Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Towards building a live 3D digital twin of the world
(USC Thesis Other)
Towards building a live 3D digital twin of the world
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Towards Building a Live 3D Digital Twin of the World
by
Fawad Ahmad
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2022
Copyright 2022 Fawad Ahmad
Acknowledgements
I would like to thank my advisor, Prof. Ramesh Govindan, without whom this dissertation would not have
beenpossible. Iamgratefulforhavinghadtheopportunitytoworkwith, andlearnfromsuchanoutstanding
mentor and exemplary person. If I were to do my PhD again, without any once of doubt, it would be with
Ramesh. For the entire time that I have worked with him, there is not a single thing I can complain about.
We have a belief that a teacher is a spiritual father and for me, no one fits that bill better than Ramesh.
I would like to thank my committee members, Prof. Konstantinos Psounis, Prof. Barath Raghavan, Prof.
Muhammad Naveed, and Prof. Nora Ayanian for their insightful discussions, and valuable feedback which
helped shape my dissertation.
I would like also to thank my collaborators, Fan Bai, Eugene Chai, and Karthik Sundaresan who have
beenasourceofinspirationandadviceforme. IamgratefultoallmylabmatesandfriendsattheNetworked
Systems Laboratory (NSL) who shared this journey with me. I am especially thankful to my collaborators,
Hang, Christina, Weiwu, Rajrup, and Ray who helped in this dissertation. I am thankful to Zahaib who has
been a constant source of advice throughout my PhD.
Iamverygratefultomyparents(DaulatKhanandNazneen). Theyhavebeenasourceofunwavering,and
unconditional support for me throughout my life and especially during my PhD. Without their support, this
dissertation would not have been possible. They have given numerous sacrifices to give me this opportunity
and I hope this achievement will make them proud. I would also like to thank my brother (Junaid) and
sister (Saima). They have been very encouraging and supportive in this endeavour. I am very grateful to my
wife (Aamna) for her patience, encouragement, and unwavering support throughout the entire PhD program.
ii
Without her, this thesis would not have been possible. Last, but not least, I am extremely grateful to my 1
year old daughter, Amal, for adding color and bringing unparalleled joy into our lives.
iii
Abstract
A live 3D digital twin is a high-fidelity 3D representation of a physical scene or structure. This twin continu-
ously mirrorsthephysicalobjectinnear real-time. Live3Ddigitaltwinsenablenovelandexcitingcapabilities
like safer, and more efficient autonomous driving; efficient, and targeted disaster relief operations; and quick
structural integrity checks for aircrafts between flights etc. To build a live digital twin, we continuously fuse
voluminous 3D data from multiple sensors like LiDARs and stereo cameras into a single coordinate system.
However,livetwinshavestrictaccuracyandperformancerequirements i.e., atwinmustreplicatethephysical
worldontheorderofmillisecondswithcentimeterlevelaccuracy. Achievingtheserequirementsischallenging
for two reasons: a) limited network bandwidth, and b) limited compute resources. To tackle these challenges,
this thesis introduces techniques (and embodies them in the form of end-to-end systems) that ensure the
high accuracy and low latency using commercial-off-the-shelf (COTS) technologies. The contributions of this
thesis are four-fold. First, it introduces lean representations of 3D data to address the network bottleneck.
Second, it builds fast pipelines to process large amounts of 3D data with COTS compute resources, at low
latency. Third, it introduces robust algorithms to fuse 3D data from multiple 3D sensors into a single co-
ordinate system. Lastly, it proposes loss compensation techniques that high accuracy even with lean 3D
representations. As such, this thesis builds end-to-end systems that build and leverage live digital twins of
large urban spaces with a latency in milliseconds and accuracy on the order of centimeters.
iv
Table of Contents
Acknowledgements ii
Abstract iv
List Of Tables vii
List Of Figures ix
Chapter 1: Introduction 1
1.1 Live Digital Twins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Net-driving: Rethinking Autonomous Driving (Chapter 3). . . . . . . . . . . . . . . . 3
1.2 Up-to Date Digital Twins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 CarMap: Fast 3D Feature Map Updates (Chapter 2) . . . . . . . . . . . . . . . . . . . 5
1.2.2 Fresco: Fast, and Accurate 3D Reconstruction (Chapter 4) . . . . . . . . . . . . . . . 5
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Chapter 2: CarMap: Fast 3D Feature Map Updates 8
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Design of CarMap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 Map Segment Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.2 Robust and Scalable Feature Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.3 Dynamic Object Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.4 Map Updater . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.5 Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 CarMap Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5 CarMap Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5.2 Near Real-Time Map Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.3 End-to-End Localization Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5.4 Other Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5.5 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Chapter 3: Net-driving: Rethinking Autonomous Driving 52
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2 Net-driving Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.2.1 Perception Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.2.2 Accurate Alignment for Fast Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.2.3 Reusing 3D Bounding Boxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
v
3.2.4 Fast, Accurate Heading Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2.5 Other Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.3 Net-driving Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.4.2 Real-world Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.4.3 Latency Breakdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.4.4 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.4.5 Safety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.4.6 High-Throughput Traffic Management . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Chapter 4: Fresco: Fast, and Accurate 3D Reconstruction 85
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.3 Fresco Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.3.1 Model Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.3.1.1 SLAM-imposed constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.3.1.2 Density-imposed constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.3.1.3 Optimized trajectory generation . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.3.2 Drift estimation and re-calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.3.3 Discovering Rooftop Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.3.4 Point-Cloud Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.4.1 3D Model Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.4.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.4.3 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.4.4 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.4.5 Boundary Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Chapter 5: Related Work 119
5.1 Autonomous Vehicles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.1.1 Vehicle Sensing and Connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.1.2 Mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.1.3 Perception. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.2 Drone Positioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.3 3D Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.3.1 Offline Reconstruction using Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.3.2 Offline Reconstruction using LiDAR or Radar . . . . . . . . . . . . . . . . . . . . . . . 122
5.4 SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.4.1 Visual SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.4.2 Decentralized SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Chapter 6: Conclusion and Future Work 124
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.2.1 Autonomous Driving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.2.2 Collaborative 3D Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.2.3 Indoor Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Reference List 128
vi
List Of Tables
3.1 Net-driving uses a qualitatively different stack to meet throughput and latency requirements.
Bothperceptionandplanningusedifferentalgorithms(highlightedinbold)thanexistingopen-
source stacks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2 p99per-framelatency(inmilliseconds)forperception. Weexcludelatencynumbersformotion
vector estimation which are on the order of a few microseconds. . . . . . . . . . . . . . . . . . 75
3.3 Impact of optimizations on p99 latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.4 p99 per-frame planning latency in milliseconds. . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.5 Perceptionaccuraciesfromreal-worlddataandsimulation. Thelastcolumnprovidesaccuracies
reported for these by prior work, for calibration, where available. . . . . . . . . . . . . . . . . 79
3.6 Net-driving’s novel alignment algorithm outperforms existing state-of-the-art initial alignment
algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.7 With more comprehensive perception, Net-driving can provide safe passage to vehicles in both
scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.1 Challenges and contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.2 Photogrammetry can be slow and inaccurate. The small building is 50 m x 50 m x 20 m, the
large is 100 m x 50 m x 20 m. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.3 Relative to Fresco, alternative reconstruction approaches for drone-based LiDAR reconstruc-
tion give poor quality 3D models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4 LiDAR SLAM on Jetson TX2. For both accuracy and completeness, lower is better. . . . . . 94
4.5 Fresco comparison with Photogrammetry and SLAM. . . . . . . . . . . . . . . . . . . . . . . 97
4.6 Reconstruction accuracy results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.7 Fresco performance results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.8 Ablation study results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
vii
4.9 Positioning errors for parallel and perpendicular orientations at different speeds (at distance
20 m) and errors for different distances (at a speed of 1 m/s) for real-world traces. . . . . . . 115
viii
List Of Figures
1.1 A live digital twin. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Design space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Net-driving. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 CarMap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Fresco. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 If short timescale events like traffic accidents (left) are not updated in maps, a vehicle cannot
localize itself (blue line) because it cannot match the scene with the map. On the other hand,
vehicles with updated maps (red line) can localize themselves accurately. . . . . . . . . . . . . 9
2.2 Localization using a feature based map. The picture on the left shows the features in an
image, and the picture on the right shows the feature map generated for an area. Features are
color-coded by the type of object those features belong to. . . . . . . . . . . . . . . . . . . . . 11
2.3 Components of feature-based map generators. . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Architecture and workflow of CarMap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Dependencies between map components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 Semantic segmentation of an image while driving. . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7 When adding a new region to the base map, the vehicle uploads the whole map segment
(above). For updating a existing map segment, CarMap generates a map diff containing new
map features (below, new map features marked in blue). . . . . . . . . . . . . . . . . . . . . . 22
2.8 CarMap stitching together two feature maps. The highlighted regions represent overlapped
sub-segments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.9 End-to-end latency results for CarMap’s map update operation enables real-time map updates
(average end-to-end latency is approximately 0.6 seconds.) . . . . . . . . . . . . . . . . . . . . 30
2.10 End-to-endlatencyforCarMap’smapstitchoperation. Thestitchoperation, onaverage, takes
approximately 2.0 seconds for unmapped regions. . . . . . . . . . . . . . . . . . . . . . . . . . 31
ix
2.11 Vehicle map uploads for map stitch and update operations. Map updates reduce required
bandwidth by 2x as compared to stitching map segments. . . . . . . . . . . . . . . . . . . . . 32
2.12 Mapping error and map sizes for a static-map used with static- and dynamic-traces, for each
scenario. ∞ indicates that the scheme was not able to localize at all. . . . . . . . . . . . . . . 32
2.13 For a map built, and used in a static trace collected from CarLA, 75% of the mapping errors
for CarMap are less than 0.2% with respect to the length of the trace. . . . . . . . . . . . . . 33
2.14 For CarLA maps built from static, and used in dynamic environments, CarMap has a max
error of 4%. ORB-SLAM2 and QuickSketch have maximum errors of 90%. . . . . . . . . . . . 33
2.15 Mapping error and map sizes for a dynamic-map used with static- and dynamic-traces, for
each scenario. ∞ indicates that the scheme was not able to localize at all. . . . . . . . . . . . 34
2.16 For maps built, and used in CarLA’s dynamic environments, CarMap has a maximum error of
2%. ORB-SLAM2 and QuickSketch have maximum errors of 90%. . . . . . . . . . . . . . . . 34
2.17 Mapping error (%) for multi-lane localization in static environments using maps collected from
one lane in other parallel lanes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.18 Mapping error (%) for multi-lane localization in dynamic environments using a map collected
from one lane in other parallel lanes. CarMap is robust to spatio-temporal changes. . . . . . 35
2.19 Map sizes on KITTI traces: for each alternative, the map size is normalized by CarMap’s map
size. The number on top of each group of bars shows the size in MB of CarMap’s map for the
corresponding KITTI trace. CarMap reduces map size by 20x for unmapped regions. . . . . . 36
2.20 Bandwidth requirements for the four mapping schemes averaged over diverse environments in
all 11 KITTI sequences at different speeds. CarMap supports near-real time uploads over LTE
at speeds up to 80 kph whereas other schemes fail even at low speeds. . . . . . . . . . . . . . 38
2.21 Bandwidth requirements for mapping schemes at different speeds in CarLA. The bandwidth
required to upload CarMap maps are well below the LTE upload limits. . . . . . . . . . . . . 38
2.22 Bandwidth requirements for map updates in CarMap under different traffic conditions . . . . 39
2.23 Foramapbuilt, andusedfromareal-worldtrace(KITTITrace00)80%ofCarMap’smapping
errors are less than 0.4% with respect to the length of the trace. . . . . . . . . . . . . . . . . 40
2.24 Localization error for CarMap over all KITTI sequences. Even though CarMap uses 20X fewer
features in its map, its localization error is almost the same as ORB-SLAM2. . . . . . . . . . 41
2.25 Mapping accuracy of mapping schemes with varying distance, averaged over all KITTI se-
quences. The overall localization error decreases over longer distances and CarMap’s localiza-
tion error is almost the same as ORB-SLAM2’s . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.26 Breakdown of reconstruction time for CarMap across all KITTI sequences . . . . . . . . . . . 42
x
2.27 Load times on KITTI traces: for each alternative, the load times are normalized by CarMap’s
load time (whose absolute value is on top of each group of bars). CarMap loads faster than
ORB-SLAM2 (i.e., ORB-SLAM2’s load time ratio ¿ 1), except for 3 KITTI sequences. . . . . 43
2.28 Mapping error (m) with different overlapping regions. CarMap can stitch with fewer overlap-
ping frames than QuickSketch and 30x faster than progressive relocalization. . . . . . . . . . 45
2.29 Computational overhead of stitching. Even map segments as large as 1000 keyframes can be
stitched in under 7 seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.30 CarMap’s robust feature matching finds more features in different conditions and thus localize
better than ORB-SLAM2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.31 Semantic segmentation accuracy for different DCNNs. By classifying labels into static and
dynamic objects, the segmentation accuracy for all DCNNs is above 96%. . . . . . . . . . . . 48
2.32 Semantic segmentation accuracy at different frame rates. If CarMap segments every other
keyframe, classification accuracy is 85%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.33 Mapping error (m) for multi-lane stitching. CarMap’s stitching algorithm uses a more robust
featuresearchbasedonpositionhintstostitchmapsegmentstwolanesapartwherecompeting
strategies fail (∞ shows an unsuccessful stitch operation.) . . . . . . . . . . . . . . . . . . . . 50
2.34 Mapping errors (m) for stitching map segments from different traffic conditions. CarMap is
robust to temporal changes because: a) removes dynamics, and b) robust feature search. . . . 50
3.1 Autonomous vehicles run perception, planning and control on-board. Net-driving augments
autonomous driving; it runs perception and planning on edge-compute infrastructure, and
transmits trajectories directly to an autonomous vehicle’s control module. . . . . . . . . . . . 53
3.2 How Net-driving works in a scenario from CarLA [87]. Roadside LiDARs transmit data over a
wired network to an edge cluster. Net-driving’s perception extracts vehicles and their motion
information, and its planning component determines that the blue vehicle can safely pass the
stalled vehicle. Net-driving wirelessly delivers a trajectory designed for the blue vehicle, whose
control component executes the trajectory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3 (a) A bird’s eye view of an intersection. LiDARs are mounted on the traffic light poles at each
of the four corners of the intersection. (b) An individual frame from each of the 4 LiDARs. (c)
A fused frame. (d) Point clouds of traffic participants at the intersection. (e) Bounding boxes
and motion vectors for traffic participants, calculated over successive frames. . . . . . . . . . 58
3.4 Perception stages and sub-stages. We describe sub-stages in bold in detail. . . . . . . . . . . 60
3.5 Net-driving finds the yaw for the two LiDARs by searching ((a) and (b) for that combination
where two surfaces on the point clouds have the smallest angle (c). . . . . . . . . . . . . . . . 62
3.6 The figure shows the points belonging to a car, in two successive frames t and t + 1. (a)
Strawman approach for heading determination. (b) Net-driving’s approach. . . . . . . . . . . 65
xi
3.7 (a) The first (red) planned vehicle can use the entire drivable space. (b) The second (blue)
vehicle treats the first as an obstacle (different shades represent different times at which grids
are occupied). (c) The motion-adaptive buffer around a vehicle is proportional to its speed. . 69
3.8 With autonomous driving’s decentralized planning, in the absence of a traffic light controller,
vehicles come to a deadlock at the intersection where they are unable to cross it. . . . . . . . 69
3.9 End-to-end latency from real-world experiment. . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.10 The orange truck obstructs the ego-vehicle’s (yellow box) view of red-light violator. . . . . . . 81
3.11 The orange trucks obstruct the ego-vehicle’s (yellow box) view of left-turning car. . . . . . . . 81
3.12 Net-driving enables traffic light free intersections and can reduce wait times by 5 x. . . . . . . 81
3.13 Net-driving can ensure collision-free trajectories with up to 70% packet-loss rates.. . . . . . . 81
4.1 MVS reconstruction (right image) of a large building (left image shows ground truth 3D model). 91
4.2 GPS (left), SLAM (middle), and Fresco (right) derived models of a large complex on our
campus. The models are color-coded with the same height ramp function. . . . . . . . . . . . 92
4.3 Rectilinear trajectory for SLAM-based reconstruction. . . . . . . . . . . . . . . . . . . . . . . 93
4.4 Fresco architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.5 Pentagon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.6 Plus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.7 Gabled . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.8 Fresco is designed for buildings with vertical sides and either gabled or polygonal roofs. . . . 96
4.9 Parallel and perpendicular LiDAR orientation . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.10 Overlap at various orientation and speeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.11 LiDAR orientation Vs. SLAM positioning error. . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.12 Point density Vs. SLAM positioning error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.13 Pentagon building . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.14 Unfolded pentagon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.15 Meshed pentagon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.16 Graph of pentagon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.17 Derived Trajectory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
xii
4.18 Folded Trajectory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.19 Illustrating steps in optimized trajectory generation . . . . . . . . . . . . . . . . . . . . . . . 101
4.20 Rotation throws SLAM off. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.21 Fresco re-calibration flight (red) to mitigate drift (red circle). . . . . . . . . . . . . . . . . . . 104
4.22 3D models accuracy at different compression levels. . . . . . . . . . . . . . . . . . . . . . . . . 109
4.23 End-to-end latency from a real-world drone flight. . . . . . . . . . . . . . . . . . . . . . . . . 112
4.24 SLAM errors for LiDAR orientations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.25 SLAM error for speeds and building distances. . . . . . . . . . . . . . . . . . . . . . . . . . . 116
xiii
Chapter 1
Introduction
A live digital twin is a high-fidelity 3D representation of a physical object that continuously mirrors the
physicalobject, overtime, inthedigitalworld. Forinstance, Figure1.1showsalivedigitaltwin(rightimage)
of an intersection (left image). A live digital twin creates unprecedented capabilities for both computer and
human consumption. With real-time situational awareness, a live twin has the potential to improve on-road
safety for automobiles §3.1, motorcyclists, and pedestrians. A live twin enables better localization §2.1 and
enhanced context-awareness for timely and efficient disaster relief operations ( §4.1, [53]). Monitoring the
live twin over time can yield real-time automated, fine-grained analytics in urban areas ( e.g., parking spot
detection, traffic accident reporting, maintenance), agriculture, and forests etc. For humans, a digital twin
means the possibility of living, interacting and experiencing the physical world in augmented, virtual, and
mixed reality applications (e.g., digital teleportation).
This thesis takes the first steps towards building a live digital twin of the entire world by introducing
systems and techniques to replicate large physical spaces in near real-time. This vision is fueled by recent
technology trends. First, the proliferation of 3D sensors like LiDARs and stereo cameras (e.g., in automo-
biles [8] and smartphones [13] etc.) enables the possibility of perceiving the world in 3D, at scale. Second,
ubiquitous wireless connectivity opens up avenues to instantly share data from multiple 3D sensors over the
air. Lastly, edge and cloud resources have the potential of processing the shared data at low latency to build
live digital twins. To construct a digital twin, we continuously combine data from a set 3D sensors into a
1
Live Digital Twin Physical World Scene Figure 1.1: A live digital twin.
single coordinate system. We call this operation stitching. For instance, we can build a live digital twin of an
intersection in Figure 1.1 by stitching data from the four 3D sensors (green crosses) mounted on traffic light
posts, and two vehicles.
Technical Challenges. A live digital twin creates unprecedented capabilities like those mentioned above.
However, thesecapabilitieshavestrictperformanceandaccuracyrequirements. Achievingtheserequirements
is impossible today for two reasons: limited wireless bandwidth, and limited on-board compute resources.
To stitch data from multiple 3D sensors, the data has to be transported over the network to some compute
resource for processing. Sometimes, the 3D data has to be transported over a bandwidth constrained wireless
network. However, cellular networks cannot cater to the bandwidth requirements of 3D sensors that generate
data at 100s of Mbps to Gbps. Secondly, limited on-board compute (e.g., on ground or aerial robots)
prevents processing 3D sensor data in real-time. Today, this data has to collected and processed offline. So,
it is infeasible to construct a live digital twin.
Contributions. Totacklethesechallenges,thisthesisbuildsend-to-endsystemsthattranscendcomputeand
wirelessbandwidthboundariestoreplicatethephysicalworld, livein3D.Thesesystemsleveragecommercial-
off-the-shelf technologies (COTS) to enable real-time sharing and stitching of 3D data and meet the perfor-
mance and accuracy requirements of these novel capabilities. Based on the size of the scene to reconstruct
and the update frequency, our works falls into two categories (Figure 1.2). For physical scenes the size of an
2
Update Latency Physical Scene Size CarMap Intersection City 100 ms Net-driving Fresco 1 sec Live
Digital
Twin Up-to
date
Digital
Twin Figure 1.2: Design space.
intersection, with dense sensor deployments, this thesis explored a set of novel capabilities that construct live
digital twins ofthephysicalworldwithin100ms. Forlargerareaslikecities, withsparsersensordeployments,
this thesis devised another set of capabilities that construct up-to date digital twins. These systems leverage
mobile sensors to sense changes to the physical world and use fast 3D updates to update the digital twin
accordingly in less than a second.
1.1 Live Digital Twins
Autonomous vehicles have a plethora of on-board sensors to understand their surrounding. On-board sensors
arepronetolineofsightlimitations,andocclusions etc.. Ifvehiclesshare3Ddatawithothervehicles,andthe
infrastructure, in real-time, they can build a live digital twin of their surroundings. This digital twin extends
a vehicle’s visual horizon, and equips it with enhanced real-time 3D situational awareness for safer-driving.
Limited wireless bandwidth, and millisecond latency requirements prevent this. As such, this thesis explored
novel capabilities, through Net-driving (§3.1), that transcend bandwidth, and compute boundaries to build
live 3D digital twins.
1.1.1 Net-driving: Rethinking Autonomous Driving (Chapter 3).
Net-driving re-thinks the entire autonomous driving architecture based on three technology trends (i.e.,
cost-effective 3D sensors, edge-compute and 5G deployments). These trends motivate Net-driving’s design, in
whichLiDARsarecheapenoughtomountubiquitouslyonroadsidelightpoles(Figure1.3(A));edge-compute
infrastructure can process LiDAR data to build 3D digital twins in near real-time; and 5G can convey the
3
1 2 3 4 LiDARs at Intersection 1 2 3 4 3D Frames from LiDAR Stitched 3D Frames Perception Output A B D C Figure 1.3: Net-driving.
results of that processing to vehicles in milliseconds. Thus, Net-driving offloads perception and planning to
roadside infrastructure; control remains on-board because it involves tight sub-millisecond feedback loops.
The key challenge in Net-driving was, with multiple road-side LiDAR sensors and while planning for tens
of vehicles and using COTS compute platforms, to meet the performance standards, without compromising
on accuracy, set for an autonomous vehicle. Net-driving processes voluminous sensor data at full frame rate
with a tail latency of less than 100 ms (video demos
1
), without sacrificing accuracy. Net-driving uses an
accurate and lightweight perception component that relies on composite views (Figure 1.3 (C)) derived from
overlapping sensors (Figure 1.3 (B)), and a planner that jointly plans trajectories for multiple vehicles. The
perception component leverages the stationary LiDAR setting for faster stitching, makes judicious use of
hardware by offloading heavy computations to the GPU, and makes algorithmic innovations that leverage
a common vehicle abstraction (i.e., the 3D bounding box (Figure 1.3 (D))) for faster computations. Net-
driving’s fast centralized planner uses motion-adaptive bounding boxes, incorporates vehicle dynamics and
1
Net-driving: https://www.youtube.com/channel/UCpb947_nEBAv_oikE7KsC-A
4
plans for a longer time horizon for safer driving. Net-driving is safer than autonomous driving, and enables
higher traffic throughput at intersections.
1.2 Up-to Date Digital Twins
For larger physical areas, with sparse sensor deployments, this thesis leverages mobile sensors to build a base
digital twin. Then, it updates the digital twin with fast 3D updates.
1.2.1 CarMap: Fast 3D Feature Map Updates (Chapter 2)
The ability of a vehicle to localize itself accurately in the world is paramount to safe autonomous driving.
Vehicles use digital twins (3D maps) of the world to localize themselves. Today, these maps are infrequently
collectedandbuiltofflinebyfleets. Environmentalchangesrendermapsstateandstalemapsharmlocalization
accuracy. However, the huge imbalance between 3D sensor data rates and cellular bandwidths prevents real-
time updates to these maps. To tackle this, this thesis introduces CarMap (demo video
2
), a system that
crowd-sources map collection and updates. Each vehicle collects segments of a 3D map (Figure 1.4) and a
central cloud service stitches them together. The key challenge CarMap solves is finding a representation of
3D maps that can be shared over cellular networks in near real-time without compromising the localization
accuracy. For this, CarMap builds novel lean map representations that are 75x smaller than 3D maps. Then,
CarMap builds novel feature matching techniques that use positional-information instead of image feature
similarity to ensure accurate localization with sparse data. To build stable maps, it filters dynamic objects,
and has a suite of novel and efficient algorithms for updating 3D maps. CarMap disseminates map updates
within a split second, and localizes vehicles 50 x better than non-updated maps.
1.2.2 Fresco: Fast, and Accurate 3D Reconstruction (Chapter 4)
Unlike vehicles, mobile devices like drones have two additional constraints i.e., limited compute and limited
battery life. Because drones can have an intriguing top-down perspective, they are extensively used for 3D
modeling to build digital twins. These twins are used for AR/VR applications, construction monitoring and
2
CarMap: https://youtu.be/SlG4QGq5ypk
5
Stitched 3D Map Map
segments Figure 1.4: CarMap.
Large building complex Digital twin captured by Fresco Figure 1.5: Fresco.
6
agriculture etc.. Today, drones use photogrammetry for reconstruction that depends highly on trajectory
used to capture images. Because there are no fixed rules for the trajectory design and photogrammetry
pipelines are slow and offline, it can take several weeks to get an accurate digital twin. To this end, this
thesisintroducesFrescowhichprovidesfast, accuratereconstructionwithadrone-mountedLiDAR.Accuracy
requireshighdronepositioningaccuracy,and,becauseGPScanbeinaccurate,FrescousesSLAM.However,in
doing so it must deal with several competing constraints: drone battery and compute resources, SLAM error
accumulation,andLiDARresolution. Frescoincorporatesanovelintegerlinearprogrammingbasedtrajectory
optimization that minimizes flight length while observing constraints designed to ensure low SLAM error. It
alsotracks, on-the-fly, SLAMdriftaccumulationandre-calibratesSLAMmid-flighttominimizeerror. Fresco
offloads SLAM to the cloud by streaming compressed LiDAR data over LTE. It reconstructs large structures
to within a few cms, and incurs less than 100 ms compute latency (Figure 1.5). It is significantly faster, and
more accurate than the state-of-the-art photogrammetry.
1.3 Thesis Outline
This thesis is organized as follows.
• Chapter 2 introduces CarMap to enable fast 3D updates to digital twins for robust autonomous vehicle
localization.
• Chapter 3 introduces Net-driving to build and leverage live digital for safer and more efficient au-
tonomous driving.
• Chapter 4 introduces Fresco to build fast and accurate live digital twins for large physical structures
via drones.
• Chapter 5 summarizes related work.
• Chapter 6 concludes the thesis and points out directions for future work.
7
Chapter 2
CarMap: Fast 3D Feature Map Updates
2.1 Introduction
Autonomous vehicles use a three-dimensional (3D) map of the environment to position themselves accurately
with respect to the environment. A 3D map contains features in the environment (§2.2), and their associated
positions. As a vehicle drives, it perceives these features using advanced depth perception sensors (such as
LiDAR and stereo cameras), then matches these to features in the map, and using the feature positions,
triangulates its own position.
Maps need to be updated whenever there are significant changes to the environment. Changes to the
environment can impact the set of features visible to a vehicle. For example, road or lane closures due to
construction or accidents, parked delivery vans impeding traffic flow, parked vehicles on the road-side, or
closures for sporting events can cause the set of features in the map to be different from the set of features
visible to the vehicle. This impacts feature matching, and can reduce localization accuracy. Figure 2.1
quantifies this in a simple scenario. In the image on the left, a street has been closed due to an accident.
Withanoutdatedmap,acarisunabletopositionitself; anupdatedmapisnecessaryforaccuratepositioning.
Keeping this map up to date can be tedious. Today, large companies (e.g., Waymo [199], Uber [37],
Lyft [34], Here [110], Apple [27], Baidu [28], Kuandeng [32], Mapper [26]) employ fleets of vehicles equipped
with expensive sensors (LiDAR, Radar, stereo cameras) and GPS devices. For instance, Apple Map [22] uses
8
Figure 2.1: If short timescale events like traffic accidents (left) are not updated in maps, a vehicle cannot
localize itself (blue line) because it cannot match the scene with the map. On the other hand, vehicles with
updated maps (red line) can localize themselves accurately.
vans equipped with a high-precision GPS device, 4 Lidar Arrays, and 8 cameras beside other equipment for
capturing mapping data. These vehicles scan neighborhoods periodically with a frequency determined by
cost considerations, which could be up to several thousand dollars per kilometer [25]. The scan frequency
determinesthetimescaleofenvironmentalchangescapturedbythemap[23]. Tocapturethesechanges,vehicle
fleets have to continuously traverse the mapped area at very fine timescales [29], which can be prohibitively
expensive.
In this chapter, we take a first step towards answering the question: What techniques and methods can
ensurenearreal-timeupdatesto3Dmaps? Themostpromisingarchitecturalapproachtothisquestion,which
we explore in this chap, is crowd-sourcing.
1
In this approach, which leverages the increasing availability of
depth perception sensors in vehicles, each vehicle, as it drives through a road segment, uploads map updates
in near real-time over a cellular network to a cloud service. The cloud service, which acts as a rendezvous
point, applies these updates to the map and makes these updates available to other vehicles.
Given today’s cellular bandwidths, this architecture is most suitable for a class of 3D maps in which
landmarks are sparse features in the environment. Even so, today’s feature-based 3D maps of the kind
1
Incentives for crowd-sourcing are beyond the scope of this work. Waze has successfully employed crowd-sourcing from
vehicles, by providing a navigation service and CarMap can use similar techniques.
9
generated by Simultaneous Localization and Mapping (SLAM) algorithms require an order of magnitude
higher bandwidth than cellular speeds (§2.2).
Contributions. Our first contribution ( §2.3) is to identify the most parsimonious representation of feature
maps. SLAMfeaturemapspreservealargenumberoffeatures,eventransientones,andbuildindicestoenable
fast and effective feature matching. We show that it is possible to preserve fewer features, and reconstruct
the indices, without impacting localization accuracy while reducing map size significantly.
Because our lean map representation throws away information, we have had to re-think feature matching.
Our second contribution leverages the observation that, unlike robots, cars have approximate position infor-
mation (e.g., from GPS). Thus, instead of using statistics of features alone for matching, we also use position
information to enable a more robust feature search, leading to improved localization accuracy.
Vehicleswill usefeature maps over longertime-scalesthanSLAMmapsusedbyrobots,
2
sowemustavoid
including features (e.g., from parked cars, or pedestrians) that may disappear over those time-scales. We
observe that semantic segmentation algorithms can identify such features. Our third contribution is a robust
resource-aware algorithm that incorporates the semantics of objects in the scene to perform dynamic object
filtering .
Updates to a map can be of two kinds: map segments representing a previously unseen road segment, and
map diffs representing a transient in a previously-mapped road segment. Our last contribution is a collection
of algorithms for map update: a fast and efficient map diff algorithm which generates compact diffs and can
integrate these quickly into the map, and a robust map segment stitching algorithm that reliably identifies
areas of overlap between the map segment and the existing map, and uses features within the overlapped
region to transform the segment into the existing map’s coordinate frame of reference.
We have embodied these contributions in a system called CarMap. Using experiments on an implementa-
tion of CarMap built upon the top-ranked visual open-source SLAM algorithm [148], and real traces as well
as traces from a game-engine simulator [87] we show that (§2.5): CarMap requires 75× lower bandwidth than
competing algorithms; it can generate a map update, disseminate it to a participating vehicle, and integrate
2
In a robot, SLAM algorithms perform mapping and localization simultaneously. For vehicular use, a SLAM map is collected
once, updated intermittently, and used often.
10
Feature extraction and tracking Feature map Figure 2.2: Localization using a feature based map. The picture on the left shows the features in an image,
and the picture on the right shows the feature map generated for an area. Features are color-coded by the
type of object those features belong to.
theupdateintothevehicle’smapin less than a second; itslocalizationaccuracyisbetterthanstate-of-the-art
SLAMalgorithmsespeciallywhenamapisusedindramaticallydifferentconditions( e.g., densertraffic)than
when it was collected; it can localize a vehicle in some cases when other competitors cannot, such as when a
map obtained from one lane is used in another lane in a multi-lane street; its computational overhead is com-
parable to, and sometimes better than competing strategies; and its feature labeling achieves upwards of 95%
accuracy in distinguishing static from non-static objects even when the underlying segmentation algorithms
have lower accuracy.
11
2.2 Background and Motivation
SLAM Principles. SLAM represents a map by a set of landmarks and their associated positions [69]. As a
vehicle traverses the environment, its sensors (LiDAR, cameras) continuously generate measurements of the
environment. SLAM continuously outputs (a) detected landmarks, and (b) the current pose (position and
orientation) of the vehicle. It does this by using maximum a posteriori (MAP) estimation [150], finding the
landmark position and vehicle pose that best explain the observed measurements.
Feature-based Maps: Terminology. SLAM maps can contain either feature-based landmarks (extracted
from cameras [148] or LiDAR [207, 209]) or dense representations such as image frames [89] and occupancy
grids [186]. In this chapter, we explore crowd-sourcing feature-based maps (Figure 2.2), leaving denser repre-
sentations for future work
3
. A feature is a lower-dimensional representation of some high-dimensional entity
in the environment (e.g., a leaf on a tree, or a part of a letter on a roadside sign), and is represented by a
feature signature.
Features are usually extracted from LiDAR or camera frames. For storage efficiency, SLAM implemen-
tations store features from approximately every k frames (so called keyframes), for small k. These imple-
mentations associate each feature in a keyframe with a relative 3D position with respect to that keyframe.
They extract landmarks for the feature-based map from a subset of these features; we call these map-features.
Maps have a single coordinate frame of reference, and map-features have 3D positions relative to the map’s
coordinate frame of reference.
SLAM Practice. Practical SLAM implementations are complex (Figure 2.3) because they have to deal
with sensor and estimation errors. We briefly describe SLAM components here, and introduce additional
background in later sections.
Feature matching. Feature matching (or data association) is the process of matching features in the current
frame with features seen in one or more keyframes in the map. SLAM implementations match features
3
Which map technology a vehicle uses is generally proprietary information, but we conjecture, based on anecdotal evidence
thatlowerlevelsofautonomousdriving[153]orvehiclesthatusestereocameraswillusefeature-basedmaps[30]forcostreasons,
while higher-end fully-autonomous vehicles with LiDAR will use denser maps.
12
Feature
Matching Localization Mapping Error
Minimization Map
Augmentation Pose
Estimation Input
frames Feature map Figure 2.3: Components of feature-based map generators.
in a number of different ways: e.g., image feature matching uses the similarity of image signatures (feature
descriptors),andLiDAR3-Dfeaturesusefeaturegeometry. Matchingisacrucialbuildingblockforidentifying
map-features (as described below). SLAM implementations contain two data structures to speed up feature
matching. A map-feature index associates map-features to keyframes they occur in. A feature index can
search for the keyframe whose features most closely match the features in a given frame.
Pose estimation. This component contains algorithms that estimate the pose of the vehicle. As a vehicle
traversesanenvironment, itfirstextractsfeaturesfromeachframereceivedfromitssensor. Then, thevehicle
matchestheextractedfeatureswiththoseextractedinthelastframe. Atthispoint, thevehicleknows(a)the
pose estimate in the previous frame, (b) the positions of the matched features in the previous frame and the
current frame. It then uses MAP estimation [150] to estimate the current pose of the vehicle. If the feature
matching step does not return enough features to estimate pose accurately, the vehicle uses the feature index
to search the entire map for keyframes containing features matching those seen in this frame, a step called
relocalization.
Map augmentation. Pose estimation can estimate the 3D positions of features in each keyframe. It adds some
of these features as map-features, but only after filtering transient features (those that do not occur across
13
multiple frames [148, 209]) or dynamic features (e.g., features that belong to moving vehicles) whose position
is not stable across frames.
Error minimization. This component minimizes the error accumulated in the feature map. Local error
minimization rectifies error accumulation in successive frames using, for example, extended Kalman filters
for LiDARs and bundle adjustment [194] for cameras. When vehicles visit a previously-traversed part of the
environment,aloopclosure algorithmfindsmatchesbetweenfeaturesinthecurrentframeandfeaturesalready
in the map, then reconciles their position estimates (while also correcting positions of features discovered
within the loop), thereby reducing error.
Challenges. CarMap faces four challenges.
Map size. CarMapcouldsimplyupload,overthecellularnetwork,aSLAMmaptothecloud,butthesemaps,
which include map-features, keyframes, and the two indices, can be large. A 1 km stretch of our campus
generates a 1.5 GB map. A car traveling at 30 kph would require a sustained bandwidth of 100 Mbps, well
above achievable LTE speeds [24].
4
Our first challenge is to find a lean map representation that fits within
wireless bandwidth constraints.
Environmental dynamics. CarMap maps are meant to be used over a longer timescale than SLAM maps used
by robots, so they must be robust to environmental dynamics. For example, if a map includes features from
a parked car that has since moved, localization error can increase.
Effective feature matching. As in SLAM, CarMap relies heavily on accurate feature matching for pose es-
timation, relocalization, and loop closure. However, because CarMap’s lean map has less information than
SLAM’s, its feature matching accuracy can be lower, so CarMap must use a fundamentally different strategy.
Fast map-updates. CarMap must devise fast algorithms to (a) stitch additions to the map received from
vehicles traversing a previously unmapped road segment (decentralized SLAM algorithms [94] have a similar
capability but differ significantly in the details (chapter 5), (b) generate and incorporate changes to the map
from temporary obstructions.
4
With standard compression techniques (e.g., gzip [86]) the sustained bandwidth is approximately 60 Mbps. Moreover, gzip
compression adds latency: it takes approximately 25 seconds to compress a 500MB map collected over 4 minutes.
14
Map Stitching
5.
Vehicle Requesting Map
6.
Feature
Extraction
2.
Feature Extraction
Dynamic Object
Filter
3.
Map Data
Collection
1.
Perceieved Map
Segment
4.
Vehicle Downloading Map
7.
Base Feature Map
Mapping
Localization
Vehicle Operations Cloud Operations
Figure 2.4: Architecture and workflow of CarMap
2.3 Design of CarMap
Architecture and Workflow. As vehicles traverse streets (Step 1, Figure 2.4), they derive lean represen-
tations of feature maps using a map segment generator that runs on the vehicle (Step 2, §2.3.1). To this
representation, CarMap applies a dynamic object filter to improve robustness to environmental dynamics
(Step 3, §2.3.3). CarMap then determines whether this is a new map segment (not available in its own base
map). If so, it uploads the entire map segment, else it uploads a map diff (Step 4) to a cloud service. The
cloud service runs a stitcher to add a new segment to the map, or a patcher to patch the diff into the existing
map (Step 5, §2.3.4).
A vehicle receives, from the cloud service, segments or diffs contributed by other vehicles (Step 6), recon-
structs the complete map, and uses it for localizing the vehicle (Step 7, §2.3.5). Diff generation, stitching,
patching, andreconstructionusea position-based feature index forfeaturematching(§2.3.2), resultinginhigh
feature matching accuracy.
The on-vehicle compute resources needed to run map generation, matching, diff generation, and recon-
struction are comparable to those provided by commercial on-vehicle computing platforms like the NVIDIA
15
Figure 2.5: Dependencies between map components
Drive AGX [35]. CarMap uses (a) cloud storage as rendezvous for map updates from vehicles, and (b) cloud
compute to integrate map updates. Extensions to this architecture to use road-side units for storage and
processing are left to future work.
2.3.1 Map Segment Generator
The Problem. As a vehicle traverses a street, it produces map segments. The map segment generator
must find the leanest representation of the map that respects cellular bandwidth constraints while permitting
accurate localization.
As discussed in §2.2, a complete map contains four distinct components: (A) map-features, (B) features
associated with every keyframe, (C) a map-feature index that associates map-features with keyframes used to
generate the map-features (recall that a map-feature is one whose position is stable across several keyframes),
and (D) a feature index that finds the most similar keyframe to the current frame. Uploading the complete
map is well beyond cellular bandwidths (§2.2).
CarMap’s Approach. Consider Figure 2.5 in which an arrow from B to A indicates that B is needed to
generate A. Thus, for example, map features are generated from keyframe features. Similarly, to generate
the map-feature index, we need both map-features and keyframe features.
From this figure, it is clear that all other components can be generated from keyframe features. Thus, in
theory, it would suffice for CarMap to upload only the keyframe features, thereby reducing the volume of
data to be uploaded. Unfortunately, this does not provide significant bandwidth savings. For a 1 km stretch
16
of a street on our campus (§2.2), the keyframe features require 400 MB. At 30 kmph, this would require an
upload bandwidth of 26.67 Mbps, still above nominal LTE speeds. At higher speeds, CarMap would require
proportionally greater bandwidth since the vehicle covers more of the environment (Figure 2.20).
A Lean Map. CarMap uses a slightly non-intuitive choice of map representation: the map-features alone.
Each map-feature contains the feature signature, the 3D position in the map’s frame of reference, and the list
of keyframes in which the map-feature appears. In §2.5, we show that this representation permits real-time
map uploads.
Reconstruction. However, to understand why this is a reasonable representation, we describe how one can
reconstruct a full SLAM map from these map-features. Map-features have, associated with them, a list of
keyframes in which they appear. From these, we can generate keyframe features (a sequence of keyframes
and features seen in those keyframes). From these keyframe features, it is possible to generate the feature
index and the map-feature index, resulting in the complete SLAM map. §2.3.5 presents the details.
However, the CarMap map contains only map-features whereas a SLAM map contains all features seen
in every keyframe. These fewer features can potentially impair feature matching accuracy. To address this,
CarMap employs a better feature search strategy.
2.3.2 Robust and Scalable Feature Matching
Background. Feature matching is a crucial component in feature-based localization, and determines both
the robustness of feature matching as well as scalability. Feature matching requires two operations: given a
frame F, (a) find keyframes with the most similar features, and (b) given a feature f in F and a keyframe
K, find those map-features m in K that are most similar to f. The first operation is used in relocalization
and loop closure, and the second operation is used for these two tasks as well as fine-grained pose estimation
(§2.2).
Similarity matching. Both of these operations use similarity matching techniques. For example, if a feature
is represented by a vector, then, the most similar feature is one closest by Euclidean distance to this feature.
17
Similarly, if a frame F can be represented by a signature in a multi-dimensional space, then the most similar
keyframe K is one that is closest by some distance measure.
Scaling similarity matching. To derive scalable feature matching, many SLAM implementations arrange
keyframe features in fast data structures. We have used the term feature index in §2.3.1 to describe these
data structures. In practice, implementations construct multiple indices.
To ground the discussion, we take a concrete example from a popular visual SLAM [148] implementation.
This implementation discretizes the space of features into hypercubes, and represents each hypercube by a
word. For example, if a feature f is represented by a vector < 1,5 >, and the hypercube has a side of 10
units, then, f falls into the hypercube defined at the origin. Suppose the hypercube is assigned the word “0”.
Then, any feature f
′
that is assigned “0” (i.e., falls into the same hypercube) is close in feature space to f.
Search indices. The two feature matching operations can be implemented in scalable fashion using this word-
based discretization. The first operation uses an inverted index I that maps each word to all the keyframes
that it appears in. To find a keyframe closest to a given frame F, we can use the following algorithm. (i) Map
each feature f in F to a word w
f
. (ii) For each w
f
in F, find all keyframes K associated with w
f
in I. (iii)
Take the intersection of all keyframes across all words w
f
, then find those keyframes whose word histogram
is most similar to F. The second operation requires a word search tree per keyframe K that maps a given w
f
to those features in K that are closest to w
f
in feature space.
The Problem. While these data structures work well for indoor robot navigation in relatively static en-
vironments, they can fail in more dynamic environments for outdoor vehicles. For example, keyframe word
histogram matching can fail when a map’s keyframe K was collected from an unobstructed view, while frame
F, taken at the same position, had a car in front of it which obscured many of the features in K. As another
example, consider a map of a 2-lane street where the map was taken from the right lane, but the vehicle
using the map is on the left lane; in this case, a feature’s signature may change if perceived from a different
3D position and orientation and hence result in a mismatch if the matching is based on feature similarity.
In these cases, feature matching can result in false positives: a keyframe K far from the vehicle’s current
18
position may better match the current frame F than the correct match K
′
because features at completely
different locations in a frame may look visually similar ( e.g., features from trees of the same species).
CarMap’s Approach. To address these problems, instead of searching all keyframes in the map, CarMap
searches for matches in the vicinity of the vehicle’s current position. CarMap relies on a vehicle’s GPS
position to scope the search. However, GPS is known to be erroneous, especially in highly obstructed envi-
ronments [135], so CarMap searches over a large radius around the current GPS position (in our experiments,
50 m, larger than the maximum error reported in [135]).
Keyframe matching. Specifically, in addition to using the inverted index and word histogram similarity to
find matching keyframes in the base map, CarMap maintains a global k-d tree [62] of keyframes and uses
it to search for all keyframes in the map within a given radius. Then, to localize a vehicle with a frame F
in a given map, CarMap uses the GPS coordinates of the vehicle to get all keyframes within a large radius
around the GPS position. It then finds the subset of these keyframes that most closely resemble F based on
histogram matching. If it cannot find any resembling keyframes, then CarMap uses the keyframes closest to
the vehicle’s GPS coordinates. For each keyframe K in this subset, CarMap tries to find, for each feature f
in F, the closest matching feature in K. To do this, it first performs a coordinate transformation to find the
position of f in the map, assuming that F is at K’s position, and then performs feature matching.
Feature matching. Based on the position hints of the features, CarMap also maintains another global k-d
tree of map features, which partitions 3-D space into different regions to find all features in the map that are
closest (by position) to a given feature f. Then, for each feature f in frameF, CarMap finds all map-features
thatarespatialneighbors, andusesfeaturesimilaritytoidentifythematchingfeatures. Usingthesematching
features, it can perform pose estimation. CarMap then attempts to refine this pose estimate by searching
nearby (in position) map-features for additional feature matches.
2.3.3 Dynamic Object Filter
Background. As a vehicle traverses an environment, it encounters three types of objects: a) static, b)
semi-dynamic, andc)dynamicobjects. Staticobjectsarethosethatareatrestwhenperceivedbythevehicle
19
and are likely to stay in the same position for a long time e.g., roads, buildings, traffic lights, and traffic
signs. Dynamic objects are those that are in motion when perceived by the vehicle e.g., moving vehicles and
pedestrians. Semi-dynamic objects are those that have the ability to move but might not be in motion when
perceived by the vehicle e.g., parked vehicles, construction trucks.
The Problem. SLAM algorithms contain techniques to estimate whether a feature belongs to a dynamic
objectornot; ifitdoes, thatfeatureisnotusedinthemap(§2.2). However, forasystemdesignedforvehicles
like CarMap, this is insufficient. These techniques work only if the majority of the scene is static and fail in
highly dynamic environment (as we show in §2.5). Similarly, unlike SLAM, CarMap maps are intended to be
re-used over longer time scales, during which the environment might change significantly. If a map contains
a feature f, say, belonging to a semi-dynamic object such as a parked car which has moved away by the time
a vehicle uses that map (before another vehicle has contributed a map diff), keyframe matching and feature
matching might fail.
Figure 2.6: Semantic segmentation of an image while driving.
CarMap’s Approach. To counter this, CarMap uses semantic segmentation to classify the whole scene
into static and (semi-)dynamic objects. Semantic segmentation can be performed on camera data as well as
LiDAR data, and refers to the task of assigning every pixel/voxel in a frame a semantic label (Figure 2.6),
such as “car”, “building” etc. In addition to motion analysis (§2.2), CarMap leverages these semantic labels
to determine whether to add features to the map.
Specifically, CarMap extracts features (Figure 2.4) and uses semantic segmentation to label each
point/pixel in the frame. It then associates each feature with the corresponding semantic label of the
particular pixel(s) that the feature covers. As a result, when a feature is generated, besides its feature
20
signature and 3D position, CarMap also appends a semantic label to it. If the semantic label belongs to a
dynamic or semi-dynamic
5
object (e.g., car, truck, pedestrian, bike etc.), CarMap does not add it to the
map.
To detect moving objects we could have used background subtraction, but CarMap needs the ability to
also detect semi-dynamic objects (e.g., parked cars). Object detectors can generate loose bounding boxes
for semi-dynamic objects, which can result in incorrect matches between features and their corresponding
objects.
Challenges. Semantic segmentation poses two challenges in practice. First, it is prone to errors, especially
at the boundaries of different objects. For example, a state-of-the-art segmentation tool, DeepLabv3+ [73],
has an iIoU
6
score of 62.4% on a semantic segmentation benchmark (CityScapes [80]). Second, it uses deep
convolutional neural networks that are computationally very expensive (e.g., DeepLabv3+ runs at only 1.1
FPS on a relatively powerful desktop equipped with an NVIDIA GeForce RTX 2080 GPU).
Robust labeling. To tackle the first challenge, CarMap tracks feature labels across multiple frames and uses
a majority voting scheme for deriving robust labels. Consider a feature f that is detected and tracked
in multiple keyframes (only these features are likely to be added as map-features). In each keyframe, we
determine the semantic label associated with the f. Instead of labeling each feature with its semantic
label, we perform a coarser classification, determining whether that label belongs to a static (road surface,
traffic signals, buildings, and vegetation etc.) or a non-static (cars, trucks, pedestrians) etc. This coarser
classification overcomes boundary errors in segmentation: even if the segmentation algorithm identifies a
pixel as belonging to a building when it actually belongs to a tree in front of the building, because both of
these are static objects, the pixel would be correctly classified as static. CarMap then does a majority voting
5
For brevity, we use the term dynamic object filter for this capability, but it can detect semi-dynamic objects as well.
6
The IoU (intersection over union) metric is biased towards classes covering a large image area. Hence, for autonomous
driving, the iIoU metric is preferred which is fairer towards all classes.
21
Map collection Map update Figure 2.7: When adding a new region to the base map, the vehicle uploads the whole map segment (above).
For updating a existing map segment, CarMap generates a map diff containing new map features (below,
new map features marked in blue).
across these coarser labels to determine whether f is static or non-static. In§2.5, we show that this approach
results in high classification accuracy.
Resource usage. Semantic segmentation CNNs can run at low frame rates. However, CarMap only needs to
determinethelabelofafeaturewhencreatingmap-features. Theseareassessedatkeyframes,so,segmentation
needs only be applied at keyframes. Depending on the vehicle’s speed, SLAM algorithms [148] can generate
keyframes at 1-10 frames per second. In §2.5, we explore a resource/accuracy trade-off: running slightly
less accurate, but lower resource intensive CNNs still gives acceptable performance in our setting. When
segmentation cannot run on every keyframe, we mark the missed keyframe’s features as unlabeled.
2.3.4 Map Updater
Map Diffs. When a vehicle traverses a segment that exists in its own map, CarMap generates a compact
map diff to report newly discovered map features.
22
The Problem. CarMap may discover new features for two reasons. In Figure 2.7, if the feature map were
constructed from the top image, a vehicle traveling through the same region at a later time (bottom image)
might see new features previously occluded by the bus. Moreover, sparse SLAM algorithms are designed to
capture only a small portion of all the features in the environment to ensure real-time operation, so a new
traversal may discover additional features (Figure 2.7).
CarMap’sApproach. Amapdiff compactlyrepresentsthenewlydiscoveredfeatures. ToexplainhowCarMap
generates a map diff, consider a vehicle V, traversing a road segment R
A
at time t
1
, having an on-board map
segment M
A
of the same area from an earlier time t
0
. CarMap loads the on-board map segment M
A
into
memory and marks all map elements (map-points and keyframes) as pre-loaded elements in the map. As
the vehicle V traverses R
A
, it localizes itself in the map segment M
A
. At the same time, for every feature
f
road
the vehicle perceives, it uses CarMap’s robust feature matching (§2.3.2) to query and match it against
features f
A
present in the map segment M
A
in the same spatial vicinity. If the match is successful, that
means the feature is already present in the map. If not, it is a new feature. This yields a set of features
f
diff
and keyframes K
diff
that have been introduced in the time interval δt = t
1
− t
0
. The vehicle uploads
this diff map to the cloud service. The cloud service’s patcher receives this and patches these map elements
(f
diff
and K
diff
) into the base map. It also sends out the patch to all vehicles so that they can update their
base maps.
Removing features no longer visible is tricky because those features could be, for example, occluded by a
parked vehicle. It is, however, important to do this in practice (e.g., features from objects present during a
transient road closure). We are currently working on a robust algorithm for this.
Map Segment Stitching. When it traverses a previously unseen road segment, CarMap uploads the map
segment to a map stitcher in the cloud.
The Problem. CarMap’s stitcher adds map segments (§2.3.1) received from vehicles into its base map (Fig-
ure 2.8). CarMap must address three challenges while stitching a map segment into a base map. It must
23
Figure 2.8: CarMap stitching together two feature maps. The highlighted regions represent overlapped sub-
segments.
Input : Base map M
b
and new map segment M
s
Output: Stitched base map M
′
b
1 if M
b
is empty then
2 M
′
b
← M
s
;
3 else
4 R
b
← Reconstruct(M
b
);
5 R
s
← PartialReconstruct(M
s
);
6 O
b
,O
s
← FindOverlap(R
b
,R
s
);
7 T
bs
← FindTransform(O
b
, O
s
);
8 M
⋆
s
← T
bs
∗ M
s
;
9 M
′
b
← Merge(M
b
, M
⋆
s
);
10 end
Algorithm 1: Stitching Algorithm
efficiently find potential regions of overlap between two map segments. The stitcher only has access to map-
features at keyframes whereas SLAM algorithms preserve all features in each keyframe; feature matching can
potentially be more difficult in CarMap. To scale well, CarMap must incrementally add new map segments
to the base map without recomputing the whole map.
CarMap’s Approach. Algorithm 1 depicts the stitching algorithm. Suppose we have two map segments,
the new incoming map segment M
s
and the base map M
b
. To stitch M
s
with the base map M
b
, CarMap
first reconstructs (lines 4-5) ( §2.3.5) the two map segments (R
b
,R
s
). Then, it uses (line 6) fast feature
search (§2.3.2) to find the sub-segments (sequences of keyframes) that overlap ( O
b
,O
s
). It then applies
(line 7) feature matching between these sub-segments and uses these matches to compute the coordinate
24
transformation matrix between M
s
and M
b
. It uses this matrix to transform M
s
into M
b
’s coordinate frame
of reference (lines 8-9). Finally, it removes duplicate features observed in both segments.
Finding Overlap. To find potential regions of overlap, CarMap uses two strategies. When the cloud service
receives the new map segment M
s
, it uses the GPS positions and word-histograms associated with M
s
to
coarsely find potentially overlapping keyframes in the base map M
b
. For this, CarMap reconstructs all the
data structures in M
b
and only word-histograms and keyframe-features of M
s
using the methods described
in §2.3.5.
Then, CarMap finds a finer-grained overlap O
b
and O
s
(granularity level of map-points) between M
s
and
M
b
. For this, CarMap uses the reconstructed keyframe features of M
s
. For each keyframe k
s
in O
s
, it uses
the k-D tree to find all features (§2.3.2) in O
b
that match features in k
s
, instead of only matching features
belongingtothetwooverlappingkeyframesk
s
andk
b
. Attheendofthisprocess, thereisapairwisematching
of features between O
b
and O
s
.
Computing the transformation matrix. In the next step, CarMap computes the transform (translation and
rotation) to re-orient and position M
s
in M
b
. To do this, it finds the keyframe k
s
from the new map segment
with the maximum number of matched features from the previous step. Then it uses a perspective n-point
(PnP [128]) solver to derive the coordinate transformation matrix, then transforms each map feature in M
s
to M
b
’s frame of reference. After the transformation, CarMap removes all the duplicate map-features in the
overlapping region O
b
of the resulting base map M
′
b
that originated as a result of the transformation.
2.3.5 Reconstruction
Map Segment Download. Before a vehicle enters a street, it retrieves a map segment from the cloud
service. This segment uses the lean representation described in §2.3.1.
Reconstruction Details. CarMap places map-features into keyframes, and adds them to the k-d tree
structures. It then generates word histograms and per-keyframe word search trees as in SLAM. To do this,
it must compute the 2D and 3D positions of each map-feature in the associated keyframe (recall that a
map-feature’s position in a map segment is with respect to the map’s frame of reference). To reconstruct
25
the position of a given map-feature f in a keyframe k, P
f
k
, CarMap uses the global 3D position of the
map-feature P
f
O
, the respective keyframe’s position P
K
O
and rotation matrix R
K
O
to perform an
inverse transformation:
P
f
k
=
"
R
− 1
K
O
∗ n
P
f
0
− P
K
0
o
#
(2.1)
26
2.4 CarMap Implementation
CarMap’s implementation consists of the following components.
Map segment generator. This component takes the output of ORB-SLAM2 (which includes map-features,
keyframe features, and the two indices), and simply strips all other components other than the map-features.
We have also added the ability to periodically transmit complete map segments. On the receiver, we added a
moduletoreconstruct(§2.3.5)thekeyframefeaturesfromthereceivedleanmap,andre-generatestheindices.
Fast feature search. For this, we added the k-D tree data structure, and associated code for manipulating the
tree and searching in the tree, and re-used ORB-SLAM2 code for re-positioning a feature in a keyframe.
Stitcher. Stitching functionality does not exist in ORB-SLAM2. For stitching maps, we wrote our own
modules for ORB-SLAM2. We also added support for finding overlapped keyframes and computing the
transformation matrix.
Map updater. We wrote our own module for map updates. At the vehicle, our map update module uses a
fast feature search for finding differences in the two feature sets (environment and base map). At the cloud,
the module integrates these differences into the base map.
Dynamic object filter. We added a dynamic object filter to the mapping component of ORB-SLAM2 which
invokes semantic segmentation and applies majority voting to decide the label associated with each map
feature.
Map exchange. We added another module to allow the exchange of map segments, map updates, and the
base map between the vehicles and the cloud service.
27
2.5 CarMap Evaluation
In this section, we evaluate (a) real-time end-to-end latency of map update using experiments and (b) the
localization accuracy of CarMap using trace-driven simulation. We then report on microbenchmarks for its
leanmaprepresentation, featuremapstitching, segmentation, andspatio-temporalrobustnessinlocalization,
using both synthetic and real-world traces.
2.5.1 Methodology
Traces. Forourend-to-endaccuracyevaluations, weuse15kmofstereocameratracesthatwecuratedusing
CarLA [87], the leading simulation platform for autonomous driving supported both by car manufacturers
and major players in the computing industry. CarLA can simulate multiple vehicles driving through realistic
environments — the simulator has built-in 3D models of several environments including freeways, suburban
areas, and downtown streets. Each vehicle can be equipped with stereo cameras or LiDAR sensors, and the
simulator produces a trace of the sensor outputs as the cars drive through. When curating our CarLA traces,
we model a stereo camera with the same properties (stereo baseline, focal length etc.) used in the KITTI
dataset. When evaluating CarMap, we only extract the left and right images from the modeled camera after
which we process the frames like we would for a real-world camera. We do not extract depth or segmentation
labelsfromCarLAbutinsteadgeneratethemusingORB-SLAM2’sstereomatchingandasegmentationCNN
respectively.
For some of our microbenchmarks, we also used 22 km of real-world traces from the KITTI odometry
benchmark [102]. The KITTI benchmark traces only have a single run for each route, but for our end-to-end
evaluations, we need one run to build the map, and another to use the map for localization. This is why we
use traces from CarLA.
Finally, to validate real-time map updates (§2.5.2), we used 8 km of stereo camera data from our campus.
Metrics. For most evaluations, we are interested in end-to-end latency, localization accuracy, and map size.
Tocalculatelocalizationaccuracy, webuildamapforaregionandthenlocalizeanothervehiclethatdrivesin
28
the same region using that map. In this case, the localization error is the average translational/localization
error (used in KITTI odometry benchmark [102]) between the ground truth position of the vehicle and its
estimated position, averaged over the whole trace. In some experiments, we also measure compute times for
various operations. These measurements were taken on an Alienware laptop equipped with an Intel i7 CPU
running at 4.4 GHz with 16 GB DDR4 RAM and an NVIDIA 1080p GPU with 2560 CUDA cores.
Scenarios. For the end-to-end accuracy experiments, we generate CarLA traces to mimic three different
kinds of driving conditions: a) suburban streets (light traffic and some parked vehicles), b) freeway roads
(dense traffic), and c) downtown roads (dense traffic, with parked vehicles on both sides). For each of these,
we generate traces for a static scene (no traffic), and for a dynamic scene (with traffic). This allows us to
evaluate maps built for one kind of scene (e.g., static), but used in another (e.g., dynamic).
Comparison. In all these evaluations, we compare the performance of: a) maps generated by ORB-SLAM2,
b)astitchedmapgeneratedbyQuickSketch[53], andc)astitchedCarMapmap. QuickSketchisacompeting
approach to map crowdsourcing that does not attempt near real-time map updates. In QuickSketch, map
segments are raw stereo camera traces, and the stitching algorithm feeds new map elements from the camera
trace into an existing base map generated by ORB-SLAM2. QuickSketch uses ORB-SLAM2’s relocalization
and feature matching components. We repeat each experiment three times and report the average values.
2.5.2 Near Real-Time Map Updates
Methodology. To measure end-to-end latency of map updates, we drove a vehicle for 16 minutes (8 km)
equipped with an Alienware laptop tethered to a phone with an LTE connection. The laptop sends map
updates to a remote server which runs CarMap’s diff integration and stitching operations, then sends the
map updates back to the vehicle. The end-to-end latency includes: update generation and transmission on
the sender, update processing on the cloud, and update transmission and integration on the receiver. We
conducted two experiments.
Map Diffs. In the first experiment, we measure end-to-end latency when all updates are in the form of map
diffs ( i.e., the vehicle drives through a previously mapped area). CarMap generates map diffs every 10 s.
29
0 4 8 12 16
Vehicle driving time (minutes)
0.0
0.4
0.8
1.2
End-to-end latency (seconds)
Vehicle to cloud time
Integration time
Cloud to vehicle time
Average E2E latency
Figure 2.9: End-to-end latency results for CarMap’s map update operation enables real-time map updates
(average end-to-end latency is approximately 0.6 seconds.)
As Figure 2.9 shows, the average end-to-end latency for CarMap’s map update operation is 0.6 s
7
. Update
transmission times dominate the cost, since diff integration is fast ( §2.3.4).
Map Segment Stitching. In the second real-time experiment, we measure the end-to-end latency when all
updates are in the form of map segments (i.e., the vehicle drives through a previously un-mapped area). As
before, CarMap generates map segments every 10 s; in this case, however, the cloud service needs to perform
an expensive stitch operation (§2.3.4).
The overall end-to-end latency for map segment updates in CarMap (Figure 2.10), although about 3.2× more than map-updates, is still only 2.1 seconds on average. Two factors contribute to a higher end-to-end
latency. First, map segments are about 2-4× larger than map diffs (Figure 2.11). Second, they require
about 10× more computation than map updates. Map update integration is only 50 ms whereas the partial
reconstruction and stitching take nearly 500 ms. Even so, transmission and reception times dominate.
7
As an aside, vehicles rely on these maps only to localize themselves, not for safety-critical operations (for which they use
their sensors).
30
0 4 8 12 16
Vehicle driving time (minutes)
0
2
4
6
End-to-end latency (seconds)
Vehicle to cloud time
Reconstruction time
Stitching time
Cloud to vehicle time
Average E2E latency
Figure 2.10: End-to-end latency for CarMap’s map stitch operation. The stitch operation, on average, takes
approximately 2.0 seconds for unmapped regions.
In summary, in CarMap, map updates can be made available to other vehicles in under a second. Even
in the rare event that a vehicle traverses an un-mapped road segment, map updates can be made available
in about 2 s.
2.5.3 End-to-End Localization Accuracy
We now demonstrate that CarMap has comparable or better localization accuracy than ORB-SLAM2 and
QuickSketch [53] for three different scenarios: static scene, dynamic scene, and multi-lane localization.
Static Scene Maps. In this scenario, we build a map from a static scene with no dynamic or semi-dynamic
objects (a static-map). We then use this map to localize a vehicle that drives in: a) the same static scene
(resulting in a static-trace), and b) the same scene with parked and moving vehicles (resulting in a dynamic-
trace). Figure 2.12 shows the average error and map sizes for each scheme and scenario. (We show the error
distributions in Figure 2.13 and Figure 2.14).
In all three environments (suburbia, downtown, and freeway), the localization error for the static-trace
in the static-map shows that CarMap is able to localize as accurately as ORB-SLAM2 even though the map
31
0 4 8 12 16
Vehicle driving time (minutes)
0.0
0.4
0.8
1.2
Map size (MB)
Map stitch
Map update
Average map stitch
Average map update
Figure 2.11: Vehicle map uploads for map stitch and update operations. Map updates reduce required
bandwidth by 2x as compared to stitching map segments.
sizes are 23-26× smaller. Similar results hold for CarMap when compared against recent map crowdsourcing
work,QuickSketch. ThisisbecauseCarMappreservesallmap-featuresthatcontributemosttowardsaccurate
localization.
However, for the dynamic-trace on the static-map, CarMap has nearly 28× better localization accuracy
than ORB-SLAM2 and Quicksketch. These differences arise from two features in our scenarios: traffic, and
the presence of parked cars, which impact localization accuracy in different ways.
Figure 2.12: Mapping error and map sizes for a static-map used with static- and dynamic-traces, for each
scenario. ∞ indicates that the scheme was not able to localize at all.
32
0.0 0.5 1.0 1.5 2.0
Mapping error (%)
0.00
0.25
0.50
0.75
1.00
CDF
CarMap
QuickSketch
ORB-SLAM2
Figure 2.13: For a map built, and used in a static
trace collected from CarLA, 75% of the mapping
errorsforCarMaparelessthan0.2%withrespect
to the length of the trace.
0 20 40 60 80
Mapping error (%)
0.00
0.25
0.50
0.75
1.00
CDF
CarMap
QuickSketch
ORB-SLAM2
Figure 2.14: For CarLA maps built from static,
andusedindynamicenvironments,CarMaphasa
max error of 4%. ORB-SLAM2 and QuickSketch
have maximum errors of 90%.
To understand why, consider a dynamic-trace on a suburban street. If the location or number of parked
cars in the dynamic-trace are different from those in the static-map, the signature of the observed frame (its
word histogram) is different in the trace than in the map. Because ORB-SLAM2 relies on word-histogram
matching for re-localization, it fails to find the right keyframe candidates to localize. In contrast, because
CarMap filters features belonging to parked cars, the vehicle in the suburban street sees similar features as
in the map, and can re-localize more accurately.
Now consider a dynamic-trace on the freeway, in which a vehicle’s view can be obscured by other vehicles,
so it is unable to observe many of the features in the map. This causes ORB-SLAM2’s word histogram
matching to fail. CarMap uses all keyframes within a 50 m radius of its current position, so it always
has keyframe candidates to search from. Even when histogram matching succeeds, ORB-SLAM2 uses per-
keyframe word search trees that can result in false-positive feature matches. CarMap uses feature position
based search to avoid this. In this scenario, moreover, ORB-SLAM2 believes features belonging to vehicles
33
Figure 2.15: Mapping error and map sizes for a dynamic-map used with static- and dynamic-traces, for each
scenario. ∞ indicates that the scheme was not able to localize at all.
0 30 60 90
Mapping error (%)
0.0
0.2
0.4
0.6
0.8
1.0
CDF
CarMap
QuickSketch
ORB-SLAM2
Figure 2.16: For maps built, and used in CarLA’s dynamic environments, CarMap has a maximum error of
2%. ORB-SLAM2 and QuickSketch have maximum errors of 90%.
moving in the same direction to be stable (since their relative speed is near zero), makes them map-features
and uses them to track its own motion. CarMap’s dynamic object filter avoids this pitfall.
Dynamic Scene Maps. In this scenario, we build a map from a dynamic scene (a dynamic-map) and then
usethemaptolocalizeinadynamic-orstatic-trace. Figure2.15summarizestheresultsfromthisexperiment
(Figure 2.16 plots the distribution of mapping errors).
The results for the dynamic-map are more dramatic than those for the static-map. CarMap’s map is
15-36× smaller than ORB-SLAM2’s or QuickSketch’s map. Despite this, these two approaches fail to localize
(denoted by ∞) on static-traces in downtown and suburban streets. In the static-trace, very few of the
perceived features appear in the dynamic-map, and relocalization fails completely. CarMap does well here
because it filters out all cars (parked or moving). For the dynamic-trace, its accuracy is nearly 50 × better
34
Figure 2.17: Mapping error (%) for multi-lane localization in static environments using maps collected from
one lane in other parallel lanes.
Figure 2.18: Mapping error (%) for multi-lane localization in dynamic environments using a map collected
from one lane in other parallel lanes. CarMap is robust to spatio-temporal changes.
than ORB-SLAM2 and QuickSketch. CarMap’s accuracy is lowest for the downtown dynamic-trace (with a
5% translational error) in which parked and moving cars obscure a lot of features in the map, resulting in
fewer matches.
Multi-Lane Localization. In this set of experiments, we consider a somewhat more challenging case, for
each of our scenarios: building a map by traversing one lane of a multi-lane street (4 freeway lanes, or 2 lanes
in the suburban and downtown streets), and then trying to localize the vehicle in each of the remaining lanes.
As before, we build both static-maps (Figure 2.17) and dynamic-maps (Figure 2.18).
For the freeway static-map, ORB-SLAM2 cannot localize beyond the second lane, while CarMap can
localize across all four lanes. For the dynamic-map, a more challenging case, CarMap can localize one lane
35
0 1 2 3 4 5 6 7 8 9 10
KITTI Sequence Number
0
5
10
15
20
25
30
Ratio of map size wrt CarMap
23 18 30
4
2
12
8
4
20
10
5
ORB-SLAM2
No index
No keyframe-features
Figure 2.19: Map sizes on KITTI traces: for each alternative, the map size is normalized by CarMap’s map
size. The number on top of each group of bars shows the size in MB of CarMap’s map for the corresponding
KITTI trace. CarMap reduces map size by 20x for unmapped regions.
over, but ORB-SLAM2 and QuickSketch cannot localize at all (denoted by ∞). In all these cases, ORB-
SLAM2’ssearchstrategyfailsbecauseitskeyframesearchreliesonthevehicle’sperspectivebeingthesameas
the map’s perspective: in these experiments, that assumption does not hold. CarMap, by contrast, matches
features by position not perspective, so is much more robust.
Similar results hold for suburban and downtown streets: ORB-SLAM2 and QuickSketch are unable to
localize, but CarMap is able to localize in all cases, with low error.
In §2.5.4, we show that CarMap’s mapping accuracy, which measures the inherent error introduced by
mapping, is comparable to ORB-SLAM2.
36
2.5.4 Other Performance Measures
Map Sizes in Real-World Traces. §2.5.3 shows that CarMap’s maps are lean relative to competing
strategies, but these are for synthetically generated traces. Figure 2.19 shows the map sizes for the 11 real-
world KITTI sequences. Across all sequences, CarMap reduces map size 20× . About 20% of this savings
comes from removing the reconstructible indices (the No Index column), and another 60% from removing
keyframe-features after generating the indices (the No Keyframe-features column).
As a vehicle travels faster, feature maps capture more data from the environment and generate data at a
higherrate. WevalidatethisinFigure2.20bycalculatingthebandwidthrequirementsofall11KITTItraces.
Maps generated by ORB-SLAM2 and the No-Index approach are impractical at all speeds for LTE wireless
upload and impractical for LTE download at speeds over 40 kph. The No Keyframe-features alternative is
impractical for LTE upload at speeds over 60 kph. CarMap requires less than 3 Mbps up to 80 kph (the
highest speed in the KITTI traces). Similar results hold for CarLA-generated traces (Figure 2.21).
Other factors determine map size, including visual richness of the environment, lighting, weather etc.
CarMap’s map size should still be an order of magnitude smaller than competing approaches; future work
can validate this.
Bandwidth Savings with Map Updates. In this section, we evaluate the ability of CarMap’s update
operation to reduce the amount of bandwidth required to update the base map. For these experiments, we
collected traces from the same area in CarLA in three different traffic conditions i.e., static with no parked
vehicles, semi-dynamic with only parked vehicles and dynamic with both parked and moving vehicles. We
build a map for each traffic condition and then measure the amount of bandwidth required to update the
existing map with features from a different set of conditions. The baseline we compare is with the map stitch
case in which we would upload the whole map segment to the cloud service and the cloud service would only
add the new map elements to the map.
The results from the experiment (Figure 2.22) show that, given a base map of the area, map updates can
reduce the amount of bandwidth required to integrate new features in the base map by 4-10× compared to
37
20 30 40 50 60 70 80
Vehicle speed (km/hr)
0
10
20
30
40
50
Map data generated (Mbps)
ORB-SLAM2
No index
No keyframe-features
CarMap
LTE upload
LTE download
Figure 2.20: Bandwidth requirements for the four mapping schemes averaged over diverse environments in
all 11 KITTI sequences at different speeds. CarMap supports near-real time uploads over LTE at speeds up
to 80 kph whereas other schemes fail even at low speeds.
0 10 20 30 40 50
Vehicle speed (km/hr)
0
10
20
30
40
50
Map data generated (Mbps)
ORB-SLAM2
No index
No keyframe-features
CarMap
LTE upload
LTE download
Figure 2.21: Bandwidth requirements for mapping schemes at different speeds in CarLA. The bandwidth
required to upload CarMap maps are well below the LTE upload limits.
38
Figure 2.22: Bandwidth requirements for map updates in CarMap under different traffic conditions
sending the whole map segment (75× savings as compared to QuickSketch and ORB-SLAM2). This happens
because the map update only sends new features whereas the map stitch sends the whole perceived map
segment.
Mapping Accuracy. In this section, we evaluate how CarMap’s reduced map sizes affect localization
accuracy. For this experiment, we use all 11 real-world traces from the KITTI dataset. We generate maps
for each of these traces, use them as base maps and localize the same trace in these maps. We compare the
generatedtrajectorywiththegroundtruthpositions. Figure2.24showstheaveragelocalizationerrordivided
by the length of the whole sequence for all the KITTI sequences. Even though CarMap reduces map sizes by
a factor of 20, it is able to localize as accurately as ORB-SLAM2 in almost all KITTI sequences because: a)
it preserves the most important map elements (map-features), and b) robust feature matching. Figure 2.23
shows the error distribution of CarMap is similar to ORB-SLAM2, and QuickSketch for a map built from,
and used in the first KITTI sequence, despite reducing map sizes by a factor of 20.
An important property of a map is to able to localize accurately over long distances. To study how
CarMap’s localization accuracy changes with the mapped area, we calculate the average translational error
at different distances ( i.e., 50m to 5km) for all 11 KITTI sequences. We average these errors on all KITTI
39
0.0 0.2 0.4 0.6 0.8
Mapping error (%)
0.0
0.2
0.4
0.6
0.8
1.0
CDF
CarMap
QuickSketch
ORB-SLAM2
Figure 2.23: For a map built, and used from a real-world trace (KITTI Trace 00) 80% of CarMap’s mapping
errors are less than 0.4% with respect to the length of the trace.
sequences and report the numbers in Figure 2.25. As distance increases, the average translational error
decreases and CarMap does as well as ORB-SLAM2 in almost all cases. The reason for this, as mentioned in
§2.3, is that although CarMap removes keyframe-features, the robust feature matching (§2.3.2) makes up for
the 20x fewer features with better matching.
Map Reconstruction. CarMap reduces map size by trading off compute for storage. The map load time
for CarMap consists of the time to load the map from disk and the reconstruction time. After loading the
map into memory, CarMap reconstructs two indices and infers the 2D and 3D position of map-features in
keyframes (§2.3.5). Even so, as shown in Figure 2.27, except for sequence 00, 01 and 06, the load times for
CarMap are less than the ORB-SLAM2 baseline (on average, 0.95× ).
Figure 2.26 shows the breakdown of the various map elements that contribute to map reconstruction
time for all 11 KITTI sequences. In all sequences, reconstructing the feature-index takes around 40% of
the overall reconstruction time. This, however, is still 2-4x less than the reconstruction time for keyframes
that contain keyframe-features (in other mapping schemes) instead of just map-features. Calculating the 2D
and 3D positions of map-features also takes an average 35% of the overall reconstruction time. The main
40
Figure 2.24: Localization error for CarMap over all KITTI sequences. Even though CarMap uses 20X fewer
features in its map, its localization error is almost the same as ORB-SLAM2.
Figure2.25: Mappingaccuracyofmappingschemeswithvaryingdistance,averagedoverallKITTIsequences.
The overall localization error decreases over longer distances and CarMap’s localization error is almost the
same as ORB-SLAM2’s
41
0 1 2 3 4 5 6 7 8 9 10
KITTI Sequence Number
0
20
40
60
80
Portion of map load time (%)
Disk load
Map-feature index
Map-features
Feature index
Figure 2.26: Breakdown of reconstruction time for CarMap across all KITTI sequences
reason for higher load times (Figure 2.27), as compared to ORB-SLAM2, in some cases (sequence 00, 01,
and 06) is because of the variability map-feature index (orange bar) reconstruction times. The map-feature
index is a graph that relates map-points to keyframes they were detected in. Hence, for environments like
highways where the scene stays relatively constant, this graph is denser and so the reconstruction costs for
the map-feature index are relatively greater. On the other hand, for environments where features change
quickly e.g., narrow streets, the map-feature index reconstruction times are lower because these graphs are
not as dense. For instance, the feature-index reconstruction for sequence 00 (captured in narrow-streets) is
approximately 3x greater than sequence 01 (captured on the highway).
Localization Time. CarMap’s accuracy comes at the cost of a slightly higher per-frame localization time.
During localization, CarMap’s feature search adds overhead. To quantify this, we built a map from a very
large trace with 4541 frames and then tried to localize in the same trace. ORB-SLAM2 has a per-frame
localization cost of 0.023 s, while CarMap’s is only marginally higher (0.033 s).
Map Stitching Evaluation. In this section, we evaluate the ability of CarMap to accurately stitch map
segmentscollectedfromdifferentspatialandtemporalconditions. WecompareCarMapagainsttwoothermap
42
0 1 2 3 4 5 6 7 8 9 10
KITTI Sequence Number
0
1
2
3
4
5
Ratio of load times wrt CarMap
6.0
6.4
6.6
1.0
0.8
2.8
2.2
0.9
4.0
2.1
1.2
CarMap
ORB-SLAM2
No index
No keyframe-features
Figure 2.27: Load times on KITTI traces: for each alternative, the load times are normalized by CarMap’s
load time (whose absolute value is on top of each group of bars). CarMap loads faster than ORB-SLAM2
(i.e., ORB-SLAM2’s load time ratio ¿ 1), except for 3 KITTI sequences.
43
stitching schemes: progressive relocalization and QuickSketch. In progressive relocalization, as opposed to
CarMap (one-shot stitching), we relocalize every keyframe from the incoming map segment instead of using
the global transformation matrix. QuickSketch can only stitch a stereo camera trace with a QuickSketch
generated map segment. So, for stitching, QuickSketch loads the QuickSketch map as a base map and then
stitches by localizing the stereo camera trace in it.
We evaluate two metrics for stitching: mapping error, and stitching time. After stitching two map
segments, we localize a trace in the stitched map and calculate the absolute translational error (m) for each
frame. Mapping error is the mean of the translational errors over the whole trace. The stitching time is the
amount of time required to do the whole stitch operation of two map segments.
Stitching Overlap. In the first experiment, we evaluate the mapping error and stitching time of the three
mapping schemes as a function of the overlap between the two map segments. For this, we take a single
stereo camera trace and split it into two traces with different overlaps. Figure 2.28 shows that QuickSketch
fails to stitch when the number of overlapping frames between the two map segments is less than 10 frames
(1 second). This is because it is not able to find enough feature matches between the two map segments.
On the other hand, CarMap can find enough feature matches even though it uses 20x fewer features due to
its robust feature matching (§2.3.2). The mapping accuracy remains relatively constant irrespective of the
amount of overlap because CarMap only needs to localize a single keyframe in the base map for a successful
stitch operation. Although the mapping error of progressive relocalization is identical to CarMap, it takes
approximately 30x more time to stitch the same area. In the stitch operation, localizing a keyframe in the
base map is the most expensive operation. CarMap intelligently localizes a single keyframe in the base map
and then uses a transformation matrix to shift the remaining map elements. On the other hand, progressive
relocalizationlocalizesallkeyframesinthebasemapandhencetakesamuchlongertime. So,asthesizeofthe
incoming map segment increases, the stitching time for progressive relocalization will increase significantly.
Stitching Overhead. To study the overhead of stitching, we take a KITTI trace and split it into two map
segments (with a few overlapping frames). In doing so, we mark one as the base map and the other as the
44
Figure2.28: Mappingerror(m)withdifferentoverlappingregions. CarMapcanstitchwithfeweroverlapping
frames than QuickSketch and 30x faster than progressive relocalization.
incoming map segment. We keep the size of the base map constant and vary the size of the incoming map
segment. Figure 2.29 shows that the stitching time increases with the size of the incoming map segment. It
also shows that for map segments containing as many as 1,000 keyframes (15 MB), stitching takes only 7
seconds.
Map Load Time. When it receives a map segment, CarMap needs to read the segment from disk,
reconstruct the keyframe features, and the indices. Figure 2.27 quantifies the total cost of these operations
(called the map load time) for each of the 11 KITTI sequences. The load times for other alternatives are
normalized by those for CarMap.
Interestingly, except for sequences 00, 01 and 06, load times for CarMap are less than ORB-SLAM2 (on
average, 0.95× ). For most sequences, CarMap’s load time is lower than ORB-SLAM2 because the latter’s
map is large enough that the time to load it from disk exceeds CarMap’s reconstruction overhead. Other
alternatives (No Index and No Keyframe-features) have large maps and high reconstruction overhead. When
CarMap’s reconstruction cost is (marginally) higher than ORB-SLAM2, it is because the corresponding
45
0.0 2.5 5.0 7.5 10.0 12.5 15.0
Map segment size (MB)
0
1
2
3
4
5
6
7
Stitching time (seconds)
200 400 600 800 1000
Number of keyframes
Figure 2.29: Computational overhead of stitching. Even map segments as large as 1000 keyframes can be
stitched in under 7 seconds.
scenes have a dense map-feature index, leading to a slightly higher reconstruction cost. (See §2.5.4 for
details). Denser map-feature indices are found in environments with keyframes that have a large number of
common map-features (e.g., freeways). We have verified both these observations (equivalent map-load times
and slightly higher load times for dense map-feature indices) for CarLA sequences.
Loop Closure. Loop closure is an important component of SLAM systems. For the KITTI dataset, we have
verified that, even though its maps contain only map-features, CarMap can perform all loop closures that
ORB-SLAM2 can.
2.5.5 Robustness
RobustFeatureMatching. WecompareCarMap’sfeaturematchingperformancetothatofORB-SLAM2’s
nativefeaturematchingapproach(weuseORB-SLAM2’sdefaultparametersformatching). Forthis,webuild
a map segment for a static trace and then use that trace to localize: a) the same static trace, b) a static trace
from a parallel lane, c) a dynamic trace from the same lane, and d) a dynamic trace from a parallel lane. We
46
Figure 2.30: CarMap’s robust feature matching finds more features in different conditions and thus localize
better than ORB-SLAM2.
collect the trace using CarLA on a freeway, and use two metrics: a) feature matching ratio (the percentage
of map-features matched in the current trace), and b) localization error (m).
Figure 2.30 shows that for all scenarios, robust feature matching is able to find more matches and hence
results in lower localization error ascomparedtoORB-SLAM2’sfeaturematching. Thebasecase(static-map
used by a static trace) shows that normal feature matching fails to detect 30% of the features even though
the same trace is used for mapping and localization. The introduction of dynamic objects reduces the feature
matching ratio because features are occluded by vehicles and hence cannot be detected even with robust
matching.
Making Semantic Segmentation Robust. CarMap makes segmentation robust by voting across multiple
keyframes, and using a coarser static vs. non-static classification. Figure 2.31 shows CarMap’s overall
accuracy, for three different versions of DeepLabv3+. These DNNs are the DeepLabv3+ trained on the
CityScape dataset pre-trained, a fined tuned DeepLabv3+ trained on the KITTI dataset and a light-weight
version of DeepLabv3+ (MobileNetv2) for mobile devices. The third column shows that CarMap achieves
47
Figure 2.31: Semantic segmentation accuracy for different DCNNs. By classifying labels into static and
dynamic objects, the segmentation accuracy for all DCNNs is above 96%.
upwards of 96% accuracy if we apply segmentation to every keyframe. Semantic segmentation, by itself,
achieves only 70% accuracy in label assignment (second column).
The first column shows the frame rate these DNNs run at. The frame rate needs to be fast enough to
process every keyframe, or at worst, every other keyframe (at which segmentation accuracy drops to about
85%, and below which it drops to unacceptable levels (Figure 2.32). In the KITTI dataset, the average
across the 11 sequences is 3.17 keyframes per second, well within the rate of the MobileNetv2 version. One of
these sequences runs at 10 keyframes per second, so for this sequence MobileNetv2 would process every other
keyframe. For more dynamic scenes, it might be necessary to devise faster semantic segmentation techniques,
and we expect the vision community will make advances in this direction.
Figure 2.32 plots the accuracy of segmentation in CarMap using majority voting at different frame rates.
We start by running segmentation every keyframe and evaluate till running segmentation every 10 keyframes.
In the KITTI dataset, the average keyframes inserted per second is 3.17 and the worst case is 10 keyframes
per second. The worse case corresponds to running segmentation every 2 keyframes i.e., a class accuracy of
86% with CarMap using MobileNetv2 in a majority voting scheme.
Multi-Lane Stitching. CarMap can stitch map segments collected from different lanes. For this experi-
ment, we collect traces from four parallel lanes on a freeway in CarLA. Using each of these four traces as base
maps, we try to stitch map segments from other lanes into it, then evaluate the mapping error for the new
maps. Figure 2.33 shows the absolute mapping errors (in meters) for these stitched map segments. The first
column shows the lane used to collect the base map and the last four columns show the absolute mapping
48
0 2 4 6 8 10
Keyframes per segmentation cycle
30
40
50
60
70
80
90
Segmentation Accuracy
Label accuracy
Class accuracy
Figure 2.32: Semantic segmentation accuracy at different frame rates. If CarMap segments every other
keyframe, classification accuracy is 85%.
error of a stitched map with each of these lanes. The∞ sign represents a failure to stitch segments from the
two lanes.
Although QuickSketch’s base map has 20× more features than CarMap and it localizes a stereo camera
trace in that base map instead of another map segment (CarMap), it cannot stitch two lanes away. On the
other hand, CarMap’s stitching algorithm uses robust feature matching (§2.3.2) and can stitch map segments
collected two lanes away (e.g., map segments from lane 1 and lane 3). CarMap’s robustness comes purely
from using position hints to find the set of key-frames to match, and to find matching map features, while
QuickSketch uses ORB-SLAM2’s built-in matching methods (in this experiment, we do not compare against
ORB-SLAM2 because it does not contain a map stitch operation).
Stitching in Different Traffic Conditions. Besides being robust to spatial changes, crowdsourced map
collection and update requires robustness to temporal changes as well (e.g., changes in traffic during different
times of day). To evaluate this, we collect stereo camera traces from CarLA in suburban and downtown areas
in the same environment during different traffic conditions (no traffic and heavy traffic). Using these traces,
49
Figure 2.33: Mapping error (m) for multi-lane stitching. CarMap’s stitching algorithm uses a more robust
feature search based on position hints to stitch map segments two lanes apart where competing strategies fail
(∞ shows an unsuccessful stitch operation.)
Figure 2.34: Mapping errors (m) for stitching map segments from different traffic conditions. CarMap is
robust to temporal changes because: a) removes dynamics, and b) robust feature search.
we evaluate the ability of the mapping schemes to stitch these map segments by comparing their mapping
error.
Figure 2.34 shows that QuickSketch is unable to stitch because it fails to relocalize a trace in different
traffic conditions ( §2.5.3). This, again, is because its stitching is solely based on appearance-based matching
whereas CarMap uses position hints as well to make its stitching more robust.
50
2.6 Conclusion
CarMap enables near real-time crowd-sourced updates, over cellular networks, of feature-based 3D maps of
the environment. It finds a lean representation of a feature map that fits within wireless capacity constraints,
incorporates robust position-based feature search, removes dynamic and semi-dynamic features to enable
better localization, and contains novel map update algorithms. CarMap has better localization accuracy
than competing approaches, and can localize even when other approaches fail completely. Future work
can explore LiDAR sensors, mapping over timescales in which even relatively static features can disappear,
dense map representations, infrastructure-based sensing for map updates in low vehicle density areas, and
automated update of semantic map overlays (accidents, available parking spots).
51
Chapter 3
Net-driving: Rethinking Autonomous Driving
3.1 Introduction
Autonomousdrivinghasmaderapidstridesinthepastfewyears, andfullautonomymaybefeasibleforsome
cars by 2030 [133]. An autonomous vehicle’s software stack consists of three logical components [157, 159]:
perception, planning, and control (Figure 3.1).
Perception extracts knowledge of the vehicle’s surroundings, using one or more 3D depth-perception
sensors (e.g., radars, stereo cameras, and LiDARs). A LiDAR, for example, sends millions of light pulses
10 or more times a second in all directions, whose reflections enable the LiDAR to estimate the position of
points on objects or surfaces in the environment. The output of a LiDAR is a 3D point cloud that contains
points with associated 3D positions (the same is true of stereo cameras). The perception module extracts an
abstract scene description consisting of static and dynamic objects, from these point clouds. Static objects
include road surfaces, lane markers [112], and safe drivable space [76]. The module also identifies dynamic
objects (vehicles, pedestrians, bicyclists) and tracks them.
Planning uses the abstract scene description from the perception component to plan and execute a safe
path for the vehicle, while satisfying performance and comfort objectives [157] (e.g., smooth driving, minimal
traveltimeetc.). Planningexecutesatthetimescaleof100sofmillisecondsanditsoutputisatrajectory —a
sequence of way-points, together with the precise times at which the vehicle must arrive at those way-points.
52
Roadside LiDARs Edge
Autonomous Vehicle
Perception
Planning
Control
Perception
3D Point Cloud
Planning
Net-driving
Trajectory
Figure 3.1: Autonomous vehicles run perception, planning and control on-board. Net-driving augments au-
tonomousdriving; itrunsperceptionandplanningonedge-computeinfrastructure,andtransmitstrajectories
directly to an autonomous vehicle’s control module.
Control takes a trajectory and converts it, at millisecond timescales, to low-level actuator (e.g., throttle,
steering, brake) signals.
Net-driving: Infrastructure-AssistatIntersections. Insomesituations,itcanbebeneficialto augment
autonomous driving with roadside compute and sensing infrastructure. In this approach, called Net-driving,
roadside LiDARs
1
have an unobstructed view of the roadway, and Net-driving feeds LiDAR outputs to
a stack running on edge compute (Figure 3.1). Net-driving’s perception module produces a continuously
updated abstract scene description from the LiDARs. Its planning module plans trajectories for all au-
tonomous vehicles; Net-driving transmits trajectories wirelessly (e.g., over 5G) to the autonomous vehicle,
which feeds that trajectory to its control module (Figure 3.1). Thus, when Net-driving infrastructure is
available, autonomous vehicles use Net-driving’s trajectories, else they use trajectories generated by their
1
Roadside LiDAR deployment pilots exist in several places [33, 38, 36].
53
on-board planner. Net-driving is technically feasible because LiDARs are becoming cheaper [193, 40], and
5G [41, 43, 39] and edge computing [44] are seeing rapid ongoing deployment.
Net-driving can help improvesafety and traffic throughput at intersections. Annually about 2.5 million
accidents occur at intersections, which contribute to 50% of all serious accidents and 20% of fatalities [50].
Intersection safety is a challenge problem for autonomous vehicles as well [70] because of line-of-sight limita-
tions [163]. In addition, wait times at intersections in major metropolitan areas can be significant. A recent
report [48] estimates that drivers in California spend, on aggregate, 2.7M hours in a 24-hour period waiting
at intersections!
Net-driving has two properties that enable it to address these problems at intersections:
• LiDARs mounted at each corner of an intersection above car heights can get a comprehensive view
of all vehicles and traffic participants at an intersection (Figure 3.1). This improved perspective can
help overcome safety issues caused by line-of-sight limitations. Consider Figure 3.2, which illustrates
a common and deadly safety issue at intersections: red-light violations, which account for 700-800
fatalities in the US annually [50]. In this figure, the yellow vehicle’s view of the red-light violator (the
bluecar)isblockedbyaredtruck. Thisscenarioisknowntobechallengingforautonomousvehicles[70]
as well. Net-driving, because it notices the blue car, can generate a trajectory for the yellow vehicle to
cause it to slow down, and thereby avoid a collision.
• Because it jointly plans trajectories for all autonomous vehicles, at an intersection, Net-driving can
increasetrafficthroughputandevenenable traffic-lightfreeintersections [188]. Toseewhy,consider
the same scenario as in Figure 3.2, but now suppose that all vehicles are autonomous. Also suppose
there are no traffic lights at the intersection. Then, as the vehicles arrive, Net-driving can develop
trajectories that slow down the blue car or speed up the red truck and the yellow car (or both), so
that all cars can safely traverse the intersection. More important, joint trajectory planning is necessary
to achieve this capability. If Net-driving had only offloaded perception to infrastructure, and left it to
vehicles to plan trajectories based on its scene description, traffic can get deadlocked (§3.4.6).
54
Figure 3.2: How Net-driving works in a scenario from CarLA [87]. Roadside LiDARs transmit data over a
wired network to an edge cluster. Net-driving’s perception extracts vehicles and their motion information,
and its planning component determines that the blue vehicle can safely pass the stalled vehicle. Net-driving
wirelesslydeliversatrajectorydesignedforthebluevehicle,whosecontrolcomponentexecutesthetrajectory.
Net-driving Challenge. Net-driving has stringent bandwidth, latency, accuracy and robustness require-
ments, and it must be possible to meet these requirements on commodity hardware deployed on edge com-
puting infrastructure.
LiDARs generate data at 10 frames per second. At 30 MB per frame, this translates to a raw data rate
of nearly 10 Gbps for four LiDARs at an intersection. LiDARs can have wired network connections so the
network is unlikely to be a bottleneck, but extracting scene understanding from voluminous data can stress
compute.
Net-driving must produce trajectories within a tight latency (time taken to run a single frame from each
LiDAR through its perception and planning) constraint. Recent work [130] argues that autonomous vehicles
must make decisionswith a tail latency of less than 100 ms. Other work exploring autonomous driving
safety also suggests the same constraint [181].
Net-driving generates scene descriptions which consist of objects, their tracks, positions, heading, and
speed. The accuracy of these quantities, must match or exceed the state-of-the-art.
55
Net-driving Baidu Apollo [46] Autoware [142]
Perception
Mapping None Pre-built HD-Map Pre-built HD-Map
Localization Live Fused Point
Cloud (§3.2.2)
LiDAR scan match-
ing fused with GPS-
RTK and IMU
LiDAR scan match-
ing
Object detection 3D Object Detec-
tion (§3.2.3)
Obstacle detection
in 2D BEV projec-
tion
2D object detection
in camera, projected
to 3D
Tracking and Mo-
tion estimation
Kalman filter with
matching, Heading
Estimation (§3.2.4)
Kalman filter with
matching
Kalman filter with
matching
Planning Centralized Inter-
val Planner (§3.3)
Ego-planner using
hybrid A⋆ search
Ego-planner using
conformal lattices
Table3.1: Net-drivingusesaqualitativelydifferentstacktomeetthroughputandlatencyrequirements. Both
perception and planning use different algorithms (highlighted in bold) than existing open-source stacks.
Net-driving transmits trajectories wirelessly, and this must be robust to packet losses.
ItisnotimmediatelyobviousthatNet-drivingcanmeettheserequirements;forinstance,themostaccurate
2D and 3D object detectors on the KITTI leaderboard [101] require 60-300 ms!
Contributions. To address these challenges, we make the following contributions:
• Net-driving incorporates a novel stack for perception and planning, which is qualitatively different from
existing autonomous driving stacks.
• ItsperceptionpipelinefusesmultipleLiDARoutputsusinganovelalignmentalgorithmwhoseaccuracy
is significantly higher than prior work.
• Accurate alignment enables fast and cheap implementation of object extraction, tracking, and speed
estimation, all centered around a single bounding box abstraction computed early in the pipeline
• However, cheap algorithms for estimating heading are inaccurate, so Net-driving develops a more accu-
rate heading estimator, and offloads this to a GPU to meet the latency constraint.
• Net-driving adapts an existing fast single-robot motion planner to jointly plan trajectories for multiple
vehicles, and to account for real-world non-idealities, and wireless packet loss.
Summary of Results. Evaluations on a testbed as well as on a photorealistic autonomous driving simula-
tor [87] show that: Net-driving achieves its performance objectives end-to-end, being able to process frames
56
from 4 LiDARs at 10 fps with a 99-th percentile tail latency under 100 ms; its perception component can
trackvehiclestocentimeter-levelaccuracy(comparabletoSLAM[207]-basedpositioning); itsplannerensures
safe passage in all red-light violation and unprotected left-turn scenarios, while autonomous driving can do
so only 27% of the time; it can reduce wait times 5× (relative to intelligent traffic lights).
RelatedWork. Net-drivingisinspiredbyrecentandplannedroadsidedeploymentsofadvancedsensors,and
a large body of work on augmenting connected autonomous driving vehicle perception using infrastructure-
based sensors (chapter 5). To our knowledge, no prior work has considered combining infrastructure-based
perception and planning — Net-driving demonstrates the end-to-end viability of such a capability.
57
LiDAR 1
LiDAR 4
LiDAR 2
LiDAR 3
(a) (b) (c) (d) (e)
Figure 3.3: (a) A bird’s eye view of an intersection. LiDARs are mounted on the traffic light poles at each of
the four corners of the intersection. (b) An individual frame from each of the 4 LiDARs. (c) A fused frame.
(d) Point clouds of traffic participants at the intersection. (e) Bounding boxes and motion vectors for traffic
participants, calculated over successive frames.
3.2 Net-driving Perception
Net-driving consists of two components (Table 3.1): perception and planning. In this section, we describe
the former.
3.2.1 Perception Overview
Inputs and Outputs. The input to perception is a continuous sequence of LiDAR frames from each LiDAR
in a set of overlapping LiDARs deployed roadside. To ground the discussion, consider four LiDARs deployed
at an intersection.
2
Figure 3.3(a) shows a view of this intersection and LiDARs are mounted
3
on traffic light
poles at the four corners.
The output of perception is a compact abstract scene description: a list of bounding boxes of moving
objects in the environment together with their motion vectors (Figure 3.3(e)). This forms the input to the
planner (§3.3), which also requires other inputs that we describe later.
Approach. To address the challenges described in §3.1, Net-driving’s perception uses a three stage pipeline,
each with three sub-stages (Figure 3.4). Fusion combines multiple LiDAR views into fused frames, and
then subtracts the static background to reduce data. Participant extraction identifies traffic participants by
2
In this and subsequent sections, we focus on traffic management at intersections, a significant challenge in transportation
research [120]. Net-driving generalizes to other settings (e.g., a busy thoroughfare), since it assumes nothing more than static
deployment of overlapping LiDARs.
3
Practical deployments of roadside LiDARs will need to consider coverage redundancy and other placement geometry issues,
which are beyond the scope of this chapter.
58
clustering and estimates a tight 3D bounding box around each object. Tracking associates objects across
frames, and estimates their heading and motion vectors .
This design achieves Net-driving’s goals using three ideas:
• Net-driving exploits the fact that LiDARs are static to cheaply fuse point clouds from multiple over-
lapping LiDARs into a single fused frame. Such a frame may have more complete representations of
objects than individual LiDAR frames. Figure 3.3(b) shows the view from each of the 4 LiDARs; in
many of them, only parts of a vehicle are visible. Figure 3.3(c) shows a fused frame which combines
the 4 LiDAR frames into one; in this, all vehicles are completely visible.
4
• Net-driving builds most algorithms around a single abstraction, the 3D bounding box of a participant.
Its tracking, speed estimation, and motion vector estimation rely on the observation that the centroid
of the bounding box is a convenient consistent point on the object for estimating these quantities,
especially when fused LiDAR frames provide comprehensive views of an object.
• Net-driving uses, when possible, cheap algorithms rather than expensive deep neural networks (DNNs).
Only when higher accuracy is required does Net-driving resort to expensive algorithms, but employs
hardware acceleration to meet the latency budget; its use of a specialized heading vector estimation
algorithm is an example.
Qualitative Design Comparison. Net-driving’s perception design is qualitatively different (Table 3.1)
from that of other popular open-source autonomous driving stacks, Baidu’s Apollo [46] and Autoware [142],
in three different ways.
• Approaches [46, 142] rely on a pre-built HD-map, and localize the ego-vehicle (the one on which the
autonomous driving stack runs) by matching its LiDAR scans against the HD map (Apollo additionally
improves position estimates using multi-sensor fusion). In contrast, Net-driving does not require a map
for positioning: all vehicle positions can be directly estimated from the fused frame.
4
This may not be immediately obvious from the figure, but a LiDAR frame, or a fused frame is a 3D object, of which
Figure 3.3(c) is a 2D projection.
59
Alignment
Stitching
Background Subtraction
Fusion
Clustering
Bounding Box Estimation
Cyber-physical Association
Participant Extraction
Track Association
Heading Vector Estimation
Motion Vector Estimation
Tracking
Figure 3.4: Perception stages and sub-stages. We describe sub-stages in bold in detail.
• Autonomous driving stacks use different ways to identify traffic participants. Apollo extracts obstacles
on a 2D occupancy grid and infers participants from contiguous occupied grids. Autoware uses camera
imagesanduses2DobjectdetectorDNNs,thenback-projectsthesetothe3DLiDARview. Incontrast,
Net-drivingdirectlyextractsthe3Dpointcloudassociatedwitheachparticipant. Itcandothischeaply
using background subtraction because its LiDARs are static.
• Autonomous driving stacks use Kalman filters to estimate motion properties
5
(speed, heading) of other
vehicles. For heading, Net-driving uses a more sophisticated algorithm to ensure higher tail accuracy.
We now describe parts of Net-driving’s novel perception relative to autonomous driving stacks (bold text in
Table 3.1).
3.2.2 Accurate Alignment for Fast Fusion
Point Cloud Alignment. Each frame of a LiDAR contains a point cloud, a collection of points with 3D
coordinates. These coordinates are in the LiDAR’s own frame of reference. Fusing frames from two different
LiDARsistheprocessofconvertingall3Dcoordinatesofbothpointcloudsintoacommonframeofreference.
Alignment computes the transformation matrix for this conversion.
5
To estimate their own motion and heading, autonomous vehicles use SLAM.
60
Prior work has developed Iterative Closest Point (ICP) [75, 60] techniques that search for the lowest error
alignment. The effectiveness of these approaches depends upon the initial guess for LiDARs’ poses. Poor
initial guesses can result in local minima. SAC-IA [173] is a well-known algorithm to quickly obtain an initial
guess for ICP. As we demonstrate in §3.4, with SAC-IA’s initial guesses, ICP generates poor alignments on
full LiDAR frames.
Initial guesses using minimal information. Besidesthe3Dpointclouds, SAC-IArequiresnoadditionalinput.
We have developed an algorithm to obtain good initial guesses (that enable highly accurate ICP alignments)
using minimal additional input. Specifically, Net-driving only needs the distances on the ground (using, for
example, an off-the-shelf laser rangefinder) between a reference LiDAR and all others to get good initial
guesses.
6
We now describe the algorithm for two LiDARs L
1
and L
2
; the technique generalizes to multiple
LiDARs, as described in Algorithm 2.
• Inputs and outputs. The inputs are two point clouds C
1
and C
2
captured from the corresponding
LiDARs at the same instant, and the distance d on the ground between the LiDARs. The output is an
initial guess for the pose of each LiDAR. We feed these guesses into ICP to obtain good alignments.
• Fixing the base coordinates. Without loss of generality, set LiDAR L
1
’s x and y coordinates to be (0,0)
(i.e., the base is at the origin). Then, assume that L
2
’s base is at (d,0).
• Estimatingheight, rollandpitch. Inthisstep,wedetermine: theheightofeachLiDARz
i
,theroll(angle
around the x axis), and pitch (angle around the y axis). For these, Net-driving relies on fast plane-
finding algorithms [84] that extract planes from a collection of points (lines 1 and 3 of Algorithm 2).
These techniques output the equations of the planes. Assuming that the largest plane is the ground-
plane (a reasonable assumption for roadside LiDARs), Net-driving aligns the z axis of two LiDARs
with the normal to the ground plane (line 4). In this way, it implicitly fixes the roll and pitch of the
LiDAR. Moreover, after alignment, the height of the LiDARs z
i
is also known (because the z axis is
perpendicular to the ground plane).
6
Another possibility is to provide the GPS locations of the LiDARs as input. In §3.4, we show that, and explain why, this
alternative also results in poor alignment.
61
(b) (c)
(a)
Figure 3.5: Net-driving finds the yaw for the two LiDARs by searching ((a) and (b) for that combination
where two surfaces on the point clouds have the smallest angle (c).
• Estimating yaw. Finally, to determine yaw (angle around the z axis), we use a technique illustrated
in Figure 3.5 (line 6). Consider a point cloud visible in the two LiDARs. Net-driving searches for the
combination of yaw settings for the two LiDARs (Figure 3.5(a) and (b)) until it finds a combination
that results in the smallest angle between the two point cloud’s orientations (Figure 3.5(c)). We have
found that ICP is robust to initial guesses for yaw that are within about 15-20
◦ of the actual yaw, so
Net-driving discretizes the search space by this amount.
Net-driving repeats this procedure (lines 3-6) for every other LiDAR L
i
with respect to L
1
, to obtain
initial guesses for the poses of every LiDAR. It feeds these into ICP to obtain an alignment.
Re-calibrating alignment. Alignment is run only once, when installing the LiDARs. Re-alignment may be
necessary if a LiDAR is replaced or re-positioned. Alignment is performed infrequently but is crucial to
Net-driving’s accuracy (§3.4).
Usingalignment: Stitching. LiDARsgenerateframesat10fps(ormore). InFigure3.3,wheneachLiDAR
generates a frame, the fusion stage performs stitching. Stitching applies the coordinate transformation for
each LiDAR generated by alignment, resulting in a fused frame (Figure 3.3(c)).
62
input : (1) A set of Point Clouds, C={C
1
,··· ,C
N
}, where C
i
denotes the Point Cloud of LiDAR
L
i
.
(2) A set of distances from the reference LiDAR, D ={d
2
,··· ,d
N
}, where d
i
is the known
distance from LiDAR i to LiDAR 1.
output: A set of transformations T ={T
2
,T
3
,··· ,T
N
}; we can apply transformation T
i
to point
cloud C
i
to translate into the reference LIDAR’s (C
1
) coordinate system.
1 segment ground plane(C
1
);
2 for i← 2 to N by 1 do
3 segment ground plane(C
i
);
4 align ground plane(C
i
, C
1
);
5 align position(C
i
, C
1
, d
i
);
6 estimate yaw(C
i
);
7 end
Algorithm 2: Estimating an initial guess for alignment
3.2.3 Reusing 3D Bounding Boxes
Net-driving’s efficiency results from re-using the 3D bounding box of a participant (Figure 3.3(e)) in many
processingsteps. Afterstitching,Net-drivingextractstheboundingboxbyperforming backgroundsubtraction
on the fused point cloud to extract points belonging to moving objects, then applies clustering techniques to
determine points belonging to individual objects, and run a bounding box estimation algorithm. These use
well-known algorithms, albeit with some optimizations; we describe these in §3.2.5.
Using the 3D Bounding Box. Net-driving uses the bounding box for many of its algorithms. Accurate
alignment enables low complexity algorithms that use the pre-computed bounding box geometry. These
algorithms are key to achieving low latency perception without compromising accuracy.
• To associate objects across frames (tracking), it uses a Kalman filter to predict the position of the
centroid of the 3D bounding box, then finds the bestmatch (in a least-squares sense) between predicted
positions and the actual positions of the centroids in the frame. Although tracking in point clouds is a
challengingproblem[196]forwhichresearchisexploringdeeplearning,aKalmanfilterworks exceedingly
well in our setting. The biggest challenge in tracking is occlusions: when one object occludes another
in a frame, it may be mistaken for the other in subsequent frames (an ID-switch). Because our fused
frame includes perspectives from multiple LiDARs, ID-switches occur rarely (in our evaluations, §3.4).
63
• To estimate speed of a vehicle, Net-driving measures the distance between the centroid of the bounding
box in one frame, and the centroid w frames in the past (w is a configurable window size parameter),
then estimates speed by dividing the distance by the time taken to generate w frames.
• Net-driving needs to associate an object seen in the LIDAR with a cyber endpoint (e.g., an IP address)
so its planning component can send a customized trajectory to each vehicle. For this cyber-physical
association (Figure3.4),Net-drivingusesacalibrationstepperformedonce. Inthisstep,givenavehicle
for which we know the cyber-physical association (e.g., a LiDAR installer’s vehicle), we estimate the
transformation between the trajectory of the vehicle seen in the LiDAR view with the GPS trajectory
(we omit the details for brevity). When Net-driving runs, it uses this to transform a vehicle’s GPS
trajectory to its expected trajectory in the scene, then matches actual scene trajectory to expected
trajectory in a least-squares sense. To define the scene trajectory, we use the centroid of the bounding
box of the vehicle.
In all of these tasks, the centroid of the bounding box is an easily computed, consistent, point within
the vehicle that aids these tasks. Moreover, because we have multiple LiDARs that capture a vehicle from
multiple directions, the centroid of the bounding box is generally a good estimate of the actual centroid of
the vehicle.
Besides these, Net-driving (a) estimates heading direction from the axes of the bounding box (discussed
in §3.2.4) and (b) uses the dimensions of the box to represent spatial constraints for planning (discussed in
§3.3).
3.2.4 Fast, Accurate Heading Vectors
Planning also needs the motion vector to predict trajectories for vehicles. To compute this, Net-driving first
determines, for each object, its instantaneous heading (direction of motion), which is one of the three surface
normals of the bounding box of the vehicle. It estimates the motion vector as the average of the heading
vectors in a sliding window of w frames. Most autonomous driving stacks can obtain heading from SLAM or
64
Selected!
t
t
t + 1
t + 1
t + 1 t
(a) (b)
Figure 3.6: The figure shows the points belonging to a car, in two successive frames t andt+1. (a) Strawman
approach for heading determination. (b) Net-driving’s approach.
visual odometry (chapter 5), so little prior work has explored extracting heading from infrastructure LiDAR
frames.
A strawman approach. ConsideranobjectAattimetandtimet+1. Figure3.6(a)showsthepointsbelonging
to that object. Since those points are already in the same frame of reference, a strawman algorithm is: (a)
find the vector between the centroid of A at time t and centroid of A at time t+1, (b) the heading direction
is the surface normal (from the bounding box) that is most closely aligned with this vector (Figure 3.6(a)).
We have found that the error distribution of this approach can have a long tail (although average error is
reasonable). If A has fewer points in t+1 than in t (Figure 3.6(b), upper), the computed centroid will be
different from the true centroid, which can induce significant error.
Net-driving’s approach. To overcome this, we (a) use ICP to find the transformation matrix between A’s
point cloud in t and in t+1, (b) “place” A’s point cloud from t in frame t+1 (Figure 3.6(b), lower), (c)
then compute the vector between the centroids of these two (so that the centroid calculations are based on
65
the same set of points). As before, the heading direction is the surface normal (from the bounding box) that
is most closely aligned with this vector.
GPU acceleration. ICP is compute-intensive, even for small object point clouds. If there are multiple objects
in the frame, Net-driving must run ICP for each of them. We have experimentally found this to be the
bottleneck, so we developed a fast GPU-based implementation of heading vector estimation, which reduces
the overhead of this stage (§3.4.3).
A typical ICP implementation has four steps: (1) estimate correspondence between the two point clouds,
(2) estimate transformation between two point clouds, (3) apply the transformation to the source point cloud
and (4) check for the ICP convergence. The first step requires a nearest neighbor search; instead of using
octrees, we adapt a parallelizable version described in [100] but use CUDA’s parallel scanning to find the
nearest neighbor. The second step shuffles points in the source point cloud then applies the Umeyama [195]
algorithm for the transformation matrix. We re-implemented this algorithm using CUDA’s demean kernel,
and a fast SVD implementation [99]. For the third and fourth steps, we developed custom CUDA kernels.
3.2.5 Other Details
BackgroundSubtraction. Thisstepremovespointsbelongingtostaticpartsofthescene
7
, is(a)especially
crucial for voluminous LiDAR data and (b) feasible in our setting because the LiDARs are static. It requires
a calibration step to extract a background point cloud from each LiDAR [118], then creates a background
fused frame using the results from alignment; taking the intersection of a few successive point clouds and the
aggregating intersections taken at a few different time intervals, works well to generate the background point
cloud.
Optimization: background subtraction before stitching. Net-driving subtracts the background fused frame
from each fused frame generated by stitching. We have found that a simple optimization can significantly
reduceprocessinglatency: firstremovingthebackgroundfromeachLiDARframe(usingitsbackgroundpoint
7
Static parts of the scene (e.g., an object on the drivable surface) might be important for path planning. Net-driving uses the
static background point cloud to determine the drivable surface; this is an input to the planner (§3.3). We omit this for brevity.
66
cloud), and then stitching points in the residual point clouds. Stitching scales with the number of points,
which this optimization reduces significantly.
Optimization: exploiting LiDAR characteristics. Many LiDAR devices only output returns from reflected
laser beams. Generic background subtraction algorithm requires a nearest-neighbor search to match a return
with the corresponding return on the background point cloud. Some LiDARs (like the Ouster), however,
indicate non-returns as well, so that the point cloud contains the output of every beam of the LiDAR. For
these, it is possible to achieve fast background subtraction by comparing corresponding beam outputs in a
point cloud and the background point cloud.
At the end of these two steps, the fused frame only contains points belonging to objects that are not part
of the scene’s background (Figure 3.3(d)).
Computing the 3D bounding box. On the points in the fused frame remaining after background sub-
traction, Net-driving uses a standard, fast, clustering algorithm (DBSCAN) [91] to extract multiple clusters
of points, where each cluster represents one traffic participant. Then, it uses an off-the-shelf algorithm [20],
which determines a minimum oriented bounding box (Figure 3.3(d)) of a point cloud using principal compo-
nent analysis (PCA). From these, we can extract the three surface normals of the object (e.g., vehicle): the
vertical axis, the axis in the direction of motion, and the lateral axis.
67
3.3 Net-driving Planning
Inputs, Outputs, and Goals. From perception, Net-driving’s planner receives bounding boxes for non-
background objects, and their motion vectors. Some of these objects will be Net-driving-capable vehicles;
these follow trajectories designed by Net-driving. For these, Net-driving needs each vehicle’s navigation goal,
e.g., where it is headed: we assume the planning component can get this by communicating with the vehicle.
Other objects will be Net-driving-oblivious: these include pedestrians, or vehicles driven by humans or on-
board autonomous driving agents. The output of planning is a trajectory — a sequence of way-points, and
times at which those way-points should be reached — for each Net-driving-capable vehicle.
Planning executes at every frame and must satisfy two goals: (a) it must fit within the tail latency
budget of 100 ms, (b) and it must generate collision-free trajectories at every instant even in the presence of
Net-driving-oblivious objects.
Net-driving’sPlanner. Thebest-knowncentralizedrobotmotionplanner,ConflictBasedSearch[182]that
jointly optimizes paths for multiple robots, has high tail latency [55, 160]. Instead, we use a fast single-robot
motionplanner,SIPP[160]. TheinputtoSIPPisagoalforarobotandthepositionsovertimeofthedynamic
obstacles in the environment. The output is a provably collision-free shortest path for the robot [160].
At each frame, Net-driving must plan trajectories for every Net-driving-capable vehicle. Without loss of
generality, assume that vehicles are sorted in some order. Net-driving’s centralized planner iteratively plans
trajectories for vehicles in that order:
WhenrunningSIPPonthei-thvehicle,Net-drivingrepresentsalli− 1previouslyplannedvehicles
as dynamic obstacles in SIPP.
Figure 3.7(b) illustrates this, in which the trajectory of a previously planned (red) vehicle is represented as
an obstacle when planning a trajectory for the blue vehicle.
This approach has two important properties. First, all trajectories are collision-free. Two cars i and
j cannot collide, since, if j > i, j’s trajectory would have used i’s as an obstacle, and SIPP generates a
68
(a) (b) (c)
Figure3.7: (a) The first (red) planned vehiclecan usethe entiredrivablespace. (b)Thesecond (blue) vehicle
treats the first as an obstacle (different shades represent different times at which grids are occupied). (c) The
motion-adaptive buffer around a vehicle is proportional to its speed.
Figure 3.8: With autonomous driving’s decentralized planning, in the absence of a traffic light controller,
vehicles come to a deadlock at the intersection where they are unable to cross it.
collision-free trajectory for each vehicle. Second, Net-driving’s planner cannot result in a traffic deadlock. A
deadlockoccurswhenthereexistsacycleofcarsinwhicheachcar’sforwardprogressonitscurrentlyplanned
trajectory is hampered by another. Figure 3.8 shows examples of deadlocked traffic with two, three, and four
cars (these scenarios under which these deadlocks occurred are described in §3.4.6). However, Net-driving
cannot deadlock. If there exists a cycle, there must be at least one pair of cars i,j where i < j and i blocks
j. But this is not possible, because, when planning for j, Net-driving represents i’s trajectory as an obstacle.
However, this approach may sacrifice optimality in transit times: vehicles later in the planning order may
see longer transit times. On the flip side, it may allow Net-driving to achieve novel traffic policies ( e.g., those
that prioritize emergency vehicles). Future work can explore such traffic policies.
69
Because it was designed for robots, SIPP makes some idealized assumptions. Net-driving adapts SIPP to
relax these.
Robustness to Net-driving-oblivious participants. SIPP plans the entire trajectory for a robot once, given
all the dynamic obstacles in the environment. Net-driving, to deal with Net-driving-oblivious vehicles whose
trajectoriescanchangedynamically,re-planstrajectoriesforeveryNet-driving-capablevehicleateveryframe.
This also relaxes assumption in SIPP that all objects move at a fixed speed. For each Net-driving-oblivious
object, Net-driving uses its bounding box, and motion and heading vector.
Robustnesstoperceptionerrors. SIPPassumesthateachobjecthasthesamesize,occupyingonegridelement.
To relax this to permit different vehicle sizes (cars, trucks), Net-driving uses the bounding boxes generated
by perception to determine grid occupancy. Because these bounding boxes may be incorrect, Net-driving
extends the bounding box by a small buffer on all sides. This extended buffer is motion-adaptive: the faster
the vehicle travels the larger is the buffer around it (Figure 3.7(c)).
Robustness to packet losses. ThetrajectorygeneratedbyNet-drivingistransmittedoverawirelessnetworkto
the vehicle. To be robust to packet losses, Net-driving generates waypoints over a longer horizon (10 s [181])
than the inter-frame time (100 ms in our case). This way, it transmits trajectories every 100 ms, but if a
transmission is lost, the vehicle can use the previously-received trajectory.
8
Robustness to vehicle kinematics limitations. SIPPassumesthatavehiclecanstartandstopinstantaneously.
In practice, a vehicle’s kinematic characteristics dictate how quickly vehicles can start and stop. The longer
planning horizon helps with the vehicle’s controller time to adapt to vehicle kinematics. The controller (on-
board the vehicle) can determine, from the received trajectory, if it has to stop at some future point in time;
it can then decide when to apply the brakes, and with what intensity, to effect the stop.
Net-driving also extends on-board control to respect vehicle kinematic constraints; we omit these for
brevity.
8
If there is a catastrophic failure, Net-driving-capable vehicles must have a fallback-to-human strategy, but autonomous
driving will also need this.
70
3.4 Evaluation
Our evaluations show that Net-driving: achieves frame rate processing and less than 100 ms tail latency both
in real-world experiments (§3.4.2) and in simulation (§3.4.3); achieves accuracy comparable to prior work
(§3.4.4); can be safer than autonomous driving (§3.4.5); and permits high-throughput traffic management
(§3.4.6).
3.4.1 Methodology
Implementation. We implemented Net-driving’s perception and planning on the Robot Operating Sys-
tem (ROS [164]). ROS provides inter-node (ROS modules are called nodes) communication using publish-
subscribe, and natively supports points clouds and other data types used in perception-based systems. Net-
driving’s perception component runs as a ROS node that subscribes to point clouds, processes them as
described in §3.2, and publishes the results. Net-driving’s planner builds on top of an open-source SIPP
implementation [200], runs as a ROS node, subscribes to the perception results, and publishes trajectories
for each vehicle. Perception requires 6909 lines of C++ code, and planning 3800.
Real-World Testbed. Our testbed consists of three 64-beam Ouster [156] LiDARs (two with 90
◦ and one
witha45
◦ fieldofview). ThetestbedalsoincludesanedgecomputingunitwithanInteli9-10900KFCPU(20
core, 3.7GHz)andaGeForceRTX3080GPU;Net-driving’splanningandperceptioncomponentsrunonthis.
The LiDARs and the edge computing unit connect through Ethernet cables and an Ethernet switch (using
an off-the-shelf Wi-Fi access point). Finally, Net-driving transmits trajectories to an in-vehicle Raspberry Pi
3 over Wi-Fi (as a proxy for 5G).
Simulation. In our testbed, Net-driving-derived trajectories cannot be used to control vehicles, since vehic-
ular control systems are proprietary. So, we complement our evaluations with a simulator which incorporates
vehicularcontrol. Thesimulatoralsohelpsusexplorescalingofperceptionandplanning. WeuseCarLA[87],
an industry-standard photo-realistic simulator for autonomous driving perception and planning. It contains
descriptions of virtual urban and suburban streets, and, using a game engine, can (a) simulate the control of
71
vehicles in these virtual worlds, and (b) produce LiDAR point clouds of time-varying scenes within these vir-
tualworlds. Unlessotherwisenoted, oursimulationbasedevaluationsfocusonintersections; severalchallenge
scenarios for autonomous driving focus on intersections [151, 70].
Metrics. We quantify end-to-end performance in terms of the 99th percentile of the latency (p99 latency)
between when a LiDAR generates a frame, and whenthevehiclereceives thetrajectorycorrespondingto that
frame. To quantify accuracy of individual perception components, we use metrics described in prior work
(defined later). We quantify planning efficacy by the rate of safety violations, or of undesirable outcomes
such as deadlock.
Comparison. Relative to autonomous driving, Net-driving’s perception has a more comprehensive scene
understanding and can jointly plan for multiple vehicles at a time. To quantify this, we compare it against
autonomous driving.
3.4.2 Real-world Experiment
Deployment. We deployed our testbed on a sidewalk along a major street in a large metropolitan area. For
this experiment, we deployed three LiDARs along the side of the road, separated from one another by 20 m.
These LiDARs connected to the edge compute (described above) using Ethernet cables. The edge compute
connected to Raspberry Pis on the vehicles through Wi-Fi (to proxy 5G). We collected data for nearly 23
minutes; we measured and report the end-to-end latency for every frame.
Ideally, we would have liked to deploy the LiDARs at an intersection, but that would have involved
disrupting traffic to establish wired connectivity across the intersection. We argue that our street-side de-
ployment stresses Net-driving at least as much as a deployment at an intersection would have, if not more. In
our deployment, cars travel faster (speed limit: 45 mph); at an intersection, some cars are stationary or slow
moving, so tracking, motion estimation and heading estimation algorithms need not be invoked on them.
Results. Figure 3.9 shows the end-to-end latency for each frame, for over 14000 frames (23 min), broken
down by component. The y-axis is in log scale. The p99 latency is 49 ms; in our experiment, a small
number of frames exceeded 100 ms. Moreover, Net-driving processed LiDAR input at full frame rate. This is
72
Figure 3.9: End-to-end latency from real-world experiment.
encouraging, and suggests that it is possible to achieve end-to-end performance with Net-driving comparable
to those in today’s autonomous driving.
From this graph, we also observe that: network latency is small, as is planning latency (together they
are under 7-8 ms); however, planning scales as the number of vehicles, so in more realistic deployments, we
expect its share of latency to be higher; and, perception dominates, which motivates the careful algorithmic
and implementation choices in §3.2.
Perception latencies were higher (p99 of 37 ms) than in our CarLA-based evaluations (§3.4.3). In our
deployment, perception latency was affected by object size. Our LiDARs were mounted at lower heights
(2 m) than one might expect in a roadside deployment (to mimic which, our LiDARs in CarLA were mounted
at11m). Asaresult, thepointcloudscorrespondingtoindividualobjectsarelarger, andseveralcomponents
that scale with the number of points of an object (like clustering, and heading vector estimation) take more
time in our experiments. In practical deployments, point cloud size may not affect performance; because
73
LiDARs scans are radial, the number of points representing an object drops off quickly with distance from
the sensor.
3.4.3 Latency Breakdown
Setup. To explore the total latency with more vehicles, and to understand the breakdown of latency by
component, we designed several scenarios in CarLA with increasing numbers of vehicles traversing a four-way
intersection. In these scenarios, CarLA (a) generates LiDAR data and (b) controls vehicles based on received
trajectories; the results use the same Net-driving code used in the real-world experiment. To control vehicles,
we set the steer, brake and throttle values of a vehicle, exposed through the CarLA API.
At this intersection, both streets have two lanes in each direction. We varied the number of vehicles from
2 to 14, to understand how Net-driving’s components scale. To justify this range of the number of vehicles,
we use the following data: the average car length is 15 ft [72], and the width of a lane is 12 ft [47]. Allowing
for lane markers, medians, and sidewalks, let us conservatively assume that the intersection is 60 ft across.
Suppose the intersection has traffic lights. Then, if traffic is completely stalled or moving very slowly, at most
4 cars can be inside the intersection per lane, resulting in a maximum of 16 cars (i.e., the maximum capacity
of the intersection is 16). If cars are stalled, Net-driving does not incur much latency since it doesn’t have to
estimate motion, heading, or plan for these, so we limit our simulations to 14 vehicles.
9
On the other hand,
if traffic is moving at 45 mph (or 66 fps) and cars maintain a 3-second [79] safe following distance, then at
most one car can be within the intersection per lane, for a total of 4 cars.
Breakdown for Perception. Table 3.2 depicts the breakdown of 99-th percentile (p99) latency by compo-
nent for perception, as well as the total p99 latency, as a function of the number of vehicles in the scene. In
all our experiments, Net-driving processed frames at the full frame rate (10 fps).
The total p99 latency for perception increases steadily up to 82 ms for 14 vehicles
10
from 19 ms for 2 vehi-
cles. This highlights the dependency of perception on scene complexity (§3.2); many components scale with
the number of participants. At 14 vehicles, Net-driving exceeds the latency budget, since planning requires
9
We have verified that, above 14 vehicles, end-to-end latencies actually drop.
10
For many of our experiments, including this one, we have generated videos to complement our textual descriptions. These
are available at an anonymous YouTube channel: https://www.youtube.com/channel/UCpb947_nEBAv_oikE7KsC-A.
74
Number of Vehicles
Component 2 4 7 10 12 14
Background Subtraction 7.5 8.9 9.9 9.9 11.0 18.7
Stitching 0.08 0.11 0.13 0.13 0.17 0.19
Clustering 4.3 5.8 11.2 13.0 22.2 20.5
Bounding Box 0.06 0.07 0.09 0.15 0.17 0.16
Tracking 0.1 0.15 0.21 0.28 0.37 0.38
Heading Vector 8.8 12.9 24.0 28.8 41.9 52.6
Total 18.5 26.4 43.7 47.1 69.9 81.5
Table 3.2: p99 per-frame latency (in milliseconds) for perception. We exclude latency numbers for motion
vector estimation which are on the order of a few microseconds.
Optimization Before After Ratio
Interchanging stitching and background subtraction 67.7 10.0 6.7
Exploiting LiDAR characteristics in background subtraction 9.9 1.5 6.6
Heading vector GPU acceleration 1057.7 28.8 36.6
Table 3.3: Impact of optimizations on p99 latency
over 20 ms for 14 vehicles (see below). These numbers suggest that modest off-the-shelf compute hardware
that we have used in our experiments might be sufficient, at least for traffic management at moderately busy
intersections. This data dependency also suggests that deployments of Net-driving will need to carefully
provision their infrastructures based on historical traffic (similar to network planning and provisioning).
The three most expensive components are background subtraction, clustering, and heading vector esti-
mation. Background subtraction accounts for about 10 ms, but depends slightly on the number of vehicles;
to be robust, it uses a filter (details omitted in §3.2.2) that is sensitive to the number of points (or vehicles).
Clustering accounts for about 20 ms with 14 vehicles and is strongly dependent on the number of vehicles,
since each vehicle corresponds to a cluster.
Heading vector estimation accounts for nearly 65% of perception latency, even after GPU acceleration
(§3.2.4). These results show the fact that heading vector estimation is not only dependent on the number of
vehicles, but their dynamics as well. When we ran perception on 16 vehicles, p99 latency actually dropped;
in this setting, 16 vehicles congested the intersection, so each vehicle moved very slowly. Heading vector
estimation uses ICP between successive object point clouds; if a vehicle hasn’t moved much, ICP converges
faster, accounting for the drop.
75
No. of Vehicles 2 4 7 10 12 14
p99 Latency 2.4 5.1 11.9 17.1 28.7 24.2
Table 3.4: p99 per-frame planning latency in milliseconds.
Other components are negligible. Stitching is fast because of the optimization described in §3.2.2. Bound-
ing box estimation is inherently fast. Track association is cheap because it tracks a single point per vehicle
(the centroid of the bounding box). Motion estimation is very fast (on the order of microseconds) and relies
onpositionscomputedduringstitching. Thus,carefuldesignchoicesthatleverageabstractionsandquantities
computed earlier in the pipeline can be crucial for meeting latency targets (§3.2.4).
Benefits of optimizations. Table 3.3 quantifies the benefits of our optimizations. Stitching before background
subtraction requires nearly 70 ms in total; reversing the order reduces this time by 6.7× . By exploiting
LiDAR characteristics (§3.2.2), Net-driving can perform background subtraction in 1.5 ms per frame.
11
A
CPU-based heading vector estimation requires nearly 1 s which would have rendered Net-driving infeasible;
GPU acceleration (§3.2.4) reduces latency by 35× .
Calibration Steps. Finally, alignment (§3.2.2) of 4 LiDARs takes about 4 minutes. This includes not just
the time to guess initial positions, but to run the ICP (on a CPU). Because it is invoked infrequently, we
have not optimized it.
Planning. Table 3.4 depicts the p99 planning latency. As expected, there is a dependency on the number
of vehicles, since Net-driving individually plans for each vehicle. Planning latencies can be slightly non-
monotonic (the planning cost for 12 vehicles is more than that for 14) because the planner’s graph search can
depend upon the actual trajectories of the vehicles, not just their numbers.
11
Table 3.2 does not include this optimization, since it can only be applied to some LiDARs
76
3.4.4 Accuracy
PerceptionComponentAccuracy. Usingreal-worlddataandsimulationtracesdescribedabove, weshow
that Net-driving’s positioning, heading, velocity, and tracking accuracies are comparable to that reported in
other work. (For our real-world traces, we manually labeled ground-truth positions).
Metrics. We report positioning error (in m), which chiefly depends upon the accuracy of alignment. Heading
accuracy is measured using the average deviation of the heading vector, in degrees, from ground truth. For
estimating accuracy of velocity estimation, we report both absolute and relative errors. Finally, we use
two measures to capture tracking performance [63]: multi-object tracking accuracy (MOTA) and precision
(MOTP). The former measures false positives and negatives as well as ID switches (§3.2); the latter measures
average distance error from the ground-truth track.
Results. Table 3.5 summarizes our findings. Positioning error is about 8-10 cm in Net-driving, both in
simulation and in the real world; the state of the art LiDAR SLAM [209] reports about 15 cm error. Our
heading estimates are comparable to prior work that uses a neural network to estimate heading. Speed
estimates are highly accurate, both in an absolute sense (error of a few cm/s) and in a relative sense (over
97%).
Finally,trackingisalsohighlyaccurate. Inthereal-worldexperiment,trackingwasperfect. Insimulation,
with 10 vehicles concurrently visible, MOTA is still very high (over 99%), and significantly outperforms the
state-of-the-art neural network in 3D tracking [114], for two reasons. The neural network solves a harder
problem, tracking from a moving LiDAR. Our fused LiDAR views increase tracking accuracy; when using a
77
single LiDAR to track, MOTA falls to 93%. Finally, MOTP is largely a function of positioning error, so it is
comparable to that value.
Alignment Accuracy. Although alignment is performed only once, its accuracy is crucial for Net-driving;
without accurate alignment, Net-driving’s perception components could not have matched the state of the
art (Table 3.5).
Comparison alternatives. To show the efficacy of Net-driving’s novel alignment algorithm, we compared it
against two alternative ways of obtaining an initial guess for ICP: a standard feature-based approach for ICP
initialguesses, SAC-IA [173]; and using GPSforalignment. Inthisexperiment, we usepointcloudsfromour
simulations and real-world experiments. The three approaches estimate initial guesses for the pose of three
LiDARs, and feed those poses to ICP for building a stitched point cloud. In our evaluations, we report the
average root mean square error (RMSE) between the stitched point cloud (after running ICP on the initial
guesses) for every approach against a ground truth.
Results. Table 3.6 shows that Net-driving’s alignment results in errors of a few cm, almost 2-3 orders of
magnitude lower error than the competing approaches, which explains why we chose this approach. SAC-
IA [173] does not take any inputs other than the point clouds, and estimates the transformation between
two point clouds using 3D feature matching. However, this works well only when point clouds have a large
amount of overlap. In our setting, LiDARs are deployed relatively far from each other, resulting in less
overlap, SAC-IA is unable to extract matching features from multiple LiDAR point clouds. Using GPS for
alignment provides a good initial guess for the relative translation between the LiDARs. However, GPS
cannot estimate the relative rotation between LiDARs, so results in poor accuracy.
3.4.5 Safety
Net-driving’s perception has a comprehensive view of an intersection, so it can lead to increased safety. To
demonstrate this, we implemented two scenarios in CarLA from the US National Highway Transportation
78
Real Sim Prior
Positioning Error (m) 0.10 0.08 0.15 [209]
Heading Error (
◦ ) 8.17 6.45 5.10 [71]
Speed Error (m/s) 0.04 0.06 —
Speed Accuracy (%) 98.80 97.49 —
MOTA (%) 100 99.54 84.52 [114]
MOTP (m) 0.12 0.08 —
Table 3.5: Perception accuracies from real-world data and simulation. The last column provides accuracies
reported for these by prior work, for calibration, where available.
RMSE (m)
Average Std Dev
Simulation
Net-driving 0.03 0.02
SAC-IA 39.2 13.4
GPS 23.8 11.1
Realworld
Net-driving 0.09 0.04
SAC-IA 11.8 13.9
GPS 13.0 1.2
Table 3.6: Net-driving’s novel alignment algorithm outperforms existing state-of-the-art initial alignment
algorithms.
Safety Administration (NHTSA) precrash typology [151]; these are challenging scenarios for autonomous
driving [70].
Red-light violation. A red truck and the ego-vehicle (yellow box) approach an intersection (Figure 3.10). An
oncoming vehicle (red box) on the other road violates the red traffic light. The red truck can see the violator
and hence avoid collision, but the ego-vehicle cannot.
Unprotected left-turn. The ego-vehicle (yellow box) heads towards the intersection (Figure 3.11). A vehicle
(red bounding box) on the opposite side of the intersection wants to make an unprotected left-turn. The
ego-vehicle’s view is blocked by the trucks on the left lane.
Methodology and metrics. In each scenario, the ego-vehicle is Net-driving-capable. When comparing against
autonomous driving, to ensure a more-than-fair comparison, we: (a) equip autonomous driving with ground-
truth, so its perception is perfect; and (b) use Net-driving’s planner and controller for autonomous driving in
single-vehicle mode (it plans only for the ego-vehicle). Thealternativewouldhavebeento useanopen-source
autonomous driving stack like Autoware [142] which has its own perception and planning modules. However,
79
NHTSA
Scenario
Safe Passage (%)
Net-driving Autonomous Driving
Red-light Violation 100 37
Unprotected Left-turn 100 18
Table 3.7: With more comprehensive perception, Net-driving can provide safe passage to vehicles in both
scenarios.
in these experiments we are trying to understand the impact of architectural differences (autonomous driving
vs. Net-driving), we chose a simpler approach that equalizes implementations.
For both scenarios, we vary speeds and positions of the ego-vehicle and oncoming vehicle to generate 16
different experiments. We then compare for what fraction of experiments each approach can guarantee safe
passage.
Results. In both scenarios (Table 3.7), autonomous driving ensures safe passage in fewer than 20-40% of the
cases. Net-driving achieves safe passage in all cases because it senses the oncoming traffic even though it
is occluded from the vehicle’s on-board sensors
12
. This gives the planner enough time to react and plan a
collision avoidance maneuver (in this case, stop the vehicle). Of the two cases, the unprotected left-turn was
the more difficult one for Net-driving (as it is for autonomous driving, which fails more often in this case),
yet it is still able to guarantee safe passage in all 16 cases. In the red-light violation scenario, Net-driving
senses the oncoming traffic early on and has enough time to react. However, in the unprotected left-turn, the
ego-vehicleistravelingrelativelyfastandtheoncomingtraffictakestheleft-turnatthelastmoment. Though
Net-driving has a smaller time to react, its motion-adaptive bounding box and stopping distance estimation
ensure that the vehicle stops on time.
RobustnesstoPacketLoss. Net-driving’splannertransmitstrajectorieswirelesslytovehicles. Itgenerates
trajectories over a longer time horizon (§3.3) to be robust to packet loss. To quantify its robustness, we
simulated packet losses ranging from 0 to 100% for in the red-light violation scenario (Figure 3.10). For each
loss rate, we measured the stopping distance between the ego-vehicle and the oncoming traffic which violates
the red-light. Higher stopping distances are good (Figure 3.13). We repeated the experiment multiple times
12
Please see YouTube channel for videos.
80
Figure 3.10: The orange truck obstructs the ego-
vehicle’s (yellow box) view of red-light violator.
Figure 3.11: The orange trucks obstruct the ego-
vehicle’s (yellow box) view of left-turning car.
2 3 4 5 6 7 8 9 10
Number of Vehicles
0
5
10
15
20
25
30
35
Average Wait Time (s)
Static Lights
Intelligent Lights
Net-driving
Figure 3.12: Net-driving enables traffic light free in-
tersections and can reduce wait times by 5 x.
0 20 40 60 80 100
Packet Loss Rate (%)
0
2
4
6
8
Stopping Distance (m)
Figure 3.13: Net-driving can ensure collision-free tra-
jectories with up to 70% packet-loss rates.
for each loss rate. As we increased the packet-loss, the stopping distance decreased because the ego-vehicle
was operating on increasingly stale information. Even so, Net-driving ensures collision-free passage for the
ego-vehicle through the intersection till 70% loss, with minimal degradation in stopping distance till about
40% loss.
3.4.6 High-Throughput Traffic Management
Intersections contribute significantly to traffic congestion [120]; traffic-light free intersections can reduce
congestion and wait times. Net-driving, because it centrally plans trajectories for all Net-driving-capable
81
vehicles, can plan collision-free trajectories at an intersection without traffic lights. We have verified this in
simulations as well.
13
Wait-timeComparison. Atraffic-lightfreeintersectioncansignificantlyreducewaittimesatintersections,
thereby enabling higher throughput. To demonstrate this, we compared Net-driving’s average wait times at
an intersection against three other approaches:
• An intersection with static traffic lights.
• Intelligent traffic light control which prioritizes longer queues.
• A traffic-light free intersection where autonomous vehicles use centralized perception but on-board
decentralized planning. This approximates cooperative perception based approaches like AVR [163] and
EMP [210].
Forthefirsttwoapproaches,weobtainedpoliciesfrompublishedbestpractices[45,51,49]. Inallexperiments,
we placed four LiDARs at an intersection in CarLA.
Results. Compared to static and intelligent traffic lights, Net-driving reduces average wait times for vehicles
by up to 5× (Figure 3.12). Net-driving performs better than even intelligent traffic lights because it can
minimize the stop and start maneuvers at the intersection (by preemptively slowing down some vehicles),
and hence can increase overall throughput, leading to lower wait times.
14
Interestingly, beyond about 6 vehicles, wait times for strategies that use traffic lights drops. Recall that
our intersection has two lanes in each direction. As the number of vehicles increases, the probability that
a vehicle can pass the intersection without waiting increases. For example, with intelligent traffic lights, a
vehicle waiting at an intersection can trigger a green light, so a vehicle that arrives in the adjoining lane can
pass without waiting. Even so, Net-driving has lower wait times than other alternatives up to 10 vehicles
(wait time comparisons are similar beyond this number, we omit results for brevity).
Figure 3.12 does not include wait times for the decentralized planner. That is because, although one
might expect that a decentralized planner might have comparable throughput to Net-driving, we found that
13
Please see YouTube channel for videos.
14
Please see YouTube channel for videos.
82
itled to a deadlock (§3.3) with as few astwovehicles(eachvehicle waited indefinitelyfor the otherto make
progress, resulting in zero throughput).
15
Fundamentally, this occurs because vehicles lack global knowledge
of planning decisions.
15
Please see YouTube channel for videos.
83
3.5 Conclusions and Future Work
Net-driving decouples autonomous vehicle perception and planning from low-level control. By doing this, it
can perceive more than a single vehicle can, and jointly plan for multiple vehicles. To demonstrate that it can
do this, while meeting the performance requirements of existing autonomous vehicles, and without sacrificing
accuracy, we develop a novel stack containing new perception algorithms and a fast planner. Jointly, these
achieve the desired performance targets on commodity hardware, and have accuracy comparable to state-of-
the-art algorithms. At intersections, Net-driving enables higher safety than autonomous driving, and higher
throughput traffic management at intersections compared to using traffic lights.
Beyond intersections. Net-driving can, in theory, enable a future world without autonomous driving; cars
would not have on-board perception and planning, but would simply follow Net-driving-generated trajec-
tories. This can eliminate the need for expensive sensors [21, 168], power-hungry on-board compute [130],
and expensive high-definition maps [187]. To achieve this vision, infrastructure needs to be installed every-
where, which will require much work: cost/benefit economic analyses, public-private partnerships to install
infrastructure, regulatory approvals, buy-in from vehicle owners etc.
84
Chapter 4
Fresco: Fast, and Accurate 3D Reconstruction
4.1 Introduction
Drone-based3Dreconstruction. Thelastfewyearshaveseenimpressiveadvancesinthecommoditization
of drones. Today, drones can be equipped with on board compute, a cellular (LTE) radio and sophisticated
sensors (e.g., cameras, stereo cameras and LiDAR). Given this, in the coming years, drones will likely revo-
lutionize aerial photography, mapping and three-dimensional (3D) reconstruction. Recent estimates put the
total market for these drone-based services at 63 billion US dollars by 2025 [16], with 3D reconstruction, the
task of generating a digital 3D model of the environment, accounting for nearly a third of the drone services
market [167].
What is a 3D model? The term 3D model covers a range of geometric representations of the surfaces
of objects, from coarse-grained approximations (cylinders, cubes, intersection of planes), to more fine-grain
representations such as meshes (small-scale surface tessellations that capture structural variations). In this
paper, weseektoextractafine-grain point-cloud ofalargestructure(e.g., abuilding)whichconsistsofdense
points on the surface of the structure. Each point has an associated 3D position, along with other attributes
85
(depending on the sensor used to generate the point-cloud). A point-cloud based 3D model can generate all
other representations.
Applications. 3D models are used in animations in films and video games, for preserving historical and
tourist sites, for archaeology, city planning, and to capture buildings or rooftops for repair and solar instal-
lation, and as inputs to immersive virtual reality (VR) applications. Recently, startups are promising to
deliver models in hours or within a day [3] after drone data collection. This trend has inspired newer, more
time-sensitive applications of 3D modeling like (a) post-disaster reconstruction for search and rescue mis-
sions [1, 2, 7, 12, 17], (b) construction site monitoring [3], (c) mapping mines, cell-towers and tunnels [11, 66],
(d) documenting interior office spaces [15], (e) inspecting oil and gas installations [6], and (f) between-flight
modeling of aircraft fuselages or blimps to determine structural integrity [145].
State-of-the-art: Photogrammetry. Today, to build a 3D model, a drone operator flies a drone around
a physical structure in a given trajectory to collect 2D images. After the drone lands, the operator uploads
theseimagestoacloudservice. Thecloudserviceusesphotogrammetry tobuilda3Dmodel. Photogrammetry
(chapter 5) infers a 3D model from a sequence of 2D images captured using drone-mounted cameras [82, 190,
123, 127, 10], and is computationally intensive. The quality of the inferred models depends on the quality of
the 2D images, which depends on the drone trajectory. To obtain accurate reconstruction, photogrammetry
canrequiremultipleiterationswithhuman-interventionfortrajectorydesign. Wequantifytheseshortcomings
of photogrammetry in §4.2.
Drone-mounted LiDAR-based reconstruction. Unlike photogrammetry, LiDAR-based reconstruction
can more directly infer 3D models, because LiDARs directly provide depth information (unlike cameras). A
LiDAR sensor measures distances to surfaces using lasers mounted on a mechanical rotating platform. With
each revolution of the lasers, the sensor returns a point cloud, or a LiDAR frame. The point cloud is a set
of 3D data points, each corresponding to a distance measurement of a particular position of the surrounding
environment from the LiDAR. For instance, an Ouster OS1-64 LiDAR [155] has 64 lasers that scan at 10 to
20 Hz, with a horizontal and vertical field of view of 360
◦ and 45
◦ , respectively.
86
To reconstruct a structure e.g., a building, one can mount a LiDAR on a drone, fly the drone around
the building, and merge points clouds captured from different locations around the building. Consider point
clouds p and p
′
. A point x on the surface of the building may appear both in p and p
′
. However, since the
drone has moved, this point x appears at different positions (relative to the LiDAR) in the two point clouds.
Ifwepreciselytransformbothpointcloudstothesamecoordinateframeofreference,thentheunionofpoints
in p and p
′
constitutes part of the 3D model of the building. This operation is fast, but depends critically on
knowing the precise position of the drone when it captures point clouds.
Using SLAM for positioning. LiDAR SLAM (Simultaneous Localization And Mapping [88, 58]) algo-
rithms can provide accurate pose (position and orientation) estimates,
1
which can be used to position point
clouds. These algorithms use scan [111] or feature matching [208] techniques to align consecutive LiDAR
frames to determine the pose of the drone throughout its flight.
Challenges. ALiDARSLAM-based3Dreconstructionapproachpresentsfourdistinctchallenges(quantified
in §4.2):
• The drone’s flight trajectory and its flight parameters critically impact SLAM accuracy. SLAM esti-
mates positions by aligning successive point clouds, and alignment relies on the overlap between point
clouds. If the drone flies too fast, for example, successive point clouds may not overlap enough to effect
good alignment.
• Even with a well-designed trajectory, SLAM algorithms accumulate drift [88, 58], so position estimates
can degrade on longer flights.
• Drones have limited flight endurance. When fully loaded and starting from a full charge, the M600Pro
can fly for approximately 25 minutes. This necessitates careful trajectory planning to minimize flight
duration.
1
In §4.2, we explain why other approaches (e.g., using GPS or GNSS/RTK) can be inaccurate.
87
• Detecting and correcting for this drift mid-flight can avoid re-doing the entire flight (as may be required
for photogrammetry). However, existing SLAM algorithms cannot run at full frame rate on mobile
computing platforms mounted on commodity drones.
Our contributions. This chapter presents the design and implementation of Fresco, which quickly builds
high-quality 3D models of large physical structures, which, unlike photogrammetry, is computationally-
efficient and does not require iterative capture or human-intervention. To address the challenges described
above, Fresco makes three contributions (Table 4.1):
• It uses a novel optimization formulation to obtain minimal duration 3D flight trajectories for model
collection of a large class of buildings. It constrains the trajectory search using flight parameters
selected to minimize SLAM error, and plans the flight on a discretized graph embedded on an unfolded
2D representation of the building (§4.3.1).
• Fresco continuously detects excessive drift: for this, it uses a novel algorithm that tracks consistency
between the shape of GPS traces and SLAM positions. This algorithm is robust to GPS error. When
theshapesdiverge,Frescore-directsthedronetoreturntoadesignatedorigin; thiseffectsaloopclosure
in SLAM and starts a new session, both minimize error [88, 58].
• Fresco ’s careful model collection can exhaust a drone’s battery. If a building’s location and shape are
not known a priori, it is more efficient to expend some battery resources to find the building in an area
of interest, so model collection flights can focus on the area where the building is located. To this end,
Fresco introduces a robust and efficient geometry extraction algorithm using a cheap reconnaissance
flight before model collection.
• Fresco offloads computation to a cloud server by compressing point clouds on the drone to the point
where they can be transmitted over LTE (§4.3.4). It processes these point clouds at LiDAR frame rate
and with minimal end-to-end processing delay.
88
Challenge Mechanism
Model accuracy Model collection trajectory planning
Drift accumulation Flight re-calibration
Limited battery Cheap reconnaissance flight
Limited compute Offload compressed point clouds
Table 4.1: Challenges and contributions
Experiments(§4.4)onreal-worlddroneflightsandaphoto-realisticdronesimulator(AirSim[180])demon-
strate that Fresco can achieve under 10 cm reconstruction accuracy for a wide range of building shapes and
sizes, even after compressing the raw sensor data by almost two orders of magnitude. Not only is Fresco
faster than a state-of-the-art photogrammetry approach [178], it is also more accurate, since the latter can-
not re-calibrate mid-flight. An experiment using the complete Fresco implementation on a drone is able to
stream compressed point clouds over LTE, reconstruct the 3D model on-the-fly and deliver an accurate 3D
model under a second after flight completion. To our knowledge, Fresco is the first to demonstrate on-line,
cloud-based, autonomous, sub 10 cm, 3D model reconstruction
2
.
2
The focus of this chapter is on outdoor 3D reconstruction. Indoor reconstruction with smartphones or mobile devices is an
orthogonal problem because their sensors have low range, and resolution. We have left an examination of this to future work.
89
4.2 Motivation
In this section, we provide results from experiments that motivate Fresco’s approach. Before we do this, we
describe metrics that capture the quality of 3D reconstruction.
Quantifying3Dmodelquality. Priorworkon3Dreconstruction[179]hasproposedtwometrics, accuracy
and completeness. Consider a 3D model M and a corresponding ground-truth M
g
. Accuracy describes how
closely the 3D model M resembles the ground truth M
g
. Accuracy is the root mean-square error (RMSE)
of the distance from each point in M to the nearest point in M
g
. Completeness describes what extent of the
ground truth M
g
the 3D model M captures. Completeness is the RMSE of the distance from each point in
M
g
to the nearest point in M.
Toputthesenumbersincontext,anaccuracyof10cmmeansthateverypointon3DmodelM isdisplaced
by10cm, on average, fromitsground truthpositionin M
g
. Acompletenessof50cmmeansthat, onaverage,
for every point on the ground truth M
g
, the nearest point on the 3D model M is 50 cm away. If both
accuracy and completeness are zero, M perfectly matches M
g
. In other words, for both metrics, lower is
better. If M captures all points in M
g
, but the positions of the points are erroneous, then both accuracy and
completeness will be non-zero. If M captures only one point in M
g
, but positions it correctly, its accuracy
is perfect, but completeness is poor. These two metrics are roughly analogous to precision (accuracy) and
recall (completeness).
Drone-based photogrammetry can be slow and inaccurate. Photogrammetry infers 3D models using
multi-view stereo [96] (MVS). MVS algorithms infer the structure (position, and orientation) of the scene
from the set of images. Using this structure and stereo correspondence from multiple images, these algo-
rithms estimate the depth of every pixel in each image. A fusion of these 3D points generates a dense 3D
reconstruction of the entire scene/object. Though MVS 3D reconstructions are generally accurate, they may
contain missing regions or holes [96] which arise if MVS cannot confidently estimate their depths from the
set of images.
90
Figure 4.1: MVS reconstruction (right image) of a large building (left image shows ground truth 3D model).
Today, to build a 3D model, a drone operator flies a drone in a given trajectory to collect 2D images.
Then, the operator uploads these images to a cloud service after the drone lands. The cloud service runs an
MVS algorithm to build a 3D model. If the model contains missing regions (detected by visual inspection),
the operator adjusts the trajectory and flies the drone again [9, 18]. It can take multiple iterations to get the
desired 3D model. Moreover, MVS algorithms are compute-intensive.
To quantify these shortcomings, we conducted an experiment (methodology described in §4.4) in which
we used an open source implementation of MVS (ColMap [178]) to reconstruct a 3D model of two buildings.
For each building, we tried several trajectories, and selected the one that provided the highest quality model.
Table 4.2 shows these values, and Figure 4.1 is a visual illustration for that highest quality flight. The
left image shows the ground truth model whereas the right image shows the 3D model from the MVS
reconstruction. Accuracy is slightly higher than 10 cm; this is desirable. Completeness, however, even for the
best flight is 75 cm or more (right image in Figure 4.1 illustrates many regions in the MVS reconstruction
are incomplete); Fresco achieves 5 to 9 cm completeness (§4.4, almost an order of magnitude better). Finally,
MVS requires nearly 7-9 hours to reconstruct the model; Fresco assembles the model in-flight, so the model
is ready immediately after the drone lands.
Building Accuracy (m) Completeness (m) Time (s)
Small 0.11 0.80 23084
Large 0.16 0.75 31600
Table 4.2: Photogrammetry can be slow and inaccurate. The small building is 50 m x 50 m x 20 m, the large
is 100 m x 50 m x 20 m.
91
Figure 4.2: GPS (left), SLAM (middle), and Fresco (right) derived models of a large complex on our campus.
The models are color-coded with the same height ramp function.
In theory, photogrammetry could detect, in-flight, when the model might be incomplete and adapt the
trajectory accordingly. This has proven to be hard to do without running the entire MVS pipeline [146, 152]
on the set of images. To minimize iterations, practitioners have developed rules of thumb for designing
trajectories (such as taking highly-overlapping uniformly-illuminated images of texture-rich surfaces) [178, 5],
but realize there is no way to ensure complete and high-quality 3D reconstruction without iteration [146]. So,
companies hire drone pilots/photogrammetrists [9, 18] who have developed trajectory design intuitions over
time.
Other alternatives for positioning are unsuitable. Because drones can carry GPS receivers, GPS can
be used to position point clouds. Unfortunately, GPS errors can be several meters in obstructed settings,
resulting in poor accuracy and completeness of the 3D model. The left image of Figure 4.2 shows a 3D model
of a building assembled using a drone flight over a commercial building that uses GPS for positioning; the
building’s outline is fuzzy, as are the contours of the trees surrounding the building. Table 4.3 shows that
the accuracy and completeness for the GPS reconstructed model are 1.6 m and 0.53 m respectively, both of
which are undesirable. The right image shows a 3D reconstruction using techniques proposed in this chapter,
which does not use GPS. (All reconstructions in Figure 4.2 use real data captured from a drone-mounted
LiDAR §4.4).
High-precision GNSS/RTK receivers can provide more accurate positioning but require additional infras-
tructure, are costly, and can perform poorly (tens of meters error [116]) in urban environments due to non
line-of-sight signals that degrade accuracy. Prior work [131, 201] has used expensive GPS receivers in the
92
Approach Accuracy (m) Completeness (m)
GPS 1.60 0.53
SLAM 2.30 1.30
Fresco 0.13 0.09
Table 4.3: Relative to Fresco, alternative reconstruction approaches for drone-based LiDAR reconstruction
give poor quality 3D models.
Figure 4.3: Rectilinear trajectory for SLAM-based reconstruction.
context of remote sensing, using specialized unmanned aerial vehicles (UAVs) with long-range LiDAR sen-
sors (but for offline model reconstruction). In contrast, in this chapter we consider solutions that employ
off-the-shelf technologies: drones, and commodity GPS and LiDAR.
SLAM-based positioning can be inaccurate. Even with SLAM, reconstruction quality is critically
dependentonthedesignofthedronetrajectory. Figure4.2(middle)showsreconstructionusingSLAMfroma
rectilinearflight(Figure4.3)thatensuresbuildingcoverage. Forthisflight,theaccuracyandcompletenessare
2.3 m and 1.3 m respectively (Table 4.3), orders to magnitude higher than Fresco-based reconstruction. This
reconstruction is visually worse than Fresco-based reconstruction (Figure 4.2 (right)), because the underlying
SLAM algorithm is unable to track LiDAR frames (i.e., it is unable to match points in successive point
clouds). Fresco, which layers trajectory generation and drift re-calibration on top of SLAM is significantly
more accurate and complete. For SLAM-based reconstructions, we used the Google Cartographer [111]
LiDAR SLAM implementation.
Limited compute on drones. For Fresco to detect SLAM drift (§4.1) in-flight, it must run SLAM as the
drone flies. However, existing SLAM algorithms cannot run at full frame rate on mobile computing platforms
93
mounted on drones. A DJI M600Pro (which we use in this chapter) hexacopter has a maximum payload
weight of 5 kg. It carries an A3Pro flight controller that contains three IMUs and three GNSS receivers.
We have also mounted an LTE radio and an Ouster OS1-64 LiDAR, as well as a Jetson TX2 board. This
compute capability is far from sufficient to run LiDAR SLAM
3
.
On the TX2, we ran two popular, representative LiDAR-SLAM algorithms, Cartographer [111] and
LOAM [208]. In this experiment, both SLAM implementations use Fresco’s optimized 3D flight trajectories.
In Table 4.4 we show accuracy, completeness and performance results for two the LiDAR SLAM algorithms.
Cartographer, a scan matching algorithm, can only process 2 frames per second (LiDARs generate 10-20
frames per second) and LOAM, a feature-matching algorithm, can process 5 frames per second. Though
faster, LOAM reconstructions are worse than Cartographer.
Metric Cartographer [111] LOAM [208]
Frame Rate (fps) 2 5
Accuracy (m) 0.21 5.75
Completeness (m) 0.09 2.74
Table 4.4: LiDAR SLAM on Jetson TX2. For both accuracy and completeness, lower is better.
We make two observations from Table 4.4. First, Cartographer is a more accurate SLAM algorithm
than LOAM for drone-mounted LIDAR. For this reason, Fresco uses Cartographer. Second, neither SLAM
implementation can run at full frame-rate on the drone. In particular, Cartographer runs only at 2 fps, but
we need to process SLAM at full-frame rate to be able to detect drift in-flight, so Fresco offloads SLAM to
the cloud.
In the future, mobile compute capabilities will improve. However, at the same time, LiDAR technology
continues to evolve and 128 beam LIDARs are already available, and these will require significantly more
compute. So it is unclear if, or when, accurate LiDAR SLAM can run entirely on the drone. Even if that
becomes feasible, Fresco’s trajectory optimization and re-calibration algorithms will be relevant for accurate
3D modeling.
These experimental results motivate the design of Fresco, which we now describe.
3
With a 64-beam LiDAR, SLAM processes upto 480 Mbps of 3D data
94
On the Drone
Read LiDAR Stream
Point Cloud Compression
In the Cloud
Trajectory Planning SLAM
Drift Estimation & Re-calibration
Point Cloud
Trajectory
Figure 4.4: Fresco architecture
4.3 Fresco Design
Because drone-based 3D reconstruction is a complex multi-dimensional problem (Table 4.1), we have focused
on a geometrically regular, but important, subset of structures for reconstruction: buildings.
Specifically,wefocusonbuildingswith(a)verticalsidesand(b)(convexornon-convex)polygonalflatroofs
or gabled roofs (Figure 4.8). This covers most commercial buildings as well as many residential structures.
We leave extensions to other shapes to future work.
Overview. To use Fresco to reconstruct a building, a user-specifies the rooftop geometry: the approximate
GPSlocationsofthecornersoftheroof. Forexample, forabuildingwitharectangularroof, theuserspecifies
theGPSlocationsofthe4cornersoftherooftop. Itispossibletoobtainthisinformationautomaticallyusing
a reconnaissance flight that we describe in §4.3.3.
The user also specifies a minimum target point density. Point density, the number of points per unit area
on the surface of a point cloud, is a measure of the level of detail captured in a model. A minimum point
density ensures that Fresco captures all parts of the building surface at at least the same level of detail. It is
a knob for trading-off model fidelity for lower cost (when an application can tolerate a less detailed model)
since more detailed captures require longer flights.
95
Figure 4.5: Pentagon Figure 4.6: Plus Figure 4.7: Gabled
Figure 4.8: Fresco is designed for buildings with vertical sides and either gabled or polygonal roofs.
Giventheseinputs, FrescoautomaticallyguidesadronetocaptureLiDARpointcloudsofbuildings, while
minimizingflightdurationatthegivenminimumpointdensity. Toafirstapproximation, dronebatteryusage
increases with flight duration; we have left it to future work to incorporate drone battery models.
To do this, Fresco splits its functionality across two components (Figure 4.4): (a) a lightweight subsystem
thatrunsonthedrone,and(b)acloud-basedcomponentthatdiscoversbuildings,generatesdronetrajectories,
and reconstructs the 3D models on-the-fly.
Fresco’s cloud component (Figure 4.4) prepares a careful model collection trajectory (§4.3.1) that designs
a minimal duration flight to ensure high 3D model accuracy. As the drone flies this trajectory Fresco streams
compressed point clouds to a cloud service over a cellular (LTE) connection (§4.3.4). The cloud service
continuously runs SLAM and estimates whether SLAM drift is sufficient to warrant recalibration (§4.3.2),
which results in a return-to-origin maneuver to ensure loop closure.
This design addresses (Table 4.1) the issues discussed in §4.2: cloud offload circumvents the limited
compute on-board, careful trajectory design ensures that Fresco completely covers the building in a manner
designedtoensureSLAM’sabilitytocontinuouslytrackpositions,andre-calibrationenablesSLAMtorecover
fromexcessivedriftwhenitoccurs. Table4.5summarizesFresco’sdesignincomparisonwithphotogrammetry
and LiDAR SLAM approaches.
96
Property Photogrammetry SLAM Fresco
Complexity High Low Low
Reconstruction Offline Offline Online
Automated Trajectory Planning × × Quality Feedback × × Density Control × × Table 4.5: Fresco comparison with Photogrammetry and SLAM.
4.3.1 Model Collection
Fresco designs a model collection trajectory to capture a high quality model. Fresco must simultaneously (a)
ensure good accuracy and completeness, (b) satisfy the target point density, and (c) use short flights.
To achieve these, Fresco uses a collection of inter-related mechanisms. At the core of Fresco is an opti-
mization formulation that generates a minimum length trajectory that covers the building sides and roof.
This trajectory must satisfy two kinds of constraints: thoseimposed by SLAM, and thoseimposed by the
target point density. Even when a trajectory respects these constraints, SLAM may incur drift over long
flights, so Fresco includes a drift estimation and re-calibration technique. We describe these below.
4.3.1.1 SLAM-imposed constraints
The trajectory of the drone flight impacts 3D model completeness and accuracy because a poorly designed
trajectory can increase SLAM error. For example, if a drone flies too fast, two successive point clouds may
have little overlap ( Figure 4.10), increasing SLAM error. Besides speed, other parameters that affect SLAM
error include distance from the building surface and the orientation of the LiDAR with respect to the ground.
Parallel Full Coverage Target Density
Coverage Scan Lines Scan Length Perpendicular Scan Width Scan Length Scan Width x m/s x m/s h Figure 4.9: Parallel and perpendicular LiDAR orientation
Perpendicular Parallel Lower speed Higher speed No overlap Overlap Figure 4.10: Overlap at various orienta-
tion and speeds
97
SLAM algorithms are complex, so it is difficult to derive analytical models that can predict the error
resulting from different choices of these parameters. So, we resort to a parameter sweep in simulation and
from real-world flights. We run SLAM in the AirSim simulator [180] and on real-world flight traces for
different flight parameters, and find the best one. We present the results of the sweep in §4.4, but summarize
the main findings below. These findings point out two important factors that determine SLAM error: point
density, and degree of overlap between successive point clouds
4
.
Orientation impacts accuracy. At a fixed height and speed, a parallel orientation of the LiDAR (Fig-
ure 4.9) in which the line at which the LiDAR’s scan plane intersects with the building surface aligns with
the drone’s direction of motion, results in higher overlap between two successive point clouds than with a
perpendicular orientation, therefore, lower SLAM error and better accuracy. Figure 4.11, obtained using
the methodology described in §4.4, quantifies this intuition: different orientations have different degrees of
overlap, and as overlap decreases, SLAM’s positioning error increases. A parallel orientation (0
◦ ) has the
lowest SLAM error because it has the highest visibility lifespan. (Visibility lifespan, the time for which a
pointonthebuilding’ssurfaceisvisibleduringflight, isaproxyforoverlap; alongerlifespanindicatesgreater
overlap).
Speed impacts model accuracy. If the drone flies fast, two successive point clouds will have fewer over-
lapping points, resulting in errors in the SLAM’s pose transformations and (therefore) pose estimates (for a
reason similar to Figure 4.11) which leads to poor 3D model accuracy. For high accuracy, Fresco must fly the
drone at 1 m/s (§4.4.4).
Height impacts both accuracy and completeness. Because LiDAR beams are radial, the higher a drone
flies, the less dense the points on the surface of the building. Lower density results in worse completeness.
Accuracy is also worse, because the likelihood of matching the same point on the surface between two scans
decreases with point density. Figure 4.12 obtained using methodology described in §4.4 illustrates this. It
plots positioning error as a function of point density by altering the altitude of the drone. Figure 4.12 shows
the positioning errors for point densities of 2.2 points per m
2
and 3.0 points per m
2
are 2.5 m and 1.0 m
4
SLAM estimates pose by matching successive point clouds (§4.1).
98
0 30 45 60 90
LiDAR Orientation (degrees)
1.0
1.5
2.0
2.5
3.0
3.5
RMSE (m)
Positioning error
30
40
50
60
70
Visibility lifespan (s)
Region visibility
Figure 4.11: LiDAR orientation Vs. SLAM position-
ing error.
10 20 30 40
Altitude (m)
1.0
1.5
2.0
2.5
RMSE (m)
Positioning error
2.2
2.4
2.6
2.8
3.0
Point density (points/m
2
)
Point density
Figure 4.12: Point density Vs. SLAM positioning er-
ror.
respectively. So, Fresco must fly as low as possible. For high accuracy, Fresco must fly the drone at no more
than 20 m from the surface.
Finally, the drone must never rotate the LiDAR. This maneuver can increase SLAM error. Figure 4.20
shows an example in which the green dashed line is the ground truth drone trajectory, and the blue line
SLAM’s estimated pose. At the bottom right corner, when the drone rotates, SLAM is completely thrown
off.
Two of these constraints are universal: that the parallel orientation is best, and that the drone must
never rotate the LiDAR. Two others (height and speed) depend on the LiDAR characteristics and the SLAM
algorithm. For a different model of LiDAR than the one we have used in this chapter (a 64-beam Ouster)
or SLAM implementation (Cartographer), we may need to re-run the simulations to obtain these flight
parameters. We envision a practical implementation of Fresco will include pre-compute parameters so users
will not have to determine these.
4.3.1.2 Density-imposed constraints
As Figure 4.9 shows, for a given orientation, scan length is measured along and scan width perpendicular to
themotionofthedrone. Givenadistance hatwhichthedronefliesfromthesurface, thetargetpointdensity
requirement imposes constraints on scan width. As Figure 4.9 shows, two successive scans of the surface (e.g.,
99
the legs of a U-shaped trajectory) cannot be separated by more than the scan width without violating the
density constraint.
Fresco’s key observation is that these constraints can be derived from an analytical model of the LiDAR,
given its configuration. For instance, for an Ouster LiDAR with 64 beams, a vertical field of view of 45
◦ ,
two consecutive beams are separated by 0.7
◦ (
45
64
). During one full 360
◦ rotation, the laser emits 1024 pulses;
successive pulses are 0.35
◦ (
360
1024
) apart. Each pulse generates a point in the point cloud
5
. Then, the 3D
coordinates of a point from the n-th pulse of the b-th beam from a LiDAR mounted on a drone at height h
pointing downwards is:
(x,y,z)= (rsinθ L
,rcosθ L
sinθ B
,rcosθ L
cosθ B
)
where (θ L
=
θ F
n
) is the angle of the n pulse of the b-th beam whose beam angle is θ B
=
b
360
, r
max
is the
maximum range, θ F
is the vertical field of view, and r is the distance between the LiDAR and where the
pulse hits the surface:
r =
h
cos(θ L
+θ B
)
, ∀r <=r
max
From this, Fresco derives the coordinates of every point on the surface. From this point cloud, it can
derive the fine-grained point density distribution. Thus, given a minimum point density distribution and
a distance h (obtained from the SLAM-derived constraints), Fresco calculates the scan width. Using the
same procedure, Fresco derives the scan length, the length on the surface along the direction of the flight
containing points from a single frame satisfying the minimum density constraint (Figure 4.9). Scan length
helps discretize the search space for trajectory optimization.
4.3.1.3 Optimized trajectory generation
Fresco, inspired by prior work [56, 143], uses an Integer Linear Programming (ILP) based optimization to
generatetheshortestflightpaththatrespectsdensityandSLAM-imposedconstraintsfortheclassofbuildings
5
Thisisidealized. Inpractice,LiDARsmaydropsomereflectionsiftheyarenoisy[141]. So,ourdensityguaranteeis nominal.
Future work can model this noise for better equi-dense trajectory designs.
100
Figure 4.13: Pen-
tagon building
Figure 4.14: Un-
folded pentagon
Figure 4.15:
Meshed pen-
tagon
Figure 4.16:
Graph of pen-
tagon
Figure 4.17: De-
rived Trajectory
Figure 4.18:
Folded Trajec-
tory
Figure 4.19: Illustrating steps in optimized trajectory generation
discussed above. Figure 4.19 illustrates the steps, described below, in trajectory generation for a building
with a flat pentagonal roof (Figure 4.13).
Unfolding. Unfolding a 3D building on a 2D plane reduces trajectory planning to coverage path planning
(CPP) [98]. Fresco unfolds the sides of the building, so that the resulting object is planar. In Figure 4.14,
this results in a pentagon surrounded by rectangles. (Fresco applies a similar unfolding trick to the gabled
roof, details omitted for brevity.)
Meshing. This imposes a rectangular grid on the roof and all sides to discretize the unfolded structure into
grid elements with dimensions scan length/2 by scan width/2 (the points in Figure 4.15 depict the centers
of the rectangles). Discretization reduces the search space [56, 97], and using the scan width ensures target
point density. The longer side of the grid element parallels the longer side or dominant orientation of the
corresponding polygon; Fresco aligns the LiDAR parallel to the longer side to maximize flight segments with
a parallel orientation. Finally, a traversal that visits all mesh centers guarantees coverage of the structure.
For a non-convex flat roof, Fresco generates the mesh on the convex hull.
Graph generation. Fresconowembedsagraphonthemesh;theverticesarethecentersofrectangles,andeach
centerisconnectedtoeveryneighborwithanedge(Figure4.16). Frescoactuallyembedstwosub-graphs: one
on the roof mesh, and one on the side meshes taken togther. It generates tours separately for each sub-graph,
because these two require different LiDAR orientations. For each sub-graph, the tours start and terminate
101
at a fixed origin (denoted by × in Figure 4.15), which ensures a loop closure before the LiDAR needs to be
rotated for the other sub-graph.
Optimizationformulation. Foreachsub-graph,FrescomodelsthetrajectoryasaTravellingSalesmanProblem
(TSP) tour, starting and ending at the origin. It formulates the TSP traversal as an ILP. This step takes a
subgraph G = (V,E) as input, where V and E are the set of vertices and edges, respectively and vertices
indexed i=1,2,...,N, with i=1 the origin. Each edge (i,j)∈E has a weight w
ij
which is the 2D Euclidean
distance between the position of the vertices i and j. If the drone flies from vertex i to vertex j, given
(i,j) ∈ E, the binary decision variable x
ij
set to 1, 0 otherwise. The formulation below finds a minimum
length trajectory:
minimize
X
i
X
j
w
ij
x
ij
+λp
ij
(4.1a)
subject to
X
∀i
x
ij
=1, ∀j̸=1, (4.1b)
X
∀i
x
ij
≥ 1, j =1, (4.1c)
X
∀i̸=j
x
ij
=
X
∀k̸=j
x
jk
, ∀j, (4.1d)
u
1
=1 and 2≤ u
i
≤ N, ∀i̸=1, (4.1e)
u
i
− u
j
+1≤ (N− 1)(1− x
ij
), 2≤ i̸=j≤ N (4.1f)
The first term in the objective is the tour length. To force tours to traverse the structure in a parallel LiDAR
orientation for as much of the tour as possible, the second term penalizes edges selected in non-parallel
direction by a factor λ ≥ 0, where λ is a hyper-parameter. If a selected edge x
ij
= 1 is in non-parallel
direction p
ij
is set to 1, 0 otherwise.
The optimization is subject to several constraints: (4.1b, 4.1c) the drone visits each vertex of the graph
exactly once, except the origin; (4.1d) if a drone visits a vertex j, it has to leave the same vertex; (4.1e,
102
Input : SLAM poses S and GPS tags G
Output: Imperfect regions I
1 S
′
,G
′
← TimeSynchronization(S, G)
2 G
′
a
← GPSToMercator( G
′
)
3 t
cw
← UmeyamaAlignment( G
′
a
, S
′
)
4 S
′
a
← TransformTrajectory( S
′
, t
c
w)
5 foreach s
′
a− i
in S
′
a
, g
′
a− i
in G
′
a
do
6 r
i
← RMSE( s
′
a− i
, g
′
a− i
)
7 if IsExcessive(r
i
) then
8 I.Append(g
a− i
)
9 else
10 pass
11 end
12 end
Algorithm 3: Detecting Excessive Drift
4.1f) no subtours of size < N are allowed. We use Miller-Tucker-Zemlin (MTZ) based subtour elimination
constraint [158] for this optimization. It requires auxiliary decision variables u
i
for each vertex.
Refolding. Once it obtains the 2D trajectories (Fig. 4.17), the trajectory planner projects each 2D point to
3D coordinates. Since the drone has to fly at a distance h from the building, we obtain the 3D coordinates at
a distance of h from the scanned face by projecting each point on the plane parallel to the face at a distance
h away from it (Fig. 4.18).
4.3.2 Drift estimation and re-calibration
SLAM accumulates error on long flights; this is a well-known problem. To minimize error due to drift, SLAM
uses loop closure — this return to a previously-visited spot allows SLAM to re-calibrate itself. However, it
is hard to predict when SLAM incurs significant drift, so Fresco continuously estimates drift and triggers a
mid-flight return to origin, a designated spot at which to close the loop, and restart a new SLAM session.
Mid-flight re-calibration is the primary reason for Fresco’s cloud offload of SLAM, because, to detect drift
while the drone is flying, it must obtain SLAM’s position estimates to assess if there is a drift. Detecting
excessive drift is a non-trivial problem since Fresco has no way of knowing when SLAM’s position estimates
are wrong because it does not have accurate ground truth.
103
Figure 4.20: Rotation throws SLAM off.
Figure 4.21: Fresco re-calibration flight (red) to
mitigate drift (red circle).
Fresco’s key insight is to detect excessive drift by comparing the shape of the SLAM-generated trajectory
with the shape of the GPS-generated trajectory. Thus, Fresco leverages the GPS readings associated with
SLAM poses: each sequence of readings gives a GPS trajectory, and Fresco attempts to find the best possible
match between the GPS trajectory (e.g., the green line in Figure 4.20) and the SLAM-generated trajectory
(the blue line in Figure 4.20). If there is a significant deviation, Fresco assumes there is a drift and invokes
re-calibration.
This approach is robust to GPS errors, since it matches the shape of the two trajectories, not their
precise positions. Specifically, Fresco continuously executes 3D SLAM on the stream of compressed LiDAR
frames from the drone, and estimates the pose of each frame. Then, it runs the algorithm described in
Algorithm3. ItsynchronizestheGPStimestampswiththeLiDARtimestamps(line1), thentransformsGPS
readings using the Mercator projection (line 2). It then aligns the GPS trajectory and the SLAM-generated
trajectory using the Umeyama algorithm [195] (line 3) to determine the rigid transformation matrices (i.e.,
translation, and rotation) that best align SLAM and GPS poses (line 4). Fresco partitions trajectories into
fixedlengthsegmentsandafteralignment, computestheRMSEbetweenthetwotrajectoriesineachsegment,
andusesthese as an indicator of excessivedrift(lines 5-11): iftheRMSE isgreaterthanathreshold ρ , Fresco
invokes return-to-origin (Figure 4.21).
104
4.3.3 Discovering Rooftop Geometry
Fresco’s model collection requires the rooftop geometry as input. Depending on the type of structure, it may
be difficult to get this information. Fortunately, it is possible to estimate this information using a drone
flight.
Theinputtothisestimatorisanareaofinterestcontainingoneormorebuildingswhose3DmodelsFresco
must capture. The outputs are the rooftop geometries for the buildings, as defined in §4.3. These form inputs
to model collection.
To obtain these outputs, Fresco must use a reconnaissance (or recon) flight, with as short a duration as
possible to minimize battery usage. In addition, recon (a) must not assume prior existence of a 3D model of
the building (prior work in the area makes this assumption [165, 126]); (b) must be robust to nearby objects
like trees that can introduce error; and (c) must generalize to the class of buildings Fresco targets (§4.3).
The recon trajectory. Recon uses a rectilinear scan (Figure 4.3), but unlike model collection, during
recon the drone flies fast (4 m/s) and high (60 m above the building’s roof
6
), with the LiDAR mounted in
a perpendicular orientation in order to have the shortest duration flight possible. During this flight, Fresco
streams point clouds to its cloud component, which runs the boundary detection algorithms described below.
The algorithm. Boundary detection runs two algorithms.
Surface extraction. The cloud component receives GPS-tagged compressed point clouds from the drone. It
first uncompresses them, then computes the surface normal of every point in the point cloud. A surface
normal for a point determines the direction normal to the surface formed by points within a fixed radius of
the point. Then, Fresco uses RANSAC [93] (a plane-fitting algorithm) to segment the LiDAR points into
groups of points that fall onto planes. It removes the plane furthest from the drone (likely to be the ground
plane) and eliminates planes with irregular surface normals (e.g., from nearby trees). The remaining planes
6
We assume the nominal building heights in an area are known, for example, from zoning restrictions.
105
correspond to rooftops. For additional robustness, it uses majority voting across multiple frames to identify
the rooftop.
Estimating the boundary. Fresco uses the drone’s GPS location to transform each surface to the same
coordinateframeofreference,thencombinesallsurfacesintoasinglepointcloudthatrepresentstheextracted
rooftop of the building. To extract the boundary of the building, it extracts the alpha shape [54] (a sequence
of piecewise linear curves in 2D) of the stitched point cloud. This allows Fresco to generalize to non-convex
shapes as well. Finally, to detect the boundary of multiple buildings, Fresco clusters the rooftop point clouds.
4.3.4 Point-Cloud Compression
LiDARsgeneratevoluminous3Ddata. Forinstance,theOusterOS1-64,with360
◦ horizontaland45
◦ vertical
field-of-view (FoV), generates 20 point clouds per second requiring 480 Mbps, well beyond the capabilities of
even future cellular standards. Fresco compresses these point clouds to a few Mbps (1.2 to 4.0), using two
techniques: viewpoint filtering, and octree compression.
Viewpoint filtering. In a drone-mounted LiDAR (Figure 4.9), only a portion of the full 360
◦ contains
useful information. Viewpoint filtering removes returns from beams directed towards the sky, from objects
beyond the LiDAR range (because they generate zero returns) and also from the drone’s body.
Octree compression. Fresco compresses the retained data using an octree compression algorithm [177]
designedforpointclouds(sobetterthandata-agnosticcompressiontechniqueslikegzip). Frescousesdifferent
valuesofoctreeresolutionandpointresolution,parametersthatgoverncompressibility,toachievepoint-cloud
transmission rates of 1.2, 2.5, and 3.8 Mbps (§4.4) corresponding to high, medium and low compression,
respectively (all within achievable LTE speeds).
106
4.4 Evaluation
Implementation. We have implemented Fresco using the Point Cloud Library (PCL [177]), the Cartogra-
pher [111] LiDAR SLAM implementation, the Boost C++ libraries [175], and the Robotic Operating System
(ROS [185]). For the recon phase described in §4.3.3, we used functions from the Point Cloud Library
(PCL [177]) for plane-fitting, outlier removal and clustering. Our compression and extraction modules are
ROSnodesthatusePCL.ThedriftdetectionmoduleusesaPythonpackagefortheUmeyamaalignment[106].
The trajectory optimization uses Gurobi [138]. Not counting libraries and packages it uses, Fresco is 16,500
lines of code.
Simulations. We use a photorealistic simulator, AirSim [180] that models realistic physical environments
using a game engine, then simulates drone flights over these environments and records sensor readings taken
from the perspective of the drone. AirSim has a parametrizable model for a LiDAR; we used the parameters
for the Ouster OS1-64 in our simulation experiments. Fresco generates trajectories for the AirSim drone,
then records the data generated by the LiDAR, and processes it to obtain the 3D model. For computing the
metrics above, we obtain ground truth from AirSim. To build the ground truth 3D model, we flew a drone
equipped with a LiDAR several times over the region of interest in AirSim (using exhaustive flights) and then
stitched all the resulting point clouds using ground truth positioning from AirSim.
Real-world traces. In addition, we have collected data from nearly 30 flights (each of about 25 minutes)
on an M600Pro drone with an Ouster OS1-64 LiDAR on a commercial complex. For almost all experiments,
we evaluated Fresco on both real-world and simulation-driven traces. Simulation-driven traces give us the
flexibility to explore the parameter space more (as we show below). However, we use real-world traces to
validate all these parameter choices and estimate reconstruction accuracy in practice.
Metrics. We quantify end-to-end latency, 3D model accuracy and completeness (§4.3.1), and positioning
error. We also quantify Fresco’s energy-efficiency (using flight duration as a proxy for drone battery usage)
and the computational capabilities of its processing pipeline.
107
Scheme Acc. (m) Comp. (m) Rec. (s) Flight (s)
Large building (100 m x 50 m x 20 m)
ColMap 0.16 0.75 31600 1320
Fresco-st 0.21 0.24 in-flight 1520
Fresco 0.09 0.05 in-flight 1430
Small building (50 m x 50 m x 20 m)
ColMap 0.11 0.80 23084 760
Fresco-st 0.25 0.14 in-flight 900
Fresco 0.12 0.09 in-flight 864
(a)Reconstructionaccuracy(Acc.), completeness(Comp.), reconstructiontime(Rec.), andflighttime(Flight)fortwo
buildings using: a) photogrammetry reconstruction with an optimized trajectory (ColMap), b) Fresco with a simple
rectilinear trajectory (Fresco-st), and c) Fresco.
Scheme BW (Mbps) Acc. (m) Comp. (m)
Offline SLAM 480 2.30 1.30
Online with GPS 3.80 1.60 0.53
Fresco 3.80 0.13 0.09
(b) Reconstruction quality of a real-world 70 x 40 x 20 m building with online and offline approaches, relative to an
uncompressed trace.
Structure type Flight dur. (s) Acc. (m) Comp. (m)
Tall rect. 1430 0.08 0.06
Large rect. 1145 0.09 0.05
Pentagonal 1204 0.11 0.08
Small rect. 864 0.12 0.09
Gabled roof 862 0.13 0.07
Plus-shaped 1063 0.16 0.07
(c) Fresco 3D reconstruction times (recon (§4.3) and model collection) and quality for different structures at low
compression.
Table 4.6: Reconstruction accuracy results.
4.4.1 3D Model Reconstruction
Fresco outperforms photogrammetry-based reconstruction. We compare Fresco against
ColMap [178], a state-of-the-art photogrammetry-based tool that uses multi-view stereo (MVS [96])
7
.
For this, we use two rectangular buildings: a) a large 100m x 50m x 20m (L x W x H) and, b) a small
50m x 50m x 20m building in Airsim. We calculated the accuracy and completeness of the models generated
by these approaches by comparing them against ground truth models generated from AirSim. Lower is
better for accuracy and completeness. For these experiments, Fresco uses compressed point clouds with
bandwidth requirements that are compatible with LTE speeds today (i.e., upload bandwidth of 3.8 Mbps);
7
We also compare Fresco against LiDAR-based reconstruction in Table 4.6b and §4.4.3.
108
GT Raw Trace (0.1) Low Compression (0.12) Medium Compression (0.35) High Compression (0.44) Ground Truth Figure 4.22: 3D models accuracy at different compression levels.
we study the effect of compression on Fresco model reconstruction more in §4.4.3. For ColMap, we flew the
drone in various trajectories and collected 2D images for offline reconstruction; we report the result for the
best performing trajectory.
AsTable4.6ashows, Frescoachievescm-levelaccuracyandcompletenessforbothbuildings. ColMap, like
Fresco, is largely dependent on trajectory design. For ColMap, we tried multiple trajectories, including ones
suggested for drone photogrammetry [4]. From these, we report the trajectory with the best reconstruction
quality. Without proper trajectory design, ColMap reconstruction fails altogether. With proper trajectory
planning, ColMap reconstructs both buildings within 0.16 m accuracy and 0.80 m completeness. Fresco
outperforms ColMap because it has real-time insight into the drone’s tracking accuracy and can re-calibrate
on-the-fly. ColMap reconstructions are almost as accurate as Fresco but highly incomplete. This
is because of the inherent nature of photogrammetry-based reconstructions; if ColMap cannot converge to a
good solution for a given region, it will leave that region empty, resulting in an incomplete model.
ColMap’sflightdurationissmallerthanFresco’swhichmightappeartoputitatadisadvantage. Actually,
Fresco’s planned trajectory for both buildings was shorter; however, Fresco detected drift mid-flight and
incurred a few re-calibrations, resulting in longer duration flights.
109
Fresco reconstructs the small building within 15 min and the large building in just over 20 min (the flight
duration). On the other hand, ColMap takes several hours even on a powerful GPU-eqipped workstation.
Thus, Fresco reconstructs the 3D model approximately 40× faster than ColMap.
We do not evaluate Fresco against smartphone-based LiDAR reconstruction approaches which are designed
for indoor 3D reconstruction. These approaches are unsuitable for outdoor scenes for two reasons: the smart-
phone LiDAR’s limited sensing range [139] (3 m), and high depth estimation errors beyond this range [191]
(i.e., upto 30 cm). As such, in outdoor spaces, these applications can have tracking errors in meters [42]
which is undesirable. Even with perfect tracking, to ensure complete reconstruction, the drone has to do
many flight traversals close to the building’s surface which can easily exhaust the drone’s battery. Our cal-
culations show that, for such a smartphone-based reconstruction approach, reconstructing the large building
(100m x 50m x 20m) can take over 1.5 hours of the drone flight time (the M600Pro has a flight time of 25
minutes).
Fresco generalizes to different building shapes. Fresco is designed to reconstruct a variety of building
types (§4.3). Fresco reconstructs all these buildings within a single drone battery cycle (25 to 30 mins) at
cm-level accuracy and completeness (Table 4.6c). Flight durations for fairly large buildings (100 x 50 x 20 m
large rectangle, pentagon and 50 x 50 x 40 m tall building) are relatively longer because they require multiple
re-calibrations. Even for these, Fresco preserves cm-level accuracy and completeness.
Fresco is accurate on real-world data. Results from our real-world drone flights validate that real-
world data can result in comparable accuracy (Table 4.6b). We reconstructed the 3D model of a real-world
70 m x 40 m x 20 m rectangular building. Because we lack a reference ground truth for real-world data, we
use the 3D model generated from raw, uncompressed traces. Offline reconstruction using SLAM after the
drone lands fails completely for the same reasons mentioned above (i.e., no trajectory planning, and no re-
calibration). With GPS, it is possible to do in-flight reconstruction, however, the accuracy and completeness
being 1.60 m and 0.53 m, make such 3D models unusable. With Fresco, on the other hand, we can build
accurate, and complete 3D models whose completeness and accuracy are 9 cm and 13 cm respectively.
110
Component Average (ms) 99th percentile (ms)
Compression 88.2 109.45
Network Latency 104.0 409.9
Decompression 8.9 11.7
Reconstruction 33.3 87.8
End-to-End Latency 234.0 552.1
(a) Fresco enables fast 3D reconstruction over LTE. Each row shows per frame average and 99th percentile latency
for that component.
Stage Sub-component Time (ms)
Surface Extraction
Point Cloud Decompression 3.0± 0.3
Surface Normal Estimation 76.0± 82
RANSAC Plane-fitting 5.0± 9.0
Outlier Removal 0.2± 0.3
Rooftop Extraction 6.0± 5.0
Boundary Estimation Rooftop Stitching 3.0± 2.0
Total Time 93± 90.0
(b) Per-frame processing times for Fresco’s building geometry estimation.
Table 4.7: Fresco performance results.
4.4.2 Performance
Fast 3D reconstruction over LTE is feasible. To validate that Fresco can collect a 3D model in the
real-world, we used our end-to-end implementation of Fresco to reconstruct a 70 x 40 x 20 m building by
streaming compressed point clouds over LTE whilst the drone was in-flight (at 10 Hz) to a 16-core
Azure VM with 64-GB RAM that ran 3D reconstruction. The experiment ran for about 16 minutes of drone
flight (Figure 4.23 and Table 4.7a).
Fresco has a per-frame end-to-end 99th percentile latency of approximately half a second. This
means that the cloud can detect excessive drift and initiate mid-flight recalibration within a second. An
interesting side-effect of this design is that the 3D model is ready within half a second of flight
completion. The average end-to-end processing latency per frame is about 234 ms, of which nearly 104 ms
is network latency. On-board point cloud compression takes on average 88 ms per frame but is able to keep
up with the LiDAR’s frame rate i.e., 10 Hz. Decompression and 3D reconstruction are fairly fast at the
111
0.0 2.5 5.0 7.5 10.0 12.5 15.0
Drone Flight Time (mins)
0
250
500
750
1000
Latency (ms)
Compression
Decompression
3D Reconstruction
Network Latency
Figure 4.23: End-to-end latency from a real-world drone flight.
cloud. Trajectory generation runs once and takes on average 14.6 ms per building. Drift estimation runs
periodically and takes on average 11.2 ms.
Fresco’s geometry extraction processes at full frame rates. We profiled the execution time of each
componentinFresco’sbuildinggeometryextractionona15-minute real-world trace. Pointcloudcompression
executes on the drone, and other components run on an AWS VM with 16 cores, 64 GB RAM and an Nvidia
T4 GPU. Extracting the building geometry requires 93 ms per frame (Table 4.7b); with these numbers, we
can sustain 10 fps. At this frame rate, our building detector is quite accurate (§4.3.3). The most expensive
component is surface normal extraction (76 ms), even though we offload it to a GPU. Thus, a moderately
provisioned, cloud VM suffices to run Fresco at full frame rate with an end-to-end compute latency of about
100 ms for reconnaissance, and 33 ms for model collection.
112
4.4.3 Ablation Studies
Careful trajectory planning is crucial for reconstruction. To demonstrate this, we evaluated Fresco
without its optimized trajectory generation. Instead, we use a rectilinear trajectory (Figure 4.3) that sat-
isfies the minimum point density requirement (Fresco-st, Table 4.6a). Fresco-st’s has poorer accuracy and
completeness relative to Fresco, even though it has a longer duration flight (which should, in theory, give it a
better chance to capture more detail of the building). However, a larger proportion of Fresco-st’s trajectory
is not in a parallel orientation; this illustrate the importance of explicitly optimizing for parallel flights in the
objective function (Equation 4.1, in §4.3.1).
Re-calibration helps reduce error. To show the effect of in-flight re-calibration, we compare reconstruc-
tion quality with (w) and without (w/o) recalibration in AirSim (Table 4.8a). Across our six buildings, on
average, at the expense of 35% (285 seconds) longer flights, Fresco improves accuracy by 70% (28 cm) and
completeness by 64% (13 cm) with re-calibration flights. Larger buildings (tall rectangle, large rectangle,
and pentagonal) require longer aerial flights which accumulate higher drift. This results in relatively more
re-calibration flights and hence higher flight duration. Even so, Fresco is able to reconstruct these build-
ings accurately (within cm-level accuracy and completeness), demonstrating the importance of re-calibration
whilst in-flight.
Fresco builds accurate models at low bandwidths. We explore the impact of compression on accuracy
and completeness using (a) a synthetic building in AirSim and (b) real-world traces. In addition to the three
compression schemes discussed earlier (§4.3.4), we compute accuracy and completeness for (a) viewpoint
compression and (b) lossless compression. The first contextualizes our results, while the second alternative
explores reconstruction performance under higher bandwidth as would be available, e.g., in 5G deployments.
Viewpoint filtering is the same as a raw point cloud but with zero points removed. With this, it achieves
a 10× compression throughout. As Table 4.8b shows, low compression is an order of magnitude more efficient
beyond this. Despite this, Fresco can achieve high quality reconstruction. For the AirSim building, consider
accuracy: the viewpoint filtered point cloud has an accuracy of 0.10 m and 0.08 m, which is attributable
113
Structure
type
Flight dur. (s) Accuracy (m) Comp. (m)
w/o w w/o w w/o w
Tall rect. 1147 1430 0.31 0.08 0.15 0.06
Large rect. 855 1145 0.13 0.09 0.06 0.05
Pentagonal 891 1204 0.61 0.11 0.22 0.08
Small rect. 678 864 0.52 0.12 0.13 0.09
Gabled roof 667 862 0.32 0.13 0.11 0.07
Plus-shaped 620 1063 0.48 0.16 0.51 0.07
(a) Flight duration (dur.) and reconstruction quality for buildings at low compression with (w) and without (w/o)
re-calibration.
Compression profile BW (Mbps) Acc. (m) Comp. (m)
Real-world 70 m x 40 m x 20 m large building
View-point 42.7 0.00 0.00
Lossless 7.86 0.06 0.07
Low 3.80 0.13 0.09
Medium 2.50 0.23 0.16
High 1.27 0.28 0.29
AirSim 50 m x 50 m x 20 m small building
View-point 42.7 0.10 0.08
Lossless 7.86 0.10 0.10
Low 3.80 0.12 0.09
Medium 2.50 0.35 0.09
High 1.27 0.44 0.10
(b) The impact of compression on accuracy/completeness.
Table 4.8: Ablation study results.
entirelytoSLAMerror. Lowcompression,withabandwidthof3.8Mbps(easilyachievableoverLTEandover
100× morecompactthantherawLiDARoutput)only adds 2 cm and 1 cm toaccuracyandcompleteness
(respectively). Thisalsoshowsthatthereislittleroomforimprovementwith5Gbandwidths. Medium
and high compression have significantly poorer accuracy and completeness. Results for other buildings are
similar, so we omit for brevity.
Results from our drone flights validate that real-world data of a large building (dimensions in Table 4.8b)
results in comparable performance (Table 4.8b). Since we lack a reference ground truth for real-world data,
we use the 3D model generated from raw traces. With real-world traces, we can build accurate, and complete
3D models that are within 9-13 cm completeness and accuracy for low compression, and about 16-23 cm for
medium compression, relative to the raw traces. This suggests that highly compressed point clouds do not
significantly impact accuracy and completeness.
114
LiDAR
Orientation
Speed (m/s) Distance (m)
1.5 3.0 20 40
Parallel 1.3 3.3 1.2 5.4
Perpendicular 3.1 7.6 2.1 ∞
Table 4.9: Positioning errors for parallel and perpendicular orientations at different speeds (at distance 20 m)
and errors for different distances (at a speed of 1 m/s) for real-world traces.
To get a visual feel for the degradation resulting from lower accuracy, Figure 4.22 shows the ground-truth
model, together with the Fresco reconstructions for different compression levels. With an accuracy of 0.12 m
(using 3.8 Mbps upload bandwidth), the model closely matches the ground truth. As accuracy worsens at
higher levels of compression, the textures on the building increasingly become less distinct and start to show
some small artifacts, arising not because of compression but because of SLAM imperfections (§4.4.3).
Lower target density can shorten flight duration. We evaluated Fresco at two different densities: 7.5
points per m
2
and 1 point per m
2
. The lower density flight took only 31% of the higher density flight time.
However, asexpected, reconstruction is worseatlowertargetdensities(0.18maccuracy, 0.24mcompleteness
vs. 0.08mand0.06mforthehigherdensity). Forapplicationsthatcantoleratethisdegradation, thesmaller
flight time might result in cost savings.
4.4.4 Data Collection
Fresco determines model collection flight parameters with a parameter sweep in simulation and on real-world
traces (§4.3.1). In simulations, we evaluated SLAM error for every combination of drone speed (0.5 m/s to
3 m/s), distance from building (10 m to 40 m), and LiDAR orientation (ranging from parallel to perpendic-
ular).
Best choice of orientation is parallel. Figure 4.24 plots SLAM error as a function of LiDAR orientation
(Figure 4.9) with respect to the direction of motion. A parallel orientation has lowest SLAM error (in
Figure 4.24, yaw 0
◦ corresponds to parallel and yaw 90
◦ to perpendicular), because it has highest overlap
between successive frames; as yaw increases, overlap decreases, resulting in higher SLAM error (§4.3.1).
Best choice of distance is 20 m. Figure 4.25 plots the SLAM error as a function of the drone’s distance
fromthebuildingsurfacefortheparallel orientationoftheLiDAR.Errorincreasesslowlywithheight; beyond
115
0 30 45 60 90
LIDAR Yaw (degrees)
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Positioning error (m)
Figure 4.24: SLAM errors for LiDAR orienta-
tions.
10 20 30 40
Distance from building (m)
2.5
5.0
7.5
10.0
Positioning error (m)
Distance RMSE
0.5 1.0 2.0 3.0
Speed (m/s)
Speed RMSE
Figure 4.25: SLAM error for speeds and building
distances.
a 20 m distance from the building, the error is more than 1 m. Point densities decrease with height and affect
SLAM’s ability to track features/points across frames (§4.3.1). Rather than fly lower, Fresco operates at a
20 m distance from the building to reduce flight duration.
Best choice of speed is 1 m/s. Speed impacts SLAM positioning significantly (Figure 4.25). Beyond
1 m/s, SLAM cannot track frames accurately due to lower overlap between frames (§4.3.1). Below 1 m/s
i.e., at 0.5 m/s, the flight duration (in seconds) is twice that of 1 m/s resulting in drift accumulation. For
accurate reconstruction, Fresco flies the drone at 1 m/s.
Real-worldtracesconfirmtheseobservations. Tovalidatetheseobservations,weperformedaparameter
sensitivity study on real-world flights to determine the optimum parameters for SLAM positioning. For the
lack of accurate ground truth in the real-world, we compare SLAM positions against a GPS trace. Because
GPS is erroneous, we only draw qualitative conclusions.
As Table 4.9, taken from our drone traces, shows, slower flights have lower SLAM error than faster ones,
and parallel orientations have lower SLAM error than perpendicular. Similarly, SLAM error increases with
heightand,inreal-worldtraces,theparallelorientationseemstobesignificantlybetterthantheperpendicular
orientation (Table 4.9). At a distance of 20 m from the surface of the building, the parallel orientation has
the minimum positioning error i.e., 1.25 m. Beyond 40 m for parallel and 20 m for perpendicular, SLAM
loses track completely because of lower point density.
116
4.4.5 Boundary Detection
Methodology and metrics. We use two metrics for building boundary estimation: accuracy, and com-
pleteness. Accuracy is the average (2D) distance between each point (quantized to 0.1 m) on the predicted
boundary and the nearest point on the actual building boundary. Completeness, is the average distance
between each point on the actual boundary and the nearest point on Fresco’s predicted boundary. Lower
values of accuracy and completeness are better.
We use both real-world traces collected from the Fresco prototype and synthetic traces from AirSim. For
real-worldtraces,wepin-pointedthebuilding’sboundaryonGoogleMaps[104]forgroundtruth. ForAirSim,
we collected the ground truth from the Unreal engine.
Boundary detection can run at full frame rate. Table 4.7b shows the execution time for boundary
detection components, on real-world traces, with the cloud VM. The average processing time per frame is
93 ms, dominated by GPU-accelerated surface normal estimation (76 ms). This can sustain 10 fps.
Boundary detection is accurate. We experimented with 3 real-world traces collected over a 70 m x
60 m x 20 m building. Fresco’s average accuracy is 1.42 m and completeness is 1.25 m, even at the highest
compression and when it samples every other frames. Overall, these results justify our choice of a fast, high,
reconnaissance flight ( §4.3.3).
Otherresults. Weevaluatedtheboundarydetectionalgorithm’srobustnesstodifferentshapes(Table4.6c),
pointcloudcompression(Table4.8b),sub-sampling,andflightparameters. Wefound: (a)reconflightscanbe
short(boundarydetectionisinsensitivetopointdensityandoverlap), soitcanuseperpendicularorientation,
fly at 60 m from the building at 4 m/s; (b) it tolerates sub-sampling upto a point cloud per second; (c) it
generalizes to different building geometry.
117
4.5 Conclusions
In this chapter, we have taken a step towards accurate, near-real time 3D reconstruction using drones.
Our system, Fresco, uses novel techniques for navigating the tension between cellular bandwidths, SLAM
positioning errors, and compute constraints on the drone. It contains algorithms for optimized trajectory
generation, and for determining excessive SLAM drift. It can achieve reconstruction accuracy to within
10 cm in near real-time, even after compressing LiDAR data enough to fit within achievable LTE speeds.
Future work can include using more sophisticated drone battery models, cooperative reconstruction of large
campuses using multiple drones, and generalizing further to structures of arbitrary shape.
118
Chapter 5
Related Work
5.1 Autonomous Vehicles
5.1.1 Vehicle Sensing and Connectivity
Network connectivity in vehicles has opened up large avenues for research; this section covers them briefly.
A large body of work [183] has explored wireless technologies and standards (such as DSRC) for vehicle-
to-vehicle, and vehicle-to-infrastructure communication. Connected autonomous vehicles have also inspired
proposalsforcooperativeperception[163,210,109],andcooperativedriving[83,136,59]inwhichautonomous
vehicles share information with each other to improve safety and utilization. Some have proposed approaches
tooffloadrouteplanning(butnottrajectoryplanning)tothecloud[107]. Othersexplore platooning [174,119]
in which vehicles collaboratively and dynamically form platoons to enable smooth traffic flows. Beyond
inter-vehicle collaboration, several proposals have explored infrastructure support for connected autonomous
vehicles, with infrastructure augmenting perception [166, 105], or delivering traffic light status [174]. Other
work focuses on infrastructure-assisted traffic management at intersections [120]. Net-driving (chapter 3)
goes beyond this body of work by demonstrating the feasibility of decoupling both perception and planning
from vehicular control.
LiveMap [77] uses GPS and monocular cameras to automate road abnormality detection (e.g., pothole
detection). With its depth perception capabilities, Fresco can more accurately position roadside hazards.
119
AVR [163] extends vehicular vision using feature maps and would benefit from Fresco. Although the band-
width requirements for CarMap (chapter 2) are within the LTE speeds today, it can benefit from systems
[124] that schedule redundant transmissions over multiple networks. Recent work in object detection on
mobile devices [134] introduces a fast object tracking method that can be used in CarMap to enable faster
segmentation. For stitching map segments from rural, unmapped regions, CarMap can benefit from [154]
which enables autonomous navigation in such areas.
5.1.2 Mapping
In this thesis, the CarMap (chapter 2) implementation uses traditional computer vision-based features
(ORB [172]) to build the map, but these can be replaced with better, more stable CNN-based features [85].
After running a feature extractor, CarMap uses motion tracking and semantic segmentation to select sta-
ble features to build the map. Mask-SLAM [117] proposes a similar dynamic object filter to CarMap but
CarMap uses majority voting and robust labeling to account for limited on-board computational resources
and boundary segmentation errors. Other approaches [64, 121] remove dynamic features from multiple maps
collected along the same trace using background subtraction. Even the most static features are not persistent
for larger timescales. Future work for longer timescale mapping can integrate CarMap with a persistence
filter presented in [170] that estimates the life period of a feature based on an environmental evolution model.
CarMap benefits from map-element culling techniques [122] that scale maps sizes by the scale of the envi-
ronment rather than the number of miles driven. Mobileye [31] crowdsources collecting 3D maps for vehicles
using monocular cameras whereas CarMap is designed for 3D sensors like LiDARs, and stereo cameras.
5.1.3 Perception
LiDAR-basedperception. PriorworkhasexploredusinginfrastructureLiDARtodetectpedestrians[211],
and other road features such as lanes and drivable surfaces [202, 112], and to warn vehicles of impending
120
collisions [57, 189]. One work [206] proposes a genetic algorithm based LiDAR alignment, but unlike Net-
driving (chapter 3), has not explored the efficacy of an entire perception pipeline built on top of LiDAR
fusion.
Motion estimation. Heading and velocity can be estimated using deep neural networks [140, 71],
SLAM [147, 149, 90], or visual odometry [95, 198, 171]. Net-driving (chapter 3) uses a lightweight technique
since it relies on static LiDARs.
Deep neural networks for detection and tracking. Some work has developed neural nets for point-
cloud based detection [212, 197, 125, 74, 205], and tracking [162, 103, 184, 65, 136, 140, 204]. These are too
heavyweight for Net-driving, and have been developed for vehicle-mounted LiDAR (detection and tracking
are easier for static LiDARs).
5.2 Drone Positioning
Robotics literature has studied efficient coverage path-planning for single [192], and multiple drones [144].
Fresco’s (chapter 4) trajectory design is influenced by more intricate constraints like SLAM accuracy and
equi-density goals. Accurately inferring drone motion is important for SLAM-based positioning [67]. Car-
tographer [111], which Fresco uses for positioning, utilizes motion models and on-board IMUs for estimating
motion. A future version of Fresco can leverage drone orchestration [108] and SLAM edge-offloading [61] for
larger scale reconstruction.
5.3 3D Reconstruction
5.3.1 Offline Reconstruction using Images
UAV photogrammetry [92] reconstructs 3D models offline from 2D photographs. Several pieces of work [82,
190, 123, 127] study the use of cameras (either RGB or RGB-D) on UAVs for 3D reconstruction. Prior
work [82] has proposed a real-time, interactive interface into the reconstruction process for a human guide.
The most relevant of these [146, 152] predicts the completeness of 3D reconstruction in-flight, using a quality
121
confidencepredictortrainedoffline, forabetteroffline3Dreconstruction. However, unlikeFresco(chapter4),
thisworkrequireshumanintervention, computesthe3Dmodeloffline, requiresclose-upflights, cannotensure
equi-densereconstructions, cannotdynamicallyre-calibratefordriftandisnotanend-to-endsystem. Abody
ofworkhasexploredfactorsaffectingreconstructionaccuracy: sensorerror[115],trackingdrift,andthedegree
of image overlap [82, 137]. Other work [129, 127, 68] has explored techniques to reduce errors by fusing with
depth information, or using image manipulations such as upscaling. Unlike Fresco, almost all of this work
reconstructs the 3-D model offline.
5.3.2 Offline Reconstruction using LiDAR or Radar
3D model reconstruction using LiDAR [131, 201] relies on additional positioning infrastructure such as base
stations for real-time kinematic (RTK) positioning, and long-range specialized LiDAR to achieve tens of
centimeters model accuracy. Fresco (chapter 4) explores a different part of the design space: online recon-
structionwithsub-meteraccuracyusingcommoditydrones, GPSandLiDAR.Morerecentworkhasexplored
drone-mounted LiDAR based offline reconstruction of tunnels and mines, but require specialized LiDARs and
a human-in-the-loop [66, 11] for drone guidance (either manually or by defining a set of waypoints). Finally,
mmMesh [203] explores cloud-offload for human mesh reconstructions using millimeter-wave radar.
5.4 SLAM
5.4.1 Visual SLAM
Although this thesis implemented CarMap (chapter 2) on top of ORB-SLAM2 [148], our study of other
SLAM systems shows that it can be easily ported to other keyframe-based visual SLAM algorithms like S-
PTAM[161]. Infuturework, wecanextendCarMaptogroupfeaturesintohigher-dimensionalplanes[113]to
further improve localization accuracy. As wireless speeds increase, it might be possible to design over-the-air
map updates for dense mapping systems like [132] using techniques similar to ones presented in this thesis.
We have left this to future work.
122
5.4.2 Decentralized SLAM
Decentralized SLAM systems [81] leverage multiple agents to run SLAM in unknown environments. CarMap
can be considered an instance of decentralized SLAM [78] with some differences. In decentralized SLAM,
the agents (robots) have limited compute-power and only run visual odometry [94]. This leads to inaccurate
localization whereas vehicles in CarMap localize more accurately because they run both mapping and local-
ization. Decentralized SLAM sends all keyframe features to a central collector which performs all mapping
operations [176] whereas CarMap only sends map-features to a cloud service to ensure real-time map ex-
changes. Similarly, in decentralized SLAM, the collector finds overlap between maps of different agents using
thehistogramwordapproach, doesnotremoveenvironmentaldynamicsandhenceisnotrobustlikeCarMap.
DecentralizedSLAM[169]usesfeaturesfromasinglekeyframeoverlaptocomputethetransformationmatrix
whereas CarMap is more robust and uses features from multiple keyframes.
123
Chapter 6
Conclusion and Future Work
6.1 Conclusion
Live 3D digital twins are high-fidelity 3D representations that replicate the physical world in near real-time.
Live twins create unprecedented capabilities for autonomous driving, monitoring on-going construction, and
integrity checks for large physical structures etc. To this end, this thesis makes two contributions: a) builds
re-usable perception infrastructure to generate live digital twins, and b) builds end-to-end cyber-physical
systems that leverage live digital twins to enable the capabilities mentioned above.
Toconstructalivedigitaltwin,wecontinuouslyfusedatafrommultiple3Dsensorsintoasinglecoordinate
system, in near real-time. Live twins and the capabilities that leverage them have very strict accuracy and
performancerequirements. Alivetwinshouldreplicatethephysicalscenewithinmillisecondswithanaccuracy
of a few centimeters. This is not possible today for two main reasons: a) limited network bandwidth, and
b) limited compute resources. 3D sensors like LiDARs and stereo cameras generate data at 100’s of Mbps to
Gbps which stress both network and compute resources.
To this end, this thesis builds systems that ensure high accuracy of live digital twins whilst dealing
with other competing constraints like network bottleneck, computational latency and energy. The main
contribution of this thesis is in meeting these constraints with minimum or no tradeoffs in accuracy. CarMap
(chapter2)addressesthenetworkbottleneckwhilstensuringhighaccuracy. Net-driving(chapter3)addresses
124
the compute bottleneck without trading off accuracy. Fresco (chapter 4) ensures low latency and low energy
consumption whilst maintaining high accuracy. As such, this thesis makes four contributions as follows.
• It introduces lean representations for 3D data to transfer them over wireless networks. This addresses
the network bottleneck.
• It builds pipelines and techniques to operate on voluminous 3D data at frame-rate with COTS compute
resources. This addresses the compute bottleneck and ensures low latency.
• It proposes loss compensation techniques that ensure achieve the accuracy requirement for live digital
twins with lean 3D representations.
• It puts forward techniques to fuse 3D data from multiple sensors into a single live digital twin.
6.2 Future Work
There are three main fronts to push this research in for future work.
6.2.1 Autonomous Driving
For widespread adaption, autonomous vehicles have to demonstrate a high degree of reliability. Cooperative
perception, collaborative planning and large training corpuses can help achieve this. Edge-assisted au-
tonomous driving. As Net-driving demonstrates, edge resources (sensing and compute) can play a pivotal
role in improving perception and planning for autonomous driving. Compared to Net-driving, other inter-
mediate vehicle-edge offload designs are equally advantageous and may be easier to adapt. For instance,
edge-sensors can share live views with vehicles. Or, edge-compute can stitch 3D views from vehicles and
send results back to vehicles. However, leveraging edge resources means tackling a number of challenging
research problems. For instance, both designs described above are advantageous but which design can scale
gracefully to dense traffic conditions? To meet autonomous driving performance requirements, what kinds
of hardware accelerators can we leverage at the edge? To ensure robustness, how can we schedule multiple
simultaneous 3D view transmissions? What can be a fallback mechanism for catastrophic hardware/software
125
failures? Net-driving’s centralized planner opens up exciting avenues but how can we protect it against
security attacks?
3D map updates. Autonomous vehicles collect, and compute dense LiDAR and camera-based maps for
collision and obstacle avoidance. CarMap [52] develops techniques for map updates but dense maps are
fundamentally different, and so they are updated offline. Future work can look at investigating how to
update these maps on-the-fly. Specifically, is it possible to efficiently detect changes in a dense LiDAR map?
Can we leverage edge compute resources to instantly update these maps on-the-fly?
Autonomous driving datasets. A single autonomous driving vehicle generates up to 100 TB of data per
day. Manufacturersstorethisdataoverlargetimescales(e.g.,10years)fortworeasons: a)federalregulations,
and b) training. There is an opportunity to build a service for A/B testing autonomous driving pipelines
using autonomous driving datasets. To do this, there is a need to find efficient techniques that move large
amounts of 3D data from the edge to the cloud, store it efficiently, and index it for fast retrieval at the cloud.
6.2.2 Collaborative 3D Reconstruction
Theprofilerationofdepthsensors( e.g., intrafficlights, securitycameras, dronesandvehicles), theemergence
of 5G and the applications these can enable opens up exciting avenues. By fusing the capabilities of Net-
driving and Fresco, we can build live digital twins of larger urban spaces. With such a capability, consumers
can digitally teleport to experience large live entertainment events (e.g., concerts and sports matches etc.) or
their favorite tourist locations in 3D. Though exciting, there are technical challenges to surmount before we
canenablethesecapabilities. Forinstance, howcanwescaleFrescotomultipledronesthatcanautomatically
configure themselves to build and stream a live 3D model of a given region? With a plethora of different 3D
sensors, can we build a common abstraction layer for them to share data on?
6.2.3 Indoor Reconstruction
For the most part, this thesis has focused on live 3D views for outdoors spaces. Indoor replicas are equally
usefulandhavebeenusedin3Dvideoconferencing[14],socialmedia[19],andimmersivetraining etc.. Today,
126
live 3D views are built indoors in production studios equipped with specialized hardware. As manufacturers
equip smartphones with depth sensors, I am interested in using them to make these applications ubiquitous.
Doing so requires tackling technical challenges unique to indoor settings. 3D reconstruction and 3D stitching
are compute-intensive applications. How can we efficiently run them on resource-constrained mobile phones
or VR headsets? How can we transfer large amounts of 3D data over wireless links? Future work can look
at leveraging indoor compute and network resources to offload some of these tasks. With so many different
kinds of phones and VR headsets, can we build abstractions that all vendors can agree on?
127
Reference List
[1] 5 Ways Drones are Being Used for Disaster Relief. https://safetymanagement.eku.edu/blog/5-ways-
drones-are-being-used-for-disaster-relief/.
[2] Best Drones for Natural Disaster Response. https://www.dronefly.com/blogs/news/drones-flooding-
sar-disaster/.
[3] Dronedeploy. https://www.dronedeploy.com/solutions/roofing/.
[4] The dronedeploy app. https://www.dronedeploy.com/.
[5] DroneDeploy: Making Successful Maps. https://support.dronedeploy.com/docs/making-successful-
maps.
[6] Drones In The Oil and Gas Industry. https://digitalaerolus.com/drones-in-the-oil-and-gas-industry/.
[7] ExpeditingDisasterReliefServicesWithDroneTechnology. https://www.dronedeploy.com/blog/expediting-
disaster-relief-services-with-drone-technology/.
[8] Guided by Lasers: How Autonomous Cars See. https://www.volvocars.com/uk/about/humanmade/
discover-volvo/lidar.
[9] Hard-Learned Lessons in Drone Photogrammetry. https://www.pobonline.com/articles/101581-hard-
learned-lessons-in-drone-photogrammetry.
[10] Hover. https://hover.to/.
[11] Hovermap. https://www.emesent.io/hovermap/.
[12] How Drones Aid in Disaster Response. https://www.precisionhawk.com/blog/how-drones-aid-in-
disaster-response.
[13] How to Use the LiDAR in iPhone 12 Pro. https://appleinsider.com/articles/21/03/02/
how-to-use-the-lidar-scanner-in-iphone-12-pro.
[14] Introducing Microsoft Mesh — Here can be Anywhere. https://www.microsoft.com/en-us/mesh.
[15] Kaarta. https://www.kaarta.com/products/contour/.
[16] Markets and markets: The drone service market. https://www.marketsandmarkets.com/
Market-Reports/drone-services-market-80726041.html.
[17] New Technology for Drone-based Emergency Response Missions.
https://cordis.europa.eu/article/id/415444-new-technology-for-drone-based-emergency-response-
missions.
[18] TroubleshootingPhotogrammetryIssues. https://www.aerotas.com/troubleshooting-photogrammetry-
issues.
128
[19] Welcome to meta — meta. https://about.facebook.com/meta/.
[20] Find minimum oriented bounding box of point cloud. http://codextechnicanum.blogspot.com/
2015/04/find-minimum-oriented-bounding-box-of.html, 2015.
[21] What it really costs to turn a car into a self-driving vehicle. https://qz.com/924212/
what-it-really-costs-to-turn-a-car-into-a-self-driving-vehicle/, 2017.
[22] Apple Is Rebuilding Maps From the Ground Up. https://techcrunch.com/2018/06/29/
apple-is-rebuilding-maps-from-the-ground-up/, 2018.
[23] Here Self-Healing Maps. https://go.engage.here.com/self-healing.html, 2018.
[24] State of Mobile Networks: USA - OpenSignal. https://opensignal.com/reports/2018/07/usa/
state-of-the-mobile-network, 2018.
[25] The Golden Age of HD Mapping for Autonomous Driving. https://medium.com/syncedreview/
the-golden-age-of-hd-mapping-for-autonomous-driving-b2a2ec4c11d, 2018.
[26] There’s No Google Maps for Self-Driving So This Startup Is Building It. https://www.
technologyreview.com/s/612202/theres-no-google-maps-for-self-driving-cars-so-this-startup-is-building-it/,
2018.
[27] Apple Maps Image Collection. https://maps.apple.com/imagecollection/, 2019.
[28] Baidu. https://www.baidu.com/, 2019.
[29] Carmera. https://www.carmera.com/fleets/, 2019.
[30] GM’s Hands-free Driving Feature to Work on 70,000 Additional Miles
of Highways This Year. https://www.theverge.com/2019/6/5/18653628/
gms-super-cruise-hands-free-driving-feature-highway-milage, 2019.
[31] HERE and Mobileye: Crowdsourced HD Mapping for Autonomous Cars. https://360.here.com/
2016/12/30/here-and-mobileye-crowd-sourced-hd-mapping-for-autonomous-cars/, 2019.
[32] Kuandeng. http://www.kuandeng.com/html/1/index.html, 2019.
[33] Lidar sensors for combined smart city and av ecosystem. https://www.traffictechnologytoday.
com/news/smart-cities/lidar-sensors-for-combined-smart-city-and-av-ecosystem.html,
2019.
[34] Lyft Level 5. https://level5.lyft.com/, 2019.
[35] NVIDIA Drive AGX. https://www.nvidia.com/en-us/self-driving-cars/drive-platform/
hardware/, 2019.
[36] The top five applications for lidar beyond the automotive market. https://enterpriseiotinsights.
com/20191113/channels/opinion/five-applications-lidar-beyond-automotive-reader-forum,
2019.
[37] Upgrading Uber’s 3D Fleet. https://medium.com/uber-design/
upgrading-ubers-3d-fleet-4662c3e1081, 2019.
[38] Victoria to install lidar at high-crash intersections. https://www.itnews.com.au/news/
victoria-to-install-lidar-at-high-crash-intersections-530967, 2019.
[39] Ericsson 5G. https://www.ericsson.com/en/5g, 2020.
129
[40] Exclusive: Huawei’slidarisonthecar,thepricemaybeaslowashundredsofdollars. https://ycnews.
com/exclusive-huaweis-lidar-is-on-the-car-the-price-may-be-as-low-as-hundreds-of-dollars/,
2020.
[41] Introducing 5G Technology and Networks. https://www.thalesgroup.com/en/markets/
digital-identity-and-security/mobile/inspired/5G, 2020.
[42] iPhone and iPad LiDAR Spatial Tracking Capabilities: Second Test. - VGIS - Leading Augmented
Reality Solutions for BIM, GIS and 3D Scans, Dec 2020.
[43] Qualcomm 5G. https://www.qualcomm.com/invention/5g, 2020.
[44] What is Edge Computing. https://www.ibm.com/cloud/what-is-edge-computing, 2020.
[45] Automatically generated tls-programs. Traffic Lights - SUMO Documentation, 2021.
[46] Baidu apollo specifications. https://github.com/ApolloAuto/apollo/blob/master/docs/specs/
README.md, 2021.
[47] Lane, Nov 2021.
[48] Los angeles county home to state’s worst intersections: Report. Patch, 2021.
[49] Signal cycle lengths. National Association of City Transportation Officials, 2021.
[50] Statistics on intersection accidents. Federal Highway Administration, 2021.
[51] Traffic signal timing manual, 2021. Chapter 5 - Office of Operations.
[52] Fawad Ahmad, Hang Qiu, Ray Eells, Fan Bai, and Ramesh Govindan. Carmap: Fast 3d feature map
updates for automobiles. In 17th{USENIX} Symposium on Networked Systems Design and Implemen-
tation ({NSDI} 20), pages 1063–1081, 2020.
[53] Fawad Ahmad, Hang Qiu, Xiaochen Liu, Fan Bai, and Ramesh Govindan. QuickSketch: Building 3D
RepresentationsinUnknownEnvironmentsusingCrowdsourcing. In201821stInternationalConference
on Information Fusion (Fusion), pages 2314–2321. IEEE, 2018.
[54] Nataraj Akkiraju, Herbert Edelsbrunner, Michael Facello, Ping Fu, EP Mucke, and Carlos Varela.
Alphashapes: definitionandsoftware. In Proceedings of the 1st International Computational Geometry
Software Workshop, volume 63, page 66, 1995.
[55] Anton Andreychuk, Konstantin Yakovlev, Eli Boyarski, and Roni Stern. Improving continuous-time
conflict based search, 2021.
[56] Esther M. Arkin, S´ andor P. Fekete, and Joseph S.B. Mitchell. Approximation algorithms for lawn
mowing and milling. Computational Geometry, 17(1):25–50, 2000.
[57] Olivier Aycard. Intersection Safety Using Lidar and Stereo Vision Sensors on a Demonstrator Vehicle.
Transportation, (IV):863–869, 2011.
[58] T. Bailey and H. Durrant-Whyte. Simultaneous localization and mapping (slam): part ii. IEEE
Robotics Automation Magazine, 13(3):108–117, 2006.
[59] Alessandro Bazzi, Barbara M Masini, Alberto Zanella, and Ilaria Thibault. On the performance of
ieee 802.11 p and lte-v2v for the cooperative awareness of connected vehicles. IEEE Transactions on
Vehicular Technology, 66(11):10419–10432, 2017.
[60] BenBellekens,VincentSpruyt,RafaelBerkvens,RudiPenne,andMaartenWeyn. Abenchmarksurvey
of rigid 3d point cloud registration algorithms. Int. J. Adv. Intell. Syst, 8:118–127, 2015.
130
[61] Ali J Ben Ali, Zakieh Sadat Hashemifar, and Karthik Dantu. Edge-SLAM: Edge-assisted Visual Si-
multaneous Localization and Mapping. In Proceedings of the 18th International Conference on Mobile
Systems, Applications, and Services, pages 325–337, 2020.
[62] Jon Louis Bentley. Multidimensional Binary Search Trees Used for Associative Searching. Commun.
ACM, 18(9):509–517, September 1975.
[63] Keni Bernardin and Rainer Stiefelhagen. Evaluating multiple object tracking performance: the clear
mot metrics. EURASIP Journal on Image and Video Processing, 2008:1–10, 2008.
[64] Julie Stephany Berrio, James Ward, Stewart Worrall, and Eduardo Nebot. Identifying Robust Land-
marks in Feature-based Maps. arXiv preprint arXiv:1809.09774, 2018.
[65] Adel Bibi, Tianzhu Zhang, and Bernard Ghanem. 3d part-based sparse tracker with automatic syn-
chronization and registration. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 1439–1448, 2016.
[66] Liam Brown, Robert Clarke, Ali Akbari, Ujjar Bhandari, Sara Bernardini, Puneet Chhabra, Ognjen
Marjanovic, Thomas Richardson, and Simon Watson. The Design of Prometheus: A Reconfigurable
UAV for Subterranean Mine Inspection. Robotics, 9(4):95, 2020.
[67] M. Bryson and S. Sukkarieh. Observability analysis and active control for airborne slam. IEEE Trans-
actions on Aerospace and Electronic Systems, 44(1):261–280, 2008.
[68] E. Bylow, R. Maier, F. Kahl, and C. Olsson. Combining depth fusion and photometric stereo for
fine-detailed 3D models. In Scandinavian Conference on Image Analysis (SCIA), Norrk¨ oping, Sweden,
June 2019.
[69] Cesar Cadena, Luca Carlone, Henry Carrillo, Yasir Latif, Davide Scaramuzza, Jose Neira, Ian Reid,
and John J. Leonard. Past, Present, and Future of Simultaneous Localization and Mapping: Toward
the Robust-perception Age. Trans. Rob., 32(6):1309–1332, December 2016.
[70] CarLA. Carla autonomous driving challenge, 2019.
[71] Sergio Casas, Wenjie Luo, and Raquel Urtasun. IntentNet: Learning to Predict Intention from Raw
Sensor Data. In Aude Billard, Anca Dragan, Jan Peters, and Jun Morimoto, editors, Proceedings of
The 2nd Conference on Robot Learning, volume 87 of Proceedings of Machine Learning Research, pages
947–956. PMLR, 2018.
[72] C.e.h. Average car length - list of car lengths in details - a new way forward: Automotive and home
advice & review, Nov 2020.
[73] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-
decoder with Atrous Separable Convolution for Semantic Image Segmentation. In ECCV, 2018.
[74] Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. Multi-view 3d object detection network
for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), July 2017.
[75] Yang Chen and G´ erard G. Medioni. Object modeling by registration of multiple range images. In
Proceedings of the 1991 IEEE International Conference on Robotics and Automation, Sacramento, CA,
USA, 9-11 April 1991, pages 2724–2729, 1991.
[76] Zhe Chen, Jing Zhang, and Dacheng Tao. Progressive lidar adaptation for road detection. IEEE/CAA
Journal of Automatica Sinica, 6(3):693–702, 2019.
131
[77] Kevin Christensen, Christoph Mertz, Padmanabhan Pillai, Martial Hebert, and Mahadev Satya-
narayanan. Towards a Distraction-free Waze. In Proceedings of the 20th International Workshop on
Mobile Computing Systems and Applications, pages 15–20. ACM, 2019.
[78] Titus Cieslewski, Siddharth Choudhary, and Davide Scaramuzza. Data-efficient Decentralized Visual
Slam. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 2466–2473.
IEEE, 2018.
[79] Travelers Risk Control. 3-second rule for safe following distance.
[80] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benen-
son, Uwe Franke, Stefan Roth, and Bernt Schiele. The Cityscapes Dataset for Semantic Urban Scene
Understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 3213–3223, 2016.
[81] Alexander Cunningham, Manohar Paluri, and Frank Dellaert. Ddf-sam: Fully Distributed Slam using
Constrained Factor Graphs. In 2010 IEEE/RSJ International Conference on Intelligent Robots and
Systems, pages 3025–3030. IEEE, 2010.
[82] S. Daftry, C. Hoppe, and H. Bischof. Building with drones: Accurate 3d facade reconstruction using
mavs. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 3487–3494,
2015.
[83] Reza Dariani and Julian Schindler. Cooperative strategical decision and trajectory planning for auto-
mated vehicle in urban areas. In 2019 IEEE International Conference on Vehicular Electronics and
Safety (ICVES), pages 1–6. IEEE, 2019.
[84] Konstantinos G Derpanis. Overview of the ransac algorithm. Image Rochester NY, 4(1):2–3, 2010.
[85] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superpoint: Self-supervised Interest
Point Detection and Description. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition Workshops, pages 224–236, 2018.
[86] P. Deutsch. RFC1952: GZIP File Format Specification Version 4.3, 1996.
[87] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An
Open Urban Driving Simulator. arXiv preprint arXiv:1711.03938, 2017.
[88] H. Durrant-Whyte and T. Bailey. Simultaneous localization and mapping: part i. IEEE Robotics
Automation Magazine, 13(2):99–110, 2006.
[89] Jakob Engel, Thomas Sch¨ ops, and Daniel Cremers. LSD-SLAM: Large-scale Direct Monocular Slam.
In European conference on computer vision, pages 834–849. Springer, 2014.
[90] Jakob Engel, J¨ org St¨ uckler, and Daniel Cremers. Large-scale direct slam with stereo cameras. In
2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1935–1942.
IEEE, 2015.
[91] Martin Ester, Hans-Peter Kriegel, J¨ org Sander, Xiaowei Xu, et al. A density-based algorithm for
discovering clusters in large spatial databases with noise. In Kdd, volume 96, pages 226–231, 1996.
[92] A. Federman, M. Quintero, S. Kretz, J. Gregg, M. Lengies, C. Ouimet, and Jeremy Laliberte. Uav
photgrammetric workflows: A best practice guideline. ISPRS - International Archives of the Pho-
togrammetry, Remote Sensing and Spatial Information Sciences, XLII-2/W5:237–244, 08 2017.
132
[93] Martin A. Fischler and Robert C. Bolles. Random sample consensus: A paradigm for model fitting
with applications to image analysis and automated cartography. Commun. ACM, 24(6):381–395, June
1981.
[94] Christian Forster, Simon Lynen, Laurent Kneip, and Davide Scaramuzza. Collaborative Monocular
Slam with Multiple Micro Aerial Vehicles. In 2013 IEEE/RSJ International Conference on Intelligent
Robots and Systems, pages 3962–3970. IEEE, 2013.
[95] Christian Forster, Zichao Zhang, Michael Gassner, Manuel Werlberger, and Davide Scaramuzza. Svo:
Semidirect visual odometry for monocular and multicamera systems. IEEE Transactions on Robotics,
33(2):249–265, 2016.
[96] Yasutaka Furukawa and Carlos Hern´ andez. Multi-view stereo: A tutorial. Found. Trends Comput.
Graph. Vis., 9(1-2):1–148, 2015.
[97] Y. Gabriely and E. Rimon. Spanning-tree based coverage of continuous areas by a mobile robot. In
Proceedings 2001 ICRA. IEEE International Conference on Robotics and Automation, volume 2, pages
1927–1933 vol.2, 2001.
[98] Enric Galceran and Marc Carreras. A survey on coverage path planning for robotics. Robotics and
Autonomous Systems, 61(12):1258–1276, 2013.
[99] MingGao,XinleiWang,KuiWu,AndrePradhana,EftychiosSifakis,CemYuksel,andChenfanfuJiang.
Gpuoptimizationofmaterialpointmethods. ACM Transactions on Graphics (TOG),37(6):1–12,2018.
[100] Vincent Garcia, Eric Debreuve, and Michel Barlaud. Fast k nearest neighbor search using gpu, 2008.
[101] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti
dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
[102] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are We Ready for Autonomous Driving? the Kitti
Vision Benchmark Suite. In 2012 IEEE Conference on Computer Vision and Pattern Recognition,
pages 3354–3361. IEEE, 2012.
[103] Silvio Giancola, Jesus Zarzar, and Bernard Ghanem. Leveraging shape completion for 3d siamese
tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 1359–1368, 2019.
[104] Google. Google maps. https://www.google.com/maps.
[105] Swaminathan Gopalswamy and Sivakumar Rathinam. Infrastructure enabled autonomy: A distributed
intelligence architecture for autonomous vehicles. In 2018 IEEE Intelligent Vehicles Symposium, IV
2018, Changshu, Suzhou, China, June 26-30, 2018, pages 986–992, 2018.
[106] Michael Grupp. evo: Python package for the evaluation of odometry and SLAM. https://github.
com/MichaelGrupp/evo, 2017.
[107] Jacopo Guanetti, Yeojun Kim, and Francesco Borrelli. Control of connected and automated vehicles:
State of the art and future challenges. Annual Reviews in Control, 45:18–40, 2018.
[108] Songtao He, Favyen Bastani, Arjun Balasingam, Karthik Gopalakrishnan, Ziwen Jiang, Mohammad
Alizadeh, Hari Balakrishnan, Michael J Cafarella, Tim Kraska, and Sam Madden. Beecluster: drone
orchestration via predictive optimization. In MobiSys, pages 299–311, 2020.
[109] Yuze He, Li Ma, Zhehao Jiang, Yi Tang, and Guoliang Xing. Vi-eye: Semantic-based 3d point cloud
registration for infrastructure-assisted autonomous driving. In Proceedings of the 27th Annual Interna-
tional Conference on Mobile Computing and Networking, MobiCom ’21, page 573–586, New York, NY,
USA, 2021. Association for Computing Machinery.
133
[110] Here. The Self-healing Map From Here. https://go.engage.here.com/self-healing.html, 2019.
[111] WolfgangHess,DamonKohler,HolgerRapp,andDanielAndor. Real-TimeLoopClosurein2DLIDAR
SLAM. In2016 IEEE International Conference on Robotics and Automation (ICRA),pages1271–1278,
2016.
[112] Aharon Bar Hillel, Ronen Lerner, Dan Levi, and Guy Raz. Recent progress in road and lane detection:
a survey. Machine vision and applications, 25(3):727–745, 2014.
[113] Mehdi Hosseinzadeh, Yasir Latif, and Ian Reid. Sparse Point-plane Slam. In Australasian Conference
on Robotics and Automation 2017 (ACRA 2017).
[114] Hou-Ning Hu, Qi-Zhi Cai, Dequan Wang, Ji Lin, Min Sun, Philipp Krahenbuhl, Trevor Darrell, and
Fisher Yu. Joint monocular 3d vehicle detection and tracking. In Proceedings of the IEEE/CVF
International Conference on Computer Vision (ICCV), October 2019.
[115] S.Jiang,N.Y.Chang,C.Wu,C.Wu,andK.Song. Erroranalysisandexperimentsof3dreconstruction
using a rgb-d sensor. In 2014 IEEE International Conference on Automation Science and Engineering
(CASE), pages 1020–1025, 2014.
[116] Yurong Jiang, Hang Qiu, Matthew McCartney, Gaurav Sukhatme, Marco Gruteser, Fan Bai, Donald
Grimm, and Ramesh Govindan. Carloc: Precise positioning of automobiles. In Proceedings of the 13th
ACM Conference on Embedded Networked Sensor Systems, SenSys ’15, page 253–265, New York, NY,
USA, 2015. Association for Computing Machinery.
[117] MasayaKaneko,KazuyaIwami,ToruOgawa,ToshihikoYamasaki,andKiyoharuAizawa. Mask-SLAM:
Robust Feature-based Monocular SLAM by Masking using Semantic Segmentation. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 258–266, 2018.
[118] Alireza G Kashani, Michael J Olsen, Christopher E Parrish, and Nicholas Wilson. A review of lidar
radiometric processing: From ad hoc intensity correction to rigorous radiometric calibration. Sensors,
15(11):28099–28128, 2015.
[119] Pooja Kavathekar and YangQuan Chen. Vehicle platooning: A brief survey and categorization. In
ASME 2011 International Design Engineering Technical Conferences and Computers and Information
in Engineering Conference, pages 829–845. American Society of Mechanical Engineers Digital Collec-
tion, 2011.
[120] Mohammad Khayatian, Mohammadreza Mehrabian, Edward Andert, Rachel Dedinsky, Sarthake
Choudhary, Yingyan Lou, and Aviral Shirvastava. A survey on intersection management of connected
autonomous vehicles. ACM Trans. Cyber-Phys. Syst., 4(4), August 2020.
[121] B Ravi Kiran, Luis Roldao, Be˜ nat Irastorza, Renzo Verastegui, Sebastian S¨ uss, Senthil Yogamani,
Victor Talpaert, Alexandre Lepoutre, and Guillaume Trehard. Real-time Dynamic Object Detection
for Autonomous Driving using Prior 3d-maps. In European Conference on Computer Vision, pages
567–582. Springer, 2018.
[122] Henrik Kretzschmar, Giorgio Grisetti, and Cyrill Stachniss. Lifelong Map Learning for Graph-based
SLAM in Static Environments. KI, 24:199–206, 09 2010.
[123] D. Lapandic, J. Velagic, and H. Balta. Framework for automated reconstruction of 3d model from
multiple 2d aerial images. In 2017 International Symposium ELMAR, pages 173–176, 2017.
[124] HyunJong Lee, Jason Flinn, and Basavaraj Tonshal. Raven: Improving Interactive Latency for the
Connected Car. In Proceedings of the 24th Annual International Conference on Mobile Computing and
Networking, pages 557–572. ACM, 2018.
134
[125] Bo Li. 3d fully convolutional network for vehicle detection in point cloud. In 2017 IEEE/RSJ Interna-
tional Conference on Intelligent Robots and Systems (IROS), pages 1513–1518. IEEE, 2017.
[126] Hui Li, Cheng Zhong, Xiaoguang Hu, Long Xiao, and Xianfeng Huang. New methodologies for precise
building boundary extraction from lidar data and high resolution image. Sensor Review, 2013.
[127] J. Li, W. Gao, and Y. Wu. High-quality 3d reconstruction with depth super-resolution and completion.
IEEE Access, 7:19370–19381, 2019.
[128] Shiqi Li, Chi Xu, and Ming Xie. A Robust O (n) Solution to the Perspective-n-point Problem. IEEE
transactions on pattern analysis and machine intelligence, 34(7):1444–1450, 2012.
[129] Z. Li, P. C. Gogia, and M. Kaess. Dense surface reconstruction from monocular vision and lidar. In
2019 International Conference on Robotics and Automation (ICRA), pages 6905–6911, 2019.
[130] Shih-Chieh Lin, Yunqi Zhang, Chang-Hong Hsu, Matt Skach, Md. Enamul Haque, Lingjia Tang, and
Jason Mars. The architectural implications of autonomous driving: Constraints and acceleration. In
Proceedings of the Twenty-Third International Conference on Architectural Support for Programming
Languages and Operating Systems, ASPLOS 2018, Williamsburg, VA, USA, March 24-28, 2018, pages
751–766, 2018.
[131] Y.-C. Lin, Y.-T. Cheng, T. Zhou, R. Ravi, S.M. Hasheminasab, J.E. Flatt, C. Troy, and A. Habib.
Evaluation of UAV LiDAR for Mapping Coastal Environments. Remote Sensing, 2019.
[132] Yonggen Ling and Shaojie Shen. Building Maps for Autonomous Navigation using Sparse Visual Slam
Features. In Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on,
pages 1374–1381. IEEE, 2017.
[133] T.Litman. Autonomousvehicleimplementationpredictions: Implicationsfortransportplanning, 2015.
[134] Luyang Liu, Hongyu Li, and Marco Gruteser. Edge Assisted Real-time Object Detection for Mobile
Augmented Reality. In Proceedings of the 25th Annual International Conference on Mobile Computing
and Networking. ACM, 2019.
[135] XiaochenLiu,SumanNath,andRameshGovindan. Gnome: APracticalApproachtoNLOSMitigation
for GPS Positioning in Smartphones. In Proceedings of the 16th Annual International Conference on
Mobile Systems, Applications, and Services, MobiSys ’18, page 163–177, New York, NY, USA, 2018.
Association for Computing Machinery.
[136] Ye Liu, Xiao-Yuan Jing, Jianhui Nie, Hao Gao, Jun Liu, and Guo-Ping Jiang. Context-aware three-
dimensionalmean-shiftwithocclusionhandlingforrobustobjecttrackinginrgb-dvideos. IEEE Trans-
actions on Multimedia, 21(3):664–677, 2018.
[137] Jean Li´ enard, Andre Vogs, Demetrios Gatziolis, and Nikolay Strigul. Embedded, real-time uav control
for improved, image-based 3d scene reconstruction. Measurement, 81:264 – 269, 2016.
[138] Gurobi Optimization LLC. Gurobi optimizer reference manual. http://www.gurobi.com, 2020.
[139] Gregor Luetzenburg, Aart Kroon, and Anders A Bjørk. Evaluation of the Apple iPhone 12 Pro LiDAR
for an Application in Geosciences. Scientific reports , 11(1):1–9, 2021.
[140] Wenjie Luo, Bin Yang, and Raquel Urtasun. Fast and Furious: Real Time End-to-End 3D Detec-
tion, Tracking and Motion Forecasting With a Single Convolutional Net. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), jun 2018.
135
[141] Sivabalan Manivasagam, Shenlong Wang, Kelvin Wong, Wenyuan Zeng, Mikita Sazanovich, Shuhan
Tan, Bin Yang, Wei-Chiu Ma, and Raquel Urtasun. Lidarsim: Realistic lidar simulation by leverag-
ing the real world. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), June 2020.
[142] Keita Miura, Shota Tokunaga, Noriyuki Ota, Yoshiharu Tange, and Takuya Azumi. Autoware toolbox:
Matlab/simulink benchmark suite for ros-based self-driving software platform. In Proceedings of the
30th International Workshop on Rapid System Prototyping (RSP’19), RSP ’19, page 8–14, New York,
NY, USA, 2019. Association for Computing Machinery.
[143] Jalil Modares, Farshad Ghanei, Nicholas Mastronarde, and Karthik Dantu. Ub-anc planner: Energy
efficient coverage path planning with multiple drones. In 2017 IEEE International Conference on
Robotics and Automation (ICRA), pages 6182–6189, 2017.
[144] Jalil Modares, Farshad Ghanei, Nicholas Mastronarde, and Karthik Dantu. Ub-anc planner: Energy
efficientcoveragepathplanningwithmultipledrones. In 2017IEEEinternationalconferenceonrobotics
and automation (ICRA), pages 6182–6189. IEEE, 2017.
[145] Trevor Mogg. Austrian airlines is flying a drone around its planes
for a good reason. https://www.digitaltrends.com/cool-tech/
austrian-airlines-is-flying-a-drone-around-its-planes-for-a-good-reason/.
[146] Christian Mostegel, Markus Rumpler, Friedrich Fraundorfer, and Horst Bischof. UAV-based Au-
tonomous Image Acquisition with Multi-view Stereo Quality Assurance by Confidence Prediction. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages
1–10, 2016.
[147] Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: a versatile and accurate
monocular slam system. IEEE transactions on robotics, 31(5):1147–1163, 2015.
[148] Raul Mur-Artal and Juan D Tard´ os. ORB-SLAM2: An Open-source Slam System for Monocular,
Stereo, and Rgb-d Cameras. IEEE Transactions on Robotics, 33(5):1255–1262, 2017.
[149] Raul Mur-Artal and Juan D Tard´ os. Orb-slam2: An open-source slam system for monocular, stereo,
and rgb-d cameras. IEEE Transactions on Robotics, 33(5):1255–1262, 2017.
[150] Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. MIT Press, 2012.
[151] WassimGNajm,RajaRanganathan,GowrishankarSrinivasan,JohnDSmith,SamuelToma,Elizabeth
Swanson, August Burgett, et al. Description of light-vehicle pre-crash scenarios for safety applications
based on vehicle-to-vehicle communications. Technical report, United States. National Highway Traffic
Safety Administration, 2013.
[152] V. S. Nguyen, T. H. Trinh, and M. H. Tran. Hole boundary detection of a surface of 3d point clouds.
In 2015 International Conference on Advanced Computing and Applications (ACOMP), pages 124–129,
2015.
[153] Society of Automotive Engineers International. Automated Driving Levels of Driving Automation Are
Defined in New SAE International Standard J3016. (2014), 2014.
[154] Teddy Ort, Liam Paull, and Daniela Rus. Autonomous Vehicle Navigation in Rural Environments
Without Detailed Prior Maps. In 2018 IEEE International Conference on Robotics and Automation
(ICRA), pages 2040–2047. IEEE, 2018.
[155] Ouster. Ouster os1-64 user guide. https://data.ouster.io/downloads/software-user-guide-v1.
13.0.pdf. [Online; accessed 03-31-2020].
136
[156] Ouster. Ouster LiDAR. https://ouster.com/, 2020.
[157] Brian Paden, Michal Cap, Sze Zheng Yong, Dmitry Yershov, and Emilio Frazzoli. A survey of mo-
tion planning and control techniques for self-driving urban vehicles. IEEE Transactions on intelligent
vehicles, 1(1):33–55, 2016.
[158] G´ aborPataki. Teachingintegerprogrammingformulationsusingthetravelingsalesmanproblem. SIAM
review, 45(1):116–123, 2003.
[159] Scott Drew Pendleton, Hans Andersen, Xinxin Du, Xiaotong Shen, Malika Meghjani, You Hong Eng,
Daniela Rus, and Marcelo H Ang. Perception, planning, control, and coordination for autonomous
vehicles. Machines, 5(1):6, 2017.
[160] Mike Phillips and Maxim Likhachev. Sipp: Safe interval path planning for dynamic environments. In
2011 IEEE International Conference on Robotics and Automation, pages 5628–5635. IEEE, 2011.
[161] Taih´ u Pire, Thomas Fischer, Gast´ on Castro, Pablo De Crist´ oforis, Javier Civera, and Julio Jacobo
Berlles. S-ptam: Stereo Parallel Tracking and Mapping. Robotics and Autonomous Systems, 93:27–42,
2017.
[162] Haozhe Qi, Chen Feng, Zhiguo Cao, Feng Zhao, and Yang Xiao. P2b: Point-to-box network for 3d
object tracking in point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), June 2020.
[163] Hang Qiu, Fawad Ahmad, Fan Bai, Marco Gruteser, and Ramesh Govindan. AVR: Augmented Vehic-
ular Reality. In Proceedings of the 16th Annual International Conference on Mobile Systems, Applica-
tions, and Services (Mobisys), MobiSys ’18, pages 81–95, Munich, Germany, 2018. ACM.
[164] Morgan Quigley, Ken Conley, Brian Gerkey, Josh Faust, Tully Foote, Jeremy Leibs, Rob Wheeler,
and Andrew Y Ng. Ros: an open-source robot operating system. In ICRA workshop on open source
software, volume 3, page 5. Kobe, Japan, 2009.
[165] Anandakumar M Ramiya, Rama Rao Nidamanuri, and Ramakrishan Krishnan. Segmentation based
buildingdetectionapproachfromlidarpointcloud. The Egyptian Journal of Remote Sensing and Space
Science, 20(1):71–77, 2017.
[166] Deepika Ravipati, Kenny Chour, Abhishek Nayak, Tyler Marr, Sheelabhadra Dey, Alvika Gautam,
Sivakumar Rathinam, and Swaminathan Gopalswamy. Vision based localization for infrastructure
enabledautonomy. In2019IEEEIntelligentTransportationSystemsConference, ITSC2019, Auckland,
New Zealand, October 27-30, 2019, pages 1638–1643, 2019.
[167] Grand View Research. Drone data services market size and share. https://www.grandviewresearch.
com/industry-analysis/drone-data-services-market.
[168] Reuters. Self-driving costs could drop 90 percent by 2025, Del-
phi CEO says. https://www.reuters.com/article/us-autos-delphi/
self-driving-costs-could-drop-90-percent-by-2025-delphi-ceo-says-idUSKBN1DY2AC,
December 2017.
[169] Luis Riazuelo, Javier Civera, and JM Mart´ ınez Montiel. C2tam: A Cloud Framework for Cooperative
Tracking and Mapping. Robotics and Autonomous Systems, 62(4):401–413, 2014.
[170] DavidMRosen, JulianMason, andJohnJLeonard. TowardsLifelongFeature-basedMappinginSemi-
static Environments. In 2016 IEEE International Conference on Robotics and Automation (ICRA),
pages 1063–1070. IEEE, 2016.
137
[171] AntoniRosinol, MarcusAbate, YunChang, andLucaCarlone. Kimera: anopen-sourcelibraryforreal-
time metric-semantic localization and mapping. In 2020 IEEE International Conference on Robotics
and Automation (ICRA), pages 1689–1696. IEEE, 2020.
[172] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. Orb: An Efficient Alternative to
Sift or Surf. 2011.
[173] Radu Bogdan Rusu, Nico Blodow, and Michael Beetz. Fast point feature histograms (FPFH) for 3d
registration. In 2009 IEEE International Conference on Robotics and Automation, ICRA 2009, Kobe,
Japan, May 12-17, 2009, pages 3212–3217, 2009.
[174] Julian Schindler, Reza Dariani, Michele Rondinone, and Thomas Walter. Dynamic and flexible pla-
tooning in urban areas. 2018.
[175] Boris Schling. The Boost C++ Libraries. XML Press, 2011.
[176] Patrik Schmuck and Margarita Chli. Multi-uav Collaborative Monocular Slam. In 2017 IEEE Inter-
national Conference on Robotics and Automation (ICRA), pages 3863–3870. IEEE, 2017.
[177] Ruwen Schnabel and Reinhard Klein. Octree-based point-cloud compression. pages 111–120, 01 2006.
[178] Johannes Lutz Sch¨ onberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view
selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV),
2016.
[179] StevenMSeitz,BrianCurless,JamesDiebel,DanielScharstein,andRichardSzeliski. Acomparisonand
evaluation of multi-view stereo reconstruction algorithms. In 2006 IEEE computer society conference
on computer vision and pattern recognition (CVPR’06), volume 1, pages 519–528. IEEE, 2006.
[180] Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. Airsim: High-fidelity visual and
physical simulation for autonomous vehicles, 2017.
[181] Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. On a formal model of safe and scalable
self-driving cars, 2018.
[182] Guni Sharon, Roni Stern, Ariel Felner, and Nathan R Sturtevant. Conflict-based search for optimal
multi-agent pathfinding. Artificial Intelligence , 219:40–66, 2015.
[183] J. E. Siegel, D. C. Erb, and S. E. Sarma. A survey of the connected vehicle landscape—architectures,
enabling technologies, applications, and development areas. IEEE Transactions on Intelligent Trans-
portation Systems, 19(8):2391–2406, 2018.
[184] Shuran Song and Jianxiong Xiao. Tracking revisited using rgbd camera: Unified benchmark and
baselines. In Proceedings of the IEEE international conference on computer vision, pages 233–240,
2013.
[185] Stanford Artificial Intelligence Laboratory et al. Robotic Operating System.
[186] Lu Sun, Junqiao Zhao, Xudong He, and Chen Ye. DLO: Direct Lidar Odometry for 2.5d Outdoor
Environment. 2018 IEEE Intelligent Vehicles Symposium (IV), Jun 2018.
[187] Synced. The Golden Age of HD Mapping for Autonomous Driving. https://medium.
com/syncedreview/the-golden-age-of-hd-mapping-for-autonomous-driving-b2a2ec4c11d, Au-
gust 2018.
[188] Remi Tachet, Paolo Santi, Stanislav Sobolevsky, Luis Reyes-Castro, Emilio Frazzoli, Dirk Helbing, and
Carlo Ratti. Revisiting street intersections using slot-based systems. PLOS ONE, 11:e0149607, 03
2016.
138
[189] Andrew Tarko, Kartik Ariyur, Mario Romero, Vamsi Bandaru, and Cristhian Lizarazo. T-Scan: Sta-
tionary LiDAR for Traffic and Safety Applications—Vehicle Detection and Tracking . Number March.
2017.
[190] L. Teixeira and M. Chli. Real-time local 3d reconstruction for aerial inspection using superpixel expan-
sion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 4560–4567,
2017.
[191] Ruslan Timchenko. iPhone’s 12 Pro LiDAR: How to get and Interpret Data, Aug 2021.
[192] P. Tokekar, J. Vander Hook, D. Mulla, and V. Isler. Sensor planning for a symbiotic uav and ugv
system for precision agriculture. In 2013 IEEE/RSJ International Conference on Intelligent Robots
and Systems, pages 5321–5326, 2013.
[193] Digital Trends. A Self-Driving Car in Every Driveway? Solid-State Lidar is the Key. https://www.
digitaltrends.com/cars/solid-state-lidar-for-self-driving-cars/, 2018.
[194] BillTriggs,PhilipFMcLauchlan,RichardIHartley,andAndrewWFitzgibbon. BundleAdjustment—a
Modern Synthesis. In International workshop on vision algorithms, pages 298–372. Springer, 1999.
[195] Shinji Umeyama. Least-squares estimation of transformation parameters between two point patterns.
IEEE Computer Architecture Letters, 13(04):376–380, 1991.
[196] Victor Vaquero, Ivan del Pino, Francese Moreno-Noguer, Joan Sola, Alberto Sanfeliu, and Juan
Andrade-Cetto. Deconvolutional networks for point-cloud vehicle detection and tracking in driving
scenarios. In 2017 European Conference on Mobile Robots (ECMR), pages 1–7. IEEE, 2017.
[197] Dominic Zeng Wang and Ingmar Posner. Voting for voting in online point cloud object detection. In
Robotics: Science and Systems, volume 1, pages 10–15607. Rome, Italy, 2015.
[198] RuiWang,MartinSchworer,andDanielCremers. Stereodso: Large-scaledirectsparsevisualodometry
with stereo cameras. In Proceedings of the IEEE International Conference on Computer Vision, pages
3903–3911, 2017.
[199] Waymo. Building Maps for a Self-driving Car. https://medium.com/waymo/
building-maps-for-a-self-driving-car-723b4d9cd3f4, 2016.
[200] Whoenig. LibrarywithSearchAlgorithmsforTaskandPathPlanningforMultiRobot/AgentSystems.
[201] Bin Wu, Bailang Yu, Qiusheng Wu, Shenjun Yao, Feng Zhao, Weiqing Mao, and Jianping Wu. A
graph-based approach for 3d building model reconstruction from airborne lidar point clouds. Remote
Sensing, 9:92, 01 2017.
[202] JWu, HXu, andJZheng. AutomaticbackgroundfilteringandlaneidentificationwithroadsideLiDAR
data. In2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC),pages
1–6, oct 2017.
[203] Hongfei Xue, Yan Ju, Chenglin Miao, Yijiang Wang, Shiyang Wang, Aidong Zhang, and Lu Su.
mmmesh: towards 3d real-time dynamic human mesh construction using millimeter-wave. In Pro-
ceedings of the 19th Annual International Conference on Mobile Systems, Applications, and Services,
pages 269–282, 2021.
[204] Bin Yang, Wenjie Luo, and Raquel Urtasun. Pixor: Real-time 3d object detection from point clouds.
In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7652–7660,
2018.
139
[205] Zetong Yang, Yanan Sun, Shu Liu, Xiaoyong Shen, and Jiaya Jia. Std: Sparse-to-dense 3d object
detector for point cloud. In Proceedings of the IEEE/CVF International Conference on Computer
Vision (ICCV), October 2019.
[206] Rui Yue, Hao Xu, Jianqing Wu, Renjuan Sun, and Changwei Yuan. Data registration with ground
points for roadside LiDAR sensors. Remote Sensing, 11(11), 2019.
[207] Ji Zhang and Sanjiv Singh. LOAM: Lidar Odometry and Mapping in Real-time. In Robotics: Science
and Systems, volume 2, page 9, 2014.
[208] Ji Zhang and Sanjiv Singh. Loam: Lidar odometry and mapping in real-time. In Robotics: Science and
Systems, volume 2, 2014.
[209] Ji Zhang and Sanjiv Singh. Visual-lidar Odometry and Mapping: Low-drift, Robust, and Fast. In 2015
IEEE International Conference on Robotics and Automation (ICRA), pages 2174–2181. IEEE, 2015.
[210] Xumiao Zhang, Anlan Zhang, Jiachen Sun, Xiao Zhu, Y Ethan Guo, Feng Qian, and Z Morley Mao.
Emp: edge-assisted multi-vehicle perception. In Proceedings of the 27th Annual International Confer-
ence on Mobile Computing and Networking, pages 545–558, 2021.
[211] Junxuan Zhao, Hao Xu, Hongchao Liu, Jianqing Wu, Yichen Zheng, and Dayong Wu. Detection and
tracking of pedestrians and vehicles using roadside LiDAR sensors. Transportation Research Part C:
Emerging Technologies, 100(December 2017):68–87, 2019.
[212] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June
2018.
140
Abstract (if available)
Abstract
A live 3D digital twin is a high-fidelity 3D representation of a physical scene or structure. This twin continuously mirrors the physical object in near real-time. Live 3D digital twins enable novel and exciting capabilities like safer, and more efficient autonomous driving; efficient, and targeted disaster relief operations; and quick structural integrity checks for aircrafts between flights etc. To build a live digital twin, we continuously fuse voluminous 3D data from multiple sensors like LiDARs and stereo cameras into a single coordinate system. However, live twins have strict accuracy and performance requirements i.e., a twin must replicate the physical world on the order of milliseconds with centimeter level accuracy. Achieving these requirements is challenging for two reasons: a) limited network bandwidth, and b) limited compute resources. To tackle these challenges, this thesis introduces techniques (and embodies them in the form of end-to-end systems) that ensure the high accuracy and low latency using commercial-off-the-shelf (COTS) technologies. The contributions of this thesis are four-fold. First, it introduces lean representations of 3D data to address the network bottleneck. Second, it builds fast pipelines to process large amounts of 3D data with COTS compute resources, at low latency. Third, it introduces robust algorithms to fuse 3D data from multiple 3D sensors into a single coordinate system. Lastly, it proposes loss compensation techniques that high accuracy even with lean 3D representations. As such, this thesis builds end-to-end systems that build and leverage live digital twins of large urban spaces with a latency in milliseconds and accuracy on the order of centimeters.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Point-based representations for 3D perception and reconstruction
PDF
Performant, scalable, and efficient deployment of network function virtualization
PDF
Toward sustainable and resilient communities with HCI: physical structures and socio-cultural factors
PDF
Data-driven 3D hair digitization
PDF
3D urban modeling from city-scale aerial LiDAR data
PDF
3D deep learning for perception and modeling
PDF
Face recognition and 3D face modeling from images in the wild
PDF
Towards highly-available cloud and content-provider networks
PDF
Efficient pipelines for vision-based context sensing
PDF
High-performance distributed computing techniques for wireless IoT and connected vehicle systems
PDF
Networked cooperative perception: towards robust and efficient autonomous driving
PDF
Object detection and recognition from 3D point clouds
PDF
Autostereoscopic 3D diplay rendering from stereo sequences
PDF
Accurate 3D model acquisition from imagery data
PDF
Deep representations for shapes, structures and motion
PDF
3D inference and registration with application to retinal and facial image analysis
PDF
Feature-preserving simplification and sketch-based creation of 3D models
PDF
Constructing an unambiguous user-and-machine-friendly, natural-language protocol specification system
PDF
Detecting and mitigating root causes for slow Web transfers
PDF
Complete human digitization for sparse inputs
Asset Metadata
Creator
Ahmad, Fawad
(author)
Core Title
Towards building a live 3D digital twin of the world
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2022-12
Publication Date
08/16/2022
Defense Date
04/20/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
3d maps,autonomous driving,digital twin,feature maps,LiDAR,localization,mapping,OAI-PMH Harvest,SLAM,stereo camera
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Govindan, Ramesh (
committee chair
), Naveed, Muhammad (
committee member
), Psounis, Konstantinos (
committee member
), Raghavan, Barath (
committee member
)
Creator Email
fawadahm@usc.edu,fawadahm91@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111376640
Unique identifier
UC111376640
Legacy Identifier
etd-AhmadFawad-11138
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Ahmad, Fawad
Type
texts
Source
20220816-usctheses-batch-973
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
3d maps
autonomous driving
digital twin
feature maps
LiDAR
localization
mapping
SLAM
stereo camera