Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Robust loop closures for multi-robot SLAM in unstructured environments
(USC Thesis Other)
Robust loop closures for multi-robot SLAM in unstructured environments
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Robust Loop Closures for Multi-Robot SLAM in Unstructured Environments
by
Elizabeth R. Boroson
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
August 2022
Copyright 2022 Elizabeth R. Boroson
Acknowledgements
Completing this thesis has been a long journey, and I would like to thank the many people who
have helped and supported me along the way.
First, thank you to my advisor, Nora Ayanian, for welcoming me into the ACT Lab and sup-
porting me through my Ph.D. You provided the balance I needed of flexibility to pursue my own
interests and guidance to make progress and see that work through.
Thank you to the rest of my committee: to Gaurav Sukhatme, for stepping in as my committee
chair and asking insightful questions, and to Stefanos Nikolaidis and Ketan Savla, for the helpful
feedback as I developed my ideas into this thesis.
Thank you to all of the members of the ACT Lab that I’ve had the pleasure to interact with
over my time there: Wolfgang H¨ onig, Jingyao Ren, Eric Ewing, Baskın S ¸enbas ¸lar, Connie Zhang,
Kegan Strawn, Chotiwat Chawannakul, Mark Debord, Laura Lytle, Aditya Barve, Nanda Kishore,
and probably others who I’ve missed. I’ve enjoyed interesting and insightful conversations with
all of you about work and non-work topics, and you’ve helped me to grow as a researcher and as a
person.
Thank you to all the students I’ve had the pleasure to mentor and teach, both one-on-one and
as a TA, including Jack Zeiders, Adelina Alegria, Rodolfo Edeza, Ceasar Navarro, Qiaoyi Yin,
and Eric Chen. They say that the best way to learn something is to teach it, and your questions
constantly forced me to reconsider my work and learn to present my thoughts in new ways.
Thank you to my NSTRF collaborator, Jean-Pierre de la Croix, for helping me to get involved
at JPL and direct my research interests toward real projects by joining the Autonomous PUFFER
ii
team. It’s been wonderful having you as a mentor, and I look forward to continuing to work with
you as a colleague.
Thank you to everyone else I’ve had the opportunity to interact with at JPL. I’d especially
like to thank Rob Hewitt, who was an invaluable source of ideas, brainstorming, sanity checking,
debugging, and testing help on Autonomous PUFFER. I’d also like to thank Mike Wolf and Amir
Rahmani, former and current managers of the Maritime and Multi-Agent Autonomy group, for
welcoming me as part of the group. And of course, thank you to everyone I worked closely
and interacted with: Federico Rossi, Josh Vander Hook, Maira Saboia da Silva, Changrak Choi,
Saptarshi Bandyopadhyay, Gary Doran, and many others.
Thank you to my parents and siblings, who provided an environment where I was encouraged
to read, learn, and follow my interests. I am grateful for your support in everything I do.
Thank you to my husband and partner, Brian. You’ve been an amazing source of support for
many years, through so many changes in our lives. You’ve been there for me every day, whether I
needed you to talk through an idea, help me relax when I was stressed, or just make me laugh. In
the last few months, you’ve kept our household running through sleepless nights, family illnesses,
daycare closures, unplanned trips, and so much more. I truly couldn’t have done this without you.
Finally, thank you to my son, Samuel. You haven’t made the process of writing this thesis
easier, but you’ve made it worth it. As you grow up, I can’t wait to show you the things I love
working on and see where your interests take you.
I have been fortunate to receive funding for my Ph.D. that allowed me to spend my time an-
swering interesting questions and applying those solutions to real problems. The first two years,
during which I completed parts of Chapter 3, were funded by a Viterbi Fellowship from USC.
The remaining time, during which I completed the rest of Chapter 3 and all of the other chap-
ters, and also my visits to JPL, was supported by a NASA Space Technology Research Fellow-
ship (NSTRF). The work on Autonomous PUFFER (Chapters 5 and 6) was carried out at the Jet
Propulsion Laboratory, California Institute of Technology, and was sponsored by NSTRF and the
National Aeronautics and Space Administration (80NM0018D0004).
iii
TableofContents
Acknowledgements ii
Abstract vii
Chapter1: Introduction 1
Chapter2: BackgroundandLiteratureReview 7
2.1 Multi-Robot SLAM System Overview . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Visual (and other) Odometry . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Single-Robot Loop Closure . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.3 Inter-Robot Loop Closure . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.5 Full Multi-Robot SLAM Systems . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Image Matching with Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 Absolute Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Chapter3: FindingLoopClosuresBetweenRobotswithHeterogeneousSensors 31
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 3D Keypoint Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.1 Extensions from Image Keypoints . . . . . . . . . . . . . . . . . . . . . . 36
3.2.2 Keypoints from Surface Curvature . . . . . . . . . . . . . . . . . . . . . . 36
3.2.3 Range Image Keypoints . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.4 Keypoints we did not evaluate . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5.1 Scans from same viewpoint . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5.2 Scans from different viewpoints . . . . . . . . . . . . . . . . . . . . . . . 44
3.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Chapter4: LoopClosuresbetweenRobotswithDifferentPerspectives 51
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
iv
4.4.1 Rock Matching as a Maximum Weighted Clique Problem . . . . . . . . . 56
4.4.2 IQP Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4.3 ILP Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4.4 Using Matched Rocks for Loop Closures . . . . . . . . . . . . . . . . . . 62
4.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5.1 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5.2 Comparison to Other Methods . . . . . . . . . . . . . . . . . . . . . . . . 64
4.5.3 Dataset and Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.5.4 Matching Procedure and Results . . . . . . . . . . . . . . . . . . . . . . . 67
4.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Chapter5: LoopClosureswithRangeSensorsforSmallRobots 72
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2.1 Pose Graph Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2.2 Range Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3.1 PUFFER Localization Subsystem . . . . . . . . . . . . . . . . . . . . . . 76
5.3.2 Pose Graph Optimization with Range Edges . . . . . . . . . . . . . . . . . 78
5.4 Experiments and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.4.1 Range Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.4.2 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.4.3 Two-Robot Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.4.4 Three-Robot Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.4.5 Base Station Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Chapter 6: PUFFER: A Semi-Online Multi-Robot SLAM System for Planetary Explo-
ration 95
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.2 PUFFER System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.2.1 PUFFER System Hardware . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.2.2 Other Software Subsystems . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.3 Real-Time Visual-Inertial Odometry System . . . . . . . . . . . . . . . . . . . . . 102
6.3.1 ROVIOLI Base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.3.2 Wheel Odometry Updates . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.3.3 Pose Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.4 Multi-Robot SLAM Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.4.1 Maplab Base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.4.2 Wrappers for Real-Time Multi-Robot Performance . . . . . . . . . . . . . 105
6.4.3 Integration with MapDB . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.5.1 Single-Robot System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.5.2 Full Multi-Robot System . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
v
Chapter7: ConclusionandFutureWork 117
References 120
vi
Abstract
A key capability for a team of robots operating together in an unknown environment is building
and sharing maps. As each robot explores, it must be able to build its own local map and use
it for navigation. To take advantage of the benefits of working in a team, the robots should also
be able to share and merge those maps. Merging these local maps into a global map requires
identification of loop closures, or places where the maps overlap. However, tasks in unstructured
environments, such as planetary exploration, are not well-suited to traditional visual loop closure
methods like scene or object detection. These tasks may involve robots with unusual sensors,
the robots may not observe the same areas, and the environment may not allow for identification
of standard visual features, which all make it challenging to identify loop closures. The team
may also be heterogeneous, so there may be differences in how and where the robots make their
observations.
This thesis addresses the challenge of identifying robust loop closures in spite of these limi-
tations. It includes several methods that successfully find inter-robot loop closures in challenging
unstructured environments, including a method using heterogeneous sensors, a method for robots
that view the world from different perspectives, and a method with ranging sensors for scenarios
where robots’ trajectories do not overlap. It also discusses the Autonomous PUFFER multi-robot
SLAM system, a semi-real time system developed for a team of robots operating autonomously in a
planetary exploration environment. Finally, it discusses how these techniques provide a framework
for future multi-robot mapping in unstructured environments. The maps and systems developed
will need to accurately model the environment while also supporting diverse robots and teams.
vii
Chapter1
Introduction
Robots are one of the most important tools available for operating in environments that are remote,
dangerous for humans, or otherwise challenging. Robots are capable of exploring distant places,
such as other planets [89] or the bottom of the ocean [93], eliminating the need to provide life
support and round trip capabilities. They perform work in dangerous areas, such as collapsed
buildings after a disaster [88], without any danger of getting lost or risking human lives if the roof
caves in and traps them. They are rapidly becoming critical tools for fighting fires [74], surveying
uninhabited areas [136], and performing ocean rescues [17]. Recently, robots have been used
successfully for mapping and cleanup in the aftermath of the Fukushima reactor meltdown [6].
The robots that have operated in these contexts are typically large and powerful, with a complex
suite of sensors. These robots are very expensive, and are designed to operate alone. For example,
the DARPA Robotics Challenge funded development of large robots for search and rescue in a
disaster area [32]. The challenge culminated in a 2015 competition in which the robots were
required to climb over rubble, open a door, and drive a vehicle. The most successful robots were
humanoid, and were large and expensive, which enabled them to cover a large area but made
failures very costly.
In planetary exploration, missions to explore and study other planets have utilized increasingly
larger and more powerful rovers and landers. Mars has been the most thoroughly explored, with
NASA missions from 1997’s Pathfinder and Sojourner to 2021’s Perseverance and Ingenuity [89].
But rovers on other bodies have followed, including the Chinese Yutu-2, which landed on the Moon
1
in 2019 [33]. Many future missions are planned, to the Moon, Mars, and beyond, and the increas-
ing distance will require increasing autonomy as the communication time becomes too great to
send detailed commands to the mission. The successful missions make enormous contributions to
science, but failures are extremely expensive. The Mars Surveyor ’98 program, in which the orbiter
failed due to an error in unit conversion and the lander lost contact on entry to the atmosphere, cost
hundreds of millions of dollars in development and operation [79].
In addition to being expensive and risking devastating failures, the use of single large, powerful
robots can be limiting. A single robot can only visit one area at a time, so its coverage rate in
building a map or searching for survivors after a disaster is slow. A single robot is also limited
to a predefined architecture, which may be tied to the sensors or other equipment it has available.
For example, a robot carrying heavy equipment will necessarily be large and powerful, and may
not be able to squeeze through tight spaces or traverse rocky terrains. Meanwhile, a robot that can
fit into small openings and navigate through (or even fly over) challenging areas will probably be
restricted to a weight too low to carry powerful sensors or to actually rescue and carry a person. In
general, attempts to use one robot that does many tasks can result in a robot that does not excel at
any of those tasks.
Instead, it is often beneficial for these tasks in remote, challenging, or dangerous areas to be
performed by a team of robots. A team can explore more efficiently, because multiple robots can
cover more areas and extend the map in multiple directions at the same time. By maintaining a
communication network, they can travel farther and still return to the same place to work together
if necessary. A heterogeneous team has additional potential advantages: robots with different
capabilities can explore different terrains and have the potential to work together to build a better
map. In an unexplored environment, this mission might include a small UA V to build a low-
resolution overhead map, with one or more ground robots then visiting interesting areas of the map
to improve it. In search and rescue, robots might explore different parts of a building, with UA Vs
flying up stairwells and small robots digging through piles of rubble. If a survivor is found, the
robots would then communicate over their network to find a route for a rescue robot to reach the
2
survivor and carry them out. And, if any individual robots are disabled, the rest of the team could
continue operating and still achieve part of the mission.
However, teams of robots also have their own challenges, including communication, data shar-
ing, and controls. As the robots separate and move through the environment, it may be challenging
for them to maintain contact, and their communication may have bandwidth limitations and de-
lays. Thus, algorithms that depend on one robot having access to data collected by another robot
must carefully consider when and how data should be shared. Additionally, the large amounts of
data collected by all of the robots may require additional processing, and the small robots that are
benefical for exploration are likely to have limited onboard processing capabilities, so techniques
developed for multiple robots need to consider where processing is done. Finally, to benefit from
the team’s unique capabilities, the algorithms used to select the robots’ actions must consider them
in the context of the whole team and may need to coordinate those actions.
This thesis addresses the challenge of mapping in a robot team, and considers it in the context of
a complex multi-robot planetary exploration scenario, shown in Fig. 1.1. In this scenario, a group
of robots, including large rovers like NASA’s Perseverance, small ground robots like PUFFERs
(described in Chapter 6), and UA Vs like NASA’s Ingenuity, land in an unstructured, previously
unexplored area on an uninhabited planet. The planet is distant enough that the communication
delay makes it impossible to have humans sending detailed commands or building a map. The
team may have some previous low-resolution imagery, such as from an orbiter, which can help it to
identify a good landing site, but not to navigate after landing. The robots are tasked with building
up a detailed map of the area, identifying areas of scientific interest (which may be difficult to
reach, such as in caves, under overhangs, or on rocky outcroppings), navigating to those areas,
collecting the appropriate data, and sending it back to Earth. To achieve these objectives, the
robots must work together. They must survey the area to build a map. They must select targets to
visit, which may be distant from the landing site, decide which robots should visit the targets, and
actually navigate to them. During that navigation, they may use the other robots to assist with the
drive by moving rocks or observing parts of the route. Once they have reached the targets, they
3
Figure 1.1: A scenario in which multiple robots operate together during planetary exploration. The
team includes large rovers, small ground robots, UA Vs, and even an orbiter.
must maintain a communication network to analyze and share the data. They may also need to
collect and return samples, monitor targets over time, or complete other tasks. Throughout all of
this, they need to build and maintain a map that contains all relevant information and that all robots
can use to navigate.
The problem of robots building a map and localizing themselves within it is called Simultane-
ous Localization and Mapping (SLAM). In popular SLAM systems, a robot uses its observations
from cameras and possibly an inertial measurement unit (IMU) to build a local map. When it ob-
serves a place it has seen before, it detects that as a loop closure, and updates its pose history and
map to remove error that has accumulated since the previous observation. In multi-robot SLAM,
the robots each build their own local maps, which can also be merged if inter-robot loop closures
are observed. The parts of a multi-robot SLAM system, and the current state of the art for those
parts, are described in Sec. 2.1.
One of the major challenges in multi-robot SLAM is finding those inter-robot loop closures.
When single robots return and view a scene for the second time, they usually view it with exactly
the same sensors and from a similar perspective, so the scene and the objects in it tend to look
4
nearly the same. Also, if the robot has been tracking its pose, it has an estimate of where it is and
when it might be near a location it has seen previously. However, none of those things are true in
inter-robot loop closures. Robots with identical designs will view the world through similar sensors
and perspectives, though not always identical. Robots with different designs may not have the same
sensors, and they may have very different views, such as aerial and ground robots. Furthermore,
until their maps have been merged, the robots have no idea when they are viewing the same area,
meaning that they must search much more data looking for matches.
These challenges are particularly amplified in unstructured environments, or those without
man-made structure. In man-made environments, such as on city streets or in occupied build-
ings, anything that moves through the environment tends to follow the same routes, like streets or
hallways. The objects around are often rectangular, so they have predictable features that can be
observed. Even objects with irregular shapes still have known structures, which can be learned
ahead of time. Additionally, large amounts of data of similar scenes are available, which can be
used to train algorithms like scene detectors. These environments may even have accurate maps
or floor plans that the robot can use to navigate. But in unstructured environments like planetary
exploration, none of these shortcuts exist. There are no maps and no roads, so robots may not fol-
low the same routes. Objects observed are typically natural features like rocks, craters, and ridges,
which cannot be learned ahead of time and may not be recognizable from different perspectives.
Finally, there is unlikely to be much, if any, similar data available for training. This is particularly
true in scenarios like planetary exploration where the environment is unique and has possibly not
been seen before, but it also applies in remote areas that have not been visited extensively.
The contribution of this thesis is a framework for finding inter-robot loop closures and merging
maps in unstructured environments. In the absence of defined organization (streets, hallways) and
known objects (buildings, cars, furniture), the methods developed must create their own structure
and use whatever properties of the world are known. However, despite sometimes being sensitive
to the environment, the framework is very flexible in the types of robots and teams supported, en-
abling techniques that are adaptable to robots and teams with greatly varying architectures, types
5
of mobility, and even sensors. After a review of background information on building multi-robot
SLAM systems, visual feature matching, and localization (Chapter 2), this thesis presents several
loop closure methods that work well with different types of robots and environments, including het-
erogeneous teams, and demonstrates how they might be used for mapping in multi-robot SLAM.
These methods include identifying 3D keypoints that appear in the same places in scans from
different sensors (Chapter 3), identifying groups of unstructured objects viewed from different
perspectives (Chapter 4), and using inter-robot range measurements to find loop closures between
robots with no visual overlap (Chapter 5). Finally, it presents the PUFFER multi-robot SLAM sys-
tem, which uses range-based loop closures in addition to traditional visual feature-based methods
and was designed for multiple robots operating together in an unstructured planetary exploration
environment (Chapter 6).
6
Chapter2
BackgroundandLiteratureReview
This chapter reviews background information about multi-robot SLAM and localization, which
provides helpful context for the remaining chapters. First, it reviews the pieces included in a
typical multi-robot SLAM system and discusses the current state of the art of each piece, showing
how the work in this thesis addresses real challenges and fits into a SLAM system. It also reviews
some existing full multi-robot SLAM systems, noting how they are constructed from those pieces
and the areas where they work well or fall short. Next, it reviews image matching with features,
which is a necessary tool in the odometry and loop closure parts of most SLAM systems. Several
of the novel methods presented in this thesis aim to improve on how image matching is used.
Finally, it discusses absolute localization using maps or external signals like GPS, which are not
typically part of SLAM systems but often provide sensors or techniques that can be repurposed for
multi-robot applications.
2.1 Multi-RobotSLAMSystemOverview
In general, the processing of a multi-robot Simultaneous Localization and Mapping (SLAM) sys-
tem consists of 4 parts: each individual robot’s onboard visual odometry, single robot loop closures,
inter-robot loop closures, and optimization. Most of those parts can be structured in a centralized
or a distributed way, and algorithms for the same part are fairly interchangeable. (Though trading,
for example, one type of loop closure detection for another may require modifying interfaces to
7
ensure that the right types of data are available.) While this division may or may not reflect the
software architecture, it provides a framework to think about how the processing is structured and
where one algorithm can be replaced by another. The division of multi-robot SLAM processing is
shown in Fig. 2.1.
2.1.1 Visual(andother)Odometry
The first piece of a SLAM system, and the only piece that is nearly always done onboard, is odom-
etry. As the robot moves around the environment, it processes measurements from its onboard
sensors to estimate its state, consisting of position, orientation, velocity, and possibly angular ve-
locity and sensor errors. This estimate is often made by an extended Kalman filter (EKF), which
iteratively propagates the state and covariance based on timesteps or sensor data, then updates those
estimates based on measurements from other sensors. When no IMU is available, visual odometry
may be done with only imagery using sliding-window bundle adjustment; see Scaramuzza and
Fraundorfer [103] for a discussion of these methods. However, methods that include an IMU are
typically the best choice for unstructured or uncertain environments, as they continue working if
feature tracking is lost and thus are more reliable, so we focus on EKF-based methods here.
An EKF’s processing consists of two steps:
1. Propagation: Based on the current estimate of the state and covariance and the IMU mea-
surement, predicts the state and covariance at the next timestep.
2. Update: Uses sensor measurements to improve the state and covariance estimate.
The EKF state vector is the estimate of the robot’s pose, and typically includes elements for
states like position, velocity, and orientation. It may also include angular velocity, sensor biases,
or other states specific to the updates being performed. For example, a system that uses wheel
encoder measurements might track scale factors to estimate errors like uncertainty in the wheel
diameter [69, 80], and visual odometry systems may track position and orientation at past timesteps
in the state [84] or errors in predicted feature observations [10]. In addition to state, an EKF also
8
2. Single-Robot Loop Closure
Robot finds loops and builds maps
from past observations.
1. Odometry
As a robot moves, it estimates its
motion from observations.
3. Multi-Robot Loop Closure
T wo robots recognize that they have
been in the same place and share maps.
4. Optimization
All observations are used to
refine trajectories.
Figure 2.1: The 4 parts of a multi-robot SLAM system. The work in this thesis primarily focuses
on multi-robot loop closure, which is particularly challenging in unstructured environments. Chap-
ters 3, 4, and 5 describe methods developed to find inter-robot loop closures in scenarios where
traditional visual loop closure methods are not effective. The methods in Chapters 3 and 4 are also
appropriate for single robots, and the method in Chapter 5 also includes updates to the optimiza-
tion. Finally, Chapter 6 describes a complete multi-robot SLAM system containing all of these
parts.
9
maintains the covariance matrix; it is a square matrix with the same number of rows and columns as
elements in the states, which describes the estimated error in each state estimate and the correlation
between the estimated states. The diagonal element corresponding to a state gives the expected
variance in that measurement, and each off-diagonal element describes how the two corresponding
states are correlated. For example, in an EKF that estimates pose and sensor biases, we might
observe an acceleration toward one side, and it is difficult to tell whether it was observed because
of a bias in an accelerometer, or an orientation error causing that accelerometer to observe gravity.
As the robot moves and observations are made on the sensors, it is able to observe the difference
between those two possible errors, and the correlation between those states drops.
During the propagation step, the current estimates of the state and covariance are propagated
forward in time. If a very high-rate sensor, such as an inertial measurement unit (IMU) with
accelerometer and gyrometer, is available, the IMU measurements are used to construct the prop-
agation matrix A. This allows it to incorporate dynamics, such as changes in acceleration, which
is necessary in high-dynamics applications [10]. However, the propagation can also be done by
simply projecting the velocity forward in time [86]. The propagation can also include controls used
to determine the robot’s next actions, though this is not common in passive localization systems.
During the update step, measurements from any other sensor are used to update the state. The
observation matrix transforms the measurement into the observed states, and it can be very complex
depending on the sensor used. For simple sensors, such as wheel encoders, it just converts from a
click count into a velocity or distance, and rotates it into the reference frame used in the state, as in
the PUFFER onboard localization described in Chapter 6. But more complex measurements, such
as features tracked in images, will have more complex observation matrices. There are different
ways that the measurement from images might be used. In the Multi-State Constraint Kalman
Filter [84], a number of features are tracked over several frames, then errors in where those features
are expected are transformed to update the poses at each of those past frames. In ROVIO [9, 10],
several features are tracked in the state, and errors in their predicted appearance is used to update
10
the pose estimate. In loosely-coupled systems, the motion of one or more features or the whole
image is used to estimated the robot’s motion, which is then used to update the state [82, 134].
Due to vehicle dynamics, the EKF propagation must be performed at a very high rate, typically
100Hz or higher. Due to this required processing, it is often not feasible to do the EKF com-
putations anywhere other than onboard the robot. However, the needed computational power is
relatively small, especially compared to any image processing, so this is not typically constraining.
Within this thesis, an EKF is used for onboard local visual odometry and mapping on the
Autonomous PUFFER SLAM system, described in Chapter 6. The system uses ROVIO [9, 10]
as a base, and is modified with the addition of wheel encoder measurements, which are used as a
velocity update.
2.1.2 Single-RobotLoopClosure
During single-robot SLAM, the robot finds loop closures, or events where it returns to view a
previous location again, by comparing new observations to its past observations. This can be done
by comparing images to detect those looking at the same place or by matching observations (either
landmarks or dense scene reconstructions) to the map it has already built. By finding these loop
closures, the robot can observe any error that has accumulated since the previous view of that scene
or landmark and update its estimate of past states to remove that error. Estimates of past poses and
sometimes observed landmarks are usually maintained and updated in a pose graph, described in
detail in Sec. 2.1.4.
In systems where a map is not needed in real time, it may make sense to use a technique
where the robot compares its new observations directly to its observation history, typically using
an image/scene descriptor like DBoW [27, 28, 44] or NetVLAD [3] to make the search more
efficient, or using a learned function to directly find similar images or scans [23]. If matches
are found, edges can be added directly between the poses corresponding to each image in the
pose graph. This allows the robot to find loop closures and constrain its pose history without
the overhead of building and updating a map. Depending on the size of the map that would be
11
constructed, this can result in significant memory savings, so it is appropriate for scenarios where
the map is not needed immediately and the robot covers a large area.
In systems where a map should be available in real time, that map may contain either land-
marks, which are features or objects detected in the observations, or a dense representation of the
environment. In systems that track features in the map, such as ORB-SLAM [86, 87], each image
is processed, and features that are observed consistently are saved as landmarks with known posi-
tions. As new images are collected, features are detected and compared with landmarks already in
the map. If good matches are found, edges are added between the new pose and the observed land-
marks in the pose graph, and the optimization estimates both the poses and the landmark position.
If the area explored is very large or busy, there are often a large number of landmarks available,
and pruning or sparsification are needed to reduce the pose graph to a manageable size. Even with
these measures, the pose graph may become too large to do online optimization updates. Systems
that use landmarks for loop closures may also maintain a dense map for visual odometry, allowing
them to have a usable map available online [101, 130]. Instead of maintaining a map of landmarks,
some SLAM systems maintain a dense occupancy grid as a map. This is common in 2D systems
like GMapping [49], which is popular on small ground robots like Turtlebot and matches a 2D
LIDAR scan to the previously constructed map. It is also used in 3D applications, which build
large-scale dense maps as the robot explores [129]. Even though the map of the environment is
dense in these systems, an occupancy grid has only a binary state or a single-value belief for each
voxel [31], so the map still grows slowly.
This thesis presents methods for inter-robot loop closure, which typically do not apply to single-
robot place recognition. However, some of the techniques, like sensor-agnostic features (discussed
in Chapter 3) and groups of unstructured objects (discussed in Chapter 4), could also apply to
single robots closing large loops.
12
2.1.3 Inter-RobotLoopClosure
Thus far, we have discussed pieces that are used even in single-robot SLAM systems. However,
one of the major challenges in building a system that does multi-robot SLAM is the need to merge
the robots’ maps, which requires the identification of loop closures between robots. These inter-
robot loop closures are events where two robots have overlapping maps, and they can be identified
either by recognizing that the robots viewed the same area or by detecting that they were near
each other at the same time. Sometimes, the techniques from single-robot loop closures can be
used. For example, inter-robot loop closures can be found by matching observed scenes or land-
marks with those observed by other robots in the past. However, with multiple robots, identifying
those matches is more challenging due to the need to share any observation data between robots,
which may have limited bandwidth or have communication restricted at times. Additionally, when
reviewing a single robot’s trajectory, the robot has some prior estimate of pose, so it can predict
where it might find loop closures. This often does not exist in multi-robot systems, essentially
meaning that much more data must be searched even though that data is more challenging to pro-
cess.
One way to handle this large amount of data is to do all map matching and optimization in a
centralized way, such as in a base station. In this architecture, each robot collects observations
as it moves and sends those observations to a centralized repository. As the observations are
received, they are compared with the complete observation history of all robots, and inter-robot
loop closures can be found. This approach is typically used in systems that do the optimization
and map-building offline. For example, maplab [106] is a multi-session mapping system that builds
a map, then merges each new session into the map by matching observed landmarks with those in
the map. Other examples are CORB-SLAM [73] and segment-based LIDAR SLAM [34], which
do online processing, but depend on robots each sending their observations back to a base station.
While this centralized method finds the largest possible number of loop closures, it is inefficient
due to the bandwidth required to share the full observation history and the computation required to
search the large map for matches.
13
There are alternative approaches to both the problem of sharing data and the problem of match-
ing against a large observation history. Rather than sharing full images or long lists of detected
features, many systems aim to reduce the information shared. One way to do this is by com-
pressing full images into descriptors, or vectors that uniquely describe the image. Like feature
descriptors, these have the property that descriptors of images of the same place tend to be simi-
lar, so they can be matched efficiently. The most commonly used image descriptor is the bag of
words (DBoW) [45]. To form this descriptor, the most common features in the image are indexed,
then a vector is constructed where each element corresponds to one of those features and gives the
number of times that similar features are observed in the image. These descriptors can be stored,
shared, and matched efficiently, and are commonly used to find inter-robot loop closures [73, 118].
Another descriptor is NetVLAD [3], which is a learned descriptor that approximates DBoW. The
NetVLAD detector can be learned from training on images similar to those observed during the
robots’ operation. A distributed SLAM system has been demonstrated based on NetVLAD de-
scriptors, and it performs nearly as well as a centralized system while communicating much less
data [25, 26]. Finally, other (learned or hand-coded) mappings can be used that transform images
or scans to a space where they are easier to compare, which may even help to identify loop closures
between observations from different sensors [58, 116].
Another way to reduce the amount of data shared is to detect semantic information in the
images, such as objects. In man-made environments, many objects are observed that have known
shapes and appearances, such as pieces of furniture or products like cans of soda. If those objects
are detected, they can be shared between robots with very little information, because the robot
only needs to share that it observed a can of soda in a particular place, and not the details of its
pixels or features. This technique has been demonstrated successfully, using the YOLO object
detector [92] to identify objects and a distributed method to do optimization with very little data
shared [24]. Sometimes individual objects are not unique enough to find loop closures, but groups
of objects may be. Multi-robot SLAM systems have been demonstrated using bag-of-words-style
semantic descriptors of groups of objects [91] and finding similar triangles in graphs connecting
14
identified objects [46]. However, detecting objects requires working in a man-made environment in
which objects have predictable forms. Without depending on man-made objects, it is also possible
to build an occupancy grid map and match portions of the maps built by each robot [107, 132].
Chapter 4 discusses a method for finding loop closures using groups of unstructured objects such
as rocks, which appear in unstructured environments like planetary exploration.
In addition to limiting the amount of data that must be shared for each observation, it may be
helpful to distribute the processing to identify matches. One way to do this in applications where
inter-robot communication is available at all times is to assign a subset of the possible image
descriptors to each agent in the system [26]. When a robot wants to share a descriptor, it can
determine which agent is responsible for similar descriptors and only share with that one. Thus,
the image descriptor does not need to be shared to all robots, and the processing to find a match
is limited to only one agent and a limited number of similar images. In applications where inter-
robot communication is only available at rendezvous, the robots may only exchange high-level
information like image descriptors initially, then exchange images or features for only the poses
where a loop closure is possible [24, 58, 118].
The methods described above depend on two or more robots observing the same area, but do
not require them to do so at the same time. In fact, they are equally applicable to multi-session
SLAM, in which one or more robots observe the same area at different times [106]. However,
when the application requires that the robots actually explore at the same time, it may be possible
to make direct observations between them instead of or in addition to visual loop closures. These
direct observations give a well-known transformation between two poses, so they can be easily
incorporated into the pose graph and used to merge maps. The most common way to make these
observations is by one robot directly observing another [19, 107, 124]. These methods assume
that one robot views another robot with its camera, and the observation is detailed enough that the
viewing agent recognizes which robot it is viewing and can estimate its relative pose accurately.
In lab applications, this level of recognition is typically accomplished by labeling each robot with
AprilTags [127], which are patterns in a known size whose exact relative pose can be estimated
15
from a single image. However, these tags may not be an option in real applications. Robots may
be required to look a certain way (such as in planetary exploration, where robots are painted in
a specific way to withstand the extreme environment [50]), or they may get covered with dust
or dirt (such as during search and rescue operations after a natural disaster). Thus, recognizing
other robots may be more challenging in real applications. However, this technique is used in the
PUFFER system, described in Chapter 6, in which the ground robots recognize an AprilTag on the
base station to initialize their poses and shared map frame.
An alternative way to make direct observations between robots is to use other sensors to share
information that allows relative localization. Those sensors are typically not high-dimensional like
cameras, so they may be able to estimate full poses or they may only make a single-dimensional
measurement. One common type of sensor makes ranging measurements, which measure the
straight-line distance between sensors on each of the robots. One type of distance sensor that can be
used for relative pose estimation is acoustic pulses, which are used underwater [53]. Another type
of distance sensor is ultra wideband (UWB) ranging radios, which are cheap, lightweight sensors
that provide low-noise measurements over short distances based on a time-of-flight measurement.
UWB ranging sensors are typically used for absolute positioning using beacons, as described in
Sec. 2.3. However, they can also be adapted as an observation of direct loop closures, and a method
to achieve this capability is described in Chapter 5.
2.1.4 Optimization
The final piece of actually performing SLAM is the optimization, in which all of the poses and
the map are estimated using past information, including both odometry measurements and loop
closures. In modern SLAM systems, including nearly all of those developed in the last ten years,
pose graph optimization (PGO) is used to estimate pose histories and sometimes also to build
the map. While there are other ways to do the estimation, including particle filters and extended
Kalman filters, these methods are typically not appropriate for robots exploring large complex
environments, and are particularly challenging for systems of multiple robots. Therefore, this
16
discussion is limited to only PGO-based SLAM techniques. For a discussion of non-PGO based
multi-robot SLAM techniques, see Gupta and Conrad’s review [52].
In PGO, a graph is constructed such that each vertex represents a pose, or a robot’s state at a
particular time. The vertices have edges connecting them, with each edge representing a measure-
ment made that constrains the two poses it connects. The edge could constrain two consecutive
poses based on onboard sensor data, such as IMU measurements or visual odometry, or it could
constrain two poses that have a known (or partially known) transformation, based on loop clo-
sures. Once a pose graph has been built, the process of PGO is used to re-estimate all poses by
minimizing the error introduced by each constraint.
Overall, PGO is a large optimization problem solving:
X
∗ = argmin
X
∑
i, j
∥ f
i j
(x
i
)− x
j
∥
2
Λ
(2.1)
Here, x
i
gives the pose of vertex v
i
, with X describing the poses of all vertices. f
i j
is a function
implementing the constraint associated with edge e
i j
, so f
i j
(x
i
)= ˆ x
j
is the estimate of what x
j
should be, based on x
i
and the constraint on e
i j
. Finally,∥·∥
Λ
is the Mahalanobis distance with
covariance Λ, allowing each term in the optimization to be scaled based on how accurate the
measurement is that gives the corresponding constraint. Thus, PGO attempts to find the overall
best set of poses, given all constraints.
In single-robot SLAM, the optimization is generally performed in a centralized way, either
onboard during operation [67, 86, 87], or afterwards during map construction [106]. However,
in multi-robot systems, this centralized approach may not always be feasible or the best option.
The best optimization structure is dependent on the system architecture, applications, what type of
measurements are made, and how much information is available to each robot.
The simplest PGO architecture to implement is a centralized architecture, where all robots
make measurements and send them back to a base station or another robot that is designated as a
central processor, and the base station performs the optimization and sends back optimized poses
and measurements. This technique provides excellent results, as all information from all robots
17
is available to be used in the optimization. However, it generally requires high bandwidth com-
munications between the robots and the base station to share all measurements, which may not
be available in some applications. Additionally, the central pose graph can get very large, and
performing the optimization may be slow and computationally expensive. Depending on the appli-
cation, this may be acceptable. In a short-term application with small robots, the graph may not get
large enough to slow down the optimization. Also, if the application does not require immediate
updating of poses and maps onboard the robots, this architecture may be effective by processing
the data from each robot in batches like missions [106] or submaps [108]. Despite the communi-
cation restrictions, this architecture has the advantage that robots do not need to communicate with
or even be aware of each other, removing a lot of coordination needed in more distributed architec-
tures. Several systems have been developed with this architecture, including CORB-SLAM [73],
segment-based LIDAR SLAM [34], and CCM-SLAM [105], which have immediate updates, and
LAMP [38] and Autonomous PUFFER SLAM, discussed in Chapter 5, which allow for commu-
nication between robots and the base station to be interrupted.
In some scenarios, it may not be ideal for a centralized computer to perform all PGO. For
example, there may not be any robot with a computer powerful enough to act as a base station, or
there may not be a guarantee that communication will be available. If the robots will do significant
exploration out of range of the base station, it may not make sense to rely on regular updates from
the base station. That might occur in scenarios like cave exploration or search and rescue in a
collapsed building, where communication is likely to be unavailable. A less centralized solution
may also be a good choice in scenarios where the robots are likely to have frequent communication,
such as those where multiple robots are working together in a small area. In those cases, it may
be useful to perform some or all of the PGO onboard each robot, with robots only exchanging
observations at convenient times. However, this makes the optimization much more challenging,
as each robot has access to only its own pose history and whatever information is exchanged, so it
may be missing parts of other robots’ pose graphs. Thus, the optimization must be performed in
a distributed way, so that each robot is able to estimate only part of the pose graph [24, 25, 119].
18
This distributed optimization is challenging for two reasons. First, the communication protocol
and how the optimization is constructed must be designed so that each robot has the information
it needs to do the optimization [39, 118]. Second, if the information is shared between pairs of
robots, it is necessary that the optimization algorithm understands the measurements in a way that
ensures they are not double-counted during sequential communications/updates [36, 124]. Even
in scenarios where communication is not restricted, distributed processing can help with reducing
the load on each robot’s processor, and it may even be helpful to distribute the processing to many
servers in the cloud. [135].
One of the key distinctions in the different PGO architecture options, and in how multi-robot
systems can be classified, is how aware a robot is of other robots. In order for a robot to perform
PGO onboard that includes inter-robot loop closures, it must be aware of those other robots, even if
it does not have detailed information about their trajectories. But in a centralized system, each robot
can simply collect observations and send them back to the base station, without any awareness of
where other robots are or even whether they exist. The choice of architecture depends on the
processing power available, as the robots’ onboard processors may or may not be sufficient for
a large pose graph, and there may or may not be a powerful base station available. The choice
of architecture may also depend on the application: it may be necessary for the robots to know
about each other if their desired behaviors require them to interact, and they may not benefit from
knowing about each other if they will visit separate areas. The PUFFER multi-robot SLAM system,
described in Chapter 6, uses a centralized architecture because the robots are small with very
limited processing power, and the application does not require them to interact. Instead, even
though they make direct inter-robot measurements, the base station actually constructs the full
pose graph with loop closures and does the optimization.
A further challenge in multi-robot SLAM is detecting and removing incorrect loop closures.
In any pose graph, a loop closure that incorrectly connects two nodes can cause the graph and
the map it produces to be distorted and unusable. In single-robot systems, the prior estimate of
the robot’s poses allows many incorrect matches to be rejected. But in multi-robot systems, these
19
prior estimates do not exist. Instead, checks can be performed on the loop closures to make sure
they are consistent with the rest of the maps. Pairwise consistency checks are commonly used
before the loop closures are added to the graph [38, 68], which works well for graphs with sparse
loop closures. The PUFFER SLAM system uses a different approach, in which range-based loop
closures, described in Chapter 5, are implemented as switchable constraints [113, 114]. These
edges in the pose graph can be disabled during optimization if they are not consistent with the rest
of the graph.
2.1.5 FullMulti-RobotSLAMSystems
In Table 2.1, we show a number of complete PGO-based multi-robot SLAM systems that have
been developed recently. All of the systems contain the four parts described above, with different
architectures and implementations. The last row shows our PUFFER multi-robot SLAM system,
described in Chapter 6.
There are some clear similarities between all of these systems. They all include some type of
odometry, usually using visual or LIDAR observations depending on the sensors available. In the
systems that do not include or choose not to use IMUs, ORB-SLAM is a very popular implemen-
tation. When an IMU is available, most systems use similar EKF-based VIO, incorporating IMU
measurements and updates from tracked features. Some also incorporate wheel encoders in the
EKF, though not all methods are designed for ground vehicles. Also, almost none of the systems
do single-robot loop closure in a different way from multi-robot loop closure. Many of them only
find single-robot loop closures incidentally as they look for inter-robot loop closures, and even
those that do explicitly look for single-robot loop closures mostly use the same method as they use
to find inter-robot loop closures. A major exception is CORB-SLAM, which implements single-
robot loop closure onboard the robots using ORB-SLAM, but uses centralized scene recognition
for inter-robot loop closures. But in most cases, to avoid repeating the same processing, it makes
sense to perform the two loop closure steps in the same way.
20
Most of the systems find loop closures through scene matching, based on DBoW or NetVLAD
image descriptors, scan descriptors, or submap matching. These loop closure methods require
significant preparation ahead of time, including training scene detectors. Very few of the systems
directly match map landmarks, and maplab and the PUFFER system (which is based on maplab)
are the only ones that match feature landmarks. (Object SLAM treats objects as map landmarks,
but the object detectors are learned ahead of time.) The decision about what type of loop closures to
use is heavily dependent on the environment in which the system is expected to be used. In well-
known, well-understood environments, like city streets or inside office buildings, it is expected
that many known objects will be observed and recognized, and there is a large amount of similar
imagery available for training. This contrasts with unstructured environments like the one where
the Autonomous PUFFER system is used and the other environments studied in this thesis. In those
environments, there is no expectation of well-understood order or known objects, and there may
not be any prior images to train with, so loop closure methods based on scene or object recognition
are not feasible.
Finally, about half of the methods use distributed PGO, while the rest are centralized. This
largely reflects the intended application: the centralized implementations are typically intended for
systems with less onboard processing power, such as the small UA Vs that maplab is demonstrated
on, while the distributed systems run on robots that can handle doing the optimization onboard,
like the Clearpath Jackals used to demonstrate Object SLAM. This is not surprising, as a robot with
more onboard compute power would be expected to operate with a higher level of autonomy, which
likely involves tracking pose and observation history, and building a pose graph, onboard. While
significant advances have been made in distributed PGO, these methods remain very complex to
implement.
21
SLAM System Odometry
Single-robot
loop closure
Multi-robot
loop closure
Optimization Notes
6D Graph SLAM
[107, 108]
Visual odometry
in EKF with
time delay
Submap matching
with surface
point cloud
Direct matching with
AprilTag detections,
submap matching
Centralized on
base station,
performed in
real time
Object SLAM [24] Wheel odometry
Same as multi-
robot loop closure
Objects detected
using YOLO and
tracked in map
Distributed (DGS),
observations shared
during encounters
Demonstrated with
household objects
CORB-SLAM [73]
Visual odometry
(ORB-SLAM)
Match to map
landmarks
DBoW for place
recognition, then
matches ORB
features
Centralized on
base station,
performed in
real time
Segment-based
LIDAR SLAM [34]
IMU, wheel
odometry, LIDAR
scan-matching
Part of multi-
robot loop closure
Submap matching
by matching
segment features
Centralized on
base station,
performed in
real time
Designed for
LIDAR scans
maplab [106]
ROVIO patch-
based VIO
None
Features detected in
new observations
and matched to
landmarks in map
Centralized with
maplab PGO,
performed offline
Dense map must
be constructed
offline
DSLAM [25]
Visual odometry
(ORB-SLAM)
Same as multi-
robot loop closure
NetVLAD image
descriptors, then
match landmarks
Distributed (DGS),
observations shared
during encounters
Novel method to
distribute matching
among robots
CCM-SLAM [105]
Visual odometry
(ORB-SLAM)
Part of multi-
robot loop closure
DBoW for place
recognition, then
matches map
landmarks
Centralized on
base station,
performed in
real time
22
TSDF-SLAM [36]
eVO stereo
VIO
Same as multi-
robot loop closure
Submap mesh
matching
with ICP
Distributed (Ceres),
observations broadcast,
bookkeeping to
track latest submap
DOOR-SLAM [68]
RTAB-Map
stereo visual
odometry
None
NetVLAD image
descriptors, then
match features
Distributed (DGS),
observations shared
during encounters
Novel outlier
rejection to
validate loop
closures
LAMP [38]
LIDAR scan-to-
scan matching
LIDAR scan
to submap
matching
LIDAR scan
matching with
GICP
Centralized, data shared
with base station
when communication
available
Outlier rejection
to remove
inconsistent loop
closures
DiSCo-SLAM [58]
LIO-SAM
LIDAR-inertial
odometry
Same as multi-
robot loop closure
Scan Context
LIDAR scan
descriptors
Distributed (global
and local),
observations shared
during encounters
Based on
LIDAR scans
instead of
images
Kimera-Multi [118] Kimera-VIO
Same as multi-
robot loop closure
DBoW image
descriptors,
ORB features
Distributed (GNC),
observations shared
during encounters
PUFFER SLAM
Chapter 6
ROVIO patch-
based VIO, wheel
odometry
Part of multi-
robot loop closure
Map landmarks,
inter-robot
UWB ranging
Centralized on
base station,
performed in batches
in real time
Designed for
Autonomous
PUFFER program
at JPL
Table 2.1: Summary of currently-used multi-robot SLAM systems, showing how they implement each of the four parts described.
23
2.2 ImageMatchingwithFeatures
The previous section described many different types of loop closures, which identify that two
robots are in the same area or that their sensors view the same scene. One of the key techniques
for working with that sensor data is feature matching, in which the same features can be detected
in both frames. This section will discuss what feature matching is and how it is used in typical
systems.
A keypoint is a point in an image or other view of a scene that is both unique and repeatable:
it should look different from other points in the image, and multiple observations of the same
keypoint should look the same. In images of real scenes, keypoints are associated with a point in
3D space, though that point is not required to represent an object. For example, it could be the
edge of a shadow, or a place where two corners of buildings appear to overlap. In views from
3D sensors, keypoints are often associated with real structures in the word, like the corner of a
building. Typically, a keypoint detector is used to identify keypoints, using methods like corner
detection [54, 96, 109] or blob detection by processing the gradients of the image [5, 75, 111].
Corner keypoints are typically very fast to detect but are less unique looking and can be difficult to
match with no pre-matching estimate of where they might be found. Blob keypoints, on the other
hand, are slower to detect but can often be matched more successfully.
The keypoints described above are simply image or scan points that are associated with a point
in space. To tell when two images or scans are viewing the same area, we need to detect those
keypoints in each observation, identify that two detections are associated with the same point in
space, and estimate the transform between the two camera frames based on pairs of keypoints
representing the same point in space. The first step of this procedure is to find the keypoints that
represent the same 3D point, which is typically done using descriptors, or vectors describing the
keypoint in a way that makes it straightforward to match. Together, the keypoint and descriptor
are called a feature. The first commonly-used descriptors were SIFT [75] and SURF [5], which
are scale- and rotation-invariant and describe blob features by constructing a vector of gradients
around the keypoint location. These features can be matched by comparing the Euclidean distance
24
or dot product between the descriptor vectors, as different views of the same keypoint will have
very similar descriptors and different keypoints will have different descriptors. Later, binary de-
scriptors like ORB [96] and BRISK [72] were proposed for corner features, which are much faster
to create and use because they can be matched very efficiently using Hamming distance. Because
the descriptors can be matched in this way, all features in each image can be searched to find
matches, so no prediction of where a feature’s match is likely to be is required. Thus, matching
features using descriptors is effective for finding loop closures and estimating relative poses, as the
relative poses do not need to be known in advance.
Though an estimated transform is not required to find feature matches, it is available in some
applications and can be useful. Notably, in visual odometry, an estimate of how the camera has
moved from one frame to the next is available, either by propagating IMU measurements [9, 10]
or just by projecting the motion from the previous frame [86, 87]. In those cases, feature matches
can be found much more efficiently because the whole image does not need to be searched, so
correct matches are more likely to be found and incorrect matches are likely to be avoided. The
projections may not always be correct, such as during trajectories with high dynamics, so these
methods must also be able to recover from not finding feature matches.
Typically, in two observations of the same scene, matching features from one frame to the next
gives pairs of features that are mostly correct. However, there may be some incorrect matches,
often because a feature did not have any true matches so it was paired incorrectly, or because
two features look very similar. These incorrect matches are often identified as the relative pose
is estimated using an outlier rejector like RANSAC [41]. For all correct pairs of features, the
difference in camera frames between the observations can be used to predict where the features
will be found in the images or scans. For cameras, this is described by the fundamental matrix F,
which has the property x
T
1
Fx
2
= 0, as shown in Fig. 2.2. From a set of correctly matched features,
F can be estimated:
argmin
ˆ
F
∑
x
[x
T
1
ˆ
Fx
2
]
2
such that det(
ˆ
F)= 0
25
Figure 2.2: This figure shows the geometry of two cameras observing features x
1
and x
2
, which
are both observations of the same 3D point X. F is a 4× 4 fundamental matrix describing the
transformation between the two camera frames, with the property x
T
1
Fx
2
= 0.
26
RANSAC performs this estimation and outlier rejection at the same time by repeatedly selecting
a subset of the pairs, using it to estimate F, then rejecting any pairs that do not fit that estimate.
After several iterations, a fundamental matrix that describes the correct matches is found, and all
matches that do not fit that matrix are rejected as being incorrect. [4]
The fundamental matrix requires at least 8 feature pairs to estimate, but more pairs are helpful in
case of outliers. In scenarios where all features have 3D positions that are known to be coplanar, we
can use the essential matrix instead, which has the same properties as F but does not have a defined
scale, so can be estimated with only 4 feature pairs (but again, more pairs are preferred). Several
chapters of this thesis discuss methods that search for correspondences with the goal of finding
enough correspondences to estimate relative pose. In Chapter 3, at least 8 correspondences are
needed between 3D points in point clouds from different sensors to estimate the sensors’ relative
poses. Chapter 4 describes a method that finds correspondences between observations of rocks
from different frames. The rocks are located on the ground, which is assumed to be fairly flat, so
only 4 correspondences are needed to estimate the essential matrix.
The procedure described above is for matching features between images, which have the added
challenge of scale because they are 2D. However, keypoints and features can also be detected in
3D data from LIDAR scanners, stereo cameras, or RGB-D cameras. These features can be used
to find the relative pose between sensors in a similar, but simpler way. Because the measurements
include scale, the relative pose is simply a rotation and translation that maps one set of keypoints
onto the other. A thorough review of 3D keypoints is in Tombari et al. [120], and reviews of 3D
descriptors are in Alexandre [2] and Guo et al. [51]. Chapter 3 evaluates 3D keypoints specifically
for use during SLAM.
2.3 AbsoluteLocalization
So far, all of the localization measurements discussed in this chapter are relative measurements.
IMUs and wheel odometers measure the robot’s motion since the previous measurement. Even
27
though images show the world at a particular point, they are mainly used as relative measurements
by estimating the relative pose between two images. Thus, the map that is built by these methods
is a relative map, based on an origin at the robot’s initial pose, the base station, or some other
reference point.
However, there are often ways to measure the robot’s pose relative to known points in the
world, either by observing known objects or by making measurements to a known point. In those
cases, the robot can build an absolute map, with the origin at a known point in the world. Using an
absolute map is a necessity in many scenarios. In space exploration, landmarks to visit may have
been identified in a satellite map, so the robots need to reach the correct location. In search and
rescue, whether the mission is indoors or outdoors, there is a known area that must be searched. If
a robot finds a person who needs to be rescued, it may need to direct rescuers to a specific location.
These tasks require using a map fixed to the global reference frame.
One method of making absolute measurements, which can also be done by humans, is matching
to previously built maps. In many scenarios, including planetary exploration and outdoor search
and rescue, maps are likely to have previously been constructed from satellite imagery. Even in
indoor scenarios, there is likely to be some initial map, such as a building plan. These maps will
generally include information like terrain and large obstacles, but they may also have outdated
information or be collected at a different time of day or year. As a tool for localization, observing
a scene and matching it to these maps is similar to finding a loop closure to a point with a known
position. Similar techniques to those used to find visual loop closures can often be used, though
they may need to be modified due to the maps’ different information. For example, some satellites
collect infrared images, which must be processed with edge detection or learned mappings to
match with visible images [116]. Some maps are collected at a different time of day or year, so
again will need additional processing (for example, if the map has snow on the ground but the
robot is operating in the summer, or if the map was collected in the morning but the robot operates
in the evening, when shadows are in very different places.). This is a major challenge in lifelong
mapping, and strategies include continually finding loop closures where possible [16, 62], storing
28
features from multiple passes in the map [37, 85], and tracking how features persist or change over
time [94].
The other frequently-used method of finding absolute pose is making measurements to ob-
jects at known positions. The most common way to do this in real-world applications on Earth is
with GPS. GPS satellites orbit the Earth in well-understood orbits with very accurate clocks, and
broadcast a specific signal. GPS receivers correlate a local signal with the one received from each
satellite, and are able to triangulate their location by estimating the distance from each satellite and
the position of the satellite when it sent the signal. This produces a position estimate for the re-
ceiver with an accuracy of about 5 meters in basic civilian applications like cell phone positioning,
down to centimeter-level accuracy for military and surveying applications. Multi-sensor systems
can further improve this accuracy and also estimate orientation with differential GPS, which com-
pares the signal received at two antennas a known distance apart. [15]
GPS is only available in outdoor, above-ground applications on Earth. But in areas that are ac-
cessible prior to the robotic mission, similar techniques can be used with beacons placed at known
locations. This is typically done with ultra-wideband (UWB) radios, which are very small portable
sensors that measure the distance between an anchor and a tag. Beacons are set up around the area
of robotic operations, each containing a tag. During operation, each robot carries an anchor, which
it uses to measure the distance to each beacon. Because those beacons have known locations, the
robot can triangulate its position with each measurement and use that position directly or as an
update to its EKF pose estimate [43, 78, 125]. If the beacons’ positions are not known initially,
they can be estimated during operation and then used for absolute positioning [139]. Even if the
beacons do not fully cover the space, they can be used to help the robots estimate scale and reduce
drift, which are particularly challenging for robots with monocular cameras [18, 70].
While the UWB signals are typically used for absolute positioning, the beacons are not actually
required to be at fixed positions. Instead, they can also be placed on robots or other objects to be
localized, and used for relative positioning. In Chapter 5, we use UWB sensors to make range
29
measurements between robots, enabling the PUFFER system to find inter-robot loop closures and
merge their maps even when the robots’ trajectories do not overlap.
30
Chapter3
FindingLoopClosuresBetweenRobotswithHeterogeneous
Sensors
3.1 Introduction
Many applications for multi-robot systems in unstructured environments benefit from heteroge-
neous teams. In search and rescue, a team may include flying robots for searching, large powerful
robots to carry anyone who needs to be rescued, and small robots to squeeze through small spaces.
In space exploration, the team might include UA Vs like Ingenuity, large rovers like Perseverance,
and small ground robots like PUFFERs. Though each robot localizes on its own and builds its own
map, they also must merge their maps so they can explore more efficiently and so that they can use
shared observations to help with planning and navigation.
Most approaches for multi-robot SLAM improve localization for a robot team [24, 34, 76,
100], but only consider homogeneous groups, meaning all robots have the same sensors. Thus,
detecting that two robots are observing the same scene can be done using the same techniques as
detecting single-robot loop closures, such as using feature or scene descriptors.
However, teams of heterogeneous robots may not have the same sensors available, preventing
them from using these loop closure techniques to find observations of shared scenes. Consider
a scenario where large ground vehicles map an outdoor region with the aid of small UA Vs. The
ground vehicles may have high-quality IMUs and heavy, expensive sensors such as LIDAR, which
31
provide an accurate point cloud of the scene but no visual information. They may also have cam-
eras for specific purposes, such as the navigation cameras on Perseverance that assist in wheel
placement, but those cameras may not cover the whole scene and may be difficult to match with
images from other perspectives, as discussed in Chapter 4. The UA Vs are too small to carry a
LIDAR, but have lightweight, low-quality IMUs and one or more small cameras. They can re-
construct a point cloud using stereo vision, but it has very different error properties to the LIDAR
scans. Ideally, both types of robots would be able to contribute to a shared map, with the UA V
covering a large area and the ground vehicles providing detailed measurements of important areas,
and both robots would also be able to match their observations to this map to improve their pose
estimates.
Finding correspondences between LIDAR scans and stereo cameras is challenging since LI-
DAR does not provide imagery. While both can provide 3D point clouds (as can other sensors like
RGB-D cameras), the point clouds generated have different resolutions and error models, which
make dense methods for data association challenging (see Fig. 3.1). In particular, the point cloud
from a LIDAR scanner has lower resolution due to the physical structure of the scanner, but the
points have very low noise and cover the full 360-degree field of view. Meanwhile, a point cloud
reconstructed from stereo imagery has much higher resolution, as there is a point for each pixel in
the images, but there is a lot of noise on the position of each point, particularly in the depth mea-
surement. However, even if the types of point clouds do not lend themselves to direct point-to-point
matching, it may be possible to identify correspondences by matching keypoints observed in all
types of point clouds. Like in images, these keypoints can be identified using keypoint detectors,
or methods to select unique and repeatable points in the point cloud.
Once keypoints have been detected in point clouds, using them for multi-robot SLAM would
require identifying matches, or pairs of keypoints corresponding to the same physical location.
These matches are typically found using descriptors, which uniquely describe each selected point.
Several types of descriptors exist for keypoints in point clouds, including descriptors associated
with keypoint detectors (e.g., NARF [111] and ISS [138]), and standalone descriptors (e.g., Spin
32
Images [59] and variations on Point Feature Histograms [97, 98]). Alexandre [2] and Guo et
al. [51] evaluated 3D descriptors on point clouds from various sensors, and found that all descrip-
tors worked significantly better when applied to detected keypoints rather than a uniform sampling.
However, this descriptor matching approach assumes that those keypoints are the same between
point clouds. Thus, before detected keypoints can be matched and used in SLAM, it is necessary
to identify keypoint detectors with this property, which is the focus of this paper.
The contribution of this chapter is an evaluation of 3D keypoint detectors for loop closure in
heterogeneous multi-robot SLAM. Specifically, we seek a keypoint detector that:
1. Finds repeatable keypoints across different sensor types;
2. Finds repeatable keypoints in observations made from different poses;
3. Detects relatively few high-quality keypoints that can be shared even with low-bandwidth
communication; and
4. Detects keypoints efficiently from LIDAR and stereo camera point clouds to enable real-time
SLAM.
This chapter is based on a published paper [11], which was the first comparison of 3D keypoint
detectors for SLAM or visual odometry.
We evaluated 5 commonly used 3D keypoint detectors for these properties using point clouds
from LIDAR scans and stereo camera images in the KITTI dataset [47], which are representative
of measurements made by mobile robots in SLAM applications. We evaluated repeatability of
each detector, measuring how likely it is that a keypoint detected in a scan from one sensor will
be found in a scan from the other sensor, and we considered scans from the same pose and from
different nearby poses. Overall, the evaluations indicate that NARF [111] and KPQ-SI keypoints
[83] have the best relative repeatability. The KPQ-SI detector finds many more keypoints, but has
significantly worse computational performance on all scans. We conclude that both detectors can
work well, but the NARF keypoint detector is the best choice if computational power or bandwidth
is limited.
33
(a)
(b)
(c)
Figure 3.1: Several views of frame 2181 from track 0 of the KITTI data. (a) Image from the left
camera. (b) Point cloud from LIDAR scan, shown from two viewpoints. The area also visible to
the stereo camera is shown in green. (c) Point cloud from stereo camera images, shown from two
viewpoints. The different properties of the two sensors are visible in the point clouds. LIDAR
scans are less dense, but they have little noise and a 360-degree field of view. The camera images
have a denser view of part of the scene, but also significant noise along the axis toward the camera.
They also have erroneous points in the “sky,” which can challenge keypoint detectors.
34
3.2 3DKeypointDetection
In applications where information is extracted from high-dimensional sensor measurements such
as images or LIDAR scans of the environment, it can be helpful to use keypoints rather than con-
sidering the entire scan, which may contain thousands of points. Keypoints are points that contain
most of the relevant information from the measurement while reducing the data that must be stored
to a manageable size.
In feature-based SLAM methods, robot poses are reconstructed using observations of those
keypoints in several images and the correspondences between them. Thus, an important require-
ment for feature-based SLAM is features that are both repeatable, meaning they are found in the
same place over many scans, and unique, meaning that each keypoint looks different from other
keypoints in the same scan. It is common to first extract repeatable keypoints, then compute a
unique descriptor for each keypoint and use this descriptor to find correspondences between im-
ages. With visual sensors, such algorithms often use SIFT, FAST, or ORB keypoints [45, 75, 86,
95, 96].
The keypoints mentioned above apply to 2D sensors such as cameras. But several types of key-
points exist that can be detected in 3D data, such as surface meshes from object models or point
clouds from 2.5D sensors (which make depth measurements from a single viewpoint, e.g., LIDAR
and stereo or RGB-D cameras) [20, 83, 109, 111, 112, 123, 133, 138]. Many were developed for
object recognition, where a 2.5D scan is matched against a known 3D model of the object. In
contrast, in feature-based SLAM, correspondences are found by identifying pairs of matched fea-
tures from individual scans. To this end, we selected 5 keypoint detection methods, which we now
describe in detail, to compare based on the requirements described above and their performance in
other studies. Other methods we considered are also briefly described.
35
3.2.1 ExtensionsfromImageKeypoints
Common 2D keypoint detectors evaluate functions of the image intensity. Some of those methods
can be adapted to 3D point clouds, either by evaluating the same expressions over volumetric data
or by computing surface normals and evaluating them over the surface. A volumetric model of
the scene cannot be accurately reconstructed from a 2.5D single scan. However, normals can be
computed, so methods evaluated over surfaces are particularly applicable.
One such keypoint type is Harris corners [54]. In 2D, Harris corners are defined at the maxima
of an autocorrelation function that describes how similar a patch of the image is to a slightly shifted
copy of itself. This is an efficient way to detect corners, as patches containing them vary rapidly in
both directions. In the Harris3D detector, a 3-dimensional version of Harris corners, the normal at
each point is computed by fitting a plane to nearby points, then the same autocorrelation operator
is applied on these normals and peaks are selected as Harris3D keypoints [109]. These keypoints
are efficient to compute, making them ideal for robotics applications where computational power
may be limited and real-time performance is required.
3.2.2 KeypointsfromSurfaceCurvature
Intrinsic Shape Signatures (ISS) are a keypoint and descriptor for a region of a 3D surface. The de-
tector finds keypoints with a unique orientation and neighbors with large surface normal variation
in all 3 dimensions [138]. The ISS detector did well in previous comparison studies in efficiency
and repeatability over viewpoint changes [40, 102, 120]. The paper also presents a descriptor
designed for ISS keypoints, making it a promising choice for both extracting keypoints and identi-
fying feature correspondences.
Mian et al. [83] define another keypoint detector, which we refer to here as KPQ (KeyPoint
Quality). In the KPQ detector, points’ neighborhoods are aligned with the local frame and key-
points are selected with the largest ratio in how far that neighborhood patch extends along each
axis. This selects points where the surface has much more curvature in one direction than the
other. A surface is then fitted to the neighborhood points to compute a quality measure. KPQ uses
36
a fixed neighborhood size, but a scale-invariant method is also described by identifying keypoints
at multiple scales. We refer to the scale-invariant method as KPQ-SI. KPQ-SI also performed well
in previous surveys [102, 120].
3.2.3 RangeImageKeypoints
Steder et al. [111] propose a very different approach. Rather than work with the full 3D point
cloud, they reduce the point cloud to a range image by projecting it onto a spherical camera plane
located at the viewpoint and coloring each pixel by its depth. The range image is a distorted view
of the full 360-degree surroundings. If the point cloud is a 2.5D measurement, like the scans in
SLAM scenarios, and the origin used for the projection is the same as the pose where the scan was
collected, no information is lost in this projection. The Normal Aligned Radial Feature (NARF)
keypoint extractor finds points in the range image which are on stable surfaces that are not sensitive
to viewpoint changes, but are still unique [111]. Each point in the range image is assigned an
interest value, which is high when the point’s immediate neighbors (within a fixed neighborhood)
have similar normals, but more distant neighbors have large normal variations. Keypoints are
selected at maxima of this interest value. A descriptor is also proposed for these keypoints, but we
do not evaluate it here. The NARF detector has not been evaluated in previous surveys because
information is lost when projecting a 3D model to a range image, so it is not appropriate for
object recognition applications. However, because SLAM requires only matching between 2.5D
measurements, NARF keypoints are appropriate for SLAM.
3.2.4 Keypointswedidnotevaluate
Our evaluation does not include 3D keypoints that performed poorly with data similar to ours
in other studies or that have requirements that are unreasonable for robots in a SLAM system.
In particular, the MeshDOG [133], Salient Points [20], Laplace-Beltrami Scale Space [123], and
Heat Kernel Signature [112] detectors require a mesh surface to be fitted to the point cloud. This is
computationally expensive and slow, particularly for the scans of very large areas used in SLAM.
37
The Local Surface Patches detector [21] identifies points with different surface curvature from their
neighborhood, but it performed very poorly on point clouds from laser scans in other comparison
studies [102, 120].
3.3 RelatedWork
While each keypoint has been analyzed for repeatability and matching performance in the litera-
ture, the analyses typically use different data and methods. Several studies have directly compared
different types of keypoints. Salti et al. [102] and Tombari et al. [120] evaluated a variety of key-
point detectors for object recognition over 5 datasets. Each dataset had 3D models of objects and
scans of those objects from different viewpoints and in scenes. They compared all keypoint detec-
tors described in Sec. 3.2 except the NARF detector and found that KPQ-SI, MeshDOG, and ISS
had the best performance. ISS was most efficient, but performed poorly when the object was par-
tially occluded. Filipe and Alexandre [40] evaluated keypoints for object detection in an RGB-D
dataset, and also found that ISS keypoints had the best performance. These studies used datasets
with only one sensor, so they could not evaluate matching across sensor types, as is required for
heterogeneous multi-robot SLAM. Additionally, they evaluated matching of 3D models with 2.5D
scans, which applies well to object recognition but less so to SLAM, where models of the scene
are unlikely to exist.
Yu et al. [131] compared several types of keypoint detectors on volumetric data for object de-
tection. They worked with both synthetic and real data. They extended typical 2D image keypoints
like Harris corners and SURF to 3D, and found that MSER features had the best performance but
were inefficient to compute. The conversion to occupancy grids and evaluation over those grids
can be slow, and occupancy is difficult to determine from a 2.5D scan, so these detectors are not
appropriate for SLAM scenarios.
Kostusiak [63] evaluated 2D keypoint detectors and descriptors (e.g., SURF, KAZE, ORB) in
RGB-D data for visual odometry. They used the depth channel only for 3D positions of those
38
keypoints, which is common in SLAM with RGB-D cameras [90, 104]. While these detectors
perform well in RGB-D SLAM, they require image-based sensors, which may not apply in a
heterogeneous group.
Systems that perform feature-based SLAM or visual odometry typically use 2D image key-
points [45, 86]. Most SLAM systems without visual sensors use dense matching between scans
or LIDAR-specific methods like SegMatch [35], which are sensitive to changes in the the sensor
model. Because point clouds from different kinds of sensors have very different properties, dense
matching cannot be used for heterogeneous sensors. To the best of our knowledge, this work [11]
was the first comparison of 3D keypoints for SLAM or visual odometry in the literature.
3.4 Methods
As described in Section 3.2, we compared the 3D keypoint detectors that were reasonable for the
data and computational limitations in heterogeneous multi-robot SLAM: Harris3D, NARF, ISS,
KPQ, and KPQ-SI. We used implementations of Harris3D, ISS, and NARF available in the open-
source Point Cloud Library (PCL) [99]. KPQ and KPQ-SI were implemented in C++ to operate
on PCL point clouds.
We modified KPQ and KPQ-SI to better apply to SLAM. These methods have two parts: an ini-
tial identification step based on the keypoint’s neighborhood, followed by a filtering step in which
a mesh is fit to that neighborhood and used to determine the keypoint’s quality. Surface fitting
a set of points is computationally intensive for large point clouds generated by robotic sensors,
particularly on embedded processors. Instead, we use all the keypoints identified in the first step,
which is equivalent to setting the filtering threshold to 0. Including the selection step results in
fewer features with a larger fraction repeatable, but the full implementation may not be feasible to
perform on robots in real time. Additionally, the KPQ-SI detector requires evaluation of keypoints
at several scales. To limit computational time, we evaluated only 3 scales. Keypoints were detected
39
at all 3 scales. More scales may have resulted in a larger number of keypoints, but significantly
worse efficiency.
Many of the detectors we evaluate have performed well in past comparison studies for object
recognition, as described in Sec. 3.3. In object recognition, it is important to retain all model
information in order to recognize the object from any viewpoint. However, the NARF detector
works on a range image from a particular pose, so any parts of the point cloud obscured from that
pose are lost. For that reason, the NARF keypoint detector was not evaluated in those comparison
studies. In SLAM, however, observations are 2.5D, thus reduction to a range image does not lose
any information. Therefore, the NARF keypoint extractor is appropriate for SLAM applications,
so it is included in our evaluation.
We evaluate keypoint repeatability for pairs of scans observed using different sensors, from
different viewpoints, or both. That is, for keypoints p
1
i
∈ P
1
, where P
1
is the set of keypoints
in image 1, and p
2
j
, p
2
k
∈ P
2
, where P
2
is the set of keypoints in image 2, we consider the set of
correspondencesC ⊆ P
1
× P
2
, where (p
1
i
, p
2
j
)∈C is a pair of detections of the same keypoint.
Keypoints consist of only a location within the scan, with no other identifying information. Thus,
we define keypoints in different scans to be the same point if the Euclidean distance between
their locations in the global frame is less than some search radius r. Additionally, we allow each
keypoint from a scan to have only one match in another scan. For example, for keypoint p
1
i
in scan
1, if keypoints p
2
j
and p
2
k
in scan 2 both have distance less than r from p
1
i
, with p
2
j
closer, only the
one that is closer is considered a match. Thus, p
1
i
and p
2
j
are matched and p
2
k
cannot have p
1
i
as a
match.
For each keypoint type, we evaluate both absolute and relative repeatability. Absolute repeata-
bility gives the total number of keypoints repeated between scans:
r
abs
=|C|
This number depends on the total number of keypoints detected, which may also depend on param-
eters such as scales and thresholds in the detector. A higher number here may be helpful, as more
40
keypoints will continue to be visible even if part of the scene is obscured. However, a high number
may indicate lower quality keypoints, so future matching using other methods such as descriptors
may be more difficult.
Relative repeatability is the fraction of keypoints that are repeatable:
r
1
rel
=
|C|
|P
1
|
and r
2
rel
=
|C|
|P
2
|
Since scans from different sensors may have very different numbers of keypoints detected (due to
different resolutions or qualities of the underlying data), we compute relative repeatability for both
sensor types.
For all experiments, we consider only keypoints in the region visible in both scans.
We do not evaluate the distribution of repeatable keypoints within each scan. This is important
when applying this work, such as in a loop closure application, as well distributed matches give
better observability of relative pose. If the keypoints are not well distributed, it might be due to
inherent properties of the sensors or asymmetries in the environment. For example, points farther
from a sensor might have more error, making those keypoints more difficult to detect. Before using
SLAM, it would be necessary to calibrate the sensors used to determine any regions of the scans
with larger errors and adjust the system parameters accordingly. Bias may also be introduced when
some parts of the scene have more keypoints; this is commonly overcome by dividing the scan into
regions and searching the regions individually. While these are important concerns in using the
detected keypoints, such an analysis of the sensors and environment is outside the scope of this
chapter.
3.5 Results
To evaluate keypoint repeatability in data representative of robotic applications, we selected several
trajectories from the KITTI odometry dataset [47]. We selected tracks 0, 2, 6, and 9, which gave a
large number of frames with varied scenery and a relatively small number of other moving objects
41
(a) (b)
(c) (d)
Figure 3.2: Examples of detected keypoints in frame 2181 from track 0. (a) NARF keypoints in the
LIDAR scan (b) KPQ-SI keypoints in the LIDAR scan (c) NARF keypoints in the stereo camera
scan (d) KPQ-SI keypoints in the stereo camera scan
42
0 0.5 1 1.5
Search radius (m)
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Fraction of repeatable keypoints
LIDAR
Harris
ISS
KPQ
KPQ-SI
NARF
0 0.5 1 1.5
Search radius (m)
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Fraction of repeatable keypoints
Stereo Camera
0 0.5 1 1.5
Search radius (m)
0
5
10
15
20
25
30
35
40
45
Number of repeatable keypoints
Absolute Repeatability
Figure 3.3: Repeatability of each keypoint detector between scans taken with different sensors
from the same pose. Considering the LIDAR scan as scan 1 and the stereo camera point cloud as
scan 2, the plots show r
1
rel
(left), r
2
rel
(center), and r
abs
(right). Values shown are means over all
poses evaluated.
observed. Track 0 also allowed for evaluations over different views of the same scene, as it has
segments where the same area is observed from multiple orientations.
3.5.1 Scansfromsameviewpoint
For each time step in the data, we have the true pose and scans from each sensor recorded at that
time. We use the LIDAR scans recorded from a Velodyne HDL-64E LIDAR, and grayscale images
from two Point Grey Flea 2 cameras. Prior to our analysis, some preprocessing was done on the
LIDAR data so that all points in each scan are valid at the timestamp of the corresponding pose.
Additionally, the images have already been undistorted and rectified.
For each track selected, we evaluated the keypoint repeatability between sensors at each frame.
To do this, we created point clouds from the stereo camera scans using the OpenCV 3D recon-
struction library to compute a disparity map between the two images, then project each pixel to a
point in space, with depth determined by the disparity. We then downsampled to a density similar
to the LIDAR point cloud. As shown in Fig. 3.1, the LIDAR and stereo camera point clouds have
very different noise and types of errors.
For each pair of scans from the same viewpoint, we detected keypoints in each scan and com-
puted the repeatability as described in Section 3.4; results in Fig. 3.3 are averages over the four
43
Table 3.1: Mean number of keypoints detected by each method
Scan Type Harris ISS KPQ KPQ-SI NARF
LIDAR 92 135 130 250 71
Stereo Camera 173 168 223 318 102
tracks. We evaluate repeatability as a function of search radius, so the plots show, for a given search
radius on the x-axis, what fraction of keypoints in one scan have a matching keypoint within that
search radius in the other scan. Examples of some detected keypoints are shown in Fig. 3.2.
The KPQ-SI detector finds the most features and the NARF detector the fewest (Table 3.1),
resulting in quite different absolute repeatability (Fig. 3.3). Using a search radius below 0.25m,
the NARF detector finds the most repeatable keypoints, but still only finds an average of about
5 keypoints in each frame. For larger search radii, the KPQ-SI detector finds a larger number of
repeatable keypoints. However, fewer keypoints may be preferred for SLAM, as the bandwidth
required for robots to share keypoints and the time required for matching increase with the number
of keypoints. While it may be possible to identify keypoints with better chances of matching (e.g.,
with the KPQ mesh-fitting step), the selection process increases computation time so tradeoffs in
selecting the best keypoints must be considered.
For a small search radius (<0.5m), NARF keypoints have significantly higher relative repeata-
bility than the other keypoint detectors (Fig. 3.3). For larger search radii, KPQ-SI features have
better relative repeatability, but all methods reach similar values. The best search radius depends
on factors like the quality of the point cloud from each sensor, the localization accuracy of each
detector, and the density of keypoints. Noise and localization errors may increase the appropriate
search radius. However, too large of a search radius results in unrelated keypoints being matched.
3.5.2 Scansfromdifferentviewpoints
In previous keypoint analyses for tasks such as object detection, keypoints from scans of scenes are
compared against a full 3-D model. In that case, while only some regions of the object are visible
44
Figure 3.4: Sample frames from track 0 (top) and track 6 (bottom).
in the scan, all of those regions will be visible in the model. In SLAM applications, however,
differing poses for two scans may lead to limited or no overlap of the visible parts of the scene.
Comparing two scans from the same pose, as in Section 3.5.1, guarantees that the same parts of
the scene are visible, but this is not realistic for multi-robot SLAM. Even if two robots can detect
a loop closure, they may still not be close enough to assume that the scene is observed from the
same pose. To understand how the system might perform in such scenarios, we evaluated keypoint
repeatability in different frames, selecting sections of tracks 0 and 6 for cross-frame comparison.
Based on the repeatability observed between scans from the same sensor, we select 0.5m as a
search radius. We evaluated repeatability between scans measured in nearby frames and in an area
where the track intersected itself. Track 0 is in a crowded neighborhood, with many buildings and
cars close to the street, while track 6 is on a larger street, with any visible buildings and cars farther
away (Fig. 3.4 shows sample images). Thus, the point clouds in track 6 cover a much larger area
than those in track 0.
For matching between nearby frames, we chose sections of tracks 0 and 6 where the track
is straight and the vehicle has a near-constant velocity, so difference in frame numbers can be
45
0 5 10 15 20
Distance (meters)
0
5
10
15
20
25
30
35
40
Number of repeatable keypoints
Track 0
0 2 4 6 8 10 12
Distance (meters)
0
5
10
15
20
25
30
35
40
Number of repeatable keypoints
Track 6
Harris
ISS
KPQ
KPQ-SI
NARF
Figure 3.5: Absolute repeatability of each keypoint detector between LIDAR and camera frames
from different poses in tracks 0 (left) and 6 (right). The distance between poses is given on the
x-axis. Repeatability decreases as the distance increases, both because the area covered by the
scans has less overlap and because the viewpoints for the scans differ more.
associated with a displacement distance. In track 0, we used frames 0-80 and 3770-3870; the
vehicle averages 1 meter of forward motion per frame. In track 6, we used frames 0-180 and
330-615; the vehicle averages 1.2 meters of forward motion per frame. All poses had nearly the
same orientation (cameras facing forward, in the direction of motion). We measured absolute
repeatability for scans separated by up to 20 meters in track 0 and 12 meters in track 6 (Fig. 3.5).
For comparison, we also computed repeatability for scans from the same sensor in these pose
pairs, shown in Fig. 3.6. Though most of the keypoint detectors have not previously been evaluated
in outdoor scenes of this scale, performance is similar to that reported in other comparison studies
[120].
Over the pairs of scans with the same orientation, both the NARF and KPQ-SI detectors find
a significant number of repeatable keypoints. The KPQ-SI detector again has a higher absolute
number, but the relative repeatabilities are similar. As expected, the number of repeatable keypoints
decreases as the distance between the frames compared increases. However, even at a distance of
10 meters, the NARF detector still typically finds about 10 matching keypoints, and the KPQ-SI
46
0 2 4 6 8
0
100
200
300
Number of keypoints
Track 0 - LIDAR
0 5 10
0
100
200
300
Track 6 - LIDAR
0 2 4 6 8
Distance (meters)
0
50
100
150
Number of keypoints
Track 0 - Stereo Camera
0 5 10
Distance (meters)
0
50
100
150
Track 6 - Stereo Camera
Harris
ISS
KPQ
KPQ-SI
NARF
Figure 3.6: Absolute repeatability of each keypoint detector between scans from the same sensor
from different poses.
detector finds 20-30. This is enough to compute point cloud registration and indicates that loop
closure would be possible between different 3D sensors using these detectors.
In the area where the track crossed itself, we selected two segments where the same area was
visible but the vehicle had different orientations. We used track 0, frames 2326-2340 and 3255-
3269; here, the vehicle encounters an intersection on one road and later returns to it on another
road. The average numbers of repeatable keypoints between LIDAR scans in the first segment and
stereo camera frames in the second are given in Table 3.2. While all detectors find some repeatable
keypoints, only NARF and KPQ-SI find enough to consistently estimate relative poses and perform
loop closure.
47
Table 3.2: Mean repeatable keypoints between LIDAR and stereo camera scans from poses with
different orientations
Harris ISS KPQ KPQ-SI NARF
3 5 5 28 11
Table 3.3: Mean time to extract keypoints (in seconds)
Scan Type Harris ISS KPQ KPQ-SI NARF
LIDAR 7.77 25.9 31.0 48.4 0.22
Stereo Camera 0.24 0.51 0.68 1.05 0.05
We also evaluated the computational performance of each detector. We randomly selected 100
frames from the four tracks and measured the mean time for each detector to find keypoints in each
type of scan; Table 3.3 reports measured time. The evaluations were performed on an Intel Xeon
3.5 GHz processor with 32 GB of RAM, and all methods were written in C++ to interface with
PCL. All detectors were significantly slower on the LIDAR point clouds; this can be somewhat
mitigated by only searching the areas visible in both scans. However, NARF was the only detector
that could perform keypoint detection in real time for a system making scans at 1 Hz, a standard
rate for outdoor SLAM systems.
Overall, the NARF and KPQ-SI detectors find keypoints with the highest repeatability between
scans from different sensors. This suggests that they are good choices for heterogeneous SLAM
applications. The NARF detector is very efficient, but its repeatability decreases more quickly with
increased distance between poses (Fig. 3.5). Thus, it is likely the best choice for most applications.
If high bandwidth for inter-robot communication and significant computing power are available, it
may be better to use the KPQ-SI detector, particularly if keypoint quality can be estimated.
48
3.6 ConclusionandFutureWork
In this chapter, 3D keypoint detectors were evaluated for repeatability of detected keypoints in
point clouds from heterogeneous sensors for multi-robot SLAM. This work was the first compari-
son of 3D keypoints for SLAM or visual odometry. The Harris3D, ISS, KPQ, KPQ-SI, and NARF
detectors were evaluated, comparing their performance on LIDAR and stereo camera scans from
the KITTI dataset. NARF and KPQ-SI had the highest relative repeatability, with both detectors
finding about 10% repeatable keypoints in scans from different sensors in the same pose. While
KPQ-SI found more keypoints overall, it had the lowest computational efficiency. In scans observ-
ing the same area with different sensors and in different poses, NARF and KPQ-SI consistently
found enough matching keypoints to perform point cloud registration and find loop closures, while
the other methods did not. For building a feature-based heterogeneous multi-robot SLAM system,
NARF is the best choice of keypoint if computing power or bandwidth are limited. If they are not,
KPQ-SI may be a better choice due to the higher absolute number of keypoints.
This work only identified keypoints in scans and images from heterogeneous sensors. To use
those keypoints in multi-robot SLAM, it would still be necessary to use a descriptor to describe
those keypoints, identify matching features, and estimate the relative poses between the sensors.
The paper presenting NARF keypoints also includes a descriptor [111], which may be a good
choice, or a number of other 2D and 3D descriptors exist that could be applied to NARF keypoints
in point clouds or in range images. The next step toward building a multi-robot SLAM system
using NARF keypoints would be an evaluation of these descriptors in SLAM-like data for precision
and recall to identify one that works well.
Beyond the details of exactly how many keypoints were detected, this work showed more
fundamentally that scans and images from different types of sensors contain information about the
structure of the scene, even if the scans have very different sensor and noise models. Some of the
keypoint detectors evaluated are very sensitive to aspects of the sensor model like point density
and error distribution, and those detectors do not tend to find keypoints that are repeatable in scans
from different sensors. But other detectors are more invariant to those differences, and can be used
49
to identify that two scans show the same scene. The NARF detector is particularly relevant here,
as it completely removes parameters like point density in the conversion to a range image.
In any group of heterogeneous robots, the robots are very likely to have different sensors,
whether those sensors use completely different modalities like the ones studied in this chapter
or are simply cameras, LIDARs, or other sensors with different error models. In any of those
cases, finding loop closures and merging maps requires a deeper understanding of the scene than
simply finding some interesting pixels, and learning how to find the same parts of the scene in
scans or images from different sensors is a major step toward a system that takes advantage of that
understanding. In the next chapter, we discuss how an understanding of the scene structure lets us
detect objects and use them to find loop closures between very different types of robots.
50
Chapter4
LoopClosuresbetweenRobotswithDifferentPerspectives
4.1 Introduction
In multi-robot SLAM, loop closures, or events where two robots are in the same place or observe
the same scene, are used to merge the maps built by each robot. These shared observations can
be used to estimate the robots’ relative poses, allowing them to transform their maps to the same
coordinate frame and combine them. In typical SLAM systems, this is done by finding observations
of the same scene [25, 132] or features [34, 106], or by making inter-robot measurements [14].
However, these methods typically require robots to view the scene from the same perspective
or even to be in the same place at the same time. Those methods that work for robots viewing
the scene from different perspectives require assumptions that the environment is man-made or
structured and has well-defined objects visible [24, 46, 91].
Due to the constraints imposed by these assumptions, exploration in unstructured environments
is not always possible with current multi-robot SLAM methods. Consider two robots performing
planetary exploration in an unknown area, such as Perseverance, a rover, and Ingenuity, a UA V .
While these two robots do not operate together autonomously, a pair of similar robots may do so
on a future mission. In that scenario, an aerial robot might first fly through an area to be explored,
make observations, and build a map. Later, the ground-based rover might use that map to navigate
to an area of scientific interest. During navigation, we would like the rover to localize itself against
the map. Additionally, if it views areas that were not included in the map, it should be able to
51
estimate its relative pose well enough to add those new observations into the map. However, the
rover views the environment from a different perspective than the aerial robot, and since there will
not be any features or structured objects to recognize, the rover will not be able to localize with
respect to the aerial robot’s map using current state of the art methods.
This challenge is not limited to the case of an aerial and ground robot working together. It
applies generally whenever two or more robots operate in an unstructured environment. In the ab-
sence of known objects, the robots must recognize that they are viewing the same scene to find loop
closures. In many cases, the robots must visit exactly the same place to recognize loop closures,
which both reduces the advantage of having multiple robots and may be infeasible, especially in
unstructured environments that do not have previously-built roads or paths. These constraints in
finding loop closures are particularly challenging in systems including small robots. For example,
when working with the PUFFERs (described in Chapter 6), it is common that the robots’ physical
trajectories will come within centimeters of each other, or even cross, but image processing will
not identify a visual loop closure. Because the cameras are so close to the ground, the scene is
observed at a very small scale, and often even robots that are close together do not observe the
same details on small rocks. Instead of depending on finding features that may not be observed, it
would be helpful for the robots to be able to find loop closures when viewing the same scene from
different places, allowing much more freedom while still allowing them to share and merge maps.
In these scenarios where two robots view a scene from different perspectives, it is helpful
to consider how a human might localize themselves in a similar situation. A human lost in an
unstructured environment with access to one or more images of the area would immediately begin
looking for landmarks. They might identify a unique-looking rock or rock cluster, or look at terrain
variation to identify ridges. If they found one or more rocks in a unique pattern, they would use
that to place themselves in the image and then would be able to use that image to navigate. Even
without any priors on what the environment looked like, they would begin by finding objects like
rocks or ridges to use as reference points. This work aims to develop a method for robots to localize
themselves using similar techniques.
52
In this chapter, we present a method for finding loop closures and estimating relative pose be-
tween two views of the same scene by identifying groups of unstructured objects. We demonstrate
the method here with rocks detected in a planetary exploration environment, but it could be applied
to any other feature that can be identified, such as ridges, dunes, or craters. This method assumes
that the two views include observations of several rocks, with at least four visible in both views.
The rock observations can be detected using 3D reconstruction, a learned or edge-based detector,
or extracted by hand from a map. Using combinatorial optimization, we identify correspondences
that match the rocks based on both appearance and spatial distribution, allowing the relative pose
between the two views to be estimated. Unlike most feature-based loop closure techniques, this
method includes outlier rejection during the optimization, removing the need for a later outlier
rejection step. Two methods are presented for performing the optimization that give equally good
results; each may be preferable in certain scenarios. Finally, we demonstrate the method’s perfor-
mance on simulated rock fields and data from the Katwijk Planetary Rover dataset [55].
This chapter is based on a submitted paper that is currently in review [12], which is a novel use
of optimization methods for matching from different perspectives and the first use of the ILP for-
mulation for feature matching. The work was also presented at the ICRA 2022 workshop ”Robotic
Perception and Mapping: Emerging Techniques” [13].
4.2 RelatedWork
In general, inter-robot loop closures are found by matching features [34, 106, 124], scenes [25,
132], or objects [24]. Features and scenes are both dependent on being viewed from roughly
the same orientation: SIFT [75] and ORB [96] feature descriptors are invariant up to about 50
degrees viewpoint change. Object-based SLAM is not dependent on viewpoint, as objects can be
recognized from any perspective, but it does depend on objects in the environment being known
and having models. None of these methods work well for robots with different perspectives in
unstructured environments.
53
There have been some attempts to find loop closures with groups of known objects, particularly
in environments that tend to have the same objects repeated. Ramtoula et al. [91] create constella-
tions of objects like chairs and cups in an office area, then match those constellations based on the
distribution of objects. Gao et al.[46] find objects like traffic lights in driving scenes, build graphs
connecting those objects, then find matches by identifying similar triangles in the graphs. Both of
these works are in a man-made environment, where the robots are recognizing known objects.
In stationary environments, feature matching is typically done in two steps, by first matching
the features by appearance, then later rejecting outliers (with a few exceptions that use graph
matching [61, 137]). However, graph matching approaches that consider appearance and spatial
information together are more common when matching non-identical objects, such as in object
detection [71], facial recognition [7], or motion capture applications [22, 126]. These problems
involve features that differ more in appearance, but have stronger expectations that nearby features
should appear together in every image. There are typically many features, making the graph very
large, so solutions often partition the optimization to only search a few features at a time [121].
This is possible in these scenarios because more distant features do not need to have the same
spatial relationship. For example, in motion capture, all of the features on an arm or leg in one
image should also be on the same arm or leg and therefore close together in another image, but
the two arms may be in different positions and features located on different arms may have moved
relative to one another. This decomposition does not apply to rock matching because we still
require spatial consistency between rocks that are not close together.
Graph methods set up feature matching as a Maximum Weighted Clique Problem (MWCP),
which has been thoroughly explored in the combinatorial optimization literature; see Hosseinian et
al. [56] for a full discussion and Gouveia and Martins [48] for details of the formulation. Typically
only the Maximum Edge Weight Clique Problem is discussed, but the same optimization formu-
lation applies when both node and edge weights are considered. [110] We do not consider any of
54
the MWCP-specific relaxations in the literature due to our graph structure varying greatly depend-
ing on the observations, but these could be considered in the future to improve our algorithm’s
performance.
4.3 ProblemFormulation
Consider two robots exploring an unknown, unstructured environment. We assume that the robots
both have cameras or other high-dimensional sensors, such as 3D LIDARs. Using these sensors,
the robots both make observations of the world from which regions of interest can be detected.
In this work, we assume that those regions of interest are rocks, which can be detected from the
images or point clouds. Rock or region detection is commonly done during robot operations,
either using AI tools like AEGIS [42], which segments regions of a specific color or texture on
Mars rovers, or by segmenting point clouds [128]. These techniques can identify rocks in the
observations and detect their properties, such as size and color.
Assume that one robot observes an image or point cloud containing the set of rock observations
M. A second image or point cloud is observed, either by a different robot or by the same robot at
a different time, which contains the set of rock observations N. We denote the rock observations
as m
i
∈ M and n
j
∈ N. When we refer to more than one rock observation in the same image, the
second one is denoted m
i
′∈ M or n
j
′∈ N to avoid confusing indices between the two images. Each
rock observation has a 3D position relative to its own camera frame, but the relative pose between
the two camera frames is unknown.
We seek a set of correspondencesM ⊆ M× N, where (m
i
,n
j
)∈M represents observations
of the same physical rock. Each rock observation should be used at most once inM . Each pair of
rock observations in this set should be spatially consistent: the two observations should be of rocks
with similar physical properties, and any two pairs selected should describe rock observations that
are the same distance apart. By selecting pairs of observations with the same appearance and
55
distances between them,M is likely to describe two different observations of the same pattern of
a group of rocks.
4.4 Algorithm
To find a set of correspondences, a naive approach would be to set up a graph with each rock
observation (e.g., m
i
) as a node, and select the edges connecting observations that should match.
This is a straightforward assignment problem, and matches of rock observations by appearance can
be easily identified with the Hungarian method [64]. However, this approach does not allow us to
consider whether the corresponding observation pairs are spatially consistent.
Instead, we consider the dual of this optimization, in which the set of correspondences can be
found by solving a Maximum Weighted Clique Problem (MWCP) [121].
4.4.1 RockMatchingasaMaximumWeightedCliqueProblem
For the MWCP approach, we construct a graph G=(V,E) from the two sets of rock observations.
An example of the graph construction is shown in Fig. 4.1, and we describe it in detail here.
Each node v
i, j
in the graph represents a possible correspondence between observation m
i
∈ M and
observation n
j
∈ N. Node v
i, j
has weight w
i, j
, which describes the appearance similarity between
observations m
i
and n
j
. A pair of observations that look very similar have a high weight because
they are more likely to describe a true correspondence.
The graph has undirected edges e
i, j,i
′
, j
′, which connect v
i, j
and v
i
′
, j
′. The edges have weights
c
i, j,i
′
, j
′, describing the spatial agreement of nodes v
i, j
and v
i
′
, j
′. A higher edge weight indicates that
the two potential correspondences are consistent, or could appear together in a pair of images. For
example, in a pair of consistent correspondences, we would expect that observations m
i
and m
i
′ in
image M have the same distance as observations n
j
and n
j
′ in image N.
Using this graph, we want to find the clique with maximum weight (including both vertex and
edge weights), which is equivalent to finding the most consistent set of correspondences. To see
56
that this is true, consider what a clique means in the graph. A clique is a fully connected set
of nodes, so in this graph, it describes a set of correspondences between rock observations from
each image. Those nodes should have maximum vertex weight, meaning that each correspondence
matches two observations of rocks with similar appearance. Additionally, every vertex must have
an edge to every other vertex, and those edges should have the maximum weight, meaning that
every pair of observations selected is compatible with every other pair. Because the graph does not
include edges that are impossible, such as those that would link two correspondences including the
same observation, we know that the entire clique describes a spatially consistent set of correspon-
dences and no further outlier rejection is required. An example of the complete constructed graph
and the clique describing the best set of correspondences is shown in Fig. 4.1.
Above, we stated that node weights are set based on appearance similarity and edge weights
are set based on spatial compatibility. We use measured size in the largest dimension to describe
appearance. Due to the different perspectives of each view, more detailed measures such as features
or rock shape are not appropriate. To evaluate spatial compatibility, we use measured distance
between two rock observations. Both weights are implemented following the simple divergence
described in Leordeanu and Hebert [71]:
w
i, j
=
C− (ℓ
i
− ℓ
j
)
2
θ
ℓ
, if|ℓ
i
− ℓ
j
|< 3θ
ℓ
0, otherwise
and
c
i, j,i
′
, j
′ =
C− (d
ii
′− d
j j
′)
2
θ
d
, if|d
ii
′− d
j j
′|< 3θ
d
0, otherwise
As shown in Fig. 4.1,ℓ
i
is the measured size of observed rock m
i
, d
ii
′ is the distance between obser-
vations m
i
and m
i
′, which are observed in the same image, andθ
ℓ
andθ
d
are the standard deviations
of those measurements. We use a constant value of C= 9, making these weights continuous. They
reach a maximum when the pairs of rocks have the same size or the same distance in both images.
57
(a) (b)
(c)
(d)
Figure 4.1: An example of the graph for two images, a forward view from a rover (a) and an
overhead view (b), with the rover’s pose and view marked by the camera on the right side. The
graph structure is shown in(c), with all edges connected to node v
2,2
shown. This node describes
a potential correspondence between observation m
2
and observation n
2
. It is connected to all
other nodes except those including m
2
or n
2
, as two correspondences using the same observation
cannot be selected together. All other nodes are similarly connected; these edges are not shown
for readability. The clique containing the correct correspondences, which has maximum weight, is
highlighted in(d).
If the difference in distance is so large that an edge weight would be 0, i.e. |d
ii
′− d
j j
′|≥ 3θ
d
,
those correspondences are not compatible and we remove the edge between the two nodes. This
58
results in graphs that vary between being almost fully connected and very sparse, depending on
the distribution of rocks in the scene and how accurately the errors in the rock observations are
modeled.
It is likely that at least one rock detection in one image will not have a match in the other
image. This may occur if a rock is outside of the frame or obscured by another rock, or because
noise causes the rock detection algorithm to identify something in one frame that is not visible
from the perspective of the other frame. In these cases, we do not want to erroneously match that
rock detection to another rock just to have a match, or to fail because there are no possible matches.
To allow for this scenario, we add one node to the graph for each rock detection representing that
rock being unmatched; see, for instance, node v
3,/ 0
in Fig. 4.1. These nodes have edges to all
other compatible nodes in the graph, with weights such that they will not generally be selected if a
correct match is possible, but will be selected over an incorrect match. An example of a scenario
in which some rock observations are unmatched, showing the clique describing the best set of
correspondences, is shown in Fig. 4.2.
The MWCP can be solved with an integer optimization. Below, we describe a quadratic
method, in which the nodes giving the maximum weight clique are selected, and a linear method,
in which both nodes and edges are selected and additional constraints force the solution to form a
clique. To describe these methods clearly, we introduce the shorthand k= i, j and l= i
′
, j
′
. Thus,
v
k
and v
l
refer to nodes in the graph of potential correspondences, and e
k,l
= e
i, j,i
′
, j
′.
4.4.2 IQPOptimization
A straightforward way to solve the MWCP described is by writing the problem as a quadratic
optimization:
59
(a) (b)
(c)
Figure 4.2: An example of two images in which some of the observations do not match. The rover
has moved forward, shown by its new pose in the overhead view, and the rock in the foreground is
no longer visible. The forward image(a) and the overhead image(b) each contain an observation
(m
3
and n
3
, respectively) that does not have a match in the other image, so they should be left
unmatched. The graph has the same structure as in Fig. 4.1, but the correct maximum weight
clique for this pair of images is shown in(c).
maximize
x
x
T
Qx
subject to
∑
m
i
∈M
x
i, j
≤ 1, ∀n
j
∈ N
∑
n
j
∈N
x
i, j
≤ 1, ∀m
i
∈ M
x
k
∈{0,1}.
(4.1)
In this problem, x
k
= x
i, j
are binary variables describing whether node v
k
is selected in the
clique. Q is the affinity matrix for the graph: it is a symmetric matrix where element q
kl
gives the
weight c
k,l
, or w
k
if k= l. The constraints prevent the selection of nodes that are not connected,
thus enforcing that the selected nodes form a clique.
60
In other optimization-based feature matching applications, this method is used. The weights
often include terms to enforce groups of nearby features to be matched together [121], or perform
the matching hierarchically by first matching small regions of each image and then combining
them [126]. However, these approaches based on groupings of features are designed for problems
like deformable object matching, with many features that are more consistently grouped together.
Thus, these grouping techniques do not apply to rock matching.
4.4.3 ILPOptimization
The IQP optimization described above uses the fewest possible variables, but is not usually convex
so cannot always be solved efficiently. An alternative approach is to rewrite the optimization as an
ILP. The ILP objective is linear, so it can be solved more efficiently. However, nodes and edges are
selected separately, so additional constraints are needed to make sure that they form a clique. The
resulting ILP is:
maximize
x
∑
v
k
∈V
w
k
x
k
+
∑
e
k,l
∈E
q
kl
y
kl
subject to y
kl
− x
k
≤ 0, ∀e
kl
∈ E
x
k
+ x
l
− y
kl
≤ 1, ∀e
kl
∈ E
x
k
+ x
l
≤ 1, ∀k,l s.t.e
kl
/ ∈ E
x
k
,y
kl
∈{0,1}.
(4.2)
Here, x
k
= x
i, j
are again binary variables describing whether node v
k
is selected, and y
kl
= y
i, j,i
′
, j
′
are binary variables describing whether edge e
kl
is selected. The constraints are constructed to
ensure that the set of nodes and edges selected form a clique: if an edge is selected, both nodes it
connects must also be selected, and if a pair of nodes is selected, the edge between them must also
be selected. This follows formulation F1 from Gouveia et al. [48], which applies to graphs with
some edges missing and does not make assumptions about the size of the maximum clique, both
properties we would like for our solution.
61
Experiment type
Number of observations
with matches
Number
of tests
Successful
IQP tests
Successful
ILP tests
Successful
ICP tests
Noise only
5 10 10 10 1
8 10 10 10 1
12 10 10 10 2
Extra observations
in one view
5 10 10 10 2
8 10 10
∗ 10
∗ 1
Extra observations
in both views
5 10 10 10 2
8 10 10 10 2
Table 4.1: Results from the simulated experiments, showing the number of successful runs for
each type/size of experiment. We define a successful run as one in which at least 4 simulated rock
correspondences are found. Both the IQP and ILP optimizations were successful on all simulated
data, while the ICP comparison failed on most runs. Note that all correspondences were found in
all successful runs except those marked with
∗ , in which one correspondence was not matched. In
those runs, all other matches were found and no incorrect correspondences were selected, so they
were still successful.
Though the ILP has many more constraints that the IQP, it is linear and so can be solved
efficiently as the size of the graph grows. To the best of our knowledge, such a technique has not
been used before for image or feature matching.
4.4.4 UsingMatchedRocksforLoopClosures
So far, we have only discussed the problem of finding correspondences between rocks. However,
to use these correspondences as loop closures, we also need to find the robots’ relative poses. If a
group of rocks is matched correctly, this is equivalent to identifying several pairs of image points,
each representing the same point in the real world. In general, the relative pose can be found by
estimating the fundamental matrix between the two images, which requires 8 matched points. If
the rocks are known to be coplanar, as is common in planetary exploration, this can be done using
the homography matrix, which requires only 4 points. Thus, we consider a pair of matched images
with at least 4 correctly matched rocks to be successful.
In typical image matching applications, the relative poses between two images are found by
estimating the fundamental matrix from all feature pairs using an outlier rejection method like
62
RANSAC [41]. RANSAC relies on the assumption that a large fraction of the matches will be
correct, allowing outliers to be identified and discarded. In the rock matching problem, there may
not be enough matches found to allow for some to be safely discarded (or for RANSAC to run at
all). However, this method does validation of spatial consistency during the optimization, so no
additional outlier rejection is required.
4.5 ExperimentsandResults
4.5.1 Simulation
We first demonstrate the performance of these algorithms on simulated data. We generate a variety
of simulated rock fields, add realistic errors, and create views of them from different perspectives,
then use both optimization methods to find correspondences between rocks in the two views.
We simulate rock fields with rocks of random sizes placed at random positions. The rock
sizes are uniformly distributed between 0.5 and 2.0 meters, and the rocks’ positions are uniformly
distributed within a 40 meter square area. We assume that the ground is flat, as in the real data in
later experiments, so no height estimate is added to the positions. However, the only assumption
made about height in the algorithms is that only 4 matches are required because the rocks are
coplanar. In scenes with significant elevation change, more matches would be required to estimate
relative pose. The simulated dataset includes 30 rock fields, of sizes 5, 8, and 12 rocks. These
parameters give rock fields similar to those observed in real data, described in Section 4.5.3.
In the first set of experiments, both views use observations of exactly the same rocks, with
realistic noise added. For each rock field, the generated points are used as the first view, as they
describe how it would be observed by an overhead camera. The second view is created by randomly
generating a rotation matrix, then rotating the points by that matrix. Finally, normally-distributed
noise is added to the positions and rock sizes in each view. The rock positions haveσ= 0.12m, and
the rock sizes haveσ = 0.1m. In this experiment, all correspondences were identified correctly in
63
all sizes of rock field by both the IQP and ILP formulations. This indicates that our method works
very well in scenarios where all rock observations can be matched.
Including matches for all rock observations, however, is not a realistic assumption for real data.
It is likely that at least one view will include an observation that is not matched in the other view,
either because the rock is occluded or out of the frame, or because it is an artifact from image
processing. In either case, the algorithm should match the observations of all rocks that are visible
in both frames successfully. To evaluate this, in the second experiment, 5 rock observations were
added to one view. They described rocks with the same sizes/distribution as the original field, and
should have been left unmatched after optimization. In the third experiment, 5 unmatched rock
observations were added to both views.
Even when additional observations were added to one or both views, both the IQP and ILP
formulations successfully found nearly all correlations. As at least 4 matches are needed to esti-
mate relative pose, we define a successful run as one in which 4 or more correlations are matched
correctly. All of the simulated experiments were successful, and in fact far exceeded this standard.
Out of all observations in all experiments, only one match was missed, because the noise added to
that rock was particularly large and our graph construction algorithm rejects any correspondence
with too much error. However, 7 of the 8 correspondences in that run were found successfully, so
the run was still successful, and no incorrect correspondences were identified in any experiment.
A summary of the results is shown in Table 4.1.
4.5.2 ComparisontoOtherMethods
While no other approaches have been proposed to solve this problem, we can consider some ap-
proaches that might be used based on related problems. One option would be a two-stage approach,
in which we would first identify matches and then perform outlier rejection, as is typically used in
feature matching. However, the appearance information we have (rock sizes) is not unique enough
to identify matches, and most of the images we consider do not have enough rocks to use any
standard outlier rejection methods.
64
An alternative option would be to only use spatial information. In this case, the rocks identified
in each view could be treated as a point cloud and aligned with Iterative Closest Point (ICP) [8].
We evaluated ICP on the simulated data described above. For each experiment, the ICP algorithm
was used to align the point clouds from the two views, and the alignment was evaluated. We
consider those alignments with RMS error of a similar magnitude to the noise added to the points
to be correct. Of the 70 tests, ICP produced a correct alignment for only a few of them. These
results are shown in Table 4.1. Due to the ICP algorithm’s dependence on initial conditions, the
exact number of successful runs varied but was never more than about 20% of the total.
ICP is not a suitable choice for this problem because of its dependence on initialization and its
handling of outliers. Without a reasonable initial estimate of the transform, ICP tends to get stuck
in local minima and can fail to find the correct alignment. Additionally, it does not adequately
handle scenarios where a large fraction of the points do not match.
4.5.3 DatasetandPreprocessing
We evaluate the algorithms on real data from the Katwijk Beach Planetary Rover dataset [55]. For
this dataset, artificial rocks were placed on a sandy beach and imaged both from overhead and from
a ground rover with a sensor suite including stereo cameras and IMU. For the evaluation, we use
the images from the PanCam, which includes roll and pitch metadata, and has the clearest images.
We focus on the data from traverse 3, in which the rocks were placed closer together to allow for
more rocks in each image.
For the forward-facing view of the rocks, each timestamp has a pair of images from stereo
cameras. The dataset includes a MATLAB function to process the image into point clouds. From
the point cloud, we first throw away all points that are more than 20m from the camera, as they
are triangulated relatively inaccurately. Additionally, this 20m distance is the scale at which the
beach is approximately flat, so we can assume that the rocks lie on a plane. Next, we fit a plane
to the surface and throw away all points on that plane, which are likely on the ground, not on
rocks. Finally, we segment the remaining points into clusters, and label each cluster with a large
65
number of points as a rock observation. This procedure detects all rocks that are visible and can be
reconstructed well. It sometimes also includes incorrect rock identifications due to large ridges or
other places where the beach is not planar. The horizontal width of each cluster is used as the rock
size. Examples of a forward view from the dataset, and the observations detected in it (including
one false detection), are shown in Figs. 4.3a and 4.3b.
For the overhead view of the rocks, the true rock positions, provided with the dataset, are used.
This simulates matching rock observations to a previously collected and processed map. The rocks
are artificial and have known sizes, so the true size is also used. The overhead views used are
cropped from the overhead map, as shown in Fig. 4.3, and only the rocks within the cropped area
are considered for matching.
(a)
(b)
(c)
Figure 4.3: (a) A ground rover view, (b) the rock observations detected in the image, and (c) the
overhead views that it was matched with during the experiments. The overhead views are marked
by regions 1-6, with region 6 being the full map. The same three rocks are marked with stars in
both images to show which areas overlap. Region 5 is the one with only partial overlap–it contains
only 3 rocks that were detected in the forward view, while the other regions contain 6. Of those 3
rocks, only 1 set of observations was matched, so experiment 5 was unsuccessful.
66
1 2 3 4 5 6 7 8 9
Number of rocks visible in both images
0
0.2
0.4
0.6
0.8
1
Fraction of tests with >4
rocks matched correctly
Results of Katwijk Experiments
Figure 4.4: Fraction of tests that were successful, for pairs of images with different numbers of
rocks visible. Both the IQP and ILP formulations were successful on the same tests.
4.5.4 MatchingProcedureandResults
To evaluate the accuracy of each matching method, we select 8 forward views collected throughout
the dataset and match each one with several overhead views of different sizes, as shown in Fig. 4.3.
The smallest of the overhead views (region 1) is cropped to contain close to the same area as the
forward view, and the largest (region 6) shows the entire rock field. Additionally, an overhead view
is tested that only partially overlaps with the forward view (region 5), to evaluate the risk of false
positive matches.
In all of the forward views, both the IQP and ILP formulations were successful in every test
except the one with only partial overlap (recall that we define a successful test as one in which at
least 4 rocks were matched successfully). Just as with the simulated data, the results far exceeded
that measure of success. In every test except the one with partial overlap, nearly every visible rock
observation was matched successfully, and no incorrect matches were identified. These tests had
5-9 rocks visible from both views, and in all of them, at least 4 (and usually more) observations
were matched correctly. Some of those tests also had erroneous rock observations as described
earlier, and those were never matched incorrectly. In tests with 4 rocks visible, which occurred in
some partial overlap tests, most tests were successful but a few failed to match enough observations
because some observations were not matched, or were matched incorrectly. The results of all these
67
tests are shown in Fig. 4.4. Based on these results, although 4 rock observation matches are enough
to estimate relative pose, it is safer to use 5 matches as a threshold to trust the identified matches.
Tests where one or more observations were missed fell into a few cases. First, sometimes, one
of the rock observations that was not matched did not represent a real rock. These observations
were typically areas of the beach that diverged from the ground plane, and were never incorrectly
matched to real observations (except in tests that did not have enough rocks visible for relative pose
estimation). Second, sometimes one of the rocks at the very edge of the view area had significant
triangulation error; then that observation was not matched even though it did correspond to a real
rock. In both of these cases, all other observations were matched correctly and provided enough
matches for relative pose estimation.
The third case where matches were not identified correctly occurred when there was not enough
overlap between the two images to find a successful match, as in the tests with rocks from region
5. In these cases, there are three or fewer true correspondences (and sometimes none). For these
tests with few or no correspondences, it is common that not all true correspondences are found
and two or three incorrect matches might be identified. However, these tests never have more than
4 matches identified, and often have fewer. Since three true correspondences is not enough for
relative pose estimation anyway, we should discard the results of these tests. These results indicate
that views with 4 or fewer correspondences found should be discarded. They also indicate that the
risk of an erroneous relative pose estimate from scenes with no true matches is minimal, and the
algorithm (correctly) fails when the images do not overlap.
Next, the runtimes of the same tests are evaluated and shown in Fig. 4.5. All preprocessing is
done in MATLAB and the optimization is done in Gurobi, and only the runtime for the optimization
is shown. All experiments were performed on an Intel Xeon 8-core 3.5 GHz processor with 32 GB
of RAM. The runtimes are evaluated relative to the number of graph vertices, which is a rough
measure of the size of the problem. However, even in problems with the same number of vertices,
there is still some variation because the graphs are not fully connected, so the number of edges
may vary.
68
(a) (b)
Figure 4.5: Runtimes for IQP (a) and ILP (b) optimizations, evaluated in graphs with different
numbers of nodes. To show how the graph sizes vary with different inputs, the graph for matching
the image in Fig. 4.3 with region 1 has 112 nodes (line 1), the graph for matching that image with
region 3 has 208 nodes (line 3), and the graph for matching that image with region 6, the whole
map, has 376 nodes (line 6).
For tests with only a few rocks in each image, both optimizations run efficiently, finding all
correspondences in less than 1 second. However, on these very small problems, the IQP formula-
tion is faster because the problem is so much smaller, with fewer variables and constraints. As the
graph gets larger, the IQP runtime increases exponentially. The ILP runtime is initially longer. It
increases more slowly, but remains slightly above linear increase due to the added constraints. For
graphs above about 300 vertices, the ILP is equally or more efficient. Thus, it is the formulation
that should be used for very large tests, like matching an image with several rocks to a full map.
The experiments were done in Gurobi, which uses distributed algorithms to solve mixed-integer
programs more efficiently. In a computer with fewer parallel processing capabilities, such as an
embedded computer that might be found on a robot, the IQP may become less efficient and it may
be necessary to switch to the ILP formulation for smaller graphs.
4.6 ConclusionandFutureWork
This chapter presented an optimization-based method for finding loop closures between two robots
with different perspectives by matching groups of unstructured objects detected in each robot’s
69
view. This method uses information about both appearance and spatial distribution, making it
robust to noise and to erroneous detections in either image. It performs well on simulated data and
real data from a planetary exploration environment, identifying nearly all rock correspondences
correctly. Two versions of the optimization are presented; both have similar results and may be
preferable in different scenarios. The ILP formulation is particularly interesting, as it has not been
previously used in feature matching but it scales better to very large problems. This work addresses
the problem of vision-based inter-robot loop closure in unstructured environments, which was not
previously addressed in the literature.
While we demonstrate the algorithm on rocks, it is not strictly limited to this application. De-
pending on the environment, a different type of landmark may be a better choice. In planetary
exploration, this could be part of the terrain, like ridges, dunes, or craters. In unstructured environ-
ments on Earth, it could also include trees. In any of those cases, it would be necessary to evaluate
whether object size is still an appropriate measure of appearance to identify similar objects, or if
another feature might be a better choice. In the future, it would be interesting to demonstrate this
algorithm’s performance in other environments with different landmarks.
It is clear from the results shown that the optimization becomes very slow as the graph size
grows, which would make it a challenge to use for large problems such as localizing a single
image within a map. Thus, a useful direction for future work would be finding ways to speed
up the optimization on large problems, such as by further sparsifying the graph, by dividing the
optimization into smaller subproblems, or by instead looking for the maximum weight 5-clique,
which is likely to be in the maximum weight clique. In particular, it should be straightforward to
divide up the map into regions to search, since the robot’s view covers a known area and therefore
it could not match to areas of the map separated by more than that distance. It may also be helpful
to focus on better understanding the graph structure in order to apply heuristics developed for the
MWCP.
Additionally, there are applications where the optimization time may not be problematic. For
example, the current method of localization in planetary exploration, if there is no initial pose
70
estimate, is to send images from the robot back to Earth and have human operators look at the
images to estimate the robot’s pose. Doing this, there is already a significant delay between the
robot making observations and receiving a pose that it can use. The robot could instead send
images to Earth and the optimization could be done on a powerful computer, reducing the need for
a human to do the pose estimation without adding significant overhead. In general, this method
has the advantage that it has low communication requirements (only a few images, once), so it
can move the to take advantage of processing power wherever it is available. Thus, it may also be
effective in scenarios where processing power is available but separated from the robots, such as
cloud-based systems.
This chapter demonstrated using this method for the problem of matching forward views with
overhead views. However, nothing in the method actually requires these two specific views. An-
other scenario where this method could be very effective is in finding loop closures between two
small robots such as PUFFERs, described in the next two chapters. Working with PUFFERs is
challenging because they are small and have cameras close to the ground, so visual loop closures
are difficult to find unless they are in exactly the same place. However, they have stereo cameras,
so could estimate rock positions as we do with the forward views in this chapter. Thus, matching
the forward views of two PUFFERs could be a way to find loop closures when they view the scene
from different perspectives, and should be considered in similar robots in the future.
71
Chapter5
LoopClosureswithRangeSensorsforSmallRobots
5.1 Introduction
When multiple robots autonomously map a previously unmapped area, as in planetary exploration,
it is vital that they maintain estimates of their relative poses to build a complete, consistent map.
Visual loop closures can provide relative pose measurements when they are available. However,
visual loop closures can be challenging to find, particularly with small robots in exploration sce-
narios. In typical exploration approaches, the robots explore different areas to cover a large region
efficiently, so loop closures may not be available if their trajectories do not overlap. Even if the
robots’ observations cover the same area, they may not be able to find loop closures, as discussed
in the previous chapter. Additionally, loop closures are even more difficult to find if the robots
are small. This chapter proposes an alternative method of estimating relative poses using ultra-
wideband (UWB) ranging measurements between robots, and shows that it can be used to build
consistent maps in scenarios where visual loop closures fail. It demonstrates this capability using
PUFFERs, developed for planetary exploration and described in more detail in Chapter 6.
This system has multiple rovers, each equipped with motor encoders and a stereo camera with
included IMU (StereoLabs ZED Mini). They have sufficient compute power to run visual-inertial
odometry on-board in real time, but pose graph optimization may take a long time to complete.
The work in this chapter was carried out at the Jet Propulsion Laboratory, California Institute of Technology, and
was sponsored by NSTRF and the National Aeronautics and Space Administration (80NM0018D0004).
72
Additionally, the onboard computers must be shared with other real-time tasks like planning and
navigation. They have a wireless radio to transmit data to and from other rovers or the parent
spacecraft, which allows for offloading data and computation tasks. In this multi-robot SLAM im-
plementation, the pose graph optimization is offloaded to the parent spacecraft with more compute
power at the expense of needing to spend energy and time transmitting data. The tools to support
data exchange amongst rovers and the parent spacecraft are not described in detail in this chapter
– it is sufficient to know that this mechanism is capable of asynchronously sharing pose graph
data from rovers to the parent spacecraft and map updates from the parent spacecraft to the rovers.
Lastly, the system also equips each rover, as well as the parent spacecraft, with a pair of UWB
ranging radios (Decawave DWM1001) to measure rover-to-rover and rover-to-parent distances
bidirectionally. During operation, it is common that up to 4 PUFFERs operate simultaneously, and
they may work near each other or cover different areas.
The contribution of this chapter is a method for using ranging measurements in multi-robot
SLAM to estimate robots’ relative positions, which enables them to build consistent maps even
when loop closures cannot be found. The measurements can be collected over a large distance and
do not require inter-robot communication. This method adds range edges to the pose graph, which
are used in pose graph optimization to constrain the robots’ trajectories at times when loop closures
are not available, such as when they explore separate non-overlapping areas. In experiments with
PUFFERs, this method performs similarly to a visual loop closure-based approach when loop
closures are available. When visual loop closures are not available, range edges still enable the
system to build a consistent map, while other methods fail to produce a usable map. Most of this
chapter is based on a published paper [14], which was the first example of utilizing UWB range
measurements for multi-robot SLAM in an application with few visual loop closures. Sec. 5.4.5
has not previously been published, but describes an extension of the work in the rest of the chapter.
73
5.2 RelatedWork
5.2.1 PoseGraphOptimization
Single-robot Simultaneous Localization and Mapping (SLAM) is a well-studied problem; see
Taketomi et al. [115] for a complete discussion. Multi-robot extensions have been proposed for
most popular SLAM implementations. Early examples included multi-robot SLAM systems based
on particle filters [19, 57], Kalman filters [140], and information filters [117]. However, like in the
single-robot case, filter based algorithms accumulate error and do not allow for global map updates.
Current approaches typically perform EKF-based visual odometry for short-term localization and
maintain the map and pose history as a pose graph, which can be updated via optimization and
bundle adjustment.
Multi-robot SLAM implementations have two unique aspects: they must detect inter-robot loop
closures and they must either provide communication for maps and pose histories to be shared or
perform distributed optimization and mapping.
Inter-robot loop closures are required to combine each robot’s local map into the global map
and to estimate robots’ relative positions. These loop closures are found when robots directly
observe each other [19] or when they observe the same landmark [24, 34] or scene [25, 81]. Loop
closure methods can be computation-intensive, as robots do significant image processing onboard
to find features or image descriptors for matching. Also, most vision-based place recognition
requires viewing the scene from nearly the same perspective. For small robots with cameras near
the ground, like PUFFERs, achieving this recognition requires that the two poses are in nearly the
same place which is often not realistic. This paper presents an alternative to visual loop closure
using range measurements, which can be collected over a larger distance and require no inter-robot
communication.
The other challenge commonly faced in multi-robot SLAM is pose graph optimization. In the
centralized approach, a central server maintains the global pose graph and robots share all poses
and observations with the server. Robots may perform single-agent SLAM onboard [73, 105] or
74
collect data and share it with the server [34]. This method benefits from additional computational
power available on the server, but is limited by the need for continuous communication to maintain
an updated map. In the distributed approach, each robot maintains a pose graph and performs
optimization onboard, and has only limited knowledge of other robots’ trajectories [24, 29, 30].
Because these methods exchange information between robots only when they can, they are robust
to drops in communication but may have worse performance, particularly if communication is lost
for a long time. This paper uses centralized pose graph optimization, but reduces the reliance on
continuous communication by only exchanging completed mapping sessions which can be shared
and integrated with the global map at any time.
Multi-session SLAM is an application that is closely related to multi-robot SLAM. Multi-
session SLAM is concerned with only a single robot at a time, and expects that map sessions will
be collected and later joined to optimize them and build a complete map. While multi-session
algorithms do not support multiple robots, they often provide a framework that can be adapted for
centralized implementations. Maplab [106] is one such multi-session SLAM algorithm, and in this
paper we adapt it to multiple robots. Other examples include RTAB-Map, which produces a 3D
reconstruction of the scene [65, 66], and the implementation developed by McDonald et al., which
was the first such system and uses iSAM as the pose graph backend [81].
5.2.2 RangeSensors
UWB sensors work by measuring the time-of-flight of a signal sent from one UWB sensor and
received by another, which is then converted to a range measurement. By arranging 4 or more
around a workspace, triangulation of these measurements is possible. This solution for position
estimation is used widely as an alternative to camera or laser based tracking systems. While they
are less precise than other options (eg. the DWM1001 UWB sensor used in this paper has mea-
surements with a standard deviation of 2 cm, compared to camera tracking and LiDAR systems
that offer mm scale precision), they are easier to install, inexpensive, do not require line-of-sight
and work well in brightly lit outdoor environments.
75
There are numerous examples of fusing UWB range measurements from 4 or more sensors with
measurements from an inertial measurement unit (IMU) allowing the estimation of a full 6 degree-
of-freedom pose on a single robot [43, 78, 125]. Taking this a step further, Lutz et al. included
UWB range measurements in a visual-inertial SLAM system [77]. Relative pose estimation was
demonstrated by Trawny et al. using only dead-reckoning and inter-robot distance measurement
[122]. It was also shown by Lee et al. that it is possible to estimate scale in a monocular visual
odometry system using only a single UWB sensor pair [70]. The system described here can operate
on as little as one robot containing a camera, IMU and UWB sensor and a base station that also has
a UWB sensor, essentially combining the SLAM and single UWB pair measurements described
above into a single system. However, the focus of this chapter is how the system can also be scaled
to include multiple robots that measure their range to each other and include these additional range
measurements as edges in the multi-robot SLAM problem. This is the first example of utilizing
UWB range measurements in this fashion in a multi-robot SLAM problem.
5.3 Algorithm
5.3.1 PUFFERLocalizationSubsystem
For localization and mapping within the PUFFER system, we use maplab [106], which includes
two parts:
1. ROVIOLI, an online visual-inertial odometry (VIO) tool that estimates state using an ex-
tended Kalman filter [10], localizes against a previously created map, and produces a map
from the series of past poses, and
2. the maplab console, an offline map-merging tool with plugins to load maps from different
sessions, find loop closures between them, and optimize the multi-session map.
We adapt this framework for semi-online multi-robot localization.
76
As each robot moves to scientific or exploration targets, it runs ROVIOLI in one or more
sessions and creates a map from each session. In addition to VIO, our ROVIOLI implementation
incorporates velocity updates from wheel odometry. These odometry measurements help the filter
to estimate IMU biases, which are often not observable during typical PUFFER operation due to
the low velocities and lack of high-dynamic movement.
When a robot begins a new map session, its pose is initialized from the end of the previous
session. Additionally, the most recent localization map received from the base station is passed
into ROVIOLI, allowing it to localize against observations made by other robots without explicitly
having those robots’ pose histories or observations.
Once a map session is completed, it is shared with the centralized base station as communica-
tion is available. The base station runs the maplab console, where it maintains a map of all sessions
received from all robots. As new sessions are received, they are added to the map, loop closures
are found with any previous missions, and the pose graph is re-optimized. The base station is also
added to the pose graph as a stationary node with a known pose, and range measurements to the
base station are also added as pose graph edges and used during optimization. From this optimized
pose graph, an updated localization map is shared with all active robots. Additionally, the observa-
tions made during the map session are incorporated into the global map, which is used by the base
station to identify science targets.
Individual robots do not maintain a long-term map of their own pose history and they do not
find loop closures onboard. This frees up onboard computing for other processes. Additionally, a
robot does not know the location or observations of other robots. If communication is lost, a robot’s
operation is not affected. It continues to explore and produce map sessions, to be eventually shared
with the base station if the connection is re-established. It does not receive updated localization
maps, but these are not necessary for basic operation.
77
5.3.2 PoseGraphOptimizationwithRangeEdges
Maplab uses a pose graph designed to find loop closures and optimize over multiple missions,
which are created from map sessions. The vertices v
i
are landmarks and poses within the mis-
sions. Poses represent a 6-DOF robot pose at a specific timestamp. Landmarks represent stationary
points in the map, and have only position. Edges represent constraints of different types between
poses, with edge e
i j
representing the constraint between v
i
and v
j
. The edge type describes the
type of measurement used to estimate the constraint and determines how the edge is added to the
optimization. Visual-inertial edges connect consecutive frames in a single mission, and contain
estimated transformations from inertial and visual measurements. Landmark observation edges
connect keyframes to landmarks they observe, and loop closure edges directly connect keyframes
that observed the same set of landmarks. Each mission has an associated base frame, which is
estimated when the mission is added to the map by aligning it with the other missions. All poses
in the mission are measured relative to the mission base frame.
During general pose graph optimization, the following optimization problem is solved:
X
∗ = argmin
X
∑
i, j
∥ f
i j
(x
i
)− x
j
∥
2
Λ
(5.1)
Here, x
i
is the pose of vertex v
i
, and f
i j
describes the constraints for edge e
i j
. ∥·∥
Λ
is the Maha-
lanobis distance with covarianceΛ. When an optimization is performed, this measurement is set
up for each edge, and the Ceres nonlinear optimization library [1] is used to solve the optimization.
When a range measurement is made from one robot to another, it is added as an edge to the
global pose graph. The edge is attached to the vertices corresponding to the poses on those robots
with time nearest the measurement time. The edge contains the measured range value and es-
timated variance. If suitable pose graph nodes cannot be found for both robots involved in the
measurement, the measurement is dropped.
A range edge has a single 1-dimensional measurement r
i j
with variance σ
2
r
. Additionally, to
account for occasional spurious measurements with the range sensors, we implement range edges
78
using switchable constraints, as introduced by S¨ underhauf and Protzel [113, 114] and also used
by maplab for some types of loop closure edge. Switchable constraints allow some edges to be
switched off (with some penalty) if they cannot be reconciled with the rest of the pose graph. The
edge between nodes x
i
and x
j
has a switch variable s
i j
, and a switch functionΨ(s
i j
) :R→[0,1]
describing the state of the edge. When Ψ(s
i j
)= 1, the constraint is used normally, and when
Ψ(s
i j
)= 0, it is switched off. For each edge, we add to the optimization problem the expression:
1
σ
2
r
(Ψ(s
i j
)( f(x
i
,x
j
)− r
i j
))
2
+
1
σ
2
s
(γ
i j
− s
i j
)
2
.
Here,γ
i j
is the prior value for the switch variable. Initially, all edges are switched on, soΨ(γ
i j
)= 1,
and turning off a constraint incurs a high penalty. But if the constraint is switched off, the term
describing the error from it is 0, allowing spurious measurements to be ignored without needing to
perform a complex outlier rejection procedure. The variance of the switch variable is σ
2
s
, which
determines how difficult it is to switch a constraint off.
For the switch functionΨ. we use the sigmoid function
Ψ(s
i j
)=
1
1+ e
− s
i j
.
S¨ underhauf and Protzel originally proposed this formulation [114], as it is nearly a step function
between 0 and 1, but is differentiable everywhere. They proposed to use switch variables to make
SLAM robust to incorrect loop closures, which induce very large errors. They found in a later work
that a linear function (Ψ(s
i j
)= s
i j
, with s
i j
∈[0,1]) worked as well and could optimize at extreme
values of s
i j
, where the sigmoid has derivatives near zero [113]. However, as we will describe in
Sec. 5.4.1, the errors our sensors exhibit are relatively small, and we need to disable those errors
altogether. This is not possible with a linear switch function, as it only lowers the weight on those
edges as much as necessary. Thus, while the standard is to use a linear switch function, we opt for
the sigmoid since it gives a nearly step response, which is the desired binary behavior.
79
5.4 ExperimentsandDiscussion
5.4.1 RangeSensors
The error model of the Decawave 1001M UWB ranging sensors was evaluated by placing pairs of
sensors at several distances apart and measuring the mean and standard deviation of each sensor’s
measurement. To produce results that directly apply to their use during PUFFER operation, these
values were measured in the test environment, with the sensors inside the robots. Results of these
measurements are shown in Fig. 5.1. The range measurements are unidirectional and do not require
communication between the robots, so measurements made by each robot in a pair at the same time
are independent.
Overall, the measurements at a particular distance are stable, with mean close to the true range
value and standard deviation of about 0.02 m. The measurements do not show any systematic
bias or scale factor, but they do contain some artifacts. Occasionally, the measurement will drop
to the integer below the true value, as though the decimal part of the measurement is truncated.
These spurious measurements could potentially be removed, but that would risk also removing
real measurements when the true range value was near a whole number of meters. Instead, these
measurements are included, and are eliminated during optimization by the switchable constraints.
5.4.2 ExperimentSetup
Recall that each rover (PUFFER) is equipped with a ZED Mini stereo camera (with integrated
IMU) and a pair of Decawave 1001M ultra-wideband ranging radios. Consequently, each PUFFER
is capable of stereo visual-inertial odometry that integrates motor encoder data. The resulting pose
graphs are exchanged over the network with a base station, that can provide an optimized map back
to all PUFFERs. The software implementation was achieved in ROS (Robotics Operating System)
by building upon maplab. It runs locally on-board the PUFFER’s Nvidia Jetson Nano or NX, as
well as on a standard Intel x86-64 NUC or laptop.
80
0 200 400 600 800
Time (sec)
0
1
2
3
4
5
Range (m)
Range Measurements
Measured range
True range
(a)
0 1 2 3 4 5
True Range (m)
-0.15
-0.1
-0.05
0
0.05
0.1
Range Error (m)
Range Error
(b)
Figure 5.1: (a) Measured range by one pair of sensors over a series of distances. Spurious mea-
surements at the integer below the true range value are visible. (b) Mean of errors from this test,
with spurious measurements removed. Error bars show one standard deviation.
81
Figure 5.2: Three PUFFERs moving within the test area. The test area was mostly flat, with
scattered rocks. This is reflected in the map built using range measurements.
We performed experiments with two and three PUFFERs in the JPL Mars Yard, an outdoor
planetary exploration analog environment. For all experiments, the area covered was fairly flat
with scattered rocks (see Fig. 5.2). Visual odometry was performed in real time, and pose graph
optimization was performed offline on the base station, a laptop with Intel Core i7 processors and
16 GB of RAM. We also performed additional experiments with one robot and a ranging sensor
on on the base station in the JPL Mini Mars Yard, which is a similar environment. Ground truth
positioning was not available during the experiments. However, we evaluate the quality of the
localization by evaluating the map produced from the collected images after optimization.
In all experiments, the range-optimized maps show consistency and detail. The maps are con-
tinuous, and they show flat terrain in the areas that the robots traveled and identifiable obstacles
where rocks were observed. With two robots, we show that the optimized solution with only range
edges produces similar results to the solution with visual loop closures between the robots. With
three robots, no visual loop closures were available. We show that the map produced without range
measurements is unusable, but optimizing with range edges produces a consistent, usable map.
82
5.4.3 Two-RobotExperiment
Two PUFFERs, AP1 and AP4, were driven around in a 5m× 6m area. Their trajectories, shown
in Fig. 5.3, covered most of the area, with significant overlap between the two trajectories. While
no ground truth positioning was available, the trajectories had enough overlap to allow for several
loop closures between them. Thus, we use the optimized solution from loop closures as a reference
(cf. Fig. 5.3b). Relative to this reference, the error in the trajectories before optimization is shown
in Fig. 5.4. Notably, the visual odometry-only solution with no optimization has a significant scale
error, which is observed and corrected during optimization.
For each robot, the images, IMU, and wheel encoder measurements were processed into a map
session using ROVIOLI, the online VIO portion of maplab. These map sessions were added into
a pose graph as missions, with the baseframe of each mission initialized from loop closures. This
simulates planetary exploration, where the robots would initialize their baseframes by being placed
in a known location or by measuring their pose relative to the lander.
During the experiment, each robot collected range measurements to the other robot. The range
measurements varied between 1 and 5 meters and exhibited the same error model and occasional
spurious measurements described above, as shown in Fig. 5.6. Approximately 2500 measurements
were collected by each robot during the 7-minute experiment, so only a subset of them were used.
To ensure that the subset selected was well-distributed over the whole trajectory, every tenth mea-
surement from each robot was added into the pose graph for optimization, adding 512 edges total.
The robot trajectories after optimization with range edges only are shown in Fig. 5.3c, with
errors relative to the reference in Fig. 5.5. Range measurements improve the pose estimates for
both robots significantly. The errors compared to the solution with loop closures are much smaller
and less noisy than before optimization, showing that much of the detail in the trajectories can be
captured with range edges. The optimized solution with range edges also has nearly the same scale
as the reference solution, showing that much of the scale error in the visual odometry solution can
be observed with range edges. Additionally, Fig. 5.7 shows where the range edges are located on
the trajectories and which edges are switched off. Notably, no edges have switch variables with
83
-6 -5 -4 -3 -2 -1 0 1
X Position (m)
-1
0
1
2
3
4
5
6
7
8
Y Position (m)
Two Robots, No Optimization
AP1
AP4
(a)
-5 -4 -3 -2 -1 0 1
X Position (m)
-1
0
1
2
3
4
5
6
7
Y Position (m)
Two Robots, Visual Loop Closures
AP1
AP4
(b)
-5 -4 -3 -2 -1 0 1
X Position (m)
-1
0
1
2
3
4
5
6
7
Y Position (m)
Two Robots, Range Edges
AP1
AP4
(c)
Figure 5.3: Trajectories from the two-robot experiment (a) with no optimization and after opti-
mization using (b) loop closures only and (c) range measurements only. We use (b) as a truth
reference. Note the larger scale in(a), which is corrected by both types of optimization.
84
-1
0
1
X Error (m)
AP1 Position Error
-1
0
1
Y Error (m)
0 100 200 300 400
Time (sec)
-0.2
0
0.2
Z Error (m)
-2
0
2
Roll Error
(deg)
AP1 Rotation Error
-5
0
5
Pitch Error
(deg)
0 100 200 300 400
Time (sec)
-10
0
10
Hdg Error
(deg)
-1
0
1
X Error (m)
AP4 Position Error
-1
0
1
Y Error (m)
0 100 200 300 400
Time (sec)
-0.2
0
0.2
Z Error (m)
-2
0
2
Roll Error
(deg)
AP4 Rotation Error
-2
0
2
Pitch Error
(deg)
0 100 200 300 400
Time (sec)
-5
0
5
Hdg Error
(deg)
Figure 5.4: Errors in unoptimized trajectories compared to reference trajectories with loop closures
in the two-robot experiment.
-1
0
1
X Error (m)
AP1 Position Error
-1
0
1
Y Error (m)
0 100 200 300 400
Time (sec)
-0.2
0
0.2
Z Error (m)
-2
0
2
Roll Error
(deg)
AP1 Rotation Error
-5
0
5
Pitch Error
(deg)
0 100 200 300 400
Time (sec)
-10
0
10
Hdg Error
(deg)
-1
0
1
X Error (m)
AP4 Position Error
-1
0
1
Y Error (m)
0 100 200 300 400
Time (sec)
-0.2
0
0.2
Z Error (m)
-2
0
2
Roll Error
(deg)
AP4 Rotation Error
-2
0
2
Pitch Error
(deg)
0 100 200 300 400
Time (sec)
-5
0
5
Hdg Error
(deg)
Figure 5.5: Errors in optimized trajectories with range edges compared to reference trajectories
with loop closures in the two-robot experiment.
0 100 200 300 400
Time (sec)
0
1
2
3
4
5
Measured Range (m)
AP1 Measured Range
(a)
0 100 200 300 400
Time (sec)
0
1
2
3
4
5
Measured Range (m)
AP4 Measured Range
(b)
Figure 5.6: Range measurements in the two-robot experiment. (a) Measured range from AP1 to
AP4. (b) Measured range from AP4 to AP1.
85
Figure 5.7: Range edges used during optimization in the two-robot experiment. Edges marked in
green are switched on, and edges marked in red are switched off.
86
(a) (b) (c)
Figure 5.8: Reconstructed maps from the two-robot experiment with noise removed, colored by
depth. (a) The map created with no optimization. (b) The map created with visual loop closures.
(c) The map created with only range measurements. In both (b) and (c), the area is flat and indi-
vidual rocks can be identified.
values near zero; all edges are either on or off. It is not visible from the image, but the edges that
are switched off are those with measurements at the integer below the expected range, indicating
that they are incorrect measurements and the switch variables are removing them as desired.
Additionally, we evaluate the map created from the optimized pose graph, as this is the main
product of the localization and mapping subsystem. The stereo dense reconstruction tool in maplab
is used to construct point clouds from each pair of stereo images, then integrate those point clouds
into a dense map, shown in Fig. 5.8. The map created from the range-optimized solution shows
similar consistency and detail to the map with visual loop closures, and is much more usable for
navigation than the map created with no loop closures.
5.4.4 Three-RobotExperiment
Three PUFFERs (AP1, AP4, and AP5) were driven around in a 10m× 10m area. The robots all
began near the same initial location, representing the base station. Despite their initial proximity,
they were not near enough for loop closures to be found between all robots to initialize their poses.
Instead, they were initialized manually based on measured initial poses. The robots were driven in
exploration-like trajectories: they covered only new terrain, and their paths did not overlap. This
87
is typical of the behavior that would be used by PUFFERs during autonomous exploration. They
would mostly explore different areas with little overlap to build the map most efficiently.
During the experiment, each robot collected range measurements to each of the other robots.
The maximum ranges observed were about 6 meters, with the same error properties as previously
described. Again, every tenth range measurement from each robot was used in optimization, with
774 total range edges added to the pose graph. This resulted in a distribution of range measure-
ments over all of the trajectories and between all pairs of robots, as shown in Fig. 5.9.
The optimized solution using only loop closures is shown in Fig. 5.10a. In evaluating this
experiment, less than 10 features were matched between the initial positions of AP4 and AP5, and
none were matched between either of them and AP1 or in the rest of their trajectories. Thus, visual
loop closure-based pose graph optimization failed to find a reasonable set of trajectories, even
when the robots’ baseframes were initialized manually. Note that this solution places the paths of
AP4 and AP5 on top of one another, even though they actually did not cover the same area. The
unoptimized solution is not shown here, but it showed the same behavior, with the paths of AP4
and AP5 overlapping. No loop closures were found later in the trajectories, so optimization with
loop closures was unable to correct this error.
On the other hand, because range measurements were available throughout the dataset, our
approach using pose graph optimization with range edges was able to identify the separate trajec-
tories followed by each of the robots. The pose estimates produced with range edges correctly
show paths that do not overlap, shown in Fig. 5.10b. This is also visible in the reconstruction
produced by each set of trajectories, as shown in Fig. 5.11. The optimization with range edges
produces a smooth map, which is continuous between areas covered by different robots and has
recognizable rocks. However, the optimization without range edges gives a discontinuous map in
which objects are not identifiable and two layers of ground are even projected on top of each other.
88
Figure 5.9: Range edges used in optimization with three robots. Edges between AP1 and AP4 are
shown in red, between AP1 and AP5 in green, and between AP4 and AP5 in blue.
89
0 2 4 6
X Position (m)
-1
0
1
2
3
4
5
6
7
8
9
Y Position (m)
Three Robots without Range Edges
AP1
AP4
AP5
(a)
0 2 4 6
X Position (m)
-1
0
1
2
3
4
5
6
7
8
9
Y Position (m)
Three Robots with Range Edges
AP1
AP4
AP5
(b)
Figure 5.10: Optimized trajectories in the three-robot experiment with (a) loop closures only and
(b) range edges. All robots began and ended in the lower left corner. Note that during the experi-
ment, AP5 remained between AP1 and AP4 at all times and the paths of the robots did not overlap.
5.4.5 BaseStationExperiment
Recall that a ranging sensor was also placed on the base station, though it was not used in the
previously described experiments. This experiment demonstrates the system’s ability to correct
scale errors, even when operating with only a single robot. A single PUFFER (AP1) was driven
through a loop about 5 meters long. At several points on this trajectory, the PUFFER stopped for
long periods of time; this data was later used to test the system’s behavior for multiple missions
collected by the same robot and merged later. Visual loop closures were observed between the
beginning and end of the loop. A UWB tag was placed at the base station, and measurements to
it from the anchor onboard the PUFFER were recorded. Visual odometry was performed onboard
the PUFFER with 2 Hz monocular imagery. Note that this rate is much lower than in the other
experiments; several other subsystems were also active onboard AP1 during operation. With such
a low frame rate, significant scale error accumulates during VIO. On the base station, a fixed node
90
(a)
(b)
Figure 5.11: Reconstructed maps from the three-robot experiment with noise removed, shaded by
depth on the left and a side view on the right. (a) The map created with visual loop closures. (b)
The map created with range measurements. The map in (b) shows a flat surface with identifiable
rocks, while the map in(a) does not have recognizable features and has two layers on top of each
other.
91
representing the base station’s pose was added to the map at its measured location, and the range
measurements from AP1 were added as edges to the pose graph and used in optimization.
The dataset was collected over 741 seconds, or about 12 minutes; the rover was stationary for
a large fraction of that time. Once the measurements were reduced to only those measuring the
distance to the base station (excluding measurements to the tag on AP1), and the measurements
were filtered to a lower rate, 157 range edges were added to the pose graph covering most of the
trajectory.
Ground tracks and generated maps for three scenarios are shown in Fig. 5.12. The first sce-
nario shows the localization solution before optimization. This is the output from VIO with wheel
encoders included. The estimated pose accumulates in the Z axis over time; this increase is clearly
incorrect as the ground was nearly flat and the robot returned to near its initial position. The sec-
ond scenario adds optimization using visual loop closures. Loop closures were observed between
the beginning and end of the trajectory, as well as near the points where the robot was stationary.
However, they are not observed during times that the robot is moving. These loop closures effec-
tively limit the growth in the Z axis, as they enforce that the robot returns to near its initial position.
However, they do not help to constrain scale errors that appear during VIO due to the low image
rate, using only one camera, and wheel slip in the sandy terrain.
Finally, the third scenario adds the range measurements to the optimization. The base station
was added as a node at its measured position, and the range measurements were added as edges
to the the pose graph. Loop closures were also detected. In this scenario, the optimization still
corrects the growth in the Z axis, and the range edges are also able to help correct the scale. From
observing the maps, it is clear that both the Z axis correction and the scale correction are necessary
to construct a consistent map. In the map from the first scenario, the error in the Z axis overwhelms
any features that might be visible on the ground. In the map from the second scenario, the ground
is flat, but there are large holes in the map in areas where obstacles were observed because the
localization estimate drifted. However, in the map from the third scenario, these obstacles are
92
(a) (b) (c)
Figure 5.12: Maps produced from the localization solution in the base station range experiment,
shaded by depth. Maps with (a) no optimization, (b) visual loop closures only, and (c) ranges to
base station.
compact and identifiable, indicating that the scale of this map is correct. Of the maps that were
constructed, this map with range measurements is the only one that could be used for navigation.
The scenario evaluated here is a more realistic example of onboard PUFFER localization than
the other experiments in this chapter. Unlike in the other experiments, this PUFFER ran its full
navigation subsystem onboard, which limited the rate that it could perform VIO. That limitation
resulted in significant scale drift as well as error in the Z axis pose estimate, which was not entirely
corrected by visual loop closures. However, this experiment showed that range measurements
could be used to correct even those large scale errors and build an accurate, usable map.
Additionally, this experiment demonstrated the addition of the base station to the pose graph as
a fixed stationary node. This capability was added to the maplab PGO system in order to use the
base station as a reference.
5.5 ConclusionandFutureWork
This chapter presented a method for adding range measurements between robots to a pose graph
and using them in optimization for multirobot SLAM. These range edges can be used to augment
other types of measurements that may be made, including loop closures made by the same robot
revisiting a past location, inter-robot loop closures made by different robots visiting the same area,
and sensor measurements describing a robot’s motion between consecutive poses. In scenarios
93
where visual inter-robot loop closures are available, optimization with range edges has similar
performance. In scenarios where visual loop closures are not available, range edges constrain the
robots’ trajectories so a map can be built using all of their observations, which is not possible
without those measurements. The optimization performance with range edges is demonstrated on
datasets with one, two, and three PUFFERs, on trajectories typical for their use.
This approach helps to constrain the trajectories of robots operating together when they cannot
use visual inter-robot loop closures. However, it can only perform this task at a later time, after
the missions collected by each robot are complete. In the future, range sensors would be more
beneficial to multi-robot systems by constraining their pose in real time, so that their observations
can immediately contribute to a global map. This can be partially done by measuring range to a
fixed point and incorporating those measurements into the real-time EKF. To get the full benefit of
inter-robot range measurements, the robots must exchange information about some poses to use the
measurements onboard. The question of how to exchange this information and what information
to exchange is currently an open problem in multi-robot SLAM, and will need to be solved to
effectively use inter-robot range measurements in real time.
The techniques discussed in this chapter make a number of assumptions about the system’s
architecture and operation. They assume that processing power is available on the base station,
that at least sporadic communication is available between each PUFFER and the base station, and
that the PUFFERs operate at the same time and mostly stay fairly close together. All of these
assumptions make sense when working with the PUFFERs. The next chapter will discuss the
details of PUFFER system architecture and operation, in which range measurements are used to
build a shared map that all PUFFERs can use for navigation and targeting.
94
Chapter6
PUFFER:ASemi-OnlineMulti-RobotSLAMSystemfor
PlanetaryExploration
6.1 Introduction
The previous chapter discussed a method for using UWB ranging measurements between robots,
as well as from robots to the base station, to estimate relative poses and merge the local maps
of the robots even when no visual loop closures were available. The method worked well, and
consistently produced usable merged maps at the correct scale. The technique of adding range
edges to the pose graph was developed for use on a team of PUFFERs (Pop-Up Flat-Folding
Explorer Robots), which were developed at JPL for planetary exploration and are shown in Fig. 6.1.
The Autonomous PUFFER project developed autonomy for a novel origami-inspired, ultra-
compact rover technology to expand the scientific reach of future NASA missions. The original
PUFFER design is a “pop-up” robot that folds into a small, smartphone-sized weight and volume
[60]. This compactness allows a large number of PUFFERs to be packed into a larger “parent”
spacecraft at low payload cost, and then used by the parent spacecraft to provide increased surface
mobility. For example, missions could include a planetary lander with small rovers for increased
exploration. Alternatively, a larger parent rover could use a collection of PUFFERs to explore
extreme terrains that are easier to access with a small, low-cost child rover. When the parent
The work in this chapter was carried out at the Jet Propulsion Laboratory, California Institute of Technology, and
was sponsored by NSTRF and the National Aeronautics and Space Administration (80NM0018D0004).
95
Figure 6.1: Two of the PUFFER robots used in the experiments.
spacecraft finds an exciting region for exploration, it simply ejects one or more PUFFERs, which
then pop up and explore the target or targets of interest. In either case, the PUFFERs are sent
into environments without much prior information (e.g., a map) or localization infrastructure (e.g.,
GPS). Consequently, they must rely on themselves to localize within the environment and build a
map, but can also coordinate with other PUFFERs, as well as the parent spacecraft, by exchanging
data over a communication network. Autonomous PUFFER extended the original PUFFER design
to develop PUFFER v4.0, which is approximately 25cm across and weighs 1.5 kg, still much
smaller than any other planetary rover concepts currently in use. PUFFER v4.0 carries upgraded
compute elements and sensors, including an NVIDIA Jetson NX embedded processor with GPU
and a Stereolabs ZED stereo camera with an integrated IMU. The systems described in this chapter
were developed for use on PUFFER v4.0.
During a mission, a team of PUFFERs is sent out with the goal of exploring an area, identifying
targets of interest, such as those that should be studied further by the science system, and eventually
building a complete map that it can use to navigate to those targets. An exploration subsystem
96
selects actions for each robot to cover the area, and an autonomous science targeting subsystem
identifies rocks that should be studied further. As the PUFFERs explore, a global map is built up of
the areas that have been explored, including both obstacles and science targets. As the mission goes
on, the exploration subsystem selects more actions, optimizing for coverage and revisiting targets
of interest, and the localization and mapping subsystems continue to use all of the PUFFERs’
observations to improve the map.
This chapter discusses the architecture and development of the PUFFER system, and demon-
strates the use of a team of PUFFERs in planetary exploration scenarios. Due to the hardware
requirements, including limited onboard compute power, and the mission goals, PUFFER uses a
semi-online multi-robot SLAM system that takes advantage of onboard visual odometry and spo-
radic communication with the base station. This design allows the PUFFERs to use global maps
when they are available and to work with only local information when they are not. The SLAM
subsystem interacts with other subsystems on the robot and the base station, including mapping
and exploration, and the whole system operates autonomously in real time.
The work described in this chapter was performed at NASA’s Jet Propulsion Laboratory by
the Autonomous PUFFER team. As part of the team, I did much of the design, development, and
testing of the multi-robot SLAM system. Specifically, I ported ROVIOLI to the PUFFER, adapted
it to the available camera and IMU, and tested that integration. I did not implement the wheel
odometry updates, but I did test them. I designed and implemented the range measurement loop
closures, and led the data collection to test them. I worked with the rest of the localization and
mapping team to design and test the interface between the localization subsystem and MapDB,
and I implemented the wrappers for the localization system on both the PUFFERs and the base
station. I also performed a variety of miscellaneous tasks, including data collection, working with
the hardware team to get the PUFFERs ready to drive, and debugging many hardware and firmware
problems. The final demonstration was initially scheduled for May 2020, but was delayed when
access to JPL was restricted due to COVID-19. During that time, I contributed remotely, continuing
my development and testing on recorded data and a single PUFFER I tested with at home. The
97
final demonstration was completed in April 2021. I was not able to attend, but the team used the
software that I had developed, described in this chapter.
6.2 PUFFERSystem
6.2.1 PUFFERSystemHardware
The Autonomous PUFFER system consists of multiple rovers and a single base station. Each rover
(PUFFER) is equipped with motor encoders and a stereo camera with included inertial measure-
ment unit (StereoLabs ZED Mini). Each rover has an NVIDIA Jetson NX onboard, where it runs
local computations. This provides sufficient compute power to run visual-inertial odometry and
other tasks like navigation on-board in real time, albeit at a low rate, but more computationally
intensive tasks like pose graph optimization would take a long time to complete and may only be
possible if the rover remains stationary and stops other tasks. Each robot has a wireless radio to
transmit data to and from other rovers or the parent spacecraft. Lastly, the system also equips each
rover, as well as the parent spacecraft, with a pair of UWB ranging radios (Decawave DWM1001)
to measure rover-to-rover and rover-to-parent distances bidirectionally.
The multi-robot SLAM approach used on the PUFFER system is largely driven by this ar-
chitecture and by the typical exploration scenarios the system encounters. Each rover performs
visual-inertial odometry (VIO) onboard in real time, producing a local map which can be used
immediately for local navigation and obstacle avoidance. This solution also incorporates veloc-
ity measurements from the wheel encoders, which are helpful for measuring scale. Because of
the computational load of other processes that must run in real time, the VIO processing is only
able to handle monocular images at 4 Hz. Thus, this additional scale measurement is necessary
to maintain scale over long drives. The system also takes advantage of the parent spacecraft and
wireless radio for offloading data and computation tasks. The base station, the computer onboard
the parent spacecraft, has significantly more computational power — the experiments used a laptop
with Intel Core i7 processors and 16 GB of RAM. The pose graph optimization is offloaded to the
98
parent spacecraft with more compute power at the expense of needing to spend energy and time
transmitting data.
This approach also takes advantage of the physical setup of the system. Notably, while the
PUFFERs move around the exploration area, the base station, which would be the lander during
planetary exploration, remains stationary in a known position. The base station is used as a beacon
to help estimate PUFFER poses in two ways. First, when the exploration begins, the PUFFERs
have been placed in unknown locations, but they are able to turn until they face toward the lander.
Because it has a known size and appearance, each robot is able to estimate its own initial pose.
In the experiments, this is modeled this by placing an AprilTag [127] on the base station, which
forms the origin of the global map frame. Each robot estimates its initial pose in that frame before
it starts moving. Second, the base station has a UWB ranging tag, which each robot can measure
a distance to during nearby operation. These measurements are used as range edges in the pose
graph, where they help to estimate the scale of the robot’s movement.
6.2.2 OtherSoftwareSubsystems
A number of software subsystems run within the Autonomous PUFFER system, both onboard the
rovers and on the base station, to enable the system’s exploration goals. While the focus of this
chapter is the localization and mapping subsystem, the other subsystems and their interactions are
briefly described below to give an overview of how the full system works. A diagram of the system
architecture, including subsystems running both onboard the PUFFERs and on the base station, is
shown in Fig. 6.2.
The subsystems running onboard the PUFFERs are:
1. Localization(ROVIOLI): Processes sensor measurements in real time to estimate pose and
build local map.
2. MapDB: Database for maintaining costmaps of obstacles and targets, and sharing local maps
and costmaps with base station.
99
Figure 6.2: The PUFFER system architecture, including subsystems running on the base station
(top) and onboard each PUFFER (bottom). Commands and data are shared between each PUFFER
and the base station using ZeroMQ and Bucardo to sync the DB, and the subsystems on the same
platform communicate via ROS messaging.
100
3. AutonomousScience: Processes images to identify science targets.
4. Navigation(MoveBase): Given a goal location, plans and traverses trajectory within local
map to reach goal.
5. OnboardExecutive: Controls robot actions and sharing of data with base station.
The subsystems running onboard the base station are:
1. Multi-Robot Localization (maplab): Collects local maps and inter-robot measurements
from all robots and merges them into global map.
2. Mapping DB: Database for maintaining costmaps of obstacles and targets, incorporating
updates from localization, and sharing global maps and costmaps with robots.
3. Exploration: Uses global costmaps to select targets for all robots, with the goal of cov-
ering the whole workspace and revisiting targets that have been identified as scientifically
interesting.
During general operation, each PUFFER receives exploration targets and uses the costmap it
has received, as well as local information from its sensors, to navigate to those targets. As it drives,
the localization subsystem uses sensor data to estimate the robot’s pose and build a local map. At
regular intervals, the PUFFER stops moving, saves off the map it has built, attempts to share its
updated pose history and local map with the base station, and receives any updated instructions.
Meanwhile, on the base station, the localization subsystem merges local maps from each robot
into a global map. The mapping subsystem collects pose graph node updates and observations
from each pose showing traversability, obstacles, and science targets into a global costmap. The
exploration subsystem uses that costmap to designate targets for each PUFFER. Finally, commu-
nication between the base station and the PUFFERs is done in two ways. Commands are shared
between the exploration subsystem on the base station and the onboard executive on the PUFFERs
via a ZeroMQ interface, which detects and handles disconnects. Meanwhile, maps and costmaps
101
are shared between the base station and the PUFFERs through MapDB, which uses Bucardo for
database syncing.
6.3 Real-TimeVisual-InertialOdometrySystem
6.3.1 ROVIOLIBase
Each PUFFER runs the visual odometry portion of the localization subsystem onboard. This
system collects measurements from sensors including the IMU, stereo camera, and wheel en-
coders, publishes the robot’s pose estimate, and builds a local map. The localization is based
on ROVIO [10], a real-time visual-inertial odometry tool. As initially developed, ROVIO includes
only camera and inertial measurements, and is particularly well-suited to the limited processing
power onboard the PUFFERs.
ROVIO implements an iterated extended Kalman filter, which improves on a standard EKF
by repeatedly updating and linearizing the measurement Jacobian. This procedure improves the
quality of the estimate from very nonlinear measurements, like those needed to track features over
a series of images. The KF includes the standard states tracking pose, sensor biases, and camera-
to-IMU transform, and also tracks the 3D positions of a number of landmarks, as distance and
bearing from the robot.
1
The landmarks are detected as patch-based corner features in the camera images, and are
tracked from frame to frame. In each frame, a landmark’s appearance is predicted by warping the
previous view of the landmark based on IMU measurements, and the difference between this pre-
dicted view and the actual observation is used as the measurement. When possible, the landmarks
that are included in the filter are selected to provide the maximum amount of information from
these errors. For example, landmarks described by patches with strong gradients in both directions
1
This feature caused one of the most challenging bugs I’ve ever encountered, which caused our implementation of
the filter to crash only when it tracked an even number of features.
102
are preferred, as misalignment in any direction between the predicted and actual observations will
be observable.
The above describes the base ROVIO system. But the version used in maplab, ROVIOLI, is
augmented with Localization Integration [106]. When a map containing landmarks from previous
exploration is available, the system searches for matches between those landmarks and features
it observes. If matches are observed, they are used to estimate the robot’s pose in the global
frame and perform a loose Kalman filter update. The map used for Localization Integration can be
generated from a previous run of ROVIOLI, or by combining multiple runs, finding loop closures,
and reoptimizing the pose graph using the full offline maplab system.
6.3.2 WheelOdometryUpdates
The base ROVIOLI system is augmented on the PUFFERs by incorporating velocity updates from
the wheel encoders. The wheel encoders provide 20Hz measurements of the position of each
wheel; these measurements are differenced to get a velocity estimate, which is processed into
forward and angular velocity estimates for the robot. Those velocity estimates are then used as
measurements in the filter. They greatly improve the scale estimate of the robot’s motion, though
they tend to slightly overestimate velocity due to wheel slip, which does occur in sandy terrains.
6.3.3 PoseInitialization
The base ROVIO implementation estimates pose and builds a map with the initial pose of the robot
as the origin. However, our goal is to merge the maps created by the robots, and it is possible that
some maps will be very small or will not have loop closures. Because of this, it is beneficial to use
all available information about the robots’ poses to help put the maps in the global frame.
The PUFFERs use any available information to find the initial pose for each local map, giving
an estimate of that map’s transform to the global reference frame (which uses the base station as its
origin). On initialization at the beginning of an exploration session, each PUFFER views the base
station to form an estimate of its initial poses. During the experiments, the base station included
103
an AprilTag [127], which was visible from everywhere in the exploration area. In a real planetary
exploration application, the base station might not have an AprilTag, but the lander would still have
a known appearance that could be used for initialization.
Later in the experiment, each time a PUFFER closes the local map it has built and begins a new
map, it needs to initialize the transform of that map. In this case, the map is initialized with the
final pose of the previous map, as the robot does not move between missions. These transforms can
be refined during pose graph optimization, but they provide an initial reference for placing each
local map in the global map.
6.4 Multi-RobotSLAMArchitecture
6.4.1 MaplabBase
The SLAM system uses maplab, an open-source multi-session mapping tool [106], as a base.
Maplab consists of two parts:
1. ROVIOLI, described in Sec. 6.3, an online visual-inertial odometry (VIO) tool that estimates
state using an extended Kalman filter [10], localizes against a previously created map, and
produces a map from the series of past poses, and
2. the maplab console, an offline map-merging tool with plugins to load maps from different
sessions, find loop closures between them, and optimize the multi-session map.
Autonomous PUFFER adapts this framework for semi-online multi-robot localization and map-
ping. That is, within the system, local pose estimates and maps for each robot are available imme-
diately, and global maps are available with some delay, but during operation. While global maps
and pose estimates may not be ready instantaneously, there is no need to do offline postprocessing
before the map can be used. The additions to the maplab framework include:
1. Wrappers for each part of maplab to support semi-real time data storage, map merging, and
costmap updates,
104
2. range edges stored in the pose graph and used for optimization, and console plugins to add
range edges to the pose graph, and
3. stationary map missions in the pose graph, which can be used as anchors for range measure-
ments to sensors with known locations.
6.4.2 WrappersforReal-TimeMulti-RobotPerformance
As written, maplab is a multi-session mapping tool. In its published use case, individual sessions
(called missions) are collected using ROVIOLI on one or more robots. Those missions can be
merged into a global map offline using the maplab console, which allows the user to find visual
loop closures, perform pose graph optimization, and generate a 3D reconstruction or create a lo-
calization map to be used in future mapping sessions. The data elements in this system, including
missions and localization maps, are stored as files or directories in the filesystem, so after they
have been created and saved, they can be shared between platforms.
The PUFFER multi-robot use case leads to different requirements for data processing. Since
the robots are identical, missions from different robots at the same time look nearly the same
during processing as missions from different sessions with the same robot. (There may be minor
differences, as missions recorded at the same time may include views of other robots moving. In
practice, these robots have few enough features that they do not induce incorrect matches, and
maplab’s loop closure procedure does not penalize missing feature matches.) However, as the
PUFFERs might want to use an earlier map as a base when starting a new mapping session, it
is best for the maps created from other robots’ missions to be available as soon as possible. To
accomplish this, the processing architecture runs ROVIOLI on the robots and maplab on the base
station, and uses wrappers to interface with the rest of the PUFFER system and to trigger recording
missions and merging maps as required by system operations.
These requirements are driven by unique aspects of PUFFER operations. First, due to the
restrictions on weight and power, each PUFFER has a single NVIDIA Jetson NX embedded com-
puter for all onboard processing. The Jetson NX includes an ARM CPU, NVIDIA GPU, and 4GB
105
of RAM. Because there is limited RAM available, and several other subsystems must be running
onboard at the same time as localization, the size of the local map that can be built in a single map-
ping session is limited. However, PUFFERs drive at a low speed (<0.1m/s) and exploring even the
relatively small areas covered in these experiments takes tens of minutes. To ensure that the pose
estimate remains accurate, it is necessary that VIO is running anytime the robot is moving and
that the missions collected cover all areas the robots explore. Thus, mapping sessions and periods
where the robot is moving must be coordinated. If a mission has grown too large and needs to be
closed, the robot should stop moving until the next mission is started. If the robot reaches a point
where it needs to stop for operational reasons, the localization subsystem should take advantage of
that to close the previous mission and start a new one.
Second, due to the limited processing power available, onboard processing by the localization
subsystem is limited to only things that are required to be done in real time. This is mainly VIO, as
the system must immediately process measurements from the camera, IMU, and wheel encoders,
and provide a real-time pose estimate to navigation. This pose estimate provides enough infor-
mation for the PUFFER to navigate within its local map, but does not provide information about
what other PUFFERs are doing or areas that they have explored. (In fact, each PUFFER is not
explicitly aware of the other PUFFERs operating at the same time.) Meanwhile, all non real-time
processor-intensive work is shifted to the base station, including map merging, using UWB ranging
measurements, and pose graph optimization.
Finally, there is no guarantee that a PUFFER will maintain constant communication with other
PUFFERs or with the base station. The system makes an effort to maintain communication most
of the time, but this is not always possible. During the experiments presented here, a WiFi network
was used to connect all robots. However, during an actual mission, the network might depend on
line of sight or all robots staying within a fixed range of the base station, so it could drop out if a
robot moves behind an obstacle or moves too far away. Thus, communication between PUFFERs
and the base station is designed so that data is exchanged largely when the robot is stationary, and
a loss of communication does not prevent the robot from continuing to operate according to the
106
maps it already has onboard. Control messages such as assigning new navigation targets are sent
through ZeroMQ, which handles disconnects, and large data items like map missions, global maps,
and costmaps are shared using MapDB, part of the mapping subsystem, described in Sec. 6.4.3.
Both interfaces are robust to losses of communication and in the case of a disconnect, MapDB
waits and syncs those data items when the PUFFER reconnects.
During nominal operation, each PUFFER navigates to goals set by the exploration subsys-
tem. While the PUFFER is moving, ROVIOLI runs onboard to process images, IMU, and wheel
encoder measurements, generate a pose estimate for the navigation subsystem, and build up a mis-
sion. When the PUFFER reaches its goal, or when the mission has grown too large, the PUFFER
stops moving and the mission is closed. The mission is added to the database and the database
is synchronized to the base station, sharing that mission so it can be added to the global map. If
a new localization map is available, it is saved out of the database for use in future mapping ses-
sions. Finally, ROVIOLI is started again, with the most recent localization map available, and the
PUFFER continues moving toward its goal or it receives a new goal and begins moving toward it.
Meanwhile, on the base station, planning is done in exploration steps. At the beginning of an
exploration step, each PUFFER is assigned a goal to move to. Once it has reached that goal, and
sometimes at other times, it shares the mission that was collected with the base station. The local-
ization subsystem periodically checks for new missions from active robots. When new missions or
UWB measurements have been received, they are added to the global map, and loop closure and
pose graph optimization are performed. This process adds local information that the PUFFERs
have recently collected to the global map and re-estimates the pose of every previously collected
node. This system also produces a new localization map that the PUFFERs can use if they are in
an area that was previously explored (by themselves or a different PUFFER), and this map is added
to MapDB to be shared with each PUFFER when it syncs its own MapDB. The PGO process runs
asynchronously, so the PUFFERs do not need to wait for it to complete before beginning their next
mission.
107
In the case where a PUFFER loses communication with the base station, the rest of the system
can continue running. The PUFFER can begin or continue moving toward a goal it has already
received, and it can use the maps it has already received to navigate and to localize against. It
can also estimate its pose, collect range measurements and missions, and build local maps. When
the connection is re-established, the PUFFER shares the collected data with the base station and
receives any updates to the global map that it missed.
6.4.3 IntegrationwithMapDB
The subsystem that localization is most tightly integrated with is the mapping database, MapDB.
MapDB is responsible for both compiling measurements from different subsystems into a costmap
used for exploration and navigation, and for sharing data between PUFFERs and the base station.
A timing diagram showing the interactions between both parts of the localization subsystem and
MapDB is shown in Fig. 6.3. This figure shows the behavior of a single PUFFER, but during
operation, several PUFFERs might all be adding nodes, missions, and UWB measurements to
MapDB and commanding synchronizations, and the PGO process incorporates all of that data.
For building costmaps, MapDB maintains a database of all of the nodes tracked by the lo-
calization system, and any navigation or mission features observed at those nodes, like observed
obstacles or exploration targets. Nodes are added to this database as ROVIOLI adds them to a
mission, and their poses are updated as maplab re-estimates them or removed if maplab removes
them to reduce the map size. The database is synced between each PUFFER and the base station,
so all nodes added or updated on one PUFFER or the base station are available throughout the
system. MapDB merges all of the measurements it has stored into a global costmap of obstacles
and exploration targets. The costmap is updated as new nodes are added and as existing nodes
are re-estimated. The global costmap is used by the exploration subsystem to select exploration
targets, and a local costmap can be extracted that covers only the area a PUFFER is expected to
traverse, which can be used by the navigation subsystem to navigate to those targets.
108
ROVIOLI
Onboard VIO
Add nodes to DB
maplab Multi-
Robot PGO
PGO
PGO
Initialize Pose
End mission
Query for updates
return new
missions, UWB
Start mission
Onboard
Executive
Start mission
MapDB
DB sync
Sync done
MapDB
Add mission and
UWB to DB
return, update localization map
Add mission and
UWB to DB
Query for updates
return no updates
Query for updates
return new
missions, UWB
return
End mission
Start mission with
new localization map
DB sync
Sync done
return no updates
Query for updates
Add new localization
map to DB
PUFFER Base Station
Update nodes in DB
Add nodes to DB
Figure 6.3: Timing diagram of interactions between the localization subsystem and MapDB. The
localization subsystem includes both ROVIOLI, running onboard the PUFFER (left) and maplab,
running onboard the base station (right). Green arrows indicate database synchronization, which
is done using Bucardo. Orange arrows indicate updates involving nodes, which are done via pub-
lished ROS messages and result in updates to the costmap. Black arrows show ROS services,
which are used for data that is shared using the database but does not update nodes, like missions
and localization maps.
109
In addition to nodes, MapDB also shares other types of data used by the localization sub-
system between the PUFFERs and the base station. Specifically, it shares map missions, which
are constructed by the visual odometry system onboard each PUFFER, and UWB measurements,
which are collected in ROS bag files onboard the PUFFERs, with the base station, and it shares
the localization map created by pose graph optimization with the PUFFERs. The data types used
are in complete files or folders, so these items can be shared without considering the underlying
structure of the data. To deal with the high bandwidth required to share these items, the actual
database synchronization is commanded at specific times. During typical usage, maplab missions
and UWB data bags are added to MapDB after a mission is completed, and a sync is commanded
while the robots are stationary between missions. When the missions and UWB measurements are
received on the base station, they are read out of MapDB and added into the map being constructed
by maplab. The PUFFER only waits for the sync to complete before beginning the next mission,
so it does not have the results of the pose graph update using the mission it just shared. Pose
graph optimization proceeds in parallel on the base station while the PUFFERs move on to their
next navigation goal. Once pose graph optimization has been completed, maplab produces a new
localization map, which is added to MapDB and then shared to all of the PUFFERs during the next
sync to be used in future runs of ROVIOLI.
6.5 ExperimentsandResults
The Autonomous PUFFER system was demonstrated in a series of experiments as it was devel-
oped and all the parts were integrated. For the localization system, experiments were performed
showing onboard visual odometry and visual loop closures, pose graph optimization with range
measurements, and the full system integration. Examples of the experiments demonstrating the
single-robot system, including both visual odometry and visual loop closures, and the full system
110
integration are described below. In both cases, they mainly serve as validation of the system hard-
ware and software architecture. The experiments with range measurements, including the novel
inclusion of range edges in the pose graph, were discussed in Sec. 5.4.
6.5.1 Single-RobotSystem
The single-robot system, including both visual-inertial odometry and visual loop closures with
PGO, was first demonstrated indoors in a lab environment. While no truth reference was available,
this allowed the PUFFER to traverse a measured path. The driving area included scattered rocks
that had been collected from the JPL Mini Mars Yard, where the full system experiments were
performed later, allowing for evaluation of whether features were adequately detected on the rocks.
An example image observed by the PUFFER in this experiment is shown in Fig. 6.4.
In the experiment presented here, the PUFFER was driven around 2 nearly square loops with a
length of about 2.25 meters on each side. Because the PUFFER was driven by hand, the corners are
slightly rounded and the sides are not perfectly aligned. Fig. 6.5a shows the ground track estimated
onboard the robot by the VIO system, which includes visual-inertial odometry and wheel encoder
measurements. Overall, the scale is estimated quite well, but some error appears during turns due
to wheel scale factor errors and the lack of long feature tracks. Since the two loops covered the
same area, a large number of visual loop closures were found between the first and second loops.
Fig. 6.5b shows the ground track of the same trajectory after optimization. Those loop closures
help to correct the heading errors that appear during turns, and the correct trajectory, in which both
loops were almost the same and the PUFFER nearly returned to its initial pose, is shown.
6.5.2 FullMulti-RobotSystem
The full Autonomous PUFFER system was demonstrated in experiments in the JPL Mini Mars
Yard, a planetary analog environment. The Mini Mars Yard is similar to the JPL Mars Yard where
the experiments in the previous chapter were performed, but it is smaller and has sandier terrain
111
Figure 6.4: An image collected by the PUFFER during the indoor experiment, showing some of
the rocks that were collected.
(a) (b)
Figure 6.5: Ground tracks from the single robot experiment in the lab, where the PUFFER com-
pleted two nearly square loops. (a) Ground track with only odometry (including vision, IMU, and
wheel encoders). The PUFFER maintains a good scale estimate during straight portions, but ac-
cumulates heading errors on corners due to the lack of long feature tracks. (b) Ground track after
visual loop closures and PGO. The loop closures allow the system to correct the heading errors
and estimate the trajectory accurately.
112
Figure 6.6: Overhead view of the Mini Mars Yard, with rock features highlighted. These rock
features are identified in the maps constructed during PUFFER operation. The initial positions of
two PUFFERs in a different experiment are shown, but the rock features are the same. The three
robots in this experiment started in a similar location.
and a number of larger rock features. An overhead view of the Mini Mars Yard with rock fea-
tures highlighted is shown in Fig. 6.6. All onboard processing, including real-time visual-inertial
odometry, was done on the PUFFER hardware, which uses an Nvidia Jetson NX, as described in
Sec. 6.2.1. A laptop with Intel Core i7 processors and 16 GB of RAM was used as the base station,
and all base station processing, including map merging and PGO, was performed onboard. The full
system experiments were performed in April 2021. I was not able to be present due to COVID-19
restrictions, but the software described in this chapter was used.
During the experiment, 3 PUFFERs were placed in initial locations near the base station. They
began facing the base station, so they could use the AprilTag to initialize their poses. The team was
intended to perform exploration of the area. The exploration subsystem designated an area to be
covered, and commanded each PUFFER to locations in that area. As they reached those locations,
new locations were commanded for the robots to cover the whole area. As the robots explored,
they built localization maps of the area and merged observations of obstacles into the costmap.
Additionally, the science mapping subsystem identified rocks of interest, which had been placed in
the exploration area ahead of time, and these rocks were also added to the costmap. The system as
113
(a) (b) (c)
Figure 6.7: Missions collected on(a) AP1,(b) AP3, and(c) AP6 during the full system experiment.
Each color denotes a separate mission, with each mission beginning where the previous one ended.
The length of each mission is limited to keep the local maps from growing too large; the longest
missions shown here are about 4 minutes and the full exploration took about 15 minutes.
a whole was evaluated on several benchmarks, including how well these rocks were detected and
localized.
Onboard each PUFFER, the localization system was controlled by the executive as described
earlier in this chapter. Each robot collected a series of short map missions containing a local map,
with each mission beginning at the same pose where the previous mission ended. The robots each
collected 4-6 missions over the 15 minute experiment, with the longest covering about 4 minutes
and a few meters of motion. The missions collected by each robot are shown in Fig. 6.7. The
robots also each collected measurements from their UWB sensors to each of the other robots and
to the base station.
On the base station, the missions from each robot and the UWB measurements were merged
into a single pose graph. Additionally, the nodes in the pose graph were added to MapDB,where
they were used to construct a costmap showing the explored area. As all missions and UWB
measurements were received on the base station, the pose graph was optimized with loop closures
and the node poses were re-estimated. Those updated poses were shared with MapDB and used to
update the costmap. The initial map of the area before optimization and the optimized final map
are shown in Fig. 6.8.
114
(a) (b) (c)
Figure 6.8: Costmaps constructed as the 3 PUFFERs explored the Mini Mars Yard. (a) Ground
truth. (b) The costmap constructed before PGO, using only the robots’ onboard visual odometry
and their initial poses from viewing the AprilTag. (c) The costmap after PGO. As nodes are up-
dated, the obstacles they observed are also updated, and the map structure matches the true layout
of the exploration area.
6.6 ConclusionandFutureWork
This chapter described the semi-online multi-robot SLAM system developed and demonstrated on
the Autonomous PUFFER project at JPL. As part of the Autonomous PUFFER team, I developed
the SLAM system architecture and designed and implemented most of the modifications and inter-
faces described in this chapter. I also assisted in all testing except the full system demo, when I was
not able to participate due to COVID-19 restrictions. The program demonstrated the feasibility of
using a team of small robots to autonomously explore an unknown area. The methods developed
on Autonomous PUFFER, including the SLAM architecture and the use of range measurements
for loop closures, are currently being used on CADRE, a lunar technology demonstration targeted
for launch in 2024. The development of CADRE includes an updated flight-hardened hardware
design for the rovers and flight safety-focused improvements to the algorithms.
Though the PUFFERs are identical, they encounter many of the same challenges as the hetero-
geneous robots discussed in the other chapters of this thesis. Because they are so small, PUFFERs
primarily view the rocks nearest to them, and they view them from a perspective very close to the
115
ground. This means that two PUFFERs, even if they are in the same area, often do not observe the
same features unless they are in nearly the same location. Thus, it is challenging to find feature-
based visual loop closures, so PUFFERs must depend on methods that do not use visual features
or do not use vision at all. Autonomous PUFFER implemented loop closures from range mea-
surements, described in Chapter 5. But systems like this one would also benefit from loop closure
methods that take advantage of the scene structure, like those using 3D features, as described in
Chapter 3, and those using unstructured objects, described in Chapter 4. Additionally, in more
complex exploration scenarios, PUFFERs may be only one part of a multi-robot team. The team
might include other types of robots, such as large rovers to traverse long distances and carry heavy
equipment, UA Vs to survey larger areas and map terrain, and even orbiters to help with long-term
mission planning. As part of this team, PUFFERs could explore challenging environments like
caves or rock overhangs and squeeze into small areas to collect important samples. Here, they
would face all the same challenges of finding loop closures between heterogeneous robots in un-
structured environments, and any of the techniques discussed in this thesis may help the team to
build, update, and share their maps.
116
Chapter7
ConclusionandFutureWork
Multi-robot mapping is a key capability for robots to operate together in unknown environments.
Any missions they might be used for will require them to build and use maps together, and to
be able to localize themselves relative to each other. In challenging scenarios, especially those
that include heterogeneous robots operating in unstructured environments, traditional loop closure
techniques like visual feature, scene, or object recognition are often unsuccessful. These tradi-
tional techniques depend on assumptions about the designs of the robots, the paths they might take
through the environment, and even what the environment looks like, but those assumptions do not
apply in unstructured environments like planetary exploration or search and rescue after a disaster.
New techniques are needed for these scenarios, particularly those that take advantage of what is
known about the structure of the environment or do not depend on the environment at all.
This thesis introduced a framework for identifying and using inter-robot loop closures in groups
of robots operating in unstructured environments. Because the environments lack organization
and do not typically have man-made objects, the methods developed for inter-robot loop closures
cannot depend on the robots following specific paths or observing known objects. Instead, they
must use what information is known about the environment to build structure into the map, leading
to maps that describe the important features of the environment while being adaptable for use
with diverse robots and teams. Developing those maps requires inter-robot loop closures to enable
sharing and merging maps, even between different kinds of robots. The work discussed in this
thesis included novel methods of finding loop closures between robots with heterogeneous sensors
117
(Chapter 3), robots with different perspectives or methods of mobility (Chapter 4), and even robots
that do not observe visual loop closures at all (Chapter 5). It also included a full multi-robot
SLAM system (Chapter 6), which uses ranging measurements to improve or replace visual loop
closures, and provides a framework for the considerations needed to develop a multi-robot SLAM
architecture.
One of the key insights that enables these robust mapping and localization methods is that,
even when the environment is unstructured, meaning that it does not have any man-made structure,
it still contains objects and terrain. By taking advantage of whatever structure does exist in the
environment, like unique features in the 3D shape of the world, or detecting natural objects like
rocks, these methods make unique observations that support multi-robot mapping. And by building
maps around features of the environment rather than artifacts of the sensors used, they are also more
adaptable to diverse and heterogeneous robots and teams.
As future multi-robot systems work toward supporting diverse teams, they will require the de-
velopment of better qualitative maps. The work in this thesis provides a framework for what those
maps should look like. Future development of more adaptable techniques will require localization
systems that move beyond landmark tracking and scene descriptors to develop an understanding of
the structure of the world. The “features” used for heterogeneous loop closures or rock matching
are simply ways of processing the actual 3D structure of the world, and the maps created need to
represent that structure and the objects in it. The PUFFER system separates the localization sub-
system, which uses visual feature landmarks, and the mapping subsystem, which maps objects like
rocks for navigation purposes. In an adaptable system, which can process and apply information
about the world in many different ways, the localization subsystem should be able to directly use
the rocks and other objects that are identified.
As future teams of robots take on more demanding challenges and become more complex and
more autonomous, complete SLAM systems will be required that can build, share, and use joint
maps. These systems should be designed to look beyond the traditional loop closure methods
and support novel maps and methods like those discussed here. While the exact implementation
118
depends on the target mission and hardware, they must support robust options like sharing different
types of observations (such as visual features, 3D features, scene descriptors, objects, or patterns
of objects), using different types of measurements for map merging and optimization (such as
relative pose measurement, landmark observation, or distance or bearing measurement), and maps
that allow all of those pieces of information to be stored, retrieved, and computed as they are
needed.
119
References
1. Agarwal, S., Mierle, K., et al. Ceres Solver http://ceres-solver.org.
2. Alexandre, L. A. 3D descriptors for object and category recognition: a comparative evalu-
ation in IROS Workshop on Color-Depth Camera Fusion in Robotics1 (2012).
3. Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T. & Sivic, J. NetVLAD: CNN architecture for
weakly supervised place recognition in IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) (2016), 5297–5307.
4. Armangu´ e, X. & Salvi, J. Overall view regarding fundamental matrix estimation. Image and
Vision Computing21, 205–220 (2003).
5. Bay, H., Ess, A., Tuytelaars, T. & Van Gool, L. Speeded-up robust features (SURF). Com-
puter Vision and Image Understanding110, 346–359 (2008).
6. Beiser, V . The Robot Assault On Fukushima. Wired (2018).
7. Berg, A. C., Berg, T. L. & Malik, J. Shape matching and object recognition using low dis-
tortion correspondences in IEEE Conference on Computer Vision and Pattern Recognition
(CVPR)1 (2005), 26–33.
8. Besl, P. J. & McKay, N. D. Method for registration of 3-D shapes. IEEE Transactions on
Pattern Analysis and Machine Intelligence14, 239–256 (1992).
9. Bloesch, M., Burri, M., Omari, S., Hutter, M. & Siegwart, R. Iterated extended Kalman filter
based visual-inertial odometry using direct photometric feedback. International Journal of
Robotics Research36, 1053–1072 (2017).
10. Bloesch, M., Omari, S., Hutter, M. & Siegwart, R. Robust visual inertial odometry using
a direct EKF-based approach in IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS) (2015), 298–304.
11. Boroson, E.R. & Ayanian, N. 3D keypoint repeatability for heterogeneous multi-robot
SLAM in International Conference on Robotics and Automation (ICRA) (2019), 6337–6343.
12. Boroson, E.R. & Ayanian, N. Inter-Robot Loop Closures by Matching Clusters of Un-
structured Objects in IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS) In review (2022).
13. Boroson,E.R. & Ayanian, N. Inter-Robot Loop Closures by Matching Clusters of Unstruc-
tured Objects in ICRA Workshop on Robotic Perception and Mapping: Emerging Techniques
(2022).
14. Boroson,E.R., Hewitt, R., Ayanian, N. & de la Croix, J.-P. Inter-robot range measurements
in pose graph optimization in IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS) (2020), 4806–4813.
15. Braasch, M. & Dempster, A. Tutorial: GPS receiver architectures, front-end and baseband
signal processing. IEEE Aerospace and Electronic Systems Magazine34, 20–37 (2019).
120
16. B¨ urki, M., Dymczyk, M., Gilitschenski, I., Cadena, C., Siegwart, R. & Nieto, J. Map man-
agement for efficient long-term visual localization in outdoor environments in IEEE Intelli-
gent Vehicles Symposium (IV) (2018), 682–688.
17. Butterman, E. Robotic Lifeguard Changes Water Rescues https://www.asme.org/
topics-resources/content/robotic-lifeguard-changes-water-rescues (2022).
18. Cao, Y . & Beltrame, G. Vir-slam: Visual, inertial, and ranging slam for single and multi-
robot systems. Autonomous Robots45, 905–917 (2021).
19. Carlone, L., Ng, M. K., Du, J., Bona, B. & Indri, M. Rao-Blackwellized particle filters multi
robot SLAM with unknown initial correspondences and limited communication in IEEE
International Conference on Robotics and Automation (ICRA) (2010), 243–249.
20. Castellani, U., Cristani, M., Fantoni, S. & Murino, V . Sparse points matching by combining
3D mesh saliency with statistical descriptors in Computer Graphics Forum27 (2008), 643–
652.
21. Chen, H. & Bhanu, B. 3D free-form object recognition in range images using local surface
patches. Pattern Recognition Letters28, 1252–1262 (2007).
22. Chen, Q. & Koltun, V . Robust nonrigid registration by convex optimization in IEEE Inter-
national Conference on Computer Vision (ICCV) (2015), 2039–2047.
23. Chen, X., L¨ abe, T., Milioto, A., R¨ ohling, T., Vysotska, O., Haag, A., et al. OverlapNet:
Loop closing for LiDAR-based SLAM. arXiv preprint arXiv:2105.11344 (2021).
24. Choudhary, S., Carlone, L., Nieto, C., Rogers, J., Christensen, H. I. & Dellaert, F. Dis-
tributed mapping with privacy and communication constraints: Lightweight algorithms and
object-based models. International Journal of Robotics Research36, 1286–1311 (2017).
25. Cieslewski, T., Choudhary, S. & Scaramuzza, D. Data-efficient decentralized visual SLAM
in IEEE International Conference on Robotics and Automation (ICRA) (2018), 2466–2473.
26. Cieslewski, T. & Scaramuzza, D. Efficient decentralized visual place recognition from full-
image descriptors in International Symposium on Multi-Robot and Multi-Agent Systems
(MRS) (2017), 78–82.
27. Collier, J., Se, S. & Kotamraju, V . Multi-sensor appearance-based place recognition in
International Conference on Computer and Robot Vision (2013), 128–135.
28. Cummins, M. & Newman, P. Appearance-only SLAM at large scale with FAB-MAP 2.0.
International Journal of Robotics Research30, 1100–1123 (2011).
29. Cunningham, A., Indelman, V . & Dellaert, F. DDF-SAM 2.0: Consistent distributed smooth-
ing and mapping in IEEE International Conference on Robotics and Automation (ICRA)
(2013), 5220–5227.
30. Cunningham, A., Paluri, M. & Dellaert, F. DDF-SAM: Fully distributed SLAM using con-
strained factor graphs in IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS) (2010), 3025–3030.
31. Curless, B. & Levoy, M. A volumetric method for building complex models from range
images in Conference on Computer Graphics and Interactive Techniques (1996), 303–312.
32. DARPA Robotics Challenge (DRC)https://www.darpa.mil/program/darpa-robotics-
challenge (2022).
33. Ding, L., Zhou, R., Yuan, Y ., Yang, H., Li, J., Yu, T., et al. A 2-year locomotive exploration
and scientific investigation of the lunar farside by the Yutu-2 rover. Science Robotics 7,
eabj6660 (2022).
121
34. Dub´ e, R., Gawel, A., Sommer, H., Nieto, J., Siegwart, R. & Cadena, C. An online multi-
robot SLAM system for 3D LiDARs in IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS) (2017), 1004–1011.
35. Dub´ e, R., Gawel, A., Sommer, H., Nieto, J., Siegwart, R. & Cadena, C. An online multi-
robot SLAM system for 3D LiDARs in IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS) (2017), 1004–1011.
36. Dubois, R., Eudes, A., Moras, J. & Fr´ emont, V . Dense Decentralized Multi-robot SLAM
based on locally consistent TSDF submaps in IEEE/RSJ International Conference on Intel-
ligent Robots and Systems (IROS) (2020), 4862–4869.
37. Dymczyk, M., Lynen, S., Cieslewski, T., Bosse, M., Siegwart, R. & Furgale, P. The gist of
maps-summarizing experience for lifelong localization in IEEE International Conference
on Robotics and Automation (ICRA) (2015), 2767–2773.
38. Ebadi, K., Chang, Y ., Palieri, M., Stephens, A., Hatteland, A., Heiden, E., et al. LAMP:
Large-scale autonomous mapping and positioning for exploration of perceptually-degraded
subterranean environments in IEEE International Conference on Robotics and Automation
(ICRA) (2020), 80–86.
39. Fan, T. & Murphey, T. Majorization minimization methods for distributed pose graph opti-
mization with convergence guarantees in IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS) (), 5058–5065.
40. Filipe, S. & Alexandre, L. A. A Comparative Evaluation of 3D Keypoint Detectors in a
RGB-D Object Dataset in International Conference on Computer Vision Theory and Appli-
cations (VISAPP) (2014), 476–483.
41. Fischler, M. A. & Bolles, R. C. Random sample consensus: a paradigm for model fitting
with applications to image analysis and automated cartography. Communications of the
ACM 24, 381–395 (1981).
42. Francis, R., Estlin, T., Doran, G., Johnstone, S., Gaines, D., Verma, V ., et al. AEGIS au-
tonomous targeting for ChemCam on Mars Science Laboratory: Deployment and results of
initial science team use. Science Robotics2, eaan4582 (2017).
43. Fresk, E.,
¨
Odmark, K. & Nikolakopoulos, G. Ultra WideBand enabled Inertial Odometry
for Generic Localization. IFAC-PapersOnLine50, 11465–11472 (2017).
44. G´ alvez-L´ opez, D. & Tardos, J. D. Real-time loop detection with bags of binary words in
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2011), 51–
58.
45. G´ alvez-L´ opez, D. & Tardos, J. D. Bags of binary words for fast place recognition in image
sequences. IEEE Transactions on Robotics28, 1188–1197 (2012).
46. Gao, P., Zhang, Z., Guo, R., Lu, H. & Zhang, H. Correspondence identification in col-
laborative robot perception through maximin hypergraph matching in IEEE International
Conference on Robotics and Automation (ICRA) (2020), 3488–3494.
47. Geiger, A., Lenz, P. & Urtasun, R. Are we ready for Autonomous Driving? The KITTI Vision
Benchmark Suite in IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2012).
48. Gouveia, L. & Martins, P. Solving the maximum edge-weight clique problem in sparse
graphs with compact formulations. EURO Journal on Computational Optimization3, 1–30
(2015).
122
49. Grisetti, G., Stachniss, C. & Burgard, W. Improved techniques for grid mapping with rao-
blackwellized particle filters. IEEE Transactions on Robotics23, 34–46 (2007).
50. Grossman, D. Painting a Mars Rover Is Even More Complicated Than You’d Think. Popular
Mechanics (2018).
51. Guo, Y ., Bennamoun, M., Sohel, F., Lu, M., Wan, J. & Kwok, N. M. A comprehensive
performance evaluation of 3D local feature descriptors. International Journal of Computer
Vision116, 66–89 (2016).
52. Gupta, R. U. & Conrad, J. M. A Survey on Multi-robot Particle Filter SLAM in IEEE South-
eastCon (2019), 1–5.
53. Halsted, T. & Schwager, M. Distributed multi-robot localization from acoustic pulses using
Euclidean distance geometry in International Symposium on Multi-Robot and Multi-Agent
Systems (MRS) (2017), 104–111.
54. Harris, C., Stephens, M., et al. A combined corner and edge detector in Alvey Vision Con-
ference15 (1988), 10–5244.
55. Hewitt, R. A., Boukas, E., Azkarate, M., Pagnamenta, M., Marshall, J. A., Gasteratos, A., et
al. The Katwijk beach planetary rover dataset. International Journal of Robotics Research
37, 3–12 (2018).
56. Hosseinian, S., Fontes, D. B., Butenko, S., Nardelli, M. B., Fornari, M. & Curtarolo, S. in
Optimization Methods and Applications 217–237 (Springer, 2017).
57. Howard, A. Multi-robot simultaneous localization and mapping using particle filters. Inter-
national Journal of Robotics Research25, 1243–1256 (2006).
58. Huang, Y ., Shan, T., Chen, F. & Englot, B. DiSCo-SLAM: Distributed Scan Context-Enabled
Multi-Robot LiDAR SLAM With Two-Stage Global-Local Graph Optimization. IEEE Robotics
and Automation Letters7, 1150–1157 (2021).
59. Johnson, A. E. & Hebert, M. Using spin images for efficient object recognition in clut-
tered 3D scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 433–
449 (1999).
60. Karras, J. T., Fuller, C. L., Carpenter, K. C., Buscicchio, A., McKeeby, D., Norman, C. J.,
et al. Pop-up Mars rover with textile-enhanced rigid-flex PCB body in IEEE International
Conference on Robotics and Automation (ICRA) (2017), 5459–5466.
61. Kolmogorov, V . & Zabih, R. Computing visual correspondence with occlusions using graph
cuts in IEEE International Conference on Computer Vision (ICCV)2 (2001), 508–515.
62. Konolige, K. & Bowman, J. Towards lifelong visual maps in IEEE/RSJ International Con-
ference on Intelligent Robots and Systems (IROS) (2009), 1156–1163.
63. Kostusiak, A. The comparison of keypoint detectors and descriptors for registration of RGB-
D data in International Conference on Automation (2016), 609–622.
64. Kuhn, H. W. The Hungarian method for the assignment problem. Naval Research Logistics
Quarterly2, 83–97 (1955).
65. Labb´ e, M. & Michaud, F. Online global loop closure detection for large-scale multi-session
graph-based SLAM in IEEE/RSJ International Conference on Intelligent Robots and Sys-
tems (IROS) (2014), 2661–2666.
66. Labb´ e, M. & Michaud, F. Long-term online multi-session graph-based SPLAM with mem-
ory management. Autonomous Robots42, 1133–1150 (2018).
123
67. Labb´ e, M. & Michaud, F. RTAB-Map as an open-source lidar and visual simultaneous lo-
calization and mapping library for large-scale and long-term online operation. Journal of
Field Robotics36, 416–446 (2019).
68. Lajoie, P.-Y ., Ramtoula, B., Chang, Y ., Carlone, L. & Beltrame, G. DOOR-SLAM: Dis-
tributed, online, and outlier resilient SLAM for robotic teams. IEEE Robotics and Automa-
tion Letters5, 1656–1663 (2020).
69. Larsen, T. D., Bak, M., Andersen, N. A. & Ravn, O. Location estimation for an autonomously
guided vehicle using an augmented Kalman filter to autocalibrate the odometry in Interna-
tional Conference on Multisource-Multisensor Information Fusion (1998).
70. Lee, Y ., Zhu, C., Giorgi, G. & G¨ unther, C. Fusion of Monocular Vision and Radio-based
Ranging for Global Scale Estimation and Drift Mitigation. arXiv preprint arXiv:1810.01346
(2018).
71. Leordeanu, M. & Hebert, M. A spectral technique for correspondence problems using pair-
wise constraints. IEEE International Conference on Computer Vision (ICCV)II, 1482–1489
(2005).
72. Leutenegger, S., Chli, M. & Siegwart, R. Y . BRISK: Binary robust invariant scalable key-
points in IEEE International Conference on Computer Vision (ICCV) (2011), 2548–2555.
73. Li, F., Yang, S., Yi, X. & Yang, X. CORB-SLAM: A collaborative visual SLAM system for
multiple robots in International Conference on Collaborative Computing (2017), 480–490.
74. Little, J. B. Firefighting Robots Go Autonomous. Scientific American (2021).
75. Lowe, D. G. Distinctive image features from scale-invariant keypoints. International Jour-
nal of Computer Vision60, 91–110 (2004).
76. Luft, L., Schubert, T., Roumeliotis, S. I. & Burgard, W. Recursive Decentralized Collabo-
rative Localization for Sparsely Communicating Robots in Robotics: Science and Systems
(2016).
77. Lutz, P., Schuster, M. J. & Steidle, F. Visual-inertial SLAM aided estimation of anchor poses
and sensor error model parameters of UWB radio modules in International Conference on
Advanced Robotics (2019), 739–746.
78. Mai, V ., Kamel, M., Krebs, M., Schaffner, A., Meier, D., Paull, L., et al. Local Positioning
Systems Using UWB Range Measurements for an Unmanned Blimp. IEEE Robotics and
Automation Letters3, 2971–2978 (2018).
79. Mars Climate Orbiter https://nssdc.gsfc.nasa.gov/nmc/spacecraft/display.
action?id=1998-073A (2022).
80. Martinelli, A., Tomatis, N., Tapus, A. & Siegwart, R. Simultaneous localization and odome-
try calibration for mobile robot in IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS)2 (2003), 1499–1504.
81. McDonald, J., Kaess, M., Cadena, C., Neira, J. & Leonard, J. J. Real-time 6-DOF multi-
session visual SLAM over large-scale environments. Robotics and Autonomous Systems61,
1144–1158 (2013).
82. Mercado, D. A., Flores, G., Castillo, P., Escareno, J. & Lozano, R. GPS/INS/Optic flow
data fusion for position and velocity estimation in International Conference on Unmanned
Aircraft Systems (ICUAS) (2013), 486–491.
83. Mian, A., Bennamoun, M. & Owens, R. On the repeatability and quality of keypoints for
local feature-based 3D object retrieval from cluttered scenes. International Journal of Com-
puter Vision89, 348–361 (2010).
124
84. Mourikis, A. I., Roumeliotis, S. I., et al. A Multi-State Constraint Kalman Filter for Vision-
aided Inertial Navigation. in IEEE International Conference on Robotics and Automation
(ICRA)2 (2007), 6.
85. M¨ uhlfellner, P., B¨ urki, M., Bosse, M., Derendarz, W., Philippsen, R. & Furgale, P. Summary
maps for lifelong visual localization. Journal of Field Robotics33, 561–590 (2016).
86. Mur-Artal, R., Montiel, J. M. M. & Tard´ os, J. D. ORB-SLAM: a versatile and accurate
monocular SLAM system. IEEE Transactions on Robotics31, 1147–1163 (2015).
87. Mur-Artal, R. & Tard´ os, J. D. ORB-SLAM2: An open-source SLAM system for monocular,
stereo, and RGB-D cameras. IEEE Transactions on Robotics33, 1255–1262 (2017).
88. Murphy, R. R. Trial by fire [rescue robots]. IEEE Robotics & Automation Magazine 11,
50–61 (2004).
89. NASA Mars Explorationhttps://mars.nasa.gov/ (2022).
90. Pumarola, A., Vakhitov, A., Agudo, A., Sanfeliu, A. & Moreno-Noguer, F. PL-SLAM: Real-
time monocular visual SLAM with points and lines in IEEE International Conference on
Robotics and Automation (ICRA) (2017), 4503–4508.
91. Ramtoula, B., de Azambuja, R. & Beltrame, G. CAPRICORN: Communication Aware Place
Recognition using Interpretable Constellations of Objects in Robot Networks in IEEE In-
ternational Conference on Robotics and Automation (ICRA) (2020), 8761–8768.
92. Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You only look once: Unified, real-time
object detection in IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2016), 779–788.
93. Robotic Navigation Tech Will Explore the Deep Ocean https://www.jpl.nasa.gov/
news/robotic-navigation-tech-will-explore-the-deep-ocean (2022).
94. Rosen, D. M., Mason, J. & Leonard, J. J. Towards lifelong feature-based mapping in semi-
static environments in IEEE International Conference on Robotics and Automation (ICRA)
(2016), 1063–1070.
95. Rosten, E. & Drummond, T. Machine learning for high-speed corner detection in European
Conference on Computer Vision (ECCV) (2006), 430–443.
96. Rublee, E., Rabaud, V ., Konolige, K. & Bradski, G. ORB: An efficient alternative to SIFT or
SURF in IEEE International Conference on Computer Vision (ICCV) (2011), 2564–2571.
97. Rusu, R. B., Blodow, N. & Beetz, M. Fast point feature histograms (FPFH) for 3D registra-
tion in IEEE International Conference on Robotics and Automation (ICRA) (2009), 3212–
3217.
98. Rusu, R. B., Blodow, N., Marton, Z. C. & Beetz, M. Aligning point cloud views using
persistent feature histograms in IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS) (2008), 3384–3391.
99. Rusu, R. B. & Cousins, S. 3D is here: Point Cloud Library (PCL) in IEEE International
Conference on Robotics and Automation (ICRA) (2011).
100. Saeedi, S., Trentini, M., Seto, M. & Li, H. Multiple-Robot Simultaneous Localization and
Mapping: A Review. Journal of Field Robotics33, 3–46 (2016).
101. Salas-Moreno, R. F., Newcombe, R. A., Strasdat, H., Kelly, P. H. & Davison, A. J. SLAM++:
Simultaneous localisation and mapping at the level of objects in IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR) (2013), 1352–1359.
125
102. Salti, S., Tombari, F. & Di Stefano, L. A performance evaluation of 3D keypoint detec-
tors in International Conference on 3D Imaging, Modeling, Processing, Visualization and
Transmission (3DIMPVT) (2011), 236–243.
103. Scaramuzza, D. & Fraundorfer, F. Visual odometry [tutorial]. IEEE Robotics & Automation
Magazine18, 80–92 (2011).
104. Scherer, S. A. & Zell, A. Efficient onboard RGBD-SLAM for autonomous MAVs in IEEE/RSJ
International Conference on Intelligent Robots and Systems (IROS) (2013), 1062–1068.
105. Schmuck, P. & Chli, M. CCM-SLAM: Robust and efficient centralized collaborative monoc-
ular simultaneous localization and mapping for robotic teams. Journal of Field Robotics36,
763–781 (2019).
106. Schneider, T., Dymczyk, M., Fehr, M., Egger, K., Lynen, S., Gilitschenski, I., et al. maplab:
An open framework for research in visual-inertial mapping and localization. IEEE Robotics
and Automation Letters3, 1418–1425 (2018).
107. Schuster, M. J., Brand, C., Hirschm¨ uller, H., Suppa, M. & Beetz, M. Multi-robot 6D graph
SLAM connecting decoupled local reference filters in IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS) (2015), 5093–5100.
108. Schuster, M. J., Schmid, K., Brand, C. & Beetz, M. Distributed stereo vision-based 6D
localization and mapping for multi-robot teams. Journal of Field Robotics 36, 305–332
(2019).
109. Sipiran, I. & Bustos, B. Harris 3D: A robust extension of the Harris operator for interest
point detection on 3D meshes. The Visual Computer 27, 963–976 (2011).
110. Sørensen, M. M. New facets and a branch-and-cut algorithm for the weighted clique prob-
lem. European Journal of Operational Research154, 57–70 (2004).
111. Steder, B., Rusu, R. B., Konolige, K. & Burgard, W. Point feature extraction on 3D range
scans taking into account object boundaries in IEEE International Conference on Robotics
and Automation (ICRA) (2011), 2601–2608.
112. Sun, J., Ovsjanikov, M. & Guibas, L. A concise and provably informative multi-scale sig-
nature based on heat diffusion in Computer Graphics Forum28 (2009), 1383–1392.
113. S¨ underhauf, N. & Protzel, P. Switchable constraints for robust pose graph SLAM in IEEE/RSJ
International Conference on Intelligent Robots and Systems (IROS) (2012), 1879–1884.
114. S¨ underhauf, N. & Protzel, P. Towards a robust back-end for pose graph SLAM in IEEE
International Conference on Robotics and Automation (ICRA) (2012), 1254–1261.
115. Taketomi, T., Uchiyama, H. & Ikeda, S. Visual SLAM algorithms: a survey from 2010 to
2016. IPSJ Transactions on Computer Vision and Applications9, 16 (2017).
116. Tang, T. Y ., De Martini, D., Wu, S. & Newman, P. Self-supervised localisation between
range sensors and overhead imagery in Robotics: Science and Systems (2020).
117. Thrun, S. & Liu, Y . Multi-robot SLAM with sparse extended information filers in Interna-
tional Symposium on Robotics Research (2005), 254–266.
118. Tian, Y ., Chang, Y ., Arias, F. H., Nieto-Granda, C., How, J. P. & Carlone, L. Kimera-multi:
Robust, distributed, dense metric-semantic slam for multi-robot systems. IEEE Transactions
on Robotics (2022).
119. Tian, Y ., Koppel, A., Bedi, A. S. & How, J. P. Asynchronous and parallel distributed pose
graph optimization. IEEE Robotics and Automation Letters5, 5819–5826 (2020).
120. Tombari, F., Salti, S. & Di Stefano, L. Performance evaluation of 3D keypoint detectors.
International Journal of Computer Vision102, 198–220 (2013).
126
121. Torresani, L., Kolmogorov, V . & Rother, C. A dual decomposition approach to feature cor-
respondence. IEEE Transactions on Pattern Analysis and Machine Intelligence35, 259–271
(2012).
122. Trawny, N., Zhou, X. S., Zhou, K. X. & Roumeliotis, S. I. 3D relative pose estimation from
distance-only measurements in IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS) (2007), 1071–1078.
123. Unnikrishnan, R. & Hebert, M. Multi-scale interest regions from unorganized point clouds
in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2008),
1–8.
124. Vidal-Calleja, T. A., Berger, C., Sol` a, J. & Lacroix, S. Large scale multiple robot visual map-
ping with heterogeneous landmarks in semi-structured terrain. Robotics and Autonomous
Systems59, 654–674 (2011).
125. Wang, C., Zhang, H., Nguyen, T. & Xie, L. Ultra-Wideband Aided Fast Localization and
Mapping System in IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS) (2017), 1602–1609.
126. Wang, C., Bronstein, M. M., Bronstein, A. M. & Paragios, N. Discrete minimum distor-
tion correspondence problems for non-rigid shape matching in International Conference on
Scale Space and Variational Methods in Computer Vision (2011), 580–591.
127. Wang, J. & Olson, E. AprilTag 2: Efficient and robust fiducial detection in IEEE/RSJ Inter-
national Conference on Intelligent Robots and Systems (IROS) (2016), 4193–4198.
128. Weidner, L., Walton, G. & Krajnovich, A. Classifying rock slope materials in photogram-
metric point clouds using robust color and geometric features. ISPRS Journal of Photogram-
metry and Remote Sensing176, 15–29 (2021).
129. Whelan, T., Kaess, M., Leonard, J. J. & McDonald, J. Deformation-based loop closure
for large scale dense RGB-D SLAM in IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS) (2013), 548–555.
130. Yang, S. & Scherer, S. Cubeslam: Monocular 3-d object slam. IEEE Transactions on Robotics
35, 925–938 (2019).
131. Yu, T.-H., Woodford, O. J. & Cipolla, R. A performance evaluation of volumetric 3D interest
point detectors. International Journal of Computer Vision102, 180–197 (2013).
132. Yue, Y ., Yang, C., Wang, Y ., Senarathne, P. C. N., Zhang, J., Wen, M., et al. A multilevel
fusion system for multirobot 3-D mapping using heterogeneous sensors. IEEE Systems Jour-
nal14, 1341–1352 (2019).
133. Zaharescu, A., Boyer, E., Varanasi, K. & Horaud, R. Surface feature detection and de-
scription with applications to mesh matching in IEEE Conference on Computer Vision and
Pattern Recognition (CVPR) (2009), 373–380.
134. Zhang, J., Wu, Y ., Liu, W. & Chen, X. Novel approach to position and orientation estimation
in vision-based UA V navigation. IEEE Transactions on Aerospace and Electronic Systems
46, 687–700 (2010).
135. Zhang, P., Wang, H., Ding, B. & Shang, S. Cloud-based framework for scalable and real-
time multi-robot slam in IEEE International Conference on Web Services (ICWS) (2018),
147–154.
136. Zhao, X., Liu, J. & Tan, M. A remote aerial robot for topographic survey in 2006 IEEE/RSJ
International Conference on Intelligent Robots and Systems (2006), 3143–3148.
127
137. Zheng, S., Hong, J., Zhang, K., Li, B. & Li, X. A multi-frame graph matching algorithm for
low-bandwidth RGB-D SLAM. Computer-Aided Design78, 107–117 (2016).
138. Zhong, Y . Intrinsic shape signatures: A shape descriptor for 3D object recognition in IEEE
International Conference on Computer Vision (ICCV) Workshops (2009), 689–696.
139. Zhou, H., Yao, Z., Zhang, Z., Liu, P. & Lu, M. An Online Multi-robot SLAM System Based
on Lidar/UWB Fusion. IEEE Sensors Journal (2021).
140. Zhou, X. S. & Roumeliotis, S. I. Multi-robot SLAM with unknown initial correspondence:
The robot rendezvous case in IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS) (2006), 1785–1792.
128
Abstract (if available)
Abstract
A key capability for a team of robots operating together in an unknown environment is building and sharing maps. As each robot explores, it must be able to build its own local map and use it for navigation. To take advantage of the benefits of working in a team, the robots should also be able to share and merge those maps. Merging these local maps into a global map requires identification of loop closures, or places where the maps overlap. However, tasks in unstructured environments, such as planetary exploration, are not well-suited to traditional visual loop closure methods like scene or object detection. These tasks may involve robots with unusual sensors, the robots may not observe the same areas, and the environment may not allow for identification of standard visual features, which all make it challenging to identify loop closures. The team may also be heterogeneous, so there may be differences in how and where the robots make their observations.
This thesis addresses the challenge of identifying robust loop closures in spite of these limitations. It includes several methods that successfully find inter-robot loop closures in challenging unstructured environments, including a method using heterogeneous sensors, a method for robots that view the world from different perspectives, and a method with ranging sensors for scenarios where robots’ trajectories do not overlap. It also discusses the Autonomous PUFFER multi-robot SLAM system, a semi-real time system developed for a team of robots operating autonomously in a planetary exploration environment. Finally, it discusses how these techniques provide a framework for future multi-robot mapping in unstructured environments. The maps and systems developed will need to accurately model the environment while also supporting diverse robots and teams.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Motion coordination for large multi-robot teams in obstacle-rich environments
PDF
Multi-robot strategies for adaptive sampling with autonomous underwater vehicles
PDF
Advancing robot autonomy for long-horizon tasks
PDF
Algorithms and systems for continual robot learning
PDF
Remote exploration with robotic networks: queue-aware autonomy and collaborative localization
PDF
Characterizing and improving robot learning: a control-theoretic perspective
PDF
Data scarcity in robotics: leveraging structural priors and representation learning
PDF
Decentralized real-time trajectory planning for multi-robot navigation in cluttered environments
PDF
Intelligent robotic manipulation of cluttered environments
PDF
Leveraging prior experience for scalable transfer in robot learning
PDF
Quality diversity scenario generation for human robot interaction
PDF
Closing the reality gap via simulation-based inference and control
PDF
Data-driven acquisition of closed-loop robotic skills
PDF
Leveraging cross-task transfer in sequential decision problems
PDF
Rethinking perception-action loops via interactive perception and learned representations
PDF
Decision support systems for adaptive experimental design of autonomous, off-road ground vehicles
PDF
Efficiently learning human preferences for proactive robot assistance in assembly tasks
PDF
Scaling robot learning with skills
PDF
Robot mapping with proprioceptive spatial awareness in confined and sensor-challenged environments
PDF
Coordinating social communication in human-robot task collaborations
Asset Metadata
Creator
Boroson, Elizabeth Rose
(author)
Core Title
Robust loop closures for multi-robot SLAM in unstructured environments
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2022-08
Publication Date
07/20/2022
Defense Date
05/10/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
mapping,multi-robot systems,OAI-PMH Harvest,SLAM
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sukhatme, Gaurav (
committee chair
), Ayanian, Nora (
committee member
), Nikolaidis, Stefanos (
committee member
), Savla, Ketan (
committee member
)
Creator Email
boroson@usc.edu,lboroson@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111373258
Unique identifier
UC111373258
Legacy Identifier
etd-BorosonEli-10872
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Boroson, Elizabeth Rose
Type
texts
Source
20220720-usctheses-batch-957
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
mapping
multi-robot systems
SLAM