Close
The page header's logo
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected 
Invert selection
Deselect all
Deselect all
 Click here to refresh results
 Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Pretraining transferable encoders for visual navigation using unlabeled datasets
(USC Thesis Other) 

Pretraining transferable encoders for visual navigation using unlabeled datasets

doctype icon
play button
PDF
 Download
 Share
 Open document
 Flip pages
 More
 Download a page range
 Download transcript
Copy asset link
Request this asset
Transcript (if available)
Content Pretraining Transferable Encoders for Visual Navigation using Unlabeled Datasets
by
Kiran Kumar Lekkala
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2025
Copyright 2025 Kiran Kumar Lekkala



Dedication
To everyone in the last 28 years of my life, especially my parents, who have been a pillar of support and
constantly motivated and inspired me.
ii



Acknowledgements
This PhD thesis would not have been possible without the support, guidance, and encouragement of many
people, to whom I am deeply grateful.
First and foremost, I would like to express my heartfelt gratitude to my advisor, Prof. Laurent Itti
for their unwavering support, invaluable insights, and constant encouragement throughout this journey.
Their expertise and dedication have been a guiding light, and I am immensely fortunate to have had the
opportunity to learn under their mentorship.
I extend my appreciation to the members of my dissertation committee, Prof. Erdem Biyik and Prof.
Bartlett Mel, for their constructive feedback and invaluable suggestions that have enriched this research.
A special thanks to my colleagues and collaborators at the University of Southern California, especially
my lab mates, for their camaraderie, stimulating discussions, and technical assistance. Working alongside
such talented individuals has been both inspiring and rewarding.
I am profoundly grateful to Dolby Labs for giving me the opportunity to intern with their team and for
the exposure to cutting-edge research in 3D Large Language Models. The experience significantly shaped
my perspective and contributed to this work.
To my family, Bhavani and Suri Naidu Lekkala, I owe my deepest thanks for their unwavering belief in
me and for their unconditional love and support. To my friends, who provided a much-needed balance
between work and life, your encouragement has meant the world to me.
iii



Finally, I would also like to thank my brother, Jayath, who has been a constant support and played a
major role by being in close vicinity during hardships.
This thesis is a testament to the collective efforts of everyone who supported me throughout this
journey. Thank you.
iv



Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Learning for Robotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Imitation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 Imitation from Observation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Generalizable pretraining using image datasets for transfer . . . . . . . . . . . . . . . . . . 8
1.2.1 Unsupervised pretraining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.2 Representation learning through Visual attention . . . . . . . . . . . . . . . . . . . 11
1.2.3 Representation learning through Multi-task learning . . . . . . . . . . . . . . . . . 11
1.2.4 Generalizable transfer using Meta learning . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.5 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Chapter 2: BEV based visual navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 Perception model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.2 Temporal model with Robustness modules . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Experimental platform and setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.1 Experimental platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.2 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.2.1 Train dataset from CARLA simulator . . . . . . . . . . . . . . . . . . . . 28
2.3.2.2 Validation dataset from Google Street View . . . . . . . . . . . . . . . . . 28
2.3.2.3 Test dataset from Beobotv3 . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 Evaluation and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 Discussion and Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Chapter 3: Value Explicit Pretraining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
v



3.2 Problem Setting and Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.1 Contrastive Representation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.2 Discounted Returns and Value Functions . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.1 Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Chapter 4: USCIlab3D dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.1 Related datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1.1 Multi-view datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1.2 Scene datasets with semantic labels . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2 Dataset collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.1 Robot platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.2 Dataset collected over the entire USC campus . . . . . . . . . . . . . . . . . . . . . 54
4.2.3 Synchronization of cameras and LiDAR . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.4 Sensor calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3 Dataset annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3.1 GPT4-based candidate labels and clustering . . . . . . . . . . . . . . . . . . . . . . 57
4.3.2 Grounded-SAM masks on pixel space . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3.3 Post-processing after Grounded-SAM . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3.4 Projecting 2D semantic masks to 3D pointcloud . . . . . . . . . . . . . . . . . . . . 60
4.3.5 Released data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4.1 Evaluation on Novel View Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4.2 Evaluation on Semantic Segmentation and Completion . . . . . . . . . . . . . . . . 64
4.5 Caveats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.6 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Chapter 5: BeoGym . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 Proposed Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2.1 Gaussian splat based rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2.2 Sequence Graph for querying splat files . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3 Benchmarking Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Chapter 6: Conclusion and Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
vi



List of Tables
4.1 Comparison of the existing datasets with our USCILab3D dataset. . . . . . . . . . . . . . . 53
4.2 Comparison of Semantic Classes and Labels Across Existing Datasets and Our USCILab3D
Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3 Clustering of the semantic labels. We use GPT-4 to cluster 267 labels into 12 categories
using the prompt "Could you help me classify by following category: Vehicle, Nature,
Human, Ground, Structure, Street Furniture, Architectural Elements." . . . . . . . . . . . . 58
4.4 Percentage of incorrect pixel labels. Quantitative measures to show robustness through
the change in the percentage of incorrect pixel labels with additional prompts. Note that
this table in relation with the above Figure 4.5 . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.5 Performance comparison of 3D Gaussian splatting on different datasets. Our dataset
achieves superior performance compared to other datasets. Although Deep Blending
demonstrates a higher PSNR, it only contains 2.6K images. . . . . . . . . . . . . . . . . . . 64
vii



List of Figures
1.1 Examples of simulators that are used to train and evaluate Reinforcement learning
algorithms. (a) Carla (left) and the (b) UR5 Gazebo (right) simulators. In Carla simulator,
the agent had to solve a point goal navigation task, where it has to reach a destination
location from a randomly spawned location. In UR5 Gazebo simulator, the task for the
agent was to move the fixed end-effector in alignment with the trajectory consisting of the
target pose (blue) and the waypoints (red). . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Blue represents the uniform sampling distributions used by vanilla-Evolutionary strategy
methods, and orange represents the sampling more sophisticated Zeroth-order methods.
Center dots in both the sampling distributions represents the mean. . . . . . . . . . . . . . 3
1.3 Evaluating deployable model involves pretraining a system on a dataset and then deploying
it. This process consists of freezing the model parameters and allowing the system to
quickly learn and adapt to unseen tasks on the fly. As we can see in the above figure, during
deployment, the pretrained encoder model accommodates an unseen task and so only the
policy is learnt to perform that task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Reconstructions obtained from the trained VAE model on the test-dataset. Note that
the model has not been trained on any of the above games. Odd rows correspond to the
reconstructions of the subsequent even rows. . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Generic framework for Multi-task meta-learning that could be used for pretraining. Each
task-specific head is adapted to perform an unseen subtask optimally. . . . . . . . . . . . . 14
2.1 Overview of our system We first pretrain the visual navigation system on a set of
large-scale dataset, consisting of unlabeled expert videos (expert videos without actions
used to pretrain the encoder) and random trajectory rollouts (used to train memory
module.), collected in the simulator. Once the model is pretrained. The frozen model is
deployed for performing Visual Navigation either using Reinforcement Learning or Planning. 17
2.2 Working of the System. RGB observation ot at time step t is passed to the perception
model (blue) that compresses it into an embedding zt
. The memory model takes the current
latent representation zt and uses the historical context to refine the state into zˆt
. These
embeddings could either be used to train a control policy (orange) or to reconstruct the
Bird’s Eye View (BEV) for planning (grey). Both utilities result in an action command at
. . 18
viii



2.3 Training pipeline for the perception model. (a) During the training phase, the ResNet
model is trained using a set of temporal sequences, consisting of pairs of input (FPV images
o, displacement ∆g and orientation to goal ∆ϕ) and output (BEV images x) from the
simulator. Our contrastive loss embeds positives zps closer to the anchor zan and negatives
zng farther away. (b) In the bottom, we pictorially show the input embeddings zt from FPV
images, actions at and the output zt+1 that is used to train the memory module. . . . . . . 22
2.4 Robustness enhancement using Memory module. TSC (red) only takes input from
the representation zt when it comes with a high confidence score. Otherwise, it takes
the previous prediction by the LSTM zˆt−1 as interpolation. ASC (green) improves the
representation of the incoming observation by making it in-domain. The crosses above
correspond to rejecting the precepts and using the model’s state prediction as the current
state. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5 Out-of-domain and real-world evaluation We constructed two 6-class validation
datasets: one from the simulator (upper-portion in the table) and another from real-world
street-view data (lower-portion). The values in the header rows correspond to number
of data samples. Each class corresponds to the BEV images shown above. We specify
accuracies for each class. Along with that, we also specify the success rate (SR) of the agent,
when the encoder is deployed for real-world visual navigation. Our method outperformed
the ResNet classifier (baseline) on both the unseen simulation dataset, the real-world
validation dataset and real-world navigation as shown above. . . . . . . . . . . . . . . . . . 25
2.6 Ablation experiments on the Test Dataset. Classes in the above table have the
same correspondences as the classes in Fig. 2.5. Each double-row corresponds to a data
sequence. We demonstrate that our approach not only attains high ACC (accuracy),
but also provides a more granular BEV representation compared to the naive classifier,
as indicated by the MSE (Mean Squared Error) and CE (Cross-Entropy) metrics. (a).
In the upper portion of the table, we assessed our method independently of the LSTM
on an unseen temporal sequence from the simulator, contrasting it with the baseline CNN
classifier. (b) In the lower portion, we compared the performance of system with and
without LSTM on a real-world data sequence. Note that dashes in the table indicate the
absence of a class in the respective sequence. We compute the mean values for each row as
shown in the last column. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.7 Comparison of runtime. Computation costs (runtime in milliseconds) of each module in
the navigation system for policy learning and planning are shown above. . . . . . . . . . . 29
2.8 Policy learning and Planning experiments on navigation task using pretrained
representations. Using a pretrained ResNet encoder, we compare our method with
different baselines. The training curves are obtained when we train a 1-layer policy, using
RL, that takes the embeddings from the frozen encoder. The x and y axis corresponds to
iterations and the cumulative reward, with the shaed regions showing the 95% confidence
intervals. We also perform planning experiments, where the BEV reconstructions are used
to navigate to the goal, as shown the by the success rate (SR), through the dotted lines
corresponding to each method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
ix



3.1 High-level overview of our problem statement The encoder fϕ is pretrained using play
data from a set of train tasks, that is then reused for an unseen task. We evaluate pretrained
encoders produced by our method and the baselines on the Atari and Navigation benchmarks. 35
3.2 Description of our method (VEP). We compute value estimates (Bellman returns),
as denoted by G, for each frame. We then use a contrastive learning-based pretraining
method that learns task-agnostic representations based on G. The above figure is a pictorial
representation of a training scenario where the sampling batch size bT is 2 and the training
batch size bG is 1. This results in anchor, positive and negative sampled from two
sequences in each batch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Pretraining results on Atari (Left). Performance of different pretraining methods on
the respective games as mentioned above on the left. The encoder is pretrained only on
the first two games (Demon-Attack and Space-Invaders) and is evaluated on the other
out-of-distribution games. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4 Pretraining results on Navigation (Right). Performance of different pretraining
methods on the respective cities as mentioned above right. Similar to the Atari experiments,
for all the baselines, play data from the first two tasks (Wall Street and Union Square) were
used for pretraining. VEP representations improve PPO policy performance by up to 2×. . 40
3.5 (a). Reward functions for Navigation (left). For a specific map, the agent spawns at a
predetermined starting location (red), with the flexibility to initiate at a random location
within a r-step to the fixed starting point. The sparsity of the rewards (brown lines) that
enable the agent to navigate to the goal (green) can be adjusted through the parameter L.
(b). Comparison with other existing pretrained models (right). We show the bar
plot that compares VEP with other existing pretrained models using the mean cumulative
reward of the policy on the out-of-distribution task. . . . . . . . . . . . . . . . . . . . . . . 43
3.6 Comparison of our method with End-to-end trained method for Navigation task.
Note that in each of the above training curves, the end-to-end baseline has the entire model
trained on each of the above tasks, whereas our method (VEP) is pretrained only on play
data from Wall Street and Union Square. The x axis corresponds to the wall-clock
time (not including the pretraining time for VEP since its negligible compared to the online
RL train-time). Compared to any pretrained method, End-to-end training baseline takes
significantly longer time (2.1× for Navigation and 3.3× for Atari). Since both the methods
were trained for the same number of timesteps (20M), our method finished earlier and the
dotted line is only for comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.7 (a) Different early stop iterations (left). Notice that with an increase in number of
pretraining tasks (cities) from 2 to 4, our method performs better with fewer training
iterations. (b) Larger batch size (right). We compared TCN by equating the batch size
and the number of iterations to match those of VEP by combining sample and train batch
size, to show that the learning ability of our method is due to value estimates amidst tasks. 47
x



3.8 Performance comparison on the quality of play data Each of the above bar plots
corresponds to the evaluation of the encoder in a different city. Each coloured bar
corresponds to a specific play dataset used for pretraining. We also provide 95% confidence
intervals along with the mean cumulative reward. . . . . . . . . . . . . . . . . . . . . . . . 49
4.1 Images with the respective 3D pointclouds Our adjacent five cameras provide
comprehensive coverage with overlap at the same timeframe, ensuring the captured
information’s redundancy. We also show the corresponding point cloud view for every image. 51
4.2 The pipeline of our semantic annotations method We use GPT4 and Grounded-SAM
to create pixel-wise semantic labels and align the 2D and 3D points. . . . . . . . . . . . . . 53
4.3 Overview of the data collection robot and its hardware. Beobot-v3 is a differentialdrive, non-holonomic mobile robot, equipped with five Intel Realsense D455 cameras and
one Velodyne HDL-32E LiDAR sensor used to collect the dataset. . . . . . . . . . . . . . . 55
4.4 Sample snapshots from our dataset of various daylight timings. These are images
obtained from randomly sampling across the entire dataset. . . . . . . . . . . . . . . . . . . 56
4.5 Robustness of Grounded SAM to prompts. Comparison of the semantic masks obtained
using different prompts for the same image by Grounded-SAM model, showing the
robustness of the model. On the right image, the additional prompts were "fire hydrant,
person, car, Parking lot lines, Boat, Scooter, Dog, Bear, Cat" along with the common prompts
"Trees, Bushes, Benches, Tables, Chairs, Pavement, Buildings, Windows, Doors, Emergency call
box, Umbrellas, Leaves, Grass" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.6 Histogram of the semantic labels frequency in point cloud scans and points. Top 50
frequently estimated semantic classes in points(orange), and correspoing point cloud scan
frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.1 Comparision with other simulators. Rendering speed or Frames per Second (FPS)
recorded for a single thread process with frame resolution of 1280 × 720, Single episode
∼ 200 timesteps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.2 Overview of our simulator. The agent obtains a percept/image from the simulator and
estimates a control signal ut. The outer loop in the simulator determines whether the agent
has passed the boundaries of the current sector and if a new splat file has to be loaded using
the sequence graph. The inner loop corresponds to the motion model that computes the
pose xt that is then used to render the percept in the next timestep using the gaussian splat
file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3 Sample sequence graph The left graph depicts the node graph, where each node
represents a scan in the trajectory. The right graph illustrates the corresponding sequence
graph, with each node representing a sector. The value of each node in the sequence graph
indicates the number of scans within that sector. . . . . . . . . . . . . . . . . . . . . . . . . 70
xi



5.4 Occupancy map and Elevation map. These maps are computed from the collected 3D
data from the USCILab3D dataset and is used for validating agent poses and estimating
collisions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.5 Comparison of different novel-view rendering methods quality on our dataset. Instant-NGP
exhibits poor quality compared to Gaussian Splatting, with noticeable blurring in the area
highlighted by the red circle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
xii



Abstract
Advancing the field of autonomous navigation and multimodal pretraining, this thesis presents a comprehensive study that leverages innovative methodologies for pretraining using large-scale datasets. The
central aim is to enable robust, task-agnostic transfer learning and efficient training of autonomous agents
across diverse environments and tasks.
We first introduce a novel multimodal pretraining framework for visual navigation, which fuses
components of a traditional World Model into a unified system. This system leverages Bird’s Eye View
(BEV) representations as an intermediate perceptual space to bridge complex First-Person View (FPV)
observations and downstream policy learning. Pretrained entirely on unlabeled videos and simulator data,
the model demonstrates zero-shot transfer to unseen environments, achieving faster reinforcement learning
through its BEV-based embeddings. A state-checking module further enhances robustness by interpolating
uncertain or missing observations. Extensive evaluations using a differential drive robot in both simulated
(CARLA) and real-world settings underline the effectiveness of this approach, supported by open-source
resources for reproducibility.
To generalize representations across tasks, we propose Value Explicit Pretraining (VEP), a task-agnostic
framework that learns objective-conditioned encodings invariant to environmental variations. Unlike
traditional methods relying on optimal task completions, VEP utilizes sub-optimal play data, learning
temporally smooth representations via self-supervised contrastive loss. This enables efficient adaptation to
xiii



new tasks with shared objectives, evidenced by superior performance on Atari benchmarks and realistic
navigation tasks, achieving up to 3× improvement in sample efficiency and task rewards.
This work is further supported by the USCILab3D dataset, a large-scale, annotated dataset collected
using a mobile robot navigating the USC campus under diverse conditions. Comprising 10M multi-view
images and 1.4M semantically annotated point clouds, this dataset enriches multimodal 3D research by
providing detailed semantic labels, pose-stamped trajectories, and dense 3D reconstructions. The dataset’s
high granularity facilitates precise 3D labeling and serves as a foundation for diverse tasks in computer
vision, robotics, and machine learning.
Finally, leveraging volumetric rendering techniques, we developed Beogym, a data-driven simulator
built using Gaussian Splatting to render realistic navigation environments. By processing USCILab3D
data into interconnected Gaussian splat files, Beogym provides seamless transitions across sectors of the
environment, enabling first-person view imagery and realistic training scenarios. The simulator supports
advanced evaluation of autonomous agents, bridging the gap between real-world data and simulation-based
training.
In conclusion, this thesis lays the groundwork for scalable, task-agnostic pretraining frameworks,
enriched multimodal datasets, and realistic simulators to advance autonomous navigation and reinforcement
learning research.
xiv



Chapter 1
Introduction
Visual navigation for robotics has seen significant advancements in recent years, with researchers developing
innovative approaches to enable robots to perceive and navigate complex environments autonomously.
Traditional visual navigation methods often relies on geometric approaches such as Simultaneous
Localization and Mapping (SLAM). These techniques typically use depth sensors and RGB cameras to
construct geometric maps of the environment while simultaneously determining the robot’s position within
those maps. Semantic SLAM extends this concept by incorporating semantic data, allowing robots to
identify and store information about objects in their surroundings.
Recent advancements in machine learning and computer vision have led to the development of end-toend approaches for visual navigation which use deep neural networks to learn semantic information directly
from visual observations, often combining Convolutional Neural Networks (CNNs) for visual encoding
with Recurrent Neural Networks (RNNs) for action prediction.
When using learning-based methods, the majority of works focus on mapless navigation techniques that
allow robots to navigate without prior knowledge of the environment. These methods often use reactive
approaches based on qualitative characteristics extraction, appearance-based localization, optical flow, and
feature tracking. At every timestep, given an image of the environment captured by the camera, the objective
of the agent is to execute a control signal that allows the robot to navigate in the surroundings. Some
1



novel approaches solve the visual navigation semantically through object-based topological navigation by
detecting and recognizing objects and creating a topological map of the environment, reducing dependence
on precise metric maps and improving navigation reliability.
1.1 Learning for Robotics
Learning-based approaches typically use an optimization strategy using a predefined cost or loss function.
Depending on the availability of labelled, unlabeled or sparsely-labelled datasets, the model parameters are
optimized using supervised, unsupervised, self-supervised or reinforcement learning.
Figure 1.1: Examples of simulators that are used to train and evaluate Reinforcement learning algorithms.
(a) Carla (left) and the (b) UR5 Gazebo (right) simulators. In Carla simulator, the agent had to solve a point
goal navigation task, where it has to reach a destination location from a randomly spawned location. In
UR5 Gazebo simulator, the task for the agent was to move the fixed end-effector in alignment with the
trajectory consisting of the target pose (blue) and the waypoints (red).
Reinforcement learning (RL) is one of the most popular methods used in a wide range of Robotic
applications. Although RL has shown promising results in the past, one of the main challenges that arise in
RL is that it requires evaluating a large number of samples on the environment. Evolutionary strategies
(ES) are one of the least widely used RL methods, perhaps because, although they are easily parallelizable,
they have poor sample efficiency. In situations where we have access to waypoints/intermediate subgoals,
we could further improve the performance of these methods.
2



Figure 1.2: Blue represents the uniform sampling distributions used by vanilla-Evolutionary strategy
methods, and orange represents the sampling more sophisticated Zeroth-order methods. Center dots in
both the sampling distributions represents the mean.
A common practice to run RL algorithms involves single-threaded sequential policy learning. Algorithms
on scaling these sequential learning methods to distributed settings [123, 130] are getting popular as they
can be linearly scaled to the number of machines/cores. However, these distributed methods require the
entire gradient of the policy to be communicated across machines, which imposes restrictions on the
network bandwidth and limits usage. On the other hand, ES methods just transmit parameters and the
reward value and the distributed computation is only used for rollouts.
Existing RL works focus on policy optimization and neglect learning useful information like agent
dynamics, i.e., which action would lead to a specific state transition. Some methods like [84] propose
methods to shape the policy search such that the sampling procedure is biased towards the optimal
parameters (Figure 1.3). Standard RL methods also rely on reward signals from the environment to find the
optimal parameters for a policy. Two of the most active research areas in RL are overcoming sparse rewards
and exploration. In many practical scenarios, environments have long horizon and sparse rewards, and it
becomes challenging for RL methods to optimize the reward function. [4] proposes a novel approach to
alleviate sparse rewards by learning from failed trajectories, which dovetails with our motivation. Although
shaped rewards can be used these scenarios, they often lead to sub-optimal solutions.
Derivative-free optimization methods have been used in the past for problems where computation of a
gradient is not feasible, like that of RL. Most popular Zeroth-order methods like evolutionary strategies work
by sampling parameters from a distribution and evaluating them to take an update step. Covariance-based
3



Evolution strategy (CMA-ES) [33] and Cross-Entropy [93] are two most prominent algorithms widely used
in RL and they perform surprisingly well when compared to the state of the art policy gradient methods.
Other versions of ES like PEPG [104] and NES [122] carry out parameter search with novel techniques
inspired from REINFORCE, which is one of the earliest policy gradient algorithm [30].
Although Zeroth-order methods, like CMA-ES, are not constrained to environments having Markov
Decision Process (MDP) assumption, and so can be used in a lot more applications, they are highly sampleinefficient. Note that, in the ideal case, if ground-truth actions exist that resemble the actual goal/task-specific
action labels, this entire formulation would be reduced to a trivial problem of behaviour cloning using
first-order gradient descent. However, if they had access to intermediate waypoints, these Zeroth-order
methods could be improved by augmenting the optimization procedure by shaping the sampling distribution
[51, 61].
1.1.1 Imitation Learning
Imitation learning (IL) involves learning a policy from a set of expert demonstrations without explicitly
designing any reward function. The most naive form of Imitation learning, sometimes referred to as
Behaviour Cloning requires fitting a model to the expert trajectories. Although behaviour cloning is a
seemingly trivial problem, a significant challenge involves distribution mismatch of the expert and the
agent, which leads to compounding errors. Furthermore, the trained agent would have obtained sub-optimal
behaviour compared to the expert, and so the agent can never be better than the expert. Recent advances
like enabling the use of Inverse Reinforcement Learning (IRL) also fall under the broad umbrella of IL. These
methods use expert demonstrations to learn the reward function, which could then be used to optimize the
policy using RL.
Expert demonstrations have also been used to supplement exploration in RL. Some of the prominent
ones include fusing demonstrations in the optimization procedure with Q-learning as an auxiliary objective
4



[23] and hybrid policy gradient [84]. Although the methods mentioned above show promising results
on some benchmarks, there are inherent issues. First, expert demonstrations are off-policy, which makes
the optimization challenging. Second, while trying to learn a policy using RL, the agent could forget the
expert’s trajectories, which nullifies the reason for bringing in the demonstrations.
1.1.2 Imitation from Observation
Imitation from Observation (IfO) is an ongoing research area which deals with policy learning from expert
observations, without the need of the expert actions. The policy is then learnt with the help of RL, by taking
help from the expert observations, which could be same as the waypoints/intermediate goals. Unlike using
demonstrations in RL, there would not arise problems of fusing in off-policy data since there are no actions
available for the expert data. However, to determine the actions and reachable states is still a challenge.
Recent IfO works, like [28] and [77] use an Inverse Dynamics model.
1.1.3 Reinforcement Learning
Reinforcement learning methods are based on an MDP formulation, characterized by the tuple (S, A, R, S
′
),
where S and A are the set of states and actions. R(s, a) is the reward function when the agent is at state
s and takes action a. S
′
is the set of possible next states reached from S. For almost all the RL training
procedures, we use Proximal Policy Optimization (PPO) [101] an on-policy based method, which is the
state of the art policy gradient method on continuous control tasks.
Unlike the MDP based methods, there exist black-box based evolutionary strategies for policy search.
CMA-ES, one of the most widely used variants of evolutionary strategy which is known for its robustness
to multiple local minima and has proved to be efficient in solving RL problems in comparison to the state of
the art policy gradient methods [95]. The core idea is to sample policy parameters from a parameterized
distribution like Gaussian and evaluate those parameters on the environment. The mean µ and the
5



covariance matrix Σ of the Gaussian distributions is then updated using the Mbest (highest rewards in our
case) parameters, in the most standard case is 25, over every g
th generation. Although, the covariance
matrix calculation during every generation is expensive as it is O(N2
), and might not be feasible for
parameters more than 10000, there are some practical tricks to mitigate these complexities [30].
µ =
1
Mbest
M
Xbest
k=1
Θk (1.1)
Σij =
1
Mbest
M
Xbest
k=1
(Θi
k − µ
i
)(Θj
k − µ
j
)
T
(1.2)
In the above equation, Θk is the k
th sampled parameter and Θi
is the i
th index of all sampled parameters.
Σij refers to the i
th column and j
th row in a matrix. In the remainder of the paper, parameters refer to
that of the neural network after flattening. A vanilla version of the ES algorithm exists, which involves
updating the mean of the Gaussian distribution alone and having the covariance fixed. We improve this by
using the dynamics to estimate the covariance matrix. Vanilla-ES method is based on finite differences and
falls roughly into the category of Zeroth-order methods, which is used in applications, where gradients
are not available. Compared to CMA-ES, these methods supposedly fail in a lot of applications, because
of their inability to escape local minima. Similar to CMA-ES, these methods sample perturbations from
standard Gaussian N (µ, σ2
I), where I and σ represents identity matrix and variance respectively, and use
the evaluated values to move the mean towards a direction. Traditionally these methods use antithetic
sampling, which involves sampling P points and P mirror images of those points with respect to the mean
of the distribution. Antithetic sampling has proven to reduce the variance of the Monte-Carlo estimator
as it draws correlated rather than independent samples [89]. All the members in a population are then
generated by µ + ϵσ. These members are evaluated in the simulator. In the remainder of the paper, the
term population represents a set of sampled policy parameters.
6



Notations Now we present the notations and definitions which are used in this thesis pertaining to Reinforcement and Imitation learning. We denote πθ(st) as the policy which is a neural network, parametrized
by θ, which receives a state st from an observation ot to output at at every time-step t. The gradient of
the policy is represented by ∇L(θ), where L is the loss function (Behaviour cloning or Imitation learning)
or cost function (in the case of Reinforcement learning; also known as the reward function). Lastly, note
that the bold counterpart of a symbol corresponds to an array of scalars or vectors, depending on that symbol.
While many visual navigation methods show promising results in simulated environments, translating
these successes to real-world scenarios remains challenging. Current research also focuses on bridging
the gap between simulation and reality, developing robust methods that can handle the complexities and
uncertainties of real-world environments [14, 109]. Adding onto this, robots that are smaller and more
ubiquitous have a growing need for visual navigation methods that can operate on resource-constrained
platforms. Insect-inspired approaches are being explored to develop navigation techniques aimed explicitly
at resource-restricted robots. It is essential to evaluate the generalization of these pre-trained models and
other systems that consist of models trained on datasets from simulators on a real-world world robot.
Unlike task-specific learning and evaluating, integrating multiple sensory modalities for pretraining
presents a solution that could enable visual encoders to learn more compact representations that are
task-agnostic [127]. These multimodal approaches can potentially enhance robotic spatial awareness and
navigation capabilities in complex environments. Developing navigation methods that can scale to large,
diverse environments and generalize across different scenarios remains an open challenge and those that
leverage pretrained models and zero-shot learning show promise in this direction.
7



1.2 Generalizable pretraining using image datasets for transfer
Humans have an innate ability to sequentially learn and perform new tasks without forgetting them, all
while leveraging prior knowledge during this process. Continual and lifelong learning is an imperative
skill that needs to be acquired by any intelligent machine. This is especially true in the real-world, where
environments keep evolving; thus, agents need to remember previously executed tasks in order to perform
these tasks in the future without forgetting. Few of the continual learning methods proposed [29, 57] use
complex memory modules and data augmentations that become difficult to scale and deploy on real-world
robotic systems. Furthermore, apart from not forgetting previously learnt tasks, it’s important that models
are pretrained offline using large datasets so that when deployed they offer good inductive bias to warm-start
the learning process.
This type of pretraining from large offline datasets, for enabling efficient transfer, could be done in
various ways. In this thesis, we limit the pretraining strategies to use image datasets and learn representations using perception or computer-vision-related methods, although the representation bottleneck or
visual encoder could be frozen to be used for robotic tasks during deployment. This is typically the case in
pretraining where the optimization strategy is different from the downstream learning process. We discuss
the fundamental optimization procedures used in various works in the remainder of this section.
1.2.1 Unsupervised pretraining
Offline pretraining is a fast growing field that involves using unlabeled, unorganized data that can be used to
learn a pretrained representation model [2]. This model can be used to learn the inductive bias of tasks (like
temporal sequencing), relationships of actions with states, and value function estimates corresponding to a
state. Currently, the only forward transfer we have in our system is the priors that the backbone/encoder
model learns during the offline pretraining, but we are currently working on improving transfer within the
games pertaining to the same type.
8



Figure 1.3: Evaluating deployable model involves pretraining a system on a dataset and then deploying it.
This process consists of freezing the model parameters and allowing the system to quickly learn and adapt
to unseen tasks on the fly. As we can see in the above figure, during deployment, the pretrained encoder
model accommodates an unseen task and so only the policy is learnt to perform that task.
A pretrained encoder is used to extract the relevant features from the observations required for task
identification and downstream policy execution. During the deployment phase, the pretrained encoder
is frozen and used to infer the embeddings, which are then utilized by the downstream policy. Many
prior works use a VAE-based encoder that is trained on offline datasets. The reconstructions obtained
using the encoder model is presented in Figure 2.4. For the reconstructions shown in this figure, we use a
ResNet-based architecture for the VAE Encoder. The encoder and the decoder block are built for a 84 × 84
image and result in a latent vector of size of 512. The entire network was end-to-end trained for 100 epochs
which took about 27 hours on a NVIDIA V100 GPU.
To perform a specific task, a single or multi-layer fully-connected network could be trained using
Reinforcement learning. This policy receives the feature embedding and predicts an action command to
perform the necessary task.
9



Figure 1.4: Reconstructions obtained from the trained VAE model on the test-dataset. Note that the model has
not been trained on any of the above games. Odd rows correspond to the reconstructions of the subsequent
even rows.
10



VAE-based representation bottleneck learns embeddings of the observations solely based on appearance.
Even when there are differences in appearances, if the skills or the task-objectives learnt are similar, the
games can still be played using a single set of parameters. For example, even though Phoenix and AirRaid
have different appearances, they share the same action space and are both Shoot Up games. Our work as
outlined in the next chapters incorporates representations that are skill/dynamic-aware, as opposed to
those based solely on appearance, like VAEs. This would not only identify previously learnt tasks, but also
help reuse existing policy parameters to play a new game sharing existing skills.
1.2.2 Representation learning through Visual attention
Visual Attention has been used by researchers in vision and language models alike. Earlier works on
top-down attention for CNNs try to learn channel dependencies via a fully connected layer [39]. [119]
uses attentive feature selection and distillation for transfer learning, which was partly inspired by [22]. In
few-shot learning, attention is used in [78, 38] to highlight features which tend to maximize the correlation
between support and query samples. However, most existing approaches are limited to classification, as
they use the discrete class information.
1.2.3 Representation learning through Multi-task learning
Multi Task Learning involves learning multiple high-level tasks concurrently, and executing all of them at
test-time. Existing methods predominantly involve either using novel hard or soft shared model parameters.
Since different tasks have different feature learning rates, [46] propose task loss weighting schemes, which
balance the loss by enabling tasks to regularize each other during training. Recent works have proposed
novel architectures to enhance multi-task learning [56, 24, 68], but all of them are geared towards training
and testing on a fixed set of domain-specific tasks [54, 26].
11



Existing methods in Multi-task learning [56, 68] leverage inter-modal features by training on multiple
modalities/tasks together, where these tasks are assumed to be fixed. Here, we define task as achieving
different goals and outputs; e.g., image classification, depth estimation, or surface normal prediction. Since
data is prone to domain and task shift, it is essential to consider these shifts. Meta learning [87] addresses
task shift by learning a set of variants for a given task (which we here define as subtasks; e.g., learning
different sets of classes or from different datasets within a general object recognition task), in such a way
that the model can quickly generalize to new unseen datasets. Current methods in meta learning deal only
with subtask variants having identical output dimensions and loss functions [18], making them unusable
for more heterogeneous situations.
Popular methods in meta learning use multiple evaluation benchmarks, ranging from image classification
to pose regression. Although earlier works typically train and test within one modality (e.g., different
subsets of mini ImageNet [117]), recent works extend to multiple datasets [115], yet still within a specific
task (e.g., image classification). In multi-modal meta learning, [128] recently proposed a method which
selects task-specific clusters for network parameters, while [118] modulate the parameters of the neural
network during adaptation based on the modality of the subtask. Although these methods deal with broader
data distributions, they are limited to a specific task with fixed output structure. [132] discuss an interesting
idea of generalizing adaptation to various output structures, but only for fully-connected layers.
1.2.4 Generalizable transfer using Meta learning
Meta Learning deals with applying prior knowledge from various tasks to learn a new task in a few shot
setting (note: although prior works mention "tasks", these so far have been subtasks per our terminology).
One of the most promising methods (MAML) is optimization-based [18]. During meta-training, MAML
learns a parameter initialization which enables to model to quickly adapt to a new unseen subtask in a few
shot setting. This involves computing Hessian-vector products which introduce computational instabilities.
12



To alleviate these problems and scale meta learning, there have been many improvements [85]. In some
applications like classification, other categories of meta learning algorithms, namely black-box [8] and
parametric methods [113] also achieve state of the art results. We limit our discussion to optimization-based
methods as we are concerned with flexible meta learning involving various heterogeneous tasks with varied
output structures and loss functions [37]. Meta learning is also applied to Domain adaptation as seen in
[43].
Lately, works aim to explain the effectiveness of Meta learning approaches concerning representation
and adaptation aspects [83]. Their findings surprisingly indicate that the success behind MAML is primarily
due to feature reuse amongst different subtasks. [83] presented an analysis of MAML, which show that
actual parameter adaptation happens only in the last layer(s) and the test accuracy depends on the quality of
features learnt during meta learning. In fact, several modern few-shot methods use a fixed feature learning
backbone and adapt/update only the final layer during test-time [114]. These methods, surprisingly, beat
MAML by a significant margin. Inspired by this, we follow a similar style, where a visual encoder is trained
to learn representations, which is not adapted at all. We also include multiple heterogeneous tasks, as prior
multitask learning research has shown evidence of positive transfer. Since the quality of representations is
important to excel at meta-test time, we hypothesize that, better, if not equal, quality of features can be
learnt by learning multiple modalities together.
In standard optimization-based meta learning [18], there is no separate body and head, as all model
parameters are used for adaptation. That is, these methods minimize the objective:
min
Φ
X
Ti∼p(T )
LTi
(fΦ
′) = X
Ti∼p(T )
LTi
(fΦ−α∇ΦLˆ
Ti
(fΦ)
) (1.3)
In the above equation fΦ is the model, including the head and Φ are the entire model parameters. Note
that since all prior meta learning approaches optimize on a set of tasks (subtasks as per our terminology),
the loss function, unlike our notation, would be LTi
. To avoid confusion, when we write (sub)tasks, it
13



Figure 1.5: Generic framework for Multi-task meta-learning that could be used for pretraining. Each
task-specific head is adapted to perform an unseen subtask optimally.
means others have considered them as tasks, but in our terminology, they are subtasks. Also, the actual
notation of the model is fΦ(I), where I is the input. Recently, [83] showed that Almost No Inner-Loop
MAML (ANIL-MAML), a variant of MAML which only adapts the last layer, performs almost as well as
MAML. Based on more elaborate experiments, we found that ANIL-MAML performs as well as MAML.
Based on these advances, we use the ANIL-MAML training procedure:
min
Φ,θ
X
Ti∼p(T )
LTi
(gθ
′) = X
Ti∼p(T )
LTi
(gθ−α∇θLˆ
Ti
(gθ)
) (1.4)
In this case, Φ are the parameters of the model except the last layer or the head. In other words, if the
model has B conv-blocks, Φ represents the parameters of those conv-blocks and θ are the head parameters,
which in most cases is a fully connected layer. Again, the actual notation for the head gθ is gθ(x), where
x = fΦ(I), i.e. embedding or the output activations of I by the backbone network f. Note that, this notation
of gθ−α∇θLˆ
Ti
(gθ)
is valid for (sub)tasks which have task-shifts in them. We term this (sub)task adaptation.
For tasks whose (sub)tasks have only domain shift, we use domain adaptation by pre-training (training
14



all the data together without any task distinction; refer to [18]) instead of meta-training. In which case
gθ−α∇θLˆ
Ti
(gθ)
is replaced by gθ.
1.2.5 Thesis organization
Visual navigation for robotics is a rapidly evolving field, with researchers exploring diverse approaches
ranging from traditional geometric methods to cutting-edge deep learning techniques. The integration of
multiple modalities, such as language and haptic feedback, is pushing the boundaries of what’s possible in
robotic navigation. As these methods continue to mature, we can expect to see more capable and versatile
robots operating autonomously in increasingly complex real-world environments. This PhD thesis attempts
to solve long-challenging problems of data and generalizable transfer of models for challenging real-world
tasks. Each of the next four chapters are based on the following publications:
• Bird’s Eye View Based Pretrained World model for Visual Navigation published at ISRR 2024
• Value Explicit Pretraining for Learning Transferable Representations Accepted and Presented at CoRL
2024 workshop on pretraining for robots, CoLLA 2024 workshop track
• USCILab3D: A Large-scale, Long-term, Semantically Annotated Outdoor Dataset Published at NeurIPS
2024
• BeoGym: Real-world Visual Navigation in a Simulator: A New Benchmark Presented at CVPR 2024
Workshop on Virtual Humans for Robotics and Autonomous Driving
15



Chapter 2
BEV based visual navigation
Reinforcement Learning (RL) has predominantly been conducted in simulator environments, primarily
due to the prohibitive costs associated with conducting trial-and-error processes in the real world. With
the advances in graphics and computational technologies, there has been a significant development in
realistic simulators that capture the system (robot) information. RL is a widely sought-after learning method
because of its only need of a sparse reward signal for the task. However, it is compute-intensive and slow,
especially when we train models end-to-end in simulators [123]. An alternative for RL is Imitation learning
or Behaviour cloning, but it necessitates the collection of expert data.
In this paper, we formulate a new setting for Zero-shot transfer for Visual Navigation without Maps,
involving unlabeled expert videos and random trajectory rollouts obtained from the CARLA simulator,
as outlined in Fig. 2.1. To avoid any Sim2Real gap within the control pipeline and focus only on the
perception transfer, we built a Differential-drive-based robot in the CARLA simulator that closely resembles
our real-world robot. Using this setup, we construct a large dataset consisting of First-person view (FPV) and
Bird’s eye view (BEV) image sequences from the CARLA [14] simulator. The system is pre-trained entirely
on these unlabeled videos and random trajectory datasets obtained from the simulator. The system is then
frozen and deployed for an online visual navigation task. This pretraining is inexpensive as it runs in a
simulator, but we hypothesize that BEV maps contain crucial information to facilitate learning of future
16



Figure 2.1: Overview of our system We first pretrain the visual navigation system on a set of large-scale
dataset, consisting of unlabeled expert videos (expert videos without actions used to pretrain the encoder)
and random trajectory rollouts (used to train memory module.), collected in the simulator. Once the model
is pretrained. The frozen model is deployed for performing Visual Navigation either using Reinforcement
Learning or Planning.
navigation tasks. Here, we hence seek to answer the question of whether such pre-training can benefit and
accelerate the learning of downstream visual navigation tasks.
2.1 Related work
Although, many methods [25, 45, 51, 52] use simulators for learning through an extensive amount of
experiences that could be used to train a model policy end to end, some recent works [5] have shown
promising results, on various tasks [41], using encoders that are pretrained on large unlabelled expert data
and then train a significantly smaller network on top of the frozen encoder. Since these encoders are not
trained on a specific task, we call it pretraining. Representations estimated using these pretrained and frozen
encoders would help the model remain lightweight and flexible, which is desirable for mobile platforms.
In our work, we employ such an approach with a new pre-training objective (to reconstruct BEV maps
from FPV inputs), which we show provides very good generalization for downstream robotics tasks. Since
learning representations does not involve any dynamics, any navigation dataset consisting of FPV-BEV
17



Figure 2.2: Working of the System. RGB observation ot at time step t is passed to the perception model
(blue) that compresses it into an embedding zt
. The memory model takes the current latent representation
zt and uses the historical context to refine the state into zˆt
. These embeddings could either be used to train
a control policy (orange) or to reconstruct the Bird’s Eye View (BEV) for planning (grey). Both utilities
result in an action command at
.
could be used to pretrain the encoder. By training a Vision encoder using a large aggregated dataset, this
could be a comparable alternative to the current ViT’s [79] used for Robotics.
Bird’s Eye View (BEV) based representation allows for a compact representation of the scene, invariant
to any texture changes, scene variations, occlusions or lightning differences in an RGB image. This makes
for an optimal representation for PointGoal Navigation. Furthermore, it is one of the most efficient and
lightweight form of information, since the BEV maps (occupancy maps) that we use are binary. For example,
the corresponding BEV image of an 1MB FPV image is around 0.5KB. Some works estimate BEV maps from
RGB images, such as [58], [74] and [88]. However, these map predictions from FPV images are typically
only evaluated for visual tasks, with a lack of evidence that BEV-based representations can be useful for
robotic tasks. Furthermore, [3] have shown that reconstruction-based methods like VAE [48] perform close
to Random encoders. Incorporating these representations as inputs for training downstream models for
robotic tasks to ensure their compatibility indeed is challenging. Our pretraining approach not only allows
for learning visual representations that are optimal for robotic tasks, but also allows these representations
18



to reconstruct the corresponding BEV map. Together, they allow the lightweight policy model to efficiently
learn the task through these representations.
Recurrent world-models. [31] introduces a novel approach to RL, incorporating a vision model for
sensory data representation and a memory model for capturing temporal dynamics, all of which collectively
improve agent performance. Apart from the advantages of pertaining each module, some of the modules in
this architecture can be frozen after learning the representation of the environment, paving the way for
more efficient and capable RL agents.
We propose a novel training regime and develop a perception model pretrained on a large simulated
dataset to translate FPV-based RGB images into embeddings that align with the representations of the
corresponding BEV images. Along with that, we upgrade the existing world models framework using novel
model-based Temporal State Checking (TSC) and Anchor State Checking (ASC) methods that add robustness to
the navigation pipeline when transferred to the real world. We release the code for pre-training, RL training
and ROS-based deployment of our system on a real-world robot, FPV-BEV dataset and pre-trained models.
With the above contributions, we hope move closer towards open-sourcing a robust Visual Navigation
system that uses pre-trained models trained on large datasets and simulations for efficient representation
learning.
2.2 Proposed Method
For an autonomous agent to navigate using camera imagery, we use a simple system that consists of a
perception model and a control model as shown in 2.2. The perception model takes input observation ot
and outputs an embedding zt that is then passed on to the policy, as part of the control model to output an
action vector at
, throttle and steer. We first outline the perception model, with the objective of efficiently
learning compact intermediate representations compatible with downstream policy learning, solely from a
19



sequence of observations from the simulator. We then describe our second contribution, which involves
the enhancement of the robustness and stability of the predictions during real-world evaluation.
2.2.1 Perception model
When training the perception model, we focus on 3 main principles. Firstly, zt
, the embedding vector
should always be consistent with the BEV reconstruction. Secondly, BEV images must be represented in a
continuous latent space that has smooth temporal transitions to similar BEV images. Finally, the perception
model must efficiently utilize an unlabelled sequence of images as an expert video portraying optimal
behaviour. This would also allow for unsupervised training/fine-tuning of the model using real-world
expert videos, which we leave for future work.
The perception model consists of a ResNet-50 [34] that is tasked with processing the observation ot
obtained from an RGB camera, with the primary objective of comprehending the environmental context
in which the robot operates, and compresses ot
into a consistent intermediate representation, zt
, which
when decoded through a BEV decoder, outputs a BEV image xt
. Our choice for BEV observations is rooted
in their capacity to convey the surrounding roadmaps with minimal information redundancy. To learn
such representations from a set of FPV and corresponding binary BEV images, prior methods [58] train a
Variational Autoencoder (VAE) [48] to encode an RGB image ot that is decoded using zt ∈ R
B×d
, where
B is the batch size and d is the embedding dimension. Given that we have batches x (BEV predictions
through the model logits) and y (ground-truth BEV observations), we could then optimize the following
reconstruction loss LR:
LR = −[y · log(x) + (1 − y) · log(1 − x)] (2.1)
Using the above loss, the VAE Encoder will learn to embed the FPV observations o into z that will
be reconstructed by the decoder to their corresponding BEV outputs/reconstructions x, and y being
20



the corresponding ground-truth BEV outputs. Additionally, KL (Kullback Leibler) divergence forces the
embeddings, to be within a Gaussian distribution of zero-mean and unit-covariance, that allows for smooth
interpolation. The representations learnt by VAE would embed 2 FPV observations that are very similar,
for example, 2 straight roads, but a have slight variation in the angle to be closer, than a straight road and
an intersection. The following is the loss function used to train a VAE baseline.
LELBO = LR + β · KL[N (µ, σ2
) || N (0, 1)] (2.2)
Although, the above ELBO loss would allow the model to learn appropriate representations for understanding the observation, these representations do not capture the temporal understanding of the task.
Typically, representations for robotics embed observations in such a way to make it easier for the policy to
learn the behaviour of an objective quickly and efficiently. One of the earliest methods for self-supervised
learning, Time-Contrastive Networks [106] disambiguates temporal changes by embedding representations
closer in time, closer in the embedding space and farther otherwise by optimizing the following loss function,
which is used in the Time-contrastive learning (TCN) baseline.
LInf oNCE = Ez
ps
− log
Sϕ (z
an
, z
ps)
Ezng Sϕ (z
an, z
ng)

(2.3)
In the above function, z
an, z
ps and z
ng are a batch of embeddings corresponding to anchors, positives
and negatives and Sϕ is the similarity metric of the embeddings from the encoder fϕ. For a given single
observation sample ot
, the embedding obtained as an anchor z
an, we uniformly sample a frame within
a temporal distance threshold dthresh to obtain z
ps at timestep t + δ and z
ng, anywhere from t + δ to
the end of the episode. However, recently [60] has shown that in-domain embeddings learnt by TCN are
discontinuous, leading to sub-optimal policies. To alleviate this problem, we also add the reconstruction loss
21



Figure 2.3: Training pipeline for the perception model. (a) During the training phase, the ResNet model
is trained using a set of temporal sequences, consisting of pairs of input (FPV images o, displacement ∆g
and orientation to goal ∆ϕ) and output (BEV images x) from the simulator. Our contrastive loss embeds
positives zps closer to the anchor zan and negatives zng farther away. (b) In the bottom, we pictorially
show the input embeddings zt from FPV images, actions at and the output zt+1 that is used to train the
memory module.
LR that enhances the stability of the training process, and helps learn better representations. To achieve the
FPV-BEV translation using our method, we optimize the model parameters using the following contrastive
with reconstruction loss LCR for image encoding.
LCR = LR + β · LInf oNCE (2.4)
In the above loss function, β balances the reconstruction with the contrastive loss, since the model
optimizes the reconstruction loss slower than the contrastive loss. Using the above loss function, the
model learns more temporally continuous and smoother embeddings as it constrains the proximity of the
embeddings not only using the contrastive learning loss but also based on the BEV reconstructions.
22



2.2.2 Temporal model with Robustness modules
To enhance the robustness of the perception model and transfer it to the real world setting, we implemented
an additional model in the pipeline. Fig. 2.4 shows our proposed method of robustness enhancement.
This involves the integration of an LSTM, functioning as a Memory model. The LSTM was trained on
sequences {⟨oj , aj ⟩}j=T
j=0 gathered from sequences {T0, T1, .., Tn} in the simulator. The primary outcome of
this Memory model is to effectively infuse historical context {⟨zj , aj ⟩}j=T
j=0 into the prediction of zˆt
, which
forms a candidate of zt
, and enhancing the robustness of the perception module when confronted with the
unseen real-world data.
zˆt ∼ P(ˆzt
| at−1, zˆt−1, ht−1) (2.5)
where at−1, zˆt−1, ht−1 respectively denotes action, state prediction at the previous timestep, and
historical hidden state at the time step t − 1. zˆt
is the latent representation that is given as an input to the
policy. We optimize M with the below loss function:
LM = −
1
T
X
T
t=1
log(X
K
j=1
θ · N (zt
|µj , σj )) (2.6)
where {T, K, θj , N (zt
|µj , σj )} is, respectively, the training batch size, number of Gaussian models,
Gaussian mixture weights with the constraint PK
j=1 θj = 1, and the probability of ground truth at time
step t conditioned on predicted mean µj and standard variance σj for Gaussian model j. Note this this is
the same loss objective used in Mixture Density Network RNN (MDN-RNN) [31].
Nonetheless, it is noteworthy that zt that is obtained from the ResNet-50 may be slightly distinct from
the latent distribution of BEV images when the perception model is applied to real-world observations ot
,
potentially impacting the performance of the LSTM and the policy. To mitigate this concern, we collected a
dataset R comprising of the BEV-based latent embeddings s ∈ R of 1439 FPV images which we define
23



Figure 2.4: Robustness enhancement using Memory module. TSC (red) only takes input from the
representation zt when it comes with a high confidence score. Otherwise, it takes the previous prediction
by the LSTM zˆt−1 as interpolation. ASC (green) improves the representation of the incoming observation
by making it in-domain. The crosses above correspond to rejecting the precepts and using the model’s state
prediction as the current state.
24



Figure 2.5: Out-of-domain and real-world evaluation We constructed two 6-class validation datasets:
one from the simulator (upper-portion in the table) and another from real-world street-view data
(lower-portion). The values in the header rows correspond to number of data samples. Each class corresponds to the BEV images shown above. We specify accuracies for each class. Along with that, we also
specify the success rate (SR) of the agent, when the encoder is deployed for real-world visual navigation.
Our method outperformed the ResNet classifier (baseline) on both the unseen simulation dataset, the
real-world validation dataset and real-world navigation as shown above.
as the BEV anchors. In practice, upon obtaining the output vector zt from the ResNet-50, we measure its
proximity to each s ∈ R, subsequently identifying the closest match. We replace zt with the identified
anchor embedding z¯t
, ensuring that both the LSTM and the policy consistently uses the pre-defined BEV
data distribution. We pass z¯t as an input to the LSTM, along with the previous action at−1 to get the output
zˆt+1. Again, we find the closest match sˆt ∈ R for zˆt
. We call this module Anchor State Checking (ASC):
z¯ = arg min
s∈S
∥z − s∥ (2.7)
We also utilize the LSTM model for rejecting erroneous predictions by the ResNet-50, further enhancing
the system’s robustness against noise. If the processed prediction z¯t from the perception model is estimated
with confidence score τt
, obtained from either cosine-similarity or MSE, below a predefined threshold ρ, we
25



deliberately discard z¯t and opt for zˆt
. In such instances, we resort to the output of the LSTM at the previous
time-step. This module is known as Temporal State Checking (TSC):
zˆt =



z¯t
, τt ≥ ρ,
zˆt−1, τt < ρ.
(2.8)
Apart from adding robustness to the system using TSC, the utilization of the Memory model also
serves as the crucial purpose of performing interpolation for the robots state in instances where actual
observations ot are delayed, ensuring the continuity and reliability of the entire system. There is often a
notable discrepancy in the update frequencies between control signals and camera frames, since control
signals often exhibit a significantly higher update rate (50Hz) compared to the incoming stream of camera
frames (15Hz). Values mentioned in brackets is in regards to our setup. This is also beneficial in the case
of the recent large vision-language models like RT-X [72] that could solve many robotic tasks, but with a
caveat of operating at a lower frequency, typically around 5Hz.
2.3 Experimental platform and setup
To leverage the extensive prior knowledge embedded in a pre-trained model, we opt to train a ResNet-50
[34] model after initializing with ImageNet pre-trained weights on a large-scale dataset containing FPV-BEV
image pairs captured in the simulator. We collected the train dataset from the CARLA simulator to train
both the Perception and the Memory model. Along with that, we also collected the validation and the test
datasets from 2 different real-world sources. Following are the details on the collected datasets.
2.3.1 Experimental platform
For evaluating Zero-shot real-world transfer, we built a hardware apparatus, which is a Non-Holonomic,
Differential-drive robot (Beobotv3) for the task of visual navigation. Our system is implemented using the
26



Figure 2.6: Ablation experiments on the Test Dataset. Classes in the above table have the same correspondences as the classes in Fig. 2.5. Each double-row corresponds to a data sequence. We demonstrate that
our approach not only attains high ACC (accuracy), but also provides a more granular BEV representation
compared to the naive classifier, as indicated by the MSE (Mean Squared Error) and CE (Cross-Entropy)
metrics. (a). In the upper portion of the table, we assessed our method independently of the LSTM
on an unseen temporal sequence from the simulator, contrasting it with the baseline CNN classifier. (b)
In the lower portion, we compared the performance of system with and without LSTM on a real-world
data sequence. Note that dashes in the table indicate the absence of a class in the respective sequence. We
compute the mean values for each row as shown in the last column.
ROS (Robotic Operating System) middleware and uses a Coral EdgeTPU, which is an ASIC chip designed
to run CNN models for edge computing for all the compute. We used this Edge-TPU to run the forward
inference of the ResNet-50 through a ROS node.
The CARLA simulator had been primarily tailored to self-driving applications, that use Ackermann
steering; we further developed an existing differential drive setup using Schoomatic [59] and upgraded the
CARLA simulator. We find this necessary because our real-world hardware system is based on differentialdrive and to enable seamless transfer without any Sim2Real gap in the control pipeline, both the control
systems need to have similar dynamics. In response to this limitation, Luttkus [59] designed a model
for the integration of a differential-drive robot into the CARLA environment. Building upon their work,
we undertook the development of a version of CARLA simulator catering to differential-drive robots for
reinforcement learning, subsequently migrating it into the newly introduced CARLA 0.9.13.
27



2.3.2 Data collection
2.3.2.1 Train dataset from CARLA simulator
Within the CARLA simulator, we have access to the global waypoints along various trajectories. To allow
more diversity, we randomly sampled a range of different orientations and locations. Leveraging this setup,
we facilitated the generation of a large dataset of FPV-BEV images. We augmented the simulator’s realism
by introducing weather randomization and non-player traffic into the simulated environment.
2.3.2.2 Validation dataset from Google Street View
Using the Google Street View API, we obtained all the panoramic images from various locations on the USC
campus. The panoramic images were segmented with a Horizontal Field of View (FoV) of 90 degrees and are
manually segregated into 6 different classes as shown in Fig. 2.5. We then manually assigned a prototypical
BEV image to each of the 6 classes. The validation dataset does not have any temporal sequencing and is
primarily focused on having a broader and more uniform data distribution across all the classes. Due to
these reasons, this dataset becomes an optimal choice for evaluating the perception model.
2.3.2.3 Test dataset from Beobotv3
To evaluate the quality of representations estimated by the entire system, we record a video sequence
using a mobile robot. More precisely, we recorded a set of 5 ROSBag sequences at different locations of
the USC campus. Later, we labelled all the frames in a ROSBag sequence, similar to the above paragraph.
However, unlike the validation set, the test dataset has temporal continuity, which helps us judge the entire
navigation system.
2.4 Evaluation and Results
Through our experiments, we aim to answer the following questions in regards to our proposed method.
28



Figure 2.7: Comparison of runtime. Computation costs (runtime in milliseconds) of each module in the
navigation system for policy learning and planning are shown above.
1. How good are the representations obtained from the pretrained model for learning to navigate using
online RL?
2. How well can we plan using the BEV reconstructions from the pretrained model?
3. Does contrastive learning help learning good representations compared to an auxiliary task?
4. What are the performance benefits by adding ASC, TSC, and both?
5. How efficient and optimal is the navigation system when transferred to the Real-world setup?
Policy Learning. We performed RL experiments by deploying the frozen pretrained encoder and
training a 1-layer policy in the CARLA simulator Fig. 2.8. The task for the agent is to navigate to a goal
destination using an RGB image (ot
, ∆gt
, ϕt
). We accomplished this by training a policy employing the
PPO algorithm [102]. The design of the reward function is rooted in proportionality to the number of
waypoints the robot achieves to the designated goal point. In each timestep, the policy receives the current
embedding of the observation zt concatenated with the directional vector pointing towards the waypoint
tasked with producing a pair of (throttle, steer) values. We compared our method with VAE (reconstructing
only the BEV image; Eqn. 2.2), TCN (trained using Eqn. 2.3), Random (Randomly initialized encoder and
frozen), CLIP-RN50 [79]. Note that, many of the prior works [3, 10] have shown that randomly initialized
and frozen encoders do learn decent features from an observation.
29



Figure 2.8: Policy learning and Planning experiments on navigation task using pretrained representations. Using a pretrained ResNet encoder, we compare our method with different baselines. The
training curves are obtained when we train a 1-layer policy, using RL, that takes the embeddings from
the frozen encoder. The x and y axis corresponds to iterations and the cumulative reward, with the shaed
regions showing the 95% confidence intervals. We also perform planning experiments, where the BEV
reconstructions are used to navigate to the goal, as shown the by the success rate (SR), through the dotted
lines corresponding to each method.
30



Planning We use the TEB planner [91] to compute the action using an occupancy map (BEV reconstruction) to perform a task. Typically, occupancy map-based planners like TEB, use LiDAR data to compute
the map of the environment and estimate a plan to perform the task, but in our case, we reconstruct the
occupancy map using embedding obtained from RGB inputs. These maps are straightforward to compute
in the case of our method and the VAE baseline, since these methods use a decoder. For the other baselines
like the Random, CLIP and TCN encoder, we freeze the encoder and train the decoder to upsample the
embeddings to estimate the BEV reconstruction. The results obtained for the planning task are shown
in Fig. 2.8 as dotted lines. The success rates correspond to percentage of rollouts that achieve the goal
destination.
Quantitative Analysis We evaluated the performance of our ResNet-50 model using the real-world
validation dataset to evaluate the out-of-distribution capabilities of the models and the results are shown
in Table 2.5. The performance of our perception model on both simulation and real-world dataset are
compared to the baseline, which is a 6-way ResNet-50 classifier. Our perception model identifies the closest
matching class for the output embedding. The baseline is a ResNet-50 model trained on a 6-class training
dataset comprising 140,213 labelled FPV images. This proves that contrastive learning using BEV prediction
enables better generalization to out-of-domain data and better transfer from simulator to real (Table 2.5).
Ablation Experiments for state checking Following a similar approach, we used the Test dataset to
evaluate the entire system. Apart from the accuracy also used Cross entropy (CE) and Mean Square error
(MSE) to judge the quality of reconstructions by the LSTM model. These results are shown in Table 2.6.
Similar to the above experiments, we also used data from the unseen Town from the CARLA simulator to
asses the predictions of our system, as shown in the top half of the Figure. 2.6. The metrics presented in
this table exhibit a slight decrease compared to Table 2.5. This can be attributed to the increased presence
of abnormal observations and higher ambiguity between classes within the time-series data obtained from
the robot, as opposed to the manually collected and labelled dataset in the validation dataset.
31



Evaluation on a Real-world system We perform experiments on a Real-world robot, where the
agent is tasked with navigating to a given destination location, using the pretrained ResNet encoder and
the trained policy in the Carla simulator. Success rates (SR) for planning experiments for our model are
shown in Fig. 2.5. For both policy learning and planning, we specify the computation costs in Table. 2.7.
As mentioned before, the success rates correspond to the percentage of rollouts that achieve the goal
destination.
2.5 Discussion and Future work
In this chapter we proposed a robust navigation system that is trained entirely in a simulator and frozen
when deployed. We learn compact embeddings of an RGB image for Visual Navigation that are aligned with
temporally closer representations and reconstruct corresponding BEV images. By decoupling the perception
model from the control model, we get the added advantage of being able to pretrain the encoder using a
set of observation sequences irrespective of the robot dynamics. Our system also consists of a memory
module that enhances the robustness of the navigation system and is trained on an offline dataset from the
simulator. Although our experiments in this paper are limited to data obtained through the simulator, one
of the primary advantages of our methods is the ability to use additional simulator/real-world FPV-BEV
datasets by aggregating with the current dataset. We leave this for future work.
32



Chapter 3
Value Explicit Pretraining
While performing everyday tasks, humans have an innate ability to appropriately extract information from
what they perceive. This is often regardless of various changes related to the appearance and the dynamics
of the tasks. This ability stems from understanding the objective of the tasks. While this ability is natural to
humans, we need to equip robots with generalizable representations of their visual observations to achieve
the same advantages.
Unfortunately, learning generalizable representations for control is still an open problem in visual
sequential decision-making. Typically in such representation learning works, an encoder ϕ is learned
using a large offline dataset via a predetermined objective function. Subsequently, ϕ is used for control by
mapping high-dimensional visual observations from the environment o:t
into a lower-dimensional latent
representation zt
. The representation zt
is fed into a policy π(· | zt) to generate an action at to solve a
task. The key question in visual representation learning is: what should the learned ϕ be?
The challenge in learning ϕ mainly lies in discovering the correct inductive biases that yield representations that can be used to learn a variety of downstream tasks in a sample efficient manner. It is unclear,
however, what such useful inductive biases are. Initial approaches [107, 129, 75] to this problem included
simply reusing pretrained vision models trained to solve computer vision tasks like image recognition,
zero-shot for control. Works like R3M [70] and VIP [60] tried to utilize temporal consistency, enforcing
33



images that are temporally close in a video demonstration are embedded close to each other. Other works
like Voltron [44] and Masked Visual Pretraining [82, 80, 105] attempt to use image reconstruction as one
such inductive bias.
While biases induced by pretraining objectives like image reconstruction and temporal consistency
have been shown to greatly improve downstream policy performance, these pretraining objectives used to
learn ϕ are distinct from the downstream usage of ϕ, e.g., the task of image reconstruction is very different
from that of action prediction. There exists an unmet need for representation learning approaches that
explicitly encode information directly useful for downstream control during the process of learning ϕ.
This is, of course, challenging — how do we encode control-specific information without actually
training online on a control task? Our crucial insight is that encoding control-specific information in the
representations generated by ϕ is possible by harnessing the power of Monte Carlo estimates of control
heuristics computed offline using gameplay datasets.
Our key contribution is Value Explicit Pretraining (VEP), a contrastive learning approach that utilizes
offline play datasets (without any action labels) to learn a representation for visual observations. Our
method utilizes the insight that estimates of Bellman returns across multiple tasks share a similar propensity
of success, and, in tasks with related goals, also share a similar optimal policy. For example, in shooter
games on Atari, despite differences in the visual appearances of adversaries, the strategy to effectively
shoot them is similar. Our approach thus focuses on the similarity of progress towards the objective, as
opposed to visual similarity.
VEP utilizes this intuition to learn an encoder using a contrastive loss which embeds observations
with similar value function estimates across a set of training tasks near each other. We investigate the
performance gains obtained by utilizing the VEP representation for policy learning, both on the training
set of tasks and on visually distinct yet related held-out tasks. We experiment on the Atari benchmark and
on a visual navigation benchmark comparing VEP to state-of-the-art methods like VIP [60] and SOM [17].
34



. . . Offline Play datasets for Training Tasks
Encoder ⏀
Test Environment
Phase 1: Pretraining on offline Play datasets Phase 2: Online RL on Test Environment
Encoder ⏀
Figure 3.1: High-level overview of our problem statement The encoder fϕ is pretrained using play data
from a set of train tasks, that is then reused for an unseen task. We evaluate pretrained encoders produced
by our method and the baselines on the Atari and Navigation benchmarks.
We find up to a 2× improvement in the rewards obtained on both benchmarks and 3× improvement in
sample efficiency of online RL algorithms trained on VEP.
3.1 Related Work
Representation Learning for Robotics. Besides the general idea that the representations have the
role of encoding the essential information for a given task, while discarding irrelevant aspects of the
original data, typical state representation learning methods attempt to embed an observation into a latent
representation that could be utilized by the downstream task [53]. It is also important that these methods
produce a low dimensional representation that allows the control policy to efficiently learn the downstream
task. Traditionally, unsupervised methods like variational autoencoders [48], can learn disentangled
representations that can be used to correlate with underlying factors that cause variation in observation
data [36] for policy learning [32]. However, in many environments, these representations prove difficult to
learn an optimal policy, since the temporal structure is missing in these representations. [3] explore this
direction and learn representations by enforcing temporal structure through contrastive loss. However,
these works are limited since they do not explore generalization of these learnt representations to other
unseen-tasks.
35



Pretraining for RL. Pretraining for representation learning, in the context of RL, involves learning
transferable knowledge, typically in the form of good representations, that helps the agent utilize its
observations better [126]. Compared with traditional unsupervised methods for pretraining, the objective of
self-supervised pretraining for RL is to learn representations by exploiting the underlying structure within
the data distribution. Majority of the earlier online pretraining works learn representations that model
the task dynamics that can be learned through expert videos during the RL procedure [76]. More recent
offline pretraining methods like [103] build on the prior work [3] by pretraining an encoder using unlabeled
data and then finetune on a small amount of task-specific data. In comparison with these approaches, our
method focuses on learning representations that not only aid in solving in-distribution tasks but also can
generalize to the out-of-distribution by learning representations that relate to general objectives and not
overfit to individual task-specific attributes.
. . . .
. . . .
Task 1 Play dataset
Anchors and Positives near
Negatives far away
G = 0.71
G = 0.7 G = 0.72
G = 0.73
G = 0.8
G = 0.9
Task 2 Play dataset
Bellman return
estimate
Bellman return
estimate
Figure 3.2: Description of our method (VEP). We compute value estimates (Bellman returns), as denoted
by G, for each frame. We then use a contrastive learning-based pretraining method that learns task-agnostic
representations based on G. The above figure is a pictorial representation of a training scenario where
the sampling batch size bT is 2 and the training batch size bG is 1. This results in anchor, positive and
negative sampled from two sequences in each batch.
36



Transfer after Pretraining. Transferring knowledge or skills learned from a given set of tasks to an
unseen set of tasks is an active research area. Early works like progressive networks [94] attempt to solve it
by reusing features learned from source tasks through adapters. [21] perform image-to-image translation
using GANs. However these methods are limited to predefined source or target domains. More recent
works focus on the more challenging problem of using only expert videos for offline pretraining that could
later be transferred to solve a novel downstream task. These methods have gained popularity in RL for their
use of self-supervised based pretraining [106] based on contrastive learning. Compared to these methods,
our method only requires sub-optimal play data that consists of episodes that need not always be successful
in achieving the task objective.
Baselines for VEP. Value Implicit Pretraining (VIP) [60] encodes the goal (positive) and the start
(anchor) images close and the middle images (negatives) further away in the embedding space. By training
on this objective through sampling multiple sub-episodes, the encoder recursively learns temporally smooth
and continuous embeddings in a trajectory. Time Contrastive Learning (TCN) involves sampling the positive
within a certain margin distance dthresh from the anchor and a negative anywhere from the positive to the
end of the trajectory [106]. If the anchor is sampled at time instant ta, positive is sampled at tp and the
negative sampled at tn, then |tn − ta| > |tp − ta|. We then use the standard triplet loss for optimization,
although other contrastive losses could also be used. Unlike TCN, [17] sample the positives from State
Occupancy Measure (SOM) that could be embedded close to the anchor. The negative, on the other hand,
is sampled anywhere from the other episodes of the same task or from the other tasks. State occupancy
measure at a specific instant t is a truncated geometric distribution GeoH
t
(1 − γ) with probability mass
re-distributed over the interval [t, H], where H is the horizon. [63].
37



3.2 Problem Setting and Preliminaries
Let Ttrain = {T1, T2, ...Tm} be the set of training tasks with the associated play datasets that consists
of a stream of images and sparse reward signals, denoted by Dtrain. During pretraining, we assume
that the encoder model fϕ parameterized by ϕ has access to Dtrain. Data in Dtrain corresponding to Ti
consists of a sequence of frames {o
i
t} and sparse reward values {r
i
t}. The encoder fϕ learns to encode
images/observations ot
into an embedding zt
, which is taken as an input by the policy π to perform a testtask. The set of test-tasks are denoted by Ttest = {Tm+1, Tm+2, ...Tn}. Note that although Ttrain∩Ttest = ∅,
all the tasks in Ttrain ∪ Ttest share a semantically similar objective.
For evaluation, we selected tasks that have semantically similar objectives, in two settings or benchmarks:
1) For urban visual-based navigation, every task corresponds to navigating to the same goal destination
with respect to the start location but in different cities; we use several cities and photographs taken along
the available streets for the agent to navigate. 2) In Atari games, we obtain several shooter games that
all contain the "FIRE" action in the action space and whose objective semantically relates to "shoot up
the enemy". All of them highly resemble the Space-Invaders concept, albeit with graphical and other
variations: an army of alien enemies descends towards the bottom of the screen, where the agent’s ship is,
which can move left or right or shoot straight up.
For both our method and the baselines, the encoder fϕ is trained only using the play data Dtrain without
any fine-tuning. The objective of our method is to efficiently learn the encoder using sub-optimal play
datasets consisting of a sequence of observations and sparse reward signals Dtrain from the source tasks
Ttrain, such that the embeddings from fϕ could be zero-shot transferred to unseen test tasks Ttest.
3.2.1 Contrastive Representation Learning
Typically, contrastive representation learning methods for RL utilize offline video demonstration datasets.
These methods typically input a batch of anchors oan, positives ops, and negatives ong and minimize
38



a predetermined similarity metric that enables an encoder model to learn consistent and meaningful
representations that can be used for downstream tasks. The earliest known formulation by [100] uses
Euclidean distance to embed the positives and the anchor close to each other and the negatives far away
from the anchor.
Ltriplet =
X
z∈X
maxh
0, ||zan − zps||2
2 − ||zan − zng||2
2 + ϵ
i
(3.1)
In the above equation zan, zps and zng represent the embeddings that are obtained after passing
observations oan, ops and ong (anchors, positives and negatives) through the encoder network fϕ. Other
metrics like cosine similarity could also be used instead of Euclidean distance, to compute the similarity
between embeddings. This loss is used by VEP and all our baselines except for VIP.
Carnival
Demon Attack Space Invaders Phoenix
Beam Rider Episode Rewards Episode Rewards Episode Rewards Episode Rewards Episode Rewards
VEP (Ours) Random
SOM (Eysenbach et al. 2022) TCN (Sermanet et al. 2018) Episode Rewards
Air Raid
VIP (Ma et al. 2023)
Timesteps Timesteps Timesteps
Timesteps Timesteps Timesteps
Figure 3.3: Pretraining results on Atari (Left). Performance of different pretraining methods on the
respective games as mentioned above on the left. The encoder is pretrained only on the first two games
(Demon-Attack and Space-Invaders) and is evaluated on the other out-of-distribution games.
39



Wall Street Union Square Hudson River
Allegheny South Shore CMU
VEP (Ours) Random
SOM (Eysenbach et al. 2022) TCN (Sermanet et al. 2018)
VIP (Ma et al. 2023) Episode Rewards Episode Rewards Episode Rewards Episode Rewards Episode Rewards Episode Rewards
Timesteps Timesteps Timesteps
Timesteps Timesteps Timesteps
Figure 3.4: Pretraining results on Navigation (Right). Performance of different pretraining methods on
the respective cities as mentioned above right. Similar to the Atari experiments, for all the baselines, play
data from the first two tasks (Wall Street and Union Square) were used for pretraining. VEP representations
improve PPO policy performance by up to 2×.
Similar to recent methods like [60], the InfoNCE [73] objective can also be used to optimize the encoder
parameters. Unlike Triplet loss from Eq. (3.1), InfoNCE permits utilizing multiple negative examples for
calculating the loss (via the expectation term in the denominator of Eq. (3.2)). As depicted below, InfoNCE
aims to maximize mutual information of the anchors and positives. This loss is used by VIP:
LInfoNCE = Ezps
− log
Sϕ (zan, zps)
Ezng Sϕ (zan, zng)

(3.2)
In the above equation, Sϕ is a distance function in the ϕ-representation space that is used to compute
the similarity between a pair of embeddings. In our experiments that use InfoNCE, Sϕ takes the form of
cosine similarity.
40



More recently, Soft-Nearest Neighbor loss was proposed [19] that generalizes InfoNCE to use single or
multiple positive examples [121] in the computation of the objective. We also experimented with this loss
function and were able to obtain almost the same performance as compared to the standard triplet loss
function.
3.2.2 Discounted Returns and Value Functions
We consider a POMDP (Partially Observable Markov Decision Process) denoted by the tuple (O, S, A, p, θ,
r, T, γ) representing an observation space O, state space S, action space A, transition function p, emission
function θ, reward function r, time horizon T, and discount factor γ. An agent in state st takes an action at
and consequently causes a transition in the environment through p(st+1 | st
, at). The agent receives the
next observation ot+1 and reward rt that is calculated using the state st and action at
. The objective for
the agent is to learn a policy π which maximizes the expected discounted sum of rewards. The discounted
sum of rewards at a state st
in a trajectory τ is given by G:
G(st
, τ ) = rt + γrt+1 + · · · + γ
3
rt+3 + · · · =
X
T
k=t
γ
(k−t)
rk (3.3)
The expectation of this discounted return is often defined as the value of the state st under policy π,
denoted by V
π
(st).
3.3 Method
The value V
π
(st) of a state st under a policy π intuitively defines the propensity for the success of solving
a task by following policy π. If two states have similar value estimates, they likely have a similar expected
return under π.
With this in mind, we now motivate VEP with an example. Consider the task of shooting an adversary
in the Atari game of Space-Invaders. Assume that there exists an optimal policy for this task denoted by
41



π
∗
(· | ot) which operates on image observations and the associated optimal value function V
π
∗
(·). Now,
consider a slightly perturbed version of this game in which all the adversaries are colored orange. If policy
π
∗ must solve this perturbed task, it must be invariant to the color of the adversary. One way to achieve this
invariance is to enforce that the value estimates of states with similar propensities for success are similar,
e.g., the value estimate of a state containing a bullet very close to an adversary should be the same regardless
of whether the adversary is yellow or orange. VEP utilizes this exact intuition by learning representations
that induce such an invariance. We assume access to play data from suboptimal agents doing the task
(playing the game), consisting of only observations and reward values (obtained sparsely) for the set of
training tasks Ttrain: This kind of data can be obtained from various online sources of gameplay, and does
not contain any action labels. Further, it is assumed to be generated by a sub-optimal agent that contains at
least a few positive reward signals during gameplay. These play datasets consist of data that is not always
guaranteed to succeed in task completion. If a play sequence ends up with no reward at the terminal state,
the last sub-sequence leading to the terminal state is rejected since all the frames in that sub-sequence
end up with value estimates of 0. Note that since the task objectives allow for sparse rewards during the
task, we would still be using parts of these play datasets that don’t succeed at task completion. We also do
not have access to the true reward function, so we operate under a sparse reward setting, assuming that a
reward of 1 at a few timesteps in the play data and 0 everywhere else. We now compute a value estimate to
each observation using Eq. (3.3). Ideally, this value estimate would be computed using V
π
∗
(·), but since we
do not have access to the true value function of the optimal policy, we utilize a Monte Carlo estimate of
this using Eq. 3.3. Note how the computation of value estimates is completely algorithmic and requires no
human effort.
Having obtained several play datasets, for tasks in Ttrain, and computed value estimates at each
frame with G(·) from Eq. (3.3), we now train the encoder ϕ using a contrastive learning objective. This
procedure first involves sampling a scalar value estimate g between 0 and 1 and then further sample
42



multiple observations from Dtrain values within an vthresh of g. Subsequently, an encoder ϕ is learned
which embeds these observations close to each other. Consequently, observations with a similar propensity
for success have similar embeddings.
3.3.1 Implementation
To make the training computationally efficient, we preprocess Dtrain and save a dictionary that maps
sorted bellman returns G(·) to the indices of corresponding observations with the same Monte Carlo value
estimate. This speeds-up the value look-up subroutines through binary search (see supplementary material
for implementation details).
Figure 3.5: (a). Reward functions for Navigation (left). For a specific map, the agent spawns at a
predetermined starting location (red), with the flexibility to initiate at a random location within a r-step to
the fixed starting point. The sparsity of the rewards (brown lines) that enable the agent to navigate to the
goal (green) can be adjusted through the parameter L. (b). Comparison with other existing pretrained
models (right). We show the bar plot that compares VEP with other existing pretrained models using the
mean cumulative reward of the policy on the out-of-distribution task.
We first sample a batch of value estimates from the dataset determined by training batch size bG. Next,
we sample a bT number of training tasks. In our experiments, we only sample 2 training tasks (Ti and Tj )
during pretraining, i.e., bT = 2. Subsequently, the pretraining objective becomes the following:
max
ϕ
X
Ti∈Ttrain
X
Tj∈Ttrain
E
g∼Unif(G)
[Sϕ(z
Ti
an, z
Ti
ps)+
Sϕ(z
Tj
an, z
Tj
ps ) + Sϕ(z
Ti
an, z
Tj
ps ) + Sϕ(z
Tj
an, z
Ti
ps)
− Sϕ(z
Ti
an, z
Ti
ng) − Sϕ(z
Tj
an, z
Tj
ng) − Sϕ(z
Ti
an, z
Tj
ng) − Sϕ(z
Tj
an, z
Ti
ng)]
(3.4)
43



where G ⊂ (0, 1] is the set of Bellman return estimates of all observations in Dtrain. As mentioned
before, the similarity metric Sϕ computes distance between 2 embeddings that are obtained from an encoder
parameterized by weights ϕ. z
Ti
an corresponds to the embedding of the anchor, i.e., observation sampled
from task Ti with a value estimate within an vthresh of g, and z
Tj
an corresponds to the embedding of the
anchor sampled from task Tj , that is also within an vthresh of g. Similarly, z
Ti
ps corresponds to a positive,
and is temporally closer to the anchor than the negative z
Ti
ng. Likewise, z
Tj
ps and z
Tj
ps relate to Tj .
Intuitively, this objective encourages the positives and anchors from all the sampled tasks to embed
near each other, using the a value function estimate to organize the latent space of the learned encoder ϕ.
For full implementation details like batch sizes etc., please refer to the supplementary material.
Algorithm 1 Value Explicit Pretraining
Require: Dtrain as the entire set of play data that are collected from tasks {Ti}
j=m
j=0
Require: Encoder fϕ parameterized by ϕ
Require: bG, bT as the train and the sample batch size
Require: dthresh, vthresh as the distance and the value thresholds
Require: N as the number of Iterations
1: Randomly Initialize ϕ
2: Compute value estimates G(.) for every frame ot in the play data Dtrain
3: Remove all frames in the play data Dtrain having value estimate of 0
4: For every task Ti, create a dictionary Vi mapping sorted value estimates as keys to list of frame indices in Dtrain
5: while iterations until N do
6: Sample a bG sized batch of values g ∼ (0, 1]
7: For each g in the batch, sample a bT sized batch of τ ∼ {Ti}
m
i=0
8: For each sampled task τ , select a frame oan that has a value estimate of g within vthresh
9: Sample a positive ops within dthresh
10: Mine for negatives ong such that ong is further away from oan than ops
11: Estimate embeddings zan, zps, zng for a batch of oan, ops, ong by propagating through fϕ
12: Compute contrastive loss using zan, zpo and zng
13: Optimize ϕ
14: end while
3.4 Experimental Setup
We study whether utilizing VEP as a pretraining objective to learn an encoder improves (1) policy learning
on in-distribution tasks, i.e., those tasks for which data was available to pre-train the encoder and (2)
whether the learned encoder aids transfer learning of new tasks. We performed our experiments using
the benchmark specified in the next paragraph. We used the RLLib library [55] under the Ray ecosystem
44



for all our RL experiments. We used PPO [101] for training the policy. For all these baselines, we use the
same datasets as our method for pretraining. All the environments we use for evaluating the baselines and
our method are long-horizon tasks with a horizon of 2000-4000 timesteps. Other additional details for our
experimental setup are mentioned in the Supplementary material.
3.4.1 Environments
Atari. We used six Atari games with "FIRE" in their action set, which all are Shoot’em up games similar in
spirit to Space Invaders. Although all the games share a common objective of shooting enemies that spawn
from above, there are significant differences in appearances and dynamics across games. We then split
these games into Ttrain and Ttest. For pretraining the encoder, we use a play data, without action labels from
the D4RL datasets [20]. The value estimates of each frame at timestep t in a sequence are then computed
using 3.3 with T being the closest frame in the episode that obtains a reward.
Navigation. We build an engine that loads the StreetLearn dataset [67] to perform visual navigation,
based on gym [9]. In a typical Navigation task, the agent is designed to randomly respawn within a radius r
of a predetermined location (srcx, srcy), with the objective reaching a goal location that is sampled within
a radius r of a location (dx +srcx, dy +srcy). Reward acquisition is structured through a linear distribution
of L reward points (including the reward obtained upon reaching the target) uniformly spanning the starting
point to the goal. The agent only earns rewards as it moves closer to the goal as depicted through the
yellow lines in Figure 3.5. We have six cities for this benchmark, and we established consistent horizontal
and vertical displacements (dx, dy) between the starting and target points across all cities, avoiding the
need for any explicit goal information (details are provided in the supplementary material). The agent is
then expected to transfer to an unseen test city after learning from the play data obtained from a set of
cities. Note that each task corresponding to a specific city is non-trivial since the agent needs to navigate in
an unknown city with a different map and appearance. Tasks across all the cities are all solvable within a
45



predefined horizon. Lastly, using a planner, we obtain the play data by randomly generating paths with a
specific distance bound. To ensure that the play data had at least a few sparse rewards, the start and the
end locations of the path was chosen to be between the actual source and the destination location. For all
the tasks, we set L = 15, and r = 5. Allegheny
Compute time in min Compute time in min
South Shore
Compute time in min
CMU
Compute time in min
VEP (Ours) End-to-End Episode Rewards Episode Rewards Episode Rewards Episode Rewards
Union Square Figure 3.6: Comparison of our method with End-to-end trained method for Navigation task. Note
that in each of the above training curves, the end-to-end baseline has the entire model trained on each of
the above tasks, whereas our method (VEP) is pretrained only on play data from Wall Street and Union
Square. The x axis corresponds to the wall-clock time (not including the pretraining time for VEP since its
negligible compared to the online RL train-time). Compared to any pretrained method, End-to-end training
baseline takes significantly longer time (2.1× for Navigation and 3.3× for Atari). Since both the methods
were trained for the same number of timesteps (20M), our method finished earlier and the dotted line is
only for comparison
We use the same encoder architecture for both Atari and Navigation to embed pixel observation into
vector space. To enable to temporal understanding of the state, all the embeddings in the past four timesteps
are concatenated together and passed onto the policy. For the Navigation task, apart from the image
embeddings, we also obtain odometry information (odomx, odomy), of the agent, that is concatenated with
the image embedding and passed into a linear layer. This enables the agent to understand its ego-centric
pose with respect to the source location, which is crucial for understanding the objective and navigating to
the goal. We first pretrain the encoder fϕ using the method described in the previous section and visually
shown in 5.2. This is achieved by using a sequence of unlabelled trajectories from both the games. Once we
obtain the pretrained encoder, we use an online RL algorithm, in our case PPO [101], to train a policy. We
summarize results in Figure. 3.4.
46



Wall Street Episode Rewards
Timesteps
VEP 1 epoch, 4 cities Episode Rewards
VEP 1 epoch, 2 cities
Union Square
Timesteps Timesteps Timesteps Episode Rewards Episode Rewards
Wall Street Union Square
VEP 0.25 epoch, 4 cities TCN+ 1epoch, 2cities
Figure 3.7: (a) Different early stop iterations (left). Notice that with an increase in number of pretraining
tasks (cities) from 2 to 4, our method performs better with fewer training iterations. (b) Larger batch size
(right). We compared TCN by equating the batch size and the number of iterations to match those of VEP
by combining sample and train batch size, to show that the learning ability of our method is due to value
estimates amidst tasks.
3.4.2 Results
Online RL experiments on Atari. For experiments involving Atari games, we trained the policy by
freezing the pretrained encoder, without any additional fine-tuning. The encoder is pretrained using
offline play data from Demon-Attack and Space-Invaders, and evaluated on a set of in-distribution and
out-of-distribution environments. We find that the pretrained encoder is able to outperform baselines on
the in-distribution by ∼ 25 percent. This margin is increased in the transfer experiments, most notably on
Phoenix, with nearly 2× improvement over baselines.
Online RL experiments on Navigation. Similar to the above, we froze the encoder and trained a
layer policy. As mentioned before, unlike the model that was used for Atari, we also had the odometry
information for a specific image that had to part of the embedding for the policy to perform the task. The
embedding that was obtained from the CNN was concatenated with the 2D odometry information and was
passed through another fully connected layer to obtain an embedding. All these parameters were used to
pretrain. VEP outperforms all of our baselines by a larger margin in the navigation set as seen in Figure
3.4. VEP also outperformed the End-to-End trained baseline by achieving the same performance 2.1×
47



faster (Figure 3.6). In addition, we evaluate our method on out-of-distribution tasks along with existing
state-of-the-art Vision-language pretrained models. Specifically, we compared our method (VEP) with CLIP
[79], MVP [81], R3M [70] and VC-1 [62] and the results are shown in Figure 3.5. We hypothesize that the
better performance of our method in the Navigation tasks was due to a more similar distribution of value
estimates across the cities in the Navigation task, than the Atari games. Detailed specifications of the value
estimates for all the Atari games and the cities in Navigation are described in the supplementary material.
Larger batch size and more iterations for training for TCN All the baseline approaches we
compared against had a fixed train batch size that is used for computing gradients. To ensure that gains
demonstrated by VEP cannot be attributed only due to larger number of samples in a batch — bG × bT , we
doubled the batch size for TCN (known as TCN+) as seen in Figure 3.7. The larger batch size for TCN still
does not match the performance of VEP.
Early stopping to prevent overfitting. For the Navigation task, we increased the number of training
tasks from 2 to 4. We observed that the performance degraded in this setting. As shown in Figure 3.7, when
we reduce the number of iterations, the model retains the performance, which suggests that our method
learns much faster with an increase in data diversity and early stopping can prevent overfitting.
Quality of play data. We also evaluated our method by using different amounts of diversity and
optimality in the play dataset. Specifically, we compared with the datasets with episodes of length less than
400, 500-800, 1000-1400. All episodes in the respective datasets have a cumulative reward between 12-15.
We also used play datasets that consist of episodes that complete 10% of the actual task. Further, we also
included play datasets that consist of sub-optimal episodes from 3 and 4 cities. The results are shown in
Figure 3.8.
48



Wall Street Union Square
Allegheny South Shore
2 cities, <400 steps 2 cities, 1000-1400 steps
2 cities, 10% task completion 3 cities
2 cities, 500-800 steps Episode Rewards Episode Rewards Episode Rewards Episode Rewards Episode Rewards Episode Rewards
4 cities (.25 epochs)
Hudson River
CMU
Figure 3.8: Performance comparison on the quality of play data Each of the above bar plots corresponds
to the evaluation of the encoder in a different city. Each coloured bar corresponds to a specific play dataset
used for pretraining. We also provide 95% confidence intervals along with the mean cumulative reward.
3.5 Conclusion
Transferring policies to novel but related tasks is an important problem that needs to be addressed. We
formulated a method to learn representations of states from different tasks solely based on the temporal
distance to the goal frame. This way, the skills learned from the train tasks could be transferred to unseen
related tasks. We show the efficacy of our method by performing comprehensive evaluations on Atari and
Visual Navigation.
49



Chapter 4
USCIlab3D dataset
With the recent advancements in 3D vision techniques, the integration of three-dimensional perception has
become integral to many interdisciplinary domains. Unlike the abundant resources available for 2D vision,
the lack of comprehensive datasets for 3D vision poses a significant challenge to researchers. The progress
in this field can be significantly propelled by leveraging large-scale datasets, which offer adaptability across
a spectrum of downstream tasks.
In this paper, we present USCILab3D — a large-scale, long-term, semantically annotated outdoor dataset.
USCILab3D comprises over 10 million images and 1.4 million semantic point clouds, rendering it suitable
for a wide range of vision tasks.
Differing from smaller-scale semantic datasets or larger-scale undetailed ones, our dataset not only
encompasses a wide array of outdoor multi-view scene images but also provides detailed semantic annotations, facilitating enhanced understanding and utilization of 3D perception techniques. Given the massive
scale of our new dataset, as detailed below, we have thus far focused on leveraging the latest foundation
models to compute detailed annotations. Our workflow using these models is detailed below.
50



Figure 4.1: Images with the respective 3D pointclouds Our adjacent five cameras provide comprehensive
coverage with overlap at the same timeframe, ensuring the captured information’s redundancy. We also
show the corresponding point cloud view for every image.
4.1 Related datasets
Several large-scale scene datasets have been developed in recent years for indoor settings [86, 111, 92].
Additionally, several datasets have focused on outdoor city navigation[67]. Furthermore, some datasets
are generated using simulators [14, 108]. These attempt to solve the above problems, although presenting
their challenges: While they offer controlled environments, there exists a noticeable gap in scene quality
compared to real-world scenes.
4.1.1 Multi-view datasets
Multi-view scene datasets are typically used for novel view synthesis tasks with generative models such
as Neural Radiance Fields (NeRF) [65] and 3D Gaussian Splatting [47]. The LLFF dataset [64] is an early
multi-view scene dataset that includes both indoor and outdoor scenes, with fewer than 1,000 low-resolution
images. The DTU [40] and ScanNet [13] datasets contain between 30K and 2,500K images, but they are
limited to indoor scenes. The ETH3D dataset [99] provides high-quality outdoor scenes but has sparse
51



scans and fewer than 1,000 images. Tanks and Temples [50] addresses these limitations by offering 147,000
high-quality outdoor images, which are commonly used in novel view synthesis benchmarks.
4.1.2 Scene datasets with semantic labels
Indoor datasets Datasets like [86, 111] represent large-scale 3D reconstruction datasets tailored for
research in indoor robotic navigation and scene understanding. Matterport [12] is a large-scale RGB-D
indoor dataset containing 10,800 panoramic views from 194,400 RGB-D images of 90 building-scale scenes.
However, this dataset is limited to indoor environments and offers only 20 labels for scene annotation.
In contrast, our dataset encompasses approximately 10 million images and over 4000 labels, providing
extensive coverage of outdoor scenes. Moreover, the inclusion of ground-truth point clouds in our dataset
enhances the accuracy of alignment between 2D images and 3D annotations, surpassing the alignment
capabilities of other datasets.
Outdoor datasets SemanticKITTI [7] is a widely used dataset for semantic segmentation and scene
understanding in outdoor environments. It consists of dense point cloud sequences collected by a mobile
LiDAR scanner which is similar to us. However, SemanticKITTI’s semantic annotations are confined to only
25 categories. In contrast, leveraging multimodal model outputs, our dataset enables the labeling of almost
every element within the scene, providing a comprehensive understanding of outdoor environments.
Our dataset addresses the limitations of the above datasets by providing large-scale outdoor scenes
with diverse weather and lighting conditions, along with various ground-truth semantic point clouds (Table
4.1 and Table 4.2). Leveraging multimodal foundational models, we accurately label 2D images and align
them in 3D space, resulting in precise 3D annotations.
52



Figure 4.2: The pipeline of our semantic annotations method We use GPT4 and Grounded-SAM to
create pixel-wise semantic labels and align the 2D and 3D points.
Dataset Frames Indoor Outdoor LiDAR Point Cloud Semantic
LLFF[64] < 1K images ✓ ✓ ✗ ✗
DTU[40] 30K images ✓ ✗ ✗ ✗
ScanNet[13] 2,500K images ✓ ✗ ✗ ✗
Tanks and Temples[50] 147K images ✓ ✓ ✗ ✗
ETH3D[99] <1K images ✓ ✗ ✗ ✗
Matterport3D[12] 195K images ✓ ✗ ✗ ✓
Habitat[86] - ✓ ✗ ✗ ✓
iGibson[111] - ✓ ✗ ✓ ✓
SemanticKITTI[7] 23K scans ✗ ✓ ✓ ✓
USCILab3d (ours) 10M images ✗ ✓ ✓ ✓
1.4M scans
Table 4.1: Comparison of the existing datasets with our USCILab3D dataset.
Dataset Point Clouds Semantic Labels Semantic classes
nuScenes[11] 390K 31 vehicle, human, animal, movable object, flat, static
Waymo motion[16] 230K 23 Traffic Entities: Car, Truck, Bus, Motorcyclist, Bicyclist, Pedestrian, etc.
SemanticSTF[125] 2K 21 flat, construction, nature, vehicle, human, object
WildScenes[116] 12K 15 terrain, vegetation, object, structure, water, sky
USCILab3d (Ours) 1.4M 267 Vehicle, nature, human, ground, structure, street furniture,
architectural elements, signs and symbols, general objects, lightning
Table 4.2: Comparison of Semantic Classes and Labels Across Existing Datasets and Our USCILab3D Dataset.
53



4.2 Dataset collection
This section outlines our robot platform and data collection approach. Our robot, Beobot-v3, utilizes
multiple cameras and a LiDAR sensor for simultaneous data capture. We collect data across the USC
University Park campus and synchronize streams for analysis.
4.2.1 Robot platform
We build our robot Beobot-v3 to collect the dataset, as shown in Figure 4.3. We use five Intel Realsense
D455 cameras and Velodyne HDL-32E LiDAR. The RGB images, featuring a field of view (FOV) of 90 ×
65° and a resolution of 1280 × 720 pixels, are captured at a rate of 15 frames per second (FPS). Utilizing a 1
MP RGB sensor, these images ensure high-quality visual data acquisition. Furthermore, the LiDAR scans
the environment at a rate of 10 Hz, capturing precise point clouds that complement the visual data. These
point clouds offer comprehensive 3D spatial information essential for scene understanding and navigation
tasks. Because of microcomputer’s limit, camera 1 and LiDAR are controlled by one microcomputer, and
other cameras are controlled by their own microcomputer. All microcomputers are controlled by a central
computer, our data collection system orchestrates the simultaneous scanning and recording process. As
the LiDAR initiates scanning, capturing a 360° view of the environment, the data is saved directly into the
system and five cameras capture images in tandem, storing them in separate ROS bag files.
4.2.2 Dataset collected over the entire USC campus
Our dataset is meticulously collected across the entirety of the USC University Park campus. Spanning an
expansive area of 229 acres (0.93 km2), the campus makes our dataset diverse. From the varied architecture
of its buildings to the network of roads, stairs, trails, paths, gardens, and sidewalks, each corner offers a
unique scene. By dynamically selecting its route, the robot explores the full extent of the campus’ diverse
terrain, from thoroughfares to hidden nooks, creating a rich variety of surroundings.
54



Figure 4.3: Overview of the data collection robot and its hardware. Beobot-v3 is a differential-drive,
non-holonomic mobile robot, equipped with five Intel Realsense D455 cameras and one Velodyne HDL-32E
LiDAR sensor used to collect the dataset.
The data collection occurred in many daytime sessions, with a preference for sunrise or sunset periods
to avoid crowds and mitigate harsh sunlight that could degrade image quality. However, a small portion of
the captured images may still exhibit the effects of powerful sunshine. The sample images are shown in
Figure 4.4.
Our data collection efforts span from March 11, 2023, to March 16, 2024, encompassing 12 months. Over
this time frame, the environment undergoes dynamic changes, including variations in weather, seasons,
and alterations to the campus landscape, such as ongoing construction projects. This deliberate scheduling
ensures that our dataset encapsulates a diverse range of environmental scenarios, enriching the dataset
with a wide array of conditions for robust training and evaluation of algorithms.
55



4.2.3 Synchronization of cameras and LiDAR
To address the synchronization issue between the LiDAR and cameras due to the control of different
microprocessors, we implement a synchronization process. Given that the LiDAR operates on the same
system clock as camera 1, we only need to synchronize the remaining cameras with camera 1. To achieve
this, we employ a method based on feature detection and optical flow tracking. At the onset of each session,
the scene remains static. Leveraging ShiTomasi corner detection [112], we identify key features in the
camera images. Subsequently, using the Lucas-Kanade optical flow algorithm, we track the movement of
these features over consecutive frames. If the displacement of these features exceeds a predefined threshold,
indicative of the robot initiating movement, we designate this time as the session’s start time.
Once the start time is determined for camera 1, we synchronize the start times of the remaining cameras
by aligning them with the start time of camera 1. This ensures temporal coherence across all camera feeds,
enabling accurate alignment of the visual and LiDAR data streams. Through this synchronization process,
we establish temporal consistency across all data sources, facilitating coherent analysis and interpretation
of the collected data.
Figure 4.4: Sample snapshots from our dataset of various daylight timings. These are images obtained
from randomly sampling across the entire dataset.
56



4.2.4 Sensor calibration
By aligning the coordinate systems of the Velodyne LiDAR and the camera, we ensure that the geometric
transformation from 3D to 2D space is accurate. With this calibrated setup, we can assign semantic labels
to the 3D points based on the information extracted from the images. The accurate alignment between
the Velodyne-frame and camera-frame ensures that the projected points correspond to the correct regions
in the images, enabling us to leverage the semantic information obtained from the images to label the 3D
points accurately.
To obtain the pose transformation between images and point clouds, we use a 1m × 1m checkerboard
as a calibration target for sensor alignment. Leveraging the MATLAB calibration toolbox, we apply the Line
and Plane Correspondence method [131] to refine sensor alignment and calibration with high precision.
In this approach, we treat edges in 3D as contours (C) and planes (a), while lines (L) in 3D space are
characterized by points within the same plane (a). This framework integrates point-to-line, point-to-plane,
and direction/normal-based adjustments, ensuring accurate alignment across sensors.
4.3 Dataset annotation
In this section, we describe methods used as part of the pipeline for our semantic annotations of 3D point
clouds. A high-level overview is shown in Figure 4.2.
4.3.1 GPT4-based candidate labels and clustering
We use GPT-4 [1] to detect the semantic labels in an image. Since images are obtained at 15Hz and the
robot moves at a velocity close to 1 m/s, it is redundant and expensive to query the semantic labels for all
images through GPT-4 model. Given that the image frequency is 15Hz, for about every 225 images from
one camera, we extract the the images of five cameras at that time. Given that the camera records at 15Hz,
a 15-second interval of movement (typically less than 12 meters) ensures a small scene variation.
57



We then pass these 5 images to GPT-4, and prompt it to estimate the semantic labels of the images
using the following prompt "List every possible semantic class that exists in the scene. List only the names and
nothing else." After standardizing and filtering the output, we obtain a total of 4162 labels. But most labels
are meaningless or have similar meaning. We then again use GPT-4 to perform clustering and categorization
on the estimated semantic labels.
After removing the meaningless labels and merging semantically equivalent labels, we obtained 257
unique labels. Then, for all images we asked GPT-4 to extract objects from the image again, now with
prompt is "I will give you a list of semantic class, list every possible semantic class that exists in the scene.
List only the names and nothing else, split by comma." This yields the final label list for each image.
Category Elements
Vehicle vehicle, bicycle, van, truck, motorcycle, golf cart, bus, car, skateboard
Nature sky, grass, tree, shrub, shrubbery, hedge, trunk, tree trunk, green area, birds, bush, yard, plant
sun, palm, rock, soil, leaf, leaves, water, flower, branch, bushes, vegetation, bird, ivy
Human person, hand
Ground pavement, curb, gravel, rail, sidewalk, street, walkway, floor, road, pedestrian walkway, crosswalk
ramp, garden, ground, pathway, paving stone, golf course, parking lot, drainage grate, mulch
Structure monument, structure, courtyard, fountain, public space, construction, emergency station
ceiling, fence, gate, wall, balcony, container, stadium, lattice, shed, house, construction
pipe, roof, building, sports field, campus, toilet, baseball field, architecture
site, parking structure, garage, scaffolding, archway, call station
Street Furniture bench, pole, feeding station, patio, handicap, barrier, hydrant, construction cone, construction barrier
lamp post, lamp, trash can, recept, sign, parking meter, public art, statue, sculpture
bollard, bus stop, park bench
Architectural Elements drain cover, manhole cover, vent, air vent, arch, sill, doorway, baluster, security camera, electric box
corridor, stair, ventilation grill, door handle, entrance, post, air unit, pillar, balustrade, handrail
window, door, elevator, gutter, bleachers, tank, generator, utility meter
General Objects umbrella, table, chair, stroller, furniture, board, bottle, canopy, outdoor gear, advertisement, station
pot, rack, flag, locker, ladder, garbage, bulletin board, pallet, planter, equipment, tent, base, hat
curtain, blinds, cardboard, box, tire, wheels, bag, bed, frame, bucket, painting, poster , machine
Signs and Symbols shadow, reflection, traffic cone, parking space line, space line, road marking
parking symbol, stop sign, street sign, road sign, symbol, plaque, banner, graffiti, waste container
signboard, security camera, camera, warning sign, fire safety sign, transportation sign
handicap sign, closed sign, exit sign, parking sign, reservation sign, rec sign
Materials concrete, brick, construction materials, stone, wood, plastic, metal, glass, iron, materials
Lighting outdoor lighting, light, street light, indoor light, lantern, sunlight, shade
Miscellaneous cover, trash, outdoor, chain, unit, security, exterior, fire, electric, meter, lettering, phone, debris, railway
text, potted, space, portable, cone, stlight, cross, marker, grate, blea, stoller, units, picnic, electrical
cable, basin, pavilion, ster, bal, field, curve, bod, bay, pal, firent, box, exit, baseball, image, rec, sports
public, piping, grill, guttering, utility, call, case, recacle, gut, hydra, air
line, tile, cardboard, patch, reservoir, valve
Table 4.3: Clustering of the semantic labels. We use GPT-4 to cluster 267 labels into 12 categories using
the prompt "Could you help me classify by following category: Vehicle, Nature, Human, Ground, Structure,
Street Furniture, Architectural Elements."
58



4.3.2 Grounded-SAM masks on pixel space
After we obtain the candidate labels, for equally spaced subset of images, we use those labels as an input to
the Grounded-SAM model [90] to detect and segment the image by pixel. Since we are using a differentialdrive robot that can potentially rotate left or right, images may look very different quite rapidly, so we
merge the five image labels from GPT-4 and pass to next step. After conducting our experiments, we
found that the presence of unrelated labels (not visually represented in the images) does not significantly
influence the results of Grounded-SAM. This observation is reflected in Figure 4.5 and Table 4.4 through
the percentage of incorrect pixel labels in the masks of 2 images. We show the top 50 frequent objects and
their pixel percentage in images of our dataset in Figure 4.6.
4.3.3 Post-processing after Grounded-SAM
Grounded-SAM’s output is not always using the same vocabulary as our input labels, e.g., one may prompt
it for ’vehicle’ but obtain a segmented ’car’. It may also generate meaningless words or words having similar
meaning. To address this, we perform clustering and categorization as in section 4.1 again to merge all
similar labels. Additionally, we manually merge and remove some words. Ultimately, we obtain 267 labels
and 12 categories (Table 4.3).
Additional prompts Incorrect pixel labels
1 0.23%
2 0.63%
3 0.63%
10 0.92%
Table 4.4: Percentage of incorrect pixel labels. Quantitative measures to show robustness through the
change in the percentage of incorrect pixel labels with additional prompts. Note that this table in relation
with the above Figure 4.5
59



Figure 4.5: Robustness of Grounded SAM to prompts. Comparison of the semantic masks obtained
using different prompts for the same image by Grounded-SAM model, showing the robustness of the model.
On the right image, the additional prompts were "fire hydrant, person, car, Parking lot lines, Boat, Scooter,
Dog, Bear, Cat" along with the common prompts "Trees, Bushes, Benches, Tables, Chairs, Pavement, Buildings,
Windows, Doors, Emergency call box, Umbrellas, Leaves, Grass"
4.3.4 Projecting 2D semantic masks to 3D pointcloud
From the LIDAR data, we reconstruct 3D trajectories of the robot throughout the dataset. Essentially, we
compute a pose transformation for each LiDAR scan in the dataset. We then interpolate the LiDAR poses to
the camera images using the extrinsic parameters corresponding to the transformation of each camera with
respect to the LiDAR sensor. This results in a pose estimate for every camera image in the dataset.
By utilizing the semantic map of every image obtained from Grounded SAM, we use ground truth
camera intrinsics and extrinsics to accurately project 3D point clouds onto 2D images, following equation
(1). Here, (X, Y, Z) represents the world coordinates of a point, while (x, y) denotes the coordinates of
the point projected onto the image plane, measured in pixels. r and t are rotation and translation. cx, cy
represents the principal point, and fx, fy are the focal lengths in pixels. Subsequently, we align the 2D and
3D points to assign labels to the 3D points.
60





x
y
1


∼


fx 0 cx
0 fy cy
0 0 1




r11 r12 r13 t1
r21 r22 r23 t2
r31 r32 r33 t3




X
Y
Z
1


(4.1)
Considering the presence of moving objects and calibration errors, there may be some offset for each
projection. To reduce erroneous labels, we run DBSCAN clustering [15] on each label projection to check
whether the 3D points projected belong to a single cluster. If they do not, we only label the cluster with the
most points.
Figure 4.6: Histogram of the semantic labels frequency in point cloud scans and points. Top 50
frequently estimated semantic classes in points(orange), and correspoing point cloud scan frequency
4.3.5 Released data
We release the raw ROS Bagfiles, and extracted images, point cloud files, COLMAP [98] poses and sparse
reconstructions. The raw data consists of a set of sequences, each of which is collected during a specific data
61



recording session. To make the data more manageable, we divide each session into different subsequences
or "sectors", with each sector consisting of 1250 images and roughly 167 point cloud scans. In addition, we
conducted face detection and applied blurring techniques to ensure privacy protection on campus.
Multi-view images Each image is named according to the convention cam[id]-[timestamp].jpg.
We estimate synchronized timestamps for all images within a sector, using the method mentioned in
section 3.3. The wide field of view (FoV) of 90 degrees for each of the five cameras results in significant
overlap between their respective images, as depicted in Figure 4.1. This substantial overlap ensures more
robust Structure from Motion (SfM) reconstruction. By having multiple views of the same scene, the SfM
algorithm can triangulate feature points more accurately, leading to a more precise reconstruction of the
3D environment. This overlap also aids in improving the accuracy of semantic labelling. By leveraging
overlapping information from multiple viewpoints, inconsistencies or errors in semantic annotations of 3D
points from 2D-pixel maps can be identified and rectified through cross-validation. This double-checking
mechanism helps to enhance the reliability of semantic labels assigned to objects in the scene.
Semantic instances and masks for images In addition to the raw image data, we also provide
semantic labels and label masks generated by Grounded-SAM for each image in the dataset. These labels
offer valuable insights into the semantic understanding of the scene, allowing researchers to perform tasks
such as semantic segmentation and object detection.
Semantically annotated 3D point cloud streams As mentioned before, the pointcloud streams are
captured at 10Hz. Similar to KITTI Semantic [7], we extract each of the pointcloud scans and annotate the
3D points by assigning semantic labels to individual points based on the closest image’s label, using the
method outlined in section 4.3. The color and corresponding label for each point are saved in a JSON file
named labels.json, ensuring easy access and interpretation of the semantic annotations.
Semantically annotated point clouds In addition to the individual semantic annotated point cloud
scans, we have processed each session’s point cloud data using LeGO-LOAM [110] to generate a merged
62



point cloud for each sector (area corresponding to the segments of a trajectory). We mention the statistics of
the distribution of points in each of the point cloud scans and the merged point clouds in the supplemental
material. Unlike the point cloud scans, sector-based point clouds have more points and offer a comprehensive
overview of the semantic annotated scene. Through these semantic point clouds, researchers can gain
deeper insights into the semantic structure and composition of the environment.
Pose annotations for images. We release interpolated poses from LeGO-LOAM, and COLMAP
Structure from Motion (SfM) [98]. The COLMAP SfM results can serve as inputs for some generative
model like NeRF or 3D Gaussian Splatting. Further, by utilizing the poses computed by COLMAP, we aim
to improve the precision of our annotations given the different sampling rates of the LiDAR (10Hz) and
cameras (15Hz). This alignment is crucial for accurately projecting semantic labels onto the 3D points
based on the information extracted from the images. We are currently investigating how to best merge
the LiDAR and COLMAP poses, likely resulting in a unified set of poses indexed non-uniformly in time,
for each image and for each point cloud. We expect that these unified poses will be released with the next
version our dataset.
Robotic dataset for visual navigation. Our dataset comprises diverse sequences captured within a
university environment, reflecting a range of real-world scenarios. Leveraging the compact form factor of
our robot, we collected data across a variety of settings including roads, outdoor lobbies, ramps, and other
typical campus landscapes. This dataset is particularly valuable for applications in visual navigation and is
integrated into the comprehensive Open X-Embodiment dataset [72].
63



4.4 Benchmarks
4.4.1 Evaluation on Novel View Synthesis
We examine the current state-of-the-art (SOTA) Novel View Synthesis methods on several datasets: USCILab3D, ETH3D [99], Mip-NeRF360 [DBLP:journals/corr/abs-2111-12077], Tanks&Temples [35], and
Deep Blending [35]. For each dataset, we run 3D Gaussian Splatting and evaluate the generated image
quality using PSNR, SSIM, and L-PIPS metrics. For each scene, we use 7/8 of the data as the training set and
1/8 as the test set, then calculate the average result for each scene. Considering the large size of our dataset,
we randomly extract one sector from each session to compute the average result.
Our dataset achieves superior PSNR, SSIM, and the best L-PIPS performance compared to other datasets
(Table 4.5). Among these datasets, ours is the only one that provides large-scale scenes, making it suitable
for a wider range of applications, such as simulators [6].
PSNR ↑ SSIM ↑ LPIPS ↓ Resolution ↓ interation
USCILab3D (ours) 26.02 0.86 0.20 1280 × 720 7000
ETH3D[99] 21.25 0.83 0.27 6048 × 4032 7000
Tanks&Temples [50] 21.20 0.77 0.28 980 x 540 7000
Mip-NeRF360[DBLP:journals/corr/abs-2111-12077] 25.19 0.75 0.25 1256 x 828 7000
Deep Blending[35] 27.01 0.87 0.32 1332 x 876 7000
Table 4.5: Performance comparison of 3D Gaussian splatting on different datasets. Our dataset
achieves superior performance compared to other datasets. Although Deep Blending demonstrates a higher
PSNR, it only contains 2.6K images.
4.4.2 Evaluation on Semantic Segmentation and Completion
We also evaluate our dataset using key tasks: semantic segmentation, panoptic segmentation, and semantic
scene completion. Semantic segmentation is crucial for understanding and labeling every point in a 3D
point cloud with a specific class, providing detailed insights into the composition of the scene. Panoptic
segmentation extends this by not only classifying each point but also distinguishing between different
64



instances of the same class. This is particularly valuable for environments with multiple similar objects,
enhancing the dataset’s utility in more complex and dynamic scenarios. Lastly, semantic scene completion
involves predicting the complete geometry and semantics of a scene, including occluded and unobserved
regions. This task is vital for creating comprehensive and accurate representations of environments, which
is indispensable for advanced applications in augmented reality and spatial analysis. We have included the
results in the supplemental material.
4.5 Caveats
Thus far, our annotations have been machine-generated using the latest foundation models. Although
this may pose a few risks, nevertheless, to the best of our knowledge, our method is the first of its kind to
annotate 3D point clouds using image and text based foundational models without any manual intervention.
Casual inspection by authors suggests that the annotations are indeed of high quality. However, we plan
to validate them by hiring a group of human annotators to inspect and possibly correct a fraction of the
machine-generated annotations. We expects that this will be completed by the time of publication.
4.6 Discussion and Conclusion
In this chapter, we introduced the USCILab3D dataset, a comprehensive outdoor 3D dataset designed to
address the limitations of existing datasets in the domain of 3D scene understanding and navigation. Our
dataset offers a diverse array of complex intersections and outdoor scenes meticulously collected across
the USC University Park campus. With approximately 10 million images and 1.5 million dense point cloud
scans, our dataset prioritizes intricate areas, enabling more precise 3D labelling and facilitating a broader
spectrum of 3D vision tasks.
65



Moving forward, we believe that the USCILab3D dataset will serve as a valuable resource for researchers
and practitioners across various domains, including computer vision, robotics, and machine learning. We
anticipate that the dataset will stimulate further advancements in 3D vision-based models and foster the
development of robust algorithms capable of tackling real-world challenges in outdoor environments.
66



Chapter 5
BeoGym
Learning to navigate without map is a task designed to enable agents to mimic human-like goal-oriented
behaviors, relying solely on visual observations. Simulators are widely used in practice to seamlessly enable
the agent to learn such behaviors. However, many recent works [14] attempt to fulfill only a few aspects
out of the following; simulators that have photorealistic rendering, high performance, efficient utilization
of compute resources and real-world transferability. Our method aims fulfill all the above requirements
through advanced techniques in novel view rendering, such as Neural Radiance Fields (NeRF) [65] or
Gaussian Splatting [47]. By interpolating these intermediate views, we seek to develop a real-time simulator
that would not only facilitate more effective learning and navigation for robotic agents but also enhance
their applicability in real-world environments.
5.1 Related Work
Existing Simulators Visual Navigation is a critical component in numerous domains. StreetLearn [66]
introduces a novel approach by providing a static dataset and a navigation engine tailored for Visual
Navigation and Reinforcement Learning (RL) tasks. StreetLearn dataset is derived from Google street
view imagery and uses a graph data-structure, where every node corresponds to a geo-tagged panorama
image. While StreetLearn offers real-world images, it relies on sparse images rather than complete 3D
67



0
50
100
150
200
250
300
350
400
Carla Streetlearn Isaac Gym Habitat Ours
FPS
FPS
Figure 5.1: Comparision with other simulators. Rendering speed or Frames per Second (FPS) recorded
for a single thread process with frame resolution of 1280 × 720, Single episode ∼ 200 timesteps.
reconstructions. On the other hand, Habitat/Gibson indoor simulator [96, 124] constitutes a large scale
3D reconstruction dateset for research in indoor robotic navigation and scene understanding. Unlike
outdoor environments captured by StreetLearn, Habitat/Gibson focuses on indoor spaces and creates rich
3D reconstructed mesh for the scence instead of sparse images like StreetLearn. However, unlike ours, these
simulators are limited to indoor environments. Beogym is capable of rendering images from our large-scale
outdoor scenes with different weather and lighting conditions. Although photo-realistic simulators like
Carla or Airsim [109, 14] are capable of rendering dynamic simulation with custom weather-presets, human
and vehicle agents, they are computationally expensive and inefficient for data-hungry Reinforcement
Learning algorithms. Moreoever, since they are not obtained from real-world data, there is a significant gap
between real-world and simulated percepts.
Novel-view scene synthesis plays a crucial role in various applications, offering the potential for
more realistic simulations. Traditional methods have inherent limitations that impede the creation of truly
lifelike environments. In recent years, Radiance Field methods [69, 120] and Gaussian Splatting [47, 27, 71]
have emerged as a promising solution to address these limitations, significantly enhancing the quality of
68



AGENT
SIMULATOR
Motion
Model
Gaussian
Splat
Rendered
image
Sequence
graph
Generic model
(Neural-network)
Control signal
Inner loop
Outer loop
Load Gaussian splat
(GS), Elevation map
(EM) and Occupancy
EM, OM map (OM) from node
Agent
pose
Rendered
image
Figure 5.2: Overview of our simulator. The agent obtains a percept/image from the simulator and
estimates a control signal ut. The outer loop in the simulator determines whether the agent has passed the
boundaries of the current sector and if a new splat file has to be loaded using the sequence graph. The inner
loop corresponds to the motion model that computes the pose xt that is then used to render the percept in
the next timestep using the gaussian splat file.
novel-view scene synthesis. We attempt to explore the capabilities of Radiance Field methods and Gaussian
Splatting, and their potential to create a high-performance simulator that operates in real-time.
5.2 Proposed Simulator
We propose Beogym, a real-time simulator that allows an autonomous agent capable of navigating in an
environment. After the agent executes an action/control signal ut, given an input image It the simulator
computes the pose xt+1 at the next timestep using a motion model and renders an image, by querying
from the splat file. One of the crucial aspects of a simulator apart from realistic quality is high performance
in terms of rendering speed. Our method ensures that the visual feedback provided to the agent is both
realistic and computationally efficient, facilitating more effective training and navigation.
69



Figure 5.3: Sample sequence graph The left graph depicts the node graph, where each node represents
a scan in the trajectory. The right graph illustrates the corresponding sequence graph, with each node
representing a sector. The value of each node in the sequence graph indicates the number of scans within
that sector.
5.2.1 Gaussian splat based rendering
We use the USCIlab3D dataset [49] for our simulator consists of images and pointcloud data collected on a
mobile robot that is collected across the USC campus. The dataset consists of set of sequences/trajectories,
each of which consists of stream of multi-view images and pointcloud scans. Each scan in the sequences
are annotated with global poses and are used to construct the elevation maps and the occupancy maps
(Figure 5.4).
In our methodology, each sector, defined as a splat file that is trained using a segmented portion of a
trajectory. To optimize the efficiency and performance of reconstruction, we subdivided each session into
several sectors, each containing 1250 images captured from the five cameras. For every sector, we utilize
the collection of images and use COLMAP [98] to automatically obtain poses. This pose-annoted set of
images is passed as an input for training a splat file using Gaussian Splatting [47]. The result would then be
70



obtained as an explicit 3D representation of a specific scene or a sector. Subsequently, we store these splat
files, which are later used for rendering purposes.
5.2.2 Sequence Graph for querying splat files
To facilitate navigation between sectors, we constructed a sequence graph based on discrete image poses.
In this graph, each sector serves as a node corresponding to a splat file and other related metadata. The
connectivity between sectors is represented by an edge that corresponds to a transformation matrix Tb
a
from a splat file a to the other b. An example of a generic graph structure as used in Streetlearn and the
corresponding sequence graph equivalent of the node graph is shown in Figure 5.3
As the agent navigates within the simulation environment, a key element of realism is its interaction
between different sectors. When the agent approaches the boundary of its current sector, the simulator
recognizes this transition and initiates the process of loading the appropriate splat file for the new sector.
This mechanism ensures a seamless visual experience as the agent moves through diverse parts of the
simulated environment.
Algorithm 2 Beogym simulation pipeline
Require: Agent model fϕ parameterized by ϕ
Require: Motion model P
Require: Sequence Graph G(V, E)
Require: Splat file SV , Elevation map EV , Occupancy map OV of a corresponding vertex V
Require: Let image rendered by the simulator be It
Require: Let current 6D pose of the agent be xt
Require: Let goal pose be xg
1: Randomly sample a node V from G(V, E)
2: Obtain SV , EV and OV from the node V
3: Sample initial pose xt from OV where OV [xt] is free
4: Render It using SV (xt).
5: while xt ̸= xg do
6: Compute the action command ut = fϕ(It)
7: Compute the next 6D pose xt+1 using P(xt, ut, EV )
8: Set xt ← xt+1
9: if OV (xt) is not occupied then
10: Collision detected
11: end if
12: if xt is outside EV then
13: Load adjacent node to V and set SV , EV , OV
14: end if
15: end while
71



In the Beogym simulator, given a specific control signal, the agent’s subsequent pose xt at timestep t is
governed by the motion model. This model is crucial for simulating realistic navigation behaviours, akin to
those exhibited by actual robots in real-world scenarios. One key challenge addressed in our model is the
maintenance of consistent height, raw and yaw orientation relative to the ground by the agent. This is
crucial for ensuring realism and practical applicability, especially environments where the terrain may vary,
potentially leading to the agent "floating" above the ground or colliding with terrain features. To address
this, we employ a sophisticated method to compute the z-axis component or the elevation of the agent’s
pose accurately, using elevation and occupancy maps obtained from the LiDAR data.
Furthermore, our simulation is designed to be adaptable and responsive. An agent can be initialized at
any location within the simulation environment determined by the occupancy map. As the agent moves,
guided by the motion model, its pose is continually updated, and the rendering process adapts accordingly,
thus creating a seamless and continuous simulation experience. This iterative process, where the motion
model predicts the next pose and the simulator renders the new view, forms the core inner-loop of our
simulation as shown in Figure 5.2.
Figure 5.4: Occupancy map and Elevation map. These maps are computed from the collected 3D data
from the USCILab3D dataset and is used for validating agent poses and estimating collisions.
72



Gaussian Splatting 30k Gaussian Splatting 7k Instant-NGP 30k
Figure 5.5: Comparison of different novel-view rendering methods quality on our dataset. Instant-NGP
exhibits poor quality compared to Gaussian Splatting, with noticeable blurring in the area highlighted by
the red circle.
As stated before, Within each sector, we employ COLMAP [97] to obtain image poses that are then
used for training a splat file. However, it’s important to note that these poses do not represent ground-truth
poses obtained from the LiDAR sensor, and merely are used for training a Gaussian splat file. To utilize
elevation maps derived from ground-truth point clouds, we must transform the coordinate system of the
poses from COLMAP to those of the elevation maps. We employ the Kabsch algorithm [42] to compute this
coordinate transformation. The ground truth poses obtained from LiDAR in each sector are less dense than
image data and we pair these poses with images having the closest timestamps. After forming all pairs, the
Kabsch algorithm is executed to derive optimal translation and rotation matrices that minimize the root
mean square (RMS) deviation between the two sets of points.
5.3 Benchmarking Results
We compare the performance of our simulator with other SoTA simulators as shown in Figure 5.1. In our
investigation, we conducted a thorough assessment of various novel-view rendering techniques, with a
specific emphasis on rendering speed (FPS), rendering clarity (PSNR), computational memory footprint
(Memory) and resource utilization (GPU usage) – all of which are crucial considerations for our real-time
simulator.
73



Notably, 3D Gaussian Splatting exhibits remarkable frame rates and achieves leading Peak Signal-toNoise Ratio (PSNR) scores. Given our navigation task’s utilization of low-resolution imagery (84 × 84),
minor differences in PSNR have negligible impact on the overall outcome. Therefore, we opted for a
training regimen of 7000 iterations, resulting in the highest frame rate observed at 392 frames per second
(FPS). Conversely, Instant-NGP struggled to produce satisfactory results with our dataset, particularly in
scenarios involving large-scale outdoor environments. Notably, it encountered difficulties in achieving
clear reconstructions, particularly when scenes comprised both near and distant objects, as depicted in
Figure 5.5. 4D Gaussian Splatting has a higher PSNR but the frame rates is too low for our simulator.
This superiority is crucial for achieving the real-time rendering capabilities essential for our simulation
framework. Additionally, we conducted an assessment of GPU memory and usage, as outlined in Table
??. This evaluation holds significant importance as it directly impacts the number of agents we can
concurrently train. Striking a balance between rendering efficiency and GPU resource utilization is essential
for optimizing our simulator’s capacity for large-scale, concurrent agent training. The bottleneck for both
Gaussian Splatting and Instant-NGP lies within GPU constraints. While there is a variance in memory
usage between 3D Gaussian Splatting and Instant-NGP, this discrepancy becomes inconsequential due to
the inherent limitation of GPU utilization—each GPU can only execute a single task at one time.
5.4 Discussion and Conclusion
In this chapter, we present some preliminary results pertaining to the performance and working of our
proposed data-driven simulator, Beogym. Beogym consists of a Gaussian splatting based generative model
that uses camera poses to train and render images. During simulation, a particular splat file is loaded using
a sequence graph. Poses of the simulated agent that are used to render images are computed using a motion
model. We evaluate a multitude of novel-view rendering methods, to find the method for optimal real-time
74



simulation. We are currently working on conducting more comprehensive evaluations and benchmarks for
our simulator.
75



Chapter 6
Conclusion and Future directions
This thesis has explored collecting and using image-based unlabeled datasets to pretrain visual encoders
that could then be used for quickly learn novel downstream visual navigation tasks. This way, we have
addressed some key challenges in robot learning related to the limited amount of labelled datasets available
for learning robotic tasks. Through a combination of methods used, this work has provided significant
insights into using unlabeled video datasets for pretraining to learn visual representations, efficiently. In
the later chapters, we also explored ways on automatically annotating unlabelled datasets using foundation
models, that could potentially be used for improved pretraining, which we leave for future work.
In the first chapter, we deployed a system that uses models trained from unlabeled multi-modal data
obtained from a photo-realistic simulator, demonstrating robustness and transfer to novel and real-world
environments 2. This was followed by another pretraining method, named Value explicit pretraining (VEP),
that learns visual representations from unlabeled videos that are collected across various cities 3. VEP
enabled models to learn representations that are map and city agnostic and could quickly generalize to new
cities and environments quickly. Collectively, these studies have advanced our understanding of efficiently
pretraining models using unlabeled datasets, offering ways to learn visual encoders that could be deployed
for fast downstream transfer.
76



With the recent breakthrough of using large transformer based models to efficiently solve language and
vision tasks, we also explored efficiently generating spatial augmentations and semantic labels to existing
unlabeled datasets. We describe the details in chapters 4 and 5.
Despite these achievements, this research has certain limitations, our work is only limited to point-goal
navigation. Although our work could be applied to the current era of autonomous self-driving vehicles
and package delivery robots that operate in outdoor environment, the true utility of leveraging large scale
pretraining lies in the ability to interact with the surroundings. We have only begun to explore the surface
of the vast potential represented by unlabeled data, and significant advancements can be achieved by
leveraging the extensive unlabeled datasets currently available. Using the vast amounts of unlabeled data
available, more complex tasks that involve semantic and spatial understanding could be used to evaluate
the capability and performance of state-of-the-art pre-trained models.
This direction could be further extended by enabling mobile robots to execute basic manipulation
tasks, along with intelligent navigation. This would be useful to evaluate robots on their generalization
capabilities to real-world interactions from offline pre-training. These limitations present opportunities for
future research that could arise from this work.
In conclusion, this thesis has explored utilizing unlabeled datasets for pretraining, providing a foundation
for future advancements in robot learning. Compared to labeled data that is expensive, unlabeled data is
available at large. The findings and methodologies presented here aim to contribute to enable robots to
learn tasks quickly and minimizing the need to have access to large labeled datasets, and it is hoped that
this research will inspire further innovation and exploration.
77



References
[1] Josh Achiam et al. “Gpt-4 technical report”. In: arXiv preprint arXiv:2303.08774 (2023).
[2] Maruan Al-Shedivat et al. “Continuous Adaptation via Meta-Learning in Nonstationary and
Competitive Environments”. In: 6th International Conference on Learning Representations, ICLR 2018,
Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018.
url: https://openreview.net/forum?id=Sk2u1g-0-.
[3] Ankesh Anand et al. “Unsupervised State Representation Learning in Atari”. In: Advances in Neural
Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019,
NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada. Ed. by Hanna M. Wallach et al. 2019,
pp. 8766–8779. url: https:
//proceedings.neurips.cc/paper/2019/hash/6fb52e71b837628ac16539c1ff911667-Abstract.html.
[4] Marcin Andrychowicz et al. “Hindsight Experience Replay”. In: Advances in Neural Information
Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9
December 2017, Long Beach, CA, USA. Ed. by Isabelle Guyon et al. 2017, pp. 5048–5058. url:
http://papers.nips.cc/paper/7090-hindsight-experience-replay.
[5] Karol Arndt et al. “Meta Reinforcement Learning for Sim-to-real Domain Adaptation”. In: CoRR
abs/1909.12906 (2019). arXiv: 1909.12906. url: http://arxiv.org/abs/1909.12906.
[6] Henghui Bao, Kiran Lekkala, and Laurent Itti. “Real-world Visual Navigation in a Simulator: A New
Benchmark”. In: The First Workshop on Populating Empty Cities – Virtual Humans for Robotics and
Autonomous Driving at CVPR 2024, 2nd Round. 2024. url:
https://openreview.net/forum?id=e2InrwYhK5.
[7] J. Behley et al. “SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences”.
In: Proc. of the IEEE/CVF International Conf. on Computer Vision (ICCV). 2019.
[8] Luca Bertinetto et al. “Meta-learning with differentiable closed-form solvers”. In: 7th International
Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.
OpenReview.net, 2019. url: https://openreview.net/forum?id=HyxnZh0ct7.
[9] Greg Brockman et al. “OpenAI Gym”. In: CoRR abs/1606.01540 (2016). arXiv: 1606.01540. url:
http://arxiv.org/abs/1606.01540.
78



[10] Yuri Burda et al. “Large-Scale Study of Curiosity-Driven Learning”. In: 7th International Conference
on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
url: https://openreview.net/forum?id=rJNwDjAqYX.
[11] Holger Caesar et al. nuScenes: A multimodal dataset for autonomous driving. 2020. arXiv: 1903.11027
[cs.LG]. url: https://arxiv.org/abs/1903.11027.
[12] Angel Chang et al. “Matterport3D: Learning from RGB-D Data in Indoor Environments”. In:
International Conference on 3D Vision (3DV) (2017).
[13] Angela Dai et al. “ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes”. In: 2017 IEEE
Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26,
2017. IEEE Computer Society, 2017, pp. 2432–2443. doi: 10.1109/CVPR.2017.261.
[14] Alexey Dosovitskiy et al. “CARLA: An Open Urban Driving Simulator”. In: 1st Annual Conference
on Robot Learning, CoRL 2017, Mountain View, California, USA, November 13-15, 2017, Proceedings.
Vol. 78. Proceedings of Machine Learning Research. PMLR, 2017, pp. 1–16. url:
http://proceedings.mlr.press/v78/dosovitskiy17a.html.
[15] Martin Ester et al. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases
with Noise”. In: Proceedings of the Second International Conference on Knowledge Discovery and Data
Mining (KDD-96), Portland, Oregon, USA. Ed. by Evangelos Simoudis, Jiawei Han, and
Usama M. Fayyad. AAAI Press, 1996, pp. 226–231. url:
http://www.aaai.org/Library/KDD/1996/kdd96-037.php.
[16] Scott Ettinger et al. Large Scale Interactive Motion Forecasting for Autonomous Driving : The Waymo
Open Motion Dataset. 2021. arXiv: 2104.10133 [cs.CV]. url: https://arxiv.org/abs/2104.10133.
[17] Benjamin Eysenbach et al. “Contrastive Learning as Goal-Conditioned Reinforcement Learning”. In:
NeurIPS. 2022. url:
http://papers.nips.cc/paper\_files/paper/2022/hash/e7663e974c4ee7a2b475a4775201ce1fAbstract-Conference.html.
[18] Chelsea Finn, Pieter Abbeel, and Sergey Levine. “Model-Agnostic Meta-Learning for Fast
Adaptation of Deep Networks”. In: Proceedings of the 34th International Conference on Machine
Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017. Ed. by Doina Precup and
Yee Whye Teh. Vol. 70. Proceedings of Machine Learning Research. PMLR, 2017, pp. 1126–1135.
url: http://proceedings.mlr.press/v70/finn17a.html.
[19] Nicholas Frosst, Nicolas Papernot, and Geoffrey E. Hinton. “Analyzing and Improving
Representations with the Soft Nearest Neighbor Loss”. In: Proceedings of the 36th International
Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA. Ed. by
Kamalika Chaudhuri and Ruslan Salakhutdinov. Vol. 97. Proceedings of Machine Learning
Research. PMLR, 2019, pp. 2012–2020. url: http://proceedings.mlr.press/v97/frosst19a.html.
[20] Justin Fu et al. “D4RL: Datasets for Deep Data-Driven Reinforcement Learning”. In: CoRR
abs/2004.07219 (2020). arXiv: 2004.07219. url: https://arxiv.org/abs/2004.07219.
79



[21] Shani Gamrian and Yoav Goldberg. “Transfer Learning for Related Reinforcement Learning Tasks
via Image-to-Image Translation”. In: Proceedings of the 36th International Conference on Machine
Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA. Ed. by Kamalika Chaudhuri and
Ruslan Salakhutdinov. Vol. 97. Proceedings of Machine Learning Research. PMLR, 2019,
pp. 2063–2072. url: http://proceedings.mlr.press/v97/gamrian19a.html.
[22] Xitong Gao et al. “Dynamic Channel Pruning: Feature Boosting and Suppression”. In: 7th
International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9,
2019. OpenReview.net, 2019. url: https://openreview.net/forum?id=BJxh2j0qYm.
[23] Yang Gao et al. “Reinforcement Learning from Imperfect Demonstrations”. In: CoRR abs/1802.05313
(2018). arXiv: 1802.05313. url: http://arxiv.org/abs/1802.05313.
[24] Yuan Gao et al. “MTL-NAS: Task-Agnostic Neural Architecture Search towards General-Purpose
Multi-Task Learning”. In: CoRR abs/2003.14058 (2020). arXiv: 2003.14058. url:
https://arxiv.org/abs/2003.14058.
[25] Yunhao Ge et al. “Lightweight Learner for Shared Knowledge Lifelong Learning”. In: Trans. Mach.
Learn. Res. 2023 (2023). url: https://openreview.net/forum?id=Jjl2c8kWUc.
[26] Timnit Gebru, Judy Hoffman, and Li Fei-Fei. “Fine-Grained Recognition in the Wild: A Multi-task
Domain Adaptation Approach”. In: IEEE International Conference on Computer Vision, ICCV 2017,
Venice, Italy, October 22-29, 2017. IEEE Computer Society, 2017, pp. 1358–1367. doi:
10.1109/ICCV.2017.151.
[27] Sharath Girish, Kamal Gupta, and Abhinav Shrivastava. “EAGLES: Efficient Accelerated 3D
Gaussians with Lightweight EncodingS”. In: Computer Vision - ECCV 2024 - 18th European
Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXIII. Ed. by Ales Leonardis
et al. Vol. 15121. Lecture Notes in Computer Science. Springer, 2024, pp. 54–71. doi:
10.1007/978-3-031-73036-8\_4.
[28] Xiaoxiao Guo et al. “Hybrid Reinforcement Learning with Expert State Sequences”. In: The
Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative
Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on
Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 -
February 1, 2019. AAAI Press, 2019, pp. 3739–3746. doi: 10.1609/aaai.v33i01.33013739.
[29] Xin Guo et al. “Continual Learning Long Short Term Memory”. In: Findings of the Association for
Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020. Ed. by Trevor Cohn,
Yulan He, and Yang Liu. Vol. EMNLP 2020. Findings of ACL. Association for Computational
Linguistics, 2020, pp. 1817–1822. doi: 10.18653/V1/2020.FINDINGS-EMNLP.164.
[30] David Ha. “A Visual Guide to Evolution Strategies”. In: blog.otoro.net (2017). url:
https://blog.otoro.net/2017/10/29/visual-evolution-strategies/.
[31] David Ha and Jürgen Schmidhuber. “Recurrent World Models Facilitate Policy Evolution”. In:
Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information
Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada. 2018, pp. 2455–2467.
url: http://papers.nips.cc/paper/7512-recurrent-world-models-facilitate-policy-evolution.
80



[32] David Ha and Jürgen Schmidhuber. “Recurrent World Models Facilitate Policy Evolution”. In:
Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information
Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada. Ed. by Samy Bengio
et al. 2018, pp. 2455–2467. url: https:
//proceedings.neurips.cc/paper/2018/hash/2de5d16682c3c35007e4e92982f1a2ba-Abstract.html.
[33] Nikolaus Hansen. “The CMA Evolution Strategy: A Tutorial”. In: CoRR abs/1604.00772 (2016). arXiv:
1604.00772. url: http://arxiv.org/abs/1604.00772.
[34] Kaiming He et al. “Deep Residual Learning for Image Recognition”. In: 2016 IEEE Conference on
Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE
Computer Society, 2016, pp. 770–778. doi: 10.1109/CVPR.2016.90.
[35] Peter Hedman et al. “Deep blending for free-viewpoint image-based rendering”. In: ACM Trans.
Graph. 37.6 (2018), p. 257. doi: 10.1145/3272127.3275084.
[36] Irina Higgins et al. “beta-VAE: Learning Basic Visual Concepts with a Constrained Variational
Framework”. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France,
April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. url:
https://openreview.net/forum?id=Sy2fzU9gl.
[37] Timothy M. Hospedales et al. “Meta-Learning in Neural Networks: A Survey”. In: CoRR
abs/2004.05439 (2020). arXiv: 2004.05439. url: https://arxiv.org/abs/2004.05439.
[38] Ruibing Hou et al. “Cross Attention Network for Few-shot Classification”. In: Advances in Neural
Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019,
NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada. Ed. by Hanna M. Wallach et al. 2019,
pp. 4005–4016. url:
http://papers.nips.cc/paper/8655-cross-attention-network-for-few-shot-classification.
[39] Jie Hu, Li Shen, and Gang Sun. “Squeeze-and-Excitation Networks”. In: 2018 IEEE Conference on
Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. IEEE
Computer Society, 2018, pp. 7132–7141. doi: 10.1109/CVPR.2018.00745.
[40] Rasmus Ramsbøl Jensen et al. “Large Scale Multi-view Stereopsis Evaluation”. In: 2014 IEEE
Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28,
2014. IEEE Computer Society, 2014, pp. 406–413. doi: 10.1109/CVPR.2014.59.
[41] Josip Josifovski et al. “Analysis of Randomization Effects on Sim2Real Transfer in Reinforcement
Learning for Robotic Manipulation Tasks”. In: IEEE/RSJ International Conference on Intelligent
Robots and Systems, IROS 2022, Kyoto, Japan, October 23-27, 2022. IEEE, 2022, pp. 10193–10200. doi:
10.1109/IROS47612.2022.9981951.
[42] Wolfgang Kabsch. “A solution for the best rotation to relate two sets of vectors”. In: Acta
Crystallographica Section A: Crystal Physics, Diffraction, Theoretical and General Crystallography
32.5 (1976), pp. 922–923.
81



[43] Bingyi Kang and Jiashi Feng. “Transferable Meta Learning Across Domains”. In: Proceedings of the
Thirty-Fourth Conference on Uncertainty in Artificial Intelligence, UAI 2018, Monterey, California,
USA, August 6-10, 2018. Ed. by Amir Globerson and Ricardo Silva. AUAI Press, 2018, pp. 177–187.
url: http://auai.org/uai2018/proceedings/papers/61.pdf.
[44] Siddharth Karamcheti et al. “Language-Driven Representation Learning for Robotics”. In: Robotics:
Science and Systems XIX, Daegu, Republic of Korea, July 10-14, 2023. Ed. by Kostas E. Bekris et al.
2023. doi: 10.15607/RSS.2023.XIX.032.
[45] Manuel Kaspar, Juan David Munoz Osorio, and Jürgen Bock. “Sim2Real Transfer for Reinforcement
Learning without Dynamics Randomization”. In: arXiv e-prints, arXiv:2002.11635 (Feb. 2020). doi:
10.48550. arXiv: 2002.11635 [cs.AI].
[46] Alex Kendall, Yarin Gal, and Roberto Cipolla. “Multi-Task Learning Using Uncertainty to Weigh
Losses for Scene Geometry and Semantics”. In: 2018 IEEE Conference on Computer Vision and
Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. IEEE Computer Society,
2018, pp. 7482–7491. doi: 10.1109/CVPR.2018.00781.
[47] Bernhard Kerbl et al. “3D Gaussian Splatting for Real-Time Radiance Field Rendering”. In: ACM
Trans. Graph. 42.4 (2023), 139:1–139:14. doi: 10.1145/3592433.
[48] Diederik P. Kingma and Max Welling. “Auto-Encoding Variational Bayes”. In: 2nd International
Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference
Track Proceedings. Ed. by Yoshua Bengio and Yann LeCun. 2014. url:
http://arxiv.org/abs/1312.6114.
[49] Laurent Itti Kiran Lekkala* Henghui Bao*. USCILab3D: A Large-scale, Long-term, Semantically
Annotated Outdoor Dataset. 2024. url: https://sites.google.com/usc.edu/uscilab3d/ (visited on
02/01/2024).
[50] Arno Knapitsch et al. “Tanks and temples: benchmarking large-scale scene reconstruction”. In:
ACM Trans. Graph. 36.4 (2017), 78:1–78:13. doi: 10.1145/3072959.3073599.
[51] Kiran Lekkala and Laurent Itti. “Shaped Policy Search for Evolutionary Strategies using
Waypoints*
”. In: IEEE International Conference on Robotics and Automation, ICRA 2021, Xi’an, China,
May 30 - June 5, 2021. IEEE, 2021, pp. 9093–9100. doi: 10.1109/ICRA48506.2021.9561607.
[52] Kiran Kumar Lekkala and Vinay Kumar Mittal. “Artificial intelligence for precision movement
robot”. In: 2015 2nd International Conference on Signal Processing and Integrated Networks (SPIN).
IEEE. 2015, pp. 378–383.
[53] Timothée Lesort et al. “State representation learning for control: An overview”. In: Neural Networks
108 (2018), pp. 379–392. doi: 10.1016/J.NEUNET.2018.07.006.
[54] Da Li and Timothy M. Hospedales. “Online Meta-Learning for Multi-Source and Semi-Supervised
Domain Adaptation”. In: CoRR abs/2004.04398 (2020). arXiv: 2004.04398. url:
https://arxiv.org/abs/2004.04398.
82



[55] Eric Liang et al. “RLlib: Abstractions for Distributed Reinforcement Learning”. In: Proceedings of the
35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm,
Sweden, July 10-15, 2018. Ed. by Jennifer G. Dy and Andreas Krause. Vol. 80. Proceedings of
Machine Learning Research. PMLR, 2018, pp. 3059–3068. url:
http://proceedings.mlr.press/v80/liang18b.html.
[56] Shikun Liu, Edward Johns, and Andrew J. Davison. “End-To-End Multi-Task Learning With
Attention”. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach,
CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 2019, pp. 1871–1880. doi:
10.1109/CVPR.2019.00197.
[57] David Lopez-Paz and Marc’Aurelio Ranzato. “Gradient Episodic Memory for Continual Learning”.
In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information
Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. Ed. by Isabelle Guyon et al. 2017,
pp. 6467–6476. url: https:
//proceedings.neurips.cc/paper/2017/hash/f87522788a2be2d171666752f97ddebb-Abstract.html.
[58] Chenyang Lu, Marinus Jacobus Gerardus van de Molengraft, and Gijs Dubbelman. “Monocular
Semantic Occupancy Grid Mapping With Convolutional Variational Encoder–Decoder Networks”.
In: IEEE Robotics and Automation Letters 4.2 (2019), pp. 445–452. doi: 10.1109/lra.2019.2891028.
[59] Lennart Luttkus, Peter Krönes, and Lars Mikelsons. “Scoomatic: Simulation and Validation of a
Semi-Autonomous Individual Last-Mile Vehicle”. In: Sechste IFToMM D-A-CH Konferenz 2020: 27./28.
Februar 2020, Campus Technik Lienz. Vol. 2020. Feb. 21, 2020. doi: 10.17185/duepublico/71204.
[60] Yecheng Jason Ma et al. “VIP: Towards Universal Visual Reward and Representation via
Value-Implicit Pre-Training”. In: The Eleventh International Conference on Learning Representations,
ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. url:
https://openreview.net/pdf?id=YJ7o2wetJ2.
[61] Niru Maheswaranathan et al. “Guided evolutionary strategies: augmenting random search with
surrogate gradients”. In: Proceedings of the 36th International Conference on Machine Learning, ICML
2019, 9-15 June 2019, Long Beach, California, USA. Ed. by Kamalika Chaudhuri and
Ruslan Salakhutdinov. Vol. 97. Proceedings of Machine Learning Research. PMLR, 2019,
pp. 4264–4273. url: http://proceedings.mlr.press/v97/maheswaranathan19a.html.
[62] Arjun Majumdar et al. “Where are we in the search for an Artificial Visual Cortex for Embodied
Intelligence?” In: Advances in Neural Information Processing Systems 36: Annual Conference on
Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16,
2023. Ed. by Alice Oh et al. 2023. url:
http://papers.nips.cc/paper\_files/paper/2023/hash/022ca1bed6b574b962c48a2856eb207bAbstract-Conference.html.
[63] Bogdan Mazoure et al. “Accelerating exploration and representation learning with offline
pre-training”. In: CoRR abs/2304.00046 (2023). doi: 10.48550/arXiv.2304.00046. arXiv: 2304.00046.
[64] Ben Mildenhall et al. “Local Light Field Fusion: Practical View Synthesis with Prescriptive Sampling
Guidelines”. In: ACM Transactions on Graphics (TOG) (2019).
83



[65] Ben Mildenhall et al. “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis”. In:
Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020,
Proceedings, Part I. Ed. by Andrea Vedaldi et al. Vol. 12346. Lecture Notes in Computer Science.
Springer, 2020, pp. 405–421. doi: 10.1007/978-3-030-58452-8\_24.
[66] Piotr Mirowski et al. “Learning to Navigate in Cities Without a Map”. In: Neural Information
Processing Systems (NeurIPS). 2018.
[67] Piotr Mirowski et al. “The StreetLearn Environment and Dataset”. In: CoRR abs/1903.01292 (2019).
arXiv: 1903.01292. url: http://arxiv.org/abs/1903.01292.
[68] Ishan Misra et al. “Cross-Stitch Networks for Multi-task Learning”. In: 2016 IEEE Conference on
Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE
Computer Society, 2016, pp. 3994–4003. doi: 10.1109/CVPR.2016.433.
[69] Thomas Müller et al. “Instant neural graphics primitives with a multiresolution hash encoding”. In:
ACM Trans. Graph. 41.4 (2022), 102:1–102:15. doi: 10.1145/3528223.3530127.
[70] Suraj Nair et al. “R3M: A Universal Visual Representation for Robot Manipulation”. In: Conference
on Robot Learning, CoRL 2022, 14-18 December 2022, Auckland, New Zealand. Ed. by Karen Liu,
Dana Kulic, and Jeffrey Ichnowski. Vol. 205. Proceedings of Machine Learning Research. PMLR,
2022, pp. 892–909. url: https://proceedings.mlr.press/v205/nair23a.html.
[71] Simon Niedermayr, Josef Stumpfegger, and Rüdiger Westermann. “Compressed 3D Gaussian
Splatting for Accelerated Novel View Synthesis”. In: IEEE/CVF Conference on Computer Vision and
Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024. IEEE, 2024, pp. 10349–10358. doi:
10.1109/CVPR52733.2024.00985.
[72] Abby O’Neill et al. “Open X-Embodiment: Robotic Learning Datasets and RT-X Models : Open
X-Embodiment Collaboration”. In: IEEE International Conference on Robotics and Automation, ICRA
2024, Yokohama, Japan, May 13-17, 2024. IEEE, 2024, pp. 6892–6903. doi:
10.1109/ICRA57147.2024.10611477.
[73] Aäron van den Oord, Yazhe Li, and Oriol Vinyals. “Representation Learning with Contrastive
Predictive Coding”. In: CoRR abs/1807.03748 (2018). arXiv: 1807.03748. url:
http://arxiv.org/abs/1807.03748.
[74] Bowen Pan et al. “Cross-View Semantic Segmentation for Sensing Surroundings”. In: IEEE Robotics
and Automation Letters 5.3 (2020), pp. 4867–4873. doi: 10.1109/lra.2020.3004325.
[75] Simone Parisi et al. “The unsurprising effectiveness of pre-trained vision models for control”. In:
International Conference on Machine Learning. PMLR. 2022, pp. 17359–17371.
[76] Deepak Pathak, Dhiraj Gandhi, and Abhinav Gupta. “Self-Supervised Exploration via
Disagreement”. In: Proceedings of the 36th International Conference on Machine Learning, ICML 2019,
9-15 June 2019, Long Beach, California, USA. Ed. by Kamalika Chaudhuri and Ruslan Salakhutdinov.
Vol. 97. Proceedings of Machine Learning Research. PMLR, 2019, pp. 5062–5071. url:
http://proceedings.mlr.press/v97/pathak19a.html.
84



[77] Brahma S. Pavse et al. “RIDM: Reinforced Inverse Dynamics Modeling for Learning from a Single
Observed Demonstration”. In: IEEE Robotics Autom. Lett. 5.4 (2020), pp. 6262–6269. doi:
10.1109/LRA.2020.3010750.
[78] Hugo Prol, Vincent Dumoulin, and Luis Herranz. “Cross-Modulation Networks for Few-Shot
Learning”. In: CoRR abs/1812.00273 (2018). arXiv: 1812.00273. url:
http://arxiv.org/abs/1812.00273.
[79] Alec Radford et al. “Learning Transferable Visual Models From Natural Language Supervision”. In:
Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021,
Virtual Event. Ed. by Marina Meila and Tong Zhang. Vol. 139. Proceedings of Machine Learning
Research. PMLR, 2021, pp. 8748–8763. url: http://proceedings.mlr.press/v139/radford21a.html.
[80] Ilija Radosavovic et al. “Real-World Robot Learning with Masked Visual Pre-training”. In:
Conference on Robot Learning, CoRL 2022, 14-18 December 2022, Auckland, New Zealand. Ed. by
Karen Liu, Dana Kulic, and Jeffrey Ichnowski. Vol. 205. Proceedings of Machine Learning Research.
PMLR, 2022, pp. 416–426. url: https://proceedings.mlr.press/v205/radosavovic23a.html.
[81] Ilija Radosavovic et al. “Real-World Robot Learning with Masked Visual Pre-training”. In:
Conference on Robot Learning, CoRL 2022, 14-18 December 2022, Auckland, New Zealand. Ed. by
Karen Liu, Dana Kulic, and Jeffrey Ichnowski. Vol. 205. Proceedings of Machine Learning Research.
PMLR, 2022, pp. 416–426. url: https://proceedings.mlr.press/v205/radosavovic23a.html.
[82] Ilija Radosavovic et al. “Real-world robot learning with masked visual pre-training”. In: Conference
on Robot Learning. PMLR. 2023, pp. 416–426.
[83] Aniruddh Raghu et al. “Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness
of MAML”. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa,
Ethiopia, April 26-30, 2020. OpenReview.net, 2020. url:
https://openreview.net/forum?id=rkgMkCEtPB.
[84] Aravind Rajeswaran et al. “Learning Complex Dexterous Manipulation with Deep Reinforcement
Learning and Demonstrations”. In: Robotics: Science and Systems XIV, Carnegie Mellon University,
Pittsburgh, Pennsylvania, USA, June 26-30, 2018. Ed. by Hadas Kress-Gazit et al. 2018. doi:
10.15607/RSS.2018.XIV.049.
[85] Aravind Rajeswaran et al. “Meta-Learning with Implicit Gradients”. In: Advances in Neural
Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019,
NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada. Ed. by Hanna M. Wallach et al. 2019,
pp. 113–124. url: http://papers.nips.cc/paper/8306-meta-learning-with-implicit-gradients.
[86] Santhosh Kumar Ramakrishnan et al. “Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D
Environments for Embodied AI”. In: Proceedings of the Neural Information Processing Systems Track
on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual.
Ed. by Joaquin Vanschoren and Sai-Kit Yeung. 2021. url: https://datasets-benchmarksproceedings.neurips.cc/paper/2021/hash/34173cb38f07f89ddbebc2ac9128303f-Abstractround2.html.
85



[87] Sachin Ravi and Hugo Larochelle. “Optimization as a Model for Few-Shot Learning”. In: 5th
International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017,
Conference Track Proceedings. OpenReview.net, 2017. url:
https://openreview.net/forum?id=rJY0-Kcll.
[88] Lennart Reiher, Bastian Lampe, and Lutz Eckstein. “A Sim2Real Deep Learning Approach for the
Transformation of Images from Multiple Vehicle-Mounted Cameras to a Semantically Segmented
Image in Bird’s Eye View”. In: 23rd IEEE International Conference on Intelligent Transportation
Systems, ITSC 2020, Rhodes, Greece, September 20-23, 2020. IEEE, 2020, pp. 1–7. doi:
10.1109/ITSC45102.2020.9294462.
[89] Hongyu Ren, Shengjia Zhao, and Stefano Ermon. “Adaptive Antithetic Sampling for Variance
Reduction”. In: Proceedings of the 36th International Conference on Machine Learning, ICML 2019,
9-15 June 2019, Long Beach, California, USA. Ed. by Kamalika Chaudhuri and Ruslan Salakhutdinov.
Vol. 97. Proceedings of Machine Learning Research. PMLR, 2019, pp. 5420–5428. url:
http://proceedings.mlr.press/v97/ren19b.html.
[90] Tianhe Ren et al. “Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks”. In:
CoRR abs/2401.14159 (2024). doi: 10.48550/ARXIV.2401.14159. arXiv: 2401.14159.
[91] Christoph Rösmann, Frank Hoffmann, and Torsten Bertram. “Integrated online trajectory planning
and optimization in distinctive topologies”. In: Robotics Auton. Syst. 88 (2017), pp. 142–153. doi:
10.1016/J.ROBOT.2016.11.007.
[92] Denys Rozumnyi et al. “Estimating Generic 3D Room Structures from 2D Annotations”. In:
Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information
Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. Ed. by
Alice Oh et al. 2023. url:
http://papers.nips.cc/paper\_files/paper/2023/hash/76bf913ad349686b2aa552a1c6ee0a2eAbstract-Datasets\_and\_Benchmarks.html.
[93] Reuven Y. Rubinstein and Dirk P. Kroese. The Cross Entropy Method: A Unified Approach To
Combinatorial Optimization, Monte-Carlo Simulation (Information Science and Statistics). Berlin,
Heidelberg: Springer-Verlag, 2004. isbn: 038721240X.
[94] Andrei A. Rusu et al. “Progressive Neural Networks”. In: CoRR abs/1606.04671 (2016). arXiv:
1606.04671. url: http://arxiv.org/abs/1606.04671.
[95] Tim Salimans et al. “Evolution Strategies as a Scalable Alternative to Reinforcement Learning”. In:
CoRR abs/1703.03864 (2017). arXiv: 1703.03864. url: http://arxiv.org/abs/1703.03864.
[96] Manolis Savva et al. “Habitat: A Platform for Embodied AI Research”. In: CoRR abs/1904.01201
(2019). arXiv: 1904.01201. url: http://arxiv.org/abs/1904.01201.
[97] Johannes L. Schönberger and Jan-Michael Frahm. “Structure-from-Motion Revisited”. In: 2016 IEEE
Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30,
2016. IEEE Computer Society, 2016, pp. 4104–4113. doi: 10.1109/CVPR.2016.445.
86



[98] Johannes Lutz Schönberger and Jan-Michael Frahm. “Structure-from-Motion Revisited”. In:
Conference on Computer Vision and Pattern Recognition (CVPR). 2016.
[99] Thomas Schöps et al. “A Multi-view Stereo Benchmark with High-Resolution Images and
Multi-camera Videos”. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR
2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 2017, pp. 2538–2547. doi:
10.1109/CVPR.2017.272.
[100] Florian Schroff, Dmitry Kalenichenko, and James Philbin. “FaceNet: A unified embedding for face
recognition and clustering”. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR
2015, Boston, MA, USA, June 7-12, 2015. IEEE Computer Society, 2015, pp. 815–823. doi:
10.1109/CVPR.2015.7298682.
[101] John Schulman et al. “Proximal Policy Optimization Algorithms”. In: CoRR abs/1707.06347 (2017).
arXiv: 1707.06347. url: http://arxiv.org/abs/1707.06347.
[102] John Schulman et al. “Proximal Policy Optimization Algorithms”. In: CoRR abs/1707.06347 (2017).
arXiv: 1707.06347. url: http://arxiv.org/abs/1707.06347.
[103] Max Schwarzer et al. “Pretraining Representations for Data-Efficient Reinforcement Learning”. In:
Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information
Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual. Ed. by Marc’Aurelio Ranzato
et al. 2021, pp. 12686–12699. url: https:
//proceedings.neurips.cc/paper/2021/hash/69eba34671b3ef1ef38ee85caae6b2a1-Abstract.html.
[104] Frank Sehnke et al. “Parameter-exploring policy gradients”. In: Neural Networks 23.4 (2010),
pp. 551–559. doi: 10.1016/j.neunet.2009.12.004.
[105] Younggyo Seo et al. “Masked world models for visual control”. In: Conference on Robot Learning.
PMLR. 2023, pp. 1332–1344.
[106] Pierre Sermanet et al. “Time-Contrastive Networks: Self-Supervised Learning from Video”. In: 2018
IEEE International Conference on Robotics and Automation, ICRA 2018, Brisbane, Australia, May 21-25,
2018. IEEE, 2018, pp. 1134–1141. doi: 10.1109/ICRA.2018.8462891.
[107] Rutav Shah and Vikash Kumar. “Rrl: Resnet as representation for reinforcement learning”. In: arXiv
preprint arXiv:2107.03380 (2021).
[108] Shital Shah et al. “AirSim: High-Fidelity Visual and Physical Simulation for Autonomous Vehicles”.
In: Field and Service Robotics, Results of the 11th International Conference, FSR 2017, Zurich,
Switzerland, 12-15 September 2017. Ed. by Marco Hutter and Roland Siegwart. Vol. 5. Springer
Proceedings in Advanced Robotics. Springer, 2017, pp. 621–635. doi:
10.1007/978-3-319-67361-5\_40.
[109] Shital Shah et al. “AirSim: High-Fidelity Visual and Physical Simulation for Autonomous Vehicles”.
In: CoRR abs/1705.05065 (2017). arXiv: 1705.05065. url: http://arxiv.org/abs/1705.05065.
87



[110] Tixiao Shan and Brendan J. Englot. “LeGO-LOAM: Lightweight and Ground-Optimized Lidar
Odometry and Mapping on Variable Terrain”. In: 2018 IEEE/RSJ International Conference on
Intelligent Robots and Systems, IROS 2018, Madrid, Spain, October 1-5, 2018. IEEE, 2018,
pp. 4758–4765. doi: 10.1109/IROS.2018.8594299.
[111] Bokui Shen et al. “iGibson 1.0: A Simulation Environment for Interactive Tasks in Large Realistic
Scenes”. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2021, Prague,
Czech Republic, September 27 - Oct. 1, 2021. IEEE, 2021, pp. 7520–7527. doi:
10.1109/IROS51168.2021.9636667.
[112] Jianbo Shi and Carlo Tomasi. “Good features to track”. In: Conference on Computer Vision and
Pattern Recognition, CVPR 1994, 21-23 June, 1994, Seattle, WA, USA. IEEE, 1994, pp. 593–600. doi:
10.1109/CVPR.1994.323794.
[113] Jake Snell, Kevin Swersky, and Richard S. Zemel. “Prototypical Networks for Few-shot Learning”.
In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information
Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA. Ed. by Isabelle Guyon et al. 2017,
pp. 4077–4087. url:
http://papers.nips.cc/paper/6996-prototypical-networks-for-few-shot-learning.
[114] Yonglong Tian et al. “Rethinking Few-Shot Image Classification: a Good Embedding Is All You
Need?” In: CoRR abs/2003.11539 (2020). arXiv: 2003.11539. url: https://arxiv.org/abs/2003.11539.
[115] Eleni Triantafillou et al. “Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few
Examples”. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa,
Ethiopia, April 26-30, 2020. OpenReview.net, 2020. url:
https://openreview.net/forum?id=rkgAGAVKPr.
[116] Kavisha Vidanapathirana et al. “WildScenes: A Benchmark for 2D and 3D Semantic Segmentation
in Large-scale Natural Environments”. In: The International Journal of Robotics Research 0.0 (0),
p. 02783649241278369. doi: 10.1177/02783649241278369.
[117] Oriol Vinyals et al. “Matching Networks for One Shot Learning”. In: Advances in Neural Information
Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December
5-10, 2016, Barcelona, Spain. Ed. by Daniel D. Lee et al. 2016, pp. 3630–3638. url:
http://papers.nips.cc/paper/6385-matching-networks-for-one-shot-learning.
[118] Risto Vuorio et al. “Toward Multimodal Model-Agnostic Meta-Learning”. In: CoRR abs/1812.07172
(2018). arXiv: 1812.07172. url: http://arxiv.org/abs/1812.07172.
[119] Kafeng Wang et al. “Pay Attention to Features, Transfer Learn Faster CNNs”. In: 8th International
Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020.
OpenReview.net, 2020. url: https://openreview.net/forum?id=ryxyCeHtPB.
[120] Peng Wang et al. “F2
-NeRF: Fast Neural Radiance Field Training with Free Camera Trajectories”. In:
IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC,
Canada, June 17-24, 2023. IEEE, 2023, pp. 4150–4159. doi: 10.1109/CVPR52729.2023.00404.
88



[121] Lilian Weng. “Contrastive Representation Learning”. In: lilianweng.github.io (2021). url:
https://lilianweng.github.io/posts/2021-05-31-contrastive/.
[122] Daan Wierstra et al. “Natural evolution strategies”. In: J. Mach. Learn. Res. 15.1 (2014), pp. 949–980.
url: http://dl.acm.org/citation.cfm?id=2638566.
[123] Erik Wijmans et al. “DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames”.
In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April
26-30, 2020. OpenReview.net, 2020. url: https://openreview.net/forum?id=H1gX8C4YPr.
[124] Fei Xia et al. “Gibson Env: real-world perception for embodied agents”. In: Computer Vision and
Pattern Recognition (CVPR), 2018 IEEE Conference on. IEEE. 2018.
[125] Aoran Xiao et al. 3D Semantic Segmentation in the Wild: Learning Generalized Models for
Adverse-Condition Point Clouds. 2023. arXiv: 2304.00690 [cs.CV]. url:
https://arxiv.org/abs/2304.00690.
[126] Zhihui Xie et al. “Pretraining in Deep Reinforcement Learning: A Survey”. In: CoRR abs/2211.03959
(2022). doi: 10.48550/ARXIV.2211.03959. arXiv: 2211.03959.
[127] Junhong Xu et al. “Shared Multi-Task Imitation Learning for Indoor Self-Navigation”. In: IEEE
Global Communications Conference, GLOBECOM 2018, Abu Dhabi, United Arab Emirates, December
9-13, 2018. 2018, pp. 1–7. doi: 10.1109/GLOCOM.2018.8647614.
[128] Huaxiu Yao et al. “Hierarchically Structured Meta-learning”. In: Proceedings of the 36th International
Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA. Ed. by
Kamalika Chaudhuri and Ruslan Salakhutdinov. Vol. 97. Proceedings of Machine Learning
Research. PMLR, 2019, pp. 7045–7054. url: http://proceedings.mlr.press/v97/yao19b.html.
[129] Zhecheng Yuan et al. “Pre-trained image encoder for generalizable visual reinforcement learning”.
In: Advances in Neural Information Processing Systems 35 (2022), pp. 13022–13037.
[130] Huasha Zhao and John F. Canny. “Sparse Allreduce: Efficient Scalable Communication for
Power-Law Data”. In: CoRR abs/1312.3020 (2013). arXiv: 1312.3020. url:
http://arxiv.org/abs/1312.3020.
[131] Lipu Zhou, Zimo Li, and Michael Kaess. “Automatic extrinsic calibration of a camera and a 3d lidar
using line and plane correspondences”. In: 2018 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS). IEEE. 2018, pp. 5562–5569.
[132] Yingtian Zou and Jiashi Feng. “Hierarchical Meta Learning”. In: CoRR abs/1904.09081 (2019). arXiv:
1904.09081. url: http://arxiv.org/abs/1904.09081.
89 
Asset Metadata
Creator Lekkala, Kiran Kumar (author) 
Core Title Pretraining transferable encoders for visual navigation using unlabeled datasets 
Contributor Electronically uploaded by the author (provenance) 
School Andrew and Erna Viterbi School of Engineering 
Degree Doctor of Philosophy 
Degree Program Computer Science 
Degree Conferral Date 2025-05 
Publication Date 02/12/2025 
Defense Date 11/12/2024 
Publisher Los Angeles, California (original), University of Southern California (original), University of Southern California. Libraries (digital) 
Tag 3D computer vision,artificial intelligence,deep learning,machine learning,OAI-PMH Harvest,pretraining,robotics 
Format theses (aat) 
Language English
Advisor Itti, Laurent (committee chair), Biyik, Erdem (committee member), Mel, Bartlett (committee member) 
Creator Email kiran4399@gmail.com,klekkala@usc.edu 
Unique identifier UC11399HPNI 
Identifier etd-LekkalaKir-13823.pdf (filename) 
Legacy Identifier etd-LekkalaKir-13823 
Document Type Dissertation 
Format theses (aat) 
Rights Lekkala, Kiran Kumar 
Internet Media Type application/pdf 
Type texts
Source 20250227-usctheses-batch-1242 (batch), University of Southern California (contributing entity), University of Southern California Dissertations and Theses (collection) 
Access Conditions The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law.  Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright.  It is the author, as rights holder, who must provide use permission if such use is covered by copyright. 
Repository Name University of Southern California Digital Library
Repository Location USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email uscdl@usc.edu
Abstract (if available)
Abstract Advancing the field of autonomous navigation and multimodal pretraining, this thesis presents a comprehensive study that leverages innovative methodologies for pretraining using large-scale datasets. The central aim is to enable robust, task-agnostic transfer learning and efficient training of autonomous agents across diverse environments and tasks.

We first introduce a novel multimodal pretraining framework for visual navigation, which fuses components of a traditional World Model into a unified system. This system leverages Bird's Eye View (BEV) representations as an intermediate perceptual space to bridge complex First-Person View (FPV) observations and downstream policy learning. Pretrained entirely on unlabeled videos and simulator data, the model demonstrates zero-shot transfer to unseen environments, achieving faster reinforcement learning through its BEV-based embeddings. A state-checking module further enhances robustness by interpolating uncertain or missing observations. Extensive evaluations using a differential drive robot in both simulated (CARLA) and real-world settings underline the effectiveness of this approach, supported by open-source resources for reproducibility.

To generalize representations across tasks, we propose Value Explicit Pretraining (VEP), a task-agnostic framework that learns objective-conditioned encodings invariant to environmental variations. Unlike traditional methods relying on optimal task completions, VEP utilizes sub-optimal play data, learning temporally smooth representations via self-supervised contrastive loss. This enables efficient adaptation to new tasks with shared objectives, evidenced by superior performance on Atari benchmarks and realistic navigation tasks, achieving up to 3× improvement in sample efficiency and task rewards.

This work is further supported by the USCILab3D dataset, a large-scale, annotated dataset collected using a mobile robot navigating the USC campus under diverse conditions. Comprising 10M multi-view images and 1.4M semantically annotated point clouds, this dataset enriches multimodal 3D research by providing detailed semantic labels, pose-stamped trajectories, and dense 3D reconstructions. The dataset's high granularity facilitates precise 3D labeling and serves as a foundation for diverse tasks in computer vision, robotics, and machine learning.

Finally, leveraging volumetric rendering techniques, we developed \textit{Beogym}, a data-driven simulator built using Gaussian Splatting to render realistic navigation environments. By processing USCILab3D data into interconnected Gaussian splat files, Beogym provides seamless transitions across sectors of the environment, enabling first-person view imagery and realistic training scenarios. The simulator supports advanced evaluation of autonomous agents, bridging the gap between real-world data and simulation-based training.

In conclusion, this thesis lays the groundwork for scalable, task-agnostic pretraining frameworks, enriched multimodal datasets, and realistic simulators to advance autonomous navigation and reinforcement learning research. 
Tags
3D computer vision
artificial intelligence
deep learning
machine learning
pretraining
robotics
Linked assets
University of Southern California Dissertations and Theses
doctype icon
University of Southern California Dissertations and Theses 
Action button