Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Large-scale path planning and maneuvering with local information for autonomous systems
(USC Thesis Other)
Large-scale path planning and maneuvering with local information for autonomous systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
LARGE-SCALE PATH PLANNING AND MANEUVERING WITH LOCAL INFORMATION
FOR AUTONOMOUS SYSTEMS
by
Chuanhui Hu
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(MECHANICAL ENGINEERING)
August 2024
Copyright 2024 Chuanhui Hu
ii
Acknowledgments
I would like to thank my advisor, my committee members, my colleagues, and my family.
Without them, I would never have completed my PhD journey.
I owe deep gratitude to my advisor Dr. Yan Jin, who helped me build my background
knowledge in engineering design and develop my research taste. His expertise in engineering
design and artificial intelligence helped me develop my research thinking and become a
researcher. I would also like to thank Prof. Gupta, Prof. Shiflett, Prof. Bermejo-Moreno, Prof.
Bansal, and Prof. Qian for their comments and suggestions to improve my dissertation. I would
like to acknowledge the contributions of my colleagues, including Xiongqing Liu, Hao Ji, Zijian
Zhang, Hristina Milojevic, Edwin Williams, Yunjian Qiu, Bingling Huang, Xinrui Wang, and
Ardalan Aryshad, who always provide suggestions on my research.
Finally, I want to thank my parents who always support me and encourage me to explore
new possibilities in life. I want to thank Lan Yu for her love and support. Without their help and
encouragement, I would never have overcome all the difficulties through my Ph.D. journey.
Chuanhui Hu
iii
Table of Contents
Acknowledgments........................................................................................................................... ii
List of Tables .................................................................................................................................. v
List of Figures................................................................................................................................ vi
Abstract........................................................................................................................................ viii
Chapter 1. Introduction............................................................................................................. 1
1.1. Background And Motivations............................................................................... 1
1.2. Research Questions............................................................................................... 6
1.3. Research Approach ............................................................................................... 7
1.4. Overview of the Dissertation ................................................................................ 9
Chapter 2. Related Work ........................................................................................................ 10
2.1. Classic Motion Planning Methods ...................................................................... 10
2.2. Planning by Optimization.................................................................................... 14
2.3. Reinforcement Learning...................................................................................... 15
Chapter 3. Long-Range Path Planning for Autonomous Ships in Dynamic Environments
under Uncertainty.............................................................................................................. 18
3.1. Introduction ......................................................................................................... 18
3.2. Problem Description............................................................................................ 20
3.3. Risk-Aware Path Planning .................................................................................. 21
3.3.1. Studied area and data................................................................................ 21
3.3.2. Intention Estimation ................................................................................. 23
3.3.3. Risk Modeling.......................................................................................... 30
3.3.4. Risk-aware A* Algorithm........................................................................ 32
3.4. Case Study........................................................................................................... 34
3.4.1. Environment setup.................................................................................... 35
3.4.2. Results...................................................................................................... 38
3.5. Results and Discussion........................................................................................ 39
3.6. Summary and Findings........................................................................................ 41
Chapter 4. Large-Scale Path Planning in Complex Environments......................................... 44
4.1. Introduction ......................................................................................................... 44
4.2. Problem Description............................................................................................ 46
4.3. Path Planning Based on Focus Genetic Algorithm............................................. 48
4.3.1. Genetic Algorithm.................................................................................... 48
4.3.2. Self-improving Process............................................................................ 55
4.4. Case Study........................................................................................................... 59
4.4.1. Narrow Gap Problem ............................................................................... 59
4.4.2. Compare GA-focus with RRT* ............................................................... 64
4.4.3. Different Tough Environments ................................................................ 67
4.4.4. Multi-objective Planning.......................................................................... 70
4.5. Results and Discussion........................................................................................ 72
iv
4.6. Summary and Findings........................................................................................ 75
Chapter 5. Trajectory Planning by Deep Reinforcement Learning with Knowledge Transfer
and Online Demonstration ................................................................................................ 78
5.1. Introduction ......................................................................................................... 78
5.2. Problem Description............................................................................................ 81
5.3. Dual-level Deep Reinforcement Learning Framework....................................... 82
5.3.1. Model Description.................................................................................... 85
5.3.2. Random Environment Generation............................................................ 86
5.3.3. Reward Function ...................................................................................... 87
5.3.4. Prioritized Experience Replay.................................................................. 91
5.3.5. Curriculum Learning................................................................................ 92
5.3.6. Domain Knowledge Capture and Transfer............................................... 92
5.3.7. Online Demonstrations............................................................................. 94
5.4. Experiments......................................................................................................... 95
5.4.1. Training Setup.......................................................................................... 96
5.4.2. Geometry Knowledge Transfer................................................................ 98
5.4.3. Online Demonstrations........................................................................... 101
5.5. Results and Discussion...................................................................................... 106
5.6. Summary and Findings...................................................................................... 108
Chapter 6. Conclusion .......................................................................................................... 111
6.1. Summary of Dissertation................................................................................... 111
6.2. Contributions..................................................................................................... 113
6.3. Future Work ...................................................................................................... 114
References................................................................................................................................... 116
v
List of Tables
Table 1 Performance comparison of different planners on the 10 randomly generated roadmaps.
........................................................................................................................................... 36
Table 2 Results of one-tailed paired t-test on the duration of encounters. ................................... 37
Table 3 Model settings for test case 1........................................................................................... 60
Table 4 The results of case study 1. The values are shown in the format 𝑚𝑒𝑎𝑛 ± 𝑠𝑡𝑑............... 61
Table 5 The results of case study 2. The values are shown in the format 𝑚𝑒𝑎𝑛 ± 𝑠𝑡𝑑............... 66
Table 6 The results of case study 3. The values are shown in the format 𝑚𝑒𝑎𝑛 ± 𝑠𝑡𝑑............... 67
Table 7 The results of case study 4. The values are shown in the format 𝑚𝑒𝑎𝑛 ± 𝑠𝑡𝑑............... 71
Table 8 Parameters of the proposed model................................................................................... 97
Table 9 Test results of the models and demonstrators................................................................ 102
vi
List of Figures
Figure 1 Three focused research areas.......................................................................................... 10
Figure 2 (a) San Francisco Bay on a regular nautical chart; (b) The extracted pixel map from
ENC (ENC ID: US3CA14M). The land areas are white, and the water areas are black. . 21
Figure 3 (a) AIS records at 21:00, 07/04/2020 (SOG ≤ 0.5) (b) All ship trajectories recorded on
07/04/2020 ........................................................................................................................ 23
Figure 4 (a) The clusters of all stationary ships shown in different colors; (b) The median of each
cluster; (c) All the possible destinations shown in green squares (cluster medians + water
area on margin of the map)............................................................................................... 24
Figure 5 The path from each node of the tree to the root is the optimal path to the root regarding
the distance and steering cost. The roots are shown in green squares. (a) roadmap built by
PRM; (b) root = (1674, 3409); (c) root = (2788, 4320); (d) root = (1500, 2211)............. 28
Figure 6 Possible paths a ship may take given its initial position and course angle. The line width
represents the weight of the path. (a) pos = (2500, 1800), COG = 60; (b) pos = (2800,
3000), COG = 160; (c) pos = (2500, 2000), COG = 210; (d) pos = (2500, 3500), COG =
180..................................................................................................................................... 29
Figure 7 Risk values calculated by both methods on 10 random roadmap initializations at
different prediction horizons. (a) boxplot of risk values at different prediction horizons,
whisker = [5%, 95%]; (b) mean of risk values at different prediction horizons............... 31
Figure 8 A sample path taken by the proposed risk-aware A* in one of the test cases. The own
ship and the path are shown in green. (a) t = 0; (b) t = 1970; (c) t = 3940; (d) t = 5910. . 34
Figure 9 The paths taken by the planner based on intention estimation and constant velocity
assumption on 10 randomly generated roadmaps. The yellow path is a human-taken path
recorded in AIS data. (a) the 10 paths taken by the planner based on intention estimation
(green); (b) the 10 paths taken by the planner based on constant velocity assumption (red).
........................................................................................................................................... 36
Figure 10 (a) a sample grid map (size: 10,000×10,000) for path planning problems. The area in
white are obstacles, and the area in white is free space. The starting point is marked by a
blue square, and the goal point is marked by a red circle; (b) a solution to the problem is
shown in green; (c) the original map resized to 100×100. The topological structure of the
geometry has changed and the optimal solution becomes infeasible on the low-resolution
map.................................................................................................................................... 46
Figure 11 (a) quadtree decomposition of the map shown in Figure 1. The black blocks are free
space, and the grey blocks are obstacle blocks; (b) the collision cost of the line segment is
the sum of areas of the white blocks................................................................................. 50
Figure 12 A sample of crossover operation. The red path and the green path on the left-hand side
are the two parents. The two hollow dots on the right-hand side are the selected crossover
points, and the offspring is the black path. ....................................................................... 53
Figure 13 Five mutation operations used in this study. (a) add waypoint; (b) delete waypoint; (c)
move waypoint; (d) split segment; (e) cut corner. ............................................................ 54
Figure 14 Multiple steps of mutation operations are needed to escape a local optimum. (a) before
mutation; (b) after one mutation operation (delete waypoint); (c) after the second
mutation operation (add waypoint)................................................................................... 56
Figure 15 Split-and-focus operation. The fixed waypoints are shown with the lock icon. .......... 57
vii
Figure 16 The grid map of case study 1 and the solutions by the three algorithms. (a) the starting
point and goal point are at the upper-left and upper-right corner, respectively; (b) the 10
solutions given by GA-linear; (c) the 10 solutions given by GA-area; (d) the 10 solutions
given by GA-focus............................................................................................................ 63
Figure 17 fitness history of the three methods in case study 1. The x-axis is the generation
number and is scaled to [0, 1]. The curve shows the mean±std....................................... 64
Figure 18 The solutions to the two environments. The starting point and goal point are marked
by the blue square and red dot, respectively. (a) 10 solutions by GA-focus on the Box
environment; (b) a sample solution and the generated tree by RRT* on the Box
environment; (c) 10 solutions by GA-focus on the Cross environment; (d) a sample
solution and the generated tree by RRT* on the Cross environment ............................... 66
Figure 19 The bar plot comparison of the path length and time consumption. The error bar shows
the standard deviation. (a) comparison of path length in both environments; (b)
comparison of time consumption in both environments................................................... 66
Figure 20 The solutions to four tough tasks by GA-focus. The starting point and goal point are
marked by the blue square and red dot, respectively. (a) an indoor environment; (b) a
non-square maze environment; (c) an environment with pocket-shaped obstacles and
narrow passages; (d) an environment filled with obstacles of different sizes. ................. 69
Figure 21 The grid map of case study 4 and the solutions by GA-focus. (a) the speed field in the
environment. The red color means that the agent can move faster in these areas; (b) the
10 solutions given by GA-focus. The starting point and goal point are at the upper-left
(blue square) and upper-right (red dot) corners, respectively........................................... 71
Figure 22 Training process of the proposed dual-level decision-making framework. ................. 86
Figure 23(a) A random environment generated by Perlin noise. The white areas are obstacles.
The goal is marked by the red dot, while the agent state is marked by the green polygon;
(b) The observation of the agent. The blue lines show the observation from laser scanning.
The red line marks the direction of the gradient on the Fast-Marching distance map...... 87
Figure 24(a) Path generated by PRM demonstrator. The roadmap is in blue, and the path is in red;
(b) Path generated by FM2 demonstrator. The color map shows the arrival time map of
FM2, and the sampled path is in red................................................................................. 95
Figure 25 Success rate curves of GeoTran and noTran. ............................................................... 99
Figure 26 The visualization of Q-values of three models. Higher values are shown in brighter
colors. (a) Q-values of noTran; (b) Positive Q-values of GeoTran; (c) Positive Q-values
of FM2Step. .................................................................................................................... 100
Figure 27 Success rate curves of models with online demonstration. ........................................ 103
Figure 28 Ratio of demonstration experiences during training. ................................................. 104
Figure 29 Sample test cases of FM2Step. The trajectory is shown with the green curve. The goal
position is marked by the red dot. The blue dots mark the waypoints the agent selected.
......................................................................................................................................... 105
viii
Abstract
Motion planning has been a hot research topic in robotics and is a vital functionality for
autonomous systems. Its primary objective is to discover the most efficient path, accounting for
single or multiple objectives, while also adhering to the dynamic constraints of the agent and
avoiding collisions with obstacles. Given the complexity of addressing this challenge in largescale and dynamic environments, coupled with the need for real-time control, pipeline planning
methods have gained popularity. These methods typically consist of two main components:
global path planning, which generates a global path from the origin to the destination, and local
trajectory planning, which generates the short-term trajectory in real time. This dissertation
tackles the computational challenges associated with global planning in extensive environments,
as well as the necessity of identifying dynamic-compliant trajectories during local planning. To
reduce the collision probability caused by the uncertain motions of other agents, we developed a
risk-aware A* algorithm to minimize the encountering risk in long-range path planning problems
in dynamic environments by incorporating the intention estimation method. To solve the
efficiency issue with large-scale planning problems, we developed a Focus Genetic Algorithm
(GA-focus) that can solve multi-objective path planning problems in large-scale and complex
environments. By introducing a collision area cost and the self-improving process, our GA-focus
algorithm can overcome the early-convergence problem of evolutionary algorithms and is
guaranteed to converge to a feasible solution. Lastly, we introduced a dual-level deep
reinforcement learning framework that combines high-level policy learning and low-level
integration of a controller and offers an efficient and adaptable solution for navigating through
local environments while considering dynamic constraints. Domain knowledge transfer and
online demonstrations can be integrated during training to significantly improve the agent's
ix
performance in long-range trajectory planning tasks. By enhancing navigation safety, optimizing
efficiency, and increasing adaptability to various planning conditions, these algorithms can
significantly enhance the overall effectiveness and reliability of autonomous systems.
1
Chapter 1. Introduction
1.1.Background And Motivations
The use of autonomous systems has increased greatly in different fields in recent years
(Badrloo et al., 2022). For example, mobile robots can be applied in labor-intensive industries to
perform better effectiveness and efficiency (Quaglia et al., 2020). Another key advantage of
autonomous systems is that they can be applied in hazardous environments to avoid human
exposure to dangers (Luk et al., 2005; Szczurek et al., 2023) or reduce human errors (European
Maritime Safety Agency, 2017). Autonomous robots can perform a wide range of tasks in
complex and unknown environments (De Croon et al., 2022). This requires a robot to sense the
environment and analyze the situation with its artificial intelligence (AI).
Motion planning is a vital task of autonomous systems. A motion planning problem can be
defined as finding an optimal trajectory from a given initial position to a goal position with
respect to single/multiple objectives, while not colliding with any static or dynamic obstacles
during the motion and complying with the agent’s own dynamics constraints. Motion planning is
a fundamental problem for autonomous vehicles, including autonomous cars (Abdallaoui et al.,
2022), unmanned aerial vehicles (UAV) (Jones et al., 2023), unmanned surface vehicles (USV)
(Long et al., 2023), and unmanned underwater vehicles (UUV) (Zhu & Yang, 2022). The
complexity of a motion planning problem increases with the scale and complexity of the
environment, the existence of dynamic objects, the dimension of the configuration space of the
agent, and the constraints of the agent. Although researchers have developed many different path
planning and trajectory planning algorithms that can solve problems of various types and scales,
it remains a challenging problem for autonomous systems.
2
As deep learning has shown its great power in diverse problems, much research about
motion planning is based on an end-to-end planning method, which converts raw perception data
into control commands (Teng et al., 2023). However, such an approach lacks interpretability,
making it difficult to identify defects when malfunctions or unexpected system behavior occurs.
Thus, the pipeline planning method is still popular for its interpretability (Aradi, 2022;
Grigorescu et al., 2020; Xiao et al., 2022). A modular pipeline usually contains a global planning
module, which finds the optimal path from the origin to the destination, and a local trajectory
planning module, which generates the short-term trajectory and avoids collision in real time.
Since the solution space of the global path planning problem is continuous in both space and
time, in many classic algorithms, the environment is modeled as a computational graph or a
potential field to further constrain the search for the optimal global path and reduce time
complexity. Usually, these algorithms assume that there exists a well-tuned controller to navigate
the agent along the path so that the agent’s dynamics constraint can also be loosened. In smallscale problems, search-based algorithms such as Dijkstra’s algorithm and A* (Gasparetto et al.,
2015) can efficiently find the optimal path on a graph. However, the complexity of search-based
algorithms grows exponentially with the depth of the graph and needs to be used carefully in
large-scale problems. Another popular approach is sampling-based algorithms, which randomly
draw samples in the configuration spaces and try to connect them and form a graph or a tree.
Typical applications of sampling-based algorithms are Rapidly-exploring Random Trees (RRT)
(Jaillet et al., 2010) and Probabilistic Roadmap (PRM) (Kavraki et al., 1996). Sampling-based
algorithms can react quickly and are widely applied in short-range planning problems. Artificial
Potential Field (APF) (Jayaweera & Hanoun, 2020) is also a popular approach. By constructing
an attractive field from the goal and a repulsive field from the obstacles, the agent will be pulled
3
by the potential field toward the goal while being pushed by obstacles so that no collision will
happen. The quick response capability makes this method suitable for dynamic environments.
However, the resulting path is not the optimal path, and the agent may be trapped in local
minimum in some cases. Some other methods, such as the Fast Marching Method (FMM) (X.
Wang et al., 2021) and the Voronoi diagram (Bhattacharya & Gavrilova, 2007) can also be
applied to path planning problems, but they cannot find the optimal path either.
Since the classic planning algorithms may have problems in either time complexity or
solution quality, researchers are still looking for other approaches to solve the global path
planning problem. Many researchers tried to model the path planning problem as an optimization
problem and use machine learning techniques to solve it. In such formulations, the main
objective is to minimize the path length or the energy consumption, while the constraints can be
the dynamics constraints of the agent and the collision-free requirement in environments with
obstacles. Bio-inspired algorithms, as a category of optimization algorithms inspired by
biological systems, are popular to solve such optimization problems. Due to the flexibility of the
evaluation function, evolutionary algorithms can solve multi-objective problems efficiently.
Many algorithms have shown great performance in solving path planning problems, such as
Genetic Algorithm (GA) (Y. Hu & Yang, 2004; Lamini et al., 2018), Ant Colony Optimization
(ACO) (Duan et al., 2010; Konatowski, 2019), and Particle Swarm Optimization (PSO)
(Masehian & Sedighizadeh, 2010; B. Song et al., 2021). These algorithms generate a population
of potential solutions and improve the performance of the generation by mimicking the natural
selection and evolution process. Algorithms like GA and PSO do not plan on a computational
graph so that the quality of the solution is not affected by the approach of environment modeling.
Although the evolutionary algorithms are good at modeling complex problems such as multi-
4
objective optimization, they are sensitive to the solution encoding method and hyperparameter
setting and do not generalize to different environments well.
Due to the unpredictability of long-term motion patterns for dynamic obstacles like moving
agents, global plans are typically formulated in static environments. Once a global path is found,
the agent then applies a local planner to find a short-range plan in real time. During this process,
the agent needs to find a continuous trajectory with a local horizon and avoid collision with any
dynamic obstacles. Some global planning algorithms such as A* and RRT can be applied to the
local planning problem with a short horizon by optimizing a heuristic target function. APF, as a
reaction-based algorithm, can also be applied in this phase since it can respond quickly to any
dynamic obstacles. When the motion of the dynamic obstacles is observable and predictable, a
safe local plan can be found using the motion prediction. Velocity obstacle (VO) method (Fiorini
& Shiller, 1998) creates velocity obstacles according to the current positions and velocities of the
agent and dynamics obstacles so that the agent can avoid collision by selecting its velocity
outside the velocity obstacles. Dynamic window approach (DWA) (Fox et al., 1997) is also a
reaction-based algorithm that can do local planning quickly and intuitively. It searches for the
optimal action in the local translational and rotational velocity window to maximize an objective
function so that the agent can move forward on the global path while keeping a distance from
obstacles.
Deep learning (DL) has gained attention for its learning capability and excellent
performance. By training a neural network, deep learning can construct a latent feature space and
handle complex motion planning problems by path prediction or path generation. Pinzón-Arenas
et al. (Pinzón-Arenas et al., 2022) trained a convolutional neural network (CNN) to plan the
trajectory of a UR3 robot. The trajectory is modeled as a sequence of waypoints so that the CNN
5
can solve the regression problem. Zhao et al. (Zhao et al., 2022) modeled the problem as a
trajectory generation problem and used a generative adversarial network (GAN)-based
framework to automatically generate the guidewire path for the surgical tool. Deep learning can
also be combined with classic path planning algorithms. Wang et al. (J. Wang et al., 2020)
proposed a CNN-based model to generate a probability distribution of the optimal path and used
the distribution to guide the sampling process of the RRT* to find the optimal path more
efficiently. However, deep learning methods require large amounts of expert data from human
operators, and diverse operators may make different decisions under the same situations, which
leads to uncertainty during training. Furthermore, deep learning models are trained solely on the
provided dataset, thus they cannot learn from unseen data through a trial-and-error process.
Reinforcement learning, especially deep reinforcement learning (DRL), is another hot topic
for motion planning. DRL uses the framework of the Markov Decision Process (MDP), which
defines a discrete time process that can be represented as a tuple
(𝑆: 𝑠𝑡𝑎𝑡𝑒, 𝐴: 𝑎𝑐𝑡𝑖𝑜𝑛, 𝑇:𝑡𝑟𝑎𝑛𝑠𝑖𝑡𝑖𝑜𝑛 𝑚𝑜𝑑𝑒𝑙, 𝑅: 𝑟𝑒𝑤𝑎𝑟𝑑). DRL trains an agent in an environment,
aiming at finding an optimal policy for the decision maker to maximize the expected
accumulated reward. Many researchers have proved that DRL can be applied to solve complex
motion planning problems with uncertainty from the environment (Elallid et al., 2022; H. Sun et
al., 2021). Once a DRL has been trained, it can react quickly in real time in an end-to-end
manner, e.g., observe the current state of the agent and directly output the low-level control
command. However, it remains a question how to reuse the knowledge learned by the agent in
similar problems. For example, when different agents are put in the same environment, since the
agent dynamics are different, the low-level policy learned previously may not be transferred to
the new agent effectively with transfer learning, even though the optimal global paths for
6
different agents are very similar (X. Wang & Jin, 2022). Another shortcoming of DRL-based
planners is their lack of interpretability, which makes it challenging to comprehend the agent's
actions.
Despite the numerous algorithms developed for pipeline motion planning, shortcomings still
exist. Due to the unpredictability of long-term motion patterns for dynamic obstacles like moving
agents, global plans are typically formulated in static environments. Consequently, the agents
depend on local planners to avoid collisions. Overlooking the risk of collision with dynamic
obstacles may lead to a sub-optimal global path and may even place the agent in dangerous
situations. Additionally, global planning may face challenges such as high computational
complexity and suboptimal plans in large-scale and intricate environments. The effectiveness and
quality of global planning in extensive environments are significantly influenced by how the
environment is modeled and the heuristics used for searching. Furthermore, local planning may
yield suboptimal solutions due to its constrained planning horizon. Although DL and DRL can
be applied in motion planning problems, their lack of generalizability and interpretability makes
it difficult to adapt the agent to different types of environments and problems.
1.2.Research Questions
The main purpose of this research is to answer how to solve motion planning problems in
large-scale and complex environments. We try to answer the research questions:
• What is a proper approach to address the uncertainty from the motion of dynamic
obstacles in long-range planning? How can we model the risks in complex environments
when the future motion of other dynamic agents is unknown?
7
• What approaches can be utilized to extend global planning algorithms for efficiently
generating high-quality solutions in large-scale and complex environments with
obstacles of diverse shapes and sizes?
• What strategies can be employed to design a variable-horizon local planning framework?
What are the crucial aspects of training a DRL-based local planning model to enhance its
interpretability? How can the training be optimized when guidance from a teacher is
accessible?
1.3.Research Approach
The path planning problem can be formulated as: Given the constraint set ℂ, objective set ℙ,
obstacle set 𝑂𝑏𝑠 = {𝑂𝑏𝑠𝑠𝑡𝑎𝑡𝑖𝑐,𝑂𝑏𝑠𝑑𝑦𝑛𝑎𝑚𝑐}, initial position 𝑝𝑖𝑛𝑖𝑡 and goal position 𝑝𝑔𝑜𝑎𝑙, design
a model 𝑝(𝑡) = 𝑓(ℂ,ℙ, 𝑝𝑖𝑛𝑖𝑡, 𝑝𝑔𝑜𝑎𝑙) to find the optimal path 𝑝(𝑡) = {𝑥1
(𝑡), 𝑥2
(𝑡), … , 𝑥𝑛
(𝑡)} ∈
𝑅
𝑛
,𝑡 ∈ [0, 𝑇] in the n-dimensional planning space that:
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝑃 ≔ ∑𝜆𝑖
∙ 𝑃𝑖
𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 𝑝(0) = 𝑝𝑖𝑛𝑖𝑡, 𝑝(𝑇) = 𝑝𝑔𝑜𝑎𝑙
𝐶𝑚𝑎𝑝: 𝑝(𝑡) ∩ 𝑂𝑏𝑠𝑠𝑡𝑎𝑡𝑖𝑐 = ∅
𝐶𝑡𝑎𝑟𝑔𝑒𝑡𝑠: 𝑝(𝑡) ∩ 𝑂𝑏𝑠𝑑𝑦𝑛𝑎𝑚𝑖𝑐(𝑡) = ∅
𝐶𝑜𝑤𝑛: 𝑝(𝑡) 𝑖𝑠 𝑐𝑜𝑚𝑝𝑙𝑖𝑎𝑛𝑡 𝑡𝑜 𝑡ℎ𝑒 𝑜𝑤𝑛 𝑑𝑦𝑛𝑎𝑚𝑖𝑐𝑠
( 1 )
8
For instance, when addressing the path planning problem for autonomous vehicles, the
primary objectives ℙ are centered around minimizing both energy ℙ𝑒𝑛𝑒𝑟𝑔𝑦 and time
consumption ℙ𝑡𝑖𝑚𝑒 during the journey from the initial position 𝑝𝑖𝑛𝑖𝑡 to the destination 𝑝𝑔𝑜𝑎𝑙.
During planning, the agent needs to avoid collision with the static obstacles 𝑂𝑏𝑠𝑑𝑦𝑛𝑎𝑚𝑖𝑐(𝑡) in
the environment, such as buildings and trees. Additionally, it must factor in the movements of
dynamic obstacles 𝑂𝑏𝑠𝑑𝑦𝑛𝑎𝑚𝑖𝑐(𝑡), such as other vehicles. Autonomous vehicles often operate
within limitations on their control capabilities, constraining their movements within the
configuration space. Consequently, the path they follow must adhere to the agent's dynamic
constraints ℂ𝑜𝑤𝑛.
To solve such a problem, the agent must have the capacity to solve multi-objective
optimization problems. Moreover, it must traverse environments with different scales and
intricacies, demanding both robustness and scalability. Since the future motion of the dynamic
obstacles can be uncertain, the agent needs to either predict their future trajectories or evaluate
collision risks at specific positions and times. Additionally, due to the constraints imposed by
dynamics, the agent must also demonstrate proficiency in identifying and following smooth
paths within the continuous space.
To address these challenges, we focus on three specific issues: (1) employing current
observations for risk evaluation and formulating safer plans under uncertainty; (2) formulating a
planning strategy to address path planning complexities in large-scale and intricate environments;
and (3) constructing a generalizable real-time decision-making framework with unlimited
planning horizon for the trajectory planning problem by DRL. Our contributions are:
9
• The risk-aware A* algorithm for long-range path planning problems in dynamic
environments, which incorporates intention estimation to address uncertainties arising
from the unpredictable movements of dynamic obstacles.
• The GA-focus algorithm that tackles the challenges arising from the scale and
complexity of environments, which is designed to solve multi-objective planning
problems in large-scale environments featuring obstacles of diverse shapes and sizes.
• A dual-level decision-making framework based on DRL. This framework is designed to
efficiently construct local plans that adhere to the agent's dynamic constraints with an
unlimited planning horizon while relying on local observations.
1.4.Overview of the Dissertation
In the rest of this dissertation, the related work will be introduced in Section 2. The
intention-based long-term planning algorithm based on intention estimation methods will be
described in Section 3, followed by the details of the large-scale multi-objective path planning
algorithm in Section 4. The dual-level DRL-based local planning framework is presented in
Section 5. In Section 6, the conclusions are drawn and future work directions are pointed out.
10
Chapter 2. Related Work
This research aims to develop a model to solve the path planning problem in dynamic
environments with limited information. The approach integrates theory from three major areas:
Motion Planning, Optimization, and Reinforcement Learning, as shown in Figure 1.
Figure 1 Three focused research areas
2.1.Classic Motion Planning Methods
Motion planning is a fundamental problem in robotics and autonomous vehicles. The aim of
path planning is to find a collision-free trajectory for the agent to move from a starting
configuration to a goal configuration. A path planning problem can involve various objectives
11
and constraints, such as minimizing energy consumption or restricting the maximum turning
angle during the motion. The complexity of the motion planning problem grows rapidly with the
scale and complexity of the environments, which makes it difficult for the global planning
module to meet the real-time requirement in complex situations. To solve this problem, the
motion planning of autonomous systems can be decomposed into global path planning and local
trajectory planning, where the global planning module generates a feasible path from the origin
to the destination, and the local planning module generates a short-term trajectory or plan in real
time (Teng et al., 2023). This framework enables reliable reasoning about how the system
generates specific behaviors and is popular in the industry.
Usually, a global planning algorithm assumes that there exists a well-tuned controller to
navigate the agent along the path so that the plan can be discretized to simplify the problem. One
category of path planning algorithms is search-based algorithms. Dijkstra’s algorithm, as a
popular graph search algorithm, can be applied to a path planning problem when the
environment is modeled as a connected graph. As shown in Wang’s work (H. Wang et al., 2011),
when the environment can be modeled as a small-scale grid world, the optimal path can be easily
found by Dijkstra’s algorithm. A* algorithm (Russell & Norvig, 2016), as a heuristic-guided
search-based algorithm, can find the optimal solution on the graph more efficiently with a
heuristic distance estimation function and has many applications. Shah and Gupta (Shah & Gupta,
2020) showed that A* can be applied in long-distance problems for USVs with a proper
modeling of the environment. Penin et al. (Penin et al., 2019) used a bi-directional A* algorithm
to solve the optimal control problem of UAVs under uncertainty. Searched-based algorithms
such as the D* lite algorithm (Koenig & Likhachev, 2002) can also be applied in dynamic
12
environments. Jin et al. (Jin et al., 2023) showed that conflict-based search with D* lite can
effectively solve the path planning problem in dynamic environments.
Another popular path planning approach is sampling-based algorithms. By randomly
drawing samples in the configuration spaces and trying to connect them to form a graph, such
algorithms can usually find feasible solutions in a shorter time and have a good real-time
performance. Meng et al. (B. H. Meng et al., 2022) used Rapidly-exploring Random Tree (RRT)
to plan the trajectory of a continuum arm in its workspace. Wang et al. (H. Wang et al., 2022)
proposed an improved RRT* algorithm to solve the path planning for underground intelligent
vehicles. Although many sampling-based algorithms are guaranteed to converge to the global
optimal solution, the convergence speed is not guaranteed. To enhance the efficiency of
sampling-based algorithms, researchers have developed various methods to improve the
sampling strategy. Chen et al. (G. Chen et al., 2021) developed a virtual force to push the
samples inside an obstacle to the boundary so that the algorithm has better performance in
finding narrow passages. Nasir et al. (Nasir et al., 2013) proposed an RRT*-smart algorithm that
can intelligently draw samples around the feasible paths, and thus, the algorithm can more
efficiently converge to the optimal solution. Li et al. (Q. Li et al., 2022) improved the sampling
strategy of classic PRM by sampling on the spatial axis with a maximum deflection angle. These
studies have shown that sampling-based algorithms can be applied to complex environments
with various conditions.
Apart from the algorithms mentioned above, there are some other popular algorithms. Song
et al. (R. Song et al., 2017) proposed a multi-layer Fast Marching Method (FMM) to solve the
path planning problem for USVs in time-variant environments with currents and wind that can
affect a USV’s voyage. Chen et al. (R. Song et al., 2017) combined the Fast Marching Square
13
(FMS) algorithm and velocity obstacles method to handle the velocity constraint of the ship
under control and the collision risk. Wang et al. (J. Wang & Meng, 2020) showed a method to
initialize the sampling distribution of RRT* by the heuristic solution from the Voronoi diagram
of the map.
When a global plan has been generated, the local trajectory planner is used to find a safe,
comfortable, and continuous trajectory with a local horizon. Although planners such as A* and
RRT can also be applied in short-range planning, reaction-based algorithms, including artificial
potential field (APF), velocity obstacle (VO) method, and dynamic window approach (DWA),
are widely used in the local planning phase for their quick response and intuitive behavior. APF
is a popular local planning algorithm that can quickly react to the environment based on the
artificial potential fields that can pull the agent to the goal while pushing the agent away from
obstacles. Szczepanski et al. (Szczepanski et al., 2022) showed that predictive APF can be
applied in environments with non-convex-shaped obstacles and produce smooth trajectories
while avoiding stagnation. Jayaweera et al. (Jayaweera & Hanoun, 2020) developed a dynamic
APF to deal with the speed variation of dynamic obstacles by an additional potential field
generated by the relative velocity. VO relies on the current positions and velocity of dynamic
obstacles to compute a reachable avoidance velocity space and select a proper velocity of the
own agent to avoid collisions. Xia et al. (Xia et al., 2020) showed that VO is effective in USV
collision avoidance since the velocity and course of USVs do not change much in a short time
interval. Wang et al. (J. Wang et al., 2022) used improved VO to solve the collision avoidance
problem of USVs considering the threat from different types of target ships. DWA is a practical
local planning method that considers the dynamic constraints of the autonomous system. It
searches in the available velocity space and finds the optimal path from all candidates by a
14
heuristic cost function. Lee et al. (Lee et al., 2021) showed that DWA can be extended to
dynamic environments to realize safe navigation. Kobayashi and Motoi (Kobayashi & Motoi,
2022) combined dynamic window approach and virtual manipulators to generate non-straight
line and non-arc paths in environments with dynamic obstacles.
2.2.Planning by Optimization
Although classic path planning algorithms have achieved good results in path optimization,
the time complexity of the algorithms usually grows rapidly with the scale of the problem.
Moreover, in some multi-objective or multi-agent planning problems, it can be difficult to model
the problem as relatively simple and sparse structures such as graphs. Many researchers have
tried to model the path planning problem as an optimization problem and use machine learning
methods to solve it.
Bio-inspired algorithms, as a family of optimization algorithms, are popular in path planning
problems for their flexibility in problem modeling and computation efficiency. Such algorithms
usually maintain a population and explore the solution space with mechanisms inspired by
biological systems. Luo et al. (Luo et al., 2020) solved the path planning problem in grid worlds
with Ant Colony Optimization (ACO). Lazarowska (Lazarowska, 2015) demonstrated that ACO
can solve the path planning for ships with dynamics restrictions. Kang et al. (Kang et al., 2018)
solved the collision avoidance problem for ships using Particle Swarm Optimization (PSO). Das
and Jena (Das & Jena, 2020) showed that PSO can also solve multi-agent planning problems.
Guo et al. (Guo et al., 2020) proposed an improved PSO to solve the traveling salesman problem
(TSP). Wang et al. (Y. Wang et al., 2019) used an improved grey wolf optimization algorithm
(GWO) to minimize the energy consumption of USVs. Le et al. (Le et al., 2020) performed a
Genetic Algorithm (GA)-based method to solve the TSP for tetriamond tiling robots. Segota et al.
15
(Baressi Šegota et al., 2020) used GA to solve the planning problems for six-degree-of-freedom
robotic manipulators. Elhoseny et al. (Elhoseny et al., 2018) modeled the solution with Bezier
curves and used a modified GA to find smooth paths.
Although these bio-inspired algorithms can solve the path planning problem efficiently, they
suffer from problems of early convergence and local optimum. Also, the quality of the initial
population is critical to the final solution. Some recent research has developed hybrid algorithms
that combine different path planning methods, such as GWO+Differential Evolution (X. Yu et al.,
2023), GA+ACO (Shi et al., 2022), Simulated Annealing+PSO (Z. Yu et al., 2022), Firefly
Algorithm+GA (Zhang et al., 2022), ACO+PSO+A* (Sui et al., 2023). These methods take
advantage of the component algorithms and show better performance in the case study than the
original algorithms.
2.3.Reinforcement Learning
As a sequential decision-making process, motion planning can be effectively represented as
a Markov Decision Process (MDP) and addressed using reinforcement learning (RL) techniques.
By interacting with the environment, the RL agent can gradually acquire knowledge about the
problem and get the optimal policy for the task. RL can be utilized not only for end-to-end
motion planning by generating low-level control commands based on sensor observations, but
also to substitute any subsystem within the pipeline framework and generate behavior-level
decisions, such as generating driving patterns for autonomous vehicles. One of the most popular
and fundamental RL techniques is the Q-learning, which is a model-free and off-policy algorithm,
meaning that the algorithm does not require a model of the environment and can learn from
historical data even if it does not follow that exploration strategy.
16
Due to these advantages, Q-learning is applied to various motion planning problems. Low et
al. (Low et al., 2023) used an improved Q-learning to solve the path planning problem in a
dynamic grid world. Maoudj and Hentout (Maoudj & Hentout, 2020) also showed that Qlearning can solve the 2-D planning problem efficiently. Hao et al. (Hao et al., 2023) used Qlearning to solve the path planning for unmanned surface vehicles (USVs) and improved the
efficiency by initializing the Q-table with artificial potential field (APF).
However, like classic path planning algorithms, Q-learning also suffers from the curse of
dimension and cannot generalize its policy to other similar problems. Deep Q-learning (DQN),
which uses a deep neural network to model the Q function of the agent, is thus developed to
solve these problems. Wang and Jin (X. Wang & Jin, 2022) showed that DQN can solve the
collision avoidance problem of USVs. Lv et al. (Lv et al., 2019) proposed a PN-DQN algorithm
and showed a better success rate and learning speed in the path planning problem. Yang et al.
(Yang et al., 2020) solved the multi-robot path planning problem with DQN.
Policy gradient is another popular method that targets modeling and optimizing the policy
directly. It can be used to solve continuous action problems and is popular in dynamic control
problems, which is difficult for DQN for its discretized action space. Following the Actor-Critic
framework, which learns the value function and policy together, a lot of research is done to solve
the motion planning problem in continuous space. Xie et al. (Xie et al., 2021) trained a deep
deterministic policy gradient (DDPG) model to directly control the linear and rotational velocity
of the agent for robot navigation. Hu et al. (Z. Hu et al., 2021) proposed a REL-DDPG algorithm
to solve the motion planning of UAVs. Wang et al. (S. Wang et al., 2021) used DDPG to control
the formation of multi-USVs. Tai et al. (Tai et al., 2017) trained a DDPG agent to solve the
mapless motion planning task with low-resolution sensors. Li et al. (P. Li et al., 2021) designed
17
an improved DDPG algorithm to solve the path planning of mobile robots in dynamic
environments. Liu et al. (M. Liu et al., 2021) used DDPG to achieve lane keeping of autonomous
vehicles. Apart from DDPG, many other state-of-the-art models have been applied to this
problem and have achieved success. Xing et al. (Xing et al., 2022) designed an Advantage Actor
Critic (A2C) based local planner for mobile robots. Chen et al. (P. Chen et al., 2022) proposed a
Soft Actor-Critic (SAC) based method to achieve real-time obstacle avoidance for a manipulator.
Hadi et al. (Hadi et al., 2022) proposed a Twin Delayed DDPG (Deep Deterministic Policy
Gradient), or TD3, based control system for autonomous underwater vehicles (AUVs) to avoid
obstacles in complex environments. Gao et al. (Gao et al., 2020) presented a path planning
framework that combines the probabilistic roadmap (PRM) global planner and TD3-based local
planner. These methods demonstrate that DRL can be effectively utilized for both high-level
decision-making and low-level control tasks.
18
Chapter 3. Long-Range Path Planning for Autonomous Ships in
Dynamic Environments under Uncertainty
3.1.Introduction
Long-range motion planning for autonomous systems faces significant challenges, primarily
due to the complexity of predicting the future movements of dynamic obstacles in the
environment, which increases uncertainty. Autonomous ships exemplify these challenges
distinctly due to their specific characteristics. They have low maneuvering capabilities and
necessitate large-scale, long-range planning, making them particularly susceptible to the
unpredictability of their operational environments. Additionally, these environments are typically
congested with various moving obstacles, complicating navigation and strategic planning further.
Thus, studying autonomous ships provides crucial insights into the broader challenges faced by
autonomous systems in dynamic and unpredictable settings.
Maritime transport is the most economical way of transporting goods globally. Over 80% of
international trading is carried out by sea (Sirimanne et al., 2022). Among all types of accidents,
ship collisions are the main type on the ocean (European Maritime Safety Agency, 2017). The
International Maritime Organization (IMO) has carried out the International Regulations for
Preventing Collisions at Sea (COLREGs) (International Maritime Organization, 1972) to guide
collision avoidance actions. COLREGs have defined three different situations of ship encounters,
which are overtaking, head-on, and crossing. The responsibility of the encountered ships is also
defined in COLREGs, which helps a ship determine whether it is a give-way ship or a stand-on
ship and what action it should take to avoid collisions. However, COLREGs do not provide
specific guidance for collision avoidance, and mariners must tell themselves what the situation is
and what action they should take. In addition, the three situations defined in COLREGs only
19
cover two-ship encounters. This leaves the multi-ship encounters a complex and risky situation,
where a ship can be the give-way ship to one ship and be the stand-on ship to another ship at the
same time. The safety of navigation still highly depends on the mariner’s experience and
judgment. Oftentimes, human errors are the main cause of ship collisions (European Maritime
Safety Agency, 2017).
To overcome human error in collision avoidance of ships, autonomous or unmanned ship
has become an important direction of research in the shipping industry. In most cases, the state of
a ship can be described on a 2-D plane with its position, velocity, and course (orientation). Many
algorithms have been developed for path planning and collision avoidance in dynamic
environments where multiple other ships are detected. Although these algorithms can make a
short-range plan and avoid collision with the existence of other dynamic obstacles, most of them
assume that all target ships are moving at a constant direction and velocity, which may be
violated in long-range cases. Another common assumption is that the encounter of ships happens
in the open area, which is not true in complex environments, such as San Francisco Bay area.
In this dissertation, we propose a path planning algorithm to reduce the risks of the
autonomous ship encountering other ships in long-range motion in complex and dynamic
environments. Our approach does not rely on the constant direction and velocity assumption and
the path planned by our algorithm is based on the estimation of the intentions of other ships. Our
algorithm can help autonomous ships, as well as human mariners, to avoid complex encountering
situations and thus improve the safety of navigation.
20
3.2.Problem Description
In the field of autonomous ship navigation, planning routes becomes more challenging when
we consider the uncertainty of other ships' movements. While COLREGs provide rules for twoship situations, it doesn't offer clear guidance for situations involving multiple ships. This lack of
guidance makes it difficult for autonomous ships to avoid collisions when they encounter
multiple vessels at once. Thus, one possible strategy for autonomous ships to reduce collision
risk is to reduce the risk of encountering other ships during the motion.
The global planning for autonomous ships needs to find a short route that is smooth and
compliant with the ship dynamics. To reduce collision risks, we propose to minimize the risk of
encountering other ships during the motion so that the autonomous ship does not need to face
challenging multi-ship encounter situations. Thus, the planning is modeled as a multi-objective
optimization problem to optimize the path length, path smoothness, and the risk of encountering
other ships.
To reduce the computation, we model the environment with a computation graph and
represent the path with a sequence of connected vertices on the graph. The problem can thus be
formulated as: Design a model 𝑝(𝑡) = 𝑓(ℂ, ℙ, 𝑝𝑖𝑛𝑖𝑡, 𝑝𝑔𝑜𝑎𝑙) to find the optimal path 𝑝 =
{𝑝1, 𝑝2, … , 𝑝𝑛
}, 𝑝𝑖 ∈ ℝ𝑥 × ℝ𝑦 that
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 ℙ𝑒𝑛𝑒𝑟𝑔𝑦 + 𝜆1 ∙ ℙ𝑠𝑡𝑒𝑒𝑟𝑖𝑛𝑔 + 𝜆2 ∙ ℙ𝑡𝑎𝑟𝑔𝑒𝑡
𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 𝑝1 = 𝑝𝑖𝑛𝑖𝑡, 𝑝𝑛 = 𝑝𝑔𝑜𝑎𝑙
ℂ𝑟𝑜𝑎𝑑𝑚𝑎𝑝: 𝑔𝑖𝑣𝑒𝑛 𝑎 𝑟𝑜𝑎𝑑𝑚𝑎𝑝 (𝑉, 𝐸), 𝑝𝑖 ∈ 𝑉 𝑎𝑛𝑑 𝑝𝑖𝑝𝑖+1 ∈ 𝐸
( 2 )
where ℙ𝑡𝑎𝑟𝑔𝑒𝑡 evaluates the risk of encountering other ships and should be minimized.
21
3.3.Risk-Aware Path Planning
When the future motion of the target ships is unknown, we cannot predict their exact
positions. Instead, our proposed algorithm will estimate the intention of target ships and find
possible paths they will take. A risk model is devised to assess the level of probability of
encountering target ships at a certain position in the future, and a risk-aware A* algorithm is
introduced to find a path with a low accumulated risk of encounters. In this research, the phrase
“target ship” refers to all other ships, and the phrase “own ship” refers to the ship under our
control.
(a) (b)
Figure 2 (a) San Francisco Bay on a regular nautical chart; (b) The extracted pixel map from ENC (ENC ID: US3CA14M). The
land areas are white, and the water areas are black.
3.3.1. Studied area and data
We chose San Francisco Bay as the area of interest, as shown in Figure 2(a). Several ports
are located in this area, and usually, over 100 ships are anchored here. In busy hours, there are
more than 50 ships moving in the Bay, from ocean to port, port to port, port to the ocean, etc.
22
This makes it hard to predict the motion of target ships since we do not know their intentions.
These ships will not move in straight lines, and they may also change their speed due to the
complex environment and encounter situations.
The nautical chart we use in this research is from the National Oceanic and Atmospheric
Administration1
(NOAA), which provides such maps of the ocean in different formats. Our work
is based on the electronic nautical chart (ENC). We selected the longitude from 122.67 W to
122.22 W and latitude from 37.54 N to 38.17 N, extracted the information of the land area from
the ENC, and constructed a pixel map, as shown in Figure 2(b). The white areas represent the
land areas, while the black areas are the water areas. The length change of a degree of longitude
or latitude is neglected due to the scale problem and is estimated at (37.84 N, 122.40 W). The
ratios are 87.81 km per longitude degree and 111.19 km per latitude degree. In this map, each
pixel represents a “10m x 10m” square, which finally produces a map size of 7004 * 3951.
Currently, most ships are required to be equipped with the automatic identification system
(AIS), which keeps publishing and receiving the ship information every 2-10 seconds, including
the ship’s identity (Maritime Mobile Service Identity, MMSI), position (longitude and latitude),
course (direction of motion, in degrees), and speed (in knots). The AIS system can build ship-toship and ship-to-port communications, providing more information to mariners and helping
improve navigation safety.
There exist many public AIS datasets, and one can also collect his or her own data with an
AIS device. In this research, we use the public data from MarineCadastre2
. MarineCadastre has
provided daily AIS records since 2018, which include the MMSI, record time, longitude (LON),
1 https://www.noaa.gov/
2 https://www.marinecadastre.gov/
23
latitude (LAT), course over ground (COG), speed over ground (SOG), and heading. We built our
test case with the AIS data recorded on 07/04/2020.
(a) (b)
Figure 3 (a) AIS records at 21:00, 07/04/2020 (SOG ≤ 0.5) (b) All ship trajectories recorded on 07/04/2020
3.3.2. Intention Estimation
Stationary and moving ships: Ships in the studied area include both stationary and moving
ships. Figure 3(a) shows all the latest AIS data of the stationary ships whose SOG is less than 0.5
knots (around 0.26 m/s) recorded at 21:00, 07/04/2020. These ships are regarded as anchored or
stopped and signify possible port or berth locations, which will be used as possible ship
destinations. Their speed, if any, is caused by either wind or water currents. On the other hand,
the trajectories of all the moving ships, also called target ships, are plotted in Figure 3(b). The
ships are not taking random actions or steers during the motion; instead, most of them have a
clear destination, e.g., moving to a port or into the ocean. Therefore, in this research, it is
assumed that each target ship always has its own destination. This makes the estimation of the
intention of target ships possible.
24
Destination identification: The first step of intention estimation is to choose some positions
as possible destinations. Besides the water areas on the margins of the map, a port or an
anchoring area on the map can also be a destination. Thus, we collect the AIS data of the
stationary ships, recorded at 21:00 with 𝑆𝑂𝐺 ≤ 0.5, and use hierarchical clustering to determine
the possible destinations. The distance threshold of the clustering is set as 400 pixels, and the
median of each cluster is then used as a possible destination. The positions of all the possible
destinations are shown in Figure 4.
(a) (b) (c)
Figure 4 (a) The clusters of all stationary ships shown in different colors; (b) The median of each cluster; (c) All the possible
destinations shown in green squares (cluster medians + water area on margin of the map)
Intention estimate: To estimate the level of intention toward a certain destination, we need to
compare the current movement with the path leading to that destination. If the current motion of
the ship aligns with the path, there is a high probability that the ship will follow that path in the
future motion. However, each pixel of the water area in the map can be the possible position of a
ship. This requires the algorithm to find the path from each pixel to each destination efficiently.
25
Our approach uses a modified probabilistic roadmap algorithm (PRM). PRM is a popular
sampling-based path planning algorithm in robotics (Bohlin & Kavraki, 2000; Hsu et al., 2002;
Karaman & Frazzoli, 2011). In our planning problem, we need to find paths from different
positions to the same destination. We initialize the vertex set with the position of all possible
destinations and then do random sampling in the water area to build a roadmap. The vertex
number is set to 15,000, and the max neighbor number of each vertex is set to 16. This roadmap
covers the water area on the map and can be used to generate a path from each vertex to each
destination.
Path cost function: For water transportation, the path length, radius of turning, clearance
height of bridges, and water depth should be considered in the path selection (Petraška et al.,
2017). To simplify the problem, we assume that all ships have the same maneuverability and
share the same cost function of the path. The cost function considers the length of the path and
the radius of turning. We also assume that the effect of wind, waves, and tides can be neglected,
the clearance height of bridges in this area is high enough for all ships to go through, and all
ships have the same draught so that all ships plan the path on the same map.
Due to the limited maneuverability of ships, the optimal path on a graph should consider
both the path length and steering angles on each node of the path. For container ships, the fuel
consumption rate is mainly determined by their sailing speed and grows with speed in the form
of a power function (Q. Meng et al., 2016). Thus, when the path has already been determined,
there is a tradeoff between travel speed and fuel consumption. In this research, we assume that
the own ship is already moving at a speed so that the travel time and fuel consumption are
balanced, and the speed does not change in the future motion. In addition, we assume that the
own ship prefers small steering angles and will try to find a smooth path. Noticing that the
26
optimal path from a vertex to a destination is also the optimal path from the destination to that
vertex, we can build a tree from the destination vertex and span it to cover the space so that a
path on the tree is the optimal path from the vertex to the root. Such a tree is referred to as the
destination tree in this research. Here we use a modified Dijkstra’s algorithm (L. Sun et al., 2021)
to build the destination tree. Since ships have only limited steering capability, a steering cost is
introduced to build the tree. The path cost is defined in ( 3 ).
𝐶𝑜𝑠𝑡 = 𝐶𝑜𝑠𝑡𝑑𝑖𝑠𝑡 + 𝛼 ∙ 𝐶𝑜𝑠𝑡𝑠𝑡𝑒𝑒𝑟
( 3 )
𝐶𝑜𝑠𝑡𝑑𝑖𝑠𝑡 = ∑𝑑𝑖𝑠𝑡(𝑣𝑖
, 𝑣𝑗)
𝑖,𝑗
( 4 )
𝐶𝑜𝑠𝑡𝑠𝑡𝑒𝑒𝑟 = max(𝑆𝑡𝑒𝑒𝑟(𝑣𝑖
, 𝑣𝑗
, 𝑣𝑘)) + 𝛽∑𝑆𝑡𝑒𝑒𝑟(𝑣𝑖
, 𝑣𝑗
, 𝑣𝑘)
𝑖,𝑗,𝑘
( 5 )
𝑆𝑡𝑒𝑒𝑟(𝑣𝑖
, 𝑣𝑗
, 𝑣𝑘) =
tan (0.5(𝜑𝑗,𝑘 − 𝜑𝑖,𝑗))
min (𝑑𝑖𝑠𝑡(𝑣𝑗
, 𝑣𝑘), 𝑑𝑖𝑠𝑡(𝑣𝑖
, 𝑣𝑗))
( 6 )
𝑣𝑖
, 𝑣𝑗
, 𝑣𝑘 ∈ 𝑉 are vertices on the roadmap, and (𝑣𝑗
, 𝑣𝑘), (𝑣𝑖
, 𝑣𝑗) ∈ 𝐸 are edges on the
roadmap. 𝑑𝑖𝑠𝑡(𝑣𝑖
, 𝑣𝑗) is the length of the edge (𝑣𝑖
, 𝑣𝑗), and is calculated by the Euclidean
distance. 𝜑𝑖,𝑗
is the course angle from 𝑣𝑖
to 𝑣𝑗
. The pseudo-code of the construction of the
destination tree is shown in Algorithm 1.
Algorithm 1 construct_destination_tree
Inputs: vertex_set (V), edge_set (E), root_vertex (d)
1. heap = [[0, d]] # heap of [cost, Vertex()]
2. destination_tree = empty
27
3. visited = empty set
4. while heap is not empty:
5. cost, vertex = heap.pop()
6. if vertex not in visited:
7. add vertex to visited
8. add vertex to destination_tree
9. for neighbor in E[vertex]:
10. calculate the new_cost of neighbor by ( 3 )
11. push [new_cost, neighbor] into heap
12. end for
13. end if
14. end while
15. return destination_tree
The 𝐶𝑜𝑠𝑡 consists of the cost of path length and accumulated steering cost. The steering cost
is defined in ( 6 ) and encourages smooth turns. A max function is used in the 𝐶𝑜𝑠𝑡𝑠𝑡𝑒𝑒𝑟 to avoid
the situation where a sharp turn is made to reduce future steering costs. In our case, 𝛼 = 1000,
𝛽 = 0.1. Some examples of destination trees are shown in Figure 5.
28
(a) (b) (c) (d)
Figure 5 The path from each node of the tree to the root is the optimal path to the root regarding the distance and steering cost.
The roots are shown in green squares. (a) roadmap built by PRM; (b) root = (1674, 3409); (c) root = (2788, 4320); (d) root =
(1500, 2211)
Intention score: The intention of each ship is then modeled by the level of alignment
between the present motion of the ship and the optimal path to each destination. Due to the
wind/water current on the ocean, a ship may change its heading to counteract the drifting. Thus,
we use the true direction of motion, which is the COG, to evaluate the intention of a ship. A
cosine distance is used to determine the intention weight for each destination. Given the current
position and COG of a ship, the closest vertex on the roadmap is selected as the starting point,
and a path from this vertex to each destination is found on the destination trees. The intention
score is calculated by ( 7 ).
𝐼𝑛𝑡𝑒𝑛𝑡𝑖𝑜𝑛𝑆𝑐𝑜𝑟𝑒 = cos(𝐶𝑂𝐺 − 𝜑𝑠𝑡𝑎𝑟𝑡,𝑚)
( 7 )
In ( 7 ), 𝐶𝑂𝐺 is the course angle of the ship, and 𝜑𝑠𝑡𝑎𝑟𝑡,𝑚 is the course angle from 𝑣𝑠𝑡𝑎𝑟𝑡 to
𝑣𝑚, where 𝑣𝑚 is the first vertex along the path that satisfies the length of 𝑃𝑎𝑡ℎ(𝑣𝑠𝑡𝑎𝑟𝑡, 𝑣𝑚) is
29
greater than 100-pixel length. The weight of each destination is calculated by the
𝐼𝑛𝑡𝑒𝑛𝑡𝑖𝑜𝑛𝑆𝑐𝑜𝑟𝑒 by ( 8 ).
𝑊𝑒𝑖𝑔ℎ𝑡(𝑑) =
𝑒
𝜆∙𝐼𝑛𝑡𝑒𝑛𝑡𝑖𝑜𝑛𝑆𝑐𝑜𝑟𝑒(𝑑)
∑ 𝑒
𝜆∙𝐼𝑛𝑡𝑒𝑛𝑡𝑖𝑜𝑛𝑆𝑐𝑜𝑟𝑒(𝑑′)
𝑑′
( 8 )
(a) (b) (c) (d)
Figure 6 Possible paths a ship may take given its initial position and course angle. The line width represents the weight of the
path. (a) pos = (2500, 1800), COG = 60; (b) pos = (2800, 3000), COG = 160; (c) pos = (2500, 2000), COG = 210; (d) pos =
(2500, 3500), COG = 180.
By applying ( 8 ), the weight of each destination will sum up to 1. 𝑑 is the index of possible
destinations. The 𝜆 is a user-defined coefficient that scales how much weight a high intention
score gets. To determine the value of 𝜆 , an empirical study was carried out, in which 𝜆 values
were set, and then the corresponding possible trajectories were observed. By comparing the
trajectories with the AIS data shown in Figure 3(b), the 𝜆 values that give a high weight to
30
trajectories not observable in the AIS records are discarded. As a result of the study, 𝜆 = 7 has
been set.
A visualized result of the weight destination and path is shown in Figure 6. The path to each
possible destination is drawn with a green line, and the line width represents the weight of the
path, which shows the level of intention that the ship will follow that path.
3.3.3. Risk Modeling
After we get all possible paths that a ship may take and the corresponding weights of the
paths, we can assess the risk of the own ship encountering a certain ship at a position in a future
time. The position is estimated by assuming the ship is moving at a constant speed along the path,
and a disk-shaped area centered at that future position will be regarded as a risky area since the
ship may not strictly move along the estimated path and the uncertainty associated with its speed
and direction during the motion needs to be accommodated.
Considering that the average speed of the ship can be different from that in the latest AIS
record, as time goes on, the error of future position estimation will become larger and larger.
Thus, we use a Gaussian model with time-varying variance in the sense that the long-term
estimation will not be as reliable as a short-term estimation. The risk model is described in ( 9 ).
𝑅𝑖𝑠𝑘(𝑥, 𝑦,𝑡) = ∑∑𝑊𝑒𝑖𝑔ℎ𝑡(𝑑) ∙
1
𝜎(𝑡)
∙ 𝑒
−
(𝑥−𝑥𝑘
(𝑡))
2
+(𝑦−𝑦𝑘
(𝑡))
2
2∙𝜎(𝑡)2
𝑘 𝑑
( 9 )
𝜎(𝑡) = 𝜎0 + 𝜎1 ∙ 𝑡
( 10 )
In ( 9 ), 𝑑 is the index of possible destinations, and 𝑘 is the index of target ships. The total
risk value is calculated by summing up the weighted risk value received at position (𝑥, 𝑦) from
31
each target ship. The time-varying variance of the risk function is defined in ( 10 ). The selection
of coefficients in ( 10 ) should be careful. On the one hand, when the variance is too large,
almost everywhere will be identified as risky, which is not meaningful; on the other hand, when
the variance is too small, most of the area will be identified as not risky, which does not help the
decision making. A good risk model should show better performance in the risk value by
intention estimation than that by the constant speed assumption. It remains an open question how
to select the coefficients. In this research, the coefficients are empirically selected so that the
average risk value received by the intention estimation at the true future position of target ships
is higher than that by the constant speed estimation in long-term prediction. In our case, 𝜎0 = 10
and 𝜎1 = 0.02.
(a) (b)
Figure 7 Risk values calculated by both methods on 10 random roadmap initializations at different prediction horizons. (a)
boxplot of risk values at different prediction horizons, whisker = [5%, 95%]; (b) mean of risk values at different prediction
horizons.
We calculate the risk values using both the constant velocity method and the proposed
intention estimation method on 10 randomly generated roadmaps. The results are shown in
32
Figure 7. The performance of the constant velocity solution is slightly better than the solution by
our approach in short-range estimation. This is because the roadmap in our approach is not
smooth, and the target ships cannot change their courses greatly in the short-range prediction.
However, our approach starts to outperform the constant velocity solution when the prediction
horizon is greater than 20 minutes. This tendency is present in all combinations of the values of
the coefficients in ( 10 ) that we have tested, implying that our approach can better estimate the
long-range risk of encountering other ships, and the magnitude of the long-range risk values is
not negligible compared to the short-range value.
3.3.4. Risk-aware A* Algorithm
A modified risk-aware A* algorithm3
is used in this study to find a path on the roadmap with
a low cost for the ship. The cost function is defined as a weighted sum of the distance cost,
steering cost, and risk cost, as defined in ( 11 ).
𝑜𝑤𝑛𝑃𝑎𝑡ℎ𝐶𝑜𝑠𝑡 = 𝑜𝑤𝑛𝐷𝑖𝑠𝑡𝐶𝑜𝑠𝑡 + 𝛼 ∙ 𝑜𝑤𝑛𝑆𝑡𝑒𝑒𝑟𝐶𝑜𝑠𝑡 + 𝛾 ∙ 𝑜𝑤𝑛𝑅𝑖𝑠𝑘𝐶𝑜𝑠𝑡
( 11 )
𝑜𝑤𝑛𝐷𝑖𝑠𝑡𝐶𝑜𝑠𝑡 = ∑𝑑𝑖𝑠𝑡(𝑣𝑖
, 𝑣𝑗)
𝑖,𝑗
( 12 )
𝑜𝑤𝑛𝑆𝑡𝑒𝑒𝑟𝐶𝑜𝑠𝑡 = max(𝑜𝑤𝑛𝑆𝑡𝑒𝑒𝑟(𝑣𝑖
, 𝑣𝑗
, 𝑣𝑘)) + 𝛽∑𝑜𝑤𝑛𝑆𝑡𝑒𝑒𝑟(𝑣𝑖
, 𝑣𝑗𝑣𝑘)
𝑖,𝑗,𝑘
( 13 )
𝑜𝑤𝑛𝑆𝑡𝑒𝑒𝑟(𝑣𝑖
, 𝑣𝑗
, 𝑣𝑘) = tan (0.5(𝜑𝑗,𝑘 − 𝜑𝑖,𝑗))
( 14 )
𝑜𝑤𝑛𝑅𝑖𝑠𝑘𝐶𝑜𝑠𝑡 = ∑𝑅𝑖𝑠𝑘(𝑥𝑖
, 𝑦𝑖
,𝑡𝑖)
𝑖
( 15 )
3 The algorithm can be found at https://github.com/hu-chuanhui/RiskAware-Astar
33
Like the definition in ( 3 ), 𝑣𝑖
, 𝑣𝑗
, 𝑣𝑘 ∈ V are vertices on the roadmap, and (𝑣𝑗
, 𝑣𝑘), (𝑣𝑖
, 𝑣𝑗) ∈
E are edges on the roadmap. The 𝑥𝑖
, 𝑦𝑖
in ( 15 ) is the position of the vertex 𝑣𝑖
, and 𝑡𝑖
is the
arrival time of that vertex. Thus, for the same vertex on the road, the arrival time and the cost
may differ if the parent path is different.
Notice that when there is no target ship, the risk value will be zero, and the path cost will
only depend on the distance and steering cost. Thus, like how we build the destination tree, we
can build a tree from the goal vertex using the cost function from ( 11 ) with zero risk cost. The
path from each vertex to the goal vertex on the goal tree is the optimal path when there is no
target ship, and the cost received at each vertex on the goal tree will be used as the heuristic
distance in the risk-aware A*. The pseudo-code of the risk-aware A* is shown in Algorithm 2.
Algorithm 2 risk-aware A*
Inputs: vertex_set (V), edge_set (E), start_vertex (s), goal_vertex (g), goal_tree(GT)
1. heap = [[0, 0, None, s]] # heap of [cost_f, cost_g, parent, Vertex()]
2. child_parent_set = empty set
3. visited = empty set
4. while heap is not empty:
5. cost_f, cost_g, parent, vertex = heap.pop()
6. if vertex == g:
7. break
8. end if
9. if vertex not in visited:
10. add vertex to visited
11. add [vertex, parent] to child_parent_set
12. for neighbor in E[vertex]:
13. calculate cost_g of neighbor by ( 11 )
14. cost_h = cost_from_GT(neighbor, GT)
15. cost_f = cost_g + cost_h
16. push [cost_f, cost_g, vertex, neighbor] into heap
17. end for
34
18. end if
19. end while
20. path = find_path_from(child_parent_set)
21. return path
The risk-aware A* algorithm is complete in space, which guarantees to visit every vertex in
the connected graph (𝑉, 𝐸). However, each node will only be visited once, which implies that the
solution is not complete in the time domain. That is to say, the risk-aware A* will try to find a
smooth path with low risk instead of taking a detour or even circular motions to find a risk-free
path.
3.4.Case Study
Our proposed algorithm is applied to make a long-range plan in complex environments. As a
result of minimizing the risk defined above, the algorithm seeks to reduce the duration of
encounters and the number of ships encountered at the same time, thus reducing the complexity
of the encountering situation, and helping autonomous ships make safe navigation decisions.
(a) (b) (c) (d)
Figure 8 A sample path taken by the proposed risk-aware A* in one of the test cases. The own ship and the path are shown in
green. (a) t = 0; (b) t = 1970; (c) t = 3940; (d) t = 5910.
35
3.4.1. Environment setup
The test data comes from the true AIS records on 7/4/2020. We chose to use the data
recorded from 21:56:57 to the end of the day and selected the ship with ID MMSI 477655900 as
our reference ship or the own ship mentioned above. This is a 5041 TEU (twenty equipment unit,
or 20-foot container) container ship and has a length of 294.13 meters.
The position of the reference ship at 21:56:57 is used as the starting point, and the position
recorded at 22:59:56 is used as the goal point.
The SOG recorded at the starting point is 10 knots (around 5.14 m/s). The length of the
recorded path of the reference ship is 29.62 km. The reference ship accelerated during the motion,
and it took the reference ship about an hour to reach the goal point.
In our case study, we remove the AIS records of the reference ship and use all the remaining
AIS records to rebuild the situation. Our proposed algorithm is used to estimate the risk in the
environment, and the risk-aware A* is applied to find a path from the starting point to the goal
point. Considering that many of the ships are stopped, we set a sog_threshold to trigger the
intention estimation. Only those whose SOG is greater than sog_threshold will be considered in
using the intention estimation to assess future risk. Otherwise, the future position of the ship will
be predicted by assuming it is moving at a constant velocity. For example, when sog_threshold is
2 knots, the risk value from those whose SOG is less than 2 knots will be estimated by letting
them move forward at the speed of 2 knots, and the future position and risk of those moving
faster than 2 knots will be estimated by letting them follow the paths leading to the possible
destinations.
36
The own ship plans a path at the beginning of motion with the latest AIS data received at
that moment and follows the planned path with constant speed without further planning. The
planned path of one test case is shown in Figure 8. The own ship needs to find a safe path to go
through the narrow area, where there are ports on either side, and a lot of ships are moving in this
area. It is inevitable to encounter other ships in such a situation, and our objective is to minimize
the duration of encounters and reduce the number of ships encountered at the same time.
(a) (b)
Figure 9 The paths taken by the planner based on intention estimation and constant velocity assumption on 10 randomly
generated roadmaps. The yellow path is a human-taken path recorded in AIS data. (a) the 10 paths taken by the planner based on
intention estimation (green); (b) the 10 paths taken by the planner based on constant velocity assumption (red).
Table 1 Performance comparison of different planners on the 10 randomly generated roadmaps.
Human Constant velocity planner Intention-based planner
Weight of RiskCost (γ) / 10000 20000 10000 20000
SOG threshold (knots) / infinity infinity 2 2
Mean of path length (km) 29.62 29.45 30.05 29.46 30.19
Std of path length (km) / 0.94 0.60 0.90 0.61
37
Mean of sum{steer} 2.11 4.89 5.34 5.06 5.77
Std of sum{steer} / 0.61 0.78 0.78 0.81
Mean of max {steer} 0.17 0.30 0.38 0.29 0.38
Std of max{steer} / 0.05 0.14 0.06 0.08
Mean of duration
encounter ≥ 1 (s)
1910.00 2206.90 2054.10 2065.00 1900.20
Std of duration
encounter ≥ 1 (s)
/ 241.00 118.09 219.19 240.94
Mean of duration
encounter ≥ 2 (s)
657.00 728.10 623.40 609.00 475.20
Std of duration
encounter ≥ 2 (s)
/ 137.53 102.51 154.96 152.33
Mean of duration
encounter ≥ 3 (s)
10.00 143.10 158.90 54.70 45.10
Std of duration
encounter ≥ 3 (s)
/ 83.23 52.79 67.98 64.57
Mean of planning time (s) / 171.46 246.38 960.77 1344.85
Max of planning time (s) / 216.38 316.82 1133.80 1527.85
Std of planning time (s) / 28.15 33.21 99.69 103.32
Table 2 Results of one-tailed paired t-test on the duration of encounters.
Weight of RiskCost (γ) Encounter duration t statistics p value
10000
encounter ≥ 1 -3.2521 0.0050
encounter ≥ 2 -2.0303 0.0365
encounter ≥ 3 -2.4879 0.0173
20000
encounter ≥ 1 -2.0396 0.0359
encounter ≥ 2 -3.7173 0.0024
encounter ≥ 3 -3.8115 0.0021
38
3.4.2. Results
In this case study, we compare the result of the path plan using intention estimation with the
plan assuming all ships are moving at constant velocity. This comparison is realized by setting
different sog_threshold. When sog_threshold is set to 2 knots, the intention estimation is
activated, and the risk will be analyzed based on the possible path a ship may take. When
sog_threshold is set to infinity, the future position will be barely predicted with the constant
velocity assumption.
The effect of different weights 𝛾 of risk cost is also shown in the result of the case study. By
setting different values, the level of risk tolerance can be modified to decide whether the own
ship will take a shortcut with a higher risk of encounters. Since the roadmap construction
depends on random sampling, we generate 10 roadmaps and test the performance of both
planners on them. The result of the case study is shown in Table 1, and the human performance
is also listed in Table 1. The term steer in the table is the steer cost at each node along the path as
defined in ( 14 ).
One weakness of both methods is the steering cost. The steering cost of the human path is
much smaller than that found on the roadmap by both algorithms. The reason is that the human
path is reconstructed from each piece of the AIS records and has a much smaller step size than
the edge length on the roadmap, which makes the human path smoother.
The performance is evaluated by the path length and the duration of encounters. In this case,
if a ship appears within 0.5 nautical miles (0.926 km) of the own ship, it is regarded as
encountered. Considering that encountering multiple ships at the same time is more complex, we
show the duration of encounters by the number of ships encountered. The durations of
39
encountering at least 1 ship, at least 2 ships, and at least 3 ships are shown in Table 1. We can
see that the mean path lengths of the two planners are similar, but the encounter duration of the
one based on intention estimation is much shorter than the one based on the constant velocity
assumption. The paths taken by the two planners are plotted in Figure 9. Although the paths by
the two planners look similar, Table 1 shows that the performance of our proposed algorithm is
much better, which proves that the intention estimation can make a better assessment of future
risk. A one-tailed paired t-test is conducted on the encounter duration as shown in Table 2. The
result also supports that our proposed method has a better performance in the encounter duration.
3.5.Results and Discussion
As shown in Table 1, our proposed algorithm for autonomous ships outperforms the planner
that assumes all ships are moving at a constant speed and direction. The average lengths and
average steering costs of the paths planned by the two planners are similar, but the duration of
encounters by our proposed algorithm is much shorter. Our algorithm can better assess the risk of
encountering other ships in the long term. It takes the own ship around 100 minutes to reach the
goal, and the constant velocity and course direction assumption no longer hold in such a longtime horizon in complex environments. Our algorithm utilizes the information gained from the
map and the AIS data and finds possible destinations for each ship. Since the optimal path from
the goal to a starting point is also the optimal path from that starting point to the goal, it is
possible to construct a goal tree for each destination that spans the water area and provides a
near-optimal path from each point in the water area to the destination. By this approach, the
pathfinding task for each ship to each destination can be reduced to the problem of finding the
closest vertex on the destination tree and retrieving the path from the tree.
40
This makes it possible to include the risk assessment in the risk-aware A* algorithm. Since
the construction of the roadmap and the destination trees just need to be done once, the time
complexity of the risk calculation at each vertex is 𝑂(𝑁 ∙ 𝐷 ∙ log (𝑝)), where N is the number of
ships, D is the number of possible destinations, and p is the number of vertices on the path. The
overall complexity of the risk-aware A* algorithm is 𝑂((|𝑉| + |𝐸|) ∙ 𝑁 ∙ 𝐷 ∙ log (𝑝)), where
|𝑉| and |𝐸| are the number of vertices and edges on the roadmap, respectively. From Table 1,
one can find that the variances of the duration of encountering at least three ships are very high.
This is because the current planner just plans the path at the beginning of motion and then
follows the path without any replan. In such a case, the randomness in the roadmap generation
can affect the quality of the path. A periodic replan may relieve this problem. However, the
planning time of the proposed method is over 16 minutes, which makes it hard to implement the
algorithm in real time.
The path planned by humans shows better performance in the duration of encountering at
least 3 ships. This result is reasonable since the human keeps receiving real-time information and
has more flexibility in the control of the ship. The human can change the speed of the ship to
drive through risky areas quickly, and the real-time information also helps the human to modify
the plan according to the situation.
Currently, this algorithm does not satisfy the real-time requirement. It takes over 16 minutes
for a gaming laptop to generate a plan in such a complex environment in the test case. The code
is written in Python, and the calculation is done by an AMD Ryzen 7 5800H CPU with 3.20 GHz
speed and 16 GB RAM. During the planning, the risk value at each vertex is the sum of the risk
value of each target ship, and the risk value of each target ship is calculated in parallel by 16
threads. The complexity of the proposed algorithm grows linearly with the number of ships. In
41
the test case, there are more than 200 ships in this area, and over 50 of them are moving. Future
studies on intention modeling and risk assessment are needed to improve the efficiency of the
algorithm.
3.6.Summary and Findings
In this study, we introduced a risk-aware path planning algorithm for autonomous ships
navigating in complex and dynamic environments. Hierarchical clustering is applied to
determine all possible destinations of other ships, and a pathfinding algorithm based on a
probabilistic roadmap algorithm and modified Dijkstra’s algorithm is developed to find the
possible path an unmanned ship may take. This approach does not rely on the constant velocity
assumption, which is common in similar research.
A time-varying Gaussian model is used to assess the risk of encountering a ship at a future
time. Our proposed algorithm will find a path with a low risk of encountering other ships. The
risk-aware A* algorithm can be used to reduce the duration of encounters and reduce the number
of ships encountered at the same time. This helps lower the complexity of encounter situations
and can help autonomous ships make safe navigation.
The proposed algorithm can make a long-range plan in complex and dynamic environments.
In the test case, it takes the own ships 100 minutes to reach the goal, and the risk of encountering
other ships can be properly assessed by our algorithm. Our algorithm has a much shorter duration
of encounters against the same planner with the constant velocity assumption. This implies that
the constant velocity assumption does not hold in such a complex situation, and our algorithm
can handle the situation well. The average encountering time of the paths planned by our
algorithm is comparable to that of humans.
42
However, the human path has a longer duration of encountering at least two ships and a
shorter duration of encountering at least three ships. This implies that the human is sacrificing
the overall duration of encounters to avoid the complex situation of encountering many ships at a
time. Thus, an in-depth study of how various parameters in the current model interact with each
other and consequently impact the path planning performance is needed, and detailed modeling
of different encountering situations will be warranted. The current approach assumes that all the
ships have the same maneuverability and physics constraints. How to model the intention for
different ship types and sizes through a data-driven approach is another future direction.
Furthermore, the efficiency of the risk assessment algorithm needs improvement. The current
approach takes all target ships into consideration, while many of them may not be threatening to
the own ship. A better risk assessment process will be developed in our future research for
developing a data-driven and highly intelligent path planning and collision avoidance framework
for autonomous ships.
Based on the experiment results and ensuing discussions, the following conclusions can be
drawn:
• Intention estimation can improve risk assessment, especially in short prediction horizons
the risk values are effective within a 30-minute horizon.
• Intention estimation is effective in the long term, but the effectiveness drops as the
prediction horizon increases. The encountering risk is modeled as a time-variant Gaussian
distribution and diverges with the length of the prediction horizon. Thus, the long-range
risk estimation does not provide as much information as the short-range estimation does.
• Intention estimation based on graph method is not as accurate as straight-line motion
estimation in short term. Due to the graph representation, the trajectories are not smooth
43
and result in a lower prediction accuracy in shot range, where the motion of ships is close
to straight lines. A continuous solution space is required to improve the accuracy of
motion prediction.
• The time complexity scales up with the number of targetships and the scale of the
environment. Since the search-based algorithm also scales exponentially with the depth
of the graph, a more efficient algorithm is required to solve the path planning problem in
large-scale environments.
44
Chapter 4. Large-Scale Path Planning in Complex Environments
4.1.Introduction
Autonomous systems are required to plan their motion within continuous and sometimes
very large-scale environments. This task is complicated by the complexity of these settings,
which are typically filled with obstacles of diverse shapes and sizes. A typical path planning
problem is to find a collision-free trajectory from an initial position to a target position in some
environments with obstacles, as shown in Figure 10. For different problem settings, additional
objectives (e.g., minimizing energy consumption) or constraints (e.g., restricting the turning
angle along the trajectory) can be included (Patle et al., 2019). The plan can also be made in
large-scale environments such as global navigation for ships (Shah & Gupta, 2020) or in highdimensional spaces such as motion plans for multi-joint robot arms (G. Chen et al., 2021). These
make path planning a challenging task.
Researchers have developed various methods to solve the path planning problem. Searchbased methods such as the A* algorithm are robust to different types of environments and can
always find the global optimal solution (Russell & Norvig, 2016). However, such methods need
to make plans on predefined graphs. Since environment modeling can be time-consuming, these
algorithms have poor efficiencies in large-scale problems (Shah & Gupta, 2020). Samplingbased algorithms such as probabilistic roadmap (PRM) (G. Chen et al., 2021), rapidly exploring
random trees (RRT) (Nasir et al., 2013), and their variants draw random samples in the space and
try to connect the sampled points to form a feasible path. Due to the nature of random sampling,
the convergence speed is not guaranteed, and it is difficult for these methods to find narrow
passages on the map. The artificial potential field (APF) can find feasible paths in environments
that only contain convex obstacles but may easily get stuck in local minimums in complex
45
environments (Fan et al., 2020). Fast-marching method (Beser & Yildirim, 2018) and Voronoi
diagram-based methods (Bhattacharya & Gavrilova, 2008) are able to find a safe path that keeps
a distance from obstacles, while the length of the path found is not optimized. Furthermore, it is
difficult to model objectives such as environmental disturbances or time-varying risks using the
abovementioned methods.
To plan the path under complex conditions, many researchers tried to model the task as an
optimization problem and use machine learning techniques to solve it. Many studies have been
conducted to solve path planning problems with deep learning and deep reinforcement learning
(Pan et al., 2021; Panov et al., 2018; X. Wang & Jin, 2022). However, the quality of the model
highly relies on the quality of the training data and scenarios. Using evolutionary algorithms to
do path planning is another hot research field (Das & Jena, 2020; Lazarowska, 2015; Roberge et
al., 2013; Y. Wang et al., 2019). Thanks to the flexibility of the evaluation function, evolutionary
algorithms are well-suited for modeling multi-objective or multi-robot path planning problems.
Nevertheless, when the scale of the problems grows, evolutionary algorithms suffer from early
convergence, and traditional exploration and evaluation methods may fail to find feasible
solutions.
Most of the current research on evolutionary-based path planning algorithms is conducted
on small grid maps, typically with a size smaller than 100×100. However, in complex
environments filled with obstacles of various shapes and sizes, down-sampling the original map
to a low-resolution version may destroy the topological structure of the geometry. As shown in
Figure 10(c), the optimal solution on the original high-resolution map becomes infeasible on the
down-sampled map. In extreme cases, there may be no feasible solution on the down-sampled
map, resulting in failure.
46
In this dissertation, we propose a path planning approach based on Genetic Algorithm (GA)
to solve large-scale path planning problems in complex environments. A novel collision
assessment method is introduced to guide the algorithm to find a feasible path. To overcome the
premature problem of Genetic Algorithm, we introduce a self-improving mechanism to ensure
that the algorithm pays attention to any potential solution. The proposed algorithm can produce
solutions with a variable number of waypoints to accommodate different complexities of the
problems and can find high-quality solutions in large-scale planning problems with the size of
the grid map up to 10,000×10,000.
(a) (b) (c)
Figure 10 (a) a sample grid map (size: 10,000×10,000) for path planning problems. The area in white are obstacles, and the
area in white is free space. The starting point is marked by a blue square, and the goal point is marked by a red circle; (b) a
solution to the problem is shown in green; (c) the original map resized to 100×100. The topological structure of the geometry
has changed and the optimal solution becomes infeasible on the low-resolution map.
4.2.Problem Description
In large-scale path planning problems for autonomous systems in 2-D space, a continuous
path can be approximated by a sequence of waypoints. Given a waypoint set 𝕎 = {(𝑥, 𝑦)} ⊆
ℝ𝑥 × ℝ𝑦, the size of the solution space of a path defined by 𝑛 waypoints is |𝕎|
𝑛
, which grows
quadratically with the size of the waypoint set and grows exponentially with the number of
47
waypoints. Moreover, due to the different levels of complexity of the environment, a variable
number of waypoints is required to represent the solution.
One of the constraints of the path planning problem is to avoid collision. While modeled as
an optimization problem, this collision-free constraint can be represented by a large penalty,
which is an upper bound of other objective scores in the evaluation function. In this work, we
aim to solve the large-scale path planning problem in this solution space with a variable number
of waypoints. The problem can be formulated as:
Design a model 𝑝(𝑡) = 𝑓(ℂ,ℙ, 𝑝𝑖𝑛𝑖𝑡, 𝑝𝑔𝑜𝑎𝑙) to find the optimal path 𝑝𝑎𝑡ℎ =
{𝑝1, 𝑝2, … , 𝑝𝑛
}, 𝑝𝑖 ∈ ℝ𝑥 × ℝ𝑦 with respect to an objective function 𝑓𝑢𝑛(∙) = 𝛼 ∙ ℙ𝑐𝑜𝑙𝑙𝑖𝑠𝑖𝑜𝑛 + 𝛽𝑖
∙
∑𝑖 ℙ𝑖
that
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝑓𝑢𝑛(𝑝𝑎𝑡ℎ)
𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 𝑝1 = 𝑝𝑖𝑛𝑖𝑡, 𝑝𝑛 = 𝑝𝑔𝑜𝑎𝑙
( 16 )
where ℙ𝑐𝑜𝑙𝑙𝑖𝑠𝑖𝑜𝑛 is the collision cost and is guaranteed to converge to zero during the
optimization. ℙ𝑖
′𝑠 are other objective functions such as energy consumption, sharp turns, and
time consumption.
A path in 2-D space is represented by a sequence of waypoints:
𝑃 = {𝑝𝑠
, 𝑝1, 𝑝2, … , 𝑝𝑛, 𝑝𝑔}
( 17 )
where 𝑝𝑠 = (𝑥𝑠
, 𝑦𝑠) and 𝑝𝑔 = (𝑥𝑔, 𝑦𝑔) are the coordinates of the starting position and goal
position, and 𝑝𝑖 = (𝑥𝑖
, 𝑦𝑖
),𝑖 ∈ [1, 𝑛] ∩ ℤ are the coordinates of intermediate waypoints. The
resulting path can be represented by the line segments defined by the waypoints in 𝑃:
48
𝑝𝑎𝑡ℎ = 𝑝𝑠𝑝1 ∪ 𝑝1𝑝2 ∪ … ∪ 𝑝𝑖𝑝𝑖+1 ∪ 𝑝𝑛𝑝𝑔,
𝑖 ∈ [1, 𝑛 − 1] ∩ ℤ
( 18 )
where 𝑝𝑖𝑝𝑖+1 are the line segments defined by 𝑝𝑖 and 𝑝𝑖+1.
4.3.Path Planning Based on Focus Genetic Algorithm
4.3.1. Genetic Algorithm
Genetic Algorithm (GA) is an evolutionary algorithm inspired by the natural evolution
process of species. Each individual in GA's population is called a chromosome, and each element
in a chromosome is called a gene. A gene encodes the value of a parameter, and thus a
chromosome defines the parameter set of the solution. GA mimics the natural selection process
in that in each generation, each chromosome will be evaluated by a fitness function and will be
selected by a selection function. Only the best among them can survive and exchange their genes
with other survivors by crossover and thus produce offspring. Mutations may happen to the
offspring, which will randomly switch their genes to produce new features. From generation to
generation, the good features can be preserved, and the whole population will finally converge to
an optimal solution.
4.3.1.1. Solution encoding
The algorithm aims to find the best sequence of waypoints with respect to the fitness
function. Since the waypoint representation is an approximation of the trajectory, and the optimal
number of waypoints could differ for different problems, the proposed algorithm uses a variablelength chromosome encoding, as shown below.
𝑖𝑛𝑑𝑖𝑣𝑖𝑑𝑢𝑎𝑙𝑗 = {𝑝𝑗,1, 𝑝𝑗,2, … , 𝑝𝑗,𝑛}
( 19 )
49
where n is the number of intermediate waypoints, which can vary during the optimization,
and individuals can have a different number of waypoints.
4.3.1.2. Initial population
Since the algorithm can deal with variable-length chromosomes, we initialize the whole
population as one-waypoint solutions. In the initialization stage, each individual in the
population is assigned a randomly sampled waypoint:
𝑖𝑛𝑑𝑖𝑣𝑖𝑑𝑢𝑎𝑙𝑗,𝑖𝑛𝑖𝑡 = {𝑝𝑗,𝑖𝑛𝑖𝑡}
( 20 )
and the algorithm will determine whether it needs to add or delete waypoints by itself during the
processing.
Considering that the quality of the initial population can greatly affect the final solution, and
random initialization might be biased when the population size is small, we use an arbitrary
version of the initialization method in KMeans++ (Arthur et al., 2007) to ensure that the initial
population is spread out in the environment. The individuals are initialized one by one, and each
individual is initialized by finding the point that is farthest from the already initialized population.
The initialization algorithm is shown in Algorithm 3.
Algorithm 3 initialize_population
Inputs: map (M), population_size (n)
1. all_samples = Random sample 10n free points on M
2. initial_population = []
3. randomly pick a point in all_samples and add to initial_population
4. while len(initial_population) < n
5. next_point = None
6. next_point_dist = 0
7. for point in all_samples
50
8. dist_to_pop = min( [distance(point, init_point) for init_point in
initial_population] )
9. if point not in initial_population and dist_to_pop > next_point_dist
10. next_point = point
11. next_point_dist = dist_to_pop
12. end if
13. end for
14. add next_point to initial_population
15. end while
16. return initial_population
(a) (b)
Figure 11 (a) quadtree decomposition of the map shown in Figure 1. The black blocks are free space, and the grey blocks are
obstacle blocks; (b) the collision cost of the line segment is the sum of areas of the white blocks
4.3.1.3. Fitness function
A common path planning problem usually concerns the length and safety of the path. Some
studies also include the smoothness of the path. Since if the obstacles in the environments do not
have sharp corners, the shortest path should also be a smooth path; we decided to only use the
51
path length as a fundamental objective. In safety modeling, hitting an obstacle should be
penalized. The final fitness function is defined as follows:
𝑓𝑖𝑡𝑛𝑒𝑠𝑠 =
1
𝛼 ∙ 𝑐𝑜𝑠𝑡𝑑𝑖𝑠𝑡 + 𝛽 ∙ 𝑐𝑜𝑠𝑡𝑐𝑜𝑙𝑙𝑖𝑠𝑖𝑜𝑛
( 21 )
𝑐𝑜𝑠𝑡𝑑𝑖𝑠𝑡 = 𝑑𝑖𝑠𝑡(𝑝𝑠
, 𝑝1
) + ∑𝑑𝑖𝑠𝑡(𝑝𝑖
, 𝑝𝑖+1
)
𝑛−1
𝑖=1
+ 𝑑𝑖𝑠𝑡(𝑝𝑛, 𝑝𝑔)
( 22 )
There are two popular types of collision penalization functions: constant cost and linear cost.
The constant cost method gives a constant or infinite penalty when a collision happens, while the
linear cost method uses the length of segments in the collision as the penalty. Both methods work
in small-scale environments with a small number of obstacles since random exploration is
powerful enough to find feasible solutions. However, in large-scale problems, these methods can
easily get stuck in infeasible solutions because the collision penalization function does not
provide enough information about the environment, and it is too difficult to find a feasible
solution with random exploration.
In this study, we propose a novel collision cost by calculating the area of obstacles cut by
the segments. Each part in a collision of each segment contributes to the collision cost, and the
cost value is the area of the smaller part of the obstacle being cut by the segment. Since the shape
of obstacles might be irregular, and the area calculation can be time-consuming, we decompose
the original grid map into a quadtree map for area calculation and approximate the area by
calculating the area of blocks in the quadtree.
52
𝑐𝑜𝑠𝑡𝑐𝑜𝑙𝑙𝑖𝑠𝑖𝑜𝑛 = ∑𝑎𝑟𝑒𝑎𝑐𝑜𝑙𝑙𝑖𝑠𝑖𝑜𝑛
( 23 )
As shown in Figure 11, the original 10,000×10,000 grid map is decomposed to a quadtree,
where all the black blocks are free blocks, and the grey blocks are obstacle blocks. The collision
cost of the line segment is the sum of the areas shown in white. Since the formula of the line
segments and blocks is known, the areas can be calculated analytically. When a line segment
only penetrates one obstacle, the obstacle is cut into two parts, and the smaller one is used to
calculate the collision cost. When the line segment moves, the collision cost changes
continuously until when the segment no longer cuts the obstacle, the collision cost becomes 0.
This continuity ensures that a collision-free path can always be found by small steps of motion
following the decrement of collision cost.
All cuts by a segment are independently used to calculate the collision cost so that adding
new waypoints on segments does not affect the value of the collision cost. To prevent the
algorithm from trying to find a path outside the boundary of the map, the area of the obstacle
blocks on the boundary of the map is set to a large value so that the smaller part of the obstacle
will always be on the opposite side of the map boundary.
4.3.1.4. Crossover operation
The crossover operation plays the role of communication among individuals. We use a
single-point crossover method here. Given two individuals as the parents, the crossover point of
𝑝𝑎𝑟𝑒𝑛𝑡1 is randomly selected, and thus 𝑝𝑎𝑟𝑒𝑛𝑡1 is split into two parts. The crossover point of
𝑝𝑎𝑟𝑒𝑛𝑡2 is determined by the proportion of the length of the first part of 𝑝𝑎𝑟𝑒𝑛𝑡1. For example,
the first part of 𝑝𝑎𝑟𝑒𝑛𝑡1 accounts for 68% of its length, and thus the crossover point of 𝑝𝑎𝑟𝑒𝑛𝑡2
will be the first waypoint that exceeds 68% of its length. An illustration of the crossover
53
operation is shown in Figure 12. The idea of a crossover is to combine the first part of 𝑝𝑎𝑟𝑒𝑛𝑡1
and the second part of 𝑝𝑎𝑟𝑒𝑛𝑡2, so that the properties of parents can be preserved, and the
algorithm can find a possible new path and escape from the current solution. By selecting parents
and doing a crossover, a set of offspring is produced, which contains the combinations of the
current best solutions. To keep the good property in the parents, the parents are also duplicated to
the new generation.
Figure 12 A sample of crossover operation. The red path and the green path on the left-hand side are the two parents. The two
hollow dots on the right-hand side are the selected crossover points, and the offspring is the black path.
4.3.1.5. Mutation operation
The mutation operation ensures that the solutions are not restricted to the waypoints that
have already appeared. It randomly happens on the offspring and adds, deletes, or moves the
waypoints so that new waypoints and line segments can be created and evaluated, and good ones
can be kept by the selection process. In this study, we use five mutation operators, which are
common in path planning problems. These operators have the same probability of being selected
in a mutation round.
• Add waypoint: add a waypoint to the individual as shown in Figure 13(a). Notice that if a
segment is not in collision, adding a new waypoint between its vertices will always harm the
total length of the path. Thus, the collision costs of segments are used as the sample weight
so that only those segments in the collision will be considered to add a waypoint. The
position of the new waypoint is randomly selected in the local area.
54
• Delete waypoint: delete a waypoint in the individual, as shown in Figure 13(b). It uses 𝑐𝑜𝑠2 𝜃
2
as the weight of random selection. Intuitively, this operator will delete sharp turns in the
solution.
• Move waypoint: randomly move a waypoint, as shown in Figure 13(c).
• Split segment: randomly split a segment, as shown in Figure 13(d). It uses the length of
segments as the weight of random selection. Thus, longer segments have a higher probability
of being split. The splitting point is randomly chosen. A short random motion is applied to
the new waypoint so that the fitness value of the new path differs from the original one.
• Cut corner: randomly cut a corner, as shown in Figure 13(e). It uses cos2 𝜃
2
as the weight of
corner selection. The cutting points are randomly chosen. This operator tends to cut sharp
corners and smoothen the path. If the new segment is not in collision, the length of the whole
path will also be improved.
(a) (b) (c) (d) (e)
Figure 13 Five mutation operations used in this study. (a) add waypoint; (b) delete waypoint; (c) move waypoint; (d) split
segment; (e) cut corner.
55
4.3.1.6. Selection method
The fitness function in this study consists of the path length and collision cost, where the
magnitude of the collision cost is much larger than that of the path length in early generations.
To avoid the algorithm converging to the first collision-free path it finds, we use the tournament
selection method as the selection function. When selecting a parent, k individuals are randomly
chosen from the population, and the one with the highest fitness value among them is selected as
one of the parents.
The pseudocode of the planner, namely GA-area, is shown in Algorithm 4.
Algorithm 4 GA_area
Inputs: map (M), starting position (start), goal position (goal), max generation (n)
1. initialize the population of GA-area
2. generation = 0
3. while generation < n
4. pop_fitness = fitness(population, start, goal, M)
5. parents = selection(population, pop_fitness)
6. offspring = crossover(parents)
7. offspring = mutate(offspring)
8. population = union(parents, offspring)
9. generation = generation + 1
10. end while
11. solution = best individual in population
12. return solution
4.3.2. Self-improving Process
Although GA-area can find feasible paths in complex environments, it usually converges to
a local optimum and makes unnecessary detours. The mutation operations do local changes to
56
each individual, and if the fitness value is improved, the resulting individual has a higher
probability of surviving. However, to escape a local optimum, multiple steps of mutation
operations must be applied at the same position, as shown in Figure 14. After the first mutation
operation, the new path shown in Figure 14(b) is in collision again, which results in a great drop
in the fitness value so that the probability that the new path can survive in the selection process is
low. Since the mutation operator and mutated waypoint are randomly selected, it is hard for GAarea to keep this individual until a proper mutation appears again at the same position.
To solve the problem that the mutation operations are distracted, we introduce a selfimproving process and propose a split-and-focus mechanism to let the algorithm split the original
problem into several sub-problems and improve them independently. During the split-and-focus
process of each individual, some waypoints are randomly selected and fixed. The fixed
waypoints are assumed to be part of the optimal solution, and a planner is used to plan the path
that connects the starting point, the fixed points, and the goal point. An illustration of the splitand-focus is shown in Figure 15.
(a) (b) (c)
Figure 14 Multiple steps of mutation operations are needed to escape a local optimum. (a) before mutation; (b) after one
mutation operation (delete waypoint); (c) after the second mutation operation (add waypoint).
57
Figure 15 Split-and-focus operation. The fixed waypoints are shown with the lock icon.
We developed the GA-focus algorithm4
that follows all the processes mentioned in Section
4.3 except the mutation operations. Instead, a self-improving process happens to the whole
population in each generation. In this study, we use GA-area to do the planning of sub-problems.
For each individual, a budget for the total generation number is given and is equally allocated to
each sub-problem. For example, when the budget is 200 generations, and there are 4 subproblems, the planner for each sub-problem will have 50 generations to solve it. After the fixed
points have been selected, the original problem is split into sub-problems and is optimized by
GA-area. The split-and-focus function is shown in Algorithm 5, and GA-focus is shown in
Algorithm 6. The number of sub-problems is randomly selected from [2, 2 + 𝑛𝑢𝑚𝑤𝑎𝑦𝑝𝑜𝑖𝑛𝑡 / 20].
The initial population of GA-focus, which consists of single-waypoint individuals, is
processed by split-and-focus once. During the process, each waypoint is assumed to be part of
the optimal solution, and the split-and-focus function will try to find a solution from the starting
position to the waypoint and from the waypoint to the goal position. The split-and-focus calls
GA-area to solve sub-problems. In both GA-focus and GA-area, the best individual of a
generation, namely elitism, is also duplicated to the next generation so that the current best
solution will not be discarded. One individual in the initial population of GA-area is a duplicate
4 The code can be found at https://github.com/hu-chuanhui/GA-focus
58
of the original solution to the sub-problem, while all other individuals are empty (which means
no intermediate waypoints). Thus, GA-area can keep the current best solution while it can still
decide whether to discard part of the solution by the crossover between the current best solution
and an empty individual.
Algorithm 5 split_and_focus
Inputs: individual (individual), map (M), starting position (start), goal position
(goal), generation budget (B)
1. s = randomly decide the number of sub-problems
2. sub_problems = randomly split individual into s sub-problems
3. n = int(B / s)
4. sub_solutions = empty
5. for sub_problem in sub_problems
6 initialize the population of GA-area as empty set
7. copy the waypoints of sub_problem to the first element of population
8. sub_sol = run GA_area(M, sub_start, sub_goal, n)
9. add sub_sol to sub_solutions
10. end for
11. solution = combine sub_solutions
12. return solution
Algorithm 6 GA_focus
Inputs: map (M), starting position (start), goal position (goal), generation
number (n), generation budget (B)
1. initialize the population of GA-focus
2. for individual in population
3. individual = split_and_focus(individual)
4. end for
5. generation = 0
59
6. while generation < n
7. pop_fitness = fitness(population, start, goal, M)
8. parents = selection(population, pop_fitness)
9. offspring = crossover(parents)
10. population = union(parents, offspring)
11. for individual in population
12. individual = split_and_focus(individual, M, start, goal, B)
13. end for
14. generation = generation + 1
15. end while
16. solution = best individual in population
17. return solution
4.4.Case Study
4.4.1. Narrow Gap Problem
In the first case study, we generate an environment with its upper half filled with 300
randomly generated ellipses as obstacles. The starting position is at the upper-left corner, and the
goal position is at the upper right corner. As shown in Figure 16(a), to increase the complexity of
the problem, we added a vertical wall in the center of the ellipse area. There is a narrow gap in
the wall, so the optimal solution should be a path that connects the starting point and the goal
point while passing through the gap without hitting any ellipses.
From the figure, we can see that although it is difficult to find such a path, an easy but
feasible solution is to first go down to the obstacle-free area at the bottom, then go right and
reach the bottom-right corner of the map, and finally go upward and reach the goal. Such a path
will encounter fewer obstacles so that it can improve its fitness more easily. It thus will have a
higher fitness value and a higher probability of surviving in the selection process.
60
Table 3 Model settings for test case 1
GAlinear
GAarea
GA-focus
GA-focusmain
GAfocus-sub
length cost coef α 0 1 1 1
collision cost coef β 1 100 100 100
generation number
800 800 3 200
(total)
population size 320 320 32 9
parent number 160 160 16 3
tournament selection k 2 2 2 2
crossover probability 1 1 1 1
mutation probability 1 1 / 1
keep parent True True True True
keep elitism True True True True
self-improve False False True False
The size of the grid map is 10,000×10,000, and the waypoints are picked from all the free
grids. We compare the performance of three algorithms: GA-linear, GA-area, and GA-focus.
GA-linear and GA-area use the crossover and mutation operations described in Section 4.3. The
difference is that GA-linear uses the length of segments in a collision as the collision cost, while
GA-area uses the area cost as described in Section 4.3. GA-focus has a self-improving process,
using a GA-area as the planner to solve the sub-problems during self-improving. The main
problem solver is GA-focus-main, while the sub-problem solver is GA-focus-sub in the
following text. Both GA-focus-main and GA-focus-sub use the crossover operation described in
Section 4.3. GA-focus-main does not have a mutation process, while the GA-focus-sub uses the
mutation operation described in Section 4.3. Both GA-focus-main and GA-focus-sub use the
area-based collision cost. In the area cost estimation, the grid map is decomposed into a quadtree.
61
The max level of the tree is 9, and thus, the size of the smallest block is about 10000 ∙ 2
−9 ≈ 20
pixels in width. The number of blocks in Table 4 shows how complex the environment is. In this
test, the environment is decomposed into 26,527 quadtree blocks.
Table 4 The results of case study 1. The values are shown in the format 𝑚𝑒𝑎𝑛 ± 𝑠𝑡𝑑.
#
blocks
# segments
evaluated
#
waypoints
Success rate
(%)
Path length
GAlinear
26527 293259.70±3723.80 4.70±2.57 0 10001.96± 220.75
GA-area 26527 254682.50±9148.08 26.10±8.86 100 17393.24±1706.17
GAfocus
26527 280539.20±1265.26 33.00±4.43 100 11797.43± 135.57
All parameters of the models are listed in Table 3. Since GA-linear uses a different collision
cost, we let the coefficient of length cost be zero so that it will neglect the length of the path and
focus on finding a collision-free path. GA-linear and GA-area do not have a self-improving
process. Thus, we let them have a larger population size and generation number so that they can
have equal power in exploring the solution space compared to GA-focus. The exploration power
is estimated by the number of line segments evaluated. Each segment consists of two waypoints
in the space, and the number of segments evaluated shows how many different combinations of
waypoints the algorithm has created and tested by crossover and mutation.
We do the planning ten times with different random seeds, and the results are shown in
Figure 16 and Table 4. The average number of segments evaluated by the three algorithms is less
than 300,000. The quadtree of the environment has 26,527 blocks, which have 26527 × 4 ≈ 105
vertices. If a search-based algorithm is applied and we need to construct a graph with the vertices,
62
the worst-case number of segments evaluated on this map can be 105 × 105 = 1010, which is
much larger than 3 × 105
in our case.
The obstacle blocks of the quadtree cover all obstacle grids, which means that usually, the
corner grid of an obstacle block is actually in free space. Thus, if the collision cost of a solution
is less than 1, the solution is regarded as a feasible one in the results. For GA-linear, this means
that the length in a collision is less than 1 pixel wide, and for the other two algorithms, this
means that the collision area is smaller than 1 pixel. From Figure 16(b), we can see that GAlinear fails in all 10 tests. All the solutions hit the wall because the surrounding environment is
filled with obstacles, and hitting the wall results in the smallest collision cost. The linear
collision cost does not provide enough guidance on how the algorithm can avoid the obstacles it
is in collision with.
Both GA-area and GA-focus have a success rate of 100%. They both can find feasible
solutions in complex environments. However, GA-area suffers from premature problems. It
tends to converge to an easy solution that goes through the obstacle-free space instead of trying
to find a feasible path through the ellipse obstacles. The number of segments evaluated by GAarea also shows this problem. A new segment can be produced by a crossover process or a
mutation process. The large standard deviation of the number of segments evaluated implies that
the GA-area converges to the same solution too early in some tests, and thus the parents in the
crossover process are identical so that no new segment is produced. Only 1 solution by GA-area
finds the narrow gap in the wall, yet it still does not try to find a path through the obstacle-filled
area. The history of fitness values is shown in Figure 17. It can be seen that GA-linear and GAarea suffer from the early convergence problem and stop exploration in the early stage, while
GA-focus keeps exploring the solution space and eventually finds the narrow gap in all 10 tests.
63
(a) (b)
(c) (d)
Figure 16 The grid map of case study 1 and the solutions by the three algorithms. (a) the starting point and goal point are at the
upper-left and upper-right corner, respectively; (b) the 10 solutions given by GA-linear; (c) the 10 solutions given by GA-area;
(d) the 10 solutions given by GA-focus
64
Figure 17 fitness history of the three methods in case study 1. The x-axis is the generation number and is scaled to [0, 1]. The
curve shows the mean±std.
In this test environment, GA-focus shows robust performance in preventing premature
during planning. The algorithm is never trapped in an easy solution. All 10 solutions find a
narrow gap in the wall, and the small standard deviation of path length also shows the robustness
of GA-focus. The average number of waypoints in the final solution is 33, which shows that GAfocus can increase the number of waypoints in its solution according to the complexity of the
problem.
4.4.2. Compare GA-focus with RRT*
In this case study, we compare the performance of the proposed GA-focus with RRT* on
two different 10,000×10,000 grid maps, namely the Box environment and the Cross environment.
Each map's starting point and goal point are marked by the blue square and the red dot,
65
respectively, as shown in Figure 9. We do 10 experiments on each problem with different
random seeds. The parameters of GA-focus are the same as in Case Study 1, as stated in Table 3,
except that the generation number is set to 1 in both experiments. The rewire radius factor γ of
RRT* is set to 20,000, and the maximum step size of the Steer function is set to 1000. The RRT*
continuously adds new nodes until there are 10,000 nodes in the tree. Figure 18 shows the paths
generated by GA-focus and one of the solutions generated by RRT*. The results of both
algorithms are shown in Table 5 and the bar plots of the path length and time consumption are
shown in Figure 19.
(a) (b)
66
(c) (d)
Figure 18 The solutions to the two environments. The starting point and goal point are marked by the blue square and red dot,
respectively. (a) 10 solutions by GA-focus on the Box environment; (b) a sample solution and the generated tree by RRT* on the
Box environment; (c) 10 solutions by GA-focus on the Cross environment; (d) a sample solution and the generated tree by RRT*
on the Cross environment
(a) (b)
Figure 19 The bar plot comparison of the path length and time consumption. The error bar shows the standard deviation. (a)
comparison of path length in both environments; (b) comparison of time consumption in both environments
Table 5 The results of case study 2. The values are shown in the format 𝑚𝑒𝑎𝑛 ± 𝑠𝑡𝑑.
Environment Algorithm Path length Success rate (%) Time (s)
Box GA-focus 10371.14±10.32 100 275.94±16.88
Box RRT* 10532.03±48.81 100 473.53±7.94
67
Cross GA-focus 13629.71±6.78 100 337.93±6.76
Cross RRT* 13702.87±25.51 100 450.19±5.98
In both environments, GA-focus shows robustness in the quality of the solution. GA-focus
can find near-optimal paths in a shorter time compared to RRT* while having smaller mean path
lengths and standard deviations.
4.4.3. Different Tough Environments
In this case study, we provide 4 different tough problems for GA-focus. We select four grid
maps from the benchmark set collected by Sturtevant (Sturtevant, 2012). These maps are from
the video game StarCraft I and are converted to greyscale images and resized to 10,000×10,000.
The starting point and goal point for each map are marked by the blue square and the red dot,
respectively, as shown in Figure 20. The shape of the obstacles in these environments is irregular,
and the maps show different challenges for path planning:
Table 6 The results of case study 3. The values are shown in the format 𝑚𝑒𝑎𝑛 ± 𝑠𝑡𝑑.
#
generation
#
blocks
# segments
evaluated # waypoints Path length Success
rate (%) Time (s)
Map 1 3 42373 291293.60±1190.06 60.20±10.32 16667.68±329.45 100 1204.33±18.15
Map 2 3 36343 290477.00±1068.95 67.70± 7.63 14999.16±256.00 100 2054.82±26.72
Map 3 4 49792 364079.60±1533.41 72.50± 9.63 18728.91±801.54 100 1177.46±31.96
Map 4 4 61150 360363.30±1627.90 76.40±11.13 15907.43±315.33 100 893.69±30.52
• Map 1: An indoor environment as shown in Figure 20(a). There are many rooms and walls
in this environment, so it is difficult for evolutionary algorithms to escape the local optimum
and find feasible solutions.
68
• Map 2: A maze environment as shown in Figure 20(b). There are many dead ends in this
environment, and the boundary of the geometries is rough, which makes the graph
construction and search process time-consuming for regular search-based algorithms.
• Map 3: An environment filled with irregular obstacles, as shown in Figure 20(c). The
starting point and goal point are surrounded by pocket-shaped obstacles, and the free space
contains a lot of narrow passages. These problems are difficult for APF and sampling-based
algorithms.
• Map 4: An environment filled with obstacles whose sizes differ significantly, as shown in
Figure 20(d). The winding free space around small obstacles might be part of the optimal
solution.
(a) (b)
69
(c) (d)
Figure 20 The solutions to four tough tasks by GA-focus. The starting point and goal point are marked by the blue square and
red dot, respectively. (a) an indoor environment; (b) a non-square maze environment; (c) an environment with pocket-shaped
obstacles and narrow passages; (d) an environment filled with obstacles of different sizes.
Like case study 1, we also do 10 experiments on each problem with different random seeds.
The parameters of GA-focus are the same as in Case Study 1, as stated in Table 3. If the collision
cost of a path is smaller than 1, the path is considered a feasible one. The solutions are shown in
Figure 20, and the results are listed in Table 6.
The result shows that in all test cases, GA-focus can reach a 100% success rate. GA-focus
shows robustness in the path length. On map 1, map 2, and map 4, the standard deviation of the
path length is less than 2% of the average length. On map 3, only two of the ten solutions take an
obvious detour, and the rest of the solutions are all efficient paths. Seven of the ten solutions on
map 4 go through the area with many tiny obstacles, which shows that the winding narrow space
around small obstacles is not neglected by GA-focus.
70
4.4.4. Multi-objective Planning
One of the advantages of GA-based planning is that multi-objective planning problems can
be easily modeled by the fitness function (Ahmed & Deb, 2013; C. Liu et al., 2022; Weise &
Mostaghim, 2022). In this case study, we provide an illustration of how GA-focus can be applied
to multi-objective planning problems.
One critical problem in the path planning of robots is to avoid rollover (Yan et al., 2021). A
common solution is to reduce the turning angle and the speed. In this problem, we aim to find a
near-optimal solution considering the weighted sum of distance-cost, smoothness-cost, and timecost. The grid map of the environment with a size of 10,000×10,000 is shown in Figure 21. We
let the allowed moving speed of the agent vary with the y coordinate, indicating the terrain
condition differs with the y coordinate, as shown in Figure 21(a). When the y coordinate gets
larger, the terrain becomes flatter, and the agent is allowed to move at a higher speed. The
allowed speed is set to change linearly with the increment of the y coordinate, which ranges from
0.05 to 1. The time cost of a line segment is defined as the traveling time along the segment, as
shown below:
𝑐𝑜𝑠𝑡𝑡𝑖𝑚𝑒 =
𝑙𝑒𝑛𝑔𝑡ℎ𝑠𝑒𝑔
𝑠𝑝𝑒𝑒𝑑𝑎𝑣𝑔
( 24 )
71
Figure 21 The grid map of case study 4 and the solutions by GA-focus. (a) the speed field in the environment. The red color
means that the agent can move faster in these areas; (b) the 10 solutions given by GA-focus. The starting point and goal point are
at the upper-left (blue square) and upper-right (red dot) corners, respectively.
Table 7 The results of case study 4. The values are shown in the format 𝑚𝑒𝑎𝑛 ± 𝑠𝑡𝑑.
Path length Time-cost Smoothness-cost Success rate (%) Computation time(s)
13479.54
±705.65
48504.92
±2867.61
56.00
±19.08
100
298.79
±12.33
In order to find a smooth solution, a smoothness cost is introduced to penalize sharp turns in
the solution. The smoothness cost is defined as follows, where 𝑑𝑒𝑔 denotes the turning angle of
two segments:
𝑐𝑜𝑠𝑡𝑠𝑚𝑜𝑜𝑡ℎ = {
0, 𝑖𝑓 𝑑𝑒𝑔 < 10 °
20, 𝑖𝑓 10° ≤ 𝑑𝑒𝑔 < 45°
50, 𝑖𝑓 45° ≤ 𝑑𝑒𝑔 < 90°
200, 𝑖𝑓 𝑑𝑒𝑔 ≥ 90°
( 25 )
The final fitness function is defined as the inverse of the weighted sum of all the cost
functions:
72
𝑓𝑖𝑡𝑛𝑒𝑠𝑠 =
1
𝛼 ∙ 𝑐𝑜𝑠𝑡𝑐𝑜𝑙 + 𝛽1 ∙ 𝑐𝑜𝑠𝑡𝑙𝑒𝑛𝑔𝑡ℎ + 𝛽2 ∙ 𝑐𝑜𝑠𝑡𝑠𝑚𝑜𝑜𝑡ℎ + 𝛽3 ∙ 𝑐𝑜𝑠𝑡𝑡𝑖𝑚𝑒
( 26 )
In this case study, the coefficient of collision cost is 𝛼 = 100. The coefficients of other costs
are 𝛽1 = 1, 𝛽2 = 100, 𝛽3 = 1, so the magnitudes of these costs are close. The parameters of
GA-focus are still the same as in Table 3, except that the number of generations of GA-focusmain is 2. We do 10 experiments with different random seeds. The solutions are shown in Figure
21(b), and the performance is shown in Table 7. GA-focus can efficiently solve such a multiobjective problem. It finds feasible solutions in all 10 tests in around 5 minutes. All solutions
tend to move to the center of the map first and then proceed to the goal so that at the middle of
the trajectory, the agent can move at a higher speed and thus save time.
4.5.Results and Discussion
As illustrated by case study 1, our proposed area-based collision cost can find feasible
solutions in an environment filled with obstacles. In path planning problems, the optimal path
may be more complicated than other local optimums. As shown in Table 4, the average number
of waypoints by GA-area is smaller than that by GA-focus. Also, we can see in Figure 16 that the
paths given by GA-focus go through the obstacle-filled area, while GA-area tends to make a
detour to the free area. This implies that more efforts are required to find the optimal solution,
and the premature of the algorithm will stop it from searching for a better but more complex
solution.
GA is prone to premature problems and can be easily trapped in local optimums. This is
caused by the selection process since low-quality individuals will be completely discarded, and
temporarily, high-quality individuals will dominate the population. To address this issue, we
introduced the self-improving process. The self-improving process can reduce the probability
73
that a potentially good solution fails to survive the selection process. It gives each individual
more chances to explore the solution space around it and improve itself before the next selection
process happens. By the split-and-focus mechanism, each individual splits itself into several subproblems so that the operations can be focused in a local area and the best operation has a higher
probability of emerging before the population is dominated by a local optimum. The number of
sub-problems and the position of the fixed waypoints are randomly chosen so that each
individual can try different combinations of sub-problems and are not restricted to the same set
of fixed waypoints. The results of case study 3 show that GA-focus can deal with different
challenges in path planning problems, such as pocket-shaped obstacles and winding areas. GAfocus can also be extended to multi-objective planning problems. As illustrated in case study 4,
by adding more terms in the fitness function, GA-focus can handle the multi-objective planning
problem in static environments.
Classic GA is sensitive to the hyperparameter setting, such as the crossover rate and the
mutation rate. However, GA-focus has shown robust performance in all test cases with the same
set of parameters except the generation number. The mutation operation of GA-focus is doing a
local search around an individual, and the mutation rate of GA-focus-sub decides whether an
individual explores the nearby space at each step and is trivially set to 1. The crossover operator
tries to combine sections of two different paths and generate a new one. Since the parents and
elitism of each generation are passed to the next generation, such an exploration is not harmful,
and the crossover rate is also set to 1. The population size and parent number of both GA-focusmain and GA-focus-sub are set to balance the computation time. The computation time grows
linearly with the population size since each individual undergoes a self-improving process.
74
The number of segments evaluated has a linear relationship to the number of generations.
There is a self-improving process before the selection in the first generation so that the number
of segments evaluated is linear to 𝑛𝑢𝑚𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑖𝑜𝑛 + 1. Thus, the complexity of the proposed
algorithm does not scale too fast with the complexity of the environment. The number of
generations in the case studies is selected to keep a balance between the feasibility of the solution
and the time consumption. If a smoother and shorter path is needed, optimizing for one more
generation will usually increase the quality of the solution.
The code is written in PYTHON, and the tests are done on a gaming laptop. The selfimproving process of GA-focus is parallelized on 16 threads, and the calculation is done by an
AMD Ryzen 7 5800H CPU with 3.20 GHz speed and 16 GB RAM. The construction of the
quadtree takes several seconds. The time consumption in Table 4, Table 5, and Table 6 only
records the optimization time of GA-focus.
As shown in Table 6, more quadtree blocks do not always result in longer computing times.
However, how the blocks are distributed can influence the time consumption. For instance, the
time consumption on map 2 is much longer than on the other three maps because map 2 is a nonsquare maze environment, and the grid map is padded on both sides. The padding connects
multiple obstacles to form a large one, and the collision area computation must traverse the
blocks in this large obstacle, which results in heavy computation. Building a quadtree with
rectangular blocks or simply modifying the number of nodes at the first layer of a quadtree can
deal with such a rectangular map. However, it is not guaranteed that the boundary of an indoor
map is perpendicular to the axis of the grid map, as the case in Figure 20(a). A map preprocessing
process such as rotating and cropping might help to reduce the calculation in such cases. Another
75
solution is to reduce the max level of the quadtree so that the number of blocks will reduce. This
will work when a high-resolution map is not required.
4.6.Summary and Findings
In this study, we introduced the GA-focus algorithm that can solve large-scale path planning
problems in complex environments. GA-focus follows the framework of GA, except that the
mutation process is replaced by a self-improving process. In the self-improving process, each
individual is split into several sub-problems, and a planner named GA-area is used to solve these
sub-problems. We developed a novel collision cost function for GA-area. When a line segment
of the path cuts an obstacle, the area of the smaller piece of the obstacle will be counted as its
collision cost. Such a cost function changes continuously with the motion of the line segment,
which ensures that the algorithm can always converge to a feasible solution.
The case studies show that the proposed GA-focus can solve different challenging path
planning problems on large grid maps with sizes of 10,000×10,000. GA-focus has a higher
convergence speed compared to RRT* and shows robust performance in finding high-quality
solutions in complex environments filled with obstacles of different sizes and shapes. This
indicates that GA can solve path planning problems on large-scale and complex grid maps
without any graph-construction process.
This study aims to provide a framework for solving large-scale path planning problems in
complex environments, and thus, we only consider the path length and collision cost in the
fitness function. However, one important feature of GA is its flexibility in the design of the
fitness function. As shown in Section 4.4.4, more features, such as smoothness cost and speed
76
cost, can be added to the fitness function so that the framework of GA-focus can be applied to a
multi-objective problem.
Also, it should be noted that although we use GA-area in the self-improving process, other
planners may also work in this stage. The reason we use GA as the base algorithm is that GA can
support operations among individuals of different lengths. When the length of the solution can be
predefined, other popular evolutionary algorithms such as PSO and GWO can also be applied in
the self-improving process and thus form a hybrid algorithm. We just have 5 common mutation
operators in GA-area, and more heuristic operators may help the convergence speed of the
algorithm as well. Additionally, the mutation operators choose the new waypoint randomly in
this study. A local search might also be helpful.
The collision cost is calculated by decomposing the map into a quadtree and summing up the
area of blocks. The algorithm's efficiency can be improved if a better area calculation method is
developed. The self-improving process of GA-focus utilizes a split-and-focus mechanism, which
assumes that the optimal solution of the sub-problem can be combined to form the optimal
solution of the main problem. This assumption can be violated in problems with time-varying
conditions, such as environmental disturbances or dynamic obstacles. Since the collision cost
with dynamic obstacles can be evaluated separately by a forward simulation (e.g., assuming that
the agent moves along the desired trajectory at a constant speed, simulate and see whether the
agent will collide with any dynamic obstacles), it is possible to extend GA-focus to a dynamic
environment. How to use this framework in a time-varying environment can also be a promising
future direction.
Based on the experiment results and ensuing discussions, the following conclusions can be
drawn:
77
• A convex collision-area cost guarantees the feasibility of the global plan. Unlike other
popular collision cost functions, the proposed collision area cost does not introduce
additional local minimum to the solution space, and thus can avoid converging to
infeasible paths.
• Splitting the original path planning problem into sub-problems is an effective strategy to
improve planning efficiency and quality. GA-focus uses crossover and self-improving
mechanism to randomly split the original problem and keep exploring the solution space.
With this approach, GA-focus can use the same set of hyperparameters and show robust
performance in all case studies.
• Using local path operations helps to improve the solution quality. Although the initial
population just contains one-waypoint solutions, the small-step mutation operations will
finally extend the solution to a long sequence of waypoints. The resulting global paths
tend to be smooth even when smoothness is not considered in the fitness function.
78
Chapter 5. Trajectory Planning by Deep Reinforcement Learning with
Knowledge Transfer and Online Demonstration
5.1.Introduction
Motion planning is essential for autonomous systems as it facilitates safe and efficient
navigation through dynamic and challenging environments. A motion planning problem can
involve multiple goals, such as saving energy and finding a smooth route. In real-world
applications, it also needs to consider how the vehicle moves and requires a well-adjusted
controller to stay on the path accurately.
To address the challenge of motion planning, researchers have developed various algorithms,
with search-based(Shah & Gupta, 2020; Tang et al., 2021), sampling-based(G. Chen et al., 2021;
Kuffner & LaValle, 2000; Nasir et al., 2013), and field-based(Fan et al., 2020) approaches
emerging as the most widely recognized methods. Evolutionary algorithms are also a hot
research field for solving motion planning problems because of their flexibility in modeling
multi-objective problems(Das & Jena, 2020; Luo et al., 2020). These algorithms are widely
applied in the pipeline motion planning framework, which splits the motion planning process
into several stages, including the global path planning stage and local trajectory stage (Teng et al.,
2023; Xiao et al., 2022).
Following the pipeline motion planning framework, a global planner is applied to find a
global path according to the objectives of the problem, and a local trajectory planner and
controller are then used to navigate the agent safely along the global path. However, the
efficiency of the algorithms mentioned above can be affected by the complexity and scale of the
environment. Moreover, when the agents are required to solve similar problems in slightly
79
varying environments, these algorithms still need to plan from scratch. Thus, much research has
tried to use neural networks to solve this problem by letting the agent learn common knowledge
about the task and reuse the knowledge in future planning(Morales et al., 2021).
Modeling the motion planning problem as a Markov decision process (MDP) and using
reinforcement learning to learn the optimal policy is straightforward due to its nature as a
sequential decision-making process. As neural networks have shown their great power in various
fields, deep reinforcement learning (DRL) is now widely applied in motion planning
problems(Gao et al., 2020; Z. Hu et al., 2021; B. Li et al., 2021; X. Wang & Jin, 2022, 2023; Xie
et al., 2021; Yao et al., 2020). However, most of the DRL applications use low-level actions such
as speed control or force control as the action, which makes it extremely difficult to interpret the
behavior of the agent in the long term. On the other hand, due to the randomness of exploration,
the reward function needs careful design to provide meaningful information to the agent.
Nevertheless, designing the reward function of DRL is a tricky task and demands substantial
domain knowledge.
Moreover, most of these approaches directly control the low-level dynamics input, such as
the velocity or load. This makes it hard to interpret the agent’s actions in long-horizon planning
problems. In some control problems, the DRL agent may produce abrupt changes to the lowlevel control commands and require post-processing to smoothen the actions (Andrychowicz et
al., 2020). The end-to-end control framework also makes it difficult to transfer the policy to
similar problems when the agent dynamics have changed (X. Wang & Jin, 2023). Many
researchers have shown that DRL can learn high-level policy. Williams et al. (Williams et al.,
2023) modeled the action space of motion planning problems as a set of control points of a spline
curve and assumed that a controller could be designed to follow the desired output spline.
80
Albarella et al. (Albarella et al., 2023) developed a two-layer path planning architecture for
autonomous highway driving by using DRL to determine the high-level driving modes and
generate low-level control commands with model predictive control. Pertsch et al. (Pertsch et al.,
2022) showed that a robot agent can learn high-level skills from unlabeled demonstration data
even without clear goals.
Another common problem of DRL is that it requires a large number of interactions with the
environment to learn an effective policy. Starting with no prior knowledge about the task, the
agent must explore the environment through trial and error to learn a good policy. Research has
shown that learning from demonstrations is a cure for this problem (Hester et al., 2018; Hussein
et al., 2017; Le Mero et al., 2022). The demonstrations can be classified into two categories: pretraining demonstrations and online demonstrations. A pre-training demonstration set is gathered
prior to the actual training phase. This dataset serves as experiential data that can be utilized to
train various models, including a DRL agent or an imitation learning model such as behavior
cloning or inverse reinforcement learning. The online demonstration, on the contrary, provides
expert experience interactively during training. The online demonstrated trajectories are stored in
the experience buffer with the agent’s own experience to improve its experience. Although the
online demonstration can provide more case-related experiences during training, the main
problem with online demonstration is when to query the expert since the expert’s time is more
valuable than the agent’s time.
In motion planning problem, a demonstration set can be used to pre-train the policy model
and thus improve the learning efficiency and performance. Hu et al. (J. Hu et al., 2020) used
human demonstrations to pre-train the model for multi-robot autonomous exploration problem.
Nair et al. (Nair et al., 2018) displayed in multi-step robotics tasks that the problem of sparse
81
reward in long-horizon problems can also be tackled by resetting some training episodes using
demonstration episodes. In this dissertation, we introduce a dual-level DRL framework to solve
the local planning problem for autonomous vehicles. We developed a parallel architecture for the
critic network to enable the integration of high-level geometric knowledge using transfer
learning. More specifically, we try to answer two research questions: (1) What are effective ways
to capture high-level domain knowledge and distill it into a DRL agent through training? (2)
What impact a teacher’s guidance can make on the DRL agent’s learning effectiveness? The
proposed high-level decision maker picks waypoints in the global frame with a variable horizon
and is trained using high-level actions. This approach enhances the interpretability of the policy,
improves the training efficiency in long-range planning problems, and allows for the utilization
of various path planning algorithms as online demonstrators. Our experiment results demonstrate
that even a suboptimal demonstrator or teacher can significantly enhance the agent's performance.
Furthermore, the results indicate that single-step error correction from the demonstrator is more
effective than achieving a complete successful trajectory.
5.2.Problem Description
Given that the local planner of the agent does not have global information during
maneuvering and needs to deal with uncertainty coming from the unobserved environments, in
this work, we propose a decision-making framework that can plan based on local information. To
deal with the dynamics constraint of the agent, we split the policy into two levels: a high-level
policy that picks a target waypoint and a low-level policy that controls the motion of the agent.
The aim is to maximize the expected accumulated reward during the motion so that the agent can
reach the goal through a short and safe path. The problem is formulated as follows:
82
Given the current state 𝑠 ∈ 𝕊 and the action set 𝔸, find a policy 𝑤𝑝~𝜋(𝑠) ∈ ℝ𝑥 × ℝ𝑦 that
maximizes the expected accumulated discounted reward
𝜋
∗ = 𝑎𝑟𝑔𝑚𝑎𝑥𝜋 𝐸
𝑠
′~𝑃(𝑠,𝑎), 𝑎~𝑐𝑡𝑟𝑙(𝑠,𝑤𝑝),𝑤𝑝~𝜋(𝑠)
[∑𝛾
𝑡𝑅(𝑠𝑡
, 𝑎𝑡)
+∞
𝑡=0
]
( 27 )
where 𝑅(𝑠, 𝑎) is the single-step reward, 𝑃(𝑠, 𝑎) is the state transition probability,
𝑐𝑡𝑟𝑙(𝑠, 𝑤𝑝) is the low-level controller, 𝑎 is the low-level control command, 𝑤𝑝 is the waypoint,
and 𝛾 is the discount factor.
5.3.Dual-level Deep Reinforcement Learning Framework
Although end-to-end DRL has been applied in various decision-making tasks, it remains a
challenging problem to solve long-term planning problems with this framework. When training
an off-policy RL, an experience buffer can be applied to save and make better use of historical
data. However, in long-range planning problems, the experience buffer will almost be filled with
non-terminal experiences, which makes the terminating reward too sparse in the experience
buffer. On the other hand, since much research has been done to combine classic path planning
algorithms and control theories in real-world planning and control problems, it is a reasonable
assumption that an agent can encode the policy into two levels: a high-level policy to select a
target waypoint, and a low-level policy to navigate the agent to the waypoint in real time.
In this research, our case study is the motion planning problem for autonomous ships. We
use a low-fidelity but well-received ship agent dynamics model and a linear low-level controller
83
to demonstrate the feasibility of the proposed method. The state of the agent is described by its
rotational (i.e., yawing) speed 𝜓̇
, heading angle 𝜓, 𝑥 and 𝑦 coordinate in the world frame, and
the linear speed 𝑣, as shown in ( 28 ):
𝜓̈ = −
𝜓̇
𝑇
+
𝐾
𝑇
𝑢1
𝑥̇ = 𝑣 𝑐𝑜𝑠(𝜓)
𝑦̇ = 𝑣 𝑠𝑖𝑛(𝜓)
𝑣̇ = 𝑐(𝑢2 − 𝑣)
( 28 )
where 𝑢1 and 𝑢2 are the inputs to control the heading and speed of the agent. 𝐾, 𝑇 and 𝑐 are
constants. The agent is assumed to move at a constant speed, and the linear controller 𝑎 ≡
(𝑢1, 𝑢2
) = 𝑐𝑡𝑟𝑙(𝑠, 𝑤𝑝) is defined as:
𝑢1 = 𝑝1𝜓̇ + 𝑝2(𝜓𝑡𝑎𝑟𝑔𝑒𝑡 − 𝜓)
𝑢2 = 𝑣
( 29 )
where 𝑝1 = −10 is a constant and 𝑝2 = − (
𝐾
𝑇
𝑝1 −
1
𝑇
)
2
/ (
4𝐾
𝑇
). 𝜓𝑡𝑎𝑟𝑔𝑒𝑡 is the course angle
from the current position to the target waypoint. The system will finally converge to 𝜓̇ = 0 and
𝜓 = 𝜓𝑡𝑎𝑟𝑔𝑒𝑡.
To deal with the dynamics constraint of the agent, we split the policy into two levels: a highlevel policy that picks a target waypoint and a low-level policy that controls the motion of the
agent. The aim is still to maximize the expected accumulated reward during the motion so that
the agent can reach the goal through a short and safe path.
84
Assuming the agent does not have global information and needs to deal with uncertainty
coming from unobserved environments, in this work, we propose a decision-making framework
that can plan based on local information and allows the agent to sense the local information of
Fast Marching (FMM) distance (Garrido et al., 2013) to obtain information about possible global
paths. The agent can sense the surrounding environment by laser rangers so that it can identify
the geometry of the obstacles nearby. The laser rangers are distributed every 5
°
so that the agent
receives 360°
5
° = 72 feedbacks from the laser rangers, as shown in Figure 23(b). To provide
information on the goal position, the global coordinate of the goal point and the gradient of the
FMM distance are also given in the state. The state 𝑠 is defined as
𝑠 = {𝑥, 𝑦, 𝑣, 𝜓, 𝜓̇
, 𝑥𝑓𝑚𝑚, 𝑦𝑓𝑚𝑚, 𝑑𝑖𝑠𝑡𝑓𝑚𝑚, 𝑥𝑔𝑜𝑎𝑙, 𝑦𝑔𝑜𝑎𝑙, 𝑑𝑖𝑠𝑡𝑙𝑎𝑠𝑒𝑟},
( 30 )
where (𝑥𝑓𝑚𝑚, 𝑦𝑓𝑚𝑚) is the gradient of the FMM distance field, 𝑑𝑖𝑠𝑡𝑓𝑚𝑚 is the FMM
distance of the current position, (𝑥𝑔𝑜𝑎𝑙, 𝑦𝑔𝑜𝑎𝑙) is the coordinate of the goal, and 𝑑𝑖𝑠𝑡𝑙𝑎𝑠𝑒𝑟 ∈ ℝ≥0
72
is the feedback of the laser rangers. The high-level decision maker picks waypoints in the global
coordinate, and the output is given by 𝑤𝑝 ≡ (𝑥𝑤𝑝, 𝑦𝑤𝑝) = 𝜋(𝑠). The low-level controller is
defined in ( 29 ), and the output is given by 𝑎 ≡ (𝑢1, 𝑢2
) = 𝑐𝑡𝑟𝑙(𝑠, 𝑤𝑝).
The problem is formulated as: given the current state 𝑠 ∈ 𝑆 and the action set 𝐴, find a
policy 𝑤𝑝~𝜋(𝑠) ∈ 𝑅𝑥 × 𝑅𝑦 that maximizes the expected accumulated discounted reward
𝜋
∗ = 𝑎𝑟𝑔𝑚𝑎𝑥𝜋 𝐸𝑠
′~𝑃(𝑠,𝑎),𝑎~𝑐𝑡𝑟𝑙(𝑠,𝑤𝑝),𝑤𝑝~𝜋(𝑠)
[∑𝛾
𝑡
∙ 𝑟(𝑠𝑡
, 𝑎𝑡
)
+∞
𝑡=0
]
( 31 )
85
where 𝑟(𝑠, 𝑎) is the single-step reward, 𝑃(𝑠, 𝑎) is the state transition model, 𝑐𝑡𝑟𝑙(𝑠, 𝑤𝑝) is
the low-level controller, 𝑎 is the low-level control command, 𝑤𝑝 is the waypoint, and 𝛾 is the
discount factor.
5.3.1. Model Description
In this work, we aim to find the optimal policy for picking the next waypoint, given the
current state. The waypoint can be picked from the entire 2D space, which is too large for Qlearning to model. Also, the model may suffer from typical problems such as over-estimating the
Q-values or updating the actor based on inaccurate Q-network. Thus, we apply the twin-delayed
DDPG (TD3) (Fujimoto et al., 2018) as our training method. TD3 trains two independent critic
networks and uses the smaller Q-value from them in the update so that the over-estimation
problem can be relieved. The update of the actor network is less frequent than that of the critic
network for higher-quality policy updates.
Although the original TD3 can be applied to RL in continuous space, the behavior of the
exploration can be different in high-level and low-level cases. For example, in low-level control
of manipulators, the actions can be the torques applied on each joint. The effect of the torque is
integrated to finally move a link so that a Gaussian exploration noise can be added to the action
during training to explore possible good actions. However, in our high-level waypoint selection
case, adding Gaussian noise to the waypoint will create a nearby waypoint and does not make a
great difference to the agent’s behavior. Moreover, during the early stage of the training, when
the actor network has not been well-trained, it may sometimes produce a near-fixed output so
that the agent keeps picking waypoints in a small region and spinning in circles. Thus, we use the
epsilon-greed method instead for the exploration by picking random waypoints during training
with a decayed probability. Considering the boundary of validity of waypoints around the
86
boundary of obstacles should be sharp, the target policy smoothing regularization of the original
TD3 model is also removed. The model architecture is shown in Figure 22.
Figure 22 Training process of the proposed dual-level decision-making framework.
5.3.2. Random Environment Generation
Although DRL has been applied in motion planning problems, many applications assume
that the training environment is identical or similar to the testing environment. In this case, as a
result of training, the agent tends to memorize the information about the environment through
training, including the positions and shapes of the obstacles. However, such an agent cannot be
applied to plan the path in new and different environments.
To generalize the agent to different environments, we train the agent in random
environments with static obstacles. Perlin noise (Perlin, 1985), a gradient noise used to generate
visual effects that can increase realism in computer graphics, is applied in this work to generate
87
random terrain environments. As shown in Figure 23(a), the shape of the obstacles generated by
the Perlin noise can be irregular and non-convex so that the agent can learn the general
knowledge of how to plan a path in different kinds of environments. During training, a new
environment is generated every episode so that the agent does not experience the same map
twice.
(a) (b)
Figure 23(a) A random environment generated by Perlin noise. The white areas are obstacles. The goal is marked by the red dot,
while the agent state is marked by the green polygon; (b) The observation of the agent. The blue lines show the observation from
laser scanning. The red line marks the direction of the gradient on the Fast-Marching distance map.
5.3.3. Reward Function
Reward items: The motion planning problem can be naturally modeled as a sparse-reward
MDP (e.g., positive reward for success and negative reward for failure). In this work, the reward
function for a time step is defined by ( 32 ):
𝑟(𝑠, 𝑎) = 𝑟𝑓𝑢𝑒𝑙 + 𝑟𝑟𝑔 + 𝑟𝑐𝑜𝑙 + 𝑟𝑎𝑛𝑔𝑢𝑙𝑎𝑟 + 𝑟𝑚𝑤
88
( 32 )
where 𝑟𝑓𝑢𝑒𝑙 is a negative fuel consumption reward per step, 𝑟𝑟𝑔 is a positive reaching-goal
reward when the agent has reached the goal position, 𝑟𝑐𝑜𝑙 is a negative reward when the agent
collides with obstacles, 𝑟𝑎𝑛𝑔𝑢𝑙𝑎𝑟 = 𝑐𝑎𝑛𝑔𝑢𝑙𝑎𝑟 ∙ 𝜓̇ 2
is a negative reward of steering cost and 𝑟𝑚𝑤 is
a negative reward for missing a waypoint, which penalizes the agent from picking unreachable
waypoints.
One possible problem in long-range planning is that if all the reward terms are given as
constant, the accumulated fuel-consumption cost may exceed the collision cost so that the agent
will not try to consume a lot of fuel to reach the goal and instead try to collide and end the
episode at the beginning of the simulation. Thus, we define a linear collision cost as 𝑟𝑐𝑜𝑙 =
(1 +
𝑑𝑖𝑠𝑡𝑓𝑚𝑚
𝑐𝑐𝑜𝑙
) 𝑟̅̅𝑐𝑜𝑙 ̅̅̅, where 𝑟̅̅𝑐𝑜𝑙 ̅̅̅ is a constant base collision cost and 𝑐𝑐𝑜𝑙 is a scaling constant.
𝑑𝑖𝑠𝑡𝑓𝑚𝑚 is the FMM distance from the collision position to the goal so that if the agent collides
at a position far away from the goal, it will receive more penalty.
The values of the rewards are set by ( 33 ):
𝑟𝑓𝑢𝑒𝑙 = −10−4
𝑟𝑟𝑔 = 10
𝑟𝑐𝑜𝑙 = − (1 +
𝑑𝑖𝑠𝑡𝑓𝑚𝑚
10000 )
𝑟𝑎𝑛𝑔𝑢𝑙𝑎𝑟 = −10 ∙ 𝜓̇ 2
𝑟𝑚𝑤 = −1
( 33 )
89
Variable horizon: This reward is calculated every time step, e.g., every second in the
simulation. However, the agent is making high-level decisions by selecting target waypoints.
Thus, the reward received for each waypoint is calculated by
𝑅(𝑠𝑡
, 𝑤𝑝, 𝑠𝑡+𝑛
) = ∑𝛾
𝑖
∙ 𝑟(𝑠𝑡+𝑖
, 𝑎𝑡+𝑖
)
𝑛−1
𝑖=0
( 34 )
where 𝑛 is the number of time steps to travel from 𝑠𝑡
to 𝑠𝑡+𝑛 by using the controller
𝑐𝑡𝑟𝑙(𝑠, 𝑤𝑝) to approach the waypoint 𝑤𝑝.
The model update thus is given by a variable-horizon n-step return:
𝑄
𝜋(𝑠𝑡
, 𝑤𝑝𝑡
) ← 𝑅(𝑠𝑡
, 𝑤𝑝, 𝑠𝑡+𝑛
) + 𝛾
𝑛 𝑚𝑎𝑥
𝑤𝑝𝑡+𝑛
𝑄
𝜋(𝑠𝑡+𝑛, 𝑤𝑝𝑡+𝑛
)
( 35 )
Reward shaping: Notice that the only positive term in the reward function is 𝑟𝑟𝑔, which is
achieved only when reaching the goal. Thus, the reward recorded in the experience buffer will be
sparse, since the agent may fail a lot and it takes multiple steps to reach the goal. This sparsity of
the reward will result in the low efficiency of training. One popular method to improve training
efficiency is reward shaping (Ng et al., 1999). With a carefully designed heuristic reward
function, the agent will tend to pick actions that follow human’s heuristic knowledge (e.g.,
getting closer to the goal is better) and converge faster.
90
A heuristic field-based function Ф(𝑠) is applied in this work to improve the performance but
keeps the same optimal policy. For simplicity, some footnotes and input variables in the
following derivation are omitted.
𝑄𝑡
∗ = 𝐸 [∑𝛾
𝑖
𝑟𝑡+𝑖
𝑛−1
𝑖=0
+ 𝛾
𝑛 max
𝑤𝑝𝑡+𝑛∈𝑊𝑃
𝑄𝑡+𝑛
∗
]
𝑄𝑡
∗ − Фt = 𝐸 [∑𝛾
𝑖
𝑟𝑡+𝑖
𝑛−1
𝑖=0
+ 𝛾
𝑛 max
𝑤𝑝𝑡+𝑛∈𝑊𝑃
(𝑄𝑡+𝑛
∗ − Фt+n
)] + (𝛾
𝑛Фt+n − Фt
)
( 36 )
Let,
𝑄̂∗ = 𝑄
∗ − Ф
𝐹(𝑠𝑡
, 𝑠𝑡+𝑛
) = 𝛾
𝑛Ф𝑡+𝑛 − Ф𝑡
( 37 )
Then,
𝑄̂
𝑡
∗ = 𝐸 [∑𝛾
𝑖
𝑟𝑡+𝑖
𝑛−1
𝑖=0
+ 𝛾
𝑛 max
𝑤𝑝𝑡+𝑛∈𝑊𝑃
𝑄̂
𝑡+𝑛
∗
] + 𝐹(𝑠𝑡
, 𝑠𝑡+𝑛
)
( 38 )
𝐹(𝑠𝑡
, 𝑠𝑡+𝑛
) is the reward shaping term. 𝑄
∗
indicates the original Q function of the optimal
policy, and 𝑄̂∗
indicates the new Q function of the same optimal policy after introducing the
reward shaping term. The heuristic field-based function Ф(𝑠) is defined as Ф(𝑠) = 𝑟𝑓𝑢𝑒𝑙 ∙
91
𝑑𝑖𝑠𝑡𝑓𝑚𝑚
𝑣
, which is an estimate of the accumulated fuel consumption of the remaining distance to
the goal.
5.3.4. Prioritized Experience Replay
Training on high-level actions greatly reduces the amount of data for training. To improve
the data efficiency of the experience buffer, a prioritized experience replay (PER) (Schaul et al.,
2015) is applied in this work. The idea of PER is to give higher priority to the experiences that
have larger loss values. The priority of an experience is defined as 𝑝𝑖 = |𝛿𝑖
| + 𝜖, where 𝛿𝑖
is the
TD error of the Q-function, and ϵ is a small constant to ensure that the priority is greater than
zero. In this work, we initialized the priority of each generated experience by 𝐶 + 𝜀, where 𝐶 is a
constant.
During training, the mini-batch or experiences are sampled by probability 𝑃(𝑖), where:
𝑃(𝑖) =
𝑝𝑖
𝛼
∑ 𝑝𝑘
𝛼
𝑘
( 39 )
The parameter 𝛼 determines how much we rely on the priority. When 𝛼 = 0, the priority
plays no role, and the mini batch is thus drawn from a uniform distribution.
Since the sampling strategy differs from the original sample distribution, a bias is introduced
in the expectation of loss. The importance sampling (IS) weight 𝑤𝑖
is used to correct the bias,
where:
92
𝑤𝑖 = (
1
𝑁
∙
1
𝑃(𝑖)
)
𝛽
( 40 )
When the parameter 𝛽 = 1, the weight 𝑤𝑖 can fully compensate for the bias. During the
training, we anneal the 𝛽 from 𝛽0 to 1, while 𝛽 = 1 is reached at the end of training.
5.3.5. Curriculum Learning
Curriculum learning is the idea of training a model by organizing the data in a meaningful
order, which gradually illustrates more concepts and progressively more complex ones (Bengio
et al., 2009; Soviany et al., 2022). It has been successfully applied to various challenging DRL
problems to improve the model performance (Bing et al., 2022; Gao et al., 2020; Pina et al., 2021;
Yao et al., 2020).
In this research, the curriculum is designed to gradually increase the distance from the
agent’s initial position to the goal position during training. At the early stage of training, the
agent is initialized around the goal, so it has more chance to reach the goal by random
exploration. By gradually increasing the distance, the agent will be initialized far away from the
goal at the late stage of training, and it needs to pick multiple waypoints sequentially to avoid
collision and reach the goal. Notice that as the agent picks waypoints and moves closer to the
goal, the problem becomes a short-range planning one as it is in the early stage, so the
knowledge learned at the early stage is thus essential through the complete training process.
5.3.6. Domain Knowledge Capture and Transfer
We apply transfer learning in this work to improve the model's performance. In this problem,
the agent needs to select a waypoint in the 2D simulation environment. The waypoint is valid
only when located within the range of the laser scans. However, we generate a new random
93
environment for each episode during training so that the agent can only visit the same state once
and does not have enough data to build spatial knowledge.
Although it is possible to train a waypoint validity classifier with augmented data from the
simulation, the features extracted from this classifier are only geometry-related and do not
consider the agent’s dynamics constraints. As shown in (X. Wang & Jin, 2023), not all features
can be transferred among different tasks, especially for policies trained on agents with different
dynamics models. Thus, we designed the parallel architecture for the critic networks, as shown in
Figure 22. To capture domain knowledge for ship collision avoidance, we trained a classifier to
identify whether a waypoint is valid: If a waypoint is located within the laser scans of the agent,
it is valid; otherwise, it is not.
As shown in Figure 22, the captured knowledge in the “waypoint validity classifier” is
transferred into the left branch of the “critic network” of the DRL agent by copying its two layers.
These layers are frozen during DRL agent training. The right branch of the critic network is
initialized randomly and is supposed to learn non-geometry-related features for motion planning.
The last layer of the left branch uses Tanh as the activation function, which can only produce
outputs in [−1, +1], and the last layer of the right branch uses ReLU as the activation function so
that it can only produce non-negative values. The outputs from the two branches are multiplied
to produce the final Q value, which can be any real number. Since the only positive reward in
this work is received when reaching the goal, a positive Q value indicates that the waypoint
selection is helpful in reaching the goal point. A positive Q value can only be produced when the
output of the left branch is positive so that the critic network is forced to utilize the transferred
geometry knowledge to identify whether a waypoint is valid or not.
94
5.3.7. Online Demonstrations
Although the agent will explore the action space during training, the probability of reaching
the goal with random waypoints is very low. As a result, the experience replay buffer in DRL
will be filled mostly with failed cases. To help the DRL agent learn effectively, demonstrations
can be provided to guide the agent in having some successful experiences in the buffer (Ramírez
et al., 2022). Instead of providing demonstrations at the beginning of training, we generate online
demonstrations in each episode when the agent fails to reach the goal. Whenever a failure
happens in the simulation, we make a plan from the last valid state to the goal using a global
planner, simulate with the generated waypoints, and insert the experiences into the experience
buffer. Since the global planner has a global view of the environment, it can usually find a
feasible trajectory to the goal. Although such a trajectory does not consider the agent dynamics
and may lead to a collision in an area filled with obstacles, when the environment has enough
space for the agent to follow the trajectory safely with its low-level controller, a successful
experience can be provided to the agent.
Due to the variable horizon of the proposed framework, a demonstrator can be created with
a popular path planning algorithm to provide demonstrations by selecting waypoints and
performing simulations with them. In this research, we tested two different global planners as
demonstrators: probabilistic roadmap (PRM) (G. Chen et al., 2021) and fast marching square
(FM2) (Beser & Yildirim, 2018). Since the global planner is used to provide a possibly feasible
solution to the problem and does not consider the agent dynamics or any objective other than the
path length, it is not designed to be a powerful planner. For the PRM planner, we only randomly
pick 50 nodes on the map and try to find a path on the small roadmap so that it does not slow
down the training significantly. For the FM2 planner, we generate the desired trajectory by
95
following the gradient on the FM2 time map and then sample the waypoints evenly on the
trajectory every 6km. The sample waypoints by the two planners are shown in Figure 24.
Figure 24 shows that the path generated by the demonstrators does not always follow the
direction of the Fast-Marching shortest distance. Additionally, these paths do not consider the
dynamics constraints of the agent. Due to its low capability, in some complex environments, the
PRM may fail to find a path. Moreover, since the agent does not have access to the global map, it
may not be able to infer the reason for picking certain waypoints given by the demonstrators.
Although the capability of demonstrators is very limited, the experiments show that the
demonstrations are critical to the success of training.
(a) (b)
Figure 24(a) Path generated by PRM demonstrator. The roadmap is in blue, and the path is in red; (b) Path generated by FM2
demonstrator. The color map shows the arrival time map of FM2, and the sampled path is in red.
5.4.Experiments
In this section, we evaluate the proposed framework by the success rate of planning. All the
models are tested on the same test set with 1,000 randomly generated environments. We also
96
present the agent’s high-level understanding of the problem by visualizing the distribution of the
Q-values and analyzing the effect of the high-level knowledge transfer and demonstrations.
5.4.1. Training Setup
In order to generalize the agent to various environments, we randomly generate a new
environment for each episode, so the agent can only visit each environment once. Hitting the
boundary of the environment is also considered a collision even when there are no obstacles
there. A goal position is first sampled in the environment, and then the initial position of the
agent is selected to ensure there exists a feasible path. The heading angle of the agent is
randomly initialized but it is guaranteed to have enough space in front of the agent. Thus, the
agent does not always face the correct direction at the beginning and may need to make a U-turn
from its initial state to approach the goal. The curriculum learning is designed to initialize the
agent at a close distance to the goal, and gradually increase the initial distance during training.
The range of the environment is 51,200×51,200 meters, and the agent’s speed is 10m/s. The
base time step of the simulation is 1 second, so it may take thousands of simulation steps to
finish an episode. We only store high-level experiences in the experience buffer. The high-level
waypoint selection has a variable horizon, which means that the agent can pick any waypoint in
the environment. Once the agent is within 1,000 meters of the goal or a waypoint, it is identified
as arrival. To prevent the agent from selecting a nearby waypoint and immediately reaching it,
the minimum moving time for each decision is set to 120 simulation seconds. If the agent has
reached a waypoint but has not traveled for 120 seconds, it keeps moving without changing the
target waypoint, so that it will try to spin around the waypoint until 120 seconds. There is no
upper limit for the moving duration of each decision since we encourage the agent to make longterm decisions. In each episode, if the agent does not collide or reach the goal within 20
97
waypoint selections, the episode is terminated to avoid possible infinite spinning-around
behavior.
At the early stage of the training, the agent has not learned any meaningful information and
the output of the actor network turns out to be zeros. This results in the agent keeping picking the
center position of the map. The exploration method of the original TD3 model is adding
Gaussian noise to the action. In our case, this will result in picking a waypoint near the center of
the environment and does not provide much useful information. Thus, we use a 𝜖 -greed
exploration instead of the Gaussian noise. The agent has a chance of 80% to pick a random
position in the global frame as its waypoint at the beginning of training, and this probability
decays to 2% at 10,000 episodes and does not change anymore. This helps the agent to gather
more information by trying to move to different positions. To stabilize the training and avoid any
possible primary bias, we set 2,000 pre-training warm-up episodes where the agent keeps
exploring without training the network.
Table 8 Parameters of the proposed model
Parameter Value
Warm-up rounds 2,000
Training episodes 20,000
Reward decay 𝛾 0.9999
PER buffer size 50,000
Batch size 512
Critic learning rate 1e-4
Actor learning rate 1e-5
98
Critic L2 regularization 1e-3
Actor L2 regularization 1e-3
TD3 soft update factor 𝜏 1e-3
𝜖-greedy exploration round 10,000
Due to the dual-level decision-making framework, the high-level experiences can be stored
in a relatively small replay buffer with a size of 50,000. The initial parameters of the prioritized
experience replay are set to 𝛼 = 0.5 and 𝛽 = 0.5, while 𝛽 anneals to 1 at the end of training. All
new experiences are given a high priority to increase the probability of sampling them in the
mini-batch training and update their priorities.
Other parameters of the model are shown in Table 8.
5.4.2. Geometry Knowledge Transfer
In this experiment, the general geometry knowledge about the terrain is considered the
domain knowledge to be captured and transferred into the DRL agent. We evaluate the effect of
the geometry knowledge transfer of the two trained models with the settings described in Section
5.4.1. The model GeoTran stands for the model using the parallel critic architecture and transfers
geometry knowledge from a pre-trained classifier. The model noTran stands for the model using
a common multi-layer neural network as the critic network. The size of the noTran critic network
is designed as 512-512-512-1, which has the same architecture as one branch of the GeoTran
critic network. The output layer of the noTran critic network uses a linear activation function,
while all other layers use ReLU as the activation function.
A waypoint validity classifier is trained to tell whether a waypoint is located within the
range of laser scans, and the first two layers are transferred to the first two layers of the left
99
branch of the GeoTran critic network. These two layers are fixed during training so that the
agent is forced to evaluate the Q-value with the geometry features.
Figure 25 Success rate curves of GeoTran and noTran.
The training curves are shown in Figure 25. The success rate is calculated every 100
episodes, and the solid curves show the smoothened success rate. After transferring the highlevel geometry knowledge, GeoTran significantly improves the success rate. The two models are
tested in 1,000 random environments, and the results are shown in Table 9. During testing, the
minimum moving time of each decision is changed from 120 seconds to 60 seconds so that the
agent does not make undesired turns for picking a waypoint located within the 120-second range.
The agents are randomly initialized in the test environments with a uniform distribution, so the
average distance from its initial position to the goal is shorter than the long initial distance at the
late stage of training due to curriculum learning.
100
(a) (b) (c)
Figure 26 The visualization of Q-values of three models. Higher values are shown in brighter colors. (a) Q-values of noTran; (b)
Positive Q-values of GeoTran; (c) Positive Q-values of FM2Step.
Since the agent is making decision in the high-level 2-D space, its behavior can be
interpreted from the distribution of the Q-values in the 2-D space. Figure 26 shows the
visualization of the Q-values in a sample environment of noTran and GeoTran. We evenly
sampled 1,681 positions in the environment, queried the critic network about their Q-values, and
used linear interpolation to create the Q-value distribution. Figure 26(a) shows the Q-values of
noTran. The Q-value samples of noTran are all negative, and the resulting distribution does not
have any value change around the boundary of the obstacles. Figure 26(b) shows the positive Qvalues of GeoTran. Since the only positive reward is received when reaching the goal, a positive
Q-value implies that the waypoint is a proper selection. GeoTran shows larger Q-values in the
open area in front of it and smaller Q-values in the correct direction, indicating that it has learned
knowledge about its motion pattern and desired motion direction.
During testing, if the agent does not reach the goal within 20 decisions, the simulation will
be terminated. Thus, the success rate and collision rate do not sum up to 1. The test results in
Table 9 show that the transfer of geometry knowledge has a positive effect. The success rate
101
rises from 25.3% to 46.0%. For GeoTran, 6.3% of the tests ended without reaching the goal or
collision. This happens when the agent cannot find a good waypoint and keeps selecting nearby
waypoints and spinning in circles to avoid collision. The success rate of GeoTran is still low, and
thus we use demonstrations to improve the model’s performance.
5.4.3. Online Demonstrations
To improve the exploration quality, we add expert demonstrations to the experience buffer.
Since the agent makes high-level decisions, the demonstrations should also be represented in
high-level action space as waypoints. In this experiment, we use two different types of global
planners as demonstrators: probabilistic roadmap (PRM) and fast marching square (FM2).
Whenever a collision happens in the simulation, we reset the agent to the last valid state, make a
global plan by the demonstrator online from the current position to the goal, simulate with the
generated sequence of waypoints, and add the demonstration trajectory into the experience buffer.
The purpose of the demonstrator is to correct the agent’s wrong decisions interactively and
show potentially good ones when a collision happens. Therefore, the demonstrator is not
supposed to be a powerful or optimal planner. Also, the demonstrator makes the plan based on
global information so the agent may be unable to infer the reason for selecting these waypoints.
Moreover, it is not guaranteed that the demonstrator can generate successful experiences. The
PRM demonstrator we use has just 50 vertices and does not always find a solution. Although
FM2 always finds a smooth trajectory, we sample the FM2 trajectory every 6,000 meters so the
demonstrator cannot make smooth turns in a small region. The capability of the two
demonstrators is also tested on the test set. As shown in Table 9, PRM has a success rate of 76.2%
and a collision rate of 20.5%. In the remaining 3.3% of cases, PRM failed to find a solution. The
102
FM2 demonstrator has a much better performance. It can always provide solutions and has a
success rate of 91.8%.
Table 9 Test results of the models and demonstrators
Model Success (%) Collision (%)
noTran 25.3 74.4
GeoTran 46.0 47.7
PRMSeq 71.6 22.5
FM2Seq 75.1 21.2
PRMStep 78.8 16.4
FM2Step 79.3 17.1
PRM(global) 76.2 20.5
FM2(global) 91.8 8.2
In this experiment, we tested two different demonstration strategies:
(1) Sequence: When a collision happens, reset to the last valid state, generate a sequence of
waypoints with the demonstrator, and simulate the complete trajectory.
(2) Step: When a collision happens, reset to the last valid state, generate one waypoint with
the demonstrator, simulate with this waypoint, and allow the agent to resume control.
The Step demonstration strategy focuses on error correction instead of showing a complete
successful trajectory. However, if the agent keeps colliding with the obstacles after it resumes
control, the demonstrator can still show a complete trajectory with the step-by-step error
correction.
We trained four models:
103
(1) PRMSeq: Uses PRM as the demonstrator; the demonstrator generates complete sequence
of decisions, and the expert experiences are complete trajectories.
(2) PRMStep: Uses PRM as the demonstrator; the expert experiences are single-step error
correction.
(3) FM2Seq: Uses FM2 as the demonstrator; the demonstrator generates complete sequence
of decisions, and the expert experiences are complete trajectories.
(4) FM2Step: Uses FM2 as the demonstrator; the expert experiences are single-step error
correction.
Figure 27 Success rate curves of models with online demonstration.
The training curve is shown in Figure 27. When using complete demonstration sequences,
the FM2Seq has a higher success rate compared to PRMSeq. However, PRMStep and FM2Step
exhibit similar success rates. Despite the significant difference in the capabilities of the two
104
teachers, the agent learns better when the teaching strategy involves allowing the agent to
explore independently and only providing a single-step demonstration when the agent makes a
mistake.
Figure 28 Ratio of demonstration experiences during training.
The ratio of demonstration experiences to the total experiences in each episode is shown in
Figure 28. In error correction, a single-step demonstration is given only when the agent makes a
mistake, resulting in a maximum demonstration ratio of 0.5. As the agent improves performance,
the demonstration ratio drops during training. The error-correction strategy has a smaller ratio of
demonstration experiences but results in better model performance. The four models are tested
on the same test set as in Section 5.4.2 and the results are shown in Table 9. The PRMStep and
FM2Step outperform the other two models using complete demonstration trajectories. Both
105
PRMStep and FM2Step have a success rate of around 79%, which implies that the errorcorrection online demonstration does not rely on the capability of the demonstrator.
(a) (b)
(c) (d)
Figure 29 Sample test cases of FM2Step. The trajectory is shown with the green curve. The goal position is marked by the red dot.
The blue dots mark the waypoints the agent selected.
106
Figure 29 shows some test cases and the trajectory generated by FM2Step. The agent has the
flexibility to select any waypoint within the global frame. Interestingly, it occasionally chooses
waypoints that are far away, even when there is no penalty for selecting a nearby waypoint. This
behavior is observed primarily in open areas. Conversely, when the agent is near obstacles, it
tends to select nearby waypoints so that it can make more careful control.
Figure 26(c) shows the visualization of the positive Q-values in a sample environment of
FM2Step. After integrating the online demonstration, the agent not only enhances its
performance but also gains a deeper understanding of the motion planning task. The positive Qvalues are spread across the open area surrounding the agent, with the highest values occurring in
the desired motion direction. Waypoints closer to the agent have lower Q-values, indicating that
the agent has learned that selecting a nearby waypoint leads to a sharp turn, which is undesirable.
5.5.Results and Discussion
Domain geometry knowledge and task features: In this work, we proposed a dual-level
decision-making framework and transferred high-level geometry knowledge to the critic network.
The parallel architecture of the critic network allows the agent to keep the transferred knowledge
but at the same time build its own feature space. This method results in an interpretable critic
network, as shown in Figure 26(c). The positive Q-values at a given position indicate the agent's
preference for selecting that waypoint. The high-level transferred geometry knowledge helps
construct the boundary of the Q-values, ensuring that positive values are mostly distributed in the
open area surrounding the agent. The Q-values around the agent have smaller values, indicating
that the agent has learned the task feature knowledge that selecting a nearby waypoint may cause
an undesired sharp turn due to its dynamics constraints. This kind of feature knowledge is
constructed within the agent’s own feature space embedded in the right side of the critic network
107
as well as the actor-network shown in Figure 22 since the pre-trained waypoint validity classifier
does not have any knowledge about the agent’s dynamics model.
Dual-level coordination: Our dual-level control framework trains the agent to acquire highlevel knowledge about selecting waypoints. This learned high-level knowledge tends to guide the
agent to select such waypoints that are desirable for the agent to maneuver around the obstacles
given its dynamics constraints, signifying the implicit coordination between the high-level
decisions and the lower-level control. The reward function does not penalize selecting shorterdistance waypoints. Therefore, there is no distinction in selecting different numbers of waypoints
when traveling the same distance in the same direction. From Figure 29, however, we can
observe that the agent occasionally selects long-distance waypoints in open areas. One possible
reason is that each episode terminates after the agent has made 20 decisions during training. If
the agent keeps selecting short-distance waypoints, it cannot reach the goal in 20 selections and
will not receive any actual positive reward. Therefore, the shorter-distance waypoints are
reserved for the areas around the obstacles. As a result, this waypoint selection strategy helps the
agent control its maneuver around the narrow pathways.
The critical role of teaching through demonstration: Some environments are too complex for
the agent to find a feasible path by random exploration. To ‘teach’ the agent to explore more
effectively during training, online demonstrations are applied in this work. Due to the dual-level
framework, the agent has the flexibility to select any waypoint in the global frame, and thus any
path planning algorithm can be applied to provide demonstrations. In this work, we use two
types of global planner as the demonstrator. These demonstrators make the plan based on global
information and do not consider the dynamics model of the agent. Although the demonstrators
108
can only provide sub-optimal solutions and may fail sometimes, the test results in Table 9 show
that the demonstrations are critical to the model performance.
Error correction vs. trajectory completion: We tested two demonstration strategies:
sequence-based trajectory completion and stepwise error correction. While the FM2
demonstrator itself has a high success rate, using complete FM2 trajectories does not lead to
better model performance. The results in Table 9 suggest that error correction is the key point of
successful training. This interactive training strategy does not rely on the quality of the
demonstrations. Despite the significant difference in the capabilities of the PRM and FM2
demonstrators, PRMStep and FM2Step perform similarly in terms of success rates. The
performance of PRMStep exceeds that of its demonstrator after implementing the error correction.
Computing optimization: Only the high-level decision maker is trained by DRL so the
experience buffer just contains high-level decisions and is very compact. The experience buffer
in this work has a relatively small size of 50,000. The model is trained on a high-performance
computing cluster and can be efficiently trained in just 20,000 episodes using a single computing
node with one GPU. No parallel computing is utilized in the simulation. PRMStep takes 10.5
hours with an AMD Epyc 7313 CPU and a NVIDIA A40 GPU, and FM2Step takes 17.5 hours
with an Intel Xeon E5-2640 v4 CPU and a NVIDIA Tesla P100-PCIE GPU.
5.6.Summary and Findings
In this research, we proposed a dual-level DRL framework for local trajectory planning
problems of autonomous vehicles. The agent can use the high-level decision-maker to select its
target waypoint based on local observations and travel to the waypoint with a low-level
controller. We implemented the TD3 model and developed a parallel architecture for the critic
109
network. This architecture enables the transfer of high-level geometry knowledge while still
allowing the agent to construct its own feature space during training. Such a framework
significantly enhances the interpretability of the model. The agent's understanding of the
environment can be inferred from the distribution of the Q-values in the 2-D world frame, as
shown in Figure 26.
In this work, we use a simple linear controller as the low-level policy model. In our future
work, we plan to employ a more sophisticated low-level planner, such as model predictive
control, to enhance the agent's ability to navigate in more complex environments.
The current training does not account for dynamic obstacles or environmental disturbances.
Exploring this framework in time-varying environments is a promising direction for future
research.
Additionally, this dual-level framework can be expanded to address more complex robotic
planning tasks. For instance, in robot manipulator control, the high-level action space could
represent the motion of the end effector, while the low-level control could involve a highdimensional path planner for joint control.
From the experiment results described above, the following conclusions can be drawn.
• A dual-level DRL framework can help the agent construct the high-level knowledge of
the problem and keep a flexible horizon that can be generalized to trajectory planning
problems in different environments. The dual-level structure also improves
interpretability and allows the agent to make informed decisions by integrating high-level
domain knowledge, such as geometric insights, into its learning processes and, hence, its
operational strategy.
110
• While the dual-level framework makes the agent learn only the high-level knowledge for
selecting waypoints, the learned high-level knowledge tends to be consistent with the
desirability of the lower-level controller if the control process is involved in training and
the reward function is designed to reflect the desirability.
• In complex tasks such as long-range motion planning problems, the experience quality is
critical to the model performance. The pure trial-and-error learning strategy will lead to
the sparsity of success experiences, which results in a low performance as shown in the
experiments. The introduction of teaching through online demonstrations significantly
enriches exploration quality and boosts agent performance. The results of applying two
different path planning algorithms show that interactive error correction is more effective
than complete trajectory demonstrations. This finding suggests that the sophistication of
the demonstrator algorithm is less critical when employing this approach.
• The model is optimized for efficiency, storing only high-level actions, rewards, and state
transitions in a compact experience buffer of 50,000 units. This setup enables training to
be completed overnight on a single computing node with one GPU.
111
Chapter 6. Conclusion
6.1.Summary of Dissertation
Due to the complexity of motion planning problems for autonomous systems and the realtime requirement of their decision making, pipeline methods are popular as they split the
problem into different stages, including the global path planning and local trajectory planning, so
that it can solve complex and large-scale global planning problems by a high-level global planner
and satisfy the real-time requirement with an agile local planner. However, due to the uncertainty
from the unpredictable motion of dynamic obstacles in the long term, most current autonomous
systems overlook collision risks in the long-range global planning phase. Instead, they rely on
the reactive behavior of the local planner to avoid collisions. This reliance can compromise the
safety of the overall global plan. Sometimes the system may need to deal with multi-agent or
multi-objective planning problems, which makes it extremely difficult for classic planning
algorithms to solve the problem in reasonable time. Machine learning tools, particularly bioinspired optimization algorithms, are adaptable for addressing complex issues. Nonetheless, their
tendency to converge prematurely and their sensitivity to hyper-parameter settings make it
challenging to deploy them in large-scale path planning in complex environments. Deep learning
and reinforcement learning have become increasingly popular for tackling motion planning
problems, whether through an end-to-end approach or by substituting one or several components
within a pipeline framework. Yet, the effectiveness of these neural networks depends heavily on
the quality of data, and their lack of interpretability makes it difficult to identify errors or finetune the model on specific tasks.
In this dissertation, we follow the pipeline motion planning framework and tackle the path
planning problem in complex and dynamic environments and local trajectory planning problem
112
in large-scale environments. To address these challenges, we focus on three specific issues: (1)
evaluating encountering risk and constructing safer plans under uncertainty based on the current
observation; (2) formulating a robust planning strategy to efficiently solve the multi-objective
planning problem in complex and large-scale environments; and (3) constructing an interpretable
real-time decision-making framework for the local planning using local observation while
concerning agent dynamics.
In dynamic environments, a great source of uncertainty is the unpredictability of dynamic
obstacles. We demonstrate an intention estimation method to estimate the future motion of
dynamic obstacles and model the encountering risk as a Gaussian distribution. By integrating
risk assessment into the path planning process for autonomous ships, the duration of
encountering other ships can be reduced. This, in turn, alleviates the need for the autonomous
ship to face the complex multi-ship encountering situations and improves the safety of
navigation.
In real-world scenarios, the autonomous agent needs to plan the path in large-scale
environments with obstacles of diverse shapes and sizes. The robustness of the planner is critical
for the effectiveness of the plan. To address this, we develop a “split and focus” mechanism that
can split the large-scale planning problem into multiple smaller-scale sub-problems for Genetic
algorithm. This approach ensures that the planner remains robust and does not encounter
scalability issues. Through the integration of this mechanism with an innovative collision-area
cost function, the Focus Genetic Algorithm consistently demonstrates strong and reliable
performance across various large-scale and complex environments.
113
To tackle the dynamics constraint and variable horizon local planning, we derive a duallevel deep reinforcement learning framework that can construct local plans that adhere to the
agent's dynamics constraints, while relying on local information in uncertain scenarios.
Knowledge transfer and online demonstration methods are applied to help the model construct
interpretable domain-specific feature space. The proposed framework can be trained efficiently
since only the high-level actions, rewards, and state transitions are stored in the experience buffer.
The experimental results demonstrate that the dual-level DRL framework can be generalized
across various environments, provide an interpretable understanding of the environment, and
create variable-horizon plans while complying with its dynamic constraints.
6.2.Contributions
Motion planning is a crucial functionality in autonomous systems, tasked with managing
varying scales of complexity and ensuring robust performance. The main challenges arise from
the time complexity involved in planning and the real-time demands of control. Our research
makes several significant contributions to the motion planning of autonomous systems.
Firstly, we have introduced an intention estimation method for long-range planning
problems that features a large number of dynamic obstacles. To enhance the efficiency of this
estimation, a destination tree algorithm was developed, which reduces the time complexity
involved in intention estimation in large-scale and complex environments. Additionally, by
implementing a long-term encountering risk evaluation approach in dynamic settings, the agent
ensures that no long-term risks are overlooked, leading to the identification of safer global paths.
To enhance the scalability and generalizability of optimization-based path planning methods,
we developed a novel collision penalty function to tackle the issue of infeasible solutions
114
produced by bio-inspired algorithms due to premature convergence. This approach allows the
system to adapt to environments that are large in scale and contain obstacles of various shapes
and sizes. Furthermore, to enhance the quality of the paths generated, we introduced a split-andfocus mechanism that prevents the algorithm from settling on simpler, lower-quality solutions.
We developed a dual-level deep reinforcement learning framework to create dynamiccompliant local plans by training it to select high-level waypoints and utilize a low-level
controller for navigation. To enhance the model's interpretability, we applied domain knowledge
transfer. We also boosted the model's generalizability and performance by training the agent in
random environments, using online demonstrations from traditional planning algorithms. This
dual-level framework can be trained efficiently using only high-level transition experiences, and
its performance is independent of the demonstrating algorithms' capabilities.
In summary, these contributions represent a significant step towards developing a general
and robust autonomous system. They offer a scalable solution to motion planning challenges and
enhance the interpretability of the training and decision-making processes within autonomous
systems.
6.3.Future Work
In this dissertation, we have developed innovative algorithms focused on enhancing motion
planning for autonomous systems, with significant advancements in both global path planning
and local trajectory planning. Looking ahead, future work will delve into several critical areas to
further refine these systems. Firstly, enhancing safety is paramount and we need to mitigate risks
associated with environmental disturbances and unexpected movements of dynamic obstacles.
Bridging the gap between simulation and reality is crucial, making the development of machine
115
learning components that can adapt and improve through online learning from real-world
deployments a promising avenue for exploration. Secondly, exploring among-agent and crosstask knowledge transfer will be important. Questions such as what constitutes the optimal
strategy for knowledge transfer, how to define the similarity among tasks, and what types of
knowledge are transferable will be central to future research. Lastly, enhancing the
interpretability of models to improve human understanding of agent behaviors remains a key
objective. This will support more effective multi-agent interactions and encourage deeper
collaboration between humans and autonomous systems, which is vital for operating in complex
environments.
116
References
Abdallaoui, S., Aglzim, E.-H., Chaibet, A., & Kribèche, A. (2022). Thorough Review Analysis
of Safe Control of Autonomous Vehicles: Path Planning and Navigation Techniques.
Energies, 15(4), 1358. https://doi.org/10.3390/en15041358
Ahmed, F., & Deb, K. (2013). Multi-objective optimal path planning using elitist non-dominated
sorting genetic algorithms. Soft Computing, 17(7), 1283–1299.
https://doi.org/10.1007/s00500-012-0964-8
Albarella, N., Lui, D., Petrillo, A., & Santini, S. (2023). A Hybrid Deep Reinforcement Learning
and Optimal Control Architecture for Autonomous Highway Driving. Energies, 16(8),
3490. https://doi.org/10.3390/en16083490
Andrychowicz, O. M., Baker, B., Chociej, M., Jozefowicz, R., McGrew, B., Pachocki, J., Petron,
A., Plappert, M., Powell, G., Ray, A., & others. (2020). Learning dexterous in-hand
manipulation. The International Journal of Robotics Research, 39(1), 3–20.
Aradi, S. (2022). Survey of Deep Reinforcement Learning for Motion Planning of Autonomous
Vehicles. IEEE Transactions on Intelligent Transportation Systems, 23(2), 740–759.
https://doi.org/10.1109/TITS.2020.3024655
Arthur, D., Vassilvitskii, S., & others. (2007). k-means++: The advantages of careful seeding.
Soda, 7, 1027–1035.
Badrloo, S., Varshosaz, M., Pirasteh, S., & Li, J. (2022). Image-Based Obstacle Detection
Methods for the Safe Navigation of Unmanned Vehicles: A Review. Remote Sensing,
14(15), 3824. https://doi.org/10.3390/rs14153824
Baressi Šegota, S., Anđelić, N., Lorencin, I., Saga, M., & Car, Z. (2020). Path planning
optimization of six-degree-of-freedom robotic manipulators using evolutionary
algorithms. International Journal of Advanced Robotic Systems, 17(2),
172988142090807. https://doi.org/10.1177/1729881420908076
Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). Curriculum learning. Proceedings
of the 26th Annual International Conference on Machine Learning, 41–48.
https://doi.org/10.1145/1553374.1553380
Beser, F., & Yildirim, T. (2018). COLREGS Based Path Planning and Bearing Only Obstacle
Avoidance for Autonomous Unmanned Surface Vehicles. Procedia Computer Science,
131, 633–640. https://doi.org/10.1016/j.procs.2018.04.306
Bhattacharya, P., & Gavrilova, M. (2008). Roadmap-Based Path Planning—Using the Voronoi
Diagram for a Clearance-Based Shortest Path. IEEE Robotics & Automation Magazine,
15(2), 58–66. https://doi.org/10.1109/MRA.2008.921540
117
Bhattacharya, P., & Gavrilova, M. L. (2007). Voronoi diagram in optimal path planning. 4th
International Symposium on Voronoi Diagrams in Science and Engineering (ISVD 2007),
38–47. https://doi.org/10.1109/ISVD.2007.43
Bing, Z., Zhou, H., Li, R., Su, X., Morin, F. O., Huang, K., & Knoll, A. (2022). Solving robotic
manipulation with sparse reward reinforcement learning via graph-based diversity and
proximity. IEEE Transactions on Industrial Electronics, 70(3), 2759–2769.
Bohlin, R., & Kavraki, L. E. (2000). Path planning using lazy PRM. Proceedings 2000 ICRA.
Millennium Conference. IEEE International Conference on Robotics and Automation.
Symposia Proceedings (Cat. No.00CH37065), 521–528 vol.1.
https://doi.org/10.1109/ROBOT.2000.844107
Chen, G., Luo, N., Liu, D., Zhao, Z., & Liang, C. (2021). Path planning for manipulators based
on an improved probabilistic roadmap method. Robotics and Computer-Integrated
Manufacturing, 72, 102196. https://doi.org/10.1016/j.rcim.2021.102196
Chen, P., Pei, J., Lu, W., & Li, M. (2022). A deep reinforcement learning based method for realtime path planning and dynamic obstacle avoidance. Neurocomputing, 497, 64–75.
https://doi.org/10.1016/j.neucom.2022.05.006
Das, P. K., & Jena, P. K. (2020). Multi-robot path planning using improved particle swarm
optimization algorithm through novel evolutionary operators. Applied Soft Computing, 92,
106312. https://doi.org/10.1016/j.asoc.2020.106312
De Croon, G. C. H. E., Dupeyroux, J. J. G., Fuller, S. B., & Marshall, J. A. R. (2022). Insectinspired AI for autonomous robots. Science Robotics, 7(67), eabl6334.
https://doi.org/10.1126/scirobotics.abl6334
Duan, H., Yu, Y., Zhang, X., & Shao, S. (2010). Three-dimension path planning for UCAV
using hybrid meta-heuristic ACO-DE algorithm. Simulation Modelling Practice and
Theory, 18(8), 1104–1115. https://doi.org/10.1016/j.simpat.2009.10.006
Elallid, B. B., Benamar, N., Hafid, A. S., Rachidi, T., & Mrani, N. (2022). A Comprehensive
Survey on the Application of Deep and Reinforcement Learning Approaches in
Autonomous Driving. Journal of King Saud University - Computer and Information
Sciences, 34(9), 7366–7390. https://doi.org/10.1016/j.jksuci.2022.03.013
Elhoseny, M., Tharwat, A., & Hassanien, A. E. (2018). Bezier Curve Based Path Planning in a
Dynamic Field using Modified Genetic Algorithm. Journal of Computational Science, 25,
339–350. https://doi.org/10.1016/j.jocs.2017.08.004
European Maritime Safety Agency. (2017). Annual overview of marine casualties and incidents.
European Maritime Safety Agency. https://www.emsa.europa.eu/newsroom/latestnews/item/3156-annual-overview-of-marine-casualties-and-incidents-2017.html
118
Fan, X., Guo, Y., Liu, H., Wei, B., & Lyu, W. (2020). Improved Artificial Potential Field
Method Applied for AUV Path Planning. Mathematical Problems in Engineering, 2020,
1–21. https://doi.org/10.1155/2020/6523158
Fiorini, P., & Shiller, Z. (1998). Motion Planning in Dynamic Environments Using Velocity
Obstacles. The International Journal of Robotics Research, 17(7), 760–772.
https://doi.org/10.1177/027836499801700706
Fox, D., Burgard, W., & Thrun, S. (1997). The dynamic window approach to collision avoidance.
IEEE Robotics & Automation Magazine, 4(1), 23–33.
Fujimoto, S., Hoof, H., & Meger, D. (2018). Addressing function approximation error in actorcritic methods. International Conference on Machine Learning, 1587–1596.
Gao, J., Ye, W., Guo, J., & Li, Z. (2020). Deep Reinforcement Learning for Indoor Mobile
Robot Path Planning. Sensors, 20(19), 5493. https://doi.org/10.3390/s20195493
Garrido, S., Malfaz, M., & Blanco, D. (2013). Application of the fast marching method for
outdoor motion planning in robotics. Robotics and Autonomous Systems, 61(2), 106–114.
https://doi.org/10.1016/j.robot.2012.10.012
Gasparetto, A., Boscariol, P., Lanzutti, A., & Vidoni, R. (2015). Path Planning and Trajectory
Planning Algorithms: A General Overview. In G. Carbone & F. Gomez-Bravo (Eds.),
Motion and Operation Planning of Robotic Systems (Vol. 29, pp. 3–27). Springer
International Publishing. https://doi.org/10.1007/978-3-319-14705-5_1
Grigorescu, S., Trasnea, B., Cocias, T., & Macesanu, G. (2020). A survey of deep learning
techniques for autonomous driving. Journal of Field Robotics, 37(3), 362–386.
https://doi.org/10.1002/rob.21918
Guo, X., Ji, M., Zhao, Z., Wen, D., & Zhang, W. (2020). Global path planning and multiobjective path control for unmanned surface vehicle based on modified particle swarm
optimization (PSO) algorithm. Ocean Engineering, 216, 107693.
https://doi.org/10.1016/j.oceaneng.2020.107693
Hadi, B., Khosravi, A., & Sarhadi, P. (2022). Deep reinforcement learning for adaptive path
planning and control of an autonomous underwater vehicle. Applied Ocean Research,
129, 103326. https://doi.org/10.1016/j.apor.2022.103326
Hao, B., Du, H., & Yan, Z. (2023). A path planning approach for unmanned surface vehicles
based on dynamic and fast Q-learning. Ocean Engineering, 270, 113632.
https://doi.org/10.1016/j.oceaneng.2023.113632
Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J.,
Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J., Leibo, J., & Gruslys, A.
(2018). Deep Q-learning From Demonstrations. Proceedings of the AAAI Conference on
Artificial Intelligence, 32(1). https://doi.org/10.1609/aaai.v32i1.11757
119
Hsu, D., Kindel, R., Latombe, J.-C., & Rock, S. (2002). Randomized Kinodynamic Motion
Planning with Moving Obstacles. The International Journal of Robotics Research, 21(3),
233–255. https://doi.org/10.1177/027836402320556421
Hu, J., Niu, H., Carrasco, J., Lennox, B., & Arvin, F. (2020). Voronoi-based multi-robot
autonomous exploration in unknown environments via deep reinforcement learning.
IEEE Transactions on Vehicular Technology, 69(12), 14413–14423.
Hu, Y., & Yang, S. X. (2004). A knowledge based genetic algorithm for path planning of a
mobile robot. IEEE International Conference on Robotics and Automation, 2004.
Proceedings. ICRA ’04. 2004, 4350-4355 Vol.5.
https://doi.org/10.1109/ROBOT.2004.1302402
Hu, Z., Gao, X., Wan, K., Zhai, Y., & Wang, Q. (2021). Relevant experience learning: A deep
reinforcement learning method for UAV autonomous motion planning in complex
unknown environments. Chinese Journal of Aeronautics, 34(12), 187–204.
https://doi.org/10.1016/j.cja.2020.12.027
Hussein, A., Gaber, M. M., Elyan, E., & Jayne, C. (2017). Imitation learning: A survey of
learning methods. ACM Computing Surveys (CSUR), 50(2), 1–35.
International Maritime Organization. (1972). Convention on the International Regulations for
Preventing Collisions at Sea, 1972 (COLREGs). International Maritime Organization
London, UK.
Jaillet, L., Cortés, J., & Siméon, T. (2010). Sampling-Based Path Planning on ConfigurationSpace Costmaps. IEEE Transactions on Robotics, 26(4), 635–646.
https://doi.org/10.1109/TRO.2010.2049527
Jayaweera, H. M., & Hanoun, S. (2020). A Dynamic Artificial Potential Field (D-APF) UAV
Path Planning Technique for Following Ground Moving Targets. IEEE Access, 8,
192760–192776. https://doi.org/10.1109/ACCESS.2020.3032929
Jin, J., Zhang, Y., Zhou, Z., Jin, M., Yang, X., & Hu, F. (2023). Conflict-based search with D*
lite algorithm for robot path planning in unknown dynamic environments. Computers and
Electrical Engineering, 105, 108473. https://doi.org/10.1016/j.compeleceng.2022.108473
Jones, M., Djahel, S., & Welsh, K. (2023). Path-Planning for Unmanned Aerial Vehicles with
Environment Complexity Considerations: A Survey. ACM Computing Surveys, 55(11),
1–39. https://doi.org/10.1145/3570723
Kang, Y.-T., Chen, W.-J., Zhu, D.-Q., Wang, J.-H., & Xie, Q.-M. (2018). Collision avoidance
path planning for ships by particle swarm optimization. Journal of Marine Science and
Technology, 26(6). https://doi.org/10.6119/JMST.201812_26(6).0003
Karaman, S., & Frazzoli, E. (2011). Sampling-based algorithms for optimal motion planning.
The International Journal of Robotics Research, 30(7), 846–894.
120
Kavraki, L. E., Svestka, P., Latombe, J.-C., & Overmars, M. H. (1996). Probabilistic roadmaps
for path planning in high-dimensional configuration spaces. IEEE Transactions on
Robotics and Automation, 12(4), 566–580. https://doi.org/10.1109/70.508439
Kobayashi, M., & Motoi, N. (2022). Local Path Planning: Dynamic Window Approach With
Virtual Manipulators Considering Dynamic Obstacles. IEEE Access, 10, 17018–17029.
https://doi.org/10.1109/ACCESS.2022.3150036
Koenig, S., & Likhachev, M. (2002). D* lite. Eighteenth National Conference on Artificial
Intelligence, 476–483.
Konatowski, S. (2019). Application of the ACO algorithm for UAV path planning. Przegląd
Elektrotechniczny, 1(7), 117–121. https://doi.org/10.15199/48.2019.07.24
Kuffner, J. J., & LaValle, S. M. (2000). RRT-connect: An efficient approach to single-query path
planning. Proceedings 2000 ICRA. Millennium Conference. IEEE International
Conference on Robotics and Automation. Symposia Proceedings (Cat. No.00CH37065),
2, 995–1001. https://doi.org/10.1109/ROBOT.2000.844730
Lamini, C., Benhlima, S., & Elbekri, A. (2018). Genetic Algorithm Based Approach for
Autonomous Mobile Robot Path Planning. Procedia Computer Science, 127, 180–189.
https://doi.org/10.1016/j.procs.2018.01.113
Lazarowska, A. (2015). Ship’s Trajectory Planning for Collision Avoidance at Sea Based on Ant
Colony Optimisation. Journal of Navigation, 68(2), 291–307.
https://doi.org/10.1017/S0373463314000708
Le, A. V., Nhan, N. H. K., & Mohan, R. E. (2020). Evolutionary Algorithm-Based Complete
Coverage Path Planning for Tetriamond Tiling Robots. Sensors, 20(2), 445.
https://doi.org/10.3390/s20020445
Le Mero, L., Yi, D., Dianati, M., & Mouzakitis, A. (2022). A Survey on Imitation Learning
Techniques for End-to-End Autonomous Vehicles. IEEE Transactions on Intelligent
Transportation Systems, 23(9), 14128–14147.
https://doi.org/10.1109/TITS.2022.3144867
Lee, D. H., Lee, S. S., Ahn, C. K., Shi, P., & Lim, C.-C. (2021). Finite Distribution EstimationBased Dynamic Window Approach to Reliable Obstacle Avoidance of Mobile Robot.
IEEE Transactions on Industrial Electronics, 68(10), 9998–10006.
https://doi.org/10.1109/TIE.2020.3020024
Li, B., Yang, Z., Chen, D., Liang, S., & Ma, H. (2021). Maneuvering target tracking of UAV
based on MN-DDPG and transfer learning. Defence Technology, 17(2), 457–466.
https://doi.org/10.1016/j.dt.2020.11.014
Li, P., Ding, X., Sun, H., Zhao, S., & Cajo, R. (2021). Research on Dynamic Path Planning of
Mobile Robot Based on Improved DDPG Algorithm. Mobile Information Systems, 2021,
1–10. https://doi.org/10.1155/2021/5169460
121
Li, Q., Xu, Y., Bu, S., & Yang, J. (2022). Smart Vehicle Path Planning Based on Modified PRM
Algorithm. Sensors, 22(17), 6581. https://doi.org/10.3390/s22176581
Liu, C., Liu, A., Wang, R., Zhao, H., & Lu, Z. (2022). Path Planning Algorithm for MultiLocomotion Robot Based on Multi-Objective Genetic Algorithm with Elitist Strategy.
Micromachines, 13(4), 616. https://doi.org/10.3390/mi13040616
Liu, M., Zhao, F., Niu, J., & Liu, Y. (2021). ReinforcementDriving: Exploring Trajectories and
Navigation for Autonomous Vehicles. IEEE Transactions on Intelligent Transportation
Systems, 22(2), 808–820. https://doi.org/10.1109/TITS.2019.2960872
Long, Y., Liu, S., Qiu, D., Li, C., Guo, X., Shi, B., & AbouOmar, M. S. (2023). Local Path
Planning with Multiple Constraints for USV Based on Improved Bacterial Foraging
Optimization Algorithm. Journal of Marine Science and Engineering, 11(3), 489.
https://doi.org/10.3390/jmse11030489
Low, E. S., Ong, P., & Low, C. Y. (2023). A modified Q-learning path planning approach using
distortion concept and optimization in dynamic environment for autonomous mobile
robot. Computers & Industrial Engineering, 181, 109338.
https://doi.org/10.1016/j.cie.2023.109338
Luk, B. L., Cooke, D. S., Galt, S., Collie, A. A., & Chen, S. (2005). Intelligent legged climbing
service robot for remote maintenance applications in hazardous environments. Robotics
and Autonomous Systems, 53(2), 142–152. https://doi.org/10.1016/j.robot.2005.06.004
Luo, Q., Wang, H., Zheng, Y., & He, J. (2020). Research on path planning of mobile robot based
on improved ant colony algorithm. Neural Computing and Applications, 32(6), 1555–
1566. https://doi.org/10.1007/s00521-019-04172-2
Lv, L., Zhang, S., Ding, D., & Wang, Y. (2019). Path Planning via an Improved DQN-Based
Learning Policy. IEEE Access, 7, 67319–67330.
https://doi.org/10.1109/ACCESS.2019.2918703
Maoudj, A., & Hentout, A. (2020). Optimal path planning approach based on Q-learning
algorithm for mobile robots. Applied Soft Computing, 97, 106796.
https://doi.org/10.1016/j.asoc.2020.106796
Masehian, E., & Sedighizadeh, D. (2010). Multi-Objective PSO- and NPSO-based Algorithms
for Robot Path Planning. Advances in Electrical and Computer Engineering, 10(4), 69–
76. https://doi.org/10.4316/aece.2010.04011
Meng, B. H., Godage, I. S., & Kanj, I. (2022). RRT*-Based Path Planning for Continuum Arms.
IEEE Robotics and Automation Letters, 7(3), 6830–6837.
https://doi.org/10.1109/LRA.2022.3174257
Meng, Q., Du, Y., & Wang, Y. (2016). Shipping log data based container ship fuel efficiency
modeling. Transportation Research Part B: Methodological, 83, 207–229.
https://doi.org/10.1016/j.trb.2015.11.007
122
Morales, E. F., Murrieta-Cid, R., Becerra, I., & Esquivel-Basaldua, M. A. (2021). A survey on
deep learning and deep reinforcement learning in robotics with a tutorial on deep
reinforcement learning. Intelligent Service Robotics, 14(5), 773–805.
https://doi.org/10.1007/s11370-021-00398-z
Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., & Abbeel, P. (2018). Overcoming
exploration in reinforcement learning with demonstrations. 2018 IEEE International
Conference on Robotics and Automation (ICRA), 6292–6299.
Nasir, J., Islam, F., Malik, U., Ayaz, Y., Hasan, O., Khan, M., & Muhammad, M. S. (2013).
RRT*-SMART: A Rapid Convergence Implementation of RRT*. International Journal
of Advanced Robotic Systems, 10(7), 299. https://doi.org/10.5772/56718
Ng, A. Y., Harada, D., & Russell, S. (1999). Policy invariance under reward transformations:
Theory and application to reward shaping. Icml, 99, 278–287.
Pan, Y., Yang, Y., & Li, W. (2021). A Deep Learning Trained by Genetic Algorithm to Improve
the Efficiency of Path Planning for Data Collection With Multi-UAV. IEEE Access, 9,
7994–8005. https://doi.org/10.1109/ACCESS.2021.3049892
Panov, A. I., Yakovlev, K. S., & Suvorov, R. (2018). Grid Path Planning with Deep
Reinforcement Learning: Preliminary Results. Procedia Computer Science, 123, 347–353.
https://doi.org/10.1016/j.procs.2018.01.054
Patle, B. K., Babu L, G., Pandey, A., Parhi, D. R. K., & Jagadeesh, A. (2019). A review: On path
planning strategies for navigation of mobile robot. Defence Technology, 15(4), 582–606.
https://doi.org/10.1016/j.dt.2019.04.011
Penin, B., Giordano, P. R., & Chaumette, F. (2019). Minimum-Time Trajectory Planning Under
Intermittent Measurements. IEEE Robotics and Automation Letters, 4(1), 153–160.
https://doi.org/10.1109/LRA.2018.2883375
Perlin, K. (1985). An image synthesizer. ACM Siggraph Computer Graphics, 19(3), 287–296.
Pertsch, K., Lee, Y., Wu, Y., & Lim, J. J. (2022). Demonstration-Guided Reinforcement
Learning with Learned Skills. In A. Faust, D. Hsu, & G. Neumann (Eds.), Proceedings of
the 5th Conference on Robot Learning (Vol. 164, pp. 729–739). PMLR.
https://proceedings.mlr.press/v164/pertsch22a.html
Petraška, A., Čižiūnienė, K., Jarašūnienė, A., Maruschak, P., & Prentkovskis, O. (2017).
Algorithm for the assessment of heavyweight and oversize cargo transportation routes.
Journal of Business Economics and Management, 18(6), 1098–1114.
https://doi.org/10.3846/16111699.2017.1334229
Pina, R., Tibebu, H., Hook, J., De Silva, V., & Kondoz, A. (2021). Overcoming challenges of
applying reinforcement learning for intelligent vehicle control. Sensors, 21(23), 7829.
123
Pinzón-Arenas, J., Jimenez-Moreno, R., & Rubiano, A. (2022). Path Planning for Robotic
Training in Virtual Environments using Deep Learning. International Journal on
Advanced Science, Engineering and Information Technology, 12(1), 8.
https://doi.org/10.18517/ijaseit.12.1.12693
Quaglia, G., Visconte, C., Scimmi, L. S., Melchiorre, M., Cavallone, P., & Pastorelli, S. (2020).
Design of a UGV Powered by Solar Energy for Precision Agriculture. Robotics, 9(1), 13.
https://doi.org/10.3390/robotics9010013
Ramírez, J., Yu, W., & Perrusquía, A. (2022). Model-free reinforcement learning from expert
demonstrations: A survey. Artificial Intelligence Review, 55(4), 3213–3241.
https://doi.org/10.1007/s10462-021-10085-1
Roberge, V., Tarbouchi, M., & Labonte, G. (2013). Comparison of Parallel Genetic Algorithm
and Particle Swarm Optimization for Real-Time UAV Path Planning. IEEE Transactions
on Industrial Informatics, 9(1), 132–141. https://doi.org/10.1109/TII.2012.2198665
Russell, S. J., & Norvig, P. (2016). Artificial intelligence: A modern approach. Pearson.
Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2015). Prioritized experience replay. arXiv
Preprint arXiv:1511.05952.
Shah, B. C., & Gupta, S. K. (2020). Long-Distance Path Planning for Unmanned Surface
Vehicles in Complex Marine Environment. IEEE Journal of Oceanic Engineering, 45(3),
813–830. https://doi.org/10.1109/JOE.2019.2909508
Shi, K., Huang, L., Jiang, D., Sun, Y., Tong, X., Xie, Y., & Fang, Z. (2022). Path Planning
Optimization of Intelligent Vehicle Based on Improved Genetic and Ant Colony Hybrid
Algorithm. Frontiers in Bioengineering and Biotechnology, 10, 905983.
https://doi.org/10.3389/fbioe.2022.905983
Sirimanne, S. N., Hoffmann, J., Asariotis, R., Ayala, G., Assaf, M., Bacrot, C., Benamara, H.,
Chantrel, D., Cournoyer, A., Fugazza, M., Hansen, P., Hoffmann, J., Kulaga, T., Premti,
A., Rodríguez, L., Salo, B., Tahiri, K., Tokuda, H., Ugaz, P., & Youssef, F. (2022).
Review of Maritime Transport 2021. United Nations.
Song, B., Wang, Z., & Zou, L. (2021). An improved PSO algorithm for smooth path planning of
mobile robots using continuous high-degree Bezier curve. Applied Soft Computing, 100,
106960. https://doi.org/10.1016/j.asoc.2020.106960
Song, R., Liu, Y., & Bucknall, R. (2017). A multi-layered fast marching method for unmanned
surface vehicle path planning in a time-variant maritime environment. Ocean
Engineering, 129, 301–317. https://doi.org/10.1016/j.oceaneng.2016.11.009
Soviany, P., Ionescu, R. T., Rota, P., & Sebe, N. (2022). Curriculum learning: A survey.
International Journal of Computer Vision, 130(6), 1526–1565.
124
Sturtevant, N. R. (2012). Benchmarks for Grid-Based Pathfinding. IEEE Transactions on
Computational Intelligence and AI in Games, 4(2), 144–148.
https://doi.org/10.1109/TCIAIG.2012.2197681
Sui, F., Tang, X., Dong, Z., Gan, X., Luo, P., & Sun, J. (2023). ACO+PSO+A*: A bi-layer
hybrid algorithm for multi-task path planning of an AUV. Computers & Industrial
Engineering, 175, 108905. https://doi.org/10.1016/j.cie.2022.108905
Sun, H., Zhang, W., Yu, R., & Zhang, Y. (2021). Motion Planning for Mobile Robots—Focusing
on Deep Reinforcement Learning: A Systematic Review. IEEE Access, 9, 69061–69081.
https://doi.org/10.1109/ACCESS.2021.3076530
Sun, L., Zhao, Y., & Zhang, J. (2021). Research on path Planning algorithm of Unmanned ship
in narrow Water area. Journal of Physics: Conference Series, 2029(1), 012122.
https://doi.org/10.1088/1742-6596/2029/1/012122
Szczepanski, R., Tarczewski, T., & Erwinski, K. (2022). Energy Efficient Local Path Planning
Algorithm Based on Predictive Artificial Potential Field. IEEE Access, 10, 39729–39742.
https://doi.org/10.1109/ACCESS.2022.3166632
Szczurek, K. A., Prades, R. M., Matheson, E., Rodriguez-Nogueira, J., & Castro, M. D. (2023).
Multimodal Multi-User Mixed Reality Human–Robot Interface for Remote Operations in
Hazardous Environments. IEEE Access, 11, 17305–17333.
https://doi.org/10.1109/ACCESS.2023.3245833
Tai, L., Paolo, G., & Liu, M. (2017). Virtual-to-real deep reinforcement learning: Continuous
control of mobile robots for mapless navigation. 2017 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS), 31–36.
Tang, G., Tang, C., Claramunt, C., Hu, X., & Zhou, P. (2021). Geometric A-Star Algorithm: An
Improved A-Star Algorithm for AGV Path Planning in a Port Environment. IEEE Access,
9, 59196–59210. https://doi.org/10.1109/ACCESS.2021.3070054
Teng, S., Hu, X., Deng, P., Li, B., Li, Y., Ai, Y., Yang, D., Li, L., Xuanyuan, Z., Zhu, F., &
Chen, L. (2023). Motion Planning for Autonomous Driving: The State of the Art and
Future Perspectives. IEEE Transactions on Intelligent Vehicles, 8(6), 3692–3711.
https://doi.org/10.1109/TIV.2023.3274536
Wang, H., Li, G., Hou, J., Chen, L., & Hu, N. (2022). A Path Planning Method for Underground
Intelligent Vehicles Based on an Improved RRT* Algorithm. Electronics, 11(3), 294.
https://doi.org/10.3390/electronics11030294
Wang, H., Yu, Y., & Yuan, Q. (2011). Application of Dijkstra algorithm in robot path-planning.
2011 Second International Conference on Mechanic Automation and Control
Engineering, 1067–1069. https://doi.org/10.1109/MACE.2011.5987118
125
Wang, J., Chi, W., Li, C., Wang, C., & Meng, M. Q.-H. (2020). Neural RRT*: Learning-Based
Optimal Path Planning. IEEE Transactions on Automation Science and Engineering,
17(4), 1748–1758. https://doi.org/10.1109/TASE.2020.2976560
Wang, J., & Meng, M. Q.-H. (2020). Optimal Path Planning Using Generalized Voronoi Graph
and Multiple Potential Functions. IEEE Transactions on Industrial Electronics, 67(12),
10621–10630. https://doi.org/10.1109/TIE.2019.2962425
Wang, J., Wang, R., Lu, D., Zhou, H., & Tao, T. (2022). USV Dynamic Accurate Obstacle
Avoidance Based on Improved Velocity Obstacle Method. Electronics, 11(17), 2720.
https://doi.org/10.3390/electronics11172720
Wang, S., Ma, F., Yan, X., Wu, P., & Liu, Y. (2021). Adaptive and extendable control of
unmanned surface vehicle formations using distributed deep reinforcement learning.
Applied Ocean Research, 110, 102590. https://doi.org/10.1016/j.apor.2021.102590
Wang, X., & Jin, Y. (2022). Work Process Transfer Reinforcement Learning: Feature Extraction
and Finetuning in Ship Collision Avoidance. Volume 2: 42nd Computers and Information
in Engineering Conference (CIE), V002T02A069. https://doi.org/10.1115/DETC2022-
91145
Wang, X., & Jin, Y. (2023). Transfer Reinforcement Learning: Feature Transferability in Ship
Collision Avoidance. International Design Engineering Technical Conferences and
Computers and Information in Engineering Conference, 87318, V03BT03A071.
Wang, X., Wang, Z., Tahchi, E., & Zukerman, M. (2021). Submarine Cable Path Planning Based
on Weight Selection of Design Considerations. IEEE Access, 9, 123847–123860.
https://doi.org/10.1109/ACCESS.2021.3108770
Wang, Y., Yao, P., & Dou, Y. (2019). Monitoring trajectory optimization for unmanned surface
vessel in sailboat race. Optik, 176, 394–400. https://doi.org/10.1016/j.ijleo.2018.09.104
Weise, J., & Mostaghim, S. (2022). A Scalable Many-Objective Pathfinding Benchmark Suite.
IEEE Transactions on Evolutionary Computation, 26(1), 188–194.
https://doi.org/10.1109/TEVC.2021.3089050
Williams, K. R., Schlossman, R., Whitten, D., Ingram, J., Musuvathy, S., Pagan, J., Williams, K.
A., Green, S., Patel, A., Mazumdar, A., & Parish, J. (2023). Trajectory Planning With
Deep Reinforcement Learning in High-Level Action Spaces. IEEE Transactions on
Aerospace and Electronic Systems, 59(3), 2513–2529.
https://doi.org/10.1109/TAES.2022.3218496
Xia, G., Han, Z., Zhao, B., & Wang, X. (2020). Local Path Planning for Unmanned Surface
Vehicle Collision Avoidance Based on Modified Quantum Particle Swarm Optimization.
Complexity, 2020, 1–15. https://doi.org/10.1155/2020/3095426
Xiao, X., Liu, B., Warnell, G., & Stone, P. (2022). Motion planning and control for mobile robot
navigation using machine learning: A survey. Autonomous Robots, 46(5), 569–597.
126
Xie, L., Miao, Y., Wang, S., Blunsom, P., Wang, Z., Chen, C., Markham, A., & Trigoni, N.
(2021). Learning With Stochastic Guidance for Robot Navigation. IEEE Transactions on
Neural Networks and Learning Systems, 32(1), 166–176.
https://doi.org/10.1109/TNNLS.2020.2977924
Xing, X., Ding, H., Liang, Z., Li, B., & Yang, Z. (2022). Robot path planner based on deep
reinforcement learning and the seeker optimization algorithm. Mechatronics, 88, 102918.
https://doi.org/10.1016/j.mechatronics.2022.102918
Yan, C., Shao, K., Wang, X., Zheng, J., & Liang, B. (2021). Reference Governor-Based Control
for Active Rollover Avoidance of Mobile Robots. 2021 IEEE International Conference
on Systems, Man, and Cybernetics (SMC), 429–435.
https://doi.org/10.1109/SMC52423.2021.9659171
Yang, Y., Juntao, L., & Lingling, P. (2020). Multi‐robot path planning based on a deep
reinforcement learning DQN algorithm. CAAI Transactions on Intelligence Technology,
5(3), 177–183. https://doi.org/10.1049/trit.2020.0024
Yao, Q., Zheng, Z., Qi, L., Yuan, H., Guo, X., Zhao, M., Liu, Z., & Yang, T. (2020). Path
planning method with improved artificial potential field-a reinforcement learning
perspective. IEEE Access, 8, 135513–135523.
Yu, X., Jiang, N., Wang, X., & Li, M. (2023). A hybrid algorithm based on grey wolf optimizer
and differential evolution for UAV path planning. Expert Systems with Applications, 215,
119327. https://doi.org/10.1016/j.eswa.2022.119327
Yu, Z., Si, Z., Li, X., Wang, D., & Song, H. (2022). A Novel Hybrid Particle Swarm
Optimization Algorithm for Path Planning of UAVs. IEEE Internet of Things Journal,
9(22), 22547–22558. https://doi.org/10.1109/JIOT.2022.3182798
Zhang, T.-W., Xu, G.-H., Zhan, X.-S., & Han, T. (2022). A new hybrid algorithm for path
planning of mobile robot. The Journal of Supercomputing, 78(3), 4158–4181.
https://doi.org/10.1007/s11227-021-04031-9
Zhao, Y., Wang, Y., Zhang, J., Liu, X., Li, Y., Guo, S., Yang, X., & Hong, S. (2022). Surgical
GAN: Towards real-time path planning for passive flexible tools in endovascular
surgeries. Neurocomputing, 500, 567–580. https://doi.org/10.1016/j.neucom.2022.05.044
Zhu, D., & Yang, S. X. (2022). Bio-Inspired Neural Network-Based Optimal Path Planning for
UUVs Under the Effect of Ocean Currents. IEEE Transactions on Intelligent Vehicles,
7(2), 231–239. https://doi.org/10.1109/TIV.2021.3082151
Abstract (if available)
Abstract
Motion planning has been a hot research topic in robotics and is a vital functionality for autonomous systems. Its primary objective is to discover the most efficient path, accounting for single or multiple objectives, while also adhering to the dynamic constraints of the agent and avoiding collisions with obstacles. Given the complexity of addressing this challenge in large-scale and dynamic environments, coupled with the need for real-time control, pipeline planning methods have gained popularity. These methods typically consist of two main components: global path planning, which generates a global path from the origin to the destination, and local trajectory planning, which generates the short-term trajectory in real time. This dissertation tackles the computational challenges associated with global planning in extensive environments, as well as the necessity of identifying dynamic-compliant trajectories during local planning. To reduce the collision probability caused by the uncertain motions of other agents, we developed a risk-aware A* algorithm to minimize the encountering risk in long-range path planning problems in dynamic environments by incorporating the intention estimation method. To solve the efficiency issue with large-scale planning problems, we developed a Focus Genetic Algorithm (GA-focus) that can solve multi-objective path planning problems in large-scale and complex environments. By introducing a collision area cost and the self-improving process, our GA-focus algorithm can overcome the early-convergence problem of evolutionary algorithms and is guaranteed to converge to a feasible solution. Lastly, we introduced a dual-level deep reinforcement learning framework that combines high-level policy learning and low-level integration of a controller and offers an efficient and adaptable solution for navigating through local environments while considering dynamic constraints. Domain knowledge transfer and online demonstrations can be integrated during training to significantly improve the agent's performance in long-range trajectory planning tasks. By enhancing navigation safety, optimizing efficiency, and increasing adaptability to various planning conditions, these algorithms can significantly enhance the overall effectiveness and reliability of autonomous systems.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Speeding up trajectory planning for autonomous robots operating in complex environments
PDF
Transfer reinforcement learning for autonomous collision avoidance
PDF
Advancements in understanding the empirical hardness of the multi-agent pathfinding problem
PDF
Reward shaping and social learning in self- organizing systems through multi-agent reinforcement learning
PDF
Target assignment and path planning for navigation tasks with teams of agents
PDF
Enhancing collaboration on the edge: communication, scheduling and learning
PDF
Learning objective functions for autonomous motion generation
PDF
Trajectory planning for manipulators performing complex tasks
PDF
Decentralized real-time trajectory planning for multi-robot navigation in cluttered environments
PDF
Efficient and effective techniques for large-scale multi-agent path finding
PDF
Speeding up path planning on state lattices and grid graphs by exploiting freespace structure
PDF
Any-angle path planning
PDF
Landscape analysis and algorithms for large scale non-convex optimization
PDF
Discrete geometric motion control of autonomous vehicles
PDF
Decision making in complex action spaces
PDF
Terrain-following motion of an autonomous agent as means of motion planning in the unknown environment
PDF
The projection immersed boundary method for compressible flow and its application in fluid-structure interaction simulations of parachute systems
PDF
A discrete-time return map analysis and prediction of gait-modulated robot dynamic under repeated obstacle collisions
PDF
Informative path planning for environmental monitoring
PDF
Computer aided visual analogy support (CAVAS) for engineering design
Asset Metadata
Creator
Hu, Chuanhui
(author)
Core Title
Large-scale path planning and maneuvering with local information for autonomous systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Mechanical Engineering
Degree Conferral Date
2024-08
Publication Date
07/12/2024
Defense Date
05/09/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
autonomous systems,genetic algorithm,motion planning,OAI-PMH Harvest,path planning,reinforcement learning,trajectory planning,transfer learning
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Jin, Yan (
committee chair
), Bermejo-Moreno, Ivan (
committee member
), Qian, Feifei (
committee member
), Shiflett, Geoffrey (
committee member
)
Creator Email
chuanhui@usc.edu,huchuanhui729@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113997N8V
Unique identifier
UC113997N8V
Identifier
etd-HuChuanhui-13213.pdf (filename)
Legacy Identifier
etd-HuChuanhui-13213
Document Type
Dissertation
Format
theses (aat)
Rights
Hu, Chuanhui
Internet Media Type
application/pdf
Type
texts
Source
20240712-usctheses-batch-1180
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
autonomous systems
genetic algorithm
motion planning
path planning
reinforcement learning
trajectory planning
transfer learning