Autonomous Obstacle Avoidance by Learning Policies for Reference Modification
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Autonomous Obstacle Avoidance by Learning Policies for Reference Modification Benjamin Evans1 , Herman A. Engelbrecht1 and Hendrik W. Jordaan1 Abstract— The key problem for autonomous robots is how to navigate through complex, obstacle-filled environments, without colliding into obstacles. The navigation problem is split into generating a global reference path to the goal and then using a local planner to track the reference path and avoid obstacles by arXiv:2102.11042v2 [cs.RO] 4 May 2021 generating velocity and steering references for the robot control system to execute. Path This paper presents a hybrid local planning architecture Follower RL Agent that combines a classic path following algorithm with a deep reinforcement learning agent. The novel “modification planner” uses the path follower to track the global plan and the deep reinforcement learning agent to modify the references generated by the path follower to avoid obstacles. Importantly, our Hybrid Modification Planner architecture does not require an updated obstacle map and only 10 laser range finders to avoid obstacles. The modification planner is evaluated in the context of F1/10th autonomous racing and compared to a mapless navigation baseline, the Follow the Gap Method and an optimisation based planner. The results show that the modification planner can achieve faster average times compared to the baseline mapless navigation planner and a 94% success rate which is similar to the baseline. Fig. 1. Examples of Autonomous Navigation: The green dotted line is the global plan, the blue blocks are the obstacles and the red line is the path taken by the vehicle. Path following algorithms are able to perfectly follow the global plan, yet crash into unknown obstacles. Reinforcement learning I. I NTRODUCTION agents are able to avoid obstacles, yet suffer from being inefficient. The lower image shows the desired behaviour which we create in the proposed The general problem of robot navigation is to plan and modification planner. execute a trajectory that moves a robot from a starting loca- tion to a goal location without colliding with any obstacles. Successful approaches to robot navigation have implemented smoothly than a mapless navigation planner and does not a hierarchical approach by splitting the problem into global require an updated obstacle map as with optimisation based planning, generating a high-level plan before the mission planners. We evaluate our novel approach in the context begins, and local planning, following the global plan while of F1/10th autonomous racing and compare it to current avoiding obstacles [1]. Local planning is a difficult problem approaches which include a mapless navigation planner, the to solve because it has the dual objective of (1) trying to Follow the Gap Method, and an oracle optimisation based follow a predetermined path (2) and being able to react planner. appropriately to uncertainty in the environment, namely The rest of the paper is summarised as follows. We smoothly avoiding obstacles. first discuss previous attempts to solve the local planning In this paper, we address the problem of local planning problem in Section II. This is followed by a theoretical without an updated obstacle, which involves using the global overview of our proposed reference modification planner plan and onboard sensor readings to generate velocity and in Section III. In Section IV we explain the components steering commands that follow the global reference while that we use in building our solution and in Section V the avoiding obstacles. We present our novel approach which learning formulation for the Reinforcement Learning (RL) is a hybrid local planner that integrates a path follower agent is explained. Lastly in Sections VI and VII we detail and a deep reinforcement learning (RL) agent that modi- the evaluation methodology and present our results. fies the steering reference generated by the path follower to accomplish the dual objective of tracking a reference II. R ELATED W ORK path while also smoothly avoiding obstacles. We call our approach a reference modification planner. The advantage of Different methods of local planning, following a global our architecture is that it can learn to avoid obstacles more plan while avoiding obstacles have been developed. We study 1 Electrical and Electronic Engineering Department, Stellenbosch Uni- reactive methods [2], [3], classical planning techniques [4], versity, Stellenbosch, 7600, South Africa. {19811799, hebrecht, [5], learning based approaches [6], [7] and hybrid systems wjordaan}@sun.ac.za; [8], [9].
The earliest obstacle avoidance algorithms were reactive jectory being followed and obstacles being smoothly avoided algorithms (such as “The Bug” algorithm) which generate as shown in Figure 1. The solution should track a reference steering references based on the surrounding environment at trajectory and deviate as necessary to safely and efficiently each step [2]. One of the most recently proposed reactive ob- avoid the obstacle. The desired behaviour is analogous to stacle avoidance algorithms is the “Follow the Gap Method” driving a car, seeing an obstacle in the road, swerving around (FGM) [3]. The FGM method uses a dense LiDAR scan it and then returning to the original lane. to navigate away from the nearest obstacle into free space. Reactive methods do not have a model of the vehicle or the Global ability to plan for multiple timesteps and thus they cannot Plan + guarantee optimality. Path Follower Classical local planning methods aimed to improve on + Modified reactive methods by storing a map of the environment which references can be used to plan a trajectory, multiple time-steps in Vehicle advance. Classical methods typical build a map using a state Simultaneous Localisation and Mapping (SLAM) algorithm RL Agent [10], and then use the map to plan and optimise a collision- free trajectory [4]. The trajectory is then followed using a Fig. 2. Proposed Local Planner Architecture: A path follower to track path tracking algorithm such as Pure Pursuit [5]. Classical the reference path is combined in parallel with an RL agent that modifies planning and path tracking methods can generate and execute the path follower references. near-optimal trajectories, however, because map building is computationally expensive and slow, these solutions are We present a hybrid local planner that features a pure not feasible for fast-moving, dynamic environments such as pursuit path follower in parallel with an RL agent. Figure 2 autonomous racing [11]. shows how we combine the path follower and an RL agent Due to the hardware requirements/constraints of classical in parallel such that the agent can modify the path follower planning solutions, machine learning (ML) methods have references to avoid obstacles. This system is designed that been applied to the problem of local planning for autonomous the path follower maintains the reference path and the agent navigation [12]. Neural networks, trained with reinforcement can deal with the uncertainty of obstacles. learning (RL) [6] and imitation learning (IL) [7] have been The path follower uses the current location of the vehicle used to associate sensor readings (camera images and LiDAR and a list of way-points to calculate the required steering scans) directly with navigation commands. While using an and velocity commands to follow the global plan. an RL RL agent as a local planner does not require the expensive agent is added in parallel to avoid obstacles by modifying maintenance of a map, RL agents exhibit poor long term the steering angle calculated by the path follower. We train performance and have shown to require a high-level planner the agent by punishing it for crashing and rewarding it for in certain circumstances [13]. progress along the global plan. These two components in In response to the problem of neural networks exhibiting the reward signal represent the dual objectives of achieving poor long-term performance in complex environments, hy- efficient, collision-free paths. brid architectures have been introduced which combine high- Where previous attempts have used RL agents to replace level planners with neural networks. Paths generated with A* the local planner and learn to navigate, we use an RL agent search have been used in conjunction with a neural network to learn to navigate as a subsystem within our local planner. as a local planner and shown to be able to execute long This is done with the aim of better utilising the global path missions [9], [8]. The above methods completely replace the and avoiding obstacles more smoothly. The aim is that the local planner with an RL agent. A recent survey notes that agent will learn to swerve around obstacles where needed using machine learning (ML) for specific subsystems tends without modifying the behaviour of the vehicle when no to lead towards better overall performance [12]. We aim to obstacles are present. improve upon the performance of these solutions by using an IV. P RELIMINARIES : NAVIGATION S TACK RL agent to learn the local planning subsystem of obstacle avoidance. We aim to improve the performance of the vehicle We design our system in the context of a standard per- by giving the agent less influence over the navigation. Our ception, planning and control stack as shown in Figure 3. solution doesn’t require an updated obstacle map, as with Before the mission begins, the global planner uses a map of classical model-based planning solutions, and can harness the the environment to generate a reference path of way-points adaptability of learning techniques to form a planner which is that connect the start and goal locations. The map does not able to smoothly avoid obstacles while maintaining a global contain any dynamic obstacles. During the mission, the local trajectory. planner and control system continuously operates to move the vehicle to the goal location. III. T HEORETICAL D ESIGN We use a pure pursuit path follower which navigates We address the problem of generating navigation refer- towards a way-point situated at a fixed look-ahead distance ences for autonomous vehicles that result in a reference tra- in front of the vehicle. Figure 4 shows how the vehicle and
Environment Map Global Planner actions. For each training step, the network receives a state and selects an action that is implemented in the environment as shown in Figure 5. The simulator then returns the updated Perception Local Planner Control System state and a reward to the agent and the process is repeated. State Action Agent Vehicle Reward Fig. 3. Planning, Perception, and Control Navigation Stack Environment way-points are modelled so that the steering angle can be Fig. 5. Reinforcement Learning Agent Interacting With Environment calculated. Using the notation from Figure 4, the desired steering angle is calculated according to, We train the agent to modify the pure pursuit steering ref- 2L sin(α) erence δpp to prevent collisions. The state which the network δpp = arctan , (1) ld receives consists of a vector containing the current velocity where L is the length of the vehicle body, ld is the look Vt and steering of the vehicle δt , the references calculated ahead distance to the way-point being pursued and α is the by the path follower δpp , and the current range finder scan angle between the vehicles current heading and the heading [r1 , ..., r10 ]. The network outputs an action quantity which is to the way-point. used to modify the steering reference calculated by the path follower. The path follower (pp) and neural network (nn) steering commands are combined as, α δref = δpp + δnn . (3) R The reference steering value is limited to be within the ld maximum range and then sent to the control system to be L executed. The inputs and outputs from the network are in 2α the range [−1, 1] and are scaled according to the maximum values. R Path Fig. 4. Pure Pursuit Path Follower: R is the turning radius, g is the goal point, alpha is the angle to the goal and delta is the steering angle which is calculated. r1 , ..., r10 Action State The velocity is calculated as the fastest the vehicle can δnn safely travel while turning at the current steering angle. The δpp , δt , V velocity is calculated according to the maximum amount of lateral friction the vehicle can withstand, relative to the force exerted by the car in turning based on the current steering Fig. 6. Reinforcement Learning Agent Architecture: The state consists angle. We compute the velocity as, of 10 sparse laser range finders and the pure pursuit steering reference. A s neural network with two fully connected layers is used to calculate an action bg which is the steering reference which is used to modify the path follower V = fs , (2) reference. tan(|δref| )/ld where b is the coefficient of friction, m is the mass of the The aim in training the network is that the vehicle does not vehicle, l is the wheelbase length, and sf is a safety factor. crash and that it tracks the reference path closely. Therefore, The velocity is limited to the vehicle’s maximum velocity. we use a reward signal which severely punishes crashing The navigation stack uses a low-level proportional control and rewards progress along the reference path. We write our system to command the onboard vehicle actuators to follow reward signal as, the local planning references. The control system calculates acceleration and steering-velocity commands which can be rcrash = −1 if crash implemented on hardware. r(st , at ) = rcomplete = 1 if lap complete rtracking = (st+1 − st )/stotal otherwise, V. L EARNING P OLICIES FOR R EFERENCE M ODIFICATION (4) The RL agent comprises a neural network that is trained where s is the progress along the track at a specific step. The from experience to maximise a reward signal [14]. The neural sum of the tracking rewards for completing the episode will network is used to represent a policy that maps states to always be 1 since the sum of the progress differences will
be equal to the track length. The tracking reward must be runs at a higher frequency of 100Hz compared to the local a potential function of the terminal reward so that the agent planner which runs at 10Hz. Table I notes several important is rewarded in proportion to the progress towards the ideal parameters used in our simulation. terminal state. Simulation Parameter Value VI. E VALUATION M ETHODOLOGY Simulation dynamics time step 0.01 s Local planning time step 0.1 s The ability of the modification planner to follow a refer- Max range finder value 10 m ence path and avoid obstacles efficiently is evaluated in the Range finder noise σs 0.01 context of F1/10th autonomous racing. We aim to determine Max vehicle speed 7 m/s Max steering angle 0.4 rad the performance of the modification planner and compare Pure Pursuit Look-ahead distance 0.8 m it to current baseline solutions. We measure performance in TABLE I terms of the following metrics, i.e. (1) the time to navigate PARAMETERS U SED IN THE S IMULATION to a target and (2) the success rate, which is the percentage of episodes successfully completed without crashing. The new vehicle state is used to simulate a LiDAR scan of 10 sparse range finders on the map and the new state and the scan is returned to the local planner. To simulate the noise in range finder readings, Gaussian distributed process noise, N (0, σs ), is added to the measurements. B. Baseline Solutions Fig. 7. F1/10th Vehicle Simulator with 10 Laser Range Finder Beams We compare the modification planner to three common baseline local planners. The first baseline is the “Follow the Gap Method” (FGM) [3]. The FGM method identifies a A. Simulation Environment bubble around the nearest obstacle and then navigates away We train and test our local planner architectures using from the bubble into free space. In the implementation of a custom-built, F1/10th autonomous racing simulator. The this method, we adapt the simulator to produce 1080 beams. simulator models a non-holonomic 1/10th scale autonomous The second baseline which is used is an optimisation race car with a LiDAR sensor as shown in Figure 7. At each based, oracle planner which has full access to a map of time-step, the simulator receives an action, updates the state the environment and the location of the obstacles before according to the transition dynamics and then returns the new the episode begins. The obstacle map is used to generate state of the vehicle. A version of our test code is available an optimal path around the obstacles. The optimal plan is online 1 . followed using the pure pursuit algorithm. This solution is treated as the best performance which could be achieved by any local planner. Action Control System Local Planner Lastly, we attempt to replicate the mapless navigation planner presented by Tai et al. [6] and adapt it to a non- State holonomic vehicle. The planner presented by Tai uses a neural network that receives 10 laser range finder readings and the relative position of the target. The network is trained LiDAR Simulator Vehicle Dynamics using reinforcement learning and rewarded according to the progress made towards the goal. Simulation Environment For all of our baseline implementations, we use the same method of calculating the speed based on the steering angle Fig. 8. Schematic of Planning Loop in Simulation Environment to ensure that the results are comparable. Figure 8 shows how the simulator interfaces with the local C. Reinforcement Learning planner that is being evaluated. The local planner selects an action consisting of velocity and steering angle commands The RL agent in the modification planner is trained which are passed to the simulator and used as the references using the Twin Delayed Deterministic Policy Gradient (TD3) for the control system. The feedback control system calcu- reinforcement learning algorithm with the original training lates acceleration and steering-velocity commands which are hyper-parameters [15]. Neural networks with two hidden the input to the kinematic bicycle model equations used to layers of 200 neurons are trained in mini-batches of 100 model the car. As in practice, the simulated control system transitions. The training lasts for 200,000 steps which is generally just below 4000 episodes. Figure 9 shows the 1 Available at: https://github.com/BDEvan5/ReferenceModification average reward per episode for the training of the network.
2 Average Reward 1 x x Modification 0 0 1,000 2,000 3,000 4,000 Episode Number x x Fig. 9. Modification Planner Training Graph Navigation Fig. 10. Example Trajectories Executed by the Modification and Navigation VII. E VALUATION R ESULTS Planners A. Random Forests The modification planner is evaluated in a random forest which consists of a straight race track with randomly located high success rates show that both planners learn to avoid a obstacles. The random forest is a straight track (20m x 2m) significant majority of obstacles. where 4 obstacles (0.5m x 0.5m) are randomly placed every The influence of the network in modifying the path episode along the track. Table II shows the results from follower references is shown below. Figure 11 shows a running 100 test episodes and taking the average times to graph of the output from the neural network as the vehicle navigate to goal and success rate. moves along the path. The neural network modifies the path enough to avoid the obstacles while roughly maintaining the Vehicle No Obstacles Obstacles Success reference path. Follow the Gap Method 3.73 4.38 99% Oracle 3.72 3.87 100% Navigation Planner 5.08 6.18 97% x x Modification Planner 3.82 4.63 94% TABLE II R ESULTS FROM 20 M R ANDOM F OREST 0.4 0.2 In the tests with no obstacles the modification planner is 0 able to track the optimal path, straight line through the forest, −0.2 and achieve a very similar time to the optimisation based −0.4 oracle planner (within 3% of the oracle planner). The similar 0 5 10 15 20 times to the optimisation based planner when no obstacles Time are present shows that the RL agent learns to not modify the steering command unnecessarily. The navigation planner Fig. 11. Example Trajectory from the Modification Planner with Neural achieves a significantly slower average time of 5.08 seconds Network Output in this test compared to the 3.72 seconds which the other planners are able to achieve. In the tests with obstacles present, the optimisation based B. Repeatability oracle planner outperforms the other planners as expected To measure the repeatability of the learned policies, we with an average time of 3.87 seconds. The FGM planner train and evaluate 10 agents on the random forest described achieves a mediocre time of 4.33 seconds, however, this above. The average and standard deviation statistics are is still faster than any of the learning-based planners. The presented in Table III. modification planner achieves an average time of 4.63 sec- Statistic Average Standard Deviation onds for the forest filled with obstacles which is faster than No Obstacle Average Time 3.85 s ± 0.10 the navigation planners time of 6.18 seconds. The decrease Obstacle Average Time 4.63 s ± 0.17 in time shows that the modification planner is able to find Obstacle Success Rate 92.2% ± 5.3 smoother, faster trajectories than the mapless navigation TABLE III solution. Figure 10 shows a comparison of the trajectories R ESULTS FROM R EPEATABILITY E VALUATION taken by the navigation and modification planners and how the modification planner learns to take smoother trajectories. The modification planner achieves a slightly lower success The standard deviation in the times achieved by the rate of 94% compared to the navigation planner’s 97%. These different agents is small and this shows that the method of
training agents to achieve similar times is stable and highly our planner is able to avoid obstacles more smoothly than repeatable. The standard deviation of the success rate is large, previous solutions from map-less navigation. 5%, which shows that there is a non-trivial difference in how R EFERENCES agents learn to avoid obstacles. [1] P. Falcone, F. Borrelli, H. E. Tseng, J. Asgari, and D. Hrovat, “A hierarchical model predictive control framework for autonomous C. Discussion ground vehicles,” in 2008 American Control Conference. IEEE, 2008, The results show that the modification planner can learn pp. 3719–3724. [2] V. J. Lumelsky and A. A. Stepanov, “Path-planning strategies for a policies that improve upon map-less navigation planners. point mobile automaton moving amidst unknown obstacles of arbitrary Specifically, the modification planner avoids obstacles more shape,” Algorithmica, vol. 2, no. 1-4, pp. 403–430, 1987. smoothly which results in the vehicle being able to travel [3] V. Sezer and M. Gokasan, “A novel obstacle avoidance algo- rithm:“follow the gap method”,” Robotics and Autonomous Systems, at higher speeds. This shows that the modification planner vol. 60, no. 9, pp. 1123–1134, 2012. achieves the behaviour that it was designed for, allowing [4] L. Heng, L. Meier, P. Tanskanen, F. Fraundorfer, and M. Pollefeys, non-holonomic vehicles to avoid obstacles smoothly at high “Autonomous obstacle avoidance and maneuvering on a vision-guided mav using on-board processing,” in 2011 IEEE International Confer- speeds. The success rate which the planners achieve show ence on Robotics and Automation. IEEE, 2011, pp. 2472–2477. that while the modification planner is able to learn to avoid [5] R. C. Coulter, “Implementation of the pure pursuit path tracking a significant number of obstacles, it still underperforms the algorithm,” Carnegie-Mellon UNIV Pittsburgh PA Robotics INST, Tech. Rep., 1992. map-less navigation planners ability to consistently avoid [6] L. Tai, G. Paolo, and M. Liu, “Virtual-to-real deep reinforcement obstacles. learning: Continuous control of mobile robots for mapless navigation,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and It is suggested that the reason for the lower rate of Systems (IROS), 2017, pp. 31–36. completion is that the complexity of the learning problem [7] G. Kahn, T. Zhang, S. Levine, and P. Abbeel, “Plato: Policy learning for the modification planner is higher than for the map- using adaptive trajectory optimization,” in 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2017, pp. less navigation baseline. The transition dynamics (movement 3342–3349. from one state to the next) for the modification planner are [8] W. Gao, D. Hsu, W. S. Lee, S. Shen, and K. Subramanian, “Intention- dependant on combining the action, which the agent has net: Integrating planning and deep learning for goal-directed au- tonomous navigation,” in Conference on Robot Learning. PMLR, selected, with the pure pursuit reference. The result of this 2017, pp. 185–194. combination is that the agent has less influence on how the [9] M. Wei, S. Wang, J. Zheng, and D. Chen, “Ugv navigation opti- vehicle moves. Having less influence improves the quality mization aided by reinforcement learning-based path tracking,” IEEE Access, vol. 6, pp. 57 814–57 825, 2018. of the path taken which results in the modification planner [10] H. Durrant-Whyte and T. Bailey, “Simultaneous localization and achieving faster times. A problem that arises in having less mapping: part i,” IEEE robotics & automation magazine, vol. 13, no. 2, influence is that it achieves a lower rate of completion. pp. 99–110, 2006. [11] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, This problem of guaranteeing vehicle safety is a current I. Reid, and J. J. Leonard, “Past, present, and future of simultaneous topic of research and could be solved in future work by ex- localization and mapping: Toward the robust-perception age,” IEEE ploring different training methods such as imitation learning. Transactions on Robotics, vol. 32, no. 6, pp. 1309–1332, 2016. [12] X. Xiao, B. Liu, G. Warnell, and P. Stone, “Motion control for mobile Another possible solution is to combine the RL agent with robot navigation using machine learning: a survey,” arXiv preprint a safety system that can stop the vehicle before it crashes. arXiv:2011.13112, 2020. A recent survey [12] suggested that future work in mobile [13] F. de Villiers and W. Brink, “Learning fine-grained control for mapless navigation,” in 2020 International SAUPEC/RobMech/PRASA Confer- navigation should continue into using ML to learn specific ence. IEEE, 2020, pp. 1–6. subsystems of the navigation pipeline. In future, we plan to [14] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. investigate how RL agents can be used in different archi- MIT press, 2018. [15] S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approxi- tectures to learn obstacle avoidance behaviours by reducing mation error in actor-critic methods,” in International Conference on the complexity of the navigation problem to more simple Machine Learning. PMLR, 2018, pp. 1587–1596. components. VIII. C ONCLUSION In this paper, we addressed the problem of local planning for autonomous navigation. Local planners are responsible for tracking a reference trajectory and avoiding obstacles. We presented a novel local planning architecture that uses a path following algorithm in conjunction with an RL agent to smoothly avoid obstacles while maintaining the global plan. Our architecture is evaluated in the context of F1/10th autonomous racing and shown to achieve comparable racing times to optimisation based and reactive methods, yet using fewer inputs. Specifically, our modification local planner requires only 10 sparse laser range finder readings and is able to smoothly avoid obstacles. Additionally, we show that
You can also read