Hierarchical Control for Multi-Agent Autonomous Racing - arXiv
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Hierarchical Control for Multi-Agent Autonomous Racing Rishabh Saumil Thakkar∗ , Aryaman Singh Samyal∗ , David Fridovich-Keil∗ , Zhe Xu† , Ufuk Topcu∗ ∗ TheUniversity of Texas at Austin {rishabh.thakkar, aryamansinghsamyal, dfk, utopcu}@utexas.edu † Arizona State University {xzhe1}@asu.edu Abstract— We develop a hierarchical controller for multi- times when traveling along a straight section of the track. arXiv:2202.12861v3 [cs.MA] 27 Apr 2022 agent autonomous racing. A high-level planner approximates While it is relatively straightforward to describe this rule the race as a discrete game with simplified dynamics that in text, it is challenging to encode it in a mathematical encodes the complex safety and fairness rules seen in real- life racing and calculates a series of target waypoints. The formulation that can be solved by existing methods for real- low-level controller takes the resulting waypoints as a refer- time control. These methods have to compromise by either ence trajectory and computes high-resolution control inputs shortening their planning horizons or simply ignoring these by solving a simplified formulation of a multi-agent racing constraints. The resulting behavior is an agent that is not game. We consider two approaches for the low-level planner optimal, or an agent that may be quick but is unsafe or unfair. to construct two hierarchical controllers. One approach uses multi-agent reinforcement learning (MARL), and the other We develop a hierarchical control scheme that reasons solves a linear-quadratic Nash game (LQNG) to produce control about optimal long-term plans and closely adheres to the inputs. We test the controllers against three baselines: an end- safety and fairness rules of a multi-agent racing game. The to-end MARL controller, a MARL controller tracking a fixed high-level planner forms a discrete approximation of the racing line, and an LQNG controller tracking a fixed racing general formulation of the game. The solution of the discrete line. Quantitative results show that the proposed hierarchical methods outperform their respective baseline methods in terms problem produces a series of waypoints that both adhere of head-to-head race wins and abiding by the rules. The hierar- to the rules and are approximately optimal. The low-level chical controller using MARL for low-level control consistently planner solves a simplified, continuous state/action dynamic outperformed all other methods by winning over 88% of head- game with an objective to hit as many of the waypoints to-head races and more consistently adhered to the complex and a reduced form of the safety rules. Our structure racing rules. Qualitatively, we observe the proposed controllers mimicking actions performed by expert human drivers such yields a controller that runs in real-time and outperforms as shielding/blocking, overtaking, and long-term planning for other traditional control methods in terms of head-to-head delayed advantages. We show that hierarchical planning for performance and obedience to safety rules. The control game-theoretic reasoning produces competitive behavior even architecture is visualized in Figure 1. Although we develop when challenged with complex rules and constraints. our controller in the context of a racing game, the structure Index Terms— multi-agent systems, reinforcement learning, hierarchical control, game theory, Monte Carlo methods of this method enables reasoning about long-term optimal choices in a game-theoretic setting with complex constraints involving temporal logic and both continuous and discrete I. I NTRODUCTION dynamics. Hence, it is possible to apply this method to many Autonomous driving has seen an explosion of research other adversarial settings that exhibit the aforementioned in academia and industry [1]. While most of these efforts properties such as financial systems, power systems, or air focus on day-to-day driving, there is growing interest in traffic control. autonomous racing. Many advancements in commercial au- tomobiles have originated from projects invented for use II. P RIOR W ORK in motorsports such as disc brakes, rear-view mirrors, and Because multi-agent racing is inherently a more complex sequential gearboxes [2]. The same principle can apply when problem, most prior work in autonomous racing is focused designing self-driving controllers because racing provides a on single-agent lap time optimization, with fewer and more platform to develop these controllers to be highly performant, recent developments in multi-agent racing. robust, and safe in challenging scenarios. Single-agent racing approaches utilize a mixture of opti- Successful human drivers are required to both outperform mization and learning-based methods. One study uses Monte opponents and adhere to the rules of racing. These objectives Carlo tree search to estimate where to position the car around are effectively at odds with one another, but the best racers various shaped tracks to define an optimal trajectory [8]. The can satisfy both. Prior approaches in autonomous racing work in [9] proposes a method that computes an optimal usually over-simplify the latter by only considering collision trajectory offline and uses a model predictive control (MPC) avoidance [3]–[6]. In reality, these racing rules often involve algorithm to track the optimized trajectory online. Similarly, discrete variables and complex nuances [7]. For example, a the authors of [10] also perform calculations offline by driver may not change lanes more than a fixed number of creating a graph representation of the track to compute
the furthest progress along the track. A two-level planning system is developed in [17] to control an autonomous vehi- cle in an environment with aggressive human drivers. The upper-level system produces a plan to be safe against the uncertainty of the human drivers in the system by using simplified dynamics. The lower-level planner implements the strategy determined by the upper level-planner using precise dynamics. III. G ENERAL M ULTI -AGENT R ACING G AME F ORMULATION To motivate the proposed control design, we first outline a dynamic game formulation of a general multi-agent racing game. Let there be a set N of players racing over T steps in T = {1, ..., T }. There is a track defined by a sequence of Fig. 1. Two-level planning architecture of the proposed racing controller. τ checkpoints along the center, {ci }τi=1 , whose indices are in a set C = {1, ..., τ }. The objective for each player i is to minimize its pairwise differences of the time to reach the a target path and use spline interpolation for online path final checkpoint with all other players. In effect, the player generation in an environment with static obstacles. In the targets to reach the finish line with the largest time advantage. category of learning-based approaches, online learning to The continuous state (such as position, speed, or tire wear) update parameters of an MPC algorithm based on feedback for each player, denoted as xit ∈ X ⊆ Rn , and control, from applying control inputs is developed in [11]. Further, denoted as uit ∈ U ⊆ Rk , are governed by known dynamics there are works that develop and compare various deep f i . We also introduce a pair of discrete state variables rti ∈ reinforcement learning methods to find and track optimal C and γ i ∈ T . The index of the latest checkpoint passed trajectories [12], [13]. by player i at time t is rti , and it is computed by function Looking at multi-agent racing works, both optimization p : X → C. The earliest time when player i reaches cτ is and learning-based control approaches are also used. Authors γ i . Using these definitions, we formulate the objective (1) of [5] use mixed-integer quadratic programming formulation and core dynamics (2)-(6) of the game as follows: for head-to-head racing with realistic collision avoidance N but concede that this formulation struggles to run in real- X time. Another study proposes a real-time control mechanism i min i (|N | − 1)γ i − γj (1) u0 ,...,uT for a game with a pair of racing drones [14]. This work j6=i provides an iterative-best response method while solving an MPC problem that approximates a local Nash equilibrium. xjt+1 = f (xjt , ujt ), ∀ t ∈ T ,j ∈ N (2) It is eventually extended to automobile racing [3] and multi- j rt+1 = p(xjt+1 , rtj ), ∀ t ∈ T ,j ∈ N (3) agent scenarios with more than two racers [4]. A faster, real- time MPC algorithm to make safe overtakes is developed in r1j = 1, ∀ j∈N (4) [6], but their method does not consider adversarial behavior from the opposing players. Again, these approaches do not rTj = τ, ∀ j∈N (5) consider racing rules other than simple collision avoidance. The work in [15] develops an autonomous racing controller γ j = min{t | rti = τ ∧ t ∈ T }, ∀ j∈N (6) using deep reinforcement learning that considers the rules In addition to the core dynamics of the game, there are of racing beyond just simple collision avoidance. Their rules that govern the players’ states. To ensure that the controller outperforms expert humans while also adhering players stay within the bounds of the track we introduce to proper racing etiquette. It is the first study to consider a function, q : X → R, which computes a player’s distance nuanced safety and fairness rules of racing and does so to the closest point on the center line. This distance must be by developing a reward structure that trains a controller to limited to the width of the track w. Therefore, for all t ∈ T understand when it is responsible for avoiding collisions, and and j ∈ N : when it can be more aggressive. q(xjt ) ≤ w (7) Finally, hierarchical game-theoretic reasoning is a method that has been previously studied in the context of autonomous Next, we define the collision avoidance rules of the game. driving. A hierarchical racing controller was introduced in We use an indicator function that evaluates if player i is [16] that constructed a high-level planner with simplified “behind” player j. Depending on the condition, the distance dynamics to sample sequences of constant curvature arcs and between every pair of players, computed by function the d : a low-level planner to use MPC to track the arc that provided X → R, is required to be at least s1 if player i is behind
another player j or s0 otherwise. For all t ∈ T , j ∈ N , and discrete nature suggests that they cannot be easily differenti- k ∈ N \ {j} these rules are expressed by the constraint: ated. Therefore, most state-of-the-art optimization algorithms ( would not apply or struggle to find a solution in real time. s1 1player i behind player j d(xjt , xkt ) ≥ (8) IV. H IERARCHICAL C ONTROL D ESIGN s0 otherwise Traditional optimization-based control methods cannot Finally, players are limited in how often they may change easily be utilized for the general multi-agent racing game lanes depending on the part of the track they are at. We formulated with realistic safety and fairness rules. The rules assume that there are λ ∈ Z+ lanes across all parts of the involve nonlinear constraints over both continuous and dis- track. If the player’s location on the track is classified as crete variables, and a mixed-integer non-linear programming a curve, there is no limit on lane changing. However, if algorithm would be unlikely to run at rates of 25 Hz-50 Hz the player is at a location classified as a straight, it may for precise control. This inherent challenge encourages utiliz- not change lanes more than L times for the contiguous ing a method such as deep reinforcement learning or trying section of the track classified as a straight. We define a to solve the game using short horizons. set S that contains all possible states where a player is However, we propose a hierarchical control design involv- located at a straight section. We also introduce a function ing two parts that work to ensure all of the rules are followed z : X → {1, 2, ..., λ} that returns the lane ID of a player’s while approximating long-term optimal choices. The high- position on the track. Using these definitions, we introduce level planner transforms the general formulation into a game a variable ltj calculated by the following constraint for all with discrete states and actions where all of the discrete rules t ∈ T and j ∈ N : are naturally encoded. The solution provided by the high- (j level planner is a series of discrete states (i.e waypoints) j lt−1 + 1 1xj ∈S = 1xj ∈S ∧ z(xjt ) 6= z(xjt−1 ) for each player, which satisfies all of the rules. Then, the lt = t t−1 0 otherwise low-level planner solves a simplified version of the racing (9) game with an objective putting greater emphasis on tracking This variable effectively represents a player’s count of “re- a series of waypoints and smaller emphasis on the original cent” lane changes over a sequence of states located across a game-theoretic objective and a simplified version of the rules. contiguous straight or curved section of the track. However, Therefore, this simplified formulation can be solved by an the variable is only required to be constrained if the player optimization method in real-time or be trained in a neural is on a straight section of the track. Therefore, the following network when using a learning-based method. constraint must hold for all t ∈ T and j ∈ N and if xjt ∈ S: A. High-Level Planner ltj ≤ L (10) The high-level planner constructs a turn-based discrete, dynamic game that is an approximation of the general game Most prior multi-agent racing formulations [3]–[5] do not (1)-(10). Continuous components of a players’ states are include the complexities we introduced through defining con- broken into discrete “buckets” (e.g., speed between 2 m s−1 straints (8)-(10). They usually have a similar form regarding and 4 m s−1 , tire wear between 10% and 15%). In addition, λ continuous dynamics and discrete checkpoints (2)-(6), and (which is the number of lanes) points around each checkpoint their rules only involve staying on track (7) and collision are chosen along a line perpendicular to the direction of avoidance with a fixed distance. However, in real-life racing, travel where each point evaluates to a unique lane ID on the there do exist these complexities both in the form of mutually track when passed into function z(·) defined in the general understood unwritten rules and explicit safety rules [7]. As formulation. The left and center of Figure 2 visualize the a result, we account for two of the key rules that ensure the checkpoints in the original, continuous formulation (in red) game remains fair and safe: expanded into three discrete lanes (green or purple) for the 1) There is a greater emphasis on and responsibility of col- high-level game. lision avoidance for a vehicle that is following another The players’ actions are defined by pairs of lane ID, (8). resolving to a target location near the next checkpoint, and 2) The player may only switch lanes L times while on a target speed for that location. Therefore, we can apply a sim- straight section of the track (9)-(10). plified inverse approximation of the dynamics to determine The first rule ensures that a leading player can make a the time it would take to transition from one checkpoint to decision without needing to consider an aggressive move that the next and estimate the remaining state variables or dismiss risks a rear-end collision or side collision while turning from the action if it is dynamically infeasible. This action space the players that are following. This second rule ensures that also allows us to easily evaluate or prevent actions where the leading player may not engage in aggressive swerving or rules of the game would be broken. By limiting choices “zig-zagging” across the track that would make it impossible to fixed locations across checkpoints, we ensure that the for a player that is following the leader to safely challenge players always remain on track (7). Moreover, the players’ for an overtake. While functions may exist to evaluate actions can be dismissed if they would violate the limit on these spatially and temporally dependent constraints, their the number of lane changes by simply checking whether
Fig. 2. The uncountably infinite trajectories of the general game (left) discretized by the high-level planner (middle). The sequence of target waypoints calculated by the high-level planner (in green) is tracked by the low-level planner (right) and converges to a continuous trajectory (in black). choosing a lane would exceed their limits or checking if the have a long-term plan from the high-level planner, we can location is a curve or straight (10). Finally, other actions that formulate a reduced version of the original game for our low- could cause collisions can also be dismissed by estimating level planner. The low-level game is played over a shorter that if two players reach the same lane at a checkpoint and horizon compared to the original game of just δ steps in have a small difference in their time states, there would be T̂ = {1, ..., δ}. We assume that the low-level planner for a high risk of collision (8). player i has received k waypoints, ψri i , ..., ψri i +k , from the 1 1 The game is played with each player starting at the high-level planner, and player i’s last passed checkpoint r∗i . initial checkpoint, and it progresses by resolving all players’ The low-level objective involves two components. The first choices one checkpoint at a time. The order in which the is to maximize the difference between its own checkpoint players take their actions is determined by the player who index and the opponents’ checkpoint indices at the end of has the smallest time state at each checkpoint. A lower time δ steps. The second is to minimize the tracking error, ηyi , state value implies that a player was at the given checkpoint of every passed waypoint ψri i +y . The former component 1 before other players with a larger time state, so it would influences the player to pass as many checkpoints as possible, have made its choice at that location before the others. This which suggests reaching cτ as quickly as possible. The latter ordering also implies that players who arrive at a checkpoint influences the player to be close to the high-level waypoints after preceding players observe the actions of those preceding when passing each of the checkpoints. The objective also players. Therefore, these observations can contribute to their includes some multiplier α that balances the emphasis of strategic choices. Most importantly, because the ordering the two parts. The objective is written as follows: forces the following players to choose last, we also capture the rule that the following players (i.e. those that are “behind” N r1 +k i X X others) are responsible for collision avoidance after observing min ( r j − (|N | − 1)r i ) + α ηci (11) i i δ δ the leading players’ actions. u1 ,...,uδ j6=i c=r1i The objective of the discrete game is to minimize the dif- ference between one’s own time state at the final checkpoint The players’ continuous state dynamics, calculations for and that of all other players just like the original formulation each checkpoint, and constraints on staying within track (1). Although the discrete game is much simpler than the bounds (12)-(15) are effectively the same as the original original formulation, the state space grows as the number formulation. of actions and checkpoints increases. Therefore, we solve the game in a receding horizon manner, but our choice of xjt+1 = f (xjt , ujt ), ∀ t ∈ T̂ , j ∈ N (12) the horizon (i.e. number of checkpoints to consider) extends j much further into the future than an MPC-based continuous rt+1 = p(xjt+1 , rtj ), ∀ t ∈ T̂ , j ∈ N (13) state/action space controller can handle in real time [3]. In order to produce a solution to the discrete game in real-time, r1j = r∗j , ∀ j∈N (14) we use the Monte Carlo tree search (MCTS) algorithm [18]. The solution from applying MCTS is a series of waypoints q(xm t ) ≤ w, ∀ t ∈ T̂ , j ∈ N (15) in the form of target lane IDs (which can be mapped back to positions on track) and the target velocities at each of The collision avoidance rules are simplified to just main- the checkpoints for the ego player and estimates of the best taining a minimum distance s0 as the high-level planner response lanes and velocities by the adversarial players. would have already considered the nuances of rear-end collision avoidance responsibilities in (8). As a result, we B. Low-Level Planner require the following constraint to hold for all t ∈ T , j ∈ N , The low-level planner is responsible for producing the and k ∈ N \ {j}: control inputs, so it must operate in real-time. Because we d(xjt , xkt ) ≥ s0 (16)
Finally, we define the dynamics of the waypoint error, ηyi , rays spaced over a 180° field of view centered in the direction introduced in the objective. It is equivalent to the accumu- that the player is facing. lated tracking error of each target waypoint that player i has 2) Linear-Quadratic Nash Game Controller: Our second passed using a function h : X × X → R that measures low-level approach solves an LQNG using the Coupled the distance. If a player has not passed a waypoint, then the Riccati equations [19]. This method involves further sim- variable indexed by that waypoint is set to 0. The variable’s plifying the low-level formulation into a structure with a dynamics are expressed by the following constraint: quadratic objective and linear dynamics. The continuous state is simplified to just four variables: x position, y position, (P T̂ v velocity, and θ heading. The control inputs uit are also h(xit , ψci ) if ∃ rti ≥ y explicitly broken into acceleration, ait , and yaw-rate, eit . The ηyi = t 0 otherwise planning horizon is reduced to δ̄ where δ̄
V. E XPERIMENTS The high-level planner is paired with each of the two low- level planners discussed. We refer to our two hierarchical design variants as MCTS-RL and MCTS-LQNG. A. Baseline Controllers To measure the importance of our design innovations, we Fig. 3. Kart racing environment from a racer’s perspective (left), a bird’s also consider three baseline controllers to resemble the other eye view of the oval track (right-top), and the complex track (right-bottom) methods developed in prior works. in the Unity environment. The purple boxes visualize the lanes across checkpoints along the track, and the highlighted green boxes show planned 1) End-to-End Multi-Agent Reinforcement Learning: The waypoints determined by the hierarchical controllers. end-to-end MARL controller, referred to as “E2E,” represents the pure learning-based methods such as that of [15]. This controller has a similar reward/penalty structure as our low- implementations from the library. They are also only trained level controller, but its observation structure is slightly dif- on two sizes of oval-shaped tracks with the same number of ferent. Instead of observing the sequence of upcoming states training steps. as calculated by a high-level planner, E2E only receives the Our experiments include head-to-head racing on a basic subsequence of locations from {ci }τi=1 that denote the center oval track (which the learning-based agents were trained on) of the track near the agent. As a result, it is fully up to and a more complex track shown in Figure 3. Specifically, its neural networks to learn how to plan strategic and safe the complex track involves challenging track geometry with moves. turns whose radii change along the curve, tight U-turns, 2) Fixed Trajectory Linear-Quadratic Nash Game: The and turns in both directions. To be successful, the optimal fixed trajectory LQNG controller, referred to as “Fixed- racing strategy requires some understanding of the shape of LQNG,” uses the same LQNG low-level planner as our the track along a sequence of multiple turns. Every pair hierarchical variant, but it instead tracks a fixed trajectory of controllers competes head-to-head in 50 races on both around the track. This fixed trajectory is a racing line that tracks. The dynamical parameters of each player’s vehicle is computed offline for a specific track using its geometry are identical, and the players start every race at the same and parameters of the vehicle as seen in prior works [9], initial checkpoint. The only difference in their initial states [10]. However, the online tracking involves game-theoretic is the lane in which they start. In order to maintain fairness reasoning rather than single-agent optimal control in the prior with respect to starting closer to the optimal racing line, we works. alternate the starting lanes between each race for the players. 3) Fixed Trajectory Multi-Agent Reinforcement Learning: The fixed trajectory MARL controller, referred to as “Fixed- RL,” is a learning-based counterpart to Fixed-LQNG. Control C. Results inputs are computed using a deep RL policy trained to track Our experiments primarily seek to identify the importance precomputed checkpoints that are fixed prior to the race. of hierarchical game-theoretic reasoning and the strength of MCTS as a high-level planner for racing games. We count the B. Experimental Setup number of wins against each opponent, average collisions- Our controllers are implemented1 in the Unity Game at-fault per race, average illegal lane changes per race, and Engine. Screenshots of the simulation environment are shown a safety score (a sum of the prior two metrics) for the in Figure 3. We extend the Karting Microgame template controllers. We also provide a video2 demonstrating them [20] provided by Unity. The kart physics from the template in action. Based on the results visualized in Figures 4 and is adapted to include cornering limitations and tire wear 5, we conclude the following key points. percentage. Tire wear is modeled as an exponential decay 1) The proposed hierarchical variants outperformed curve that is a function of the accumulated angular velocity their respective baselines. endured by the kart. This model captures the concept of The results amongst MCTS-RL, Fixed-RL, and E2E show losing grip as the tire is subjected to increased lateral loads. the effectiveness of our proposed hierarchical structure. Multi-agent support is also added to the provided template While all three of the MARL-based agents were only trained in order to race the various autonomous controllers against on the oval track, the MCTS-RL agent was able to win each other or human players. The high-level planners run the most head-to-head races while also maintaining the best at 1 Hz, and low-level planners run at 50 Hz. Specifically, δ̄ safety score by better adapting its learning. Comparing the is set to 0.06 s for the LQNG planner. The implementation baselines against each other, Fixed-RL also has more wins of the learning-based agents utilizes a library called Unity and a better safety score than E2E across both tracks. This ML-Agents [21]. All of the learning-based control agents result indicates that some type of hierarchical structure is are trained using proximal policy optimization and self-play favorable. It suggests that a straightforward task of trajectory 1 https://www.github.com/ribsthakkar/ 2 https://www.youtube.com/playlist?list= HierarchicalKarting PLEkfZ4KJSCcG4yGWD7K5ENGXW5nBPYiF1
Wins on Oval Track Overall Safety Scores on Oval Track close battles and thereby lose races because of the high cost 250 5 Wins vs Fixed- in the planning objective of driving near another player even Avg. Safety Score 200 4 (Lower is better) LQNG 3 150 Wins vs MCTS- if there is no collision. Wins LQNG Avg. Illegal Lane 100 2 Changes 50 Wins vs Fixed-RL 1 Avg. Collisions-at- While Fixed-LQNG has a better safety score than Fixed- 0 fault 0 Wins vs E2E RL, MCTS-RL has a significantly better safety score than RL RL E L Fi 2 E d- G L d- G NG NG CT d-R CT d-R E2 N N E S- S- MCTS-LQNG. Just in terms of collision avoidance, both LQ LQ LQ Q Wins vs MCTS-RL xe xe CT CT L S- S- Fi M M xe xe Fi Fi M M RL-based agents have worse numbers because the LQNG- Fig. 4. Results from head-to-head racing on the oval track. based agents are tuned to be conservative. However, MCTS- LQNG has significantly increased illegal lane changes per Wins on Complex Track Overall Safety Scores on Complex race compared to MCTS-RL while Fixed-LQNG has slightly 200 Track fewer illegal lane changes per race compared to Fixed-RL. 180 Wins vs Fixed- 160 3 LQNG As discussed previously, the fixed trajectory agents do not Avg. Safety Score (Lower is better) 140 2.5 120 Wins vs MCTS- 2 Wins 100 LQNG 80 60 Wins vs Fixed-RL 1.5 1 Avg. Illegal Lane consider alternative racing lines, so they rarely break the 40 Changes 20 0 Wins vs E2E 0.5 0 Avg. Collisions-at- lane-changing limit rule in the first place. In the MCTS fault case, the high-level planner runs in parallel with the low- E Fi 2 E d- G d- G RL NG RL NG L S RL E2 CT d-R N N E CT d- S- S- LQ LQ LQ Fi -LQ Wins vs MCTS-RL xe xe CT CT S- Fi level and at a lower frequency. As a result, the calculated M M xe xe Fi M M high-level plan uses slightly out-of-date information and does Fig. 5. Results from head-to-head racing on the complex track. not account that the low-level controllers have already made choices that might contradict the initial steps in the plan. This mismatch causes the LQNG-based controller to more often tracking is much easier to learn for a deep neural network break the lane-changing rules by swerving across the track than having to learn both strategic planning and respect for to immediately follow the high-level plan when it is updated. the safety and fairness rules. MCTS-RL is more robust to this situation because they Next, we compare MCTS-LQNG and Fixed-LQNG. Al- have those safety rules encoded in their reward structures, though MCTS-LQNG has a worse overall safety score, it has albeit with smaller weights. They do not track the waypoints 25% more wins when aggregated over both tracks. Fixed- exactly and learn to smooth the trajectory produced by the LQNG has a similar number of overall wins on the oval high-level plan and the live situation in the game. track, but when the racetrack is more complicated, Fixed- 3) MCTS-RL outperforms all other implemented con- LQNG quickly becomes inferior. The oval track has just trollers. one main racing line, but there are many reasonable racing Aggregating the results from both tracks, MCTS-RL lines in the complex track that must be considered to be recorded a win rate of 83% of the 400 head-to-head and competitive. MCTS-LQNG accounts for these trajectories by the second-best safety score, only behind the conservatively using the high-level MCTS planner and is, therefore, more tuned Fixed-LQNG agent. It combined the advantage of successful in its races against MARL-based agents on the having a high-level planner that evaluates long-term plans complex track with four times the number of wins against and a low-level planner that is robust to the possibility them compared to the Fixed-LQNG agent. MCTS-LQNG that the high-level plans may be out of date. For example, considered trajectories that could result in overtakes when Figure 6a demonstrates how the high-level planner provided opponents made mistakes from any part of the track. On the a long-term strategy, guiding the agent to give up an advan- other hand, Fixed-LQNG was forced to rely on opponents tage at present for a greater advantage in the future when making mistakes that were not along its fixed trajectory to overtaking. The RL-based low-level planner approximately make overtakes. However, considering alternative lines also follows the high-level strategy in case stochasticity of the attributes to main difference in their safety scores. Both have MCTS algorithm yields a waypoint that seems out of place similar collision-at-fault scores, but MCTS-LQNG has more (e.g., the checkpoint between t = 3 and t = 4 in Figure 6a). illegal lane changes. Furthermore, MCTS-RL is also successful at executing de- 2) MARL is more successful and robust than LQNG as fensive maneuvers as seen in Figure 6b due to those same a low-level planner. properties of long-term planning and low-level robustness. Overall, the MARL-based agents outperformed their Both of these tactics resemble strategies of expert human LQNG-based counterparts in terms of both key metrics: drivers in real head-to-head racing. wins and safety scores. However, this result is likely due to our simplifications involving a time-invariant lineariza- VI. C ONCLUSION tion around the initial state of each agent, meaning the We developed a hierarchical controller for multi-agent approximation is only valid for a very short time horizon. autonomous racing that adheres to safety and fairness rules Therefore, the LQNG-based agents could only rely on brak- found in real-life racing and outperforms other common ing/acceleration instead of yaw-rate to avoid collisions. As a control techniques such as purely optimization-based or result, the weights in the objective of the LQNG formulation purely learning-based control methods. Our high-level plan- are set conservatively to emphasize avoiding collisions. This ner constructed long-term trajectories that abided by the setup also implies that LQNG-based agents often concede in introduced complex rules about collision avoidance and lane
(a) (b) Fig. 6. (a) An overtaking maneuver executed by the MCTS-RL agent (green) against the E2E agent (blue) on the complex track. Notice how, from t = 0 to t = 2, the MCTS-RL agent gives up a slight advantage and takes a wider racing line on the first turn. However, the exit of the wide racing line of the first turn places the MCTS-RL agent at the inside of the next two turns where it is able to gain an even greater advantage when passing the E2E agent from t = 3 to t = 6. The green boxes along each checkpoint also highlight the long-term plan calculated by the MCTS planner for this tactic. (b) A defensive maneuver executed by the MCTS-RL agent (green) against the E2E agent (blue) on the complex track. Before reaching the turn, the MCTS planner calculates to switch lanes to the left first (t = 0 to t = 1) and then move to the right for the inside of the turn. This motion forces the E2E agent to make an evading move to avoid collision and take an even wider turn, thus increasing the overall gap at the end. The green boxes along each checkpoint highlight the long-term plan calculated by the MCTS planner for this tactic. changes. As a result, we design an objective for the low-level [7] Martin, T. (2020, October 22). The Guide to Road Racing, Part controllers to focus on tracking the high-level plan, which is 8: Passing Etiquette. Winding Road For Racers. Retrieved February 17, 2022, from https://www.windingroad.com/articles/blogs/the-road- an easier problem to solve compared to the original racing- racers-guide-to-passing-etiquette/ game formulation. Our method outperformed the baselines [8] Hou, J. & Wang, T. The development of a simulated car racing both in terms of winning head-to-head races and a safety controller based on Monte-Carlo tree search. 2016 Conference On Technologies And Applications Of Artificial Intelligence (TAAI). 104- score measuring obedience to the rules of the game. Finally, 109 (2016) they also exhibited maneuvers resembling those performed [9] Vázquez, J., Brühlmeier, M., Liniger, A., Rupenyan, A. & Lygeros, by expert human drivers. J. Optimization-based hierarchical motion planning for autonomous racing. 2020 IEEE/RSJ International Conference On Intelligent Robots Future work should introduce additional high-level and And Systems (IROS). (2020) low-level planners and investigate policy-switching hierar- [10] Stahl, T., Wischnewski, A., Betz, J. & Lienkamp, M. Multilayer graph- chical controllers where we switch between various high and based trajectory planning for race vehicles in dynamic scenarios. 2019 IEEE Intelligent Transportation Systems Conference (ITSC).(2019) low-level controllers depending on the state of the game. [11] Kabzan, J., Hewing, L., Liniger, A. & Zeilinger, M. Learning-Based Lastly, our hierarchical control design can be extended to Model Predictive Control for Autonomous Racing. IEEE Robotics And other multi-agent systems applications where there exist Automation Letters. 4, 3363-3370 (2019) [12] Remonda, A., Krebs, S., Veas, E., Luzhnica, G. & Kern, R. Formula complex rules such as energy grid systems or air traffic rl: Deep reinforcement learning for autonomous racing using telemetry control. Constructing a discrete high-level game allows for data. ArXiv Preprint arXiv:2104.11106. (2021) natural encoding of the complex constraints, often involving [13] Fakhry, Ali. (2020). Applying a Deep Q Network for OpenAIs Car Racing Game. discrete components, to find an approximate solution that can [14] Spica, R., Cristofalo, E., Wang, Z., Montijano, E. & Schwager, M. warm start a more precise low-level planner. A real-time game theoretic planner for autonomous two-player drone racing. IEEE Transactions On Robotics. 36, 1389-1403 (2020) R EFERENCES [15] Wurman, P., Barrett, S., Kawamoto, K., MacGlashan, J., Subramanian, K., Walsh, T., Capobianco, R., Devlic, A., Eckert, F., Fuchs, F. & Oth- [1] Faisal, A., Kamruzzaman, M., Yigitcanlar, T. & Currie, G. Under- ers Outracing champion Gran Turismo drivers with deep reinforcement standing autonomous vehicles. Journal Of Transport And Land Use. learning. Nature. 602, 223-228 (2022) 12, 45-72 (2019) [16] Liniger, A. (2018). Path Planning and Control for Autonomous Racing [2] Edelstein, S. (2019, June 9). Racing Tech that migrated to your (thesis). current car. Digital Trends. Retrieved February 17, 2022, from [17] Fisac, J., Bronstein, E., Stefansson, E., Sadigh, D., Sastry, S. & Dragan, https://www.digitaltrends.com/cars/racing-tech-in-your-current-car/ A. Hierarchical game-theoretic planning for autonomous vehicles. [3] Wang, M., Wang, Z., Talbot, J., Gerdes, J. & Schwager, M. Game 2019 International Conference On Robotics And Automation (ICRA). Theoretic Planning for Self-Driving Cars in Competitive Scenarios.. pp. 9590-9596 (2019) Robotics: Science And Systems. (2019) [18] Coulom, R. Efficient selectivity and backup operators in Monte-Carlo [4] Wang, M., Wang, Z., Talbot, J., Gerdes, J. & Schwager, M. Game- tree search. International Conference On Computers And Games. theoretic planning for self-driving cars in multivehicle competitive (2006) scenarios. IEEE Transactions On Robotics. 37, 1313-1325 (2021) [19] Basar, T. & Olsder, G. Dynamic Noncooperative Game Theory: [5] Li, N., Goubault, E., Pautet, L. & Putot, S. Autonomous racecar Second Edition. (SIAM,1999) control in head-to-head competition using mixed-integer quadratic [20] Unity Technologies Karting Microgame Template. (2021), programming. Institut Polytechnique De Paris, Tech. Rep. (2021) https://assetstore.unity.com/packages/templates/karting-microgame- [6] He, S., Zeng, J. & Sreenath, K. Autonomous Racing with Multiple 150956 Vehicles using a Parallelized Optimization with Safety Guarantee [21] Juliani, A., Berges, V., Teng, E., Cohen, A., Harper, J., Elion, C., Goy, using Control Barrier Functions. ArXiv Preprint arXiv:2112.06435. C., Gao, Y., Henry, H., Mattar, M. & Others Unity: A general platform (2021) for intelligent agents. ArXiv Preprint arXiv:1809.02627. (2018)
You can also read