Hierarchical Control for Multi-Agent Autonomous Racing - arXiv

Page created by Gene Wilson

Arts & Entertainment

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Hierarchical Control for Multi-Agent Autonomous Racing - arXiv

Hierarchical Control for Multi-Agent Autonomous Racing
                                                Rishabh Saumil Thakkar∗ , Aryaman Singh Samyal∗ , David Fridovich-Keil∗ , Zhe Xu† , Ufuk Topcu∗
                                                                                       ∗ TheUniversity of Texas at Austin
                                                                        {rishabh.thakkar, aryamansinghsamyal, dfk, utopcu}@utexas.edu
                                                                                           † Arizona State University

                                                                                                {xzhe1}@asu.edu

                                             Abstract— We develop a hierarchical controller for multi-      times when traveling along a straight section of the track.
arXiv:2202.12861v3 [cs.MA] 27 Apr 2022

                                         agent autonomous racing. A high-level planner approximates         While it is relatively straightforward to describe this rule
                                         the race as a discrete game with simplified dynamics that          in text, it is challenging to encode it in a mathematical
                                         encodes the complex safety and fairness rules seen in real-
                                         life racing and calculates a series of target waypoints. The       formulation that can be solved by existing methods for real-
                                         low-level controller takes the resulting waypoints as a refer-     time control. These methods have to compromise by either
                                         ence trajectory and computes high-resolution control inputs        shortening their planning horizons or simply ignoring these
                                         by solving a simplified formulation of a multi-agent racing        constraints. The resulting behavior is an agent that is not
                                         game. We consider two approaches for the low-level planner         optimal, or an agent that may be quick but is unsafe or unfair.
                                         to construct two hierarchical controllers. One approach uses
                                         multi-agent reinforcement learning (MARL), and the other              We develop a hierarchical control scheme that reasons
                                         solves a linear-quadratic Nash game (LQNG) to produce control      about optimal long-term plans and closely adheres to the
                                         inputs. We test the controllers against three baselines: an end-   safety and fairness rules of a multi-agent racing game. The
                                         to-end MARL controller, a MARL controller tracking a fixed         high-level planner forms a discrete approximation of the
                                         racing line, and an LQNG controller tracking a fixed racing        general formulation of the game. The solution of the discrete
                                         line. Quantitative results show that the proposed hierarchical
                                         methods outperform their respective baseline methods in terms      problem produces a series of waypoints that both adhere
                                         of head-to-head race wins and abiding by the rules. The hierar-    to the rules and are approximately optimal. The low-level
                                         chical controller using MARL for low-level control consistently    planner solves a simplified, continuous state/action dynamic
                                         outperformed all other methods by winning over 88% of head-        game with an objective to hit as many of the waypoints
                                         to-head races and more consistently adhered to the complex         and a reduced form of the safety rules. Our structure
                                         racing rules. Qualitatively, we observe the proposed controllers
                                         mimicking actions performed by expert human drivers such           yields a controller that runs in real-time and outperforms
                                         as shielding/blocking, overtaking, and long-term planning for      other traditional control methods in terms of head-to-head
                                         delayed advantages. We show that hierarchical planning for         performance and obedience to safety rules. The control
                                         game-theoretic reasoning produces competitive behavior even        architecture is visualized in Figure 1. Although we develop
                                         when challenged with complex rules and constraints.                our controller in the context of a racing game, the structure
                                             Index Terms— multi-agent systems, reinforcement learning,
                                         hierarchical control, game theory, Monte Carlo methods             of this method enables reasoning about long-term optimal
                                                                                                            choices in a game-theoretic setting with complex constraints
                                                                                                            involving temporal logic and both continuous and discrete
                                                              I. I NTRODUCTION
                                                                                                            dynamics. Hence, it is possible to apply this method to many
                                            Autonomous driving has seen an explosion of research            other adversarial settings that exhibit the aforementioned
                                         in academia and industry [1]. While most of these efforts          properties such as financial systems, power systems, or air
                                         focus on day-to-day driving, there is growing interest in          traffic control.
                                         autonomous racing. Many advancements in commercial au-
                                         tomobiles have originated from projects invented for use                                 II. P RIOR W ORK
                                         in motorsports such as disc brakes, rear-view mirrors, and            Because multi-agent racing is inherently a more complex
                                         sequential gearboxes [2]. The same principle can apply when        problem, most prior work in autonomous racing is focused
                                         designing self-driving controllers because racing provides a       on single-agent lap time optimization, with fewer and more
                                         platform to develop these controllers to be highly performant,     recent developments in multi-agent racing.
                                         robust, and safe in challenging scenarios.                            Single-agent racing approaches utilize a mixture of opti-
                                            Successful human drivers are required to both outperform        mization and learning-based methods. One study uses Monte
                                         opponents and adhere to the rules of racing. These objectives      Carlo tree search to estimate where to position the car around
                                         are effectively at odds with one another, but the best racers      various shaped tracks to define an optimal trajectory [8]. The
                                         can satisfy both. Prior approaches in autonomous racing            work in [9] proposes a method that computes an optimal
                                         usually over-simplify the latter by only considering collision     trajectory offline and uses a model predictive control (MPC)
                                         avoidance [3]–[6]. In reality, these racing rules often involve    algorithm to track the optimized trajectory online. Similarly,
                                         discrete variables and complex nuances [7]. For example, a         the authors of [10] also perform calculations offline by
                                         driver may not change lanes more than a fixed number of            creating a graph representation of the track to compute

the furthest progress along the track. A two-level planning
                                                                               system is developed in [17] to control an autonomous vehi-
                                                                               cle in an environment with aggressive human drivers. The
                                                                               upper-level system produces a plan to be safe against the
                                                                               uncertainty of the human drivers in the system by using
                                                                               simplified dynamics. The lower-level planner implements the
                                                                               strategy determined by the upper level-planner using precise
                                                                               dynamics.

                                                                                     III. G ENERAL M ULTI -AGENT R ACING G AME
                                                                                                   F ORMULATION
                                                                                   To motivate the proposed control design, we first outline
                                                                               a dynamic game formulation of a general multi-agent racing
                                                                               game.
                                                                                   Let there be a set N of players racing over T steps in
                                                                               T = {1, ..., T }. There is a track defined by a sequence of
Fig. 1.   Two-level planning architecture of the proposed racing controller.   τ checkpoints along the center, {ci }τi=1 , whose indices are
                                                                               in a set C = {1, ..., τ }. The objective for each player i is
                                                                               to minimize its pairwise differences of the time to reach the
a target path and use spline interpolation for online path                     final checkpoint with all other players. In effect, the player
generation in an environment with static obstacles. In the                     targets to reach the finish line with the largest time advantage.
category of learning-based approaches, online learning to                      The continuous state (such as position, speed, or tire wear)
update parameters of an MPC algorithm based on feedback                        for each player, denoted as xit ∈ X ⊆ Rn , and control,
from applying control inputs is developed in [11]. Further,                    denoted as uit ∈ U ⊆ Rk , are governed by known dynamics
there are works that develop and compare various deep                          f i . We also introduce a pair of discrete state variables rti ∈
reinforcement learning methods to find and track optimal                       C and γ i ∈ T . The index of the latest checkpoint passed
trajectories [12], [13].                                                       by player i at time t is rti , and it is computed by function
   Looking at multi-agent racing works, both optimization                      p : X → C. The earliest time when player i reaches cτ is
and learning-based control approaches are also used. Authors                   γ i . Using these definitions, we formulate the objective (1)
of [5] use mixed-integer quadratic programming formulation                     and core dynamics (2)-(6) of the game as follows:
for head-to-head racing with realistic collision avoidance
                                                                                                                           N
but concede that this formulation struggles to run in real-                                                                X
time. Another study proposes a real-time control mechanism                                       i
                                                                                                  min i (|N | − 1)γ i −           γj        (1)
                                                                                                u0 ,...,uT
for a game with a pair of racing drones [14]. This work                                                                    j6=i

provides an iterative-best response method while solving an
MPC problem that approximates a local Nash equilibrium.                                    xjt+1 = f (xjt , ujt ),    ∀ t ∈ T ,j ∈ N        (2)
It is eventually extended to automobile racing [3] and multi-                              j
                                                                                          rt+1 = p(xjt+1 , rtj ),     ∀ t ∈ T ,j ∈ N        (3)
agent scenarios with more than two racers [4]. A faster, real-
time MPC algorithm to make safe overtakes is developed in                                             r1j = 1,       ∀ j∈N                  (4)
[6], but their method does not consider adversarial behavior
from the opposing players. Again, these approaches do not                                             rTj = τ,       ∀ j∈N                  (5)
consider racing rules other than simple collision avoidance.
The work in [15] develops an autonomous racing controller                               γ j = min{t | rti = τ ∧ t ∈ T },          ∀ j∈N     (6)
using deep reinforcement learning that considers the rules                        In addition to the core dynamics of the game, there are
of racing beyond just simple collision avoidance. Their                        rules that govern the players’ states. To ensure that the
controller outperforms expert humans while also adhering                       players stay within the bounds of the track we introduce
to proper racing etiquette. It is the first study to consider                  a function, q : X → R, which computes a player’s distance
nuanced safety and fairness rules of racing and does so                        to the closest point on the center line. This distance must be
by developing a reward structure that trains a controller to                   limited to the width of the track w. Therefore, for all t ∈ T
understand when it is responsible for avoiding collisions, and                 and j ∈ N :
when it can be more aggressive.                                                                          q(xjt ) ≤ w                      (7)
   Finally, hierarchical game-theoretic reasoning is a method
that has been previously studied in the context of autonomous                    Next, we define the collision avoidance rules of the game.
driving. A hierarchical racing controller was introduced in                    We use an indicator function that evaluates if player i is
[16] that constructed a high-level planner with simplified                     “behind” player j. Depending on the condition, the distance
dynamics to sample sequences of constant curvature arcs and                    between every pair of players, computed by function the d :
a low-level planner to use MPC to track the arc that provided                  X → R, is required to be at least s1 if player i is behind

another player j or s0 otherwise. For all t ∈ T , j ∈ N , and     discrete nature suggests that they cannot be easily differenti-
k ∈ N \ {j} these rules are expressed by the constraint:          ated. Therefore, most state-of-the-art optimization algorithms
                           (                                      would not apply or struggle to find a solution in real time.
                             s1 1player i behind player j
           d(xjt , xkt ) ≥                                 (8)              IV. H IERARCHICAL C ONTROL D ESIGN
                             s0 otherwise
                                                                     Traditional optimization-based control methods cannot
   Finally, players are limited in how often they may change      easily be utilized for the general multi-agent racing game
lanes depending on the part of the track they are at. We          formulated with realistic safety and fairness rules. The rules
assume that there are λ ∈ Z+ lanes across all parts of the        involve nonlinear constraints over both continuous and dis-
track. If the player’s location on the track is classified as     crete variables, and a mixed-integer non-linear programming
a curve, there is no limit on lane changing. However, if          algorithm would be unlikely to run at rates of 25 Hz-50 Hz
the player is at a location classified as a straight, it may      for precise control. This inherent challenge encourages utiliz-
not change lanes more than L times for the contiguous             ing a method such as deep reinforcement learning or trying
section of the track classified as a straight. We define a        to solve the game using short horizons.
set S that contains all possible states where a player is            However, we propose a hierarchical control design involv-
located at a straight section. We also introduce a function       ing two parts that work to ensure all of the rules are followed
z : X → {1, 2, ..., λ} that returns the lane ID of a player’s     while approximating long-term optimal choices. The high-
position on the track. Using these definitions, we introduce      level planner transforms the general formulation into a game
a variable ltj calculated by the following constraint for all     with discrete states and actions where all of the discrete rules
t ∈ T and j ∈ N :                                                 are naturally encoded. The solution provided by the high-
        (j                                                        level planner is a series of discrete states (i.e waypoints)
    j     lt−1 + 1 1xj ∈S = 1xj ∈S ∧ z(xjt ) 6= z(xjt−1 )         for each player, which satisfies all of the rules. Then, the
   lt =                  t         t−1

          0           otherwise                                   low-level planner solves a simplified version of the racing
                                                            (9)   game with an objective putting greater emphasis on tracking
This variable effectively represents a player’s count of “re-     a series of waypoints and smaller emphasis on the original
cent” lane changes over a sequence of states located across a     game-theoretic objective and a simplified version of the rules.
contiguous straight or curved section of the track. However,      Therefore, this simplified formulation can be solved by an
the variable is only required to be constrained if the player     optimization method in real-time or be trained in a neural
is on a straight section of the track. Therefore, the following   network when using a learning-based method.
constraint must hold for all t ∈ T and j ∈ N and if xjt ∈ S:
                                                                  A. High-Level Planner
                           ltj ≤ L                        (10)       The high-level planner constructs a turn-based discrete,
                                                                  dynamic game that is an approximation of the general game
   Most prior multi-agent racing formulations [3]–[5] do not
                                                                  (1)-(10). Continuous components of a players’ states are
include the complexities we introduced through defining con-
                                                                  broken into discrete “buckets” (e.g., speed between 2 m s−1
straints (8)-(10). They usually have a similar form regarding
                                                                  and 4 m s−1 , tire wear between 10% and 15%). In addition, λ
continuous dynamics and discrete checkpoints (2)-(6), and
                                                                  (which is the number of lanes) points around each checkpoint
their rules only involve staying on track (7) and collision
                                                                  are chosen along a line perpendicular to the direction of
avoidance with a fixed distance. However, in real-life racing,
                                                                  travel where each point evaluates to a unique lane ID on the
there do exist these complexities both in the form of mutually
                                                                  track when passed into function z(·) defined in the general
understood unwritten rules and explicit safety rules [7]. As
                                                                  formulation. The left and center of Figure 2 visualize the
a result, we account for two of the key rules that ensure the
                                                                  checkpoints in the original, continuous formulation (in red)
game remains fair and safe:
                                                                  expanded into three discrete lanes (green or purple) for the
 1) There is a greater emphasis on and responsibility of col-     high-level game.
    lision avoidance for a vehicle that is following another         The players’ actions are defined by pairs of lane ID,
    (8).                                                          resolving to a target location near the next checkpoint, and
 2) The player may only switch lanes L times while on a           target speed for that location. Therefore, we can apply a sim-
    straight section of the track (9)-(10).                       plified inverse approximation of the dynamics to determine
   The first rule ensures that a leading player can make a        the time it would take to transition from one checkpoint to
decision without needing to consider an aggressive move that      the next and estimate the remaining state variables or dismiss
risks a rear-end collision or side collision while turning from   the action if it is dynamically infeasible. This action space
the players that are following. This second rule ensures that     also allows us to easily evaluate or prevent actions where
the leading player may not engage in aggressive swerving or       rules of the game would be broken. By limiting choices
“zig-zagging” across the track that would make it impossible      to fixed locations across checkpoints, we ensure that the
for a player that is following the leader to safely challenge     players always remain on track (7). Moreover, the players’
for an overtake. While functions may exist to evaluate            actions can be dismissed if they would violate the limit on
these spatially and temporally dependent constraints, their       the number of lane changes by simply checking whether

Fig. 2. The uncountably infinite trajectories of the general game (left) discretized by the high-level planner (middle). The sequence of target waypoints
calculated by the high-level planner (in green) is tracked by the low-level planner (right) and converges to a continuous trajectory (in black).

choosing a lane would exceed their limits or checking if the                  have a long-term plan from the high-level planner, we can
location is a curve or straight (10). Finally, other actions that             formulate a reduced version of the original game for our low-
could cause collisions can also be dismissed by estimating                    level planner. The low-level game is played over a shorter
that if two players reach the same lane at a checkpoint and                   horizon compared to the original game of just δ steps in
have a small difference in their time states, there would be                  T̂ = {1, ..., δ}. We assume that the low-level planner for
a high risk of collision (8).                                                 player i has received k waypoints, ψri i , ..., ψri i +k , from the
                                                                                                                      1           1
   The game is played with each player starting at the                        high-level planner, and player i’s last passed checkpoint r∗i .
initial checkpoint, and it progresses by resolving all players’                  The low-level objective involves two components. The first
choices one checkpoint at a time. The order in which the                      is to maximize the difference between its own checkpoint
players take their actions is determined by the player who                    index and the opponents’ checkpoint indices at the end of
has the smallest time state at each checkpoint. A lower time                  δ steps. The second is to minimize the tracking error, ηyi ,
state value implies that a player was at the given checkpoint                 of every passed waypoint ψri i +y . The former component
                                                                                                               1
before other players with a larger time state, so it would                    influences the player to pass as many checkpoints as possible,
have made its choice at that location before the others. This                 which suggests reaching cτ as quickly as possible. The latter
ordering also implies that players who arrive at a checkpoint                 influences the player to be close to the high-level waypoints
after preceding players observe the actions of those preceding                when passing each of the checkpoints. The objective also
players. Therefore, these observations can contribute to their                includes some multiplier α that balances the emphasis of
strategic choices. Most importantly, because the ordering                     the two parts. The objective is written as follows:
forces the following players to choose last, we also capture
the rule that the following players (i.e. those that are “behind”                                   N                           r1 +k   i
                                                                                                   X                             X
others) are responsible for collision avoidance after observing                             min   (   r j
                                                                                                          − (|N | − 1)r i
                                                                                                                          ) + α       ηci          (11)
                                                                                           i    i      δ               δ
the leading players’ actions.                                                             u1 ,...,uδ
                                                                                                       j6=i                            c=r1i
   The objective of the discrete game is to minimize the dif-
ference between one’s own time state at the final checkpoint                     The players’ continuous state dynamics, calculations for
and that of all other players just like the original formulation              each checkpoint, and constraints on staying within track
(1). Although the discrete game is much simpler than the                      bounds (12)-(15) are effectively the same as the original
original formulation, the state space grows as the number                     formulation.
of actions and checkpoints increases. Therefore, we solve
the game in a receding horizon manner, but our choice of                                     xjt+1 = f (xjt , ujt ),      ∀ t ∈ T̂ , j ∈ N         (12)
the horizon (i.e. number of checkpoints to consider) extends
                                                                                             j
much further into the future than an MPC-based continuous                                   rt+1 = p(xjt+1 , rtj ),        ∀ t ∈ T̂ , j ∈ N        (13)
state/action space controller can handle in real time [3]. In
order to produce a solution to the discrete game in real-time,                                            r1j = r∗j ,    ∀ j∈N                     (14)
we use the Monte Carlo tree search (MCTS) algorithm [18].
The solution from applying MCTS is a series of waypoints                                         q(xm
                                                                                                    t ) ≤ w,            ∀ t ∈ T̂ , j ∈ N           (15)
in the form of target lane IDs (which can be mapped back
to positions on track) and the target velocities at each of                      The collision avoidance rules are simplified to just main-
the checkpoints for the ego player and estimates of the best                  taining a minimum distance s0 as the high-level planner
response lanes and velocities by the adversarial players.                     would have already considered the nuances of rear-end
                                                                              collision avoidance responsibilities in (8). As a result, we
B. Low-Level Planner                                                          require the following constraint to hold for all t ∈ T , j ∈ N ,
  The low-level planner is responsible for producing the                      and k ∈ N \ {j}:
control inputs, so it must operate in real-time. Because we                                          d(xjt , xkt ) ≥ s0                   (16)

Finally, we define the dynamics of the waypoint error, ηyi ,             rays spaced over a 180° field of view centered in the direction
introduced in the objective. It is equivalent to the accumu-                that the player is facing.
lated tracking error of each target waypoint that player i has                  2) Linear-Quadratic Nash Game Controller: Our second
passed using a function h : X × X → R that measures                         low-level approach solves an LQNG using the Coupled
the distance. If a player has not passed a waypoint, then the               Riccati equations [19]. This method involves further sim-
variable indexed by that waypoint is set to 0. The variable’s               plifying the low-level formulation into a structure with a
dynamics are expressed by the following constraint:                         quadratic objective and linear dynamics. The continuous state
                                                                            is simplified to just four variables: x position, y position,
            (P
              T̂                                                            v velocity, and θ heading. The control inputs uit are also
                     h(xit , ψci ) if ∃ rti ≥ y                             explicitly broken into acceleration, ait , and yaw-rate, eit . The
  ηyi   =        t
             0                     otherwise                                planning horizon is reduced to δ̄ where δ̄

V. E XPERIMENTS
The high-level planner is paired with each of the two low-
level planners discussed. We refer to our two hierarchical
design variants as MCTS-RL and MCTS-LQNG.

A. Baseline Controllers
To measure the importance of our design innovations, we Fig. 3. Kart racing environment from a racer’s perspective (left), a bird’s
also consider three baseline controllers to resemble the other eye view of the oval track (right-top), and the complex track (right-bottom)
methods developed in prior works. in the Unity environment. The purple boxes visualize the lanes across
checkpoints along the track, and the highlighted green boxes show planned
1) End-to-End Multi-Agent Reinforcement Learning: The waypoints determined by the hierarchical controllers.
end-to-end MARL controller, referred to as “E2E,” represents
the pure learning-based methods such as that of [15]. This
controller has a similar reward/penalty structure as our low- implementations from the library. They are also only trained
level controller, but its observation structure is slightly dif- on two sizes of oval-shaped tracks with the same number of
ferent. Instead of observing the sequence of upcoming states training steps.
as calculated by a high-level planner, E2E only receives the Our experiments include head-to-head racing on a basic
subsequence of locations from {ci }τi=1 that denote the center oval track (which the learning-based agents were trained on)
of the track near the agent. As a result, it is fully up to and a more complex track shown in Figure 3. Specifically,
its neural networks to learn how to plan strategic and safe the complex track involves challenging track geometry with
moves. turns whose radii change along the curve, tight U-turns,
2) Fixed Trajectory Linear-Quadratic Nash Game: The and turns in both directions. To be successful, the optimal
fixed trajectory LQNG controller, referred to as “Fixed- racing strategy requires some understanding of the shape of
LQNG,” uses the same LQNG low-level planner as our the track along a sequence of multiple turns. Every pair
hierarchical variant, but it instead tracks a fixed trajectory of controllers competes head-to-head in 50 races on both
around the track. This fixed trajectory is a racing line that tracks. The dynamical parameters of each player’s vehicle
is computed offline for a specific track using its geometry are identical, and the players start every race at the same
and parameters of the vehicle as seen in prior works [9], initial checkpoint. The only difference in their initial states
[10]. However, the online tracking involves game-theoretic is the lane in which they start. In order to maintain fairness
reasoning rather than single-agent optimal control in the prior with respect to starting closer to the optimal racing line, we
works. alternate the starting lanes between each race for the players.
3) Fixed Trajectory Multi-Agent Reinforcement Learning:
The fixed trajectory MARL controller, referred to as “Fixed-
RL,” is a learning-based counterpart to Fixed-LQNG. Control C. Results
inputs are computed using a deep RL policy trained to track Our experiments primarily seek to identify the importance
precomputed checkpoints that are fixed prior to the race. of hierarchical game-theoretic reasoning and the strength of
MCTS as a high-level planner for racing games. We count the
B. Experimental Setup number of wins against each opponent, average collisions-
Our controllers are implemented1 in the Unity Game at-fault per race, average illegal lane changes per race, and
Engine. Screenshots of the simulation environment are shown a safety score (a sum of the prior two metrics) for the
in Figure 3. We extend the Karting Microgame template controllers. We also provide a video2 demonstrating them
[20] provided by Unity. The kart physics from the template in action. Based on the results visualized in Figures 4 and
is adapted to include cornering limitations and tire wear 5, we conclude the following key points.
percentage. Tire wear is modeled as an exponential decay 1) The proposed hierarchical variants outperformed
curve that is a function of the accumulated angular velocity their respective baselines.
endured by the kart. This model captures the concept of The results amongst MCTS-RL, Fixed-RL, and E2E show
losing grip as the tire is subjected to increased lateral loads. the effectiveness of our proposed hierarchical structure.
Multi-agent support is also added to the provided template While all three of the MARL-based agents were only trained
in order to race the various autonomous controllers against on the oval track, the MCTS-RL agent was able to win
each other or human players. The high-level planners run the most head-to-head races while also maintaining the best
at 1 Hz, and low-level planners run at 50 Hz. Specifically, δ̄ safety score by better adapting its learning. Comparing the
is set to 0.06 s for the LQNG planner. The implementation baselines against each other, Fixed-RL also has more wins
of the learning-based agents utilizes a library called Unity and a better safety score than E2E across both tracks. This
ML-Agents [21]. All of the learning-based control agents result indicates that some type of hierarchical structure is
are trained using proximal policy optimization and self-play favorable. It suggests that a straightforward task of trajectory
1 https://www.github.com/ribsthakkar/ 2 https://www.youtube.com/playlist?list=
HierarchicalKarting PLEkfZ4KJSCcG4yGWD7K5ENGXW5nBPYiF1

Wins on Oval Track                                          Overall Safety Scores on Oval Track
                                                                                                                                      close battles and thereby lose races because of the high cost
        250                                                                         5
                                             Wins vs Fixed-
                                                                                                                                      in the planning objective of driving near another player even

                                                                Avg. Safety Score
        200                                                                         4

                                                                (Lower is better)
                                             LQNG
                                                                                    3
        150                                  Wins vs MCTS-
                                                                                                                                      if there is no collision.
 Wins

                                             LQNG                                                               Avg. Illegal Lane
        100                                                                         2
                                                                                                                Changes
         50
                                             Wins vs Fixed-RL
                                                                                    1
                                                                                                                Avg. Collisions-at-
                                                                                                                                         While Fixed-LQNG has a better safety score than Fixed-
                                                                                    0                           fault
            0                                Wins vs E2E
                                                                                                                                      RL, MCTS-RL has a significantly better safety score than
                RL

                                                                                        RL
                                  E

                                  L

                                                                                                Fi 2 E
                          d- G

                                                                                                         L

                                                                                                 d- G
                                NG

                                                                                                       NG
                       CT d-R

                                                                                              CT d-R
                               E2

                                N

                                                                                                       N
                                                                                                      E
           S-

                                                                                 S-
                                                                                                                                      MCTS-LQNG. Just in terms of collision avoidance, both
                             LQ

                                                                                                    LQ
                             LQ

                                                                                                     Q
                                             Wins vs MCTS-RL
                           xe

                                                                                                  xe
         CT

                                                                               CT

                                                                                                    L
                          S-

                                                                                                 S-
                         Fi
        M

                                                                          M
                        xe

                                                                                               xe
                      Fi

                                                                                             Fi
                     M

                                                                                               M
                                                                                                                                      RL-based agents have worse numbers because the LQNG-
                Fig. 4.     Results from head-to-head racing on the oval track.
                                                                                                                                      based agents are tuned to be conservative. However, MCTS-
                                                                                                                                      LQNG has significantly increased illegal lane changes per
                      Wins on Complex Track                                             Overall Safety Scores on Complex
                                                                                                                                      race compared to MCTS-RL while Fixed-LQNG has slightly
        200                                                                                           Track                           fewer illegal lane changes per race compared to Fixed-RL.
        180                                  Wins vs Fixed-
        160                                                                           3
                                             LQNG
                                                                                                                                      As discussed previously, the fixed trajectory agents do not
                                                                Avg. Safety Score
                                                                (Lower is better)
        140                                                                         2.5
        120                                  Wins vs MCTS-                            2
 Wins

        100                                  LQNG
         80
         60                                  Wins vs Fixed-RL
                                                                                    1.5
                                                                                      1
                                                                                                                Avg. Illegal Lane     consider alternative racing lines, so they rarely break the
         40                                                                                                     Changes
         20
          0                                  Wins vs E2E
                                                                                    0.5
                                                                                      0
                                                                                                                Avg. Collisions-at-   lane-changing limit rule in the first place. In the MCTS
                                                                                                                fault
                                                                                                                                      case, the high-level planner runs in parallel with the low-
                                  E

                                                                                             Fi 2 E
                          d- G

                                                                                              d- G
                RL

                                NG

                                                                                                   RL

                                                                                                   NG
                                  L

                                                                                              S RL
                               E2

                       CT d-R

                                N

                                                                                                  N
                                                                                                  E
                                                                                           CT d-
           S-

                                                                                      S-
                             LQ

                                                                                                LQ
                             LQ

                                                                                          Fi -LQ

                                             Wins vs MCTS-RL
                           xe

                                                                                               xe
         CT

                                                                                    CT
                          S-
                         Fi

                                                                                                                                      level and at a lower frequency. As a result, the calculated
        M

                                                                                M
                        xe

                                                                                            xe
                      Fi
                     M

                                                                                               M

                                                                                                                                      high-level plan uses slightly out-of-date information and does
        Fig. 5.           Results from head-to-head racing on the complex track.                                                      not account that the low-level controllers have already made
                                                                                                                                      choices that might contradict the initial steps in the plan. This
                                                                                                                                      mismatch causes the LQNG-based controller to more often
tracking is much easier to learn for a deep neural network                                                                            break the lane-changing rules by swerving across the track
than having to learn both strategic planning and respect for                                                                          to immediately follow the high-level plan when it is updated.
the safety and fairness rules.                                                                                                        MCTS-RL is more robust to this situation because they
   Next, we compare MCTS-LQNG and Fixed-LQNG. Al-                                                                                     have those safety rules encoded in their reward structures,
though MCTS-LQNG has a worse overall safety score, it has                                                                             albeit with smaller weights. They do not track the waypoints
25% more wins when aggregated over both tracks. Fixed-                                                                                exactly and learn to smooth the trajectory produced by the
LQNG has a similar number of overall wins on the oval                                                                                 high-level plan and the live situation in the game.
track, but when the racetrack is more complicated, Fixed-                                                                             3) MCTS-RL outperforms all other implemented con-
LQNG quickly becomes inferior. The oval track has just                                                                                trollers.
one main racing line, but there are many reasonable racing                                                                               Aggregating the results from both tracks, MCTS-RL
lines in the complex track that must be considered to be                                                                              recorded a win rate of 83% of the 400 head-to-head and
competitive. MCTS-LQNG accounts for these trajectories by                                                                             the second-best safety score, only behind the conservatively
using the high-level MCTS planner and is, therefore, more                                                                             tuned Fixed-LQNG agent. It combined the advantage of
successful in its races against MARL-based agents on the                                                                              having a high-level planner that evaluates long-term plans
complex track with four times the number of wins against                                                                              and a low-level planner that is robust to the possibility
them compared to the Fixed-LQNG agent. MCTS-LQNG                                                                                      that the high-level plans may be out of date. For example,
considered trajectories that could result in overtakes when                                                                           Figure 6a demonstrates how the high-level planner provided
opponents made mistakes from any part of the track. On the                                                                            a long-term strategy, guiding the agent to give up an advan-
other hand, Fixed-LQNG was forced to rely on opponents                                                                                tage at present for a greater advantage in the future when
making mistakes that were not along its fixed trajectory to                                                                           overtaking. The RL-based low-level planner approximately
make overtakes. However, considering alternative lines also                                                                           follows the high-level strategy in case stochasticity of the
attributes to main difference in their safety scores. Both have                                                                       MCTS algorithm yields a waypoint that seems out of place
similar collision-at-fault scores, but MCTS-LQNG has more                                                                             (e.g., the checkpoint between t = 3 and t = 4 in Figure 6a).
illegal lane changes.                                                                                                                 Furthermore, MCTS-RL is also successful at executing de-
2) MARL is more successful and robust than LQNG as                                                                                    fensive maneuvers as seen in Figure 6b due to those same
a low-level planner.                                                                                                                  properties of long-term planning and low-level robustness.
   Overall, the MARL-based agents outperformed their                                                                                  Both of these tactics resemble strategies of expert human
LQNG-based counterparts in terms of both key metrics:                                                                                 drivers in real head-to-head racing.
wins and safety scores. However, this result is likely due
to our simplifications involving a time-invariant lineariza-                                                                                                VI. C ONCLUSION
tion around the initial state of each agent, meaning the                                                                                 We developed a hierarchical controller for multi-agent
approximation is only valid for a very short time horizon.                                                                            autonomous racing that adheres to safety and fairness rules
Therefore, the LQNG-based agents could only rely on brak-                                                                             found in real-life racing and outperforms other common
ing/acceleration instead of yaw-rate to avoid collisions. As a                                                                        control techniques such as purely optimization-based or
result, the weights in the objective of the LQNG formulation                                                                          purely learning-based control methods. Our high-level plan-
are set conservatively to emphasize avoiding collisions. This                                                                         ner constructed long-term trajectories that abided by the
setup also implies that LQNG-based agents often concede in                                                                            introduced complex rules about collision avoidance and lane

(a)                                                                              (b)
Fig. 6. (a) An overtaking maneuver executed by the MCTS-RL agent (green) against the E2E agent (blue) on the complex track. Notice how, from t = 0
to t = 2, the MCTS-RL agent gives up a slight advantage and takes a wider racing line on the first turn. However, the exit of the wide racing line of
the first turn places the MCTS-RL agent at the inside of the next two turns where it is able to gain an even greater advantage when passing the E2E
agent from t = 3 to t = 6. The green boxes along each checkpoint also highlight the long-term plan calculated by the MCTS planner for this tactic. (b)
A defensive maneuver executed by the MCTS-RL agent (green) against the E2E agent (blue) on the complex track. Before reaching the turn, the MCTS
planner calculates to switch lanes to the left first (t = 0 to t = 1) and then move to the right for the inside of the turn. This motion forces the E2E agent
to make an evading move to avoid collision and take an even wider turn, thus increasing the overall gap at the end. The green boxes along each checkpoint
highlight the long-term plan calculated by the MCTS planner for this tactic.

changes. As a result, we design an objective for the low-level                   [7] Martin, T. (2020, October 22). The Guide to Road Racing, Part
controllers to focus on tracking the high-level plan, which is                       8: Passing Etiquette. Winding Road For Racers. Retrieved February
                                                                                     17, 2022, from https://www.windingroad.com/articles/blogs/the-road-
an easier problem to solve compared to the original racing-                          racers-guide-to-passing-etiquette/
game formulation. Our method outperformed the baselines                          [8] Hou, J. & Wang, T. The development of a simulated car racing
both in terms of winning head-to-head races and a safety                             controller based on Monte-Carlo tree search. 2016 Conference On
                                                                                     Technologies And Applications Of Artificial Intelligence (TAAI). 104-
score measuring obedience to the rules of the game. Finally,                         109 (2016)
they also exhibited maneuvers resembling those performed                         [9] Vázquez, J., Brühlmeier, M., Liniger, A., Rupenyan, A. & Lygeros,
by expert human drivers.                                                             J. Optimization-based hierarchical motion planning for autonomous
                                                                                     racing. 2020 IEEE/RSJ International Conference On Intelligent Robots
   Future work should introduce additional high-level and                            And Systems (IROS). (2020)
low-level planners and investigate policy-switching hierar-                     [10] Stahl, T., Wischnewski, A., Betz, J. & Lienkamp, M. Multilayer graph-
chical controllers where we switch between various high and                          based trajectory planning for race vehicles in dynamic scenarios. 2019
                                                                                     IEEE Intelligent Transportation Systems Conference (ITSC).(2019)
low-level controllers depending on the state of the game.                       [11] Kabzan, J., Hewing, L., Liniger, A. & Zeilinger, M. Learning-Based
Lastly, our hierarchical control design can be extended to                           Model Predictive Control for Autonomous Racing. IEEE Robotics And
other multi-agent systems applications where there exist                             Automation Letters. 4, 3363-3370 (2019)
                                                                                [12] Remonda, A., Krebs, S., Veas, E., Luzhnica, G. & Kern, R. Formula
complex rules such as energy grid systems or air traffic                             rl: Deep reinforcement learning for autonomous racing using telemetry
control. Constructing a discrete high-level game allows for                          data. ArXiv Preprint arXiv:2104.11106. (2021)
natural encoding of the complex constraints, often involving                    [13] Fakhry, Ali. (2020). Applying a Deep Q Network for OpenAIs Car
                                                                                     Racing Game.
discrete components, to find an approximate solution that can                   [14] Spica, R., Cristofalo, E., Wang, Z., Montijano, E. & Schwager, M.
warm start a more precise low-level planner.                                         A real-time game theoretic planner for autonomous two-player drone
                                                                                     racing. IEEE Transactions On Robotics. 36, 1389-1403 (2020)
                             R EFERENCES                                        [15] Wurman, P., Barrett, S., Kawamoto, K., MacGlashan, J., Subramanian,
                                                                                     K., Walsh, T., Capobianco, R., Devlic, A., Eckert, F., Fuchs, F. & Oth-
 [1] Faisal, A., Kamruzzaman, M., Yigitcanlar, T. & Currie, G. Under-                ers Outracing champion Gran Turismo drivers with deep reinforcement
     standing autonomous vehicles. Journal Of Transport And Land Use.                learning. Nature. 602, 223-228 (2022)
     12, 45-72 (2019)                                                           [16] Liniger, A. (2018). Path Planning and Control for Autonomous Racing
 [2] Edelstein, S. (2019, June 9). Racing Tech that migrated to your                 (thesis).
     current car. Digital Trends. Retrieved February 17, 2022, from             [17] Fisac, J., Bronstein, E., Stefansson, E., Sadigh, D., Sastry, S. & Dragan,
     https://www.digitaltrends.com/cars/racing-tech-in-your-current-car/             A. Hierarchical game-theoretic planning for autonomous vehicles.
 [3] Wang, M., Wang, Z., Talbot, J., Gerdes, J. & Schwager, M. Game                  2019 International Conference On Robotics And Automation (ICRA).
     Theoretic Planning for Self-Driving Cars in Competitive Scenarios..             pp. 9590-9596 (2019)
     Robotics: Science And Systems. (2019)                                      [18] Coulom, R. Efficient selectivity and backup operators in Monte-Carlo
 [4] Wang, M., Wang, Z., Talbot, J., Gerdes, J. & Schwager, M. Game-                 tree search. International Conference On Computers And Games.
     theoretic planning for self-driving cars in multivehicle competitive            (2006)
     scenarios. IEEE Transactions On Robotics. 37, 1313-1325 (2021)             [19] Basar, T. & Olsder, G. Dynamic Noncooperative Game Theory:
 [5] Li, N., Goubault, E., Pautet, L. & Putot, S. Autonomous racecar                 Second Edition. (SIAM,1999)
     control in head-to-head competition using mixed-integer quadratic          [20] Unity Technologies Karting Microgame Template. (2021),
     programming. Institut Polytechnique De Paris, Tech. Rep. (2021)                 https://assetstore.unity.com/packages/templates/karting-microgame-
 [6] He, S., Zeng, J. & Sreenath, K. Autonomous Racing with Multiple                 150956
     Vehicles using a Parallelized Optimization with Safety Guarantee           [21] Juliani, A., Berges, V., Teng, E., Cohen, A., Harper, J., Elion, C., Goy,
     using Control Barrier Functions. ArXiv Preprint arXiv:2112.06435.               C., Gao, Y., Henry, H., Mattar, M. & Others Unity: A general platform
     (2021)                                                                          for intelligent agents. ArXiv Preprint arXiv:1809.02627. (2018)

You can also read