Grouptron: Dynamic Multi-Scale Graph Convolutional Networks for Group-Aware Dense Crowd Trajectory Forecasting - arXiv

Page created by Cecil Burgess
 
CONTINUE READING
Grouptron: Dynamic Multi-Scale Graph Convolutional Networks for Group-Aware Dense Crowd Trajectory Forecasting - arXiv
Grouptron: Dynamic Multi-Scale Graph Convolutional Networks for
                                                   Group-Aware Dense Crowd Trajectory Forecasting
                                                  Rui Zhou1 , Hongyu Zhou2 , Huidong Gao3 , Masayoshi Tomizuka3 , Jiachen Li4,∗ , and Zhuo Xu3,∗

                                           Abstract— Accurate, long-term forecasting of pedestrian tra-
                                        jectories in highly dynamic and interactive scenes is a long-
                                        standing challenge. Recent advances in using data-driven ap-
                                        proaches have achieved significant improvements in terms
                                        of prediction accuracy. However, the lack of group-aware
arXiv:2109.14128v3 [cs.CV] 4 Mar 2022

                                        analysis has limited the performance of forecasting models.
                                        This is especially nonnegligible in highly crowded scenes, where
                                        pedestrians are moving in groups and the interactions between
                                        groups are extremely complex and dynamic. In this paper, we
                                        present Grouptron, a multi-scale dynamic forecasting frame-
                                        work that leverages pedestrian group detection and utilizes
                                        individual-level, group-level and scene-level information for
                                        better understanding and representation of the scenes. Our
                                        approach employs spatio-temporal clustering algorithms to
                                        identify pedestrian groups, creates spatio-temporal graphs at
                                        the individual, group, and scene levels. It then uses graph neural
                                        networks to encode dynamics at different scales and aggregate
                                        the embeddings for trajectory prediction. We conducted ex-
                                        tensive comparisons and ablation experiments to demonstrate              Fig. 1. Grouptron models the pedestrian behaviors in a multi-scale fashion
                                        the effectiveness of our approach. Our method achieves 9.3%              and constructs spatio-temporal graphs for different scales.
                                        decrease in final displacement error (FDE) compared with
                                        state-of-the-art methods on ETH/UCY benchmark datasets, and
                                        16.1% decrease in FDE in more crowded scenes where extensive
                                        human group interactions are more frequently present.                    state-of-the-art performance on trajectory prediction [1], [2],
                                                                                                                 [3], [4], [5], [6]. However, most state-of-the-art methods
                                                               I. I NTRODUCTION                                  fail to consider the densely populated scenarios, which are
                                           Generating long-term and accurate predictions of hu-                  extremely challenging due to the immensely dynamic and
                                        man pedestrian trajectories is of enormous significance for              complex interactions. Moreover, in such highly crowded
                                        developing reliable autonomous systems (e.g. autonomous                  scenarios, groups of pedestrians are common. [7] estimates
                                        vehicles, mobile robots), which are often required to safely             that 50% to 70% of pedestrians walk in groups, which exhibit
                                        navigate in crowded scenarios with many human pedestrians                dramatically distinct behaviors from individuals. However,
                                        present, such as in crowded open traffic or in warehouses.               existing methods lack the mechanism of representing such
                                        Therefore, the capability of understanding and predicting                group dynamics, which limits their performance in highly
                                        the dense crowd behavior is instrumental in order to avoid               crowded situations.
                                        collisions with the highly dynamic crowds and to behave in                  In this paper, we seek to resolve this challenge by lever-
                                        a socially-aware way.                                                    aging group detection and dynamic spatio-temporal graph
                                           Through daily interactions, humans are able to develop                representations at different scales as visualized in Fig. 1.
                                        the remarkable capability of understanding crowded and                   Concretely, we construct a dynamic multi-scale graph con-
                                        interactive environments and inferring the potential future              volutional neural network that uses pedestrian group infor-
                                        movements of all other traffic participants. Similarly, recent           mation to better learn and model the extremely complex
                                        advances using machine learning techniques leverage the                  dynamics at different scales and between different scales.
                                        massive human interaction data are capable of achieving                     The contributions of this paper are summarized as follows:
                                          ∗ Theseauthors contributed equally to this paper
                                                                                                                    • We propose to leverage spatio-temporal clustering algo-
                                          1 RuiZhou is with the Department of Electrical Engineering and              rithms to detect pedestrians in groups.
                                        Computer Science, University of California, Berkeley, CA 94720 USA          • We design a hierarchical prediction framework using
                                        ruizhouzr@berkeley.edu                                                        spatio-temporal graph neural networks which encode
                                           2 Hongyu Zhou is with the Department of Aerospace Engineering, Uni-
                                        versity of Michigan, Ann Arbor, MI 48109 USA zhouhy@umich.edu
                                                                                                                      the scene at three levels:
                                           3 Huidong Gao, Masayoshi Tomizuka, and Zhuo Xu are with the Depart-
                                                                                                                     1) The individual level encodes the historical trajectory
                                        ment of Mechanical Engineering, University of California, Berkeley, CA           of the predicted pedestrian.
                                        94720 USA {hgao9, tomizuka, zhuoxu}@berkeley.edu
                                           4 Jiachen Li is with the Department of Aeronautics & Astronautics,        2) The group level encodes the dynamics and trajectory
                                        Stanford University, CA 94305 USA jiachen_li@stanford.edu                        information within each pedestrian group.
3) The scene level encodes the dynamics and interac-           to spatio-temporal GCN for skeleton-based action recogni-
       tions between pedestrian groups.                           tion [28]. STGCN is adapted by Social-STGNN [29] for tra-
  • The proposed multi-scale spatio-temporal architecture         jectory forecasting, where trajectories are modeled by graphs
    outperforms existing methods by 9.3% in terms of final        with edges representing social interactions and weighted by
    displacement errors. In particular, for densely crowded       the distances between pedestrians. A development related
    scenarios, the performance improvement can be as sig-         to our paper is dynamic multi-scale GNN (DMGNN) [30],
    nificant as 16.1%.                                            which proposes a multi-scale graph to model human body
                                                                  relations and extract features at multiple scales for motion
                    II. R ELATED W ORK                            prediction. There are two kinds of sub-graphs in the multi-
A. Human Trajectory Forecasting                                   scale graph: (i) single-scale graphs, which connect body
   One of the pioneering works of human trajectory forecast-      components at the same scales, and (ii) cross-scale graphs,
ing is the Social Force model [8], which applies Newtonian        which form cross-scale connections among body compo-
forces to model human motion. Similar methods with strong         nents. Based on the multi-scale graphs, a multi-scale graph
priors have also been proposed [9]; yet, most of them rely        computational unit is proposed to extract and fuse features
on hand-crafted energy potential, such as relative distances      across multiple scales. Motivated by this work, we adopt
and rules, to model human motion.                                 the multi-scale graph strategy for dense crowd forecasting
   Recently, machine learning methods have been applied to        which includes scene-level graphs, group-level graphs, and
the problem of human trajectory forecasting to obtain models      individual-level graphs.
with better performance. One line of work is to formulate this
problem as a deterministic time-series regression problem         C. Group-aware Prediction
and then solve it using, e.g., Gaussian Process Regression           People moving in groups (such as friends, family mem-
(GPR) [10], inverse reinforcement learning (IRL) [11], and        bers, etc.) is a common phenomenon and people in each
recurrent neural networks (RNNs) [1], [12], [13].                 group tend to exhibit similar motion patterns. Motivated by
   However, the issue of these deterministic regressors is        this phenomenon, group-aware methods [31] consider the
that human behavior is rarely deterministic or unimodal.          possibility of human agents being in groups or formations to
Hence, generative approaches have become the state-of-            have more correlated motions than independent ones. They
the-art trajectory forecasting methods, due to recent ad-         therefore can also model reactions of agents to the moving
vancements in deep generative models [14], [15] and their         groups. Human agents can be assigned to different groups
ability of generating distributions of potential future trajec-   by clustering trajectories with similar motion patterns based
tories (instead of a single future trajectory). Most of these     on methods such as k-means clustering [32], support vector
methods use a recurrent neural network architecture with          clustering [33], coherent filtering [34], and spectral clustering
a latent variable model, such as a conditional variational        methods [35].
auto-encoder (CVAE) [2], [16], [17], [18], or a generative
adversarial network (GAN) [19], [20], [21], [22], [23] to                                III. G ROUPTRON
encode multi-modality. Compared to previous work, we not          A. Problem Formulation
only consider multi-modality from the perspective of a single
                                                                     We aim to predict the trajectory distributions of a time-
agent, but also from the group level; we take into account the
                                                                  varying N (t) number of pedestrians PN (t) . At time step
phenomenon that people usually move in groups. We show
                                                                  t, given the 2-D position s ∈ R2 of each pedestrian and
that our group-aware prediction has better understanding of
                                                                  their previous trajectories of T time steps T ∈ RT ×2 , our
the scenes and achieves better forecasting performance.
                                                                  goal is to predict the distributions of their trajectories in the
B. Graph Convolutional Networks                                   next F future time steps, y = Tt:t+F  1,..,N (t) ∈ R
                                                                                                                        N (t)×F ×2
                                                                                                                                   .
   Of the methods mentioned above, RNN-based methods              We denote the distributions as p(y | x). The performance
have achieved better performance. However, recurrent archi-       of the predictions is evaluated by standard distance-based
tectures are parameter inefficient and expensive in training      metrics on benchmark datasets: the average displacement
[24]. Besides, to handle spatial context, RNN-based methods       error (ADE) and the final displacement error (FDE).
need additional structures. Most of them use graph models
to encode neighboring pedestrians’ information since the          B. Model Overview
topology of graphs is a natural way to represent interactions        Rooted in the CVAE architecture in Trajectron++ [2],
between pedestrians. Graph convolutional networks (GCN)           we design a more expressive multi-scale scene encoding
introduced in [25] is more suitable for dealing with non-         structure, which actively takes into consideration the group-
Euclidean data. The Social-BiGAT [20] introduces a graph          level and scene-level information for better representation
attention network [26] to model social interactions. Graph-       of crowded scenes where groups of pedestrians are present.
SAGE [27] aggregates nodes and fuses adjacent nodes in            Concretely, we leverage spatio-temporal graphs for each level
different orders to extract node embeddings. To capture both      to model information and interactions at the corresponding
the spatial and temporal information, Spatio-Temporal Graph       level. We refer to our model as Grouptron. Our model is il-
Convolutional Networks (STGCN) extends the spatial GCN            lustrated in Fig. 2. In this subsection, we provide an overview
Fig. 2. The diagram of the Grouptron model, which creates spatio-temporal graphs for different levels and encodes these graphs using different networks.
Combining encodings for the individual level, the group level, and the scene level, Grouptron then uses a GRU decoder to output the predictions.

of the architecture, and in Section III-C, we elaborate on the                network which is an LSTM network with hidden dimension
details of the Grouptron model.                                               8. In this way, we obtain two vectors: a vector encoding the
   At the individual level, we construct spatio-temporal                      trajectory history of the current node and a vector encoding
graphs for individual pedestrians. The graph is centered                      the representation of all the neighbors of the current node.
at the node whose trajectory we want to predict. We call                         2) Pedestrian Group Clustering: To cluster nodes into
it the “current node”. Long short term memory (LSTM)                          groups based on trajectories, we propose to leverage the
networks [36] are used to encode this graph. We group                         agglomerative clustering algorithm [35], which uses similar-
the pedestrians with the agglomerative clustering algorithm                   ity scores based on Hausdorff distances between trajectories.
based on Hausdorff distances [35]. STGCN is used to encode                    The number of clusters (groups) to create for each scene is
dynamics within the groups.                                                   determined by:
   At the scene level, spatio-temporal graphs are created to
model dynamics among pedestrian groups and are encoded                                                C(N ) = (N + 1)/2,                            (1)
using a different STGCN. Lastly, the information across
different scales is combined. A decoder is then used to obtain                where C is the number of clusters and N is the total number
trajectory predictions and the model can output the most                      of nodes to be clustered.
possible trajectory or the predicted trajectory distributions.                   Furthermore, We only include nodes with an edge to or
                                                                              from the current node. This is because we only want to
C. Multi-Scale Scene Encoder                                                  include nodes that can potentially influence the current node
   1) Individual-Level Encoder: The first level of encoding                   to avoid unhelpful information from nodes that are too far
is for the individual pedestrians. We represent information                   away from the current node.
at the individual level using a spatio-temporal graph for the                    3) Group-Level Encoder: For each group, we create a
current node. The nodes include the current node and all                      spatio-temporal graph consisting of Gg,t = (Vg,t , Eg,t ),
other nodes that are in the perception range of the current                   where t is the time step and g is the group id. Vg,t = {vi,t |
node and nodes whose perception range covers the current                      ∀i ∈ {1, ..., Ng }} are all the nodes in the group g. The
node. The node states are the trajectories of the nodes. The                  node states are the trajectories of the represented pedestrians.
edges are directional and there is an edge ei,j if pedestrian                 Eg,t = {ei,jt   | ∀i, j ∈ {1, ..., Ng } are the set of edges
i is in the perception range of pedestrian j. To encode the                   between the nodes in the current group such that ei,j   t = 1
current node’s history trajectory, we use an LSTM network                     to allow maximum interaction modeling within pedestrian
with hidden dimension 32. To encode the edges, we first                       groups.
perform an element-wise sum on the states of all neighboring                     After forming the aforementioned graphs for each group,
nodes to form a single vector that represents all neighbors of                they are then passed to the group-level trajectory encoder
the current node. This vector is then fed into the edge LSTM                  to obtain the encoded vectors for nodes in each group. The
group-level trajectory encoder is an STGCN proposed in [28]                              IV. E XPERIMENTS
and used in [29]. We set the convolution filter size to 3 and       A. Datasets
use the same weight parameters for all the groups.
                                                                       The model is trained on two publicly available datasets
   We then average the encoded vectors of all nodes in each
                                                                    that are benchmarks in the field: The ETH [39], with subsets
group to obtain the representations       for the corresponding
                             PNg                                    named ETH and HOTEL, and the UCY [40] datasets, with
groups. That is, Eg = N1g i=1       Ei , where Eg is the encoded
                                                                    subsets named ZARA1, ZARA2, and UNIV. The trajectories
vector for group g, Ei is the encoded vector for node i in
                                                                    are sampled at 0.4 seconds intervals. The model observes 8
the output from the group-level trajectory encoder, and Ng
                                                                    time steps, which corresponds to 3.2 seconds, and predicts
is the number of nodes in group g.
                                                                    the next 12 time steps, which corresponds to 4.8 seconds.
   4) Scene-Level Encoder: After obtaining the encoded vec-            To further evaluate and demonstrate Grouptron’s perfor-
tors for each group, a scene-level spatio-temporal graph with       mance in densely populated scenarios, we create UNIV-N
nodes representing groups is created. That is, Gscene,t =           test sets, where N is the minimum number of people present
(Vscene,t , Escene,t ). Vscene,t = {vg,t | ∀ g ∈ {1, ..., G}},      simultaneously at each time step in the test sets. Each UNIV-
where G is the total number of groups and t is the timestep.        N test set contains all time steps that have at least N people
The state for each node is Eg from the group-level trajectory       in the scene simultaneously from the original UNIV test set.
encoder. Escene,t = {ei,j  t | ∀i, j ∈ {1, ..., G} are the set of   In this way, we created test sets UNIV-40, UNIV-45, and
edges between the groups in the scene. Each ei,j is set to 1        UNIV-50 and, at each time step, there are at least 40, 45, and
to allow maximum message passing between group nodes.               50 people in the scene at the same time. These test sets are
   We then select the encoded vector corresponding to the           far more challenging than the original UNIV test set because
last timestep and the group id of the current node as the           of the more complex and dynamic interactions at different
scene-level encoding: Escene = Eg,T , where g is the group          scales. We train the models on the original UNIV training
id of the current node we are encoding for and T is the total       set and evaluate the models on the UNIV-40, UNIV-45, and
number of time steps.                                               UNIV-50 test sets.
   5) Multi-Scale Encoder Output: The output of the multi-
scale scene encoder is the concatenation of the following           B. Evaluation Metrics
level encoded vectors: the output from the node history                As in prior work [1], [2], [41] and more, we used the
encoder, the output from individual-level edge encoder, and         following metrics to evaluate our model:
the output from the scene-Level encoder. That is, Emulti =             1) Final Displacement Error:
[Ehis ; Eedge ; Escene ], where Ehis is the encoded vector for
                                                                                                  ||Tˆi − TTi F ||2
                                                                                            P
the current node’s history trajectory, Eedge is the vector rep-                   F DE = i∈N TF                     ,         (3)
resenting individual-level neighbors, Escene is the encoded                                          N
vector from the scene-level encoder.                                which is the l2 distance between the predicted final position
                                                                    and the ground truth final position with prediction horizon
D. Decoder Network                                                  TF .
                                                                       2) Average Displacement Error:
   Together with the latent variable z, Emulti is passed to                               P     P        ˆi       i
the decoder that is a Gated Recurrent Unit (GRU) [37] with                                  i∈N   t∈TF ||Tt − Tt ||2
                                                                                ADE =                                 ,       (4)
128 dimensions. The output from the GRU decoder is then                                           N × TF
fed into dynamics integration modules as control actions to         where N is the total number of pedestrians, TF is the number
output the predicted trajectory distributions, or the single
                                                                    of future timesteps we want to predict for, Tˆti is the predicted
most-likley trajectory, depending on the task.
                                                                    trajectory for pedestrian i at timestep t. ADE is the mean l2
                                                                    distance between the ground truth and predicted trajectories.
E. Loss Functions
                                                                    C. Experiment settings
  We adopt the following objective function for the overall            The Grouptron model is implemented using PyTorch. The
CVAE model:                                                         model is trained using an Intel I7 CPU and NVIDIA GTX
          N
          X                                                         1080 Ti GPUs for 100 epochs. The batch size is 256. For
  max           Ez∼qφ (·|xi ,yi ) [log pψ (yi |xi , z)]             the HOTEL, UNIV, ZARA1, and ZARA2 datasets, the output
  φ,θ,ψ
          i=1                                                       dimension for the group-level and scene level encoders are
            − βDKL (qφ (z|xi , yi ) ||pθ (z|xi )) + αIq (x|z),      16. For the ETH dataset, we set the output dimension of
                                                             (2)    the group-level and scene-level encoders to be 8. This is
where Iq is the mutual information between x and z under            because the ETH test set contains only 2 timesteps with at
the distribution qφ (x|z). We follow the process given in [38]      least 5 people in the scene, out of the total 1161 timesteps. In
to compute Iq . We approximate qφ (z|xi , yi ) with pθ (z|xi ),     comparison, the training set contains 1910 timesteps, out of
and obtain the unconditioned latent distribution by summing         4976 in total, with at least 5 people. Thus, to help the model
out xi over the batch.                                              learn generalizable representations in this case, we decrease
TABLE I
   FDE/ADE (m) VALUES FOR BASELINE METHODS AND G ROUPTRON ON ETH AND UCY DATASETS . L OWER IS BETTER . B OLD INDICATES BEST.

                            Method             ETH          HOTEL          UNIV       ZARA1       ZARA2         AVG
                          Social-LSTM        2.35/1.09     1.76/0.79     1.40/0.67   1.00/0.47   1.17/0.56    1.54/0.72
                          Social-GAN         1.52/0.81     1.61/0.72     1.26/0.60   0.69/0.34   1.84/0.42    1.18/0.58
                             SoPhie          1.43/0.70     1.67/0.76     1.24/0.54   0.63/0.34   0.78/0.38    1.15/0.54
                          Trajectron++       1.68/0.71     0.46/0.22     1.07/0.41   0.77/0.30   0.59/0.23    0.91/0.37
                           Grouptron         1.56/0.70     0.46/0.21     0.97/0.38   0.76/0.30   0.56/0.22    0.86/0.36

                                                         TABLE II
      FDE/ADE (m) VALUES FOR BASELINE METHODS AND G ROUPTRON ON UNIV-N DATASETS . L OWER IS BETTER . B OLD INDICATES BEST.

                                         Method             UNIV         UNIV-40     UNIV-45     UNIV-50
                                       Trajectron++       1.07/0.41     1.17/0.436   1.24/0.46   1.25/0.47
                                        Grouptron         0.97/0.38      1.00/0.39   1.04/0.40   1.07/0.42

                                                                             efficient between human-generated and algorithm-generated
                                                                             groups. That is, we use
                                                                                                1 X 1 X 2 × |Gh,t ∩ Ga,t |
                                                                                       DSC =                                   ,   (5)
                                                                                               T     H       |Gh,t | + |Ga,t |
                                                                                                   t∈T       h∈H

                                                                             where T is the number of timesteps, H is the total num-
                                                                             ber of human annotators, Ga,t is the grouping created by
                                                                             the agglomerative clustering method for time step t, Gh,t
                                                                             is the grouping created by humans for time step t, and
                                                                             |Gh ∩ Ga | measures how many of the groups by humans
                                                                             and the algorithm are exactly the same. The Average Dice
                                                                             coefficient between human annotators and the algorithm is
                                                                             0.72. The higher the Dice coefficient, the more similar are
                                                                             the groups created by the agglomerative clustering method
                                                                             and human annotators. Thus, The average Dice Coefficient of
Fig. 3. An example of pedestrian group clustering. Pedestrians are divided   0.72 indicates the groups output by agglomerative clustering
into five groups. Different colors indicate different groups.                method are really similar to human-generated ones.

the output dimension of the STGCNs to 8. The learning                        E. Quantitative Results
rate is set to 0.001 initially and decayed exponentially every                  We compare Grouptron’s performance with state-of-the-
epoch with a decay rate of 0.9999. The model is trained using                art methods and common baselines in the field in terms of
Adam gradient descent and gradients are clipped at 1.0.                      the FDE and ADE metrics, and the results are shown in
                                                                             Table I. Overall, Grouptron outperforms all state-of-the-art
D. Evaluation of the Group Clustering Algorithm                              methods with considerable decrease in displacement errors.
   In Fig. 3, it is shown that the groups created by the                     Since Grouptron is built on Trajectron++, we also compare
agglomerative clustering method are very close to the natural                the FDE and ADE values of Grouptron and Trajectron++.
definition of pedestrian groups. We can see that pedestrians                 We find that Grouptron outperforms Trajectron++ on all 5
5 and 6 are travelling in a highly correlated fashion and                    datasets by considerable margins. Particularly, on the ETH
pedestrians 1, 2, 3, and 4’s trajectories are highly similar                 dataset, Grouptron achieves an FDE of 1.56m, which is
as well. In both cases, the clustering algorithm is able to                  7.1% better than the FDE value of 1.68m by Trajectron++.
correctly cluster these pedestrians into their corresponding                 Furthermore, Grouptron achieves an FDE of 0.97m on the
groups. This shows that by using agglomerative clustering                    UNIV dataset. This is 9.3% reduction in FDE error when
based on Hausdorff distances, Grouptron is able to suc-                      compared with the FDE value of 1.07m by Trajectron++ on
cessfully generate naturally defined groups. To quantitatively               the same dataset.
evaluate the groups generated by the algorithm, we selected                     Moreover, we compare Grouptron’s performance with
10 random time steps from the ETH training dataset. We                       Trajectron++’s in dense crowds with the UNIV-N datasets
invited 10 human volunteers to label groups for the time                     in Table II. Overall, Grouptron outperforms Trajectron++ on
steps and used the agglomerative clustering method described                 all the UNIV-N test sets by enormous margins. In particular,
in Section III-C to generate group clusters, respectively.                   Grouptron achieves an FDE of 1.04m and ADE of 0.40m on
For both human-generated groups and algorithm-generated                      the UNIV-45 test set, which contains all timesteps from the
groups, the number of groups to be formed is computed using                  original UNIV test set that have at least 45 pedestrians in the
Equation 1. We then compute the average Sørensen–Dice co-                    scene at the same time. This is 16.1% in FDE improvement
Fig. 4. Examples of Grouptron’s predictions on the UNIV dataset. The              Fig. 5. Comparisons of Grouptron’s and Trajectron++’s distributions of
predictions are the most likely trajectory predictions of the model. Green        20 most likely future trajectories on examples of the UNIV dataset. Rows
arrows indicate pedestrians’ current positions and directions. Black dashed       indicate different examples and columns represent different methods. Orange
lines indicate trajectory histories. Grey dashed lines indicate ground truth      stars indicate pedestrians of interest. Yellow stars indicate their companions
future trajectories. Green lines are Grouptron’s predicted future trajectories.   in the same pedestrian groups. Green arrows indicate pedestrians’ current
                                                                                  positions and directions. Black dashed lines indicate trajectory histories.
                                                                                  Grey dashed lines indicate ground truth future trajectories. The comparisons
when compared with the FDE value of Trajectron++ and                              show that Grouptron is able to produce predictions with higher quality and
13.0% in ADE improvement when compared with the ADE                               with better confidence levels.
value of Trajectron++ on the same test set.
   Furthermore, we notice that the state-of-the-art method,                          In Fig. 5, we compare Grouptron’s distributions of 20 most
Trajectron++, performs substantially worse as the number                          likely predictions with those of Trajectron++’s. Comparing
of pedestrians in the scene increases. Specifically, Trajec-                      Fig. 5a with 5b and 5c with 5d shows that Grouptron’s
tron++’s FDE increases from 1.07m to 1.25m as the min-                            predictions for the pedestrians of interest reflect the in-
imum number of pedestrians in the scene increases from                            teractions within pedestrian groups more accurately. Fur-
1 to 50. In contrast, Grouptron’s FDE remains relatively                          thermore, Grouptron’s prediction distributions have much
stable as the number of pedestrians increases. This shows                         smaller ranges, indicating that it is much more confident with
that Grouptron performs much better and is more robust in                         prediction outcomes.
densely populated scenarios.                                                                                V. C ONCLUSIONS
                                                                                     In this paper, we present Grouptron, a multi-scale graph
F. Qualitative Analysis
                                                                                  neural network for pedestrian trajectory forecasting. The
   Fig. 4 shows Grouptron’s most likely predictions for some                      spatio-temporal graphs model complex and highly dynamic
examples of the UNIV dataset. Fig. 4a shows two pedestrian                        pedestrian interactions at three scales: the individual level,
groups crossing paths. We can see that Grouptron’s predic-                        the group level, and the scene level. Grouptron detects
tions are consistent with the groups. Furthermore, it accu-                       pedestrian groups using agglomerative clustering based on
rately predicts when and where the two groups’ trajectories                       Hausdorff distances, uses GCNs and LSTM-based networks
intersect. Fig. 4b shows a case where pedestrians are forming                     to encode information at different scales, and combines
groups and merging paths. Grouptron again successfully                            the embeddings across scales. With these novel designs,
predicts the formation of this group. Fig. 4c and Fig. 4d                         Grouptron achieves high performance even in situations that
show Grouptron’s performance in densely populated scenes                          are tough for state-of-the-art methods. Through extensive
with more than 40 pedestrians. Even in these extremely                            experiments, we show that Grouptron outperforms state-of-
challenging scenarios for state-of-the-art methods, Grouptron                     art methods on ETH and UCY datasets, which are standard
still produces predictions of high quality and the predictions                    benchmarks in the field. Furthermore, through qualitative
are consistent with pedestrian groups. Furthermore, we can                        comparisons and experiments run on densely populated
see that even when pedestrian groups are crossing paths or                        UNIV-N test sets, we show that Grouptron outperforms state-
influencing each other, Grouptron successfully predicts these                     of-art methods by even larger margins and is robust to the
highly dynamic and complex scenarios.                                             change in the number of pedestrians in the scene.
R EFERENCES                                        [21] A. Sadeghian, V. Kosaraju, A. Sadeghian, N. Hirose, H. Rezatofighi,
                                                                                      and S. Savarese, “Sophie: An attentive gan for predicting paths
 [1] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and                  compliant to social and physical constraints,” in Proceedings of the
     S. Savarese, “Social lstm: Human trajectory prediction in crowded                IEEE/CVF Conference on Computer Vision and Pattern Recognition,
     spaces,” in Proceedings of the IEEE conference on computer vision                2019, pp. 1349–1358.
     and pattern recognition, 2016, pp. 961–971.                                 [22] T. Zhao, Y. Xu, M. Monfort, W. Choi, C. Baker, Y. Zhao, Y. Wang,
                                                                                      and Y. N. Wu, “Multi-agent tensor fusion for contextual trajectory
 [2] T. Salzmann, B. Ivanovic, P. Chakravarty, and M. Pavone, “Tra-
                                                                                      prediction,” in Proceedings of the IEEE/CVF Conference on Computer
     jectron++: Dynamically-feasible trajectory forecasting with heteroge-
                                                                                      Vision and Pattern Recognition, 2019, pp. 12 126–12 134.
     neous data,” in Computer Vision–ECCV 2020: 16th European Confer-
                                                                                 [23] J. Li, H. Ma, and M. Tomizuka, “Conditional generative neural system
     ence, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16.
                                                                                      for probabilistic trajectory prediction,” in 2019 IEEE/RSJ International
     Springer, 2020, pp. 683–700.
                                                                                      Conference on Intelligent Robots and Systems (IROS). IEEE, 2019,
 [3] J. Li, F. Yang, M. Tomizuka, and C. Choi, “Evolvegraph: Multi-agent
                                                                                      pp. 6150–6156.
     trajectory prediction with dynamic relational reasoning,” Advances in
                                                                                 [24] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic
     neural information processing systems, vol. 33, pp. 19 783–19 794,
                                                                                      convolutional and recurrent networks for sequence modeling,” arXiv
     2020.
                                                                                      preprint arXiv:1803.01271, 2018.
 [4] J. Li, F. Yang, H. Ma, S. Malla, M. Tomizuka, and C. Choi, “Rain: Re-       [25] T. N. Kipf and M. Welling, “Semi-supervised classification with graph
     inforced hybrid attention inference network for motion forecasting,” in          convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
     Proceedings of the IEEE/CVF International Conference on Computer            [26] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio,
     Vision, 2021, pp. 16 096–16 106.                                                 and Y. Bengio, “Graph attention networks,” arXiv preprint
 [5] C. Choi, J. H. Choi, J. Li, and S. Malla, “Shared cross-modal trajectory         arXiv:1710.10903, 2017.
     prediction for autonomous driving,” in Proceedings of the IEEE/CVF          [27] W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive representation
     Conference on Computer Vision and Pattern Recognition, 2021, pp.                 learning on large graphs,” in Proceedings of the 31st International
     244–253.                                                                         Conference on Neural Information Processing Systems, 2017, pp.
 [6] H. Ma, Y. Sun, J. Li, M. Tomizuka, and C. Choi, “Continual multi-                1025–1035.
     agent interaction behavior prediction with conditional generative mem-      [28] S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional
     ory,” IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 8410–             networks for skeleton-based action recognition,” in Thirty-second
     8417, 2021.                                                                      AAAI conference on artificial intelligence, 2018.
 [7] J. Sochman and D. C. Hogg, “Who knows who - inverting the social            [29] A. Mohamed, K. Qian, M. Elhoseiny, and C. Claudel, “Social-
     force model for finding groups,” 2011 IEEE International Conference              stgcnn: A social spatio-temporal graph convolutional neural network
     on Computer Vision Workshops (ICCV Workshops), pp. 830–837,                      for human trajectory prediction,” in Proceedings of the IEEE/CVF
     2011.                                                                            Conference on Computer Vision and Pattern Recognition, 2020, pp.
 [8] D. Helbing and P. Molnar, “Social force model for pedestrian dynam-              14 424–14 432.
     ics,” Physical review E, vol. 51, no. 5, p. 4282, 1995.                     [30] M. Li, S. Chen, Y. Zhao, Y. Zhang, Y. Wang, and Q. Tian, “Dynamic
 [9] G. Antonini, M. Bierlaire, and M. Weber, “Discrete choice models                 multiscale graph neural networks for 3d skeleton based human motion
     of pedestrian walking behavior,” Transportation Research Part B:                 prediction,” in Proceedings of the IEEE/CVF Conference on Computer
     Methodological, vol. 40, no. 8, pp. 667–687, 2006.                               Vision and Pattern Recognition, 2020, pp. 214–223.
[10] J. M. Wang, D. J. Fleet, and A. Hertzmann, “Gaussian process                [31] A. Rudenko, L. Palmieri, M. Herman, K. M. Kitani, D. M. Gavrila,
     dynamical models for human motion,” IEEE transactions on pattern                 and K. O. Arras, “Human motion trajectory prediction: A survey,”
     analysis and machine intelligence, vol. 30, no. 2, pp. 283–298, 2007.            The International Journal of Robotics Research, vol. 39, no. 8, pp.
[11] N. Lee and K. M. Kitani, “Predicting wide receiver trajectories in               895–935, 2020.
     american football,” in 2016 IEEE Winter Conference on Applications          [32] J. Zhong, W. Cai, L. Luo, and H. Yin, “Learning behavior pat-
     of Computer Vision (WACV). IEEE, 2016, pp. 1–9.                                  terns from video: A data-driven framework for agent-based crowd
[12] A. Jain, A. R. Zamir, S. Savarese, and A. Saxena, “Structural-rnn:               modeling,” in Proceedings of the 2015 International Conference on
     Deep learning on spatio-temporal graphs,” in Proceedings of the ieee             Autonomous Agents and Multiagent Systems, 2015, pp. 801–809.
     conference on computer vision and pattern recognition, 2016, pp.            [33] I. A. Lawal, F. Poiesi, D. Anguita, and A. Cavallaro, “Support vector
     5308–5317.                                                                       motion clustering,” IEEE Transactions on Circuits and Systems for
[13] A. Vemula, K. Muelling, and J. Oh, “Social attention: Modeling                   Video Technology, vol. 27, no. 11, pp. 2395–2408, 2016.
     attention in human crowds,” in 2018 IEEE international Conference           [34] N. Bisagno, B. Zhang, and N. Conci, “Group lstm: Group trajectory
     on Robotics and Automation (ICRA). IEEE, 2018, pp. 4601–4607.                    prediction in crowded scenarios,” in Proceedings of the European
[14] K. Sohn, H. Lee, and X. Yan, “Learning structured output represen-               Conference on Computer Vision (ECCV) Workshops, 2018, pp. 0–0.
     tation using deep conditional generative models,” Advances in neural        [35] S. Atev, G. Miller, and N. P. Papanikolopoulos, “Clustering of vehicle
     information processing systems, vol. 28, pp. 3483–3491, 2015.                    trajectories,” IEEE transactions on intelligent transportation systems,
[15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,               vol. 11, no. 3, pp. 647–657, 2010.
     S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,”       [36] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
     Advances in neural information processing systems, vol. 27, 2014.                computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[16] N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. Torr, and M. Chan-           [37] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation
     draker, “Desire: Distant future prediction in dynamic scenes with in-            of gated recurrent neural networks on sequence modeling,” arXiv
     teracting agents,” in Proceedings of the IEEE Conference on Computer             preprint arXiv:1412.3555, 2014.
     Vision and Pattern Recognition, 2017, pp. 336–345.                          [38] S. Zhao, J. Song, and S. Ermon, “Infovae: Balancing learning and
                                                                                      inference in variational autoencoders,” in Proceedings of the aaai
[17] H. Ma, J. Li, W. Zhan, and M. Tomizuka, “Wasserstein generative
                                                                                      conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 5885–
     learning with kinematic constraints for probabilistic interactive driving
                                                                                      5892.
     behavior prediction,” in 2019 IEEE Intelligent Vehicles Symposium
                                                                                 [39] S. Pellegrini, A. Ess, K. Schindler, and L. Van Gool, “You’ll never
     (IV). IEEE, 2019, pp. 2477–2483.
                                                                                      walk alone: Modeling social behavior for multi-target tracking,” in
[18] J. Li, H. Ma, Z. Zhang, J. Li, and M. Tomizuka, “Spatio-temporal
                                                                                      2009 IEEE 12th International Conference on Computer Vision. IEEE,
     graph dual-attention network for multi-agent prediction and tracking,”
                                                                                      2009, pp. 261–268.
     IEEE Transactions on Intelligent Transportation Systems, 2021.
                                                                                 [40] A. Lerner, Y. Chrysanthou, and D. Lischinski, “Crowds by example,”
[19] A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi, “Social             in Computer graphics forum, vol. 26, no. 3. Wiley Online Library,
     gan: Socially acceptable trajectories with generative adversarial net-           2007, pp. 655–664.
     works,” in Proceedings of the IEEE Conference on Computer Vision            [41] B. Ivanovic and M. Pavone, “The trajectron: Probabilistic multi-
     and Pattern Recognition, 2018, pp. 2255–2264.                                    agent trajectory modeling with dynamic spatiotemporal graphs,” in
[20] V. Kosaraju, A. Sadeghian, R. Martín-Martín, I. Reid, S. H.                      Proceedings of the IEEE/CVF International Conference on Computer
     Rezatofighi, and S. Savarese, “Social-bigat: Multimodal trajectory               Vision, 2019, pp. 2375–2384.
     forecasting using bicycle-gan and graph attention networks,” arXiv
     preprint arXiv:1907.03395, 2019.
You can also read