Gaze-Informed Multi-Objective Imitation Learning from Human Demonstrations
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Gaze-Informed Multi-Objective Imitation Learning from Human Demonstrations Ritwik Bera,1 * Vinicius G. Goecks, 2 * Gregory M. Gremillion, 2 Vernon J. Lawhern, 2 John Valasek, 1 Nicholas R. Waytowich 2,3 1 Texas A&M University, 2 U.S. Army Research Laboratory, 3 Columbia University {ritwik, vinicius.goecks, valasek}@tamu.edu, {gregory.m.gremillion,vernon.j.lawhern.civ, nicholas.r.waytowich}.civ@mail.mil arXiv:2102.13008v1 [cs.LG] 25 Feb 2021 Abstract In the field of human-robot interaction, teaching learning agents from human demonstrations via supervised learning has been widely studied and successfully applied to multi- ple domains such as self-driving cars and robot manipulation. However, the majority of the work on learning from human demonstrations utilizes only behavioral information from the demonstrator, i.e. what actions were taken, and ignores other useful information. In particular, eye gaze information can give valuable insight towards where the demonstrator is allo- cating their visual attention, and leveraging such information has the potential to improve agent performance. Previous ap- proaches have only studied the utilization of attention in sim- ple, synchronous environments, limiting their applicability to real-world domains. This work proposes a novel imitation Figure 1: Block diagram of the proposed multi-objective learning architecture to learn concurrently from human ac- learning architecture to learn from multimodal human tion demonstration and eye tracking data to solve tasks where demonstration signals, i.e. actions and eye gaze data. human gaze information provides important context. The pro- posed method is applied to a visual navigation task, in which an unmanned quadrotor is trained to search for and navigate complexity issues when learning a behavior policy directly to a target vehicle in a real-world, photorealistic simulated en- from images, i.e. mapping what the robot’s camera sees to vironment. When compared to a baseline imitation learning architecture, results show that the proposed gaze augmented what action it should take. The majority of the work on LfD imitation learning model is able to learn policies that achieve utilizes only behavioral information from the demonstrator, significantly higher task completion rates, with more efficient i.e. what actions were taken, and ignores other information paths, while simultaneously learning to predict human visual such as the eye gaze of the demonstrator (Argall et al. 2009). attention. This research aims to highlight the importance of Eye gaze is an especially useful signal, as it can give valu- multimodal learning of visual attention information from ad- able insight towards where the demonstrator is allocating ditional human input modalities and encourages the commu- their visual attention (Doshi and Trivedi 2012), and lever- nity to adopt them when training agents from human demon- aging such information has the potential to improve agent strations to perform visuomotor tasks. performance when using LfD. Eye gaze is an unobtrusive input signal that can be eas- 1 Introduction ily collected with the help of widely available eye tracking In the field of human-robot interaction, learning from hardware (Poole and Ball 2006). Eye gaze data comes at demonstrations (LfD), also referred to as imitation learn- almost no additional cost when teleoperating robots to col- ing, is widely used to rapidly train artificial agents to mimic lect expert demonstration data, as the human operator is able the demonstrator via supervised learning (Argall et al. 2009; to do the task naturally as before. Eye gaze acts as an im- Osa et al. 2018). LfD using human-generated data has been portant signal in guiding our actions and filtering relevant widely studied and successfully applied to multiple domains parts of the environment that we perceive (Schütz, Braun, such as self-driving cars (Codevilla et al. 2019), robot ma- and Gegenfurtner 2011), and as such, measuring eye gaze nipulation (Rahmatizadeh et al. 2018), and navigation (Sil- gives us an indication of visual attention that can be lever- ver, Bagnell, and Stentz 2010). aged when training AI agents. While LfD is a simple and straightforward approach for Previous works that have attempted to utilize eye-gaze for teaching intelligent behavior, it still suffers from sample imitation learning (such as (Zhang et al. 2018)) have only done so in simple, synchronous environments, limiting their * Equal contribution. applicability to real-world domains. In this paper, we utilize
multi-objective learning for gaze-informed imitation learn- playing Atari games and simulated driving. Xia et al. (2020) ing of real-world, asynchronous robotics tasks, such as au- also proposed a periphery-fovea multi-resolution driving tonomous quadrotor navigation, which have proven difficult model to predict human attention and improve accuracy in a for traditional imitation learning techniques (Ross, Gordon, fixed driving dataset. and Bagnell 2011; Bojarski et al. 2016). Attention Guided Imitation Learning (Zhang et al. 2018), This work attempts to frame the problem of visual quadro- or AGIL, collected human gaze and joystick demonstrations tor navigation as a supervised multi-objective learning prob- while playing Atari games and presented a two-step train- lem wherein our model has an input space comprised of on- ing procedure to incorporate gaze data in imitation learning, board inertial measurements, RGB and depth images, and training first a gaze prediction network modeled as a human- attempts to predict both the demonstrator’s actions as well like foveation system, then a policy network using the gaze as the the demonstrator’s eye gaze while performing a given predictions represented as attention heatmaps. The reported task, as shown in Figure 1. The benefit behind using a gains in policy performance vary from 3.4 to 1143.8%, de- multi-objective optimization, framework is two-fold. First pending on the game complexity and the number of sub- and foremost, if a task is noisy or data is limited and high- tasks the player has to attend to, which illustrates how the dimensional, it can be difficult for a model to differenti- benefit of a multimodal learning approach is tied to the de- ate between relevant and irrelevant features. Multi-objective sired task to be solved. However, to eliminate effects of hu- learning helps the model focus its attention on features that man reaction time and fatigue, eye-gaze was collected in actually matter since optimizing multiple objectives simul- a synchronous fashion in which the environment only ad- taneously forces the model to learn relevant features that vanced to the next state once the human took an action, and are invariant across each objective (Caruana 1993; Abu- game time was limited to 15 minutes, followed by a 20 min- Mostafa 1990; Ruder 2017). Finally, a multi-objective learn- utes rest period. This highlights the challenges of human- ing framework, as proposed by this work, enables eye gaze in-the-loop machine learning approaches and real-time hu- prediction loss to act as a regularizer by introducing an in- man data collection. This limits the applicability of attention ductive bias. As such, it reduces the risk of overfitting as well guided techniques to only relatively simple domains where as the Rademacher complexity of the model, i.e. its ability the environment can easily be stopped and started to syn- to fit random noise (Baxter 2000). chronously line up with human input, and will not work on The main contributions of this work are: real-world environments such as quadrotor navigation. • We propose a novel multi-objective learning architecture Liu et al. (2019) proposed two approaches to incorporate to learn from multimodal signals from human demonstra- gaze into imitation learning: 1) using gaze information to tions, from both actions and eye gaze data. modulate input images, as opposed to using gaze as an addi- tional policy input; and 2) using gaze to control dropout rates • We demonstrate our approach using an asynchronous, at the convolution layers during training. Both approaches real-world quadrotor navigation task with high- use a pre-trained gaze prediction network before incorpo- dimensional state and action spaces and a high-fidelity, rating it into imitation learning. Evaluated on a simulated photorealisitic simulator. driving task, Liu et al. (2019) showed that both approaches • We show that our gaze-augmented imitation learning reduced generalization error when compared to imitation model is able to significantly outperform a baseline imple- learning without gaze, with the second approach (using gaze mentation, a behavior cloning model that does not utilize to control dropout rates during training) yielding about 24% gaze information, in terms of task completion. error reduction compared to 17% of the first approach when predicting steering wheel angles on unseen driving tasks. 2 Related Work Understanding Gaze Behavior Saran et al. (2019) high- 3 Methods lights the importance of understanding gaze behavior for use We train a novel multi-objective model via supervised learn- in imitation learning. The authors characterized gaze behav- ing to predict actions to accomplish a given task and the ior and its relation to human intention while demonstrating location of eye-gaze of a human operating the quadrotor object manipulation tasks to robots via kinesthetic teach- through a first-person camera view. The training is accom- ing and video demonstrations. For this type of goal-oriented plished by collecting a set of human demonstrations where task, Saran et al. show that users mostly fixate at goal-related we record the observation space of the quadrotor (i.e. cam- objects, and that gaze information could resolve ambiguities eras and IMU sensors) as well as the demonstrator’s eye- during subtask classification, reflecting the user internal re- gaze position and their input actions while performing a ward function. In their study the authors highlight that gaze quadrotor search and navigation task. behavior for teaching could vary for more complex tasks that involve search and planning, such as the the one presented in this study. Model Architecture Gaze-Augmented Imitation Learning Augmenting im- We take an end-to-end supervised learning approach to solv- itation learning with eye gaze data has been shown to im- ing the visual search and navigation problem, aiming to prove policy performance when compared to unaugmented avoid making any structural assumptions with regard to ei- imitation learning (Zhang et al. 2018) and improve policy ther the task being performed by the quadrotor, or the envi- generalization (Liu et al. 2019) in visuomotor tasks, such as ronment the quadrotor is expected to operate in. The model
Figure 2: Model architecture illustrating how quadrotor onboard sensor data and cameras frames are processed, how features are combined in a shared representation backbone, and how multiple outputs, i.e. the gaze and action prediction networks, are performed via independent model heads. architecture is illustrated in Figure 2 and explained in the no absolute position such as given by a GPS device, for following paragraphs. The input space of the model is com- a total of 16 features, as shown by Kinematic Features in posed of three modalities: Figure 2. The model is intentionally not provided with ab- • The quadrotor has access to a constant stream of RGB solute position information in order to force it to be reliant frames, 704 by 480 pixel resolution, captured by its on- on visual cues to performing the task instead of memoriz- board forward-facing camera, which are stored in an im- ing GPS information. age buffer. Each frame is passed through a truncated ver- The resulting Visual, HOG, and Kinematic Features are sion of a pre-trained ResNet34 (He et al. 2016), which in- concatenated, batch-normalized, and then passed through cludes only its first three convolutional blocks, generating two fully connected layers with 128 output units each and a feature map of dimensions 128x28x28 (channel, height, ReLU activation function, forming the shared backbone of width) that are stored in an intermediate buffer. Given the proposed multi-headed model. The intermediate feature the photorealistic scenes rendered by Microsoft AirSim’s vector is then passed through two separate output heads, the (Shah et al. 2018) Unreal Engine based simulator, ResNet Action Prediction Head and the Gaze Prediction Head, as was chosen as the pre-trained feature extractor as it was seen in the right side of Figure 2. Each head has a hidden trained with real-world images. At each time instant t, the layer with 32 units and an output layer. The Action Predic- feature map corresponding to time instant t is concate- tion Head outputs 4 scalar values bounded between -1 and 1 nated with the feature map corresponding to time instant t- representing joystick values that control the quadrotor linear 8, across the channel dimension, generating a 256x28x28 velocities associated with movement in the x, y, and z direc- dimensional tensor. The concatenated feature maps are tions as well as the yaw. The Gaze Prediction Head outputs 2 passed through a sequence of trainable convolution lay- scalar values bounded between 0 and 1 representing normal- ers, as described in Figure 2, forming an 484 dimensional ized eye gaze coordinates in the quadrotor camera’s frame of vector of visual features and the first input to the model. reference. • The quadrotor also has access to depth images, which are generated by the simulated onboard camera from which Training Objective the RGB frames are obtained and, consequently, sharing the same geometric reference frame. The depth images, The loss function used in the training routine is a linear com- 704 by 480 pixel resolution, are passed through a His- bination of a Gaze Prediction Loss, LGP , and a Behavior togram of Oriented Gradients (Dalal and Triggs 2005) Cloning Loss, LBC : feature extractor that generates a 512 dimensional feature vector. This forms the second input to the model, HOG L(θ) = λ1 ∗ Gaze Prediction Loss(LGP )+ Features in Figure 2. λ2 ∗ Behavior Cloning Loss(LBC ), (1) • The quadrotor also outputs filtered data from its simulated gyroscope and accelerometer, providing the model access where θ represents the set of trainable model parameters to the vehicle’s angular orientation, velocity, and accel- and λ1 and λ2 are hyperparameters controlling the contri- eration, and linear velocity and acceleration, altitude, but bution of each loss component, defined as the mean-squared
Figure 3: First-person view from the quadrotor illustrating the target vehicle (yellow truck) to be found and navigated to, and the realistic cluttered forest simulation environment using Microsoft AirSim (Shah et al. 2018). Figure 4: Bird’s-eye view of possible spawn locations of the quadrotor and the target vehicle in the simulated forest envi- ronment. error between ground truth and predicted values: 2 LGP = πgaze − g t ) , 2 judged for themselves that they were confident in perform- LBC = πaction − at ) , ing it. To perform the task, the demonstrator was only given access to the first-person view of the quadrotor and no addi- where gt is a vector representing the true x, y coordinates tional information about their current location or the target of the eye gaze in the camera frame at instant t, and at is current location in the map. Using a Xbox One joystick, the the vector representing the expert joystick actions compo- demonstrator was able to control quadrotor throttle and yaw nents for forward velocity, lateral velocity, yaw angular ve- rate using the left joystick and forward and lateral velocity locity, and throttle taken at that same time instant t. πgaze commands using the right stick, as is standard for aerial ve- and πaction are the outputs of the Gaze Prediction and Ac- hicles. The complete data collection setup is shown in Figure tion Prediction Heads, respectively. 5. Task Description Training data is collected with both initial locations for the quadrotor and target vehicle randomly sampled without The experiments were conducted in simulated environments replacement from a set of 25 possible locations covering rendered by the Unreal Engine using the Microsoft AirSim a pre-defined area of the map, as shown in Figure 4. This (Shah et al. 2018) plugin as an interface to receive data prevents invalid initial locations, such as inside the ground and send commands to the quadrotor, a simulated Parrot or trees, while still covering the desired task area. Initial AR.Drone vehicle. The task consisted of a search and navi- quadrotor heading is also uniformly sampled from 0 to 360 gate task where the drone must seek out a target vehicle and degrees. In total, 200 human demonstration trajectories rep- then navigate towards it in a photo-realistic cluttered forest resenting unique quadrotor and vehicle location pairs (from simulation environment, as seen in Figure 3. This environ- a total number of 25 × 24 = 600 pairs) were collected to ment emulates sun glare and presents trees and uneven rocky form an expert dataset for model training and testing. This terrain as obstacles for navigation and visual identification includes RGB and depth images, the quadrotor’s inertial sen- of the target vehicle. sor readings, and eye gaze coordinates. Data Collection Procedure Experimental Setup The task was presented to the demonstrator on a 23.8 inches display, 1920x1080 pixel resolution, and eye gaze data col- The entire dataset of 200 trajectories is split into a training lection was conducted using a screen-mounted eye tracker set of 180 trajectories and a test set of 20 trajectories, for positioned at a distance of 61cm from the demonstrator’s each experimental run. During exploratory data analysis on eye. Before the data collection, the height of the demonstra- the collected dataset, it was found that the yaw-axis actions tor’s chair was adjusted in order to position their head at the were mostly centered around zero with a very low variance. optimal location for gaze tracking. The eye tracker sensor To ensure that the agent is trained with enough data on how was calibrated according to a 8-point manufacturer-provided to properly turn and not just fly straight, weighted oversam- software calibration procedure. The demonstrator avoided pling was employed in the training routine. moving the chair and minimized torso movements during Evaluation of each trained model is performed by con- data collection, while moving the eye naturally. The demon- ducting rollouts in the environment as follows. A config- strator was also given time to acclimate to the task until they uration is defined as a pair consisting of an initial spawn
MLflow (Zaharia et al. 2018) v1.10.0. Random seeds were set in Python, PyTorch, and NumPy using the methods ran- dom.seed(seed value), torch.manual seed(seed value), and numpy.random.seed(seed value), respectively. With respect to hyperparameters, the loss function is optimized using the Adam (Kingma and Ba 2014) optimizer with a learning rate of 0.0003 for 20 epochs and batch size of 128 data samples. Hyperparameters for model size, such as number of layers and units, are described in the “Model Architecture” sec- tion above. Hyperparameter tuning was done manually and changes in network size had minor impact in the proposed approach performance. Metrics In real-world robotics based applications such as the one pursued in this work, standard machine learning metrics such as validation losses, and others that quantify how well Figure 5: Data collection setup used to record human gaze a model has been trained with the given data, tend to not and joystick data illustrating the relative positioning be- translate into actual high performance models that can suc- tween user, task display, joystick, and eye tracker, as noted in cessfully satisfy end-user requirements. To that effect, this the figure. User is given only the first-person view of quadro- research employs certain custom metrics to evaluate the per- tor while performing the visual navigation task. formance of our model when deployed to a new test sce- nario and help demonstrate the effectiveness of using eye gaze as an additional supervision modality. The core metrics location for the quadrotor, and a target location, where the for comparison between the proposed model and baselines target vehicle is spawned, that is, the truck. Since there are include: 25 different starting and ending locations, there are a total • Task Completion Rate: this is evaluated by dividing the of 600 unique configurations. We chose 200 unique config- number of successful completions of the task by the total urations and collected 200 demonstration trajectories (1 for number of rollouts performed with that model. A success- each configuration). The 200 configurations are then split ful task completion is defined as the quadrotor finding the into a training set and a testing set, following a 90/10 split. target vehicle and intercepting it within a 5 meter radius. Expert demonstrations were collected that accomplish the • Collision Rate: this is evaluated by dividing the number task for each of those configurations. 180 trajectories (cor- of collisions during a rollout with the total number of roll- responding to the 180 training configurations) were used to outs performed with that model. train the model, and the remaining 20 configurations were used to test the model (i.e. performing rollouts in the envi- • Success weighted by (normalized inverse) Path Length ronment). (SPL): as defined in (Anderson et al. 2018), is evaluated Owing to the stochastic nature of the simulation dynam- for episodes with successful completion by dividing the ics and sensor sampling frequency, 5 rollouts are conducted shortest-path distance from the agent’s starting position for each configuration to obtain more accurate estimates of to the target location with the distance actually taken by the evaluation metrics, explained in the next section. Since the agent to reach the goal. In this work, it is assumed test trajectory provides one unique configuration, a total of the shortest-path distance is a straight line from the initial 100 rollouts are conducted while evaluating each model: 5 quadrotor location to the target vehicle location, irrespec- rollouts per configuration, for 20 different configurations. tive of obstacles. Since the test configurations remain the same for all models, evaluation consistency is ensured in each experimental run. 4 Results All models are evaluated for 6 random seeds under identical The proposed gaze-augmented imitation learning model, de- conditions, including random seed value. noted Gaze BC in this section, is compared against a stan- In terms of the hardware infrastructure, all experiments dard imitation learning model with no gaze information, de- were run in a single machine running Ubuntu 18.04 LTS noted Vanilla BC in this section. Note that, when training OS equipped with an Intel Core i9-7900X CPU, 128 Gb our proposed gaze-augmented imitation learning model, the of RAM, and NVIDIA RTX 2080 Ti used only for Mi- values of both λ1 and λ2 in Equation 1 are set at 1.0, while crosoft AirSim. In terms of the software infrastructure, the for training the baseline imitation learning model with no proposed model was implemented in Python v3.6.9 and gaze information, λ1 is set to 0.0 while λ2 stays at 1.0. PyTorch (Paszke et al. 2019) v1.5.0, data handling us- In terms of policy performance, Table 1 provides a sum- ing Numpy (Van Der Walt, Colbert, and Varoquaux 2011) mary of the relevant metrics. Task Completion Rate can be v1.18.4 and Pandas (Wes McKinney 2010) v1.0.4, data stor- seen to be higher for the Gaze BC model by approximately age using HDF5 via h5py v2.10.0, data versioning con- 13.7 percentage points. A Wilcoxon signed-rank test, a non- trol via DVC v1.6.0, and experiment monitoring using parametric statistical hypothesis test, was performed using
Metric Vanilla BC Gaze BC develop an approach that could be translated to real-world applications running asynchronously in real time. Moreover, Test Completion Rate (%) 16.83 ± 2.84 *30.5 ± 4.47 AGIL’s architecture does not include any pretrained feature Collision Rate (%) 52.5 ± 6.21 50.83 ± 5.97 extractor, which significantly increases data collection re- Weighted Path Length (SPL) 0.08 ± 0.02 0.14 ± 0.03 quirements when dealing with vision-based robotics tasks like the one featured in this work. Table 1: Metric comparison in terms of mean and standard With respect to the gaze prediction performance, four dis- error between the proposed method, gaze-augmented imi- tinct gaze patterns observed during the human demonstra- tation learning Gaze BC, and the main imitation learning tions were also learned by the Gaze Prediction Head of the baseline with no gaze information, Vanilla BC. Asterisk (*) proposed model, as seen in Figure 6: 1) “motion leading” indicates statistical significance. gaze pattern where gaze attends to the sides of the images followed by yaw motion in the same direction, as illustrated in Figure 6a. This pattern is mostly observed at the begin- the task completion rate values (one metric value per seed) ning of the episode when the target vehicle is not in the for both models, to evaluate the statistical significance of the field-of-view of the agent; 2) “target fixation” gaze pattern obtained results. A p-value of 0.0312 was obtained which where gaze is fixated at target during the final approach, as shows that our Gaze BC model outperforms the Vanilla BC illustrated in Figure 6b. In this pattern gaze is fixed at the model in statistically significant terms. top of the target vehicle, independent of the current motion Collision rate remains roughly the same for both the Gaze of the agent; 3) “saccade” gaze pattern where gaze rapidly BC and the Vanilla BC model. This can be attributed to the switches fixation between nearby obstacles (Salvucci and fact that both models have access to the same observation Goldberg 2000), as illustrated in Figure 6c. This pattern is space (visual features, HOG features and kinematic features) characteristic when there are multiple obstacles between the and thus both models have the information needed for colli- current agent location and the target vehicle; and 4) “obsta- sion avoidance. cle fixation” gaze pattern where gaze attends to nearby ob- SPL for the Gaze BC model can be seen to be 1.75x the stacles when navigating to the target, as learned from multi- SPL for the Vanilla BC model. This shows that whenever ple demonstrations, which might differ from the current ob- the models successfully accomplish the task, the Gaze BC stacle user is attending to at the moment, as illustrated in model, on average, takes a significantly shorter path to reach Figure 6d. This pattern is mostly observed when the quadro- the goal than the Vanilla BC model. This result again illus- tor is in close distance to an potential obstacle. This illus- trates the superior performance of a policy trained with eye- trates how the Gaze Prediction Head of the proposed model gaze data. was able to capture and replicate similar visual attention demonstrated by the user when performing the task. 5 Discussion In this work we proposed a novel imitation learning archi- Limitations and Future Work tecture to learn concurrently from human actions and eye A limitation of the current work is not being able to com- gaze to solve tasks where gaze information provides impor- plete the proposed task with a higher rate. We argue that one tant context. Specifically, we applied the proposed method to simple reason for this issue, as in all end-to-end learning ap- a visual search and navigation task, in which an unmanned proaches, is the limited training dataset. This issue is accen- quadrotor is trained to search for and navigate to a tar- tuated in human-in-the-loop machine learning tasks, such get vehicle in an asynchronous, photorealistic environment. as the one used in this work, since human-generated data When compared to a baseline imitation learning architec- is scarce. For this work we only had available 200 human- ture, results show that the proposed gaze augmented imita- generated trajectories and believe that task completion rate tion learning model is able to learn policies that achieve sig- could increase by simply training the model with more data. nificantly higher task completion rates, with more efficient This additional data could come from more independent tra- paths while simultaneously learning to predict human visual jectories, by labelling agent-generated trajectory and aggre- attention. gating to the original dataset (Ross, Gordon, and Bagnell The closest related work to ours is Attention Guided Im- 2011), or by tasking the human to overseer the agent and itation Learning (Zhang et al. 2018), or AGIL, which pre- perform interventions when the agent’s policy is close to sented a two-step training process to train a gaze predic- fail, also aggregating this intervention data to the original tion network then a policy to perform imitation learning on dataset (Goecks et al. 2019). In this case, the ability of a hu- Atari games, as explained in the Related Work section. We man trainer to enact timely interventions may be greatly im- did not perform a directly comparison to AGIL due to fact proved by the gaze predictions the model generates, which that AGIL currently operates on synchronous environments would highlight instances where this artificial gaze deviates which can be paused, allowing for the game states to be syn- from what they expect, potentially indicating subsequent un- chronized with the human data. The data collection process desirable behavior and providing some level of model inter- for AGIL requires the advancing of game states one step at pretability or explainability. a time after the human performs each action. This is feasi- Another possible avenue for future work is the addition ble when working with simple environments such as Atari of a one-hot encoded task specifier in the input space of where the game engine can be paused, however we aimed to the model, which could aid in training models that learn,
(a) “Motion leading” gaze pattern. (b) “Target fixation” gaze pattern. (c) “Saccade” gaze pattern. (d) “Obstacle fixation” gaze pattern. Figure 6: Visualization of sequence of three frames illustrating characteristic gaze patterns observed in the human demonstra- tions, showed as magenta circles in the figure, which were also learned by the proposed model, showed as cyan circles (best viewed in color). The Larger, less transparent circle illustrates the current gaze observation and the smaller, more transparent circles represent gaze (and gaze prediction) from previous time-steps. Full videos with gaze predictions can be seen at the project webpage mentioned in the Discussion section. like humans, to adapt their attention and behavior accord- and consequently, the task execution. A positive ethical im- ing to the task at hand. Furthermore, the integration of ad- pact of learning from additional input modalities is being ditional human input modalities to the proposed approach, able to capture more information from humans, which in such as natural language, could similarly be used to con- turn can be used to better train artificial intelligent agents dition the model to perform multiple different tasks. The that can model and act according to human preferences. combination of gaze and natural language would enable hu- mans to more naturally interact with learning agents, and References enable those agents to disambiguate demonstrator behav- Abu-Mostafa, Y. S. 1990. Learning from hints in neural net- ior and attention dynamics that are otherwise ambiguous works. Journal of complexity 6(2): 192–198. because they are pursuant to distinct tasks and goals that could be grounded to language commands. The ability to use Anderson, P.; Chang, A. X.; Chaplot, D. S.; Dosovitskiy, A.; eye-gaze data to leverage a human’s visual attention opens Gupta, S.; Koltun, V.; Kosecka, J.; Malik, J.; Mottaghi, R.; the door to adapting this research to learning unified poli- Savva, M.; and Zamir, A. R. 2018. On Evaluation of Em- cies that can generalize across multiple context- and goal- bodied Navigation Agents. CoRR abs/1807.06757. URL dependent, tasks. http://arxiv.org/abs/1807.06757. Argall, B. D.; Chernova, S.; Veloso, M.; and Browning, Acknowledgments B. 2009. A survey of robot learning from demonstration. Robotics and autonomous systems 57(5): 469–483. Research was sponsored by the Army Research Laboratory Baxter, J. 2000. A model of inductive bias learning. Journal and was accomplished under Cooperative Agreement Num- of artificial intelligence research 12: 149–198. ber W911NF-20-2-0114. The views and conclusions con- tained in this document are those of the authors and should Bojarski, M.; Testa, D. D.; Dworakowski, D.; Firner, B.; not be interpreted as representing the official policies, either Flepp, B.; Goyal, P.; Jackel, L. D.; Monfort, M.; Muller, U.; expressed or implied, of the Army Research Laboratory or Zhang, J.; Zhang, X.; Zhao, J.; and Zieba, K. 2016. End to the U.S. Government. The U.S. Government is authorized to End Learning for Self-Driving Cars. CoRR abs/1604.07316. reproduce and distribute reprints for Government purposes URL http://arxiv.org/abs/1604.07316. notwithstanding any copyright notation herein. Caruana, R. 1993. Multitask Learning: A Knowledge-Based Source of Inductive Bias. In Proceedings of the Tenth In- 6 Ethics Statement ternational Conference on International Conference on Ma- chine Learning, ICML’93, 41–48. San Francisco, CA, USA: Imitation learning-based agents, to a greater degree when Morgan Kaufmann Publishers Inc. ISBN 1558603077. doi: learning from human demonstrations, are susceptible to 10.5555/3091529.3091535. learning the bias present in the training dataset. Specifically related to this work, the learning agent would learn to repli- Codevilla, F.; Santana, E.; López, A. M.; and Gaidon, A. cate action and gaze patterns demonstrated in the training 2019. Exploring the limitations of behavior cloning for au- dataset. This issue can best be addressed during the data col- tonomous driving. In Proceedings of the IEEE International lection phase, where action and gaze information would be Conference on Computer Vision, 9329–9338. collected from a diverse pool of users with different back- Dalal, N.; and Triggs, B. 2005. Histograms of oriented gra- ground that could influence their action and gaze patterns, dients for human detection. In 2005 IEEE computer soci-
ety conference on computer vision and pattern recognition Salvucci, D. D.; and Goldberg, J. H. 2000. Identifying fixa- (CVPR’05), volume 1, 886–893. IEEE. tions and saccades in eye-tracking protocols. In Proceedings Doshi, A.; and Trivedi, M. M. 2012. Head and eye gaze of the 2000 symposium on Eye tracking research & applica- dynamics during visual attention shifts in complex environ- tions, 71–78. ments. Journal of vision 12(2): 9–9. Saran, A.; Short, E. S.; Thomaz, A.; and Niekum, S. 2019. Goecks, V. G.; Gremillion, G. M.; Lawhern, V. J.; Valasek, Understanding Teacher Gaze Patterns for Robot Learning J.; and Waytowich, N. R. 2019. Efficiently Combining Hu- (CoRL). URL http://arxiv.org/abs/1907.07202. man Demonstrations and Interventions for Safe Training Schütz, A. C.; Braun, D. I.; and Gegenfurtner, K. R. 2011. of Autonomous Systems in Real-Time. In Proceedings of Eye movements and perception: A selective review. Journal the AAAI Conference on Artificial Intelligence, 2462–2470. of Vision 11(5): 9–9. ISSN 1534-7362. doi:10.1167/11.5.9. AAAI Press. doi:10.1609/aaai.v33i01.33012462. URL https://doi.org/10.1167/11.5.9. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep Resid- Shah, S.; Dey, D.; Lovett, C.; and Kapoor, A. 2018. Air- ual Learning for Image Recognition. In Proceedings of the Sim: High-Fidelity Visual and Physical Simulation for Au- IEEE Conference on Computer Vision and Pattern Recogni- tonomous Vehicles. In Hutter, M.; and Siegwart, R., eds., tion (CVPR). Field and Service Robotics, 621–635. Cham: Springer In- Kingma, D. P.; and Ba, J. 2014. Adam: A method for ternational Publishing. ISBN 978-3-319-67361-5. doi: stochastic optimization. arXiv preprint arXiv:1412.6980 . 10.1007/978-3-319-67361-5 40. Liu, C.; Chen, Y.; Tai, L.; Liu, M.; and Shi, B. E. 2019. Uti- Silver, D.; Bagnell, J. A.; and Stentz, A. 2010. Learning lizing Eye Gaze to Enhance the Generalization of Imitation from demonstration for autonomous navigation in complex Networks to Unseen Environments. CoRR abs/1907.0. URL unstructured terrain. The International Journal of Robotics http://arxiv.org/abs/1907.04728. Research 29(12): 1565–1592. Osa, T.; Pajarinen, J.; Neumann, G.; Bagnell, J. A.; Abbeel, Van Der Walt, S.; Colbert, S. C.; and Varoquaux, G. 2011. P.; and Peters, J. 2018. An Algorithmic Perspective on Imita- The NumPy array: a structure for efficient numerical com- tion Learning. Foundations and Trends® in Robotics 7(1-2): putation. Computing in Science & Engineering 13(2): 22. 1–179. ISSN 1935-8253. doi:10.1561/2300000053. URL Wes McKinney. 2010. Data Structures for Statistical Com- http://dx.doi.org/10.1561/2300000053. puting in Python. In Stéfan van der Walt; and Jarrod Mill- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, man, eds., Proceedings of the 9th Python in Science Confer- J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; ence, 56 – 61. doi:10.25080/Majora-92bf1922-00a. Antiga, L.; Desmaison, A.; Kopf, A.; Yang, E.; DeVito, Xia, Y.; Kim, J.; Canny, J.; Zipser, K.; Canas-Bajo, T.; and Z.; Raison, M.; Tejani, A.; Chilamkurthy, S.; Steiner, Whitney, D. 2020. Periphery-fovea multi-resolution driving B.; Fang, L.; Bai, J.; and Chintala, S. 2019. PyTorch: model guided by human attention. In The IEEE Winter Con- An Imperative Style, High-Performance Deep Learning ference on Applications of Computer Vision, 1767–1775. Library. In Wallach, H.; Larochelle, H.; Beygelz- imer, A.; d'Alché-Buc, F.; Fox, E.; and Garnett, R., Zaharia, M.; Chen, A.; Davidson, A.; Ghodsi, A.; Hong, eds., Advances in Neural Information Processing Sys- S. A.; Konwinski, A.; Murching, S.; Nykodym, T.; Ogilvie, tems 32, 8024–8035. Curran Associates, Inc. URL P.; Parkhe, M.; et al. 2018. Accelerating the Machine Learn- http://papers.neurips.cc/paper/9015-pytorch-an-imperative- ing Lifecycle with MLflow. . style-high-performance-deep-learning-library.pdf. Zhang, R.; Liu, Z.; Zhang, L.; Whritner, J. A.; Muller, K. S.; Poole, A.; and Ball, L. J. 2006. Eye tracking in HCI and Hayhoe, M. M.; and Ballard, D. H. 2018. AGIL: Learn- usability research. In Encyclopedia of human computer in- ing Attention from Human for Visuomotor Tasks. CoRR teraction, 211–219. IGI Global. abs/1806.0. URL http://arxiv.org/abs/1806.03960. Rahmatizadeh, R.; Abolghasemi, P.; Bölöni, L.; and Levine, S. 2018. Vision-based multi-task manipulation for inexpen- sive robots using end-to-end learning from demonstration. In 2018 IEEE International Conference on Robotics and Au- tomation (ICRA), 3758–3765. IEEE. Ross, S.; Gordon, G.; and Bagnell, D. 2011. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. In Gordon, G.; Dunson, D.; and Dudı́k, M., eds., Proceedings of the Fourteenth International Con- ference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, 627–635. Fort Lauderdale, FL, USA: PMLR. URL http://proceedings.mlr. press/v15/ross11a.html. Ruder, S. 2017. An Overview of Multi-Task Learning in Deep Neural Networks.
You can also read