Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory - (ESAT) KU Leuven
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Noname manuscript No. (will be inserted by the editor) Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory Arun Balajee Vasudevan · Dengxin Dai · Luc Van Gool arXiv:1910.02029v2 [cs.CV] 3 Feb 2020 Received: date / Accepted: date Abstract The role of robots in society keeps expanding, 1 Introduction bringing with it the necessity of interacting and communi- cating with humans. In order to keep such interaction in- Consider that you are traveling as a tourist in a new city and tuitive, we provide automatic wayfinding based on verbal are looking for a destination that you would like to visit. You navigational instructions. Our first contribution is the cre- ask the locals and get a directional description “go ahead ation of a large-scale dataset with verbal navigation instruc- for about 200 meters until you hit a small intersection, then tions. To this end, we have developed an interactive visual turn left and continue along the street before you see a yel- navigation environment based on Google Street View; we low building on your right”. People give indications that further design an annotation method to highlight mined an- are not purely directional, let alone metric. They mix in re- chor landmarks and local directions between them in order ferrals to landmarks that you will find along your route. to help annotators formulate typical, human references to This may seem like a trivial ability, as humans do this rou- those. The annotation task was crowdsourced on the AMT tinely. Yet, this is a complex cognitive task that relies on platform, to construct a new Talk2Nav dataset with 10, 714 the development of an internal, spatial representation that routes. Our second contribution is a new learning method. includes visual landmarks (e.g. “the yellow building”) and Inspired by spatial cognition research on the mental con- possible, local directions (e.g. “going forward for about 200 ceptualization of navigational instructions, we introduce a meters”). Such representation can support a continuous self- soft dual attention mechanism defined over the segmented localization as well as conveying a sense of direction to- language instructions to jointly extract two partial instruc- wards the goal. tions – one for matching the next upcoming visual landmark Just as a human can navigate in an environment when and the other for matching the local directions to the next provided with navigational instructions, our aim is to teach landmark. On the similar lines, we also introduce spatial an agent to perform the same task to make the human-robot memory scheme to encode the local directional transitions. interaction more intuitive. The problem is tackled recently Our work takes advantage of the advance in two lines of re- as a Vision-and-Language Navigation (VLN) problem [5]. search: mental formalization of verbal navigational instruc- Although important progress has been made, e.g. in con- tions and training neural network agents for automatic way structing good datasets [5, 16] and proposing effective learn- finding. Extensive experiments show that our method signif- ing methods [5, 23, 73, 54], this stream of work mainly fo- icantly outperforms previous navigation methods. For demo cuses on synthetic worlds [35, 19, 15] or indoor room-to- video, dataset and code, please refer to our project page. room navigation [5, 23, 73]. Synthetic environments limit the complexity of the visual scenes while the room-to-room nav- Keywords Vision-and-language Navigation · Long-range igation comes with the kind of challenges different from Navigation · Spatial Memory · Dual Attention those of outdoors. Arun Balajee Vasudevan · Dengxin Dai · Luc Van Gool The first challenge to learning a long-range wayfinding ETH Zurich, Switzerland model lies in the creation of large-scale datasets. In order to E-mail: {arunv, dai, vangool}@vision.ee.ethz.ch be fully effective, the annotators providing the navigation in- Luc Van Gool structions ought to know the environment like locals would. K.U Leuven, Belgium Training annotators to reach the same level of understanding
2 Arun Balajee Vasudevan et al. Navigational Instructions o 0 o go straight and move forward till tri 315 o 45 Destination intersection intersection near to the Agent Node glendale shop turn right and reach the o o 270 Stop 90 intersection road on the right, we have o o park entrance go straight and move 225 135 o forward four blocks we have a thai 180 restaurant on our left proceed Source Node forward 4 more blocks reached the front of an atm Directional Instructions Observation Landmark Instructions (a)a)TopAgent navigating view of fromroute the navigation (b) Street view of b) Agent atthe route and Current the agent’s status location source to destination Fig. 1: (a) An illustration of an agent finding its way from a source point to a destination. (b) Status of the agent at the current location including the segmented navigational instructions with red indicating visual landmark descriptions and blue the local directional instructions, and the agent’s panoramic view and a pie chart depicting the action space. The bold text represents currently attended description. The predicted action at the current location is highlighted. The landmark that the agent is looking for at the moment is indicated by a yellow box. for a large number of unknown environments is inefficient erating local directional instructions between consecutive land- – to create one verbal navigation instruction, an annotator marks. This way, the annotation tasks become simpler. needs to search through hundreds of street-view images, re- member their spatial arrangement, and summarize them into We develop an interactive visual navigation environment a sequence of route instructions. This straightforward anno- based on Google Street View, and more importantly design tation approach would be very time-consuming and error- a novel annotation method which highlights selected land- prone. Because of this challenge, the state-of-the-art work marks and the spatial transitions in between. This enhanced uses synthetic directional instructions [36] or works mostly annotation method makes it feasible to crowdsource this com- on indoor room-to-room navigation. For indoor room-to- plicated annotation task. By hosting the tasks on the Ama- room navigation, this challenge is less severe, due to two zon Mechanical Turk (AMT) platform, this work has con- reasons: 1) the paths in indoor navigation are shorter; and structed a new dataset Talk2Nav with 10, 714 long-range 2) indoor environments have a higher density of familiar routes within New York City (NYC). ‘landmarks’. This makes self-localization, route remember- The second challenge lies in training a long-range wayfind- ing and describing easier. ing agent. This learning task requires accurate visual atten- To our knowledge, there is only one other work by Chen tion and language attention, accurate self-localization and a et al. [16] on natural language based outdoor navigation, good sense of direction towards the goal. Inspired by the re- which also proposes an outdoor VLN dataset. Although they search on mental conceptualization of navigational instruc- have designed a great method for data annotation through tions in spatial cognition [67, 56, 49], we introduce a soft gaming – to find a hidden object at the goal position, the attention mechanism defined over the segmented language method has difficulty to be applied to longer routes (see dis- instructions to jointly extract two partial instructions – one cussion in Section 3.2.2). Again, it is very hard for the an- for matching the next coming visual landmark and the other notators to remember, summarize and describe long routes for matching the spatial transition to the next landmark. Fur- in an unknown environment. thermore, the spatial transitions of the agent are encoded by an explicit memory framework which can be read from and To address this challenge, we draw on the studies in cog- written to as the agent navigates. One example of the out- nition and psychology on human visual navigation which door VLN task can be found in Figure 1. Our work connects state the significance of using visual landmarks in route de- two lines of research that have been less explored together scriptions [39, 43, 66]. It has been found that route descrip- so far: mental formalization of verbal navigational instruc- tions consist of descriptions for visual landmarks and local tions [67, 56, 49] and training neural network agent for auto- directional instructions between consecutive landmarks [58, matic wayfinding [5, 36]. 56]. Similar techniques – a combination of visual landmarks, as rendered icons, and highlighted routes between consecu- Extensive experiments show that our method outperforms tive landmarks – are constantly used for making efficient previous methods by a large margin. We also show the con- maps [67, 74, 26]. We divide the task of generating the de- tributions of the sub-components of our method, accompa- scription for a whole route into a number of sub-tasks con- nied with their detailed ablation studies. The collected dataset sist of generating descriptions for visual landmarks and gen- will be made publicly available.
Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory 3 2 Related Works fers significantly from them. The environment domain is different; as discussed in Section 1, long-range navigation Vision & Language. Research at the intersection of lan- raises more challenges both in data annotation and model guage and vision has been conducted extensively in the last training. There are concurrent works aiming at extending few years. The main topics include image captioning [44, Vision-and-Language Navigation to city environment [36, 78], visual question answering (VQA) [2, 6], object referring 16, 47]. The difference to Hermann et al. [36] lies in that our expressions [8, 10], and grounded language learning [35, 37]. method works with real navigational instructions, instead of Although the goals are different from ours, some of the fun- the synthetic ones summarized by Google Maps. This dif- damental techniques are shared. For example, it is a com- ference leads to different tasks and in turn to different solu- mon practice to represent visual data with CNNs pre-trained tions. Kim et al. [47] proposes end-to-end driving model that for image recognition and to represent textual data with word takes natural language advice to predict control commands embeddings pre-trained on large text corpora. The main dif- to navigate in city environment. Chen et al. [16] proposes ference is that the perceptual input to the system is static outdoor VLN dataset similar to ours, where real instructions while ours is active, i.e. the systems behavior changes the are created from Google Street View1 images. However, our perceived input. method differs in the way that it decomposes the naviga- Vision Based Navigation. Navigation based on vision and tional instructions to make the annotation easier. It draws reinforcement learning (RL) has become a very interesting inspiration from spatial cognition field to specifically pro- research topic recently. The technique has proven quite suc- mote annotators’ memory and thinking, making the task less cessful in simulated environments [60, 83] and is being ex- energy-consuming and less error-prone. More details can be tended to more sophisticated real environments [59]. There found in Section 3. has been active research on navigation-related tasks, such Attention for Language Modeling. Attention mechanism as localizing from only an image [75], finding the direction has been used widely for language [55] and visual mod- to the closest McDonalds, using Google Street View Im- eling [78, 71, 4]. Language attention mechanism has been ages [46, 14], goal based visual navigation [30] and others. shown to produce state-of-the-art results in machine trans- Gupta et al. [30] uses a differentiable mapper which writes lation [9] and other natural language processing tasks like into a latent spatial memory corresponding to an egocentric VQA [42, 79, 40], image captioning [7], grounding referen- map of the environment and a differentiable planner which tial expressions [40, 41] and others. Attention mechanism uses this memory and the given goal to give navigational ac- is one of the main component for the top-performing al- tions to navigate in novel environments. There are few other gorithms such as Transformer [68] and BERT [21] in NLP recent works on vision based navigation [64, 76]. Thoma tasks. The MAC by Hudson et al. [42] has a control unit et al. [64] formulates compact map construction and accu- which performs weighted average of the question words based rate self localization for image-based navigation by a careful on a soft attention for VQA task. Hu et al. [40] also proposed selection of suitable visual landmarks. Recently, Wortsman a similar language attention mechanism but they decom- et al. [76] proposes a meta-reinforcement learning approach posed the reasoning into sub-tasks/modules and predicted for visual navigation, where the agent learns to adapt in un- modular weights from the input text. In our model, we adapt seen environments in a self-supervised manner. There is an- the soft attention proposed by Kumar et al. [50] by applying other stream of work called end-to-end driving, aiming to the soft attention over segmented language instructions to learn vehicle navigation directly from video inputs [33, 34]. put the attention over two different sub-instructions: a) for landmarks and b) for local directions. The attention mecha- Vision-and-Language Navigation. The task is to navigate nism is named dual attention. an agent in an environment to a particular destination based Memory. There are generally two kinds of memory used in on language instructions. The following are some recent works the literature: a) implicit memory and b) explicit memory. in Vision-and-Language Navigation (VLN) [5, 73, 23, 72, 61, Implicit memory learns to memorize knowledge in the hid- 54,45] task. The general goal of these works are similar to den state vectors via back-propagation of errors. Typical ex- ours – to navigate from a starting point to a destination in a amples include RNNs [44] and LSTMs [22]. Explicit mem- visual environment with language directional descriptions. ory, however, features explicit read and write modules with Anderson et al. [5] created the R2R dataset for indoor room- an attention menchanism. Notable examples include Neural to-room navigation and proposed a learning method based Turing Machines [28] and Differentiable Neural Computers on sequence-to-sequence neural networks. Subsequent meth- (DNCs) [29]. In our work, we use external explicit mem- ods [73, 72] apply reinforcement learning and cross modal ory in the form of a memory image which can be read from matching techniques on the same dataset. The same task was and written to by its read and write modules. Training a soft tackled by Fried et al. [23] using speaker-follower technique attention mechanism over language segments coupled with to generate synthetic instructions for data augmentation and 1 pragmatic inference. While sharing similarity, our work dif- https://developers.google.com/streetview/
4 Arun Balajee Vasudevan et al. Route an explicit memory scheme makes our method more suit- Sub-route able for long-range navigation where the reward signals are Source Landmark 2 Landmark 4 sparse. S D Landmark 1 Landmark 3 Destination Road nodes Road Segments 3 Talk2Nav Dataset Fig. 2: An illustrative route from a source node to a destina- The target is to navigate using the language descriptions in tion node with all landmarks labelled along the route. In our real outdoor environment. Recently, a few datasets on lan- case, the destination is landmark 4. Sub-routes are defined guage based visual navigation task have been released both as the routes between two consecutive landmarks. for indoor [5] and outdoor [13, 16] environments. Existing datasets typically annotate one overall language description for the entire path/route. This poses challenges for annotat- to the destination node, which are sampled from the set of ing longer routes as stated in Section 1. Furthermore, these all compiled nodes. The A∗ search algorithm is then used to annotations lack the correspondence between language de- generate a route by finding the shortest traversal path from scriptions and sub-units of a route. This in turn poses chal- the source to the destination in the city graph. lenges in learning long-range vision-and-language naviga- tion (VLN). To address the issue, this work proposes a new Street View Images. Once the routes are generated using annotation method and uses it to create a new dataset Talk2Nav. OSM, we densely interpolated the latitudes and longitudes Talk2Nav contains navigation routes at city levels. A of the locations along the routes to extract more road loca- navigational city graph is created with nodes as locations tions. We then use Google Street View APIs to get the clos- in the city and connecting edges as the road segments be- est StreetView panorama and omit the locations which do tween the nodes, which is similar to Mirowski et al. [59]. A not have a streetview image. We collect the 360◦ street-view route from a source node to a destination node is composed images along with their heading angles by using Google of densely sampled atomic unit nodes. Each node contains Street View APIs and the metadata2 . The API allows for a) a street-view panoramic image, b) the GPS coordinates downloading tiles of a street-view panoramic image which of that location, c) the bearing angles to connecting edges are then stitched together to get the equirectangular projec- (roads). Furthermore, we enrich the routes with intermedi- tion image. We use the heading angle to re-render the street- ary visual landmarks, language descriptions for these visual view panorama image such that the image is centre-aligned landmarks and the local directional instructions for the sub- to the heading directions of the route. routes connecting the landmarks. An illustration of a route We use OSM and opensourced OSM related APIs for is shown in Figure 2. the initial phase of extraction of road nodes and route plans. However later, we move to Google Street View for min- ing streetview images and their corresponding metadata. We 3.1 Data Collection noted that it was not straightforward to mine the locations/road To retrieve city route data, we used OpenStreetMap to obtain nodes from Google Street View. the metadata information of locations and Google’s APIs for maps and street-view images. 3.2 Directional Instruction Annotation Path Generation. The path from a source to a destination The main challenge of data annotation for automatic language- is sampled from a city graph. To that aim, we used Open- based wayfinding lies in the fact that the annotators need StreetMap (OSM) which provides latitudes, longitudes and to play the role of an instructor as the local people do to bearing angles of all the locations (waypoints) within a pre- tourists. This is especially challenging when the annotators defined region in the map. We use Overpass API to extract do not know the environment well. The number of street- all the locations from OSM, given a region. A city graph view images for a new environment is tremendous and search- is defined by taking the locations of the atomic units as the ing through them can be costly, not to mention remembering nodes, and the directional connections between neighbour- and summarizing them to verbal directional instructions. hood locations as the edges. The K-means clustering algo- Inspired by the large body of work in cognitive science rithm (k = 5) is applied to the GPS coordinates of all nodes. on how people mentally conceptualize route information and We then randomly picked up two nodes from different clus- convey routes [48, 56, 67], our annotation method is designed ters to ensure that the source node and the destination node are not too close. We then use opensourced OpenStreet Rout- 2 http://maps.google.com/cbk?output=xml&ll=40.735357,- ing Machine to extract a route plan from every source node 73.918551&dm=1
Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory 5 3 Destination 4 Current Location Landmarks Left view Front view Right view Rear view 2 Describe the route description in a short sentence. E.g: Go straight for 50m and then turn right. 1 Fig. 3: The annotation interface used in Mechanical Turk. Box 1 represents the text box where the annotators write the descriptions. Box 2 denotes the perspective projected images of left, front, right and rear view at the current location. Box 3 shows the Street View environment where the annotator can drag and rotate to get 360◦ view. Box 4 represents the path to-be-annotated with the markers for the landmarks. Red line in Box 4 denotes the current subroute. We can navigate forward and backward along the marked route (lines drawn in red and green), perceiving the Street View simultaneously on the left. Boxes are provided for describing the marked route for landmarks and directions between them. Zoom In for a better view. to specifically promote memory or thinking of the annota- and 3) images which are easy to be described, remembered tors. For the route directions, people usually refer to visual and identified are preferred for effective communication. landmarks [39, 56, 65] along with a local directional instruc- Given the set of all images I along a path P , the problem tion [67, 69]. Visualizing the route with highlighted salient is formulated as a subset selection problem that maximizes landmarks and local directional transitions compensates for a linear combination of the three submodular objectives: limited familiarity or understanding of the environment. 3 X L = argmax wi fi (L0 , P ), s.t.|L0 | = l (1) L0 ⊆℘(I) i=1 3.2.1 Landmark Mining where ℘(I) is the powerset of I, L0 is the set of all possi- ble solutions for the size of l, wi are non-negative weights, The ultimate goal of this work is to make human-to-robot in- and fi are the sub-modular objective functions. More specif- teraction more intuitive, thus the instructions need to be sim- ically, f1 is the minimum travel distance between any of the ilar to those used by daily human communication. Simple two consecutive selected images along the route P ; f2 = methods of mining visual landmarks based on some CNN 1/(d + σ) with d the distance to the closest approaching features may lead to images which are hard for human to dis- intersection and σ is set to 15 meters to avoid having an tinguish and describe. Hence, this may lead to low-quality or infinitely large value for intersection nodes; f3 is a learned unnatural instructions. ranking function which signals the easiness of describing We frame the landmark mining task as a summariza- and remembering the selected images. The weights wi in tion problem using sub-modular optimization to create sum- Equation 1 are set empirically: w1 = 1, w2 = 1 and w3 = 3. maries that takes into account multiple objectives. In this l is set to 3 in this work as we have the source and the work, three criteria are considered: 1) the selected images destination nodes fixed and we choose three intermediate are encouraged to spread out along the route to support con- landmarks as shown in Figure 2. Routes of this length with tinuous localization and guidance; 2) images close to road 3 intermediate landmarks are already long enough for the intersections and the approaching side of the intersections current navigation algorithms. The function f3 is a learned are preferred for better guidance through the intersections; ranking model, which is presented in the next paragraph.
6 Arun Balajee Vasudevan et al. a) New York City in GoogleMap b) Paths annotated in StreetNav c) One route with landmarks d) Atomic node in City graph o 20 o 85 o 293 o 206 (a) the annotated region in NYC (b) the created city graph (c) a route with landmarks and sub-routes (d) the structure of a node Fig. 4: An illustration of our Talk2Nav dataset at multiple granularity levels. Ranking model. In order to train the ranking model for im- have to move back and forth multiple times, c) the annota- ages of being ‘visual landmarks’ that are easier to describe tion errors are very high. These observations confirmed our and remember, we compile images from three cities: New conjecture that it is very challenging to create high-quality York City (NYC), San Francisco (SFO) and London cov- annotations for large-scale, long-range VLN tasks. For each ering different scenes such as high buildings, open fields, annotation, the annotator needs to know the environment as downtown areas, etc. We select 20,000 pairs of images from well as the locals would in order to be accurate. This is time- the compiled set. A pairwise comparison is performed over consuming and error-prone. To address this issue, we sim- 20,000 pairs to choose one over the other for the preferred plify the annotation task. landmark. We crowd-sourced the annotation with the fol- In our route description annotation process, the mined lowing criteria: a) Describability – how easy to describe it visual landmarks are provided, along with an overhead topo- by the annotator, b) Memorability – how easy to remem- logical map and the street-view interface. We crowd-source ber it by the agent (e.g. a traveler such as a tourist), and this annotation task on Amazon Mechanical Turk. In the in- c) Recognizability – how easy to recognize it by the agent. terface, a pre-defined route on the GoogleMap is shown to Again, our ultimate goal is to make human-to-robot inter- the annotator. An example of the interface is shown in Fig- action more intuitive, so these criteria are inspired by how ure 3. The green line denotes the complete route while the human select landmarks when formulating navigational de- red line shows the current road segment. We use Move for- scriptions in our daily life. ward and Move backward button to navigate from the source We learn the ranking model with a Siamese network [17] node to the destination node. The annotator is instructed to by following [31]. The model takes a pair of images and watch the 360◦ Street-view images on the left. Here, we have scores the selected image more than the other one. We use customized the Google Street View interface to allow the an- the Huber rank loss as the cost function. In the inference notator to navigate along the street-view images simultane- stage, the model (f3 in Equation 1) outputs the averaged ously as they move forward/backward in the overhead map. score for all selected images signalling their suitability as The street view is aligned to the direction of the navigation visual landmarks for the VLN task. such that forward is always aligned with the moving direc- tion. To minimize the effort of understanding the street-view scenes, we also provide four perspective images for the left- 3.2.2 Annotation and Dataset Statistics view, the front-view, the right-view, and the rare-view. In the literature, a few datasets have been created for similar The annotator can navigate through the complete route tasks [5, 16, 70, 80]. For instance, Anderson et al. [5] anno- comprising of m landmark nodes and m intermediate sub- tated the language description for the route by asking the routes and is asked to provide descriptions for all landmarks user to navigate the entire path in egocentric perspective. In- and descriptions for all sub-routes. In this work, we use m = corporation of overhead map of navigated route as an aid for 4, which means that we collect 4 landmark descriptions and describing the route can be seen in [16, 70, 80]. 4 local directional descriptions for each route as shown in In the initial stages of our annotation process, we fol- Figure 2. We then append the landmark descriptions and the lowed the above works. We allowed the annotators to nav- directional descriptions for the sub-routes one after the other igate the entire route and describe a single instruction for to yield the complete route description. The whole route de- the complete navigational route. We observed that a) the de- scription is generated in a quite formulaic way which may scriptions are mostly about the ending part of the routes in- lead to not very fluent descriptions. The annotation method dicating that the annotators forget the earlier stages of the offers an efficient and reliable solution at a modest price of routes; b) the annotators took a lot of time to annotate as they language fluency.
Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory 7 Table 1: A comparison of our Talk2Nav dataset with the R2R dataset [5] and the TouchDown dataset [16]. Avg stands for Average, desc for description, and (w) for (words). Criteria R2R [5] Touchdown [16] Talk2Nav # of navigational routes 7,189 9,326 10,714 # of panoramic views 10,800 29,641 21,233 # of panoramic views per path 6.0 35.2 40.0 (a) Landmark descriptions (b) Local directional instructions # of navigational descriptions 21,567 9,326 10,714 # of navigational desc per path 3 1 1 Length of navigational desc (w) 29 89.6 68.8 # of landmark desc - - 34,930 Length of landmark desc (w) - - 8 # of local directional desc - - 27,944 Length of landmark desc (w) - - 7.2 Vocabulary size 3,156 4,999 5,240 Sub-unit correspondence No No Yes (c) Complete navigational instructions ple aspects, the R2R and TouchDown datasets only anno- tate one overall language description for the entire route. Fig. 5: The length distribution of landmark instructions (a), It is costly to scale the annotation method to longer routes local directional instructions (b), and complete navigational and to a large number of unknown environments. Further- instructions (c). more, the annotations in these two datasets lacks the corre- spondence between language descriptions and sub-units of Statistics. We gathered 43, 630 locations which include GPS a route. Our dataset offers it without increasing the anno- coordinates (latitudes and longitudes), bearing angles, etc. in tation effort. The detailed correspondence of the sub-units New York City (NYC) covering an area of 10 km × 10 km as facilitates the learning task and enables an interesting cross- shown in Figure 4. Out of all those locations, we managed to difficulty evaluation as presented in Section 5.2. compile 21, 233 street-view images (outdoor) – each for one The average time taken for landmark ranking task is 5 road node. We constructed a city graph with 43, 630 nodes seconds and for the route description task is around 3 min- and 80, 000 edges. The average # of navigable directions per utes. In terms of payment, we paid $0.01 and $0.5 for the node is 3.4. landmark ranking task and the route description task, re- We annotated 10, 714 navigational routes. These route spectively. Hence, the hourly rate of payment for annota- descriptions are composed of 34, 930 node descriptions and tion is $10. We have a qualification test. Only the annotators 27, 944 local directional descriptions. Each navigational in- who passed the qualification test were employed for the real struction comprises of 5 landmark descriptions (we use only annotation. In total, 150 qualified workers were employed. 4 in our work) and 4 directional instructions. These 5 land- During the whole course of annotation process, we regularly mark descriptions include the description about the starting sample a subset of the annotations from each worker and road node, the destination node and the three intermediate check their annotated route descriptions manually. If the an- landmarks. Since the agent starts the VLN task from the notation is of low quality, we reject the particular jobs and starting node, we use only 4 landmark descriptions along give feedback to the workers for improvement. with 4 directional instructions. The starting point descrip- We restrict the natural language to English in our work. tion can be used for automatic localization which is left for However, there are still biases in the collected annotations future work. The average length of the navigational instruc- as is generally studied in the works of Bender et al. [11, tions, the landmark descriptions and the local directional in- 12]. The language variety in our route descriptions com- structions of the sub-routes are 68.8 words, 8 words and 7.2 prise of social dialects from United States, India, Canada, words, respectively. In total, our dataset Talk2Nav contains United Kingdom, Australia and others, showing diverse an- 5, 240 unique words. Figure 5 shows the distribution of the notator demographics. Annotators are instructed to follow length (number of words) of the landmark descriptions, the guidelines for descriptions like i) Do not describe based on local directional instructions and the complete navigational street names or road names, ii) Try to use non-stationary ob- instructions. jects mainly as visual landmarks. We also later perform cer- In Table 1, we show a detailed comparison with the R2R tain curation techniques to the provided descriptions such as dataset [5] and the TouchDown dataset [16] under different a) removing all non-unicode characters, b) correcting mis- criteria. While the three datasets share similarities in multi- spelled words using opensourced LanguageTool [1].
8 Arun Balajee Vasudevan et al. 4 Approach an action, it interacts with the environment and receives a new visual observation. The agent performs the actions se- Our system is a single agent traveling in an environ- quentially until it reaches the destination node successfully ment represented as a directed connected graph. We create or it fails the task because it exceeds the maximum episode a street-view environment on top of the compiled city graph length. of our Talk2Nav dataset. The city graph consists of 21, 233 The agent learns to predict an action at each state to nav- nodes/locations with the corresponding street-view images igate in the environment. Learning long-range vision-and- (outdoor) and 80, 000 edges representing the travelable road language navigation (VLN) requires an accurate sequential segments between the nodes. Nodes are points on the road matching between the navigation instruction and the route. belonging to the annotated region of NYC. Each node con- As argued in the introduction, we observe that navigation tains a) a 360◦ street-view panoramic image, b) the GPS co- instructions consist of two major classes: landmark descrip- ordinates of the corresponding location (defined by latitudes tions and local directional instructions between consecutive and longitudes), c) and the bearing angles of this node in landmarks. In this work, given a language instruction, our the road network. A valid route is a sequence of connected method segments it to a sequence of two interleaving classes: edges from a source node to a destination node. Please see landmark descriptions and local directional instructions. Please Figure 4 for the visualization of these concepts. see Figure 1 for an example of a segmented navigational in- During the training and testing stages, we use a simula- struction. As it moves, the agent learns an associated refer- tor to navigate the agent in the street view environment in the ence position in the language instruction to obtain a softly- city graph. Based on the predicted action at each node, the attended local directional instruction and landmark descrip- agent moves to the next node in the environment. A success- tion. When one landmark is achieved, the agent updates its ful navigation is defined as when the agent correctly follows attention and moves towards the new goal, i.e. the next land- the right route referred by the natural language instruction. mark. Two matching modules are used to score each state For the sake of tractability, we assume no uncertainty (such by matching: 1) the traversed path in memory and the lo- as dynamic changes in the environment, noise in the actu- cal, softly-attended directional instruction, and 2) the visual ators) in the environment, hence it is a deterministic goal scene the agent observes at the current node and the local, setting. softly-attended landmark description. An explicit external Our underlying simulator is the same as the ones used memory is used to store the traversed path from the latest by previous methods [5, 16, 36]. It, however, has a few dif- visited landmark to the current position. ferent features. The simulator of [5] is compiled for indoor Each time a visual observation is successfully matched navigation. The simulators of [16, 36] are developed for out- to a landmark, a learned controller increments the reference door (city) navigation but they differ in the region used for position of the soft attention map over the language instruc- annotation. The action spaces are also different. The simula- tion to update both the landmark and directional descrip- tors of [5, 16, 36] use an action space consisting of left, right, tions. This process continues until the agent reaches the des- up, down, forward and stop actions. We employ an action tination node or the episodic length of the navigation is ex- space of nine actions: eight moving actions in the eight equi- ceeded. A schematic diagram of the method is shown in Fig- spaced angular directions between 0◦ and 360◦ and the stop ure 6. Below we detail all the models used by our method. action. For each moving action, we select the road which has the minimum angular distance to the predicted moving 4.1.1 Language direction. In the following sections, we present the method in details. We segment the given instruction to two classes: a) visual landmark descriptions and b) local directional instructions between the landmarks. We employ the BERT transformer 4.1 Route Finding Task model [21] to classify a given language instruction X into a sequence of token classes {ci }ni=1 where ci ∈ {0, 1}, with The task defines the goal of an embodied agent to navigate 0 denoting the landmark descriptions and and 1 the local from a source node to a destination node based on a naviga- directional instructions. By grouping consecutive segments tional instruction. Specifically, given a natural language in- of the same token class, the whole instruction is segmented struction X = {x1 , x2 , ..., xn }, the agent needs to perform into interleaving segments. An example of the segmenta- a sequence of actions {a1 , a2 , ..., am } from the action space tion is shown in Figure 1 and multiple more in Figure 8. A to hop over nodes in the environment space to reach the Those segments are used as the basic units of our attention destination node. Here xi represents individual word in an scheme rather than the individual words used by previous instruction while ai denotes an element from the set of nine methods [5]. This segmentation converts the language de- actions (Detailed in Section 4.1.5). When the agent takes scription into a more structured representation, aiming to fa-
Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory 9 o 45 o 0 o 315 270 o Stop 90o Action o o 225 135 Language Move 180 o Write features ResNet Action Module features Scores Indicator StreetView Image Memory Image Controller Match Match Module 1 Module 2 Language Model Directional Landmark Directional Landmark Description j-1 Description j Description j Description j+1 Fig. 6: The illustrative diagram of our model. We have a soft attention mechanism over segmented language instructions, which is controlled by the Indicator Controller. The segmented landmark description and local directional instruction are matched with the visual image and the memory image respectively by two matching modules. The action module fetches features from the visual observation, the memory image, the features of the language segments, and the two matching scores to predict an action at each step. The agent then moves to the next node, updates the visual observation and the memory image, continues the movement, and so on until it reaches the destination. cilitate the language-vision matching problem and the language- where ηt+1 = ηt + φt (.), η0 = 1, and φt (.) is an Indicator trajectory matching problem. Denoted by T (X) a sequence Controller learned to output 1 when a landmark is reached of segments of landmark and local directional instructions, and 0 otherwise. The controller is shown in Figure 6 and then is defined in Section 4.1.4. After a landmark is reached, ηt increments 1 and the attention map then centers around the T (X) = (L1 , D1 ), (L2 , D2 ), ..., (LJ , DJ ) (2) next pair of landmark and directional instructions. The agent j where L denotes the feature representation for segment j of stops the navigation task successfully when φt () outputs 1 j landmark description, D denotes the feature representation and there are no more pair of landmark and directional in- for segment j of directional instruction, and J is the total structions left to continue. We initialize η0 as 1 to position number of segments in the route description. the language attention around the first pair of landmark and As it moves in the environment, the agent is associated directional instruction. with a reference position in the language description in order to put the focus on the most relevant landmark descriptions and the most relevant local directional instructions. This is 4.1.2 Visual Observation modeled by a differentiable soft attention map. Let us denote by ηt the reference position for time step t. The relevant The agent perceives the environment with an equipped 360◦ landmark description at time step t is extracted as camera which obtains the visual observation It of the envi- ronment at time step t. From the image, a feature ψ1 (It ) is J X extracted and passed to the matching module as shown in L̄ηt = Lj e−|ηt −j| (3) Figure 6 which estimates the similarity (matching score) be- j=1 tween the visual observation It and a softly attended land- and the relevant directional instruction as mark description L̄ηt which is modulated by our attention J module. The visual feature is also passed to the action mod- ule to predict the action for the next step as shown in Sec- X D̄ηt = Dj e−|ηt −j| , (4) j=1 tion 4.1.5.
10 Arun Balajee Vasudevan et al. vided to the action module along with the local directional instruction features to predict the action for the next step. 4.1.4 Matching Module 5.0 5.0 17.839 5.0 Scale (metres per pixel) Our matching module is used to determine whether the aimed Fig. 7: Exemplar memory images. The blue square denotes landmark is reached. As shown in Figure 6, the matching the source node, and the blue disk denotes the current loca- score is determined by two complementary matching mod- tion of the agent. The red lines indicate the traversed paths. ules: 1) between the visual scene ψ1 (It ) and the extracted Each memory image is associated with its map scale (metres landmark description L̄ηt and 2) between the spatial mem- per pixel) which is shown in the text box below. ory ψ2 (Mt ) and the extracted directional instruction D̄ηt . For both cases, we use a generative image captioning model based on the transformer model proposed in [68] and com- 4.1.3 Spatial Memory pute the probability of reconstructing the language descrip- tion given the image. Scores are the averaged generative Inspired by [25, 29], an external memory Mt explicitly mem- probability over all words in the instruction. Let s1t be the orizes the agent’s traversed path from the latest visited land- score for pair (ψ1 (It ), L̄ηt ) and s2t be the score for pair mark. When the agent reaches a landmark, the memory Mt (ψ2 (Mt ), D̄ηt ). We then compute the score feature st by is reinitialized to memorize the traversed path from the lat- concatenating the two: st = (s1t , s2t ). The score feature st is est visited landmark. This reinitialization can be understood fed into a controller φ(.) to decide whether the aimed land- as a type of attention to focus on the recently traversed path mark is reached: in order to better localize and to better match against the rel- φt (st , ht−1 ) ∈ {0, 1}, (5) evant directional instructions D̄ηt modeled by the learned language attention module defined in Section 4.1.1. where 1 indicates that the aimed landmark is reached and As the agent navigates in the environment, we have a 0 otherwise. φt (.) is an Adaptive Computation Time (ACT) write module to write the traversed path to the memory im- LSTM [27] which allows the controller to learn to make de- age. For a real system navigating in the physical world, the cisions at variable time steps. ht−1 is the hidden state of traversed path can be obtained by an odometry system. In the controller. In this work, φt (.) learns to identify the land- this work, we compute the traversed path directly from the marks with the variable number of intermediate navigation road network in our simulated environment. Our write mod- steps. ule traces the path travelled from the latest visited landmark to the current position. The path is rasterized and written 4.1.5 Action Module into the memory image. In the image, the path is indicated by red lines, the starting point by a blue square marker, and The action module takes the following inputs to decide the the current location is by a blue disk. The write module al- moving action: a) the pair of (ψ1 (It ), L̄ηt ), b) the pair of ways writes from the centre of the memory image to make (ψ2 (Mt ), D̄ηt ), and c) the matching score feature st . The sure that there is room for all directions. Whenever the co- illustrative diagram is shown in Figure 6. ordinates of the new rasterized pixel are beyond the image The inputs in the a) group is used to predict aet – a prob- dimensions, the module incrementally increases the scale of ability vector of moving actions over the action space. The the memory image until the new pixel is in the image and inputs in the b) is used to predict amt – the second probabil- has a distance of 10 pixels to the boundary. An image of ity vector of moving actions over the action space. The two 200 × 200 pixels is used and the initial scale of the map is probability vectors are adaptively averaged together, with set to 5 meters per pixel. Please find examples of the mem- weights learned from the score feature st . Specifically, st is ory images in Figure 7. fed to a fully-connected (FC) network to output the weights Each memory image is associated with the value of its wt ∈ R2 . For both a) and b), we use an encoder LSTM scale (meters per pixel). Deep features ψ2 (Mt ) are extracted as used in [5] to encode the language segments. We then from the memory image Mt which are then concatenated concatenate the encoder’s hidden states with the image en- with its scale value to form the representation of the mem- codings (i.e. ψ1 (It ) and ψ2 (Mt )) and pass through a FC net- ory. The concatenated features are passed to the matching work to predict the probability distribution aet and am t . By module. The matching module verifies the semantic similar- adaptively fusing the, we get the final action prediction at : ity between the traversed path and the provided local direc- 1 tional instruction. The concatenated features are also pro- at = P i (wt0 ∗ aet + wt1 ∗ am t ). (6) w i t
Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory 11 The action at at time t is defined as the weighted aver- On right we have a warehouse Move forward till four way age of aet and amt . For a different part of the trajectory, one intersection there is a wellness center on the right near to the of the actions (aet or am t ) or both of them are reliable. This shop turn right and go forward on the left, we have citi bank is heavily dependent on the situation. For instance, when Turn right and go ahead for two blocks we see a road bridge straight ahead turn right again and move forward reached the the next landmark is not visible, the prediction should rely traffic light with orange polls on road more on am t ; when the landmark is clearly recognizable, we have restaurant on our left go ahead until you see first the opposite holds. The learned matching scores will de- road intersection there is a post office in the road corner and cide adaptively at each time step which prediction to be a restaurant on the right turn left and move forward there is trusted and by how much. This adaptive fusion can be un- red building of studios on our right turn left and go ahead till a derstood as a calibration system for the two complementary tri junction we see a glass building which is a boutique turn right and move forward for two junctions reached the stoned sub-systems for action prediction. The calibration method building of bank on the right needs to be time- or situation-dependent. A simpler summa- There is blue color building on our right go ahead for two tion/normalization of the two predictions is rigid, and cannot blocks there is a red stoned building on left near to huge distinguish between a confident true prediction and a confi- electric pole turn left and go to next road junction we see a green bridge on the top turn right and go ahead for two dent false prediction. Confident false predictions are very blocks two restaurants on the road intersection and a common in deeply learned models. shopping mall on the left take left and move forward for two The action space A is defined as follows. We divide junctions you reach metro station on the right the action space A into eight moving actions in eight di- Directional Instructions Landmark Instructions rections and the stop action. Each direction is centred at {(i∗45◦ ) : i ∈ [0, ..., 7]} with ±22.5◦ offset as illustrated in Fig. 8: Examples of given instructions from Talk2Nav Figure 6. When an action angle is predicted, the agent turns dataset. Different colours denote word token segmentation. to the road which has the minimum angular distance to the Red denotes landmark descriptions and blue the local direc- predicted angle and moves forward to the next node along tional instructions. the road. The turning and moving-forward define an atomic action. The agent comes to a stop when it encounters the final landmark as stated in Section 4.1.1. This is in contrast with previous methods [5], in which it is It is worth noting that the implementation of our stop the direction to the final destination. action is different from previous VLN methods [5, 16, 36]. We use the cross-entropy loss to train the action module As pointed out in the deep learning book [24, p.384], the and the matching module as they are formulated as classifi- stop action can be achieved in various ways. One can add cation tasks. For the ACT model, we use the weighted binary a special symbol corresponding to the end of a sequence, cross-entropy loss at every step. The supervision of the pos- i.e. the STOP used by previous VLN methods. When the itive label (‘1’) for ACT model comes into effect only if the STOP is generated, the sampling process stops. Another op- agent reaches the landmark. The total loss is the sum of all tion is to introduce an extra Bernoulli output to the model module losses: that represents the decision to either continue generation or halt generation at each time step. In our work, we adopt the Loss = Lossaction + Losslandmark direction matching + Lossmatching + LossACT . second approach. The agent uses Indicator Controller (IC) (7) to learn to determine whether it reaches landmarks. IC out- puts 1 when the agent reaches a landmark and updates the The losses of the two matching modules take effect only at language attention for finding the next landmark until the the place of landmarks which are much sparser than the road destination is reached. The destination is the final landmark nodes where the action loss and ACT loss are computed. Be- described in the instruction. The agent stops when the IC cause of this, we first train the matching networks individ- predicts 1 and there is no more language description left to ually for the matching tasks, and then integrate them with continue. Thus, our model has a stop mechanism but in a other components for the overall training. rather implicit manner. 4.3 Implementation Details 4.2 Learning Language Module. We use the BERT [21] transformer model The model is trained in a supervised way. We followed the pretrained on the BooksCorpus [82] and the English Wikipedia3 , student-forcing approach proposed in [5] to train our model. for modelling language. This yields contextual word repre- At each step, the action module is trained with a supervisory sentations which is different from classical models such as 3 signal of the action in the direction of the next landmark. https://en.wikipedia.org/wiki/English Wikipedia
12 Arun Balajee Vasudevan et al. word2vec [57] and GloVe [62]. We use a word-piece tok- Network details. The SphereNet model consists of five blocks enizer to tokenize the sentence, by following [21, 77]. Out of convolution and max-pooling layers, followed by a fully- of vocabulary words are split into sub-words based on the connected layer. We use 32, 64, 128, 256, 128 filters in the available vocabulary words. For word token classification, 1 to 5 convolutional layers and each layer is followed by we first train BERT transformer with a token classification a Max pooling and ReLU activation. This is the backbone head. Here, we use the alignment between the given lan- structure of the SphereNet. For ResNet, we use ResNet-101 guage instruction X and their corresponding set of landmark as the backbone. Later, we have two branches: the fully con- and directional instruction segments in the training set of the nected layer has 8 neurons using a softmax activation func- Talk2Nav dataset. We then train the transformer model to tion in one branch of the network for the action module and classify each word token in the navigational instruction to 512 neurons with ReLU activation in the other branch for be a landmark description or a local directional instruction. the match module. Hence, input of image feature size from At the inference stage, the model predicts a binary la- these models are of 512 and attention feature size over the bel for each word token. We later convert the sequence of image is 512. The convolutional filter kernels are of size 5x5 word token classes into segments T (X) by simply group- and are applied with stride 1. Max pooling is performed with ing adjacent tokens which have the same class. We note that kernels of size 3x3 and a stride of 2. For language, we use this model has a classification accuracy of 91.4% on the test an input encoding size of 512 for each token in the vocabu- set. We have shown few word token segmentation results in lary. We use Adam optimizer with learning rate of 0.01 and Figure 8. alpha and beta for Adam as 0.999 and 10−8 . We trained for 20 epochs with a mini-batch size of 16. Visual inputs. The Google street-view images are acquired in the form of equirectangular projection. We experimented 5 Experiments with SphereNet [18] architecture pretrained on the MNIST [51] dataset and ResNet-101 [32] pretrained on ImageNet [20] Our experiments focus on 1) the overall performance of our to extract ψ1 (I). Since the SphereNet pre-trained on MNIST method when compared to the state-of-the-art (s-o-t-a) navi- is a relatively shallow network, we adapted the architecture gation algorithms, and 2) multiple ablation studies to further of SphereNet to suit Street View images by adding more understand our method. The ablation studies cover a) the convolutional blocks. We then define a pretext task using the importance of using explicit external memory, b) a perfor- Street View images from the Talk2Nav dataset to learn the mance evaluation of navigation at different levels of diffi- weights for SphereNet. Given two street-view images with culty, c) a comparison of visual features, d) a comparison of an overlap of their visible view, the task is to predict the dif- the performance with and without Indicator Controller (IC), ference between their bearing angles and the projection of and across variants of matching modules. We train and eval- line joining the locations on the bearing angle of second lo- uate our model on the Talk2Nav dataset. We use 70% of the cation. We frame the problem as a regression task of predict- dataset for training, 10% for validation and the rest for test- ing the two angles. This encourages SphereNet to learn the ing. By following [3], we evaluate our method under three semantics in the scene. We compiled the training set for this metrics: pre-training task from our own training split of the dataset. In the case of memory image, we use ResNet-101 pretrained – SPL: the success rate weighted by normalized inverse on ImageNet to extract ψ2 (M ) from the memory image M . path length. It penalizes the successes made with longer paths. – Navigation Error: the distance to the goal after finishing Other modules. We perform image captioning in the match- the episode. ing modules using the transformer model, by following [52, – Total Steps: the total number of steps required to reach 81,68]. We pretrain the transformer model for captioning in the goal successfully. the matching module using the landmark street-view images and their corresponding descriptions from the training split We compare with the following s-o-t-a navigation meth- of Talk2Nav for matching of the landmarks. For the other ods: Student Forcing [5], RPA [73], Speaker-Follower [23] matching module of local directions, we pretrain the trans- and Self-Monitoring [53]. We trained their models on the former model using the ground-truth memory images and same data that our model uses. The work [72] yields very their corresponding directional instructions. We synthesized good results on the Room-to-Room dataset for indoor navi- the ground truth memory image in the same way as our write gation. We, however, have not found publicly available code module writes to the memory image (as mentioned in Sec- for the method and thus can only compare with other top- tion 4.1.3). We finetune both the matching modules in the performing methods on our dataset. training stage. All the other models such as Indicator Con- We also study the performance of our complete model troller, Action module are trained in an end-to-end fashion. and other s-o-t-a methods at varying difficulty levels: from
You can also read