Multi-Session Visual SLAM for Illumination Invariant Localization in Indoor Environments

Page created by Wesley Patterson

Style & Fashion

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Multi-Session Visual SLAM for Illumination Invariant Localization in Indoor Environments

Multi-Session Visual SLAM for Illumination Invariant Localization in
                                                                 Indoor Environments
                                                                                         Mathieu Labbé and François Michaud

                                           Abstract— For robots navigating using only a camera, illumi-               •  Which visual feature is the most robust to illumination
                                        nation changes in indoor environments can cause localization                     variations?
                                        failures during autonomous navigation. In this paper, we                       • How many sessions are required for robust localization
                                        present a multi-session visual SLAM approach to create a map
                                        made of multiple variations of the same locations in different                   through day and night based on the visual feature
                                                                                                                         chosen?
arXiv:2103.03827v1 [cs.RO] 5 Mar 2021

                                        illumination conditions. The multi-session map can then be used
                                        at any hour of the day for improved localization capability.                   • To avoid having a human teleoperate a robot at different
                                        The approach presented is independent of the visual features                     times of the day to create the multi-session map, would
                                        used, and this is demonstrated by comparing localization                         it be possible to acquire the consecutive maps simply
                                        performance between multi-session maps created using the
                                        RTAB-Map library with SURF, SIFT, BRIEF, FREAK, BRISK,                           by having localization be done using the map acquired
                                        KAZE, DAISY and SuperPoint visual features. The approach                         from the previous session?
                                        is tested on six mapping and six localization sessions recorded                The paper is organized as follows. Section II presents
                                        at 30 minutes intervals during sunset using a Google Tango                  similar works to our multi-session map approach, which is
                                        phone in a real apartment.
                                                                                                                    described in Section III. Section IV presents comparative
                                                              I. INTRODUCTION                                       results between the visual feature used and the number of
                                                                                                                    sessions required to expect the best localization performance.
                                           Visual SLAM (Simultaneous Localization and Mapping)                      Section V concludes with limitations and possible improve-
                                        frameworks using hand-crafted visual features work rela-                    ments of the approach.
                                        tively well in static environments, as long as there are
                                        enough discriminating visual features with moderated light-                                   II. RELATED WORK
                                        ing variations. To be illumination invariant, a trivial solution               The most similar approach to the one presented in this
                                        could be to switch from vision to LiDAR (Light Detection                    paper exploits what is called an Experience Map [2]. An
                                        and Ranging) sensors, but they are often too expensive for                  experience is referred to an observation of a location at a
                                        some applications in comparison to cameras. Illumination-                   particular time. A location can have multiple experiences
                                        invariant localization using a conventional camera is not                   describing it. New experiences of the same location are
                                        trivial, as visual features taken during the day under nat-                 added to the experience map if localization fails during
                                        ural light conditions may look quite different than those                   each traversal of the same environment. Localization of the
                                        extracted during the night under artificial light conditions.               current frame is done against all experiences in parallel, thus
                                        In our previous work on visual loop closure detection [1],                  requiring multi-core CPUs to do it in real-time as more and
                                        we observed that when traversing multiple times the same                    more experiences are added to a location. To avoid using
                                        area where atmospheric conditions are changing periodically                 multiple experiences of the same locations, SeqSLAM [3],
                                        (like day-night cycles), loop closures are more likely to                   [4] matches sequences of visual frames instead of trying to
                                        be detected with locations of past mapping sessions that                    localize robustly each individual frame against the map. The
                                        have similar illumination levels or atmospheric conditions.                 approach assumes that the robot takes relatively the same
                                        Based on that observation, in this paper, we present an                     route (with the same viewpoints) at the same velocity, thus
                                        multi-session approach to derive illumination invariant maps                seeing the same sequence of images across time. This is a
                                        using a full visual SLAM approach (not only loop closure                    fair assumption for cars (or trains), as they are constrained
                                        detection). Mapping multiple times the same area to improve                 to follow a lane at regular velocity. However, for indoor
                                        localization in dynamic environments has been tested more                   robots having to deal with obstacles, their path can change
                                        often outdoors with cars on road constrained trajectories [2],              over time, thus not always replicating the same sequences
                                        rather than indoors in unconstrained trajectories where the                 of visual frames. In Cooc-Map [5], local features taken at
                                        full six Degrees of Freedom (DoF) transformations must also                 different times of the day are quantized in both the feature
                                        be recovered. The main questions this paper addresses are:                  and image spaces, and discriminating statistics can be gener-
                                                                                                                    ated on the co-occurrences of features. This produces a more
                                           *This work is partly supported by the Natural Sciences and Engineering
                                        Research Council of Canada.                                                 compact map instead of having multiple images representing
                                           Mathieu Labbé and François Michaud are with Interdisciplinary          the same location, while still having local features to recover
                                        Institute of Technological Innovation (3IT), Department of                  full motion transformation. A similar approach is done in [6]
                                        Electrical Engineering and Computer Engineering, Université de
                                        Sherbrooke, Sherbrooke, Québec, Canada [Mathieu.M.Labbe                    where a fine vocabulary method is used to cluster descriptors
                                        | Francois.Michaud]@USherbrooke.ca                                          by tracking the corresponding 3D landmark in 2D images

Localization
across multiple sequences under different illumination con- 01:00
Session B
ditions. For feature matching with this learned vocabulary,
instead of using a standard descriptor distance approach (e.g., 00:00
Map3
Euclidean distance), a probability distance is evaluated to Multi-Session
18:00
improve feature matching. Other approaches rely on pre- Map2 Map

processing the input images to make them illumination-
12:00
Map1
invariant before feature extraction, by removing shadows [7],
[8] or by trying to predict them [9]. This improves feature 16:00
Localization
Session A
matching robustness in strong and changing shadows.
At the local visual feature level, common hand-crafted
Fig. 1. Structure of the multi-session map. Three sessions taken at different
features like SIFT [10] and SURF [11] in outdoor ex- time have been merged together by finding loop closures between them
periences can be compared across multiple seasons and (yellow links). Each map has its own coordinate frame. Localization Session
illumination conditions [12], [13]. To overcome limitations A (16:00) is localized in relation to both day sessions, and Localization
Session B (01:00) is only for the night session (00:00).
caused by illumination variance of hand-crafted features,
machine learning approaches have been also used to extract
descriptors that are more illumination-invariant. In [14], a
neural network has been trained by tracking interest points in implicitly transform each session into the coordinate frame
time-lapse videos, so that it outputs similar descriptors for the of the oldest map as long as there is at least one loop closure
same tracked points independently of illumination. However, found between the sessions.
only descriptors are learned, and the approach still relies RTAB-Map uses two methods to find loop closures: a
on hand-crafted feature detectors. More recently, SuperPoint global one called loop closure detection, and a local one
[15] introduced an end-to-end local feature detection and called proximity detection. For multi-session maps, loop
descriptor extraction approach based on a neural network. closure detection works exactly the same than with single
The illumination-invariance comes from carefully making a session maps. The bag-of-words (BOW) approach [1] is used
training dataset with images showing the same visual features with a Bayesian filter to evaluate loop closure hypotheses
under large illumination variations. over all previous images from all sessions. When a loop
closure hypothesis reaches a pre-defined threshold, a loop
III. MULTI-SESSION SLAM FOR ILLUMINATION closure is detected. Visual features can be any of the ones im-
INVARIANT LOCALIZATION plemented in OpenCV [19], which are SURF (SU) [11], SIFT
Compared to an experience map, our approach also ap- (SI) [10], BRIEF (BF) [20], BRISK (BK) [21], KAZE (KA)
pends new locations to the map during each traversal, but it [22], FREAK (FR) [23] and DAISY (DY) [24]. Specifically
estimates only the motion transformation with a maximum for this paper, the neural network based feature SuperPoint
of best candidates based on their likelihood. Our approach (SP) [15] has also been integrated for comparison. The BOW
is also independent of the visual features used, which can be vocabulary is incremental based on FLANN’s KD-Trees [25].
hand-crafted or neural network based, thus making it easier Quantization of features to visual words is done using the
to compare them in full SLAM conditions using RTAB- Nearest Neighbor Distance Ratio (NNDR) approach [10].
Map library for instance. RTAB-Map [16] is a Graph-SLAM In our previous works [26], proximity detection was
[17] approach that can be used with camera(s) and/or with introduced primarily in RTAB-Map for LiDAR localization.
a LiDAR. This paper focuses on the first case where only A slight modification is made for this work to use it with
a camera is available for localization. The structure of the a camera. Previously, closest nodes in a fixed radius around
map is a pose-graph with nodes representing each image the current position of the robot were sorted by distance,
acquired at a fixed rate and links representing the 6DoF then proximity detection was done against the top three
transformations between them. Figure 1 presents an example closest nodes. However, in a visual multi-session map, the
of a multi-session map created from three sessions taken at closest nodes may not have images with the most similar
three different hours. Two additional localization sessions illumination conditions than the current one. By using the
are also shown, one during the day (16:00) and one at likelihood computed during loop closure detection, nodes
night (01:00). The dotted links represent to which nodes inside the proximity radius are sorted from the most to less
in the graph the corresponding localization frame has been visually similar (in terms of BOW’s inverted index score).
localized on. The goal is to have new frames localize on Visual proximity detection is then done using the three most
nodes taken at a similar time, and if the localization time similar images around the current position. If proximity
falls between two mapping times, localizations could jump detection fails because the robot’s odometry drifted too much
between two sessions or more inside the multi-session map. since the last localization, loop closure detection is still done
Inside the multi-session map, each individual map must be in parallel to localize the robot when it is lost.
transformed in the same global coordinate frame so that For both loop closure and proximity detections, 6DoF
when the robot localizes on a node of a different session, transformations are computed in two steps: 1) a global
it does not jump between different coordinate frames. To do feature matching is done using NNDR between frames to
so, the GTSAM [18] graph optimization approach is used to compute a first transformation using the Perspective-n-Point

(PnP) approach [19]; then 2) using that previous transforma- of the six localization sessions (A to F) using the visual
tion as a motion estimate, 3D features from the first frame are features listed in Section III. As expected by looking at the
projected into the second frame for local feature matching diagonals, localization performance is best (less gaps) when
using a fixed size window. This second step generates better localization is done using a map taken at the same time of
matches to compute a more accurate transformation using the day (i.e., with very similar illumination condition). In
PnP. The resulting transform is further refined using the local contrast, localization performance is worst when localizing at
bundle adjustment approach [27]. night using a map taken the day, and vice-versa. SuperPoint
is the most robust to large illumination variations, while
IV. RESULTS binary descriptors like FREAK, BRIEF and BRISK are the
Figure 2 illustrates how the dataset has been acquired most sensitive.
in a home in Sherbrooke, Quebec in March. An ASUS
Zenfone AR phone (with Google Tango technology) running B. Multi-Session Localization
a RTAB-Map Tango App has been used to record data for The second experiment evaluates localization performance
each session following the same yellow trajectory, similarly using multi-session maps created using maps generated at
to what a robot would do patrolling the environment. The different times from the six mapping sessions. Different
poses are estimated using Google Tango’s visual inertial combinations are tested: 1+6 combines the two mapping
odometry approach, with RGB and registered depth images sessions with the highest illumination variations (time 16:46
recorded at 1 Hz. The trajectory started and finished in front with time 19:35); 1+3+5 and 2+4+6 are multi-session maps
of a highly visual descriptive location (i.e., first and last generated by assuming that mapping would occur every 1
positions shown as green and red arrows, respectively) to hour instead of every 30 minutes, and 1+2+3+4+5+6 is the
make sure that each consecutive mapping session is able combination of all maps. Figure 4 illustrates, from top to
to localize on start from the previous session. This makes bottom and for each multi-session map the resulting local-
sure that all maps are transformed in the same coordinate ized frames over time. Except for SuperPoint which shows
frame of the first map after graph optimization. For each run similar performance for all multi-session maps, sessions with
between 16:45 (daylight) and 19:45 (nighttime), two sessions different illumination conditions merged together increase
were recorded back-to-back to get a mapping session and a localization performance. For the case 1+2+3+4+5+6 (fourth
localization session taken roughly at the same time. The time line), localization sessions at 16:51 and at 19:42 are local-
delay between each run is around 30 minutes during sunset. izing mostly with sessions taken at the same time (1 and
Overall, the resulting dataset has six mapping sessions (1- 6), while for other sessions, localizations are mixed between
16:46, 2-17:27, 3-17:54, 4-18:27, 5-18:56, 6-19:35) and six more than one maps. This confirms that these two are the
localization sessions (A-16:51, B-17:31, C-17:58, D-18:30, most different ones in the set of multi-map combinations.
E-18:59, F-19:42). Table I presents cumulative localization performance over
The top blue boxes of Fig. 2 show images of the same single session and multi-session maps for each visual fea-
location taken during each mapping session. To evaluate ture. As expected, multi-session maps improves localization
the influence of natural light coming from the windows performance. The map 1|2|3|4|5|6 corresponds to the case
during the day, all lights in the apartment were on during all of only using the session map taken at the closest time
sessions except for one in the living room that we turned on of the localization session. While this seems to result in
when the room was getting darker (see top images at 17:54 similar performance compared to the multi-session cases,
and 18:27). Beside natural illumination changing over the this could be difficult to implement robustly over multiple
sessions, the RGB camera had auto-exposure and auto-white seasons (where general illumination variations would not
balance enabled (which could not be disabled by Google always happen at the same time) and during weather changes
Tango API on that phone), causing additional illumination (cloudy, sunny or rainy days would result in changes in
changes based on where the camera was pointing, as shown illumination conditions for the same time of day). Another
in the purple box of Fig. 2. The left sequence illustrates challenge is how to make sure that maps are correctly
what happens when greater lighting comes from outside, aligned together in same coordinate frame during the whole
with auto-exposure making the inside very dark when the trajectory. In contrast, with the multi-session approach, the
camera passed by the window. In comparison, doing so selection of which mapping session to use is done implicitly
at night (shown in the right sequence) did not result in by selecting the best candidates of loop closure detection
any changes. Therefore, for this dataset, most illumination across all sessions. There is therefore no need to have a priori
changes are coming either from natural lighting or auto- knowledge of the illumination conditions during mapping
exposure variations. before doing localization.
However, the multi-session approach requires more mem-
A. Single Session Localization ory usage, as the map is at most six times larger in our
The first experiment done with this dataset examines experiment than a single session of the same environment.
localization performance of different visual features for a Table II presents the memory usage (RAM) required for
single mapping session. Figure 3 shows for each single localization using the different maps configurations, along
session map (1 to 6) the resulting localized frames over time with the constant RAM overhead (e.g., loading libraries

16:46              17:27               17:54                  18:27              18:56               19:35

                                                                                                                   17:27          19:35

Fig. 2. Top view of the testing environment with the followed trajectory in yellow. Start and end positions correspond to green and red triangles
respectively, which are both oriented toward same picture on the wall shown in the green frame. Circles represent waypoints where the camera rotated
in place. Windows are located in the dotted orange rectangles. The top blue boxes are pictures taken from a similar point of view (located by the blue
triangle) during the six mapping sessions. The purple box shows three consecutive frames from two different sessions taken at the same position (purple
triangle), illustrating the effect of auto-exposure.

             16:51   17:31    17:58    18:30    18:59       19:42                                  16:51   17:31      17:58   18:30   18:59   19:42
                                                                            16:46                                                                         16:46
SURF                                                                        17:27   SURF                                                                  17:27
                                                                            17:54                                                                         17:54
                                                                            18:27                                                                         18:27

SIFT                                                                        18:56   SIFT                                                                  18:56
                                                                            19:35                                                                         19:35

BRIEF                                                                               BRIEF

BRISK                                                                               BRISK

KAZE                                                                                KAZE

FREAK                                                                               FREAK

DAISY                                                                               DAISY

SuperPoint                                                                          SuperPoint

Fig. 3. Localized frames over time of the A to F localization sessions
(x-axis) over the 1 to 6 single-session maps (y-axis) in relation to the visual     Fig. 4.    Localized frames over time of the A to F localization ses-
features used.                                                                      sions (x-axis) over the four multi-session maps 1+6, 1+3+5, 2+4+6 and
                                                                                    1+2+3+4+5+6 (y-axis, ordered from top to bottom), in relation to the visual
                                                                                    features used.
and feature detector initialization) shown separately at the
bottom. Table III presents the average localization time (on
an Intel Core i7-9750H CPU and a GeForce GTX 1650 GPU                               one. SuperPoint requires however more processing time and
for SuperPoint), which is the same for all maps with each                           significantly more memory because of a constant NVidia’s
feature. Thanks to BOW’s inverted index search, loop closure                        CUDA libraries overhead in RAM.
detection does not require significantly more time to process                          Finally, even if we did not have access to ground truth
for multi-session maps. However, the multi-session maps                             data recorded with an external global localization system,
require more memory, which could be a problem on small                              the odometry from Google Tango for this kind of trajectory
robots with limited RAM. Comparing the visual features                              and environment does not drift very much. Thus, evaluating
used, BRIEF requires the least processing time and memory,                          localization jumps can provide an estimate of localization
while SuperPoint seems to be the most illumination invariant                        accuracy. The last columns of Table I shows the average

17:27 17:54 18:27 18:56 19:35
distance of the jumps caused during localization. The maps 16:46
SURF 17:27
1+2+3+4+5+6 and 1|2|3|4|5|6 produce similar small jumps, 17:54
18:27
which can be explained by the increase of visual inliers 18:56
SIFT
when computing the transformation between two frames.
With these two configurations, localization frames can be BRIEF
matched with map frames taken roughly at the same time,
thus giving more inliers. BRISK

C. Consecutive Session Localization
KAZE

Section IV-B suggests that the best localization results
are when using the six mapping sessions merged together. FREAK

Having six maps to record before doing navigation can be a
tedious task if an operator has to teleoperate the robot many DAISY

times and at the right time. It would be better to “teach”
the robot the trajectory to do one time and have it repeat SuperPoint

the process autonomously to do these mapping sessions.
Figure 5 shows localization performance using a previous Fig. 5. Localization performance of the last five mapping sessions (x-axis)
mapping session. The diagonal values of the figure are if over preceding mapping sessions (y-axis), in relation to visual features used.
localization occurs every 30 minutes using the previous map
for localization. Results just over the main diagonal are if
localization is done each hour using a map taken one hour Continuously updating the multi-session map could be a
ago (e.g., for the 1+3+5 and 2+4+6 multi-session cases). solution [26] which, as the results suggest, would require
The top-right lines are for the 1+6 multi-session case during more RAM. A solution using RTAB-Map could be to enable
which the robot would be activated only at night while trying memory management or graph reduction approaches, which
to localize using the map learned during the day. Having low would limit the size of the map in RAM. Graph reduction is
localization performance is not that bad, but localizations similar to approach in [2] where if localization is success-
should be evenly distributed, otherwise the robot may get ful, no new experiences are added. Another complementary
lost before being able to localize after having to navigate approach could be to remove nodes on which the robot did
using dead-reckoning over a small distance. The maximum not localize for a while (like weeks or months) offline. As
distance that the robot can robustly recover from depends on new nodes are added because the environment changed, some
the odometry drift: if high, frequent localizations would be nodes would be also be removed. In future works, we plan
required to correctly follow the planned path. SURF, SIFT, to test this on larger indoor datasets like [28] and on a real
KAZE, DAISY and SuperPoint are features that do not give robot to study if multiple consecutive sessions could indeed
large gaps if maps are taken 30 minutes after the other. For be robustly recorded autonomously with standard navigation
maps taken 1 hour after the other, only KAZE, SuperPoint algorithms. Testing over multiple days and weeks could give
and DAISY do not show large gaps. Finally, SuperPoint also a better idea of the approach’s robustness on a real
may be the only one that could be used to only map the autonomous robot.
environment twice (e.g., one at day and one at night) and
robustly localize using the first map. R EFERENCES

V. DISCUSSION AND CONCLUSION [1] M. Labbé and F. Michaud, “Appearance-based loop closure detection
for online large-scale and long-term operation,” IEEE Trans. on
Results in this paper suggest that regardless of the visual Robotics, vol. 29, no. 3, pp. 734–745, 2013.
[2] W. Churchill and P. Newman, “Experience-based navigation for long-
features used, similar localization performance is possible term localisation,” The Int. J. Robotics Research, vol. 32, no. 14, pp.
using a multi-session approach. The choice of the visual 1645–1661, 2013.
features could then be based on computation and memory [3] M. J. Milford and G. F. Wyeth, “SeqSLAM: Visual route-based
navigation for sunny summer days and stormy winter nights,” in IEEE
cost, specific hardware requirements (like a GPU) or licens- Int. Conf. Robotics and Automation, 2012, pp. 1643–1649.
ing conditions. The more illumination invariant the visual [4] N. Sünderhauf, P. Neubert, and P. Protzel, “Are we there yet? Chal-
features are, the less sessions are required to reach the same lenging SeqSLAM on a 3000 km journey across all four seasons,” in
level of performance. Proc. of Workshop on Long-Term Autonomy, IEEE Int. Conf. Robotics
and Automation, 2013, pp. 1–3.
The dataset used in this paper is however limited to one [5] E. Johns and G.-Z. Yang, “Feature co-occurrence maps: Appearance-
day. Depending whether it is sunny, cloudy or rainy, or based localisation throughout the day,” in IEEE Int. Conf. Robotics
because of variations of artificial lighting conditions in the and Automation, 2013, pp. 3212–3218.
[6] A. Ranganathan, S. Matsumoto, and D. Ilstrup, “Towards illumination
environment or if curtains are open or closed, more mapping invariance for visual localization,” in IEEE Int. Conf. Robotics and
sessions would have to be taken to keep high localization Automation, 2013, pp. 3791–3798.
performance over time. During weeks or months, changes in [7] C. McManus, W. Churchill, W. Maddern, A. D. Stewart, and P. New-
man, “Shady dealings: Robust, long-term visual localisation using
the environment (e.g., furniture changes, items that are being illumination invariance,” in IEEE Int. Conf. Robotics and Automation,
moved, removed or added) could also influence performance. 2014, pp. 901–906.

TABLE I
 C UMULATIVE LOCALIZATION PERFORMANCE (%) AND AVERAGE LOCALIZATION JUMPS ( MM ) OF THE SIX LOCALIZATION SESSIONS ON EACH MAP
                                                          FOR EACH VISUAL FEATURE USED .

                           Localization (%)                               Visual Inliers (%)                      Jumps (mm)
          Map         SU SI BF BK KA FR DY                    SP   SU   SI BF BK KA FR DY          SP    SU   SI BF BK KA FR          DY   SP
          1           67 65 65 58 65 51 66                    87   11   10 15 10 12 14 11          19    56   52 49 57 52 50          45   44
          2           74 69 70 64 69 54 74                    89   11   10 14 10 12 13 10          19    40   38 41 47 45 42          36   37
          3           86 82 85 79 81 70 86                    95   13   11 16 11 15 15 13          23    42   45 45 46 47 44          41   35
          4           89 84 87 83 84 76 88                    96   13   12 16 11 15 15 13          24    38   48 43 45 45 42          36   33
          5           79 76 79 71 74 65 79                    91   13   12 16 12 15 16 13          22    40   46 45 51 46 46          39   32
          6           76 74 74 67 72 62 76                    90   13   12 17 12 16 17 13          21    42   47 40 52 41 48          39   36
          1+6         94 91 91 87 90 84 90                    97   14   13 18 12 17 16 14          25    35   38 40 44 38 40          36   29
          1+3+5       97 95 96 94 95 91 97                    99   16   14 18 13 18 18 16          27    32   33 39 42 37 39          31   28
          2+4+6       98 95 96 93 95 91 97                    98   16   14 19 13 18 18 16          27    31   37 35 37 38 35          30   27
          1+2+3+4+5+6 99 98 99 98 97 95 99                    99   18   16 20 15 20 20 18          29    26   30 33 39 29 31          26   25
          1|2|3|4|5|6 97 95 96 96 94 92 96                    98   17   14 19 14 19 18 16          27    28   30 31 36 31 32          27   27

                        TABLE II
                                                                                  Robotics and Automation, 2013.
  RAM USAGE (MB) COMPUTED WITH VALGRIND ’ S M ASSIF TOOL OF                  [13] C. Valgren and A. J. Lilienthal, “SIFT, SURF and seasons: Long-
             EACH MAP FOR EACH VISUAL FEATURE USED .                              term outdoor localization using local features,” in 3rd European Conf.
                                                                                  Mobile Robots, 2007, pp. 253–258.
Map                      SU      SI   BF    BK    KA    FR     DY     SP     [14] N. Carlevaris-Bianco and R. M. Eustice, “Learning visual feature
                                                                                  descriptors for dynamic lighting conditions,” in IEEE/RSJ Int. Conf.
1 (206 nodes)            113    177    64    80    99    68    233    213         Intelligent Robots and Systems, 2014, pp. 2769–2776.
2 (205 nodes)            110    171    63    79    90    66    227    196    [15] D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superpoint: Self-
3 (210 nodes)            112    174    64    80    95    67    232    202         supervised interest point detection and description,” in Proc. IEEE
                                                                                  Conf. Computer Vision and Pattern Recognition Workshops, 2018, pp.
4 (193 nodes)            102    159    60    74    87    62    211    190         224–236.
5 (176 nodes)             92    142    55    66    78    56    189    167    [16] M. Labbé and F. Michaud, “RTAB-Map as an open-source lidar
6 (240 nodes)            120    185    71    86   104    72    251    208         and visual simultaneous localization and mapping library for
                                                                                  large-scale and long-term online operation,” J. Field Robotics,
1+6                      237    365   135   168   206   140    486    423         vol. 36, no. 2, pp. 416–446, 2019. [Online]. Available: https:
1+3+5                    320    490   182   225   272   190    648    569         //onlinelibrary.wiley.com/doi/abs/10.1002/rob.21831
2+4+6                    338    516   195   240   284   200    685    587    [17] G. Grisetti, R. Kümmerle, C. Stachniss, and W. Burgard, “A tutorial
                                                                                  on graph-based SLAM,” IEEE Intelligent Transportation Systems
1+2+3+4+5+6              675   1015   382   468   566   393   1319   1150         Magazine, vol. 2, no. 4, pp. 31–43, 2010.
Constant Overhead         90     90    90   225    90    90    155   1445    [18] F. Dellaert, “Factor graphs and GTSAM: A hands-on introduction,”
                                                                                  Georgia Institute of Technology, Tech. Rep., 2012.
                         TABLE III                                           [19] G. Bradski and A. Kaehler, Learning OpenCV: Computer Vision with
                                                                                  the OpenCV Library. O’Reilly Media, Inc., 2008.
  AVERAGE LOCALIZATION TIME AND WORDS PER FRAME FOR EACH                     [20] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, “Brief: Binary robust
  VISUAL FEATURE USED , ALONG WITH DESCRIPTOR DIMENSION AND                       independent elementary features,” in European Conf. Computer Vision.
        NUMBER OF BYTES PER ELEMENT IN THE DESCRIPTOR .
                                                                                  Springer, 2010, pp. 778–792.
                                                                             [21] S. Leutenegger, M. Chli, and R. Y. Siegwart, “BRISK: Binary robust
                                                                                  invariant scalable keypoints,” in IEEE Int. Conf. Computer Vision,
                   SU        SI BF     BK KA FR         DY SP                     2011, pp. 2548–2555.
        Time (ms) 87        203 67     148 449 92       152 353              [22] P. F. Alcantarilla, A. Bartoli, and A. J. Davison, “Kaze features,” in
        Words     847       770 889    939 640 580      847 472                   European Conference on Computer Vision. Springer, 2012, pp. 214–
                                                                                  227.
        Dimension 64        128 32      64 64 64        200 256              [23] A. Alahi, R. Ortiz, and P. Vandergheynst, “Freak: Fast retina keypoint,”
        Bytes/Elem 4         4 1        1 4 1            4 4                      in IEEE Conf. Computer Vision and Pattern Recognition, 2012, pp.
                                                                                  510–517.
                                                                             [24] E. Tola, V. Lepetit, and P. Fua, “Daisy: An efficient dense descriptor
                                                                                  applied to wide-baseline stereo,” IEEE Trans. Pattern Analysis and
 [8] P. Corke, R. Paul, W. Churchill, and P. Newman, “Dealing with shad-          Machine Intelligence, vol. 32, no. 5, pp. 815–830, 2009.
     ows: Capturing intrinsic scene appearance for image-based outdoor       [25] M. Muja and D. G. Lowe, “Fast approximate nearest neighbors with
     localisation,” in IEEE/RSJ Int. Conf. Intelligent Robots and Systems,        automatic algorithm configuration,” in Proc. Int. Conf. Computer
     2013, pp. 2085–2092.                                                         Vision Theory and Application, 2009, pp. 331–340.
 [9] S. M. Lowry, M. J. Milford, and G. F. Wyeth, “Transforming morning      [26] M. Labbé and F. Michaud, “Long-term online multi-session graph-
     to afternoon using linear regression techniques,” in IEEE Int. Conf.         based splam with memory management,” Autonomous Robots, Nov
     Robotics and Automation, 2014, pp. 3950–3955.                                2017.
[10] D. G. Lowe, “Distinctive image features from scale-invariant key-       [27] R. Kummerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard,
     points,” Int. J. Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.          “g2o: A general framework for graph optimization,” in IEEE Int. Conf.
[11] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool, “Speeded Up Robust            Robotics and Automation, May 2011, pp. 3607–3613.
     Features (SURF),” Computer Vision and Image Understanding, vol.         [28] X. Shi, D. Li, P. Zhao, Q. Tian, Y. Tian, Q. Long, C. Zhu, J. Song,
     110, no. 3, pp. 346–359, 2008.                                               F. Qiao, L. Song, et al., “Are we ready for service robots? The
[12] P. Ross, A. English, D. Ball, B. Upcroft, G. Wyeth, and P. Corke,            OpenLORIS-scene datasets for lifelong SLAM,” in IEEE Int. Conf.
     “A novel method for analysing lighting variance,” in Australian Conf.        Robotics and Automation, 2020, pp. 3139–3145.