Multi-Session Visual SLAM for Illumination Invariant Localization in Indoor Environments
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Multi-Session Visual SLAM for Illumination Invariant Localization in
Indoor Environments
Mathieu Labbé and François Michaud
Abstract— For robots navigating using only a camera, illumi- • Which visual feature is the most robust to illumination
nation changes in indoor environments can cause localization variations?
failures during autonomous navigation. In this paper, we • How many sessions are required for robust localization
present a multi-session visual SLAM approach to create a map
made of multiple variations of the same locations in different through day and night based on the visual feature
chosen?
arXiv:2103.03827v1 [cs.RO] 5 Mar 2021
illumination conditions. The multi-session map can then be used
at any hour of the day for improved localization capability. • To avoid having a human teleoperate a robot at different
The approach presented is independent of the visual features times of the day to create the multi-session map, would
used, and this is demonstrated by comparing localization it be possible to acquire the consecutive maps simply
performance between multi-session maps created using the
RTAB-Map library with SURF, SIFT, BRIEF, FREAK, BRISK, by having localization be done using the map acquired
KAZE, DAISY and SuperPoint visual features. The approach from the previous session?
is tested on six mapping and six localization sessions recorded The paper is organized as follows. Section II presents
at 30 minutes intervals during sunset using a Google Tango similar works to our multi-session map approach, which is
phone in a real apartment.
described in Section III. Section IV presents comparative
I. INTRODUCTION results between the visual feature used and the number of
sessions required to expect the best localization performance.
Visual SLAM (Simultaneous Localization and Mapping) Section V concludes with limitations and possible improve-
frameworks using hand-crafted visual features work rela- ments of the approach.
tively well in static environments, as long as there are
enough discriminating visual features with moderated light- II. RELATED WORK
ing variations. To be illumination invariant, a trivial solution The most similar approach to the one presented in this
could be to switch from vision to LiDAR (Light Detection paper exploits what is called an Experience Map [2]. An
and Ranging) sensors, but they are often too expensive for experience is referred to an observation of a location at a
some applications in comparison to cameras. Illumination- particular time. A location can have multiple experiences
invariant localization using a conventional camera is not describing it. New experiences of the same location are
trivial, as visual features taken during the day under nat- added to the experience map if localization fails during
ural light conditions may look quite different than those each traversal of the same environment. Localization of the
extracted during the night under artificial light conditions. current frame is done against all experiences in parallel, thus
In our previous work on visual loop closure detection [1], requiring multi-core CPUs to do it in real-time as more and
we observed that when traversing multiple times the same more experiences are added to a location. To avoid using
area where atmospheric conditions are changing periodically multiple experiences of the same locations, SeqSLAM [3],
(like day-night cycles), loop closures are more likely to [4] matches sequences of visual frames instead of trying to
be detected with locations of past mapping sessions that localize robustly each individual frame against the map. The
have similar illumination levels or atmospheric conditions. approach assumes that the robot takes relatively the same
Based on that observation, in this paper, we present an route (with the same viewpoints) at the same velocity, thus
multi-session approach to derive illumination invariant maps seeing the same sequence of images across time. This is a
using a full visual SLAM approach (not only loop closure fair assumption for cars (or trains), as they are constrained
detection). Mapping multiple times the same area to improve to follow a lane at regular velocity. However, for indoor
localization in dynamic environments has been tested more robots having to deal with obstacles, their path can change
often outdoors with cars on road constrained trajectories [2], over time, thus not always replicating the same sequences
rather than indoors in unconstrained trajectories where the of visual frames. In Cooc-Map [5], local features taken at
full six Degrees of Freedom (DoF) transformations must also different times of the day are quantized in both the feature
be recovered. The main questions this paper addresses are: and image spaces, and discriminating statistics can be gener-
ated on the co-occurrences of features. This produces a more
*This work is partly supported by the Natural Sciences and Engineering
Research Council of Canada. compact map instead of having multiple images representing
Mathieu Labbé and François Michaud are with Interdisciplinary the same location, while still having local features to recover
Institute of Technological Innovation (3IT), Department of full motion transformation. A similar approach is done in [6]
Electrical Engineering and Computer Engineering, Université de
Sherbrooke, Sherbrooke, Québec, Canada [Mathieu.M.Labbe where a fine vocabulary method is used to cluster descriptors
| Francois.Michaud]@USherbrooke.ca by tracking the corresponding 3D landmark in 2D imagesLocalization
across multiple sequences under different illumination con- 01:00
Session B
ditions. For feature matching with this learned vocabulary,
instead of using a standard descriptor distance approach (e.g., 00:00
Map3
Euclidean distance), a probability distance is evaluated to Multi-Session
18:00
improve feature matching. Other approaches rely on pre- Map2 Map
processing the input images to make them illumination-
12:00
Map1
invariant before feature extraction, by removing shadows [7],
[8] or by trying to predict them [9]. This improves feature 16:00
Localization
Session A
matching robustness in strong and changing shadows.
At the local visual feature level, common hand-crafted
Fig. 1. Structure of the multi-session map. Three sessions taken at different
features like SIFT [10] and SURF [11] in outdoor ex- time have been merged together by finding loop closures between them
periences can be compared across multiple seasons and (yellow links). Each map has its own coordinate frame. Localization Session
illumination conditions [12], [13]. To overcome limitations A (16:00) is localized in relation to both day sessions, and Localization
Session B (01:00) is only for the night session (00:00).
caused by illumination variance of hand-crafted features,
machine learning approaches have been also used to extract
descriptors that are more illumination-invariant. In [14], a
neural network has been trained by tracking interest points in implicitly transform each session into the coordinate frame
time-lapse videos, so that it outputs similar descriptors for the of the oldest map as long as there is at least one loop closure
same tracked points independently of illumination. However, found between the sessions.
only descriptors are learned, and the approach still relies RTAB-Map uses two methods to find loop closures: a
on hand-crafted feature detectors. More recently, SuperPoint global one called loop closure detection, and a local one
[15] introduced an end-to-end local feature detection and called proximity detection. For multi-session maps, loop
descriptor extraction approach based on a neural network. closure detection works exactly the same than with single
The illumination-invariance comes from carefully making a session maps. The bag-of-words (BOW) approach [1] is used
training dataset with images showing the same visual features with a Bayesian filter to evaluate loop closure hypotheses
under large illumination variations. over all previous images from all sessions. When a loop
closure hypothesis reaches a pre-defined threshold, a loop
III. MULTI-SESSION SLAM FOR ILLUMINATION closure is detected. Visual features can be any of the ones im-
INVARIANT LOCALIZATION plemented in OpenCV [19], which are SURF (SU) [11], SIFT
Compared to an experience map, our approach also ap- (SI) [10], BRIEF (BF) [20], BRISK (BK) [21], KAZE (KA)
pends new locations to the map during each traversal, but it [22], FREAK (FR) [23] and DAISY (DY) [24]. Specifically
estimates only the motion transformation with a maximum for this paper, the neural network based feature SuperPoint
of best candidates based on their likelihood. Our approach (SP) [15] has also been integrated for comparison. The BOW
is also independent of the visual features used, which can be vocabulary is incremental based on FLANN’s KD-Trees [25].
hand-crafted or neural network based, thus making it easier Quantization of features to visual words is done using the
to compare them in full SLAM conditions using RTAB- Nearest Neighbor Distance Ratio (NNDR) approach [10].
Map library for instance. RTAB-Map [16] is a Graph-SLAM In our previous works [26], proximity detection was
[17] approach that can be used with camera(s) and/or with introduced primarily in RTAB-Map for LiDAR localization.
a LiDAR. This paper focuses on the first case where only A slight modification is made for this work to use it with
a camera is available for localization. The structure of the a camera. Previously, closest nodes in a fixed radius around
map is a pose-graph with nodes representing each image the current position of the robot were sorted by distance,
acquired at a fixed rate and links representing the 6DoF then proximity detection was done against the top three
transformations between them. Figure 1 presents an example closest nodes. However, in a visual multi-session map, the
of a multi-session map created from three sessions taken at closest nodes may not have images with the most similar
three different hours. Two additional localization sessions illumination conditions than the current one. By using the
are also shown, one during the day (16:00) and one at likelihood computed during loop closure detection, nodes
night (01:00). The dotted links represent to which nodes inside the proximity radius are sorted from the most to less
in the graph the corresponding localization frame has been visually similar (in terms of BOW’s inverted index score).
localized on. The goal is to have new frames localize on Visual proximity detection is then done using the three most
nodes taken at a similar time, and if the localization time similar images around the current position. If proximity
falls between two mapping times, localizations could jump detection fails because the robot’s odometry drifted too much
between two sessions or more inside the multi-session map. since the last localization, loop closure detection is still done
Inside the multi-session map, each individual map must be in parallel to localize the robot when it is lost.
transformed in the same global coordinate frame so that For both loop closure and proximity detections, 6DoF
when the robot localizes on a node of a different session, transformations are computed in two steps: 1) a global
it does not jump between different coordinate frames. To do feature matching is done using NNDR between frames to
so, the GTSAM [18] graph optimization approach is used to compute a first transformation using the Perspective-n-Point(PnP) approach [19]; then 2) using that previous transforma- of the six localization sessions (A to F) using the visual
tion as a motion estimate, 3D features from the first frame are features listed in Section III. As expected by looking at the
projected into the second frame for local feature matching diagonals, localization performance is best (less gaps) when
using a fixed size window. This second step generates better localization is done using a map taken at the same time of
matches to compute a more accurate transformation using the day (i.e., with very similar illumination condition). In
PnP. The resulting transform is further refined using the local contrast, localization performance is worst when localizing at
bundle adjustment approach [27]. night using a map taken the day, and vice-versa. SuperPoint
is the most robust to large illumination variations, while
IV. RESULTS binary descriptors like FREAK, BRIEF and BRISK are the
Figure 2 illustrates how the dataset has been acquired most sensitive.
in a home in Sherbrooke, Quebec in March. An ASUS
Zenfone AR phone (with Google Tango technology) running B. Multi-Session Localization
a RTAB-Map Tango App has been used to record data for The second experiment evaluates localization performance
each session following the same yellow trajectory, similarly using multi-session maps created using maps generated at
to what a robot would do patrolling the environment. The different times from the six mapping sessions. Different
poses are estimated using Google Tango’s visual inertial combinations are tested: 1+6 combines the two mapping
odometry approach, with RGB and registered depth images sessions with the highest illumination variations (time 16:46
recorded at 1 Hz. The trajectory started and finished in front with time 19:35); 1+3+5 and 2+4+6 are multi-session maps
of a highly visual descriptive location (i.e., first and last generated by assuming that mapping would occur every 1
positions shown as green and red arrows, respectively) to hour instead of every 30 minutes, and 1+2+3+4+5+6 is the
make sure that each consecutive mapping session is able combination of all maps. Figure 4 illustrates, from top to
to localize on start from the previous session. This makes bottom and for each multi-session map the resulting local-
sure that all maps are transformed in the same coordinate ized frames over time. Except for SuperPoint which shows
frame of the first map after graph optimization. For each run similar performance for all multi-session maps, sessions with
between 16:45 (daylight) and 19:45 (nighttime), two sessions different illumination conditions merged together increase
were recorded back-to-back to get a mapping session and a localization performance. For the case 1+2+3+4+5+6 (fourth
localization session taken roughly at the same time. The time line), localization sessions at 16:51 and at 19:42 are local-
delay between each run is around 30 minutes during sunset. izing mostly with sessions taken at the same time (1 and
Overall, the resulting dataset has six mapping sessions (1- 6), while for other sessions, localizations are mixed between
16:46, 2-17:27, 3-17:54, 4-18:27, 5-18:56, 6-19:35) and six more than one maps. This confirms that these two are the
localization sessions (A-16:51, B-17:31, C-17:58, D-18:30, most different ones in the set of multi-map combinations.
E-18:59, F-19:42). Table I presents cumulative localization performance over
The top blue boxes of Fig. 2 show images of the same single session and multi-session maps for each visual fea-
location taken during each mapping session. To evaluate ture. As expected, multi-session maps improves localization
the influence of natural light coming from the windows performance. The map 1|2|3|4|5|6 corresponds to the case
during the day, all lights in the apartment were on during all of only using the session map taken at the closest time
sessions except for one in the living room that we turned on of the localization session. While this seems to result in
when the room was getting darker (see top images at 17:54 similar performance compared to the multi-session cases,
and 18:27). Beside natural illumination changing over the this could be difficult to implement robustly over multiple
sessions, the RGB camera had auto-exposure and auto-white seasons (where general illumination variations would not
balance enabled (which could not be disabled by Google always happen at the same time) and during weather changes
Tango API on that phone), causing additional illumination (cloudy, sunny or rainy days would result in changes in
changes based on where the camera was pointing, as shown illumination conditions for the same time of day). Another
in the purple box of Fig. 2. The left sequence illustrates challenge is how to make sure that maps are correctly
what happens when greater lighting comes from outside, aligned together in same coordinate frame during the whole
with auto-exposure making the inside very dark when the trajectory. In contrast, with the multi-session approach, the
camera passed by the window. In comparison, doing so selection of which mapping session to use is done implicitly
at night (shown in the right sequence) did not result in by selecting the best candidates of loop closure detection
any changes. Therefore, for this dataset, most illumination across all sessions. There is therefore no need to have a priori
changes are coming either from natural lighting or auto- knowledge of the illumination conditions during mapping
exposure variations. before doing localization.
However, the multi-session approach requires more mem-
A. Single Session Localization ory usage, as the map is at most six times larger in our
The first experiment done with this dataset examines experiment than a single session of the same environment.
localization performance of different visual features for a Table II presents the memory usage (RAM) required for
single mapping session. Figure 3 shows for each single localization using the different maps configurations, along
session map (1 to 6) the resulting localized frames over time with the constant RAM overhead (e.g., loading libraries16:46 17:27 17:54 18:27 18:56 19:35
17:27 19:35
Fig. 2. Top view of the testing environment with the followed trajectory in yellow. Start and end positions correspond to green and red triangles
respectively, which are both oriented toward same picture on the wall shown in the green frame. Circles represent waypoints where the camera rotated
in place. Windows are located in the dotted orange rectangles. The top blue boxes are pictures taken from a similar point of view (located by the blue
triangle) during the six mapping sessions. The purple box shows three consecutive frames from two different sessions taken at the same position (purple
triangle), illustrating the effect of auto-exposure.
16:51 17:31 17:58 18:30 18:59 19:42 16:51 17:31 17:58 18:30 18:59 19:42
16:46 16:46
SURF 17:27 SURF 17:27
17:54 17:54
18:27 18:27
SIFT 18:56 SIFT 18:56
19:35 19:35
BRIEF BRIEF
BRISK BRISK
KAZE KAZE
FREAK FREAK
DAISY DAISY
SuperPoint SuperPoint
Fig. 3. Localized frames over time of the A to F localization sessions
(x-axis) over the 1 to 6 single-session maps (y-axis) in relation to the visual Fig. 4. Localized frames over time of the A to F localization ses-
features used. sions (x-axis) over the four multi-session maps 1+6, 1+3+5, 2+4+6 and
1+2+3+4+5+6 (y-axis, ordered from top to bottom), in relation to the visual
features used.
and feature detector initialization) shown separately at the
bottom. Table III presents the average localization time (on
an Intel Core i7-9750H CPU and a GeForce GTX 1650 GPU one. SuperPoint requires however more processing time and
for SuperPoint), which is the same for all maps with each significantly more memory because of a constant NVidia’s
feature. Thanks to BOW’s inverted index search, loop closure CUDA libraries overhead in RAM.
detection does not require significantly more time to process Finally, even if we did not have access to ground truth
for multi-session maps. However, the multi-session maps data recorded with an external global localization system,
require more memory, which could be a problem on small the odometry from Google Tango for this kind of trajectory
robots with limited RAM. Comparing the visual features and environment does not drift very much. Thus, evaluating
used, BRIEF requires the least processing time and memory, localization jumps can provide an estimate of localization
while SuperPoint seems to be the most illumination invariant accuracy. The last columns of Table I shows the average17:27 17:54 18:27 18:56 19:35
distance of the jumps caused during localization. The maps 16:46
SURF 17:27
1+2+3+4+5+6 and 1|2|3|4|5|6 produce similar small jumps, 17:54
18:27
which can be explained by the increase of visual inliers 18:56
SIFT
when computing the transformation between two frames.
With these two configurations, localization frames can be BRIEF
matched with map frames taken roughly at the same time,
thus giving more inliers. BRISK
C. Consecutive Session Localization
KAZE
Section IV-B suggests that the best localization results
are when using the six mapping sessions merged together. FREAK
Having six maps to record before doing navigation can be a
tedious task if an operator has to teleoperate the robot many DAISY
times and at the right time. It would be better to “teach”
the robot the trajectory to do one time and have it repeat SuperPoint
the process autonomously to do these mapping sessions.
Figure 5 shows localization performance using a previous Fig. 5. Localization performance of the last five mapping sessions (x-axis)
mapping session. The diagonal values of the figure are if over preceding mapping sessions (y-axis), in relation to visual features used.
localization occurs every 30 minutes using the previous map
for localization. Results just over the main diagonal are if
localization is done each hour using a map taken one hour Continuously updating the multi-session map could be a
ago (e.g., for the 1+3+5 and 2+4+6 multi-session cases). solution [26] which, as the results suggest, would require
The top-right lines are for the 1+6 multi-session case during more RAM. A solution using RTAB-Map could be to enable
which the robot would be activated only at night while trying memory management or graph reduction approaches, which
to localize using the map learned during the day. Having low would limit the size of the map in RAM. Graph reduction is
localization performance is not that bad, but localizations similar to approach in [2] where if localization is success-
should be evenly distributed, otherwise the robot may get ful, no new experiences are added. Another complementary
lost before being able to localize after having to navigate approach could be to remove nodes on which the robot did
using dead-reckoning over a small distance. The maximum not localize for a while (like weeks or months) offline. As
distance that the robot can robustly recover from depends on new nodes are added because the environment changed, some
the odometry drift: if high, frequent localizations would be nodes would be also be removed. In future works, we plan
required to correctly follow the planned path. SURF, SIFT, to test this on larger indoor datasets like [28] and on a real
KAZE, DAISY and SuperPoint are features that do not give robot to study if multiple consecutive sessions could indeed
large gaps if maps are taken 30 minutes after the other. For be robustly recorded autonomously with standard navigation
maps taken 1 hour after the other, only KAZE, SuperPoint algorithms. Testing over multiple days and weeks could give
and DAISY do not show large gaps. Finally, SuperPoint also a better idea of the approach’s robustness on a real
may be the only one that could be used to only map the autonomous robot.
environment twice (e.g., one at day and one at night) and
robustly localize using the first map. R EFERENCES
V. DISCUSSION AND CONCLUSION [1] M. Labbé and F. Michaud, “Appearance-based loop closure detection
for online large-scale and long-term operation,” IEEE Trans. on
Results in this paper suggest that regardless of the visual Robotics, vol. 29, no. 3, pp. 734–745, 2013.
[2] W. Churchill and P. Newman, “Experience-based navigation for long-
features used, similar localization performance is possible term localisation,” The Int. J. Robotics Research, vol. 32, no. 14, pp.
using a multi-session approach. The choice of the visual 1645–1661, 2013.
features could then be based on computation and memory [3] M. J. Milford and G. F. Wyeth, “SeqSLAM: Visual route-based
navigation for sunny summer days and stormy winter nights,” in IEEE
cost, specific hardware requirements (like a GPU) or licens- Int. Conf. Robotics and Automation, 2012, pp. 1643–1649.
ing conditions. The more illumination invariant the visual [4] N. Sünderhauf, P. Neubert, and P. Protzel, “Are we there yet? Chal-
features are, the less sessions are required to reach the same lenging SeqSLAM on a 3000 km journey across all four seasons,” in
level of performance. Proc. of Workshop on Long-Term Autonomy, IEEE Int. Conf. Robotics
and Automation, 2013, pp. 1–3.
The dataset used in this paper is however limited to one [5] E. Johns and G.-Z. Yang, “Feature co-occurrence maps: Appearance-
day. Depending whether it is sunny, cloudy or rainy, or based localisation throughout the day,” in IEEE Int. Conf. Robotics
because of variations of artificial lighting conditions in the and Automation, 2013, pp. 3212–3218.
[6] A. Ranganathan, S. Matsumoto, and D. Ilstrup, “Towards illumination
environment or if curtains are open or closed, more mapping invariance for visual localization,” in IEEE Int. Conf. Robotics and
sessions would have to be taken to keep high localization Automation, 2013, pp. 3791–3798.
performance over time. During weeks or months, changes in [7] C. McManus, W. Churchill, W. Maddern, A. D. Stewart, and P. New-
man, “Shady dealings: Robust, long-term visual localisation using
the environment (e.g., furniture changes, items that are being illumination invariance,” in IEEE Int. Conf. Robotics and Automation,
moved, removed or added) could also influence performance. 2014, pp. 901–906.TABLE I
C UMULATIVE LOCALIZATION PERFORMANCE (%) AND AVERAGE LOCALIZATION JUMPS ( MM ) OF THE SIX LOCALIZATION SESSIONS ON EACH MAP
FOR EACH VISUAL FEATURE USED .
Localization (%) Visual Inliers (%) Jumps (mm)
Map SU SI BF BK KA FR DY SP SU SI BF BK KA FR DY SP SU SI BF BK KA FR DY SP
1 67 65 65 58 65 51 66 87 11 10 15 10 12 14 11 19 56 52 49 57 52 50 45 44
2 74 69 70 64 69 54 74 89 11 10 14 10 12 13 10 19 40 38 41 47 45 42 36 37
3 86 82 85 79 81 70 86 95 13 11 16 11 15 15 13 23 42 45 45 46 47 44 41 35
4 89 84 87 83 84 76 88 96 13 12 16 11 15 15 13 24 38 48 43 45 45 42 36 33
5 79 76 79 71 74 65 79 91 13 12 16 12 15 16 13 22 40 46 45 51 46 46 39 32
6 76 74 74 67 72 62 76 90 13 12 17 12 16 17 13 21 42 47 40 52 41 48 39 36
1+6 94 91 91 87 90 84 90 97 14 13 18 12 17 16 14 25 35 38 40 44 38 40 36 29
1+3+5 97 95 96 94 95 91 97 99 16 14 18 13 18 18 16 27 32 33 39 42 37 39 31 28
2+4+6 98 95 96 93 95 91 97 98 16 14 19 13 18 18 16 27 31 37 35 37 38 35 30 27
1+2+3+4+5+6 99 98 99 98 97 95 99 99 18 16 20 15 20 20 18 29 26 30 33 39 29 31 26 25
1|2|3|4|5|6 97 95 96 96 94 92 96 98 17 14 19 14 19 18 16 27 28 30 31 36 31 32 27 27
TABLE II
Robotics and Automation, 2013.
RAM USAGE (MB) COMPUTED WITH VALGRIND ’ S M ASSIF TOOL OF [13] C. Valgren and A. J. Lilienthal, “SIFT, SURF and seasons: Long-
EACH MAP FOR EACH VISUAL FEATURE USED . term outdoor localization using local features,” in 3rd European Conf.
Mobile Robots, 2007, pp. 253–258.
Map SU SI BF BK KA FR DY SP [14] N. Carlevaris-Bianco and R. M. Eustice, “Learning visual feature
descriptors for dynamic lighting conditions,” in IEEE/RSJ Int. Conf.
1 (206 nodes) 113 177 64 80 99 68 233 213 Intelligent Robots and Systems, 2014, pp. 2769–2776.
2 (205 nodes) 110 171 63 79 90 66 227 196 [15] D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superpoint: Self-
3 (210 nodes) 112 174 64 80 95 67 232 202 supervised interest point detection and description,” in Proc. IEEE
Conf. Computer Vision and Pattern Recognition Workshops, 2018, pp.
4 (193 nodes) 102 159 60 74 87 62 211 190 224–236.
5 (176 nodes) 92 142 55 66 78 56 189 167 [16] M. Labbé and F. Michaud, “RTAB-Map as an open-source lidar
6 (240 nodes) 120 185 71 86 104 72 251 208 and visual simultaneous localization and mapping library for
large-scale and long-term online operation,” J. Field Robotics,
1+6 237 365 135 168 206 140 486 423 vol. 36, no. 2, pp. 416–446, 2019. [Online]. Available: https:
1+3+5 320 490 182 225 272 190 648 569 //onlinelibrary.wiley.com/doi/abs/10.1002/rob.21831
2+4+6 338 516 195 240 284 200 685 587 [17] G. Grisetti, R. Kümmerle, C. Stachniss, and W. Burgard, “A tutorial
on graph-based SLAM,” IEEE Intelligent Transportation Systems
1+2+3+4+5+6 675 1015 382 468 566 393 1319 1150 Magazine, vol. 2, no. 4, pp. 31–43, 2010.
Constant Overhead 90 90 90 225 90 90 155 1445 [18] F. Dellaert, “Factor graphs and GTSAM: A hands-on introduction,”
Georgia Institute of Technology, Tech. Rep., 2012.
TABLE III [19] G. Bradski and A. Kaehler, Learning OpenCV: Computer Vision with
the OpenCV Library. O’Reilly Media, Inc., 2008.
AVERAGE LOCALIZATION TIME AND WORDS PER FRAME FOR EACH [20] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, “Brief: Binary robust
VISUAL FEATURE USED , ALONG WITH DESCRIPTOR DIMENSION AND independent elementary features,” in European Conf. Computer Vision.
NUMBER OF BYTES PER ELEMENT IN THE DESCRIPTOR .
Springer, 2010, pp. 778–792.
[21] S. Leutenegger, M. Chli, and R. Y. Siegwart, “BRISK: Binary robust
invariant scalable keypoints,” in IEEE Int. Conf. Computer Vision,
SU SI BF BK KA FR DY SP 2011, pp. 2548–2555.
Time (ms) 87 203 67 148 449 92 152 353 [22] P. F. Alcantarilla, A. Bartoli, and A. J. Davison, “Kaze features,” in
Words 847 770 889 939 640 580 847 472 European Conference on Computer Vision. Springer, 2012, pp. 214–
227.
Dimension 64 128 32 64 64 64 200 256 [23] A. Alahi, R. Ortiz, and P. Vandergheynst, “Freak: Fast retina keypoint,”
Bytes/Elem 4 4 1 1 4 1 4 4 in IEEE Conf. Computer Vision and Pattern Recognition, 2012, pp.
510–517.
[24] E. Tola, V. Lepetit, and P. Fua, “Daisy: An efficient dense descriptor
applied to wide-baseline stereo,” IEEE Trans. Pattern Analysis and
[8] P. Corke, R. Paul, W. Churchill, and P. Newman, “Dealing with shad- Machine Intelligence, vol. 32, no. 5, pp. 815–830, 2009.
ows: Capturing intrinsic scene appearance for image-based outdoor [25] M. Muja and D. G. Lowe, “Fast approximate nearest neighbors with
localisation,” in IEEE/RSJ Int. Conf. Intelligent Robots and Systems, automatic algorithm configuration,” in Proc. Int. Conf. Computer
2013, pp. 2085–2092. Vision Theory and Application, 2009, pp. 331–340.
[9] S. M. Lowry, M. J. Milford, and G. F. Wyeth, “Transforming morning [26] M. Labbé and F. Michaud, “Long-term online multi-session graph-
to afternoon using linear regression techniques,” in IEEE Int. Conf. based splam with memory management,” Autonomous Robots, Nov
Robotics and Automation, 2014, pp. 3950–3955. 2017.
[10] D. G. Lowe, “Distinctive image features from scale-invariant key- [27] R. Kummerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard,
points,” Int. J. Computer Vision, vol. 60, no. 2, pp. 91–110, 2004. “g2o: A general framework for graph optimization,” in IEEE Int. Conf.
[11] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool, “Speeded Up Robust Robotics and Automation, May 2011, pp. 3607–3613.
Features (SURF),” Computer Vision and Image Understanding, vol. [28] X. Shi, D. Li, P. Zhao, Q. Tian, Y. Tian, Q. Long, C. Zhu, J. Song,
110, no. 3, pp. 346–359, 2008. F. Qiao, L. Song, et al., “Are we ready for service robots? The
[12] P. Ross, A. English, D. Ball, B. Upcroft, G. Wyeth, and P. Corke, OpenLORIS-scene datasets for lifelong SLAM,” in IEEE Int. Conf.
“A novel method for analysing lighting variance,” in Australian Conf. Robotics and Automation, 2020, pp. 3139–3145.You can also read