Revisiting animal photo-identification using deep metric learning and network analysis

Page created by Ronnie Stanley
 
CONTINUE READING
Revisiting animal photo-identification using deep metric learning and network analysis
Received: 19 June 2020    |   Accepted: 28 January 2021

DOI: 10.1111/2041-210X.13577

RESEARCH ARTICLE

Revisiting animal photo-­identification using deep metric
learning and network analysis

Vincent Miele1   | Gaspard Dussert1 | Bruno Spataro1 | Simon Chamaillé-­Jammes2,3,4                                                                |
Dominique Allainé1,4 | Christophe Bonenfant1,4

1
 Université de Lyon, Université Lyon 1,
CNRS UMR5558, Laboratoire de Biométrie            Abstract
et Biologie Évolutive, Villeurbanne, France       1. An increasing number of ecological monitoring programmes rely on photographic
2
 CEFE, Univ Montpellier, CNRS, EPHE, IRD,
                                                      capture–­recapture of individuals to study distribution, demography and abun-
Univ Paul Valéry Montpellier 3, Montpellier,
France                                                dance of species. Photo-­identification of individuals can sometimes be done using
3
 Department of Zoology & Entomology,                  idiosyncratic coat or skin patterns, instead of using tags or loggers. However,
Mammal Research Institute, University of
Pretoria, Pretoria, South Africa
                                                      when performed manually, the task of going through photographs is tedious and
4
 LTSER France, Zone Atelier ‘Hwange’,                 rapidly becomes too time-­consuming as the number of pictures grows.
Hwange National Park, Dete, Zimbabwe              2. Computer vision techniques are an appealing and unavoidable help to tackle this
Correspondence                                        apparently simple task in the big-­data era. In this context, we propose to revisit
Vincent Miele                                         animal re-­identification using image similarity networks and metric learning with
Email: vincent.miele@univ-lyon1.fr
                                                      convolutional neural networks (CNNs), taking the giraffe as a working example.
Funding information                               3. We first developed an end-­to-­end pipeline to retrieve a comprehensive set of re-­
French National Center for Scientific
Research (CNRS); Statistical Ecology                  identified giraffes from about 4,000 raw photographs. To do so, we combined
Research Group (EcoStat)                              CNN-­based object detection, SIFT pattern matching and image similarity net-
Handling Editor: Robert Freckleton                    works. We then quantified the performance of deep metric learning to retrieve
                                                      the identity of known individuals, and to detect unknown individuals never seen
                                                      in the previous years of monitoring.
                                                  4. After a data augmentation procedure, the re-­identification performance of the
                                                      CNN reached a Top-­1 accuracy of about 90%, despite the very small number of
                                                      images per individual in the training dataset. While the complete pipeline suc-
                                                      ceeded in re-­identifying known individuals, it slightly under-­performed with un-
                                                      known individuals.
                                                  5. Fully based on open-­source software packages, our work paves the way for fur-
                                                      ther attempts to build automatic pipelines for re-­identification of individual ani-
                                                      mals, not only in giraffes but also in other species.

                                                  KEYWORDS

                                                  deep metric learning, image similarity networks, individual identification, open-­source
                                                  software

1 | I NTRO D U C TI O N                                                       of animals in wild populations (Clutton-­Brock & Sheldon, 2010;
                                                                              Hayes & Schradin, 2017). At the heart of such monitoring is the
In many respects, population and behavioural ecology have im-                 ability to recognize individuals. Individual identification is often
mensely benefited from individual-­based, long-­term monitoring               achieved by actively marking animals, such as deploying ear-­t ags or

Methods Ecol Evol. 2021;00:1–11.                             wileyonlinelibrary.com/journal/mee3© 2021 British Ecological Society             |       1
Revisiting animal photo-identification using deep metric learning and network analysis
2   |     Methods in Ecology and Evolu on                                                                                             MIELE et al.

leg rings, cutting fingers or feathers, or scratching scales in reptiles   species (Bogucki et al., 2019; Bouma et al., 2019; Chen et al., 2020;
(Silvy et al., 2005). In some species, however, individuals display        Ferreira et al., 2020; Hansen et al., 2018; He et al., 2019; Körschens
natural marks that make them uniquely identifiable. For instance,          et al., 2018; Moskvyak et al., 2019; Schneider et al., 2020; Schofield
many large African mammals such as leopard Panthera pardus, zebra          et al., 2019), re-­identification remains a challenging task when ap-
Equus sp., kudu Tragelaphus strepsiceros, wildebeest Connochaetes          plied to animals in the wild where re-­observations are limited in
taurinus or giraffe Giraffa camelopardalis, all present idiosyncratic      number to train the model satisfactorily sensu largo (Schneider
fur and coat patterns. Non-­invasive and reliable identification of        et al., 2019).
individuals in the wild has long been known to be feasible from                In practice, current CNN-­based approaches have to be tailored
comparisons of these distinctive coat patterns (Estes, 1991). As the       to the needs of field ecologists interested in using these tools for
number of individuals to identify increases, however, people-­based        individual recognition. For instance, batches of new images are reg-
visual comparisons of pictures can rapidly become overwhelming.            ularly added to the reference database following yearly fieldwork
With the recent move to digital technologies (namely digital cam-          sessions because of the recruitment of newborns or of immigrants if
eras and camera traps), the problem becomes even more acute as             the study population is demographically open. Therefore, we expect
the number of pictures to process can easily reach the thousands           the re-­sighting of known individuals, as well as the observation of
or ten of thousands.                                                       individuals never seen before. In other words, this standard sampling
        Over the last decade, the use of computer vision rapidly spread    design implies to solve the re-­identification in a mixture of known
into biological sciences to become a standard tool in animal ecol-         and unknown individuals. Chen et al. (2020) referred to this problem
ogy for many repetitive tasks (Weinstein, 2018). In a seminal              as the ‘open set’ identification problem, and they proposed to iden-
publication, Bolger et al. (2012) first presented computer-­aided          tify images from unknown individuals and to assign them a single
photo-­identification, initially for giraffes but more recently applied    ‘unknown’ label. Automatically identifying currently unknown indi-
to dolphins (Renó et al., 2019). The underlying computer tech-             viduals speeds up the picture sorting process, and facilitates adding
nique is a feature matching algorithm, the Scale Invariant Feature         them to the database of individuals whose life history is monitored.
Transform operator (SIFT; Lowe, 2004), where each image is asso-               A classical CNN classifier can re-­identify already known individ-
ciated with the k-­nearest best matches. The current use of SIFT for       uals (usually with a softmax last layer) but will fail to identify new
ecologists requires human intervention to validate the proposed            individuals because the number of predicted classes must match the
candidate images within a graphical interface (Bolger et al., 2011).       number of known individuals. We therefore crucially need a CNN-­
In the same vein, other feature-­based proposals were developed in         based approach that can filter out individuals unknown at the time
the last decade to apply computer vision to different types of idio-       of the analysis. We propose to rely on deep metric learning (DML,
syncrasies (Hartog & Reijns, 2014; Moya et al., 2015). A drawback          see Hoffer & Ailon, 2015) as an ideal candidate to solve the ‘open
of the method frequently arises when two images are considered             set’ identification problem. DML consists in training a CNN model
similar not because of similar skin or coat patterns of animals, but       to embed the input data (input images) into a multidimensional
because of similarities in the backgrounds (presence of distinctive        Euclidean space such that data from a common class (e.g. images of
tree for instance), hence leading to false-­positive matches. For the      a given individual) are, in terms of Euclidean distance, much closer
best results with computer vision, all images should be cropped            than with the rest of the data.
before so that only the relevant part of the animal appears in the             Here, we addressed the problem of photo-­identification with
images to be analysed and compared (e.g. excluding most of the             an updated, open-­s ource and end-­to-­e nd automatic pipeline ap-
neck, head, legs and background for large herbivores). Until now,          plied to the case of the iconic, endangered giraffe. In the first
this cropping operation was most often done manually (Halloran             step, we applied state-­of-­t he art techniques for object detection
et al., 2015), despite being a highly time-­consuming task when pro-       with CNNs (Lin et al., 2017) to automatically crop giraffe flanks of
cessing thousands of images.                                               about 4,000 raw photographs shot in the field at Hwange National
        Meanwhile, the Deep Learning (DL) revolution was underway in       Park, Zimbabwe. Indeed, the most recent CNN approaches clearly
computer vision, showing breakthrough performance improvements             outperformed other approaches (Girshick et al., 2014), including
(Christin et al., 2019). In particular, convolutional neural networks      the Histogram of Oriented Gradients (HOG) approach that was
(CNNs) are now the front-­line computer technique to deal with a           recently used with giraffes too (Buehler et al., 2019). Second, fol-
large range of image processing questions in ecology and environ-          lowing Bolger et al. (2012), we used the SIFT operator to calculate
mental sciences (Lamba et al., 2019). Many recent studies tackle the       a numeric distance between all pairs of giraffe flanks. From the
general problem of re-­identification using CNNs, which has been           n × n calculated distances, we followed the new framework of
mostly developed and extensively used for humans (Wu et al., 2019).        image similarity network (Wang et al., 2018) and applied unsu-
Technically, re-­identification consists in using a CNN to classify im-    pervised learning to retrieve different clusters of images coming
ages of different individuals, some of them being not necessarily          from different individuals, hence removing any human interven-
seen before, that is, unknown individuals. However, despite the            tion in the process of individual identification. Third, we manually
availability of proven and efficient techniques (Zheng et al., 2016),      validated a subset of our results to build a ground-­t ruth dataset
and several successful attempts to apply the method to non-­human          of different individuals (n = 82). Using this dataset as a training
MIELE et al.                                                                                                Methods in Ecology and Evolu on   |   3

set, we developed a supervised learning strategy using CNNs                learning is a specific method aiming at training a CNN on a small
and evaluated its predictive accuracy with a cross-­v alidation            number of images that do no start CNN training ‘from scratch’ with
procedure.                                                                 some random model parameters, but uses the parameters of a model
                                                                           previously trained on a large dataset and for similar tasks as the one
                                                                           of interest (Willi et al., 2019). This approach works because the pre-­
2 | M ATE R I A L S A N D M E TH O DS                                      trained model has already learnt a wide range of relevant and generic
                                                                           features.
2.1 | Photograph database                                                      We manually prepared our training dataset by cropping bounding
                                                                           boxes around giraffe flanks, excluding most of the neck, head, legs
We carried out this study in the northeast of Hwange National Park         and background, with the labelImg open-­source program for image
(HNP), Zimbabwe. HNP park covers a 14,650 km2 area (Chamaillé-­            annotation (https://github.com/tzuta​lin/labelImg). We performed
Jammes et al., 2009). The giraffe sub-­species currently present in        transfer learning with RetinaNet to detect a single object class, the
HNP could be either G. c. angolensis or G. c. giraffa according to the     giraffe flank, from a pre-­trained model shipped with RetinaNet, that
IUCN (Muller et al., 2018). Here, we used data from a regular moni-        is a ResNet50 backbone trained on the COCO dataset (80 differ-
toring of individuals conducted between 2014 and 2018. Each year           ent classes of common objects including giraffes among a few other
for at least three consecutive weeks, we drove the road network            animal species; see Lin et al. (2014). We trained the model with 30
daily within
4   |     Methods in Ecology and Evolu on                                                                                                MIELE et al.

2.3.2 | Image similarity network, community                                can be used for machine learning tasks. In this context, we trained
detection and clusters of images                                           a CNN model using the triplet loss (Hermans et al., 2017), in line
                                                                           with recent studies on other species (Bouma et al., 2019; Moskvyak
Following the computation of distances between all pairs of giraffe        et al., 2019). The triplet loss principle relies on triplets of images
flanks obtained with the SIFT operator approach, we searched for           composed by a first image called anchor and another positive image
clusters of flank images that should come from one single individ-         of the same class (same giraffe here) and a third negative image of an-
ual giraffe. We first defined a network made of nodes and repre-           other class (any different giraffe; see Bouma et al., 2019, for details).
senting giraffe flank images, and of edges: we considered that two         The training step consists in optimizing the CNN model such that
nodes were connected by an edge, that is, two flanks were similar          the Euclidean distance computed using the last CNN layer (hereaf-
and came from the same giraffe if the SIFT-­based distance between         ter called CNN-­based distance) between any anchor and its positive
paired images felt below a given threshold (see below for more de-         image is minimal while maximizing the distance between this anchor
tails). Therefore, the so-­called connected components of this network     image with its negative counterpart. We used an improved algorithm
should associate images from different individuals.                        called semi-­hard triplet loss (Schroff et al., 2015) that deals only with
        We estimated this distance threshold value by taking advantage     triplets where the positive and negative images are close (in other
of a property of complex networks called the explosive percolation         words, the ‘hard’ cases), using the TripletSemiHardLoss function in
(Achlioptas et al., 2009). The explosive percolation predicts a phase      TensorFlow Addons. After training completion, we computed the
transition of the network just above a distance threshold point. At        Euclidean distances between any pair of giraffe flank photographs,
this point, adding a small number of edges in the network, for ex-         again using the vector composing the last layer of our CNN model.
ample by slightly increasing the distance threshold (Hayasaka, 2016),
leads to the sudden appearance of a giant component encompassing
the majority of nodes. In other words, at some point, a small increase     2.4.2 | Data augmentation, training and test datasets
in the distance threshold leads to considering that almost all images
come from the same giraffe. We determined this threshold value             We derived the training and test datasets required for the CNN
graphically, selecting the transition point where the giant component      approach from the photograph clusters identified by the SIFT al-
starts to increase dramatically (Supporting Information Figure S2).        gorithm. We retained only those clusters fulfilling the following
        An additional issue arose when different nodes were erroneously    conditions: (a) the cluster contains a minimum of two sequences of
connected (example in Figure S1), that is, when two flanks were errone-    images shot at least 1 hr apart; (b) the cluster can be divided into a
ously considered similar. Moreover, in some cases, the body of two or      first set of sequences large enough to perform training (we imposed
more giraffes could overlap in one photograph. In this situation, two or   at least five images), and a second set of sequences; (c) the cluster
more nodes might be linked by edges, when we actually should consider      demonstrated a perfect and verified consistency. We used the first
different giraffes. To solve this problem, we applied a network cluster-   set of sequences for CNN training, and the second as an independ-
ing algorithm called community detection, developed in network science     ent test dataset to assess the model performance. The first condi-
(Fortunato, 2010), to split—­only when relevant—­any connected com-        tion ensured that we have complete independence between training
ponent into different groups of nodes that are significantly much more     and test datasets, that is, giraffes being seen under different con-
connected between themselves than with the others, the so-­called com-     ditions (time, season or location). The third condition is of upmost
munity. Indeed, the presence of many edges inside a group of images        importance because errors in the dataset would lead to sub-­optimal
suggested it was consistent and taken from the same individual, whereas    performances of the machine learning approach. We therefore care-
the absence of many edges between two groups clearly informed about        fully checked, manually, that the SIFT-­based clusters we used in the
their inconsistency and heterogeneity (i.e. from two different individu-   CNN were perfectly unambiguous. We achieved this high level of
als). We applied the community detection with the InfoMap algorithm        data quality by discarding all cases where two or more giraffes over-
(Rosvall & Bergstrom, 2008). The final product of the community detec-     lapped on the same frame, or when giraffes were indifferently ori-
tion algorithm was a set of clusters of images corresponding either to a   ented from the back to the front (orientation ambiguities).
connected component or to a community retrieved by InfoMap.                   We cropped all flank images to focus on the central part of the
                                                                           flank, keeping 80% of the original width and 60% of the height (in
                                                                           particular excluding the neck and its background). By doing so, we
2.4 | Re-­identification of individuals, using                             wanted to prevent our CNN model from capturing background noise.
supervised learning                                                        Additionally, we homogenized contrast of images by normalizing the
                                                                           three colour channels using the    imagemagick   package (normalize op-
2.4.1 | Deep metric learning and triplet loss                              tion; https://image​magick.org). In a final step, we resized all images
with CNN                                                                   to 224 × 224 pixels.
                                                                              We ended up with five flanks per individual at least, and a me-
The principle of deep metric learning is to find an optimal way to pro-    dian of seven (Table 1) in the training set. This particularly low number
ject images into an Euclidean space such that the Euclidean distance       of images available to train the CNN led us to consider the few shot
MIELE et al.                                                                                                  Methods in Ecology and Evolu on    |   5

TA B L E 1 Flank images were selected to ensure independence               between test images and representative ones when they came from
of observation, and then used for individual giraffe re-­identification    the same known individual. Similarly, we calculated the CNN-­based
from coat patterns with a convolutional neural network. We
                                                                           distance between representative images and images of the so-­called
tabulated the average number (and the associated range in squared
brackets) of images and sequences (i.e. separated by at least 1-­hr        unknown individuals. We also considered that two images can come
interval) per individual in the train, test and unknown datasets over      from the same individual if their distance was below a given thresh-
10 trials                                                                  old. This distance threshold was a stringency condition that arbi-
                                                                           trarily varied between 0 and 1.
                                            Nb.            Nb.
                                  Nb.       images         sequences          We quantified the predictive performance of the trained CNN
               Nb. images         indiv.    per indiv.     per indiv.      model on the range of distance threshold values. First, we computed
 Train         503 [479–­529]     62        7 [5–­24]      2 [1–­5]        Top-­1 accuracy for known individuals, consisting in checking for each
                                                                           query image if a representative image from the same individual was the
 Test          121 [118–­126]     62        2 [1–­5]       1 [1–­4]
                                                                           one with smallest distance (i.e. the Top-­1 image) and with a distance
 Unknown       40                 20        2              2
  indiv.                                                                   below the threshold. In the following, Top-­1 accuracy was also called
                                                                           true-­positive (TP) rate. Then, we computed the false-­positive rate (FP),
learning framework, a class of problems where only a few images are        checking cases where the Top-­1 image was from a different individ-
available for training. We implemented a 10-­fold data augmentation        ual. Finally, we quantified the CNN ability to sort out images from un-
procedure where we made extensive use of image augmentation                known individuals. Again, over the range of distance threshold values,
using the imgaug Python library (https://github.com/aleju/​imgaug).        we checked if Top-­1 image of unknown individual images felt below
For each image in the training dataset, we performed a random set of       the threshold. If not, we considered that we successfully detected an
transformations such as modifying orientation and size, adding blur,       unknown individual, hence computing the true-­negative (TN) rate.
performing edge detection, adding Gaussian noise and modifying co-
lours or brightness (details in the available Python code). We finally
used this set of 11 images per original image to train our CNN model,      3 | R E S U LT S
that is, the original one and ten modified versions of this image.
                                                                           3.1 | From thousands of photographs to thousands
                                                                           of images of giraffe flank
2.5 | Evaluation of CNN-­based re-­identification
                                                                           We trained the object detection method with RetinaNet (Lin
To quantify the overall predictive performance of our CNN deep met-        et al., 2017) on a set of 400 photographs for which the cropping of
ric learning, we replicated the following procedure 10 times. We first     the 469 giraffe flanks have been previously done manually. Training
randomly selected 25% of the individuals of the dataset and, for the       took approximately 30 min on a Titan X card. When applying the
purpose of the evaluation here, considered these as unknown indi-          automatic cropping procedure on our 3,940 photographs (see
viduals. Then, for each of them, we randomly selected two images,          Figure 1a), we retrieved 5,019 images with associated bounding
one in each of the sequences (see above). With this dataset, we aimed      boxes, supposed to contain a single giraffe flank (see Figure 2a). The
to test the ability of the CNN model to detect unknown individuals.        cropping failed for 186 photographs (failure rate: 4.7%), mostly due
The remaining 75% individuals were considered known individuals.           to foreground vegetation and, unusual and difficult orientation of gi-
For these known individuals, we selected all photographs from the          raffes in the photograph (see examples on Figure 1b). In a few cases,
first sequence and used it to built a training dataset for the CNN. We     a bounding box could contain the bodies of two overlapping giraffes,
kept all images from the remaining sequences as the test dataset for       one being partially in front of the other (see Figure 2a). Similarly, in
known individuals. This ensured a good independence between train-         some rare instances, giraffes were standing very close to each other
ing and test data, mostly thanks to the 1 hr (at least) time lag between   on a photograph, a situation where RetinaNet could fail in retrieving
observations. Once the selection of individuals was completed, we          the exact boundaries of each giraffe flank (see the worst case that we
performed transfer learning using the pre-­trained model ResNetV2          experienced, from a partially blurry photograph in Figure 2b).
readily available in Keras. We estimated the model parameters using
the augmented training dataset with 80 epochs with batches of size
42. We used the stochastic gradient descent optimizer with a rate of       3.2 | From thousands of images down to
0.2. Our pipeline was implemented with Keras 2.3.0.                        hundreds of identified individuals
    To mimic re-­identification per se, literally re-­seeing known in-
dividuals, we considered that we had a ‘reference book’ with five          Running the SIFT algorithm (Lowe, 2004) to compare all pairs of
representative images per known individuals: these images were             flanks took about 800 CPU hours of heterogeneous computing re-
randomly drawn out of the training dataset. We then calculated the         sources. We estimated the threshold value for the giant component
CNN-­based distance between these representative images and each           (see Section 2) at a distance of 340 (see Figure S2a), and obtained
image from the test dataset. In essence, we expected small distances       an image similarity network composed of 5,019 nodes and 11,249
6   |       Methods in Ecology and Evolu on                                                                                                    MIELE et al.

     (a)                                                                             (a)

                 2,500
    Occurences
                 1,000
                 0

                             0    1   2    3    4      5   6    7    8
                             Number of identified giraffe flanks/image

     (b)                              Head            Backlight
                         No giraffe                         Fuzzy                             (b)

                                                                Back-side

                                                                    Not explained
Front-side
                                                                     Too far

                                                    Foreground vegetation           F I G U R E 2 Examples of automatic cropping of giraffe
                                                                                    photographs with RetinaNet to retrieve the flank of the animal
F I G U R E 1 Performance of RetinaNet flank detection of giraffes                  body (red squares). Photographs were shot at Hwange National
from a set of 3,940 photographs taken at Hwange National Park,                      Park, Zimbabwe, between 2014 and 2018. In (a) the best-­case
Zimbabwe, between 2014 and 2018. In total, we could extract                         scenario where all giraffes stand separately on the photograph,
5,019 images of giraffe flanks automatically. (a) Number of                         and RetinaNet successfully finds the flanks of the four individuals;
identified flanks per image; (b) Manual classification of cropping                  (b) Worst-­case, but rare, scenario where the body of the different
problems encountered in 186 images where Retinanet failed to                        individuals overlap, combined to a blur caused by the car window
identify a giraffe flank in the photographs                                         on the right-­hand side of the photograph. In this case, RetinaNet
                                                                                    missed two individuals, and cropped the body of two giraffes into
                                                                                    one single image
edges, yielding 1,417 connected components among which 781
were singletons of one image.
        Our network-­based approach, relying on community detection,                contained at least two different sequences of photographs shot at
retrieved consistent clusters of flank images (different colours in                 least with a 1-­hr interval (see Section 2). Those 82 clusters were
Figure 3). The cluster size distribution is by definition more con-                 made of 822 images of giraffe flanks from which we evaluated the
centrated after network clustering (see Figure S3) with a maximal                   performance of our re-­identification pipeline based on deep met-
size of 35 instead of 373. Indeed, this very large connected com-                   ric learning. Once trained using data augmentation (Figure 4), the
ponent was clearly an artefact due to a chain of giraffe overlaps,                  CNN returned a Top-­1 accuracy (TP rate) of about 85% on average
and has been successfully split by our procedure (see Figure S4).                   (Figure 5) for images of known individuals. However, 11 images
We detected 316 clusters with more than 5 images, and 105 with                      were found to be repeatedly impossible to classify because of bad
more than 10 images. However, in rare cases, some images from the                   orientation of the giraffe body on the photograph, or because of
same individuals were found in different clusters (see Figure S4).                  the presence of conspicuous and disturbing elements at the fore-
Because these clusters arose from a single connected component,                     front (Figure S6). Without these problematic images, we achieved
we could a posteriori check for consistencies by comparing clusters                 a Top-­1 accuracy > 90%, on average. Interestingly, the associated
of the same component manually (such as performed for Figure S4).                   false-­p ositive rate was close to 0 (Figure 5). In other words, when a
                                                                                    Top-­1 image existed below a given threshold (here 1 at most), this
                                                                                    Top-­1 image was almost always from the correct known individual
3.3 | From identified individuals to a deep learning                                (Figure S5a).
approach for re-­identification                                                        With our deep metric learning approach, images were pro-
                                                                                    jected into an Euclidean space. We expected images from the
To perform a fair evaluation of the CNN performance, we saved                       same known individual to be close in this space, whereas im-
82 human-­validated, unambiguous SIFT-­b ased clusters that                         ages from unknown individuals should be distant from those of
MIELE et al.                                                                                                        Methods in Ecology and Evolu on    |   7

known individuals. This prediction was partly supported only. If,        (Figure S5b). Interestingly, a particular threshold value (d = 0.25;
for small distance threshold values (d ≤ 0.1), the true-­n egative       crossing point in Figure 5) where both TP and TN rates reached
rate was TN > 95%, TN decreased markedly with the distance               80% offered the best compromise.
threshold (Figure 5). At the same time, the positive rate started
from TP < 70% for (d ≤ 0.1) but rapidly levelled off to 80% as the                         100
distance threshold increased (Figure 5). Hence, our CNN often
predicted an unexpected small distance between a given image
of unknown individual and another image of a known individual                               80

                                                                         Performance (%)
                                                                                            60

                                                      1
                                          6
                                                                                            40
                                                      2
                                                                                                        TP (Top-1 acc)
                                   5
          1                                           3                                     20          FP
                              4
               2
                     3                                                                                  TN (unkn. indiv.)
                                                      4
                                                                                             0

                                                      5                                          0.00        0.25        0.50        0.75             1.00
                                                                                                                     Threshold

                                                      6                  F I G U R E 5 Performance of our convolutional neural network
                                                                         (CNN) pipeline for the re-­identification of giraffes at Hwange
F I G U R E 3 Example of a connected component split into four           National Park, Zimbabwe (between 2014 and 2018). We decided
clusters using the InfoMap algorithm (see Section 2) to assign           that two flank images came from the same giraffe using the
images of giraffe flank to a given individual for re-­identification.    Euclidean distance between the two images defined by our deep
Each cluster, representing one individual giraffe, is delineated by      metric learning method. If the distance between the two images
an ellipse of different colour. Node 2 is an image with two giraffes     felt below a certain threshold distance, it was concluded they
that we also have in images 1 and 3, respectively, accounting for        belonged to the same individual. Here, we report on the true-­
why their two respective clusters (on the left) are connected.           positive rate (TP), or Top-­1 accuracy, as function of the distance
Clusters can sometimes be connected even if the flanks belong to         threshold and calculated on images of know individuals in the test
two different giraffes. We illustrate this case with images 3 and 4,     dataset, with (plain) or without (dashed) 11 problematic images. We
which are considered similar because of the presence of the same         also report the corresponding false-­positive rate (FP), and the true-­
tree in the background. The same issue arises for images 5 and 6.        negative rate (TN) calculated on images of unknown individuals.
We applied this method to re-­identify giraffes from coat patterns       True-­negative rate displays the performance of the CNN model to
on a collection of photographs taken at Hwange National Park,            detect new giraffes entering the dataset that is those individuals
Zimbabwe, between 2014 and 2018                                          never seen before when training the CNN

F I G U R E 4 Training a convolutional neural network (CNN) requires a large and varied set of images (here giraffe flanks) to achieve
reasonable performance when applied on new cases. In this study, we took giraffe photographs at Hwange National Park, Zimbabwe,
between 2014 and 2018 but in the field, the opportunity to shoot pictures of the same giraffe in a variety of situations in terms of location
or light condition is very limited. Therefore, we performed image data augmentation by randomly changing orientation and size, adding blur,
performing edge detection, adding noise and modifying colours or brightness using the imgaug Python library (see Section 2). Here, we
show an example of data augmentation, with the original image (left) and four different modified versions used to train our CNN for giraffe
re-­identification
8   |     Methods in Ecology and Evolu on                                                                                             MIELE et al.

4 | D I S CU S S I O N                                                    are declared different because of differences in lighting conditions
                                                                          or animal orientation) since community detection is robust to pos-
We propose two complementary approaches to re-­identify individ-          sibly missing edges: indeed, a missing edge can be compensated by
ual giraffes from a set of photographs taken in the field. Based on       the other edges inside a cluster. This step is fully reproducible and
the new framework of image similarity networks, our unsupervised          applicable to other animal species, as long as a feature matching al-
method goes one step further compared to previous solutions from          gorithm can be used, be it SIFT or any other alternative methods
the literature since its end product is a comprehensive list of clus-     such as Oriented FAST and rotated BRIEF (ORB Rublee et al., 2011),
ters of images, one cluster per identified individuals. Our supervised    or deep features (Dusmanu et al., 2019; Ma et al., 2020).
method, that relies on deep metric learning, achieves a very good re-­       We tackled the problem of animal re-­identification, literally de-
identification of giraffes from a ‘reference book’ of known individu-     tecting and identifying previously seen animals, considering that we
als despite the rather small number of photographs per individuals        had a ‘reference book’ with photographs of these known individ-
available to train the model.                                             uals. This fits the needs of field researchers that want to monitor
        As a first step, we took advantage of the most recent computer    the fate of animals by regularly adding new observations in time, for
vision techniques to perform object detection and crop the giraffe        instance by collecting photographs with camera traps. To do so, we
flanks before comparing coat patterns of giraffes. Image cropping         evaluated the possibility to use the rapidly developing convolutional
proves to be particularly efficient when the body of several giraffes     neural networks in a supervised learning framework to achieve deep
do not overlap in photographs. However, cascade of problems arises        metric learning. Solving this problem was particularly challenging
when overlapping occurs, including erroneous cropping and difficul-       because of the size of our dataset. Previous studies on animal re-­
ties to assign a bounding box to a single individual because in this      identification with CNN indeed relied on a high number of photo-
case, the coat patterns of two individuals are mixed. We show that a      graphs per individuals (Ferreira et al., 2020; Schneider et al., 2020).
limited number of labelled photographs is needed to train RetinaNet       In our case, we had to train the CNN with a few images per indi-
(a few hundreds) with a very good performance on new photographs.         viduals only (see Snell et al., 2017, on few shot learning methods)
To what extent our RetinaNet model parameters could be efficient          shot in the field with contrasting environmental and light conditions.
in other study sites with different background vegetation (in ‘Terra      This situation corresponds to many field studies, and particularly on
Incognita’, quoting Beery et al., 2018) remains an open question.         large mammals (possibly with the exception of primates), for which
Nevertheless, fine-­tuning RetinaNet for a particular task and dataset    population density and animal detection rate are low, limiting the
is within the reach of many researchers dealing with animal photo-        expected number of photograph per individuals. To circumvent this
graphs thanks to the associated code we provide. Further perspec-         problem, we developed a data augmentation strategy to increase ar-
tives now arise with contour segmentation methods (He et al., 2017)       tificially the variability of observation conditions encountered in the
than can extract contours of an object such as the whole body or          training dataset, and improved the model performance substantially.
any part of an animal by creating the so-­called segmentation mask           In terms of overall predictive performance, we reached about
(Brodrick et al., 2019). Giraffe body contouring could possibly help      90% Top-­1 accuracy, which is comparable to the previously reported
for the individual re-­identification by removing background residual     performance in animal re-­identification of known individuals (see
noise, but building a training set by manually contouring hundreds of     Schneider et al., 2019, for a review) but usually achieved with a much
animal bodies remains a huge effort.                                      higher number of photographs. The combination of recent deep
        We then recast the animal identification problem from photo-      learning algorithm and data augmentation appears very competitive
graphs into a statistical one, namely a clustering problem in an image    and efficient, with possible application to difficult practical cases
similarity network. In other words, given a network that we build         like when working on endangered or elusive species living at very
using a distance between pairs of images, we can efficiently retrieve     low abundance such as leopard Panthera pardus or the Iberian lynx
the image set of a given individual as a cluster in a network. We com-    Lynx pardinus. Compared to the more robust SIFT operator, we found
puted a distance based on pattern matching between flanks with            that the performance of the CNN is affected by the orientation of
the well-­known SIFT operator (Bellavia & Colombo, 2020) as used          giraffe body and noticeably by deviation from perfect side shot. In
by Bolger et al. (2012). The proposed network-­based approach was         terms of computing requirements, training our CNN remained time-­
particularly useful and efficient to deal with false-­positive matches.   consuming because the number of images to process is increased
False-­positive matches are a recurrent issue occurring when two          dramatically by the data augmentation. This problem is partially
images have very similar background. This situation is often found        counter-­balanced by the more computationally efficient calculation
when the same tree appears on two images (see nodes 3 and 4 in            of CNN-­based distances that increases linearly with the number of
Figure 3), when giraffe orientation perfectly matches (see Figure S1),    photographs (computing one projection per image), compared to the
or when the bodies of two giraffes overlap on the same image, which       SIFT-­based approach for which the computing time is proportional
is the most frequent configuration we faced (see node 2 in Figure 3).     to the square of the number of photographs (computing one match-
In this latter case, this image linked two sets of images correspond-     ing per image pair). For instance, we got all distances in a minute with
ing to the two overlapping individuals. Our network-­based approach       the CNN and about 2 hr with the SIFT operator when applied on the
also handles false-­negative cases (e.g. two images of the same animal    same test dataset (see Table 2).
MIELE et al.                                                                                                     Methods in Ecology and Evolu on   |   9

TA B L E 2 Computing time needed to compare 310                             Another point to pay attention to is the background which, if too sim-
representative images versus 121 test images (CNN training with             ilar on the same images (e.g. photographs shot from the very same
about 5,500 images) extracted from giraffe photographs shot at
                                                                            spot) with obvious structures (tree, pond, rocks, etc.) will likely mis-
Hwange National Park, Zimbabwe, between 2014 and 2018. The
hardware we used for these calculations was an Intel Xeon CPU E5-­          lead the computer vision algorithm, even on cropped images because
2650 v4 2.30 GHz (CPU) and Nvidia Titan X card (GPU)                        cropping is rectangular and do not delineate the animal body. This
                                                                            situation often arises while photographing animals moving in line, as
 Task                                            Avg. computing time
                                                                            giraffes and many others often do. A last point is the heterogeneity
 SIFT-­based distance                            About 1 hr 45 min          of situations under which animals were observed. We did our best
 CNN-­based distance                             About 1 min                to improve the training dataset with data augmentation, however,
 CNN training                                    About 3 hr 45 min          photographing animals in as many different conditions as possible
                                                  (with GPU)                could improve the results. This includes light conditions (dawn, dusk,
                                                                            noon), orientation of individual or background (open vs. more densely
    Our approach was also designed to deal with datasets where              vegetated areas). More specific to CNN re-­identification is the need
known and unknown individuals were present. Dealing with un-                to have a greater number of pictures of photographs per individuals
known individuals is extremely challenging because no image of              (>50) than what is currently available, so a particular attention should
these new individuals are available in the training dataset. Indeed,        be given, in the field under optimal shooting conditions, to the oppor-
most classical CNN-­based approaches solve classification problems          tunity to take more photographs of each observed individual.
where the number of classes, the number of individuals for us, was
fixed. We showed here that it was possible to filter out unknown            AC K N OW L E D G E M E N T S
from known individuals while re-­identifying a large fraction of known      We thank Jeanne Duhayer for her considerable help in analysing
individuals at the same time with a success of 80% (for both TP and         our preliminary findings, and Laurent Jacob and Franck Picard for
TN). However, this trade-­off came at the cost of a lower Top-­1 ac-        their insights on deep learning. This work was performed using the
curacy, which we acknowledge is not fully satisfying and already            computing facilities of the CC LBBE/PRABI. Funding was provided
experienced by other authors (Ferreira et al., 2020). Still, in most        by the French National Center for Scientific Research (CNRS) and
cases, we could validate the proposed identification by examining           the Statistical Ecology Research Group (EcoStat) of the CNRS. We
the Top-­1 for each query image (i.e. checking its closest image) for       are also grateful to Derek Lee for his kind advice in processing pho-
both known and unknown individuals. Despite not being fully auto-           tographs, and for sharing with us his experience in the monitoring
mated, our CNN approach would require little human intervention.            of giraffes. Finally, we acknowledge the director of the Zimbabwe
    To what extent the performance of our CNN-­based pipeline               Parks and Wildlife Management Authority for authorizing this re-
could be improved with more data? Since it is suitable to any spe-          search, and support from the CNRS Zone Atelier/LTSER program for
cies, further data analysis on other species will help answer this          fieldwork and some of the photographs (collection by P.A. Seeber).
question. However, additional strategies would help including the
integration of contextual information (Beery et al., 2019; Terry            AU T H O R S ' C O N T R I B U T I O N S
et al., 2020) such as time, GPS positioning or animal social context.       V.M., D.A. and C.B. conceived the study with some inputs from
Using accurate segmentation of animal body (Brodrick et al., 2019;          S.C.-­J.; V.M. and G.D. developed the approach and performed the
He et al., 2017) will undoubtedly be a solution against side effects        analysis; V.M. and S.C.-­J. supervised G.D.; D.A. and C.B. provided
of rectangular cropping. Moreover, this pipeline can be used in an          the photographs; B.S. set up the computing architecture. All authors
active learning strategy where the machine learning model is as-            contributed to the writing of the manuscript.
sisted by human intervention on some specific cases (Norouzzadeh
et al., 2021). Indeed, using the proposed distance threshold in the         PEER REVIEW
Euclidean space, one can iteratively enrich the training dataset after      The peer review history for this article is available at https://publo​ns.​
manual checking of the most confident Top-­1 candidates (below a            com/publo​n/10.1111/2041-­210X.13577.
small distance threshold, to guarantee optimal TN rate) and re-­run
the estimation procedure.                                                   DATA AVA I L A B I L I T Y S TAT E M E N T
    Finally, this inter-­disciplinary work provides guidelines about best   The curated dataset of re-­identified giraffe individuals is freely avail-
practices to collect identification images in the field, if to be used      able at ftp://pbil.univ-­lyon1.fr/pub/datas​ets/miele​2021. The code
later with an automated pipeline such as the one presented here.            to reproduce the analysis is available at https://plmlab.math.cnrs.fr/
Better results can be achieved with simple framing rules of animals         vmiel​e/anima​l-­reid/ with explanations and test cases.
with cameras. First, the field operator should try to avoid as much
as possible overlaying bodies of two or more individuals as this was        ORCID
the most acute issue in our giraffe experience. Note that several but       Vincent Miele      https://orcid.org/0000-0001-7584-0088
well separated individuals in the same photograph is not a problem at       Simon Chamaillé-­Jammes         https://orcid.org/0000-0003-0505-6620
all thanks to the CNN cropping performed at the preliminary stage.          Christophe Bonenfant        https://orcid.org/0000-0002-9924-419X
10   |   Methods in Ecology and Evolu on                                                                                                              MIELE et al.

REFERENCES                                                                         Fortunato, S. (2010). Community detection in graphs. Physics Reports,
                                                                                        486, 75–­174. https://doi.org/10.1016/j.physr​ep.2009.11.002
Achlioptas, D., D'Souza, R. M., & Spencer, J. (2009). Explosive perco-
                                                                                   Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hier-
    lation in random networks. Science, 323, 1453–­1455. https://doi.
                                                                                        archies for accurate object detection and semantic segmentation. In
    org/10.1126/scien​ce.1167782
                                                                                        The IEEE conference on computer vision and pattern recognition (CVPR).
Beery, S., Van Horn, G., & Perona, P. (2018). Recognition in terra incog-
                                                                                        IEEE.
    nita. In Proceedings of the European conference on computer vision
                                                                                   Halloran, K. M., Murdoch, J. D., & Becker, M. S. (2015). Applying
    (ECCV) (pp. 456–­473). Springer.
                                                                                        computer-­aided photo-­identification to messy datasets: A case study
Beery, S., Wu, G., Rathod, V., Votel, R., & Huang, J. (2019). Context R-­
                                                                                        of Thornicroft's giraffe (Giraffa camelopardalis thornicrofti). African
    CNN: Long term temporal context for per-­c amera object detection.
                                                                                        Journal of Ecology, 53, 147–­155.
    In Proceedings of the IEEE/CVF Conference on Computer Vision and
                                                                                   Hansen, M. F., Smith, M. L., Smith, L. N., Salter, M. G., Baxter, E. M.,
    Pattern Recognition, pp. 13075–­13085.
                                                                                        Farish, M., & Grieve, B. (2018). Towards on-­farm pig face recognition
Bellavia, F., & Colombo, C. (2020). Is there anything new to say about sift
                                                                                        using convolutional neural networks. Computers in Industry, 98, 145–­
    matching? International Journal of Computer Vision, 128, 1847–­1866.
                                                                                        152. https://doi.org/10.1016/j.compi​nd.2018.02.016
Bochkovskiy, A., Wang, C. Y., & Liao, H. Y. M. (2020). Yolov4: Optimal speed
                                                                                   Hartog, J., & Reijns, R. (2014). Interactive individual identification system
    and accuracy of object detection. arXiv preprint arXiv:200410934.
                                                                                        (I3S). Free Software Foundation Inc.
Bogucki, R., Cygan, M., Khan, C. B., Klimek, M., Milczek, J. K., & Mucha,
                                                                                   Hayasaka, S. (2016). Explosive percolation in thresholded networks.
    M. (2019). Applying deep learning to right whale photo identifica-
                                                                                        Physica A: Statistical Mechanics and its Applications, 451, 1–­9. https://
    tion. Conservation Biology, 33, 676–­684. https://doi.org/10.1111/
                                                                                        doi.org/10.1016/j.physa.2016.01.001
    cobi.13226
                                                                                   Hayes, L. D., & Schradin, C. (2017). Long-­term field studies of mammals:
Bolger, D. T., Morrison, T. A., Vance, B., Lee, D., & Farid, H. (2012). A
                                                                                        What the short-­term study cannot tell us. Journal of Mammalogy, 98,
    computer-­assisted system for photographic mark–­recapture anal-
                                                                                        600–­602. https://doi.org/10.1093/jmamm​al/gyx027
    ysis. Methods in Ecology and Evolution, 3, 813–­822. https://doi.
                                                                                   He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-­CNN. In
    org/10.1111/j.2041-­210X.2012.00212.x
                                                                                        Proceedings of the IEEE international conference on computer vision (pp.
Bolger, D., Vance, B., Morrison, T., & Farid, H. (2011). Wild id user guide:
                                                                                        2961–­2969). IEEE.
    Pattern extraction and matching software for computer-­assisted pho-
                                                                                   He, Q., Zhao, Q., Liu, N., Chen, P., Zhang, Z., & Hou, R. (2019).
    tographic mark. Retrieved from https://github.com/Conse​r vati​onInt​
                                                                                        Distinguishing individual red pandas from their faces. In Chinese
    ernat​ional/​Wild.ID/
                                                                                        conference on pattern recognition and computer vision (PRCV) (pp. 714–­
Bouma, S., Pawley, M. D. M., Hupman, K., & Gilman, A. (2019). Individual
                                                                                        724). Springer.
    common dolphin identification via metric embedding learning. arXiv
                                                                                   Hermans, A., Beyer, L., & Leibe, B. (2017). In defense of the triplet loss for
    preprint arXiv:1901.03662.
                                                                                        person re-­identification. arXiv preprint arXiv:1703.07737.
Bradski, G. (2000). The OpenCV library. Dr Dobb's Journal of Software
                                                                                   Hoffer, E., & Ailon, N. (2015). Deep metric learning using triplet network.
    Tools.
                                                                                        In International workshop on similarity-­based pattern recognition (pp.
Brodrick, P. G., Davies, A. B., & Asner, G. P. (2019). Uncovering ecologi-
                                                                                        84–­92). Springer.
    cal patterns with convolutional neural networks. Trends in Ecology &
                                                                                   Körschens, M., Barz, B., & Denzler, J. (2018). Towards automatic identifi-
    Evolution, 34(8), 734–­745. https://doi.org/10.1016/j.tree.2019.03.006
                                                                                        cation of elephants in the wild. arXiv preprint arXiv:181204418.
Buehler, P., Carroll, B., Bhatia, A., Gupta, V., & Lee, D. E. (2019). An au-
                                                                                   Lamba, A., Cassey, P., Segaran, R. R., & Koh, L. P. (2019). Deep learn-
    tomated program to find animals and crop photographs for indi-
                                                                                        ing for environmental conservation. Current Biology, 29, R977–­R982.
    vidual recognition. Ecological Informatics, 50, 191–­196. https://doi.
                                                                                        https://doi.org/10.1016/j.cub.2019.08.016
    org/10.1016/j.ecoinf.2019.02.003
                                                                                   Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for
Chamaillé-­Jammes, S., Valeix, M., Bourgarel, M., Murindagomo, F., &
                                                                                        dense object detection. In Proceedings of the IEEE international confer-
    Fritz, H. (2009). Seasonal density estimates of common large herbi-
                                                                                        ence on computer vision (pp. 2980–­2988). IEEE.
    vores in Hwange national park, Zimbabwe. African Journal of Ecology,
                                                                                   Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár,
    47, 804–­8 08. https://doi.org/10.1111/j.1365-­2028.2009.01077.x
                                                                                        P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context.
Chen, P., Swarup, P., Wojciech, M. M., Kong, A. W. K., Han, S., Zhang, Z., &
                                                                                        In European conference on computer vision (pp. 740–­755). Springer.
    Rong, H. (2020). A study on giant panda recognition based on images
                                                                                   Lowe, D. G. (2004). Distinctive image features from scale-­invariant key-
    of a large proportion of captive pandas. Ecology and Evolution, 10(7),
                                                                                        points. International Journal of Computer Vision, 60, 91–­110. https://
    3561–­3573. https://doi.org/10.1002/ece3.6152
                                                                                        doi.org/10.1023/B:VISI.00000​29664.99615.94
Christin, S., Hervet, E., & Lecomte, N. (2019). Applications for deep
                                                                                   Ma, J., Jiang, X., Fan, A., Jiang, J., & Yan, J. (2020). Image matching from
    learning in ecology. Methods in Ecology and Evolution, 10, 1632–­1644.
                                                                                        handcrafted to deep features: A survey. International Journal of
    https://doi.org/10.1111/2041-­210X.13256
                                                                                        Computer Vision, 129, 23–­79.
Clutton-­Brock, T., & Sheldon, B. C. (2010). Individuals and populations:
                                                                                   Moskvyak, O., Maire, F., Armstrong, A. O., Dayoub, F., & Baktashmotlagh,
    The role of long-­term, individual-­based studies of animals in ecology
                                                                                        M. (2019). Robust re-­identification of manta rays from natural
    and evolutionary biology. Trends in Ecology & Evolution, 25, 562–­573.
                                                                                        markings by learning pose invariant embeddings. arXiv preprint
    https://doi.org/10.1016/j.tree.2010.08.002
                                                                                        arXiv:1902.10847.
Dusmanu, M., Rocco, I., Pajdla, T., Pollefeys, M., Sivic, J., Torii, A., &
                                                                                   Moya, Ó., Mansilla, P. L., Madrazo, S., Igual, J. M., Rotger, A., Romano, A.,
    Sattler, T. (2019). D2-­Net: A trainable CNN for joint description and
                                                                                        & Tavecchia, G. (2015). Aphis: A new software for photo-­matching
    detection of local features. In Proceedings of the IEEE conference on
                                                                                        in ecological studies. Ecological Informatics, 27, 64–­70. https://doi.
    computer vision and pattern recognition (pp. 8092–­8101). IEEE.
                                                                                        org/10.1016/j.ecoinf.2015.03.003
Estes, R. D. (1991). The behavior guide to African mammals: Including
                                                                                   Muller, Z., Bercovitch, F., Brand, R., Brown, D., Brown, M., Bolger,
    hoofed mammals, carnivores. In Primates (pp. 509–­519). Univ of
                                                                                        D., Carter, K., Deacon, F., Doherty, J., Fennessy, J., Fennessy, S.,
    California Press.
                                                                                        Hussein, A., Lee, D., Marais, A., Strauss, M., Tutchings, A., & Wube,
Ferreira, A. C., Silva, L. R., Renna, F., Brandl, H. B., Renoult, J. P., Farine,
                                                                                        T. (2018). Giraffa camelopardalis (amended version of 2016 assess-
    D. R., Covas, R., & Doutrelant, C. (2020). Deep learning-­based meth-
                                                                                        ment). The IUCN Red List of threatened species 2018: e.t9194a1362​
    ods for individual recognition in small birds. Methods in Ecology and
                                                                                        66699.
    Evolution, 11, 1072–­1085. https://doi.org/10.1111/2041-­210X.13436
MIELE et al.                                                                                                          Methods in Ecology and Evolu on     |   11

Norouzzadeh, M. S., Morris, D., Beery, S., Joshi, N., Jojic, N., & Clune, J.           and transfer learning. IEEE Transactions on Medical Imaging, 35, 1285–­
    (2021). A deep active learning system for species identification and               1298. https://doi.org/10.1109/TMI.2016.2528162
    counting in camera trap images. Methods in Ecology and Evolution, 12,          Silvy, N. J., Lopez, R. R., & Peterson, M. J. (2005). Wildlife marking tech-
    150–­161. https://doi.org/10.1111/2041-­210X.13504                                 niques. In Techniques for wildlife investigations and management (pp.
Parham, J., Stewart, C., Crall, J., Rubenstein, D., Holmberg, J., & Berger-­           339–­376). The Wildlife Society.
    Wolf, T. (2018). An animal detection pipeline for identification. In           Snell, J., Swersky, K., & Zemel, R. (2017). Prototypical networks for few-­
    2018 IEEE winter conference on applications of computer vision (WACV)              shot learning. In Advances in neural information processing systems (pp.
    (pp. 1075–­1083). IEEE.                                                            4077–­4 087). MIT Press.
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look         Terry, J. C. D., Roy, H. E., & August, T. A. (2020). Thinking like a natural-
    once: Unified, real-­time object detection. In Proceedings of the IEEE             ist: Enhancing computer vision of citizen science images by harness-
    conference on computer vision and pattern recognition (pp. 779–­788).              ing contextual data. Methods in Ecology and Evolution, 11, 303–­315.
    IEEE.                                                                              https://doi.org/10.1111/2041-­210X.13335
Renó, V., Dimauro, G., Labate, G., Stella, E., Fanizza, C., Cipriano, G.,          Wang, B., Pourshafeie, A., Zitnik, M., Zhu, J., Bustamante, C. D., Batzoglou,
    Carlucci, R., & Maglietta, R. (2019). A sift-­based software system for            S., & Leskovec, J. (2018). Network enhancement as a general method
    the photo-­identification of the Risso's dolphin. Ecological Informatics,          to denoise weighted biological networks. Nature Communications, 9,
    50, 95–­101. https://doi.org/10.1016/j.ecoinf.2019.01.006                          1–­8. https://doi.org/10.1038/s4146​7-­018-­05469​-­x
Rosvall, M., & Bergstrom, C. T. (2008). Maps of random walks on complex            Weinstein, B. G. (2018). A computer vision for animal ecology.
    networks reveal community structure. Proceedings of the National                   Journal of Animal Ecology, 87, 533–­545. https://doi.org/10.1111/​
    Academy of Sciences of the United States of America, 105, 1118–­1123.              1365-­2656.12780
    https://doi.org/10.1073/pnas.07068​51105                                       Willi, M., Pitman, R. T., Cardoso, A. W., Locke, C., Swanson, A., Boyer,
Rublee, E., Rabaud, V., Konolige, K., & Bradski, G. (2011). Orb: An effi-              A., Veldthuis, M., & Fortson, L. (2019). Identifying animal spe-
    cient alternative to sift or surf. In 2011 International conference on             cies in camera trap images using deep learning and citizen sci-
    computer vision (pp. 2564–­2571). IEEE.                                            ence. Methods in Ecology and Evolution, 10, 80–­91. https://doi.
Sadegh Norouzzadeh, M., Morris, D., Beery, S., Joshi, N., Jojic, N., & Clune,          org/10.1111/2041-­210X.13099
    J. (2019). A deep active learning system for species identification and        Wu, D., Zheng, S. J., Zhang, X. P., Yuan, C. A., Cheng, F., Zhao, Y., Lin,
    counting in camera trap images. arXiv preprint arXiv:191009716.                    Y. J., Zhao, Z. Q., Jiang, Y. L., & Huang, D. S. (2019). Deep learning-­
Schneider, S., Taylor, G. W., & Kremer, S. (2018). Deep learning object                based methods for person re-­identification: A comprehensive re-
    detection methods for ecological camera trap data. In 2018 15th con-               view. Neurocomputing, 337, 354–­371. https://doi.org/10.1016/j.
    ference on computer and robot vision (CRV) (pp. 321–­328). IEEE.                   neucom.2019.01.079
Schneider, S., Taylor, G. W., & Kremer, S. C. (2020). Similarity learning          Zheng, L., Yang, Y., & Hauptmann, A. G. (2016). Person re-­identification:
    networks for animal individual re-­identification-­beyond the capabili-            Past, present and future. arXiv preprint arXiv:161002984.
    ties of a human observer. In Proceedings of the IEEE winter conference
    on applications of computer vision workshops (pp. 44–­52). IEEE.
Schneider, S., Taylor, G. W., Linquist, S., & Kremer, S. C. (2019). Past,
    present and future approaches using computer vision for animal
                                                                                   S U P P O R T I N G I N FO R M AT I O N
    re-­identification from camera trap data. Methods in Ecology and               Additional supporting information may be found online in the
    Evolution, 10, 461–­470. https://doi.org/10.1111/2041-­210X.13133              Supporting Information section.
Schofield, D., Nagrani, A., Zisserman, A., Hayashi, M., Matsuzawa, T.,
    Biro, D., & Carvalho, S. (2019). Chimpanzee face recognition from vid-
    eos in the wild using deep learning. Science Advances, 5, eaaw0736.
    https://doi.org/10.1126/sciadv.aaw0736                                            How to cite this article: Miele V, Dussert G, Spataro B,
Schroff, F., Kalenichenko, D., & Philbin, J. (2015). Facenet: A unified embed-        Chamaillé-­Jammes S, Allainé D, Bonenfant C. Revisiting
    ding for face recognition and clustering. In Proceedings of the IEEE con-         animal photo-­identification using deep metric learning and
    ference on computer vision and pattern recognition (pp. 815–­823). IEEE.
                                                                                      network analysis. Methods Ecol Evol. 2021;00:1–­11. https://
Shin, H. C., Roth, H. R., Gao, M., Lu, L., Xu, Z., Nogues, I., Yao, J., Mollura,
    D., & Summers, R. M. (2016). Deep convolutional neural networks for               doi.org/10.1111/2041-­210X.13577
    computer-­aided detection: CNN architectures, dataset characteristics
You can also read