ODANet: Online Deep Appearance Network for Identity-Consistent Multi-Person Tracking

Page created by Luis Gallagher

Business

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

ODANet: Online Deep Appearance Network for Identity-Consistent Multi-Person Tracking

ODANet: Online Deep Appearance Network for Identity-Consistent
Multi-Person Tracking

Guillaume Delorme1 , Yutong Ban2 , Guillaume Sarrazin1 , Xavier Alameda-Pineda1
1
  Inria, LJK, Univ. Grenoble Alpes, France 2 MIT CSAIL Distributed Robotics Lab

10/01/2021, MPRSS 2020, Milano, Italy

                                                                                  1

Introduction

Tracking by detection

                        2

Tracking by detection

                        3

Multiple Object Tracking

 • Our tracker can be seen as a generalization of a Kalman Filter dealing with
   multiple objects at once.

                                                                                 4

Multiple Object Tracking

 • Our tracker can be seen as a generalization of a Kalman Filter dealing with
   multiple objects at once.
 • Dealing with multiple objects/targets introduces a detection-to-target
   assignment problem.

                                                                                 4

Multiple Object Tracking

 • Our tracker can be seen as a generalization of a Kalman Filter dealing with
   multiple objects at once.
 • Dealing with multiple objects/targets introduces a detection-to-target
   assignment problem.It alternates between a GMM responsibility
   computation and a weighted Kalman forward pass.

                                                                                 4

Multiple Object Tracking

 • Our tracker can be seen as a generalization of a Kalman Filter dealing with
   multiple objects at once.
 • Dealing with multiple objects/targets introduces a detection-to-target
   assignment problem.It alternates between a GMM responsibility
   computation and a weighted Kalman forward pass.
 • To fully disambiguate the assignment problem, we need a discriminative
   appearance model, which adapts to the situation at hand.

                                                                                 4

Appearance modeling

The observation model

 • A key component of our model is the definition of the Observation Model:
           Which track to associate ot with?
                 z       }|       {
                 p(ot |xt , Zt = n)            =        p(yt |xt , Zt = n)
                                                        |       {z       }
                                                   Geometric Model, the closest one

                                      ×                p(ut |Zt = n)                  (1)
                                                       |     {z    }
                                          Appearance Model, the most similar one

                                                                                            5

Histogram-based appearance model

 • Previous strategy: use of hand-crafted descriptors and metrics to compute
   appearance similarity.

                                                                               6

Histogram-based appearance model

 • Previous strategy: use of hand-crafted descriptors and metrics to compute
   appearance similarity.
 • Lack discriminative power and robustness, due to appearance variations
   (Illumination, pose, background, occlusions...)

                                                                               6

Deep Appearance Model

Inspired by model-based tracking methods1 , our strategy is to learn a descriptor.

    • Deep Appearance Models are generally trained offline, on large manually
      annotated datasets2 .

1
   Junlin Hu, Jiwen Lu, and Yap-Peng Tan. “Deep metric learning for visual tracking”. In: IEEE Transactions on
Circuits and Systems for Video Technology 26.11 (2016), pp. 2056–2068.
 2
   Siyu Tang et al. “Multiple People Tracking by Lifted Multicut and Person Re-identification”. In: 2017 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR). Washington, DC, USA: IEEE Computer
Society, July 2017, pp. 3701–3710.
                                                                                                                 7

Deep Appearance Model

Inspired by model-based tracking methods1 , our strategy is to learn a descriptor.

    • Deep Appearance Models are generally trained offline, on large manually
      annotated datasets2 .
    • While they seek generality, they lack discriminative power.

1
   Junlin Hu, Jiwen Lu, and Yap-Peng Tan. “Deep metric learning for visual tracking”. In: IEEE Transactions on
Circuits and Systems for Video Technology 26.11 (2016), pp. 2056–2068.
 2
   Siyu Tang et al. “Multiple People Tracking by Lifted Multicut and Person Re-identification”. In: 2017 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR). Washington, DC, USA: IEEE Computer
Society, July 2017, pp. 3701–3710.
                                                                                                                 7

Deep Appearance Model

Inspired by model-based tracking methods1 , our strategy is to learn a descriptor.

    • Deep Appearance Models are generally trained offline, on large manually
      annotated datasets2 .
    • While they seek generality, they lack discriminative power.
    • We want to train a NN ψω using past detections annotated by the tracker, and
      update it every few frames.
1
   Junlin Hu, Jiwen Lu, and Yap-Peng Tan. “Deep metric learning for visual tracking”. In: IEEE Transactions on
Circuits and Systems for Video Technology 26.11 (2016), pp. 2056–2068.
 2
   Siyu Tang et al. “Multiple People Tracking by Lifted Multicut and Person Re-identification”. In: 2017 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR). Washington, DC, USA: IEEE Computer
Society, July 2017, pp. 3701–3710.
                                                                                                                 7

Appearance model update

We update only the top layers of ψω in a siamese setting using the contrastive loss
                      1X
             J (ω) =         max(0, 1 − lij (τ − kψω (ui ) − ψω (uj )k2 )),
                      2
                         i,j=1

                                                                                      8

Appearance model update

We update only the top layers of ψω in a siamese setting using the contrastive loss
                      1X
             J (ω) =         max(0, 1 − lij (τ − kψω (ui ) − ψω (uj )k2 )),
                      2
                         i,j=1

It needs to be supervised with binary (+/-) labels. We have access to past
posterior estimation q(zt ),thus we use soft labelisation instead:

                                                                                      8

Appearance model update

We update only the top layers of ψω in a siamese setting using the contrastive loss
                      1X
             J (ω) =         max(0, 1 − lij (τ − kψω (ui ) − ψω (uj )k2 )),
                      2
                            i,j=1

It needs to be supervised with binary (+/-) labels. We have access to past
posterior estimation q(zt ),thus we use soft labelisation instead:
                                                    N
                                                    X
              γij = p(Zti ki = Ztj kj |o1:t−1 ) ≈         q(Zti ki = n)q(Ztj kj = n).
                                                    n=1

We label positive pairs with lij = γij and negative pairs with lij = −(1 − γij ).

                                                                                        8

Implementation details

    • To make it work smoothly, we use 2 models in parallel, one for training and the
      other for inference. Our implementation reaches 10 FPS in our framework.

3
 Ergys Ristani et al. “Performance Measures and a Data Set for Multi-Target, Multi-Camera Tracking”. In:
ECCV Workshops. 2016.
                                                                                                           9

Implementation details

    • To make it work smoothly, we use 2 models in parallel, one for training and the
      other for inference. Our implementation reaches 10 FPS in our framework.
    • The convolutional layers of ψ are pretrained using external Re-ID dataset, using
      a standard training framework3 .

3
 Ergys Ristani et al. “Performance Measures and a Data Set for Multi-Target, Multi-Camera Tracking”. In:
ECCV Workshops. 2016.
                                                                                                           9

Results

Quantitative results: evaluation settings

We evaluate our tracker in a multi-party conversation: the robot has a low fov
and often change position, increasing identity switches.

4
   A. Milan et al. “MOT16: A Benchmark for Multi-Object Tracking”. In: arXiv:1603.00831 [cs] (Mar. 2016).
arXiv: 1603.00831. url: http://arxiv.org/abs/1603.00831.
 5
   Keni Bernardin and Rainer Stiefelhagen. “Evaluating Multiple Object Tracking Performance: The CLEAR
MOT Metrics”. In: EURASIP Journal on Image and Video Processing (2008).
                                                                                                            10

Quantitative results: evaluation settings

We evaluate our tracker in a multi-party conversation: the robot has a low fov
and often change position, increasing identity switches.
We also want to use a standard dataset, thus we use of the MOT16 dataset4 that
we divide in 2 evaluation settings:
    • moving surveillance camera for the sequences with camera fixed: we simulate
      the camera movement to increase identity switches.

4
   A. Milan et al. “MOT16: A Benchmark for Multi-Object Tracking”. In: arXiv:1603.00831 [cs] (Mar. 2016).
arXiv: 1603.00831. url: http://arxiv.org/abs/1603.00831.
 5
   Keni Bernardin and Rainer Stiefelhagen. “Evaluating Multiple Object Tracking Performance: The CLEAR
MOT Metrics”. In: EURASIP Journal on Image and Video Processing (2008).
                                                                                                            10

Quantitative results: evaluation settings

We evaluate our tracker in a multi-party conversation: the robot has a low fov
and often change position, increasing identity switches.
We also want to use a standard dataset, thus we use of the MOT16 dataset4 that
we divide in 2 evaluation settings:
    • moving surveillance camera for the sequences with camera fixed: we simulate
      the camera movement to increase identity switches.
    • robot navigating in the crowd for the sequences where the camera is moving.
We use the CLEAR metrics5 to evaluate the quality of the tracker results.
4
   A. Milan et al. “MOT16: A Benchmark for Multi-Object Tracking”. In: arXiv:1603.00831 [cs] (Mar. 2016).
arXiv: 1603.00831. url: http://arxiv.org/abs/1603.00831.
 5
   Keni Bernardin and Rainer Stiefelhagen. “Evaluating Multiple Object Tracking Performance: The CLEAR
MOT Metrics”. In: EURASIP Journal on Image and Video Processing (2008).
                                                                                                            10

Quantitative results: moving surveillance setting

                     Model            Rcll (↑) Prcn(↑) IDs(↓) FM (↓) MOTA(↑)
                    CH6                49.41       88.20       266        759        42.49
                  ODA-FR               49.53       88.66       195        702        42.97
                ODA-UP (Ours)          54.72       86.68       591        976        45.63

                     Table 1: Results on the moving surveillance camera setting.

    • ODA-UP stands for our online deep appearance update.
    • ODA-FR refers to the same appearance model architecture, but frozen (FR),
      trained on an external person Re-ID dataset.
    • CH stands for Color Histogram based appearance model.
6
  Yutong Ban et al. “Tracking a varying number of people with a visually-controlled robotic head”. In: 2017
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE. 2017, pp. 4144–4151.
                                                                                                              11

Quantitative results: robot navigating in the crowd

                      Model           Rcll (↑) Prcn(↑) IDs(↓) FM (↓) MOTA(↑)
                    CH7                45.81       91.80       698       1704         41.15
                  ODA-FR               45.78       93.12       516       1524         41.97
                ODA-UP (Ours)          52.29       90.48       782       1499         46.15

                    Table 2: Results on the robot navigating in the crowd setting.

    • ODA-UP stands for our online deep appearance update.
    • ODA-FR refers to the same appearance model architecture, but frozen (FR),
      trained on an external person Re-ID dataset.
    • CH stands for Color Histogram based appearance model.
7
  Yutong Ban et al. “Tracking a varying number of people with a visually-controlled robotic head”. In: 2017
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE. 2017, pp. 4144–4151.
                                                                                                              12

Conclusion

Thank you for your attention.

                                13

You can also read

Deep-Learning-Based Pupil Center Detection and Tracking Technology for Visible-Light Wearable Gaze Tracking Devices

PBR Support for Multiple Tracking Options

Visual MAV Tracker with Adaptive Search Region - MDPI

Predicting the Trend of Stock Market Index Using the Hybrid Neural Network Based on Multiple Time Scale Feature Learning - MDPI

Java 3D Seminar Lines, Shapes, and Geometric Objects

SAP Logistics Business Network, Global Track and Trace Option - Basic Modeling Tutorial - SAP Help Portal

The Eighth Visual Object Tracking VOT2020 Challenge Results

Image Text Deblurring Method Based on Generative Adversarial Network - MDPI

The improvement of visual object tracking using a combination between particle and correlation filter - IOPscience

EPIC-KITCHENS-55 - 2020 Challenges Report

Self-supervised Keypoint Correspondences for Multi-Person Pose Estimation and Tracking in Videos

A Survey of Forex and Stock Price Prediction Using Deep Learning - MDPI

Crack propagation modeling of strengthening of RC deep beams with CFRP plates

Video Object Segmentation and Tracking: A Survey

Which Object Model to Expose: Transfer Object Model vs. Domain Object Model?

Interpreting Deep Learning Models in Natural Language Processing: A Review

Extension of Sinkhorn Method: Optimal Movement Estimation of Agents Moving at Constant Velocity

Package 'xpose' July 28, 2018 - CRAN-R

Humidity Simulation Status: progress and recap May 25, 2020

Agisoft Metashape User Manual - Standard Edition, Version 1.7

Macroeconomic vs. Statistical APT Approach in the Athens Stock Exchange

Model requirements - Part 1 - Digital Design, Building and Operation of Underground Structures - daub-ita.de

Predicting lung cancer survivability: A machine learning regression model - International Academy of ...

Unsupervised changes in core object recognition behavior are predicted by neural plasticity in inferior temporal cortex

FB-MULTIPIER Step-By-Step Advanced Example Problems

Leakage of Dataset Properties in Multi-Party Machine Learning

Small Appliances and The Web: A Budding Relationship

Package 'glmnet' June 24, 2021

One-axis tracking holographic planar concentrator systems

Tracking of Ball and Players in Beach Volleyball Videos

Robust Visual Tracking Using Dynamic Classifier Selection with Sparse Representation of Label Noise

GPS Blue Force Tracking Systems Application Note - System Assessment and Validation for Emergency Responders (SAVER)

R&D EFFICIENCY EVALUATION OF COMPUTER INDUSTRY IN CHINA BASED ON AN IMPROVED DEA MODEL

Double Tracking Control for the Directed Complex Dynamic Network via the Observer of Outgoing Links

EfficientNets and Vision Transformers for Snake Species Identification Using Image and Location Information - CEUR-WS

ManulifeMOVE Every step counts on your way to premium discounts - Macau EN ver

Santa Margarita River Conjunctive Use Project Volume I - Final Technical Memorandum No. 2.2

Neural Style Transfer and Geometric Transformations for Data Augmentation on Balinese Carving Recognition using MobileNet - inass

A Model of Socially Sustained Self-Tracking for Food and Diet

Performance Evaluation of Machine Learning Frameworks for Aphasia Assessment

Evaluation of the Current Handling Fees Paid to Certified Redemption Centers

Can Oil Prices Forecast Exchange Rates?

Evaluation of Multi-Stream Fusion for Multi-View Image Set Comparison - MDPI

Tracking Protection in Firefox For Privacy and Performance

Unveiling the Power of Deep Tracking

Projector Compensation for Unconventional Projection Surface

HILLCREST HIGH SCHOOL ACHIEVERS 2019

3D free-form object recognition in range images using local surface patches

E-Mail Tracking in Online Marketing - Methods, Detection, and Usage

The CNAME of the Game: Large-scale Analysis of DNS-based Tracking Evasion