FPN-based Network for Panoptic Segmentation - Caribbean Team 2 Institute of Automation, CAS - COCO dataset

Page created by Ivan Brady
 
CONTINUE READING
FPN-based Network for Panoptic Segmentation - Caribbean Team 2 Institute of Automation, CAS - COCO dataset
FPN-based Network for
    Panoptic Segmentation

                             Caribbean Team
Yanwei Li* 1, 2, Naiyu Gao* 1, 2, Chaoxu Guo 1, 2, Xinze Chen1, Qian Zhang1,
  Guan Huang1, Xin Zhao2, Kaiqi Huang2, Dalong Du1, Chang Huang1

                           1 Horizon     Robotics,
                      2 Institute   of Automation, CAS
FPN-based Network for Panoptic Segmentation - Caribbean Team 2 Institute of Automation, CAS - COCO dataset
ár1 , Ross Girshick1 ,
 n1 , and Serge Belongie2
archFPN-based
     (FAIR)    Network
d Cornell Tech

             Input                     Shared FPN backbone
                                     predict                            predict

                                     predict

                                     predict   FPN-based
                                     predict     backbone
) Featurized image pyramid                     (b) Single feature map

                      predict
                                                                            predict
                      predict
                                                                            predict
                      predict
                                                                            predict

) Pyramidal feature hierarchy              (d) Feature Pyramid Network
gure   1. (a)
    Kirillov    Using
             A, He      an image
                   K, Girshick R B, etpyramid    to Segmentation.[J].
                                       al. Panoptic build a featurearXiv:
                                                                      pyramid.
                                                                          Computer Vision and Pattern Recognition, 2018.
 atures   are computed
    Lin, Tsung-Yi,           on each
                    et al. "Feature     of theNetworks
                                    Pyramid    image for
                                                       scales  independently,
                                                          Object  Detection." CVPR. Vol. 1. No. 2. 2017.
hich is slow to compute. (b) Recent detection systems have opted
 use only single scale features for faster detection. (c) An alter-
tive is to reuse the pyramidal feature hierarchy computed by a
FPN-based Network for Panoptic Segmentation - Caribbean Team 2 Institute of Automation, CAS - COCO dataset
ár1 , Ross Girshick1 ,
 n1 , and Serge Belongie2
archFPN-based
     (FAIR)    Network
d Cornell Tech

             Input                     Shared FPN backbone
                                     predict                            predict

                                     predict
                                                                                         Instance head
                                     predict   FPN-based                                 Function Head
                                     predict     backbone
                                                                                        Semantic head
) Featurized image pyramid                     (b) Single feature map

                      predict
                                                                            predict
                      predict
                                                                            predict
                      predict
                                                                            predict

) Pyramidal feature hierarchy              (d) Feature Pyramid Network
gure   1. (a)
    Kirillov    Using
             A, He      an image
                   K, Girshick R B, etpyramid    to Segmentation.[J].
                                       al. Panoptic build a featurearXiv:
                                                                      pyramid.
                                                                          Computer Vision and Pattern Recognition, 2018.
 atures   are computed
    Lin, Tsung-Yi,           on each
                    et al. "Feature     of theNetworks
                                    Pyramid    image for
                                                       scales  independently,
                                                          Object  Detection." CVPR. Vol. 1. No. 2. 2017.
hich is slow to compute. (b) Recent detection systems have opted
 use only single scale features for faster detection. (c) An alter-
tive is to reuse the pyramidal feature hierarchy computed by a
FPN-based Network for Panoptic Segmentation - Caribbean Team 2 Institute of Automation, CAS - COCO dataset
ár1 , Ross Girshick1 ,
 n1 , and Serge Belongie2
archFPN-based
     (FAIR)    Network
d Cornell Tech

             Input                     Shared FPN backbone                                                          Result Choose
                                     predict                            predict

                                     predict
                                                                                         Instance head
                                     predict   FPN-based                                 Function Head                     merge
                                     predict     backbone
                                                                                        Semantic head
) Featurized image pyramid                     (b) Single feature map

                      predict
                                                                            predict
                      predict
                                                                            predict
                      predict
                                                                            predict

) Pyramidal feature hierarchy              (d) Feature Pyramid Network
gure   1. (a)
    Kirillov    Using
             A, He      an image
                   K, Girshick R B, etpyramid    to Segmentation.[J].
                                       al. Panoptic build a featurearXiv:
                                                                      pyramid.
                                                                          Computer Vision and Pattern Recognition, 2018.
 atures   are computed
    Lin, Tsung-Yi,           on each
                    et al. "Feature     of theNetworks
                                    Pyramid    image for
                                                       scales  independently,
                                                          Object  Detection." CVPR. Vol. 1. No. 2. 2017.
hich is slow to compute. (b) Recent detection systems have opted
 use only single scale features for faster detection. (c) An alter-
tive is to reuse the pyramidal feature hierarchy computed by a
FPN-based Network for Panoptic Segmentation - Caribbean Team 2 Institute of Automation, CAS - COCO dataset
FPN Architecture

                                                                                Feature Pyramid
 Panoptic Segmentation (Thing)                                                 Network (FPN) [3]
       Unified Framework                                    1                                                                       low resolution
       • Mask R-CNN architecture                            32
                                                                                  2048
                                                                                                                      256                     strong features

       • Shared FPN backbone                                1
                                                            16
                                                                                   1024
                                                                                                                        256                                           Classification
                                                             1                                                                                                        Regression
                                                             8                                                             256
                                                                                      512                                                                             Mask
                                                             1
                                                             4                                                                      high resolution
                                                                                         256                                  256              strong features

                                                             1

                                                                       image                3

                                                          [1] He, K., Zhang, X., Ren, S., & Sun, J. Deep residual learning for image recognition. CVPR 2016.
                                                          [2] Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. Aggregated residual transformations for deep neural networks. CVPR 2017.
                                                          [3] Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. Feature pyramid networks for object detection. CVPR 2017.
                                                                        Shared FPN backbone                                                   Seg Head for Thing
He K, Gkioxari G, Dollar P, et al. Mask R-CNN[J]. ICCV 2017
Lin, Tsung-Yi, et al. "Feature Pyramid Networks for Object Detection." CVPR. Vol. 1. No. 2. 2017.
Hu J, Shen L, Sun G, et al. Squeeze-and-Excitation Networks[J]. CVPR 2018.
Cai Z, Vasconcelos N. Cascade R-CNN: Delving Into High Quality Object Detection[J]. CVPR 2018.
FPN-based Network for Panoptic Segmentation - Caribbean Team 2 Institute of Automation, CAS - COCO dataset
Panoptic Segmentation (Thing)
        Unified Framework
        • Mask R-CNN architecture
        • Shared FPN backbone
        Stronger Network
        • SENet154
        • Deformable Conv.                                                                            Figure 1: A Squeeze-and-Excitation block.
        • Nonlocal Conv.
                                                        features in a class agnostic manner, bolstering the quality of       dinality (the size of the set of transformations) [
                                                        the shared lower level representations. In later layers, the         Multi-branch convolutions can be interpreted as a g
                                                        SE block becomes increasingly specialised, and responds              sation of this concept, enabling more flexible comp
                                                        to different inputs in a highly class-specific manner. Con-          of operators [16, 42, 43, 44]. Recently, composition
                                                        sequently, the benefits of feature recalibration conducted by        have been learned in an automated manner [26,
                                                        SE blocks can be accumulated through the entire network.             have shown competitive performance. Cross-chan
                                                           The development of new CNN architectures is a chal-               relations are typically mapped as new combination
                                                        lenging engineering task, typically involving the selection          tures, either independently of spatial structure [6
                                                        of many new hyperparameters and layer configurations. By             jointly by using standard convolutional filters [24] w
                                                        contrast, the design of the SE block outlined above is sim-          convolutions. Much of this work has concentrate
                                                        ple, and can be used directly with existing state-of-the-art         objective of reducing model and computational com
                                                        architectures whose modules can be strengthened by direct            reflecting an assumption that channel relationship
                                                        replacement with their SE counterparts.
He K, Gkioxari G, Dollar P, et al. Mask R-CNN[J]. ICCV 2017                                                                  formulated as a composition of instance-agnostic f
                                                           Moreover,
Hu J, Shen L, Sun G, et al. Squeeze-and-Excitation Networks[J].    CVPR as shown
                                                                           2018. in Sec. 4, SE blocks are computa-           with local receptive fields. In contrast, we claim tha
                                                        tionally lightweight and impose only a slight increase in            ing the unit with a mechanism to explicitly model d
Cai Z, Vasconcelos N. Cascade R-CNN: Delving Into High     Quality Object Detection[J]. CVPR 2018.
                                                        model complexity and computational burden. To support                non-linear dependencies between channels using g
                                                        these claims, we develop several SENets and provide an               formation can ease the learning process, and sign
                                                        extensive evaluation on the ImageNet 2012 dataset [34].              enhance the representational power of the network.
                                                        To demonstrate their general applicability, we also present
                                                                                                                             Attention and gating mechanisms. Attention
                                                        results beyond ImageNet, indicating that the proposed ap-            viewed, broadly, as a tool to bias the allocation of a
                                                        proach is not restricted to a specific dataset or a task.
FPN-based Network for Panoptic Segmentation - Caribbean Team 2 Institute of Automation, CAS - COCO dataset
Panoptic Segmentation (Thing)
        Unified Framework
        • Mask R-CNN architecture
        • Shared FPN backbone
        Stronger Network
        • SENet154
        • Deformable Conv.                                                                            Figure 1: A Squeeze-and-Excitation block.
        • Nonlocal Conv.
        Bbox Head                                       features in a class agnostic manner, bolstering the quality of
                                                        the shared lower level representations. In later layers, the
                                                                                                                             dinality (the size of the set of transformations) [
                                                                                                                             Multi-branch convolutions can be interpreted as a g
        • Cascade R-CNN                                 SE block becomes increasingly specialised, and responds              sation of this concept, enabling more flexible comp
                                                        to different inputs in a highly class-specific manner. Con-          of operators [16, 42, 43, 44]. Recently, composition
                                                        sequently, the benefits of feature recalibration conducted by        have been learned in an automated manner [26,
                                                        SE blocks can be accumulated through the entire network.             have shown competitive performance. Cross-chan
                                                           The development of new CNN architectures is a chal-               relations are typically mapped as new combination
                                                        lenging engineering task, typically involving the selection          tures, either independently of spatial structure [6
                                                        of many new hyperparameters and layer configurations. By             jointly by using standard convolutional filters [24] w
                                                        contrast, the design of the SE block outlined above is sim-          convolutions. Much of this work has concentrate
                                                        ple, and can be used directly with existing state-of-the-art         objective of reducing model and computational com
                                                        architectures whose modules can be strengthened by direct            reflecting an assumption that channel relationship
                                                        replacement with their SE counterparts.
He K, Gkioxari G, Dollar P, et al. Mask R-CNN[J]. ICCV 2017                                                                  formulated as a composition of instance-agnostic f
                                                           Moreover,
Hu J, Shen L, Sun G, et al. Squeeze-and-Excitation Networks[J].    CVPR as shown
                                                                           2018. in Sec. 4, SE blocks are computa-           with local receptive fields. In contrast, we claim tha
                                                        tionally lightweight and impose only a slight increase in            ing the unit with a mechanism to explicitly model d
Cai Z, Vasconcelos N. Cascade R-CNN: Delving Into High     Quality Object Detection[J]. CVPR 2018.
                                                        model complexity and computational burden. To support                non-linear dependencies between channels using g
                                                        these claims, we develop several SENets and provide an               formation can ease the learning process, and sign
                                                        extensive evaluation on the ImageNet 2012 dataset [34].              enhance the representational power of the network.
                                                        To demonstrate their general applicability, we also present
                                                                                                                             Attention and gating mechanisms. Attention
                                                        results beyond ImageNet, indicating that the proposed ap-            viewed, broadly, as a tool to bias the allocation of a
                                                        proach is not restricted to a specific dataset or a task.
FPN-based Network for Panoptic Segmentation - Caribbean Team 2 Institute of Automation, CAS - COCO dataset
Panoptic Segmentation (Thing)
  Unified Framework
  • Mask R-CNN architecture         RoIs of first stage from Cascade R-CNN

  • Shared FPN backbone
  Stronger Network
  • SENet154
  • Deformable Conv.                                                         Share params
  • Nonlocal Conv.
  Bbox Head
  • Cascade R-CNN                                                                      RoI Mask
                                                                                                  Concat.

  Mask Head                                                                       RoI Feature
  • Cascade Mask              RoIs of second stage from Cascade R-CNN

   Exploits RoIs of from box head
   to refine the mask results.
  Test-tricks
FPN-based Network for Panoptic Segmentation - Caribbean Team 2 Institute of Automation, CAS - COCO dataset
Shared FPN Inference Thing

                ).8   40       8 . 4 8   348
                           6   8 .6
                                                       • ResNet50 baseline

                                               . 648

                                    %

     (. 5   )                   )
FPN-based Network for Panoptic Segmentation - Caribbean Team 2 Institute of Automation, CAS - COCO dataset
Shared FPN Inference Thing

                   +0   6   34      3     0 6   .56 4
                        3   8             08%
%%                                                                • ResNet50 baseline
                                                                  • Using multi-scale training
%

    %

                                                         06 6 4

                                                   )0 386 3

    %

                   %

          0   (+                )1 A (+
Shared FPN Inference Thing

                        2   B4 6        6 B2B         08
                            .6A BA        12
                                                                                     • ResNet50 baseline
                                                                                     • Using multi-scale training
                                                                                     • Adopt DCN & Nonlocal
                                                                                       & Adaptive RoI pooling
&                                                           54        42   252 B 6

                                                             AB 2
                                                %&
&
                                                           +2A6   6
                    )
%                                              %) (

               %&

%
          2A                       +3
Shared FPN Inference Thing

                   .3   5 078        7    3         1   8
                        7B   B           23
                                                                                           • ResNet50 baseline
                                                                                           • Using multi-scale training
                                                                                           • Adopt DCN & Nonlocal
                                                            53B5367 A5                       & Adaptive RoI pooling
&
                                               )            65            53   363   7 A   • Cascade RCNN
                                                                  8
                                              %&
&                                                            B A3     8

                                                            3B7   7
%                                             % (

                  %&

%
          3B +.                  4       +.
Shared FPN Inference Thing

                   .3      5 078        7    3         1   8
                           7B   B           23
                                                                                              • ResNet50 baseline
                                                                                              • Using multi-scale training
                                                               53B5367      3B                • Adopt DCN & Nonlocal
                                                  &
                                                                                                & Adaptive RoI pooling
&                                                              53B5367 A5
                                                  )                                           • Cascade RCNN
                                                               65            53   363   7 A

&                      %
                                                 %&
                                                                     8                        • Cascade Mask
                                                                B A3     8

%                                                              3B7   7
                                                 % (

                  %&

%
          3B +.                     4       +.
Shared FPN Inference Thing

                   .3      5 078        7    3         1   8
                           7B   B           23
                                                                                              • ResNet50 baseline
                                                                                              • Using multi-scale training
                                                               7B    A5 B
                                                                                              • Adopt DCN & Nonlocal
                                                  &
                                                               53B5367      3B                  & Adaptive RoI pooling
&
                                                  )            53B5367 A5                     • Cascade RCNN
                       (
&                      %
                                                 %&
                                                               65            53   363   7 A   • Cascade Mask
                                                                     8
                                                                B A3     8                    • Using test tricks
%                                                % (
                                                               3B7   7
                  %&

%
          3B +.                     4       +.
Shared FPN Inference Thing

                           4   A 6 18       8       4           2
                                08              34
                                                                                                           • ResNet50 baseline
                                        51.7%                           4B 8B      78         .            • Using multi-scale training
                                                        %
                                                                        8     B6                           • Adopt DCN & Nonlocal
          45.9%
                                                            &
                                                                                                             & Adaptive RoI pooling
&                                                                       64 6478      4
                  %                                         )                                              • Cascade RCNN
                                                                        64 6478 B6
                       (
&                      %
                                                        %&
                                                                                                           • Cascade Mask
                                                                        76               64   474A   8 B
                                                                    A                                      • Using test tricks
%                                                                             B4
                                                        % (                                                • Using larger model & BN
                  %&                                                    4 8   8
%
          4   +                         5       +
Panoptic Segmentation (Stuff)

        Unified Framework
        • FCN-based architecture
        • Shared FPN backbone
        Stronger Network
        • SENet154
        • Deformable Conv.
        • Nonlocal Conv.
        Test-tricks
        • Flip
        • Multi-Scale testing
        • Other tricks
                                                                  Shared FPN backbone               Seg Head for Stuff

Lin, Tsung-Yi, et al. "Feature Pyramid Networks for Object Detection." CVPR. Vol. 1. No. 2. 2017.
Hu J, Shen L, Sun G, et al. Squeeze-and-Excitation Networks[J]. CVPR 2018.
Shared FPN Inference Stuff

                  2    73 4694          2 7       55
                      .4   8           28

                               2 487 4 .4
                                                            • ResNet50 baseline
%
%

                                              (

                  (

          9   0                    9)33                55
Shared FPN Inference Stuff

               .4        A95 168     6 A4A9          1AB77
                        06 B A        34

                        4 6 9 6 06          4 A6 A    9
                                                                             • ResNet50 baseline
                                                                             • Using all test skills

%
%                                               ) (

               %    )

                                                                         (
                                                                         (

           2                          +55                    .   1AB77
Shared FPN Inference Stuff

                    .4          5 179        7     4        1      88
                             07B   B             34

                  4B7   7 07B           4    7B B      B         4A97A   67
                                                                                               • ResNet50 baseline
                                            61.95%          (
                                                                                               • Using all test skills
          45.9%                                                                                • Adopt larger model
%
%                                                          ) (                34.7%
                    %    )

                                                                                           (
                                                                                           (

             2                                   +55                          .   1   88
Panoptic Segmentation Merge

      Merge method
      • Sort thing results with scores
      • Thing first, stuff second
      • Merge tricks

Kirillov A, He K, Girshick R B, et al. Panoptic Segmentation.[J]. arXiv: Computer Vision and Pattern Recognition, 2018.
Visualize Results
Failure Results

                  Input   Baseline   Ensemble
Thanks & questions

For more questions, please contact:
      liyanwei2017@ia.ac.cn
You can also read