FPN-based Network for Panoptic Segmentation - Caribbean Team 2 Institute of Automation, CAS - COCO dataset
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
FPN-based Network for Panoptic Segmentation Caribbean Team Yanwei Li* 1, 2, Naiyu Gao* 1, 2, Chaoxu Guo 1, 2, Xinze Chen1, Qian Zhang1, Guan Huang1, Xin Zhao2, Kaiqi Huang2, Dalong Du1, Chang Huang1 1 Horizon Robotics, 2 Institute of Automation, CAS
ár1 , Ross Girshick1 , n1 , and Serge Belongie2 archFPN-based (FAIR) Network d Cornell Tech Input Shared FPN backbone predict predict predict predict FPN-based predict backbone ) Featurized image pyramid (b) Single feature map predict predict predict predict predict predict ) Pyramidal feature hierarchy (d) Feature Pyramid Network gure 1. (a) Kirillov Using A, He an image K, Girshick R B, etpyramid to Segmentation.[J]. al. Panoptic build a featurearXiv: pyramid. Computer Vision and Pattern Recognition, 2018. atures are computed Lin, Tsung-Yi, on each et al. "Feature of theNetworks Pyramid image for scales independently, Object Detection." CVPR. Vol. 1. No. 2. 2017. hich is slow to compute. (b) Recent detection systems have opted use only single scale features for faster detection. (c) An alter- tive is to reuse the pyramidal feature hierarchy computed by a
ár1 , Ross Girshick1 , n1 , and Serge Belongie2 archFPN-based (FAIR) Network d Cornell Tech Input Shared FPN backbone predict predict predict Instance head predict FPN-based Function Head predict backbone Semantic head ) Featurized image pyramid (b) Single feature map predict predict predict predict predict predict ) Pyramidal feature hierarchy (d) Feature Pyramid Network gure 1. (a) Kirillov Using A, He an image K, Girshick R B, etpyramid to Segmentation.[J]. al. Panoptic build a featurearXiv: pyramid. Computer Vision and Pattern Recognition, 2018. atures are computed Lin, Tsung-Yi, on each et al. "Feature of theNetworks Pyramid image for scales independently, Object Detection." CVPR. Vol. 1. No. 2. 2017. hich is slow to compute. (b) Recent detection systems have opted use only single scale features for faster detection. (c) An alter- tive is to reuse the pyramidal feature hierarchy computed by a
ár1 , Ross Girshick1 , n1 , and Serge Belongie2 archFPN-based (FAIR) Network d Cornell Tech Input Shared FPN backbone Result Choose predict predict predict Instance head predict FPN-based Function Head merge predict backbone Semantic head ) Featurized image pyramid (b) Single feature map predict predict predict predict predict predict ) Pyramidal feature hierarchy (d) Feature Pyramid Network gure 1. (a) Kirillov Using A, He an image K, Girshick R B, etpyramid to Segmentation.[J]. al. Panoptic build a featurearXiv: pyramid. Computer Vision and Pattern Recognition, 2018. atures are computed Lin, Tsung-Yi, on each et al. "Feature of theNetworks Pyramid image for scales independently, Object Detection." CVPR. Vol. 1. No. 2. 2017. hich is slow to compute. (b) Recent detection systems have opted use only single scale features for faster detection. (c) An alter- tive is to reuse the pyramidal feature hierarchy computed by a
FPN Architecture Feature Pyramid Panoptic Segmentation (Thing) Network (FPN) [3] Unified Framework 1 low resolution • Mask R-CNN architecture 32 2048 256 strong features • Shared FPN backbone 1 16 1024 256 Classification 1 Regression 8 256 512 Mask 1 4 high resolution 256 256 strong features 1 image 3 [1] He, K., Zhang, X., Ren, S., & Sun, J. Deep residual learning for image recognition. CVPR 2016. [2] Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. Aggregated residual transformations for deep neural networks. CVPR 2017. [3] Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. Feature pyramid networks for object detection. CVPR 2017. Shared FPN backbone Seg Head for Thing He K, Gkioxari G, Dollar P, et al. Mask R-CNN[J]. ICCV 2017 Lin, Tsung-Yi, et al. "Feature Pyramid Networks for Object Detection." CVPR. Vol. 1. No. 2. 2017. Hu J, Shen L, Sun G, et al. Squeeze-and-Excitation Networks[J]. CVPR 2018. Cai Z, Vasconcelos N. Cascade R-CNN: Delving Into High Quality Object Detection[J]. CVPR 2018.
Panoptic Segmentation (Thing) Unified Framework • Mask R-CNN architecture • Shared FPN backbone Stronger Network • SENet154 • Deformable Conv. Figure 1: A Squeeze-and-Excitation block. • Nonlocal Conv. features in a class agnostic manner, bolstering the quality of dinality (the size of the set of transformations) [ the shared lower level representations. In later layers, the Multi-branch convolutions can be interpreted as a g SE block becomes increasingly specialised, and responds sation of this concept, enabling more flexible comp to different inputs in a highly class-specific manner. Con- of operators [16, 42, 43, 44]. Recently, composition sequently, the benefits of feature recalibration conducted by have been learned in an automated manner [26, SE blocks can be accumulated through the entire network. have shown competitive performance. Cross-chan The development of new CNN architectures is a chal- relations are typically mapped as new combination lenging engineering task, typically involving the selection tures, either independently of spatial structure [6 of many new hyperparameters and layer configurations. By jointly by using standard convolutional filters [24] w contrast, the design of the SE block outlined above is sim- convolutions. Much of this work has concentrate ple, and can be used directly with existing state-of-the-art objective of reducing model and computational com architectures whose modules can be strengthened by direct reflecting an assumption that channel relationship replacement with their SE counterparts. He K, Gkioxari G, Dollar P, et al. Mask R-CNN[J]. ICCV 2017 formulated as a composition of instance-agnostic f Moreover, Hu J, Shen L, Sun G, et al. Squeeze-and-Excitation Networks[J]. CVPR as shown 2018. in Sec. 4, SE blocks are computa- with local receptive fields. In contrast, we claim tha tionally lightweight and impose only a slight increase in ing the unit with a mechanism to explicitly model d Cai Z, Vasconcelos N. Cascade R-CNN: Delving Into High Quality Object Detection[J]. CVPR 2018. model complexity and computational burden. To support non-linear dependencies between channels using g these claims, we develop several SENets and provide an formation can ease the learning process, and sign extensive evaluation on the ImageNet 2012 dataset [34]. enhance the representational power of the network. To demonstrate their general applicability, we also present Attention and gating mechanisms. Attention results beyond ImageNet, indicating that the proposed ap- viewed, broadly, as a tool to bias the allocation of a proach is not restricted to a specific dataset or a task.
Panoptic Segmentation (Thing) Unified Framework • Mask R-CNN architecture • Shared FPN backbone Stronger Network • SENet154 • Deformable Conv. Figure 1: A Squeeze-and-Excitation block. • Nonlocal Conv. Bbox Head features in a class agnostic manner, bolstering the quality of the shared lower level representations. In later layers, the dinality (the size of the set of transformations) [ Multi-branch convolutions can be interpreted as a g • Cascade R-CNN SE block becomes increasingly specialised, and responds sation of this concept, enabling more flexible comp to different inputs in a highly class-specific manner. Con- of operators [16, 42, 43, 44]. Recently, composition sequently, the benefits of feature recalibration conducted by have been learned in an automated manner [26, SE blocks can be accumulated through the entire network. have shown competitive performance. Cross-chan The development of new CNN architectures is a chal- relations are typically mapped as new combination lenging engineering task, typically involving the selection tures, either independently of spatial structure [6 of many new hyperparameters and layer configurations. By jointly by using standard convolutional filters [24] w contrast, the design of the SE block outlined above is sim- convolutions. Much of this work has concentrate ple, and can be used directly with existing state-of-the-art objective of reducing model and computational com architectures whose modules can be strengthened by direct reflecting an assumption that channel relationship replacement with their SE counterparts. He K, Gkioxari G, Dollar P, et al. Mask R-CNN[J]. ICCV 2017 formulated as a composition of instance-agnostic f Moreover, Hu J, Shen L, Sun G, et al. Squeeze-and-Excitation Networks[J]. CVPR as shown 2018. in Sec. 4, SE blocks are computa- with local receptive fields. In contrast, we claim tha tionally lightweight and impose only a slight increase in ing the unit with a mechanism to explicitly model d Cai Z, Vasconcelos N. Cascade R-CNN: Delving Into High Quality Object Detection[J]. CVPR 2018. model complexity and computational burden. To support non-linear dependencies between channels using g these claims, we develop several SENets and provide an formation can ease the learning process, and sign extensive evaluation on the ImageNet 2012 dataset [34]. enhance the representational power of the network. To demonstrate their general applicability, we also present Attention and gating mechanisms. Attention results beyond ImageNet, indicating that the proposed ap- viewed, broadly, as a tool to bias the allocation of a proach is not restricted to a specific dataset or a task.
Panoptic Segmentation (Thing) Unified Framework • Mask R-CNN architecture RoIs of first stage from Cascade R-CNN • Shared FPN backbone Stronger Network • SENet154 • Deformable Conv. Share params • Nonlocal Conv. Bbox Head • Cascade R-CNN RoI Mask Concat. Mask Head RoI Feature • Cascade Mask RoIs of second stage from Cascade R-CNN Exploits RoIs of from box head to refine the mask results. Test-tricks
Shared FPN Inference Thing +0 6 34 3 0 6 .56 4 3 8 08% %% • ResNet50 baseline • Using multi-scale training % % 06 6 4 )0 386 3 % % 0 (+ )1 A (+
Shared FPN Inference Thing 2 B4 6 6 B2B 08 .6A BA 12 • ResNet50 baseline • Using multi-scale training • Adopt DCN & Nonlocal & Adaptive RoI pooling & 54 42 252 B 6 AB 2 %& & +2A6 6 ) % %) ( %& % 2A +3
Shared FPN Inference Thing .3 5 078 7 3 1 8 7B B 23 • ResNet50 baseline • Using multi-scale training • Adopt DCN & Nonlocal 53B5367 A5 & Adaptive RoI pooling & ) 65 53 363 7 A • Cascade RCNN 8 %& & B A3 8 3B7 7 % % ( %& % 3B +. 4 +.
Shared FPN Inference Thing .3 5 078 7 3 1 8 7B B 23 • ResNet50 baseline • Using multi-scale training 53B5367 3B • Adopt DCN & Nonlocal & & Adaptive RoI pooling & 53B5367 A5 ) • Cascade RCNN 65 53 363 7 A & % %& 8 • Cascade Mask B A3 8 % 3B7 7 % ( %& % 3B +. 4 +.
Shared FPN Inference Thing .3 5 078 7 3 1 8 7B B 23 • ResNet50 baseline • Using multi-scale training 7B A5 B • Adopt DCN & Nonlocal & 53B5367 3B & Adaptive RoI pooling & ) 53B5367 A5 • Cascade RCNN ( & % %& 65 53 363 7 A • Cascade Mask 8 B A3 8 • Using test tricks % % ( 3B7 7 %& % 3B +. 4 +.
Shared FPN Inference Thing 4 A 6 18 8 4 2 08 34 • ResNet50 baseline 51.7% 4B 8B 78 . • Using multi-scale training % 8 B6 • Adopt DCN & Nonlocal 45.9% & & Adaptive RoI pooling & 64 6478 4 % ) • Cascade RCNN 64 6478 B6 ( & % %& • Cascade Mask 76 64 474A 8 B A • Using test tricks % B4 % ( • Using larger model & BN %& 4 8 8 % 4 + 5 +
Panoptic Segmentation (Stuff) Unified Framework • FCN-based architecture • Shared FPN backbone Stronger Network • SENet154 • Deformable Conv. • Nonlocal Conv. Test-tricks • Flip • Multi-Scale testing • Other tricks Shared FPN backbone Seg Head for Stuff Lin, Tsung-Yi, et al. "Feature Pyramid Networks for Object Detection." CVPR. Vol. 1. No. 2. 2017. Hu J, Shen L, Sun G, et al. Squeeze-and-Excitation Networks[J]. CVPR 2018.
Shared FPN Inference Stuff 2 73 4694 2 7 55 .4 8 28 2 487 4 .4 • ResNet50 baseline % % ( ( 9 0 9)33 55
Shared FPN Inference Stuff .4 A95 168 6 A4A9 1AB77 06 B A 34 4 6 9 6 06 4 A6 A 9 • ResNet50 baseline • Using all test skills % % ) ( % ) ( ( 2 +55 . 1AB77
Shared FPN Inference Stuff .4 5 179 7 4 1 88 07B B 34 4B7 7 07B 4 7B B B 4A97A 67 • ResNet50 baseline 61.95% ( • Using all test skills 45.9% • Adopt larger model % % ) ( 34.7% % ) ( ( 2 +55 . 1 88
Panoptic Segmentation Merge Merge method • Sort thing results with scores • Thing first, stuff second • Merge tricks Kirillov A, He K, Girshick R B, et al. Panoptic Segmentation.[J]. arXiv: Computer Vision and Pattern Recognition, 2018.
Visualize Results
Failure Results Input Baseline Ensemble
Thanks & questions For more questions, please contact: liyanwei2017@ia.ac.cn
You can also read