AdaBins: Depth Estimation using Adaptive Bins - Shariq Farooq Bhat(KAUST) et al. CVPR 2020 - 슬라이드
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Copyright of figures and other materials in the paper belongs to original authors. AdaBins: Depth Estimation using Adaptive Bins Shariq Farooq Bhat(KAUST) et al. Presented by JIN HONGYU CVPR 2020 2021.06.10 Computer Graphics @ Korea University
Outline • Introduction • Related Works • Methodology • Experiments • Conclusion Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 2
Introduction • Depth Estimation using Adaptive Bins ▪ High quality dense depth map from single RGB input image ▪ Start with Encoder-Decoder CNN architecture ▪ Propose a transformer- based architecture block • Divides the depth range into adaptive bins Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 4
Introduction Motivation – A Conjecture • Conjecture: ▪ Current architectures do not perform enough global analysis of the output values ▪ Convolutional layers only process global information once the tensors reach a very low spatial resolution at or near the bottleneck “TernausNet: U-Net with VGG11 Encoder Pre- Trained on ImageNet for Image Segmentation”, Vladimir Iglovikov(Lyft Inc.) and Alexey Shvets(MIT) | arXiv:1801.05746 2018 Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 5
Introduction Motivation – Global Processing • Global processing should be a lot more powerful when done at high resolution • General idea: ▪ Perform a global statistical analysis of the output • Output is from a traditional encoder-decoder architecture ▪ Refine the output with a learned post-processing building block • Operates at the highest resolution Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 6
Introduction Depth Distribution • Depth distribution corresponding to different RGB inputs can vary to a large extent ▪ makes depth regression in an end-to-end manner an even more difficult task • Approach: Adaptively Focus ▪ Let the network learns to adaptively focus on regions of the depth range which are more probable to occur in the scene of the input image Figure 1 (part) Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 7
Introduction Contribution • Propose an architecture building block that performs global processing of the scene’s information ▪ Divide the predicted depth range into bins where the bin widths change per image ▪ The final depth estimation is a linear combination of the bin center values • Decisive improvement for supervised single image depth estimation across all metrics ▪ For the two most popular datasets: NYU, KITTI • Analyze and investigate different modifications on the proposed AdaBins block ▪ study their effect on the accuracy of the depth estimation Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 8
Related Works Monocular Depth Estimation • “From Big to Small: Multi-Scale Local Planar Guidance for Monocular Depth Estimation” ▪ Jin Han Lee (Hanyang University) et al. | arXiv:1907.10326 2019 • “Guiding Monocular Depth Estimation Using Depth-Attention Volume” ▪ Lam Huynh(University of Oulu) et al. | ECCV 2020 BTS DAV Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 10
Related Works Encoder-Decoder • Used in many vision related problems ▪ Image segmentation, optical flow estimation, image restoration ▪ Have shown great success both in the supervised and the unsupervised setting of the depth estimation problem • This paper adapted baseline encoder-decoder network architecture of DenseDepth ▪ “High Quality Monocular Depth Estimation via Transfer Learning” • Ibraheem Alhashim and Peter Wonka (KAUST)| arXiv:1812.11941 2018 Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 11
Related Works Transformer • Traditionally used in Natural Language Processing(NLP) ▪ “Attention Is All You Need” • Ashish Vaswani(Google Brain) et al. | NIPS 2017 • Recently also used in Computer Vision tasks ▪ “End-to-End Object Detection with Transformers” • Carion, Nicolas, et al. | ECCV 2020 Transformer in “Attention Is All You Need” Transformer in “End-to-End Object Detection with Transformers” Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 12
Methodology
Methodology Motivation • Performance improvement by transforming depth regression task into classification task ▪ “Deep Ordinal Regression Network for Monocular Depth Estimation” • Huan Fu(The University of Sydney) et al. | Proceedings of the IEEE Conference on CVPR 2018 • Divide the depth range into a fixed number of bins of predetermined width • Multiple limitations of upper paper solved in this paper by: ▪ Compute adaptive bins • Dynamically change depending on input features ▪ Predict the final depth values as a linear combination of bin centers • combine the advantages of classification with the advantages of depth-map regression ▪ Compute information globally at a high resolution • Compared to other architectures like DAV Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 14
Methodology Adabins Design – Discretized Depth • Discretize the depth interval D = (dmin, dmax) into N bins. ▪ The bin widths b are adaptively computed for each image • Better than fixed bin width or trained-fixed bin width Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 15
Methodology Adabins Design – Linear Combination • Depth discretization artifacts ▪ Caused by Discretized Depth Interval • Predict the final depth as a linear combination of bin centers ▪ Enabling the model to estimate smoothly varying depth values Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 16
Methodology Adabins Design – Attention Block • Performing global processing using attention block ▪ In this paper: • Apply on high resolution • Encoder –> Decoder –> Attention ▪ In other architectures: • Apply on low resolution • Encoder –> Attention –> Decoder “Guiding Monocular Depth Estimation Using Depth-Attention Volume” Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 17
Methodology Adabins Design – Encoder-Decoder • Build on the simplest possible architecture ▪ To isolate the effects of proposed AdaBins concept ▪ Build on a modern encoder-decoder • Using EfficientNet B5 as encoder backbone • “Efficientnet: Rethinking model scaling for convolutional neural networks” ▪ Tan, Mingxing, and Quoc Le | ICML 2019 Efficientnet B5 Architecture Image from Vardan Agarwal’s “Complete Architectural Details of all EfficientNet Models” Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 18
Methodology Architecture - Overview • Our architecture consists of two major components: ▪ An encoder-decoder block • Encoder: pretrained EfficientNet B5 • Decoder: a standard feature up-sampling decoder ▪ 4 up-sampling layers ▪ AdaBins : Adaptive bin-width estimator block Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 19
Methodology Architecture – Encoder-Decoder • Primarily based on a simple depth regression network ▪ “High Quality Monocular Depth Estimation via Transfer Learning” • Ibraheem Alhashim and Peter Wonka (KAUST) | arXiv:1812.11941 ▪ Modifications: • Encoder: ▪ DenseNet -> EfficientNet B5 • A different appropriate loss function • Decoder output: ▪ Final depth map -> Decoded features ▪ h × w × 1 -> h × w × Cd Output Decoded Features h×w×Cd Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 20
Methodology Architecture – Adabins • Adabins module: Key contribution of this paper ▪ Input: • Decoded features: h × w × Cd ▪ Output: • Depth map: h × w ▪ Due to hardware limitation: • h=H/2,w=W/2 • Facilitate better learning with larger batch sizes • Final depth map is simply bilinearly up-sampling to H × W × 1 Input Decoded Features h×w×Cd Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 21
Methodology Adabins – Mini-ViT • Estimating sub-intervals within the depth range D ▪ Requires: • local structural information • global distributional information ▪ Using global attention method • Usually expensive: Memory & Complexity ▪ Especially at higher resolution • Mini-ViT: A more efficient alternative Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 22
Methodology Adabins – Vision Transformer • Vision Transformer (ViT): ▪ “An image is worth 16x16 words: Transformers for image recognition at scale” • Alexey Dosovitskiy(Google Brain) et al. | ICLR 2021 ▪ Applying a standard Transformer of NLP directly to images • Split an image into patches • Use sequence of linear embeddings of patches as input • Patches are treated the same way as tokens (words) in NLP Vision Transformer in “An image is worth 16x16 words: Transformers for image recognition at scale” Mini-ViT in this paper Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 23
Methodology Adabins – Patch Embedding • Patch Embedding: ▪ Transfer decoded features into fixed-sized patches • Transformer requires fixed-sized input ▪ Pass decoded features to Embedding Conv • Kernel size: p × p • Number of output channel: E • Thus, Embedding Conv output size: h/p × w/p × E ▪ Reshape into flattened tensor • xp ∈ ℝS × E hw ▪ S= : effective sequence length p2 ▪ Positional embedding: • Learnable 1D position embeddings Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 24
Methodology Adabins – Transformer Encoder • Transformer Encoder: ▪ Input: • Patch embeddings ▪ Output: • Output embeddings : xo ∈ ℝS × E ▪ Pass first row of output embeddings into an MLP head • 3 full connected layers ▪ leakyReLU among each layer • MLP output: N-dimensional vector b’ ▪ Normalize b’ into b (Eq. 1) ViT • Let b sums up to 1 ▪ Force network to focus • = 10-3 ▪ Ensure each bin-width > 0 Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 25
Methodology Adabins – Range Attention Maps • Use part of output embeddings as 1×1 conv. kernel ▪ Using part: from 2nd row to (C+1)th row • Pass decoded features into a 3×3 conv. Layer • Convolved by upper 1×1 conv. Kernel ▪ Equivalent to calculating the Dot-Product • Pixel-wise features treated as ‘keys’ • Output embeddings as ‘queries’ ViT ▪ Integrate adaptive global information into the local information • The result is Range Attention Maps ℛ ▪ ℛ and b are used to get final depth • Implementation details of Mini-ViT Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 26
Methodology Adabins – Hybrid Regression • Pass Range-Attention Maps into a 1×1 conv. layer ▪ Get N-channels • N: Number of bins • Softmax activation ▪ pk : The Softmax score of each pixel • k = 1, … , N ▪ pk is considered as probabilities over depth-bin-centers c(b) • c(b) = {c(b1), c(b2), …, c(bN)} • c(b) is calculated by Eq. 2 • Final depth value ሚ is linear combination of pk and c(b) (Eq. 3) ▪ To avoid discretization artifacts Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 27
Methodology Loss Function • Pixel-wise depth loss: (Eq. 4) ▪ A scaled version of the Scale-Invariant loss (SI) • gi = log ෩ – log d • di : ground truth • T: number of pixels having valid ground truth value • = 0.85, = 10 • Bin-center density loss: (Eq. 5) ▪ Use the bi-directional Chamfer Loss ▪ “A Point Set Generation Network for 3D Object Reconstruction from a Single Image” • Haoqiang Fan(Tsinghua University) et al. | CVPR 2017 • Final loss: (Eq. 6) ▪ = 0.1 Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 28
Experiments
Experiments Datasets • Dataset: ▪ NYU Depth v2: Indoor scenes • Image & depth map (640 × 480) • 50K subset is used in training for this paper ▪ KITTI: Outdoor scenes (captured on moving vehicle) • Stereo image(1241 × 376) & 3D laser scanned data (low density) • 26K subset is used in training for this paper ▪ SUN RGB-D: Indoor scenes • Data captured by 4 different sensors • Not used for training Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 30
Experiments Evaluation metrics • Evaluation metrics: ▪ Standard six metrics used in prior work • Average relative error • Root mean squared error • Average (log10) error • Threshold accuracy ( i): ▪ Threshold = 1.25, 1.252, 1.253 ▪ 2 more for KITTI: • Squared Relative Difference • RMSE log Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 31
Experiments Implementation details • Platform: Pytorch • Optimizer: AdamW ▪ Weighted-decay: 10-2 • 1-cycle policy for the learning rate with max_lr = 3.5 × 10-4 ▪ For the first 30%: Linear warm-up from max_lr/25 to max_lr ▪ Cosine annealing to max max_lr/75 • Total number of epochs: 25 • Batch size: 16 • 20 min per epoch ▪ on single node with 4 NVIDIA V100 32GB GPU • Main model: 78M parameters ▪ CNN encoder: 28M ▪ CNN decoder: 44M ▪ Adabins module: 5.8M Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 32
Experiments Results Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 33
Experiments Ablation Study - Adabins & Bin Types • Adabins & bin types: Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 34
Experiments Ablation Study – Number of Bins • Number of bins: Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 35
Experiments Ablation Study – Loss Function • Loss Function: Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 36
Experiments Test on Webcam • Although Adabins is not designed for real-time application, it is relatively faster than many other non-real-time archietecture ▪ Intel Core i7-7700K ▪ Nvidia Geforce GTX 1080 Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 37
Conclusion
Conclusion • We introduced a new architecture block, called AdaBins for depth estimation from a single RGB image. • AdaBins leads to a decisive improvement in the state of the art for the two most popular datasets, NYU and KITTI • Future work: ▪ Investigate if global processing of information at a high resolution can also improve performance on other tasks • segmentation, normal estimation, and 3D reconstruction from multiple images Computer Graphics @ Korea University JIN HONGYU| 2021. 06. 10 | # 39
Thanks to Your Audience!
You can also read