Learning Pixel Trajectories with Multiscale Contrastive Random Walks
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Learning Pixel Trajectories with Multiscale Contrastive Random Walks Zhangxing Bian1,3 Allan Jabri2 Alexei A. Efros2 Andrew Owens1 University of Michigan1 UC Berkeley2 Johns Hopkins University3 https://jasonbian97.github.io/flowwalk arXiv:2201.08379v1 [cs.CV] 20 Jan 2022 Abstract solution. However, scaling this perspective to pixel-level space-time graphs holds challenges. Since computing sim- A range of video modeling tasks, from optical flow to ilarity between frames is quadratic in the number of nodes, multiple object tracking, share the same fundamental chal- estimating dense motion is prohibitively expensive. More- lenge: establishing space-time correspondence. Yet, ap- over, there is no way of explicitly estimating the motion in proaches that dominate each space differ. We take a step to- ambiguous cases, like occlusion. In parallel, the unsuper- wards bridging this gap by extending the recent contrastive vised optical flow community has adopted highly effective random walk formulation to much denser, pixel-level space- methods for dense matching [80], which use multiscale rep- time graphs. The main contribution is introducing hier- resentations [9, 43, 59, 81] to reduce the large search space, archy into the search problem by computing the transi- and smoothness priors to deal with ambiguity and occlu- tion matrix between two frames in a coarse-to-fine man- sion. But, in contrast to the self-supervised tracking meth- ner, forming a multiscale contrastive random walk when ex- ods, they rely on hand-crafted distance functions, such as tended in time. This establishes a unified technique for self- the Census Transform [31, 45]. Furthermore, because they supervised learning of optical flow, keypoint tracking, and focus on producing point estimates of motion, they may be video object segmentation. Experiments demonstrate that, less robust under long-term dynamics. for each of these tasks, the unified model achieves perfor- In this work, we take a step toward bridging the gap be- mance competitive with strong self-supervised approaches tween tracking and optical flow by extending the CRW for- specific to that task. mulation [25] to much more dense, pixel-level space-time graphs. The main contribution is introducing hierarchy into 1. Introduction the search problem, i.e. the multiscale contrastive random walk. By integrating local attention in a coarse-to-fine man- Temporal correspondence underlies a range of video un- ner, the model can efficiently consider a distribution over derstanding tasks, from optical flow to object tracking. At pixel-level trajectories. Through experiments across optical the core, the challenge is to estimate the motion of some flow and video label propagation benchmarks, we show: entity as it persists in the world, by searching in space and time. For historical reasons, the practicalities differ sub- stantially across tasks: optical flow aims for dense corre- • This provides a unified technique for self-supervised op- spondences but only between neighboring pairs of frames, tical flow, pose tracking, and video object segmentation. whereas tracking cares about longer-range correspondences • For optical flow, the model is competitive with many re- but is spatially sparse. We argue that the time might be right cent unsupervised optical flow methods, despite using a to try and re-unify these different takes on temporal corre- novel loss function (without hand-crafted features). spondence. • For tracking, the model outperforms existing self- supervised approaches on pose tracking, and is compet- An emerging line of work in self-supervised learning has itive on video object segmentation. shown that generic representations pretrained on unlabeled • Cycle-consistency provides a complementary learning images and video can lead to strong performance across a signal to photo-consistency. range of tracking tasks [20, 25, 34, 72, 74]. The key idea is • Multi-frame training improves two-frame performance. that if tracking can be formulated as label propagation [83] on a space-time graph, all that is needed is a good mea- sure of similarity between nodes. Indeed, the recent con- 2. Related work trastive random walk (CRW) formulation [25] shows how Space-time representation learning. Recent work has such a similarity measure can be learned for temporal cor- proposed methods for tracking objects in video through respondence problems, suggesting a path towards a unified self-supervised representation learning. Vondrick et al. [68] 1
t t+k t level 1 coarse X level 2 X fine level 3 X Figure 1. Multiscale contrastive random walks. We learn representations for dense, fine-grained matching using multiscale contrastive random walks. At each scale, we create a space-time graph in which each pixel is a node, and where nodes close in space-time are connected. Transition probabilities are based on similarity in a learned representation. We train the network to maximize the probability of obtaining a cycle-consistent random walk. Here, we illustrate a pixel’s walk over 3 spatial scales and 3 frames. The walker is initialized using its position at the previous, coarser scale. posed tracking as a colorization problem, by training a net- tional methods. Later research improved flow estimation work to match pixels in grayscale images that have the using robust penalties [4, 6, 38, 59, 60], coarse-to-fine esti- same held-out colors. The assumption that matching pixels mation [8], discrete optimization [13,53,56], feature match- have the same color may break down over long time hori- ing [5, 51, 75], bilateral filtering [1], and segmentation [84]. zons, and the method is limited to grayscale inputs. This In contrast, our model estimates flow using neural networks. approach was extended to obtain higher-accuracy matches Unsupervised optical flow. Early work [47] used Boltz- with two-stage matching [34, 36]. In contrast to our ap- mann machines to learn transformations between images. proach, the predictions are coarse, patch-level associations. More recently, Yu et al. [80] train a neural network to mini- Another line of work learns representations by maximizing mize a loss very similar to that of optimization-based ap- cycle consistency. These methods track patches forward, proaches. Later work extended this approach by adding then backward, in time and test whether they end up where edge-aware smoothing [73], hand-crafted features [46], oc- they began. Wang et al. [71] proposed a method based clusion handling [27, 73], learned upsampling [44], and on hard attention and spatial transformers [26]. Jabri et depth and camera pose [79, 82]. Another approach learns al. [25] posed cycle-consistency as a random walk problem, flow by matching augmented image pairs [39, 40, 42]. Re- allowing the model to obtain dense supervision from multi- cently, Jonschkowski et al. [31] surveyed previous litera- frame video. Tang et al. [65] proposed an extension that ture and conducted an exhaustive search to find the best allowed for fully convolutional training. These approaches combination of methods and hyperparameters, resulting in are trained on sparse patches and learn coarse-grained cor- a method that combines occlusion reasoning, augmenta- respondences. We combine these representation learning tion, smoothness, flow dropout, and census transform fea- methods with matching techniques from the optical flow lit- tures. In contrast to these works, our goal is to learn self- erature, such as multiscale matching and smoothness, and supervised representations for matching, in lieu of hand- obtain dense, pixel-to-pixel correspondences. Other work crafted features. Moreover, we aim to produce a distri- has encouraged patches in the same position in neighboring bution over motion trajectories suitable for label transfer, frames to be close in an embedding space [17, 76]. rather than motion estimates alone. In very recent work, Optimization-based optical flow. Lucas and Kanade [2, Stone et al. [58] adapted unsupervised flow methods to the 43] used Gauss-Newton optimization to minimize a bright- RAFT architecture [66] and proposed new augmentation, ness constancy objective. In their seminal work, Horn and self-distillation, and multi-frame occlusion inpainting meth- Schunck [21] combined brightness constancy with a spa- ods. By contrast, we use PWC-net [63], since it is the stan- tial smoothness criteria, and estimated flow using varia- dard architecture considered in prior work, and since it can 2
obtain strong performance with careful training [62]. 3.1.1 Preliminaries: Contrastive random walks Cycle consistency. Cycle consistency has long been used We build on the contrastive random walk (CRW) formula- to detect occlusions [3, 23, 35, 77], and is used to discount tion of Jabri et al. [25]. Given an input video with k frames, the loss of occluded pixels in unsupervised flow [27,31,73]. we extract n patches from each frame and assign each an Zou et al. [85] used a cycle consistency loss as part of a embedding using a learned encoder φ. These patches form system that jointly estimated depth, pose, and flow. Re- the vertices of a graph that connects all patches in tempo- cently, Huang et al. [22] combined cycle-consistency with rally adjacent frames. A random walker steps through the epipolar matching, but their method is weakly supervised graph, forward in time from frames 1, 2, ..., k, then back- with camera pose and assumes egomotion. In contrast, our ward in time from k − 1, k − 2, ..., 1. The transition proba- model is entirely unsupervised and is capable of working bilities are determined by the similarity of the learned rep- solely with cycle-consistency and smoothness losses. With- resentations: out extra constraints, their cycle consistency formulations As,t = softmax(Xs Xt> /τ ), (1) have a trivial solution (e.g., all-zero flow). A random walk formulation of cycle consistency also been used in semi- for a pair of frames s and t, where Xi ∈ Rn×d is the matrix supervised learning [18], using labels to test consistency. of d-dimensional embedding vectors, τ is a small constant, and the softmax is performed along each row. We train the Multi-frame matching. Many methods use a third frame model to maximize the likelihood of cycle consistency, i.e. to obtain more local evidence for matching. Classic meth- the event that the walker returns to the node it started from: ods assume approximately constant velocity [29, 61, 67, 69] or acceleration [32, 69] and measure the photo-consistency over the full set of frames. Recently, Janai et al. [27] pro- LCRW = − tr(log(Āt,t+k Āt+k,t )), (2) posed an unsupervised multi-frame flow method that used a photometric loss with a low-acceleration assumption and where Āt,t+k are the transition probabilities from frame t to Qt+k−1 explicit occlusion handling. In contrast to these approaches, t + k: Āt,t+k = i=t Ai,i+1 . our method can be deployed using two frames at test time. We use subsequent frames as a training signal. There 3.1.2 Optical flow as a random walk have also been a variety of approaches that track over long time horizons, often using sparse (or semi-dense) keypoints After training, the transition matrix contains the probabil- [54, 55]. Other work chains together optical flow [7, 64, 70], ity that a pair of patches is in space-time correspondence. typically after removing low-texture regions. In contrast, We can estimate the optical flow fs,t ∈ Rn×2 of a patch our method also learns “soft” per-pixel tracks, which con- between frames s and t by taking the expected value of the vey the probability that pairs of pixels match. change in spatial position: Supervised optical flow. Early work learned optical flow gavg (As,t ) = EAs,t [fs,t ] = As,t D − D, (3) with probabilistic models, such as graphical models [15]. Other work learns parmeters for smoothness and bright- where D ∈ Rn×2 is the (constant) matrix containing of ness constancy [60] or robust penalties [4, 37]. More recent pixel coordinates for each patch, and As,t D is the walker’s methods has used neural networks. Fischer et al. [12, 24] expected position in frame t. proposed architectures with a built-in correlation layers. In contrast to widely-used forward-backward cycle con- Sun et al. [63] introduced a network with built-in coarse-to- sistency formulations [22, 85], which measure the deviation fine matching. Recent work [66] iteratively updates a flow of a predicted motion from a starting point, there is no triv- with multiscale features, in lieu of coarse-to-fine matching. ial solution (e.g., all-zero flow). This is because cycle con- sistency is measured in an embedding space defined solely 3. Method from the visual content of image regions. We first show how to learn dense space-time correspon- dences using mutiscale contrastive random walks, resulting 3.1.3 Multiscale random walk in a model that obtains high quality motion estimates via simple nonparametric matching. We then describe how the As presented so far, this formulation is expensive to scale learned representation can be combined with regression to to high-resolutions because computing the transition matrix handle occlusions and ambiguity, for improved optical flow. is quadratic in the number of nodes. We overcome this by introducing hierarchy into the search problem. Instead of 3.1. Multiscale contrastive random walks comparing all pairs, we only attend on a local neighbor- We review the single-scale contrastive random walk, hood. By integrating local search across scales in a coarse- then extend the approach to multiscale estimation. to-fine manner, the model can efficiently consider a distri- bution over pixel-level trajectories. 3
warp w/ upsample upsample warp w/ f t+1 t Xs Xt⊤ fl+1 fl+1 s,t fl+1 fl+1 s,t ⊤ s,t t,s Eq. 2 Xt Xt+1 D Eq. 3 ⊤ Xt Xt+1 init. 0 3.1.4 Smooth random walks flow Xsl−1 Xtl−1 Xt Xt+1 l−1 Since natural motions tend to be smooth [52], we follow coarse As,t work in optical flow [59], and incorporate smoothness as fls,t + an additional desiderata for our random walks. We use the Xsl warp X̃lt edge-aware loss of Jonschkowski et al. [31]: l As,t X ∂ 2 fs,t (p) fine fl+1 + Lsmooth = Ep exp(−λc Id (p))| ∂d2 | (8) s,t Xsl+1 warp X̃l+1 t d∈{x,y} l+1 As,t where Id (p) = 31 c | ∂I P ∂d | is the spatial derivative averaged c Xs Xt over all color channels Ic in direction d. The parameter λc input fs,t input controls the influence of similarly colored pixels. We apply output this loss to each scale of the model. Figure 2. Coarse-to-fine matching. Our model performs a con- 3.2. Handling occlusion trastive random walk across spatial scales: it computes a transition matrix (Eq. 5), uses it to obtain flow (Eq. 6), and recurses to the While effective for most image content, nonparametric next level by using upsampled estimated flow to align the finer matching has no mechanism for estimating the motion of scale for matching (Eq. 4). warp samples the grid using the flow. pixels that become occluded, since it requires a correspond- Coarse-to-fine local attention. Computing the transition ing patch in the next frame. We propose a variation of matrix closely resembles cost volume estimation in optical the model that combines the multiscale contrastive random flow [6, 14, 59]. This inspires us to draw on the classic spa- walk with a regression module that directly predicts the flow tial pyramid commonly used for multiscale search in optical values at each pixel. flow, by iteratively computing the dense transition matrix, The architecture of our regressor closely follows the re- from coarse to fine spatial scales l ∈ [1..L]. finement module of PWC-net [63]. We learn a function For frames s and t, we compute feature pyramids Xsl ∈ greg (·) that regresses the flow at each pixel from the mul- h(l) w(l) ×d tiscale contrastive random walk cost-volume and convolu- R , where h(l) and w(l) are the width and height of tional features. These features are obtained from the same the feature map at scale l. To match each level efficiently, shared backbone that is used to compute the embeddings. we warp the features of the target frame Xtl into the coor- dinate frame of Xsl using the coarse flow from the previous Regression loss. We choose a loss that approximates the l level fs,t ; we then compute local transition probabilities on contrastive random walk objective, such that pixels that the warped feature to account for remaining motion. Thus, are well-matched by the nonparametric model (e.g., non- we estimate the transition matrix and flow in a coarse-to- occluded pixels) are unlikely to change their flow values, fine manner (Fig. 2), computing levels recurrently: while the flow values of occluded pixels are obtained using a smoothness prior. To this end, we use the same smooth- X̃tl = warp(Xtl , fs,t l ) (4) ness loss, Lsmooth , as the nonparametric model (Eq. 8), and Als,t = masked softmax(Xsl X̃tl> /τ ) (5) penalize the model from deviating from the nonparametric l+1 flow estimate (Eq. 3). We also use our learned embeddings fs,t = upsample(gavg (Als,t ) + fs,t l ), (6) as features for a learned photometric loss, i.e. incurring loss where warp(X, f ) samples features X with flow f us- if the model puts two pixels with dissimilar embedding vec- ing bilinear sampling, and f 1 = 0. For notational con- tors into correspondence. This results in an additional loss: venience, we write the local transition constraint using Lfeat = kXs − warp(Xt , fs,t )k2 + λa kfs,t − gavg (As,t )k2 , (9) masked softmax, which sets values beyond a local spatial window to zero. In practice, we use optimized correlation where fs,t is the predicted flow and λa is a constant. To filtering kernels to compute Eq. 5 efficiently, and represent prevent the regression loss from influencing the learned em- the transition matrices Als,t as a sparse matrix. beddings that it is based on, we do not propagate gradients Loss. After computing the transition matrices between all from the regressor to the embeddings X during training. As pairs of adjacent frames, we sum the contrastive random in Eq. 7, we apply the loss to each scale and sum. walk loss over all levels: Augmentation and masking. We follow [31] and im- L X prove the regressor’s handling of occlusions through aug- LmsCRW = − tr(log(Ālt,t+k Ālt+k,t )), (7) mentation, and by discounting the loss of pixels that fail l=1 a consistency check [40]. To handle pixels that move off- where Āls,t is defined as in Eq. 2. In our experiments, we screen, we compute flow, then randomly crop the input im- use L = 5 scales and consider k ∈ [2..4] length cycles. ages and compute it again, penalizing deviations between 4
the flows. This results in a new loss, Lbound . We also transfer and motion estimation tasks. We compare them to remove the contribution of pixels in the photometric loss space-time correspondence learning methods, and with un- (Eq. 9) if they have no correspondence in the backwards- supervised optical flow methods. in-time flow estimation from t to s. Both are implemented 4.1. Datasets exactly as in [31]. These losses apply only to the regressor, For simple comparison with other methods, we train on and thus do not directly affect the nonparametric matching. standard optical flow datasets. We note that the training 3.3. Training protocols used by unsupervised optical flow literature are not standardized. We therefore follow the evaluation setup Objective. The pure nonparametric model (Section 3.1) of [31]. We pretrain models on unlabeled videos from the can be trained by simply minimizing the multiscale con- Flying Chairs dataset [11]. We then train on the KITTI- trastive random walk loss with a smoothness penalty: 2015 [16] multi-view extension and Sintel [10]. To evalu- Lnon = LmsCRW + Lsmooth . (10) ate our model’s ability to learn from internet video, we also trained the model on YouTube-VOS [78], without pretrain- Adding the regressor results in the following loss: ing on any other datasets. Lreg = LmsCRW + Lsmooth + Lfeat + Lbound (11) We also evaluate our model on standard label transfer We include weighting factors to control the relative im- tasks. The JHMDB benchmark [30] transfers 15 body portance of each loss(specified in the supplement). parts to future frames over long time horizons. The DAVIS benchmark [50] transfers object masks. Architecture. To provide a simple comparison with un- supervised optical flow methods, we use the PWC-net ar- 4.2. Label propagation chitecture [63] as our network backbone, after reducing the We evaluate our learned model’s ability to perform video number of filters [31]. This network uses the feature hier- label propagation, a task widely studied in video represen- archy from a convolutional network to provide the features tation learning work. The goal is to propagate a label map at each scale. We use the cost volume from this approach provided in an initial video frame, which might describe as our embedding representation for the random walk, af- keypoints or object segments, to the rest of the video. ter performing `2 normalization. We adapt the cost volume We follow Jabri et al. [25] and use our model’s proba- estimation Xsl and the regressor. We provide architecture bilistic motion trajectories to guide label propagation. We details in the supplementary material. infer the labels for each target frame t auto-regressively. For Subcycles. We follow [25] and include subcycles in our each previous source frame s, we have an predicted label contrastive random walks: when training the model on k- map Ls ∈ Rn×c , where n is the number of pixels and c is frame videos, we include losses for walks of length k, k −1, the number classes. As in [25], we compute Kt,s l , the matrix ... 2. These losses can be estimated efficiently by reusing of weights for attention on the source frames, by keeping the the transition matrices for the full walk. top-k logits in each row of Ālt,s . We use Kt,s as an atten- Multi-frame training. When training with k > 2 frames, tion matrix for each label, i.e. Lt = Kt,s Ls . Using several we use curriculum learning to speed up and stabilize train- source frames as context allows for overcoming occlusion. ing. We train the model to convergence with 2, 3, ... k frame We use variations of our model that was trained on the cycles in succession. unlabeled Sintel and YouTube-VOS datasets, and use the transition matrix and flow fields at the penultimate level of Optimization. To implement the contrastive random the pyramid, i.e. level 4. Since the transition matrix de- walk, we exploit the sparsity of our coarse-to-fine formu- 4 scribes residual motion, we warp (i.e. with ft,s ) each label lation, and represent the transition matrices Als,t as sparse map before querying. Finally, since the features in level 4 matrices. This significantly improved training times and re- have only 16 channels, we stack features from levels 3 and duced memory requirements, especially in the finest scales. 4 to obtain hypercolumns [19], before computing attention. It takes approximately 3 days to train the full model on one GTX2080 Ti. We train our network with PyTorch [48], Evaluation. We compared our model to recent video using the Adam [33] optimizer with a cyclic learning rate representation learning work, including single-scale schedule [57] with a base learning rate of 10−4 and a max CRW [25], UVC [36], and the state-of-the-art VFS [76] learning rate of 5 × 10−4 . We provide training hyperparam- (Tab. 1). We also report a baseline that uses a state-of-art eters for each dataset in Section B. unsupervised flow method, UFlow [31], to obtain motion trajectories by chaining flow. 4. Results For pose propagation, we evaluate our model on JH- Our model produces two outputs: the optical flow fields MDB [30] and report the PCK metric, which measures the and the pixel trajectories (which are captured in the tran- fraction of keypoints within various distances to the ground sition matrices). We evaluated these predictions on label truth. Our approach outperforms existing self-supervised 5
t = 33 t = 33 t = 50 t = 50 t = 44 t = 44 Mask t = 31 t = 31 Prob. JHMDB pose DAVIS segments Figure 3. Propagating segments and pose along motion trajectories. We show qualitative results for JHMDB pose (left) and object masks on DAVIS (right). For DAVIS scenes, we show examples of mask propagations (top) and soft propagated label distributions (bottom). Pose (PCK) Segments model obtains significantly better performance on all met- rics using 3-frame and 4-frame cycles. Method Params @0.05 @0.1 @0.2 J&Fm TimeCycle [72] 25.6M – 57.3 78.1 50.1 4.3.2 Nonparametric motion estimation UVC [36] 11M – 58.6 79.6 59.5 CRW [25] 11M 29.1 59.3 80.6 67.6 Our model is able to estimate motion solely through non- VFS [76] 25.6M – 60.9 80.7 68.8 parametric matching, as can be seen qualitatively in Fig- UFlow [31]†
Ground Truth Nonparametric (Ours) Regression (Ours) Census Census + Regression (Ours) KITTI Sintel-Final Figure 4. Optical flow qualitative results. We show results on images not seen during training, using our nonparametric-only and regression-based models. The highlighted regions show significant differences between the regression-based models. Table 2. Nonparametric motion estimation. We single-scale, Table 3. Model configurations. All models are 2-frame. nonparametric, and regression-based methods. Regressor-only finds a shortcut solution, predicting zero flow. Method KITTI-15 train KITTI-15 train noc all ER% (occ) Configuration noc all ER% Jabri et al. [25] 12.63 19.41 64.50 Ours - Nonparametric 2.18 9.42 27.98 Full 2.09 3.86 12.45 Ours - With Regression 2.09 3.86 12.45 Nonparametric only 2.18 9.42 27.98 No feature consistency in Lfeat 2.20 5.02 17.53 Regressor only 14.21 21.45 41.34 our features obtained competitive performance on non- No regressor constraint in Lfeat 5.54 10.44 26.43 occluded pixels, but that there was a significant advantage No Lbound 2.14 4.88 16.54 to Census features on the all metric. This is understand- No Lsmooth 10.98 17.43 34.85 able since the contrastive random walk does not have a way of learning features for occluded pixels. Interestingly, we results but outperforms the nonparametric model on the all found that combining the two features together improved metric. We also see that Lbound loss improves the accu- performance, and that the gap improves further when 3- racy of occluded pixels (due to image boundaries). We also frame walks are used, obtaining the overall best results. try different window sizes for the contrastive random walk Finally, we used our learned features as part of the pho- (Tab. 6), finding that larger sizes improve results. tometric loss for ARFlow [39], a very recent unsupervised Training on internet video. We found that our model flow model. We added the Lfeat loss to their model and generalized well to benchmark datasets when training retrained it (while, as in ours, jointly learning the features solely on YouTube-VOS [78] (Tab. 7). For comparison, through a contrastive random walk). The resulting model we also trained ARFlow [39] on YouTube-VOS. Our model improves performance on all metrics, with a larger gain on obtained better performance on KITTI, while ARFlow per- Sintel (Tab. 4(a)). formed better on Sintel. 4.3.4 Motion estimation ablations Robustness on jittered images. To help understand the To help understand which properties of our model con- flexibility of our self-supervised model, we trained a vari- tribute to its performance, we perform an ablation study on ation of our model on images with large, simulated bright- the KITTI-15 dataset (Table 3). We asked how the differ- ness and hue variations, inspired by the challenges of rapid ent losses contribute to the performance. We ablated the exposure changes (e.g., in HDR photography). During smoothness loss (Eq. 8), the self-supervision loss (Sec. 3.2), training and testing, we randomly jitter the brightness and and removing the constraint on the regressor in Eq. 9 by set- hue of the second image in KITTI by a factor of up to 0.6 ting λa = 0. We see that the smoothness loss significantly and 0.3 respectively, using PyTorch’s [48] built-in augmen- improves results. Similarly, we discarded the feature con- tation. We finetuned the variation of our model that com- sistency term from Eq. 9, which reduces the quality of the bines our learned features with Census features (Tab. 5), 7
Sintel KITTI-15 Clean Final train test train test train test Method EPE EPE EPE EPE all noc ER % ER % Supervised FlowNetC [12] (3.78) 6.85 (5.28) 8.51 - - - - FlowNet2 [24] (1.45) 4.16 (2.01) 5.74 (2.30) - (8.61) 11.48 PWC-Net [63] (1.70) 3.86 (2.21) 5.13 (2.16) - (9.80) 9.60 RAFT [66] (0.76) 1.94 (1.22) 3.18 (0.63) - (1.50) 5.10 MFOccFlow [28]∗ {3.89} 7.23 {5.52} 8.81 [6.59] [3.22] - 22.94 EPIFlow [82] 3.94 7.00 5.08 8.51 5.56 2.56 - 16.95 Unsupervised DDFlow [41] {2.92} 6.18 {3.98} 7.40 [5.72] [2.73] - 14.29 SelFlow [42]∗ [2.88] [6.56] {3.87} {6.57} [4.84] [2.40] - 14.19 UFlow [31] {2.50} 5.21 {3.39} 6.50 {2.71} {1.88} {9.05} 11.13 SMURF-PWC [58] 2.63 - 3.66 - 2.73 - 9.33 - SMURF-RAFT [58] {1.71} 3.15 {2.58} 4.18 {2.00} {1.41} {6.42} 6.83 ARFlow [39] [2.79] [4.78] [3.73] [5.89] [2.85] - - [11.80] Ours + ARFlow [2.71] [4.70] [3.61] [5.76] [2.81] [2.17] [11.25] [11.67] Ours (2-cycle) {2.84} 5.68 {3.82} 6.72 {3.86} {2.09} {12.45} 13.10 Ours Ours (3-cycle) {2.77} 5.42 {3.61} 6.63 {3.46} {2.05} {12.28} 12.78 Ours (4-cycle) {2.72} 5.37 {3.55} 6.59 {3.39} {2.04} {12.14} 12.62 (a) Optical flow benchmarks (b) Robustness to jittering Table 4. (a) Our model is trained on the train/test splits of the corresponding datasets. We reprint numbers from [31] and adopt their convention that “{}” are trained on unlabeled evaluation data, “[]” are trained on related data (e.g., full Sintel movie) and “()” are supervised. Methods that use 3 frames at test time are marked with ∗. (b) We evaluate robustness to brightness and hue jittering. Table 5. Photometric features. We evaluate the effectiveness of Table 7. Training on internet video. We train on YouTube- our features when they are used to define a photometric loss. VOS [78] and evaluate on optical flow benchmarks. KITTI-15 train Sintel train KITTI-15 train Clean Final noc all ER% Losses noc all ER% Ours (2-cycle) 3.01 4.02 2.20 3.83 13.17 Charbonnier 2.28 5.69 19.30 ARFlow [39] 2.91 3.85 2.32 4.02 14.04 Census 2.05 3.14 11.04 Our feats. 2.09 3.86 12.45 include models that use different numbers of frames for the Our feats. + Charbonnier 2.21 4.51 14.25 random walk, and a variation of ARFlow [39] that uses our Our feats. + Census 2.04 3.10 10.89 self-supervised features to augment its photometric loss. Our feats. (3-frame) + Census 2.02 3.03 10.67 We found that our models outperform many recent un- supervised optical flow methods, including the (3-frame) Table 6. Contrastive random walk ablations. We evaluate dif- MFOccFlow [28] and EPIFlow. In particular, we signif- ferent model parameters, including cycle length. icantly outperform the recent SelFlow method [42], despite Sintel train KITTI-15 train the fact that it takes 3 frames as input at test time and Clean Final noc all ER% uses Census Transform features. By contrast, our model uses no hand-crafted image features. The highest perform- Win. size Cycle len. 2 2.84 3.82 2.09 3.68 12.45 ing method is the very recent, highly optimized SMURF 3 2.77 3.61 2.05 3.46 12.28 4 2.72 3.55 2.04 3.39 12.14 model [58], which uses the RAFT [66] architecture instead 3×3 8.43 11.43 5.18 7.21 30.85 of PWC-net and which extends UFlow [31]. This model 7×7 3.02 4.13 2.24 4.40 14.53 uses a variety of additional training signals, such as ex- 11 × 11 2.84 3.82 2.09 3.68 12.45 tensive data augmentation, occlusion inpainting with multi- stage training, self-distillation, and hand-crafted features. since it obtained strong performance on KITTI. We also finetuned a model with only Census features. We found that 5. Discussion the model with our learned features performed significantly We have proposed a method for learning dense mo- better, and that the gap increased with the magnitude of the tion estimation using multiscale contrastive random walks. jittering (Tab. 4b). We show qualitative results in Fig. 5. We see our work as a step toward unifying self-supervised tracking and optical flow. By extending methods from 4.3.5 Comparison to recent optical flow methods the former, our model obtains a useful learning signal for To help understand our model’s overall performance, we dense temporal correspondence from multi-frame video, compare it to recent optical flow methods (Table 4). We and learns representations well-suited for nonparametric 8
[8] Andrés Bruhn, Joachim Weickert, and Christoph Schnörr. Lucas/kanade meets horn/schunck: Combining local and global optic flow methods. International journal of computer vision, 61(3):211–231, 2005. 2 [9] Peter J. Burt and Edward H. Adelson. The laplacian pyramid as a compact image code. IEEE Trans. Commun., 31:532– 540, 1983. 1 Figure 5. Varying brightness and hue. Qualitative comparison [10] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A nat- between the hand-crafted Census transform feature and a model uralistic open source movie for optical flow evaluation. In that combines these features with our self-supervised features. A. Fitzgibbon et al. (Eds.), editor, European Conf. on Com- puter Vision (ECCV), Part IV, LNCS 7577, pages 611–625. matching and label transfer. Through its use of multiscale Springer-Verlag, October 2012. 5 matching and architectures from the latter, it obtains dense [11] A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazırbaş, motion estimates and handles occlusion. As a result, we V. Golkov, P. v.d. Smagt, D. Cremers, and T. Brox. Flownet: obtain a single model for unsupervised optical flow, pose Learning optical flow with convolutional networks. In IEEE tracking, and video object segmentation. International Conference on Computer Vision (ICCV), 2015. Limitations and impact. Motion analysis has many ap- 5 plications, such as in health monitoring, surveillance, and [12] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip security. There is also a potential for the technology to Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van be used for harmful purposes if weaponized. The released Der Smagt, Daniel Cremers, and Thomas Brox. Flownet: models are limited in scope to the benchmark datasets used Learning optical flow with convolutional networks. In Pro- in training. ceedings of the IEEE international conference on computer vision, 2015. 3, 8 Acknowledgements. We thank David Fouhey for the [13] PF Felzenszwalb and DR Huttenlocher. Efficient belief prop- helpful feedback. AO thanks Rick Szeliski for introduc- agation for early vision. In Proceedings of the 2004 IEEE ing him to multi-frame optical flow. This research was sup- Computer Society Conference on Computer Vision and Pat- ported in part by Toyota Research Institute, Cisco Systems, tern Recognition, 2004. CVPR 2004., volume 1, pages I–I. and Berkeley Deep Drive. IEEE, 2004. 2 References [14] Philipp Fischer, Alexey Dosovitskiy, Eddy Ilg, Philip Häusser, Caner Hazırbaş, Vladimir Golkov, Patrick Van der [1] Robert Anderson, David Gallup, Jonathan T Barron, Janne Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learn- Kontkanen, Noah Snavely, Carlos Hernández, Sameer Agar- ing optical flow with convolutional networks. arXiv, 2015. wal, and Steven M Seitz. Jump: virtual reality video. ACM 4 Transactions on Graphics (TOG), 35(6):1–13, 2016. 2 [15] William T Freeman, Egon C Pasztor, and Owen T [2] Simon Baker and Iain Matthews. Lucas-kanade 20 years on: Carmichael. Learning low-level vision. International jour- A unifying framework. In International Journal of Computer nal of computer vision, 40(1):25–47, 2000. 3 Vision, pages 221–255, February 2004. 2 [16] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel [3] Simon Baker, Daniel Scharstein, JP Lewis, Stefan Roth, Urtasun. Vision meets robotics: The kitti dataset. Interna- Michael J Black, and Richard Szeliski. A database and eval- tional Journal of Robotics Research (IJRR), 2013. 5 uation methodology for optical flow. IJCV, 2011. 3 [17] Daniel Gordon, Kiana Ehsani, Dieter Fox, and Ali Farhadi. [4] Michael J Black and Paul Anandan. The robust estimation Watching the world go by: Representation learning from un- of multiple motions: Parametric and piecewise-smooth flow labeled videos, 2020. 2 fields. Computer vision and image understanding, 63(1):75– [18] Philip Haeusser, Alexander Mordvintsev, and Daniel Cre- 104, 1996. 2, 3 mers. Learning by association–a versatile semi-supervised [5] Thomas Brox, Christoph Bregler, and Jitendra Malik. Large training method for neural networks. In Proceedings of the displacement optical flow. In CVPR, 2009. 2 IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 89–98, 2017. 3 [6] Thomas Brox, Andrés Bruhn, Nils Papenberg, and Joachim Weickert. High accuracy optical flow estimation based on a [19] Bharath Hariharan, Pablo Arbeláez, Ross Girshick, and Ji- theory for warping. In ECCV, 2004. 2, 4 tendra Malik. Simultaneous detection and segmentation. In Eur. Conf. Comput. Vis., pages 297–312, 2014. 5 [7] Thomas Brox and Jitendra Malik. Object segmentation by long term analysis of point trajectories. In European con- [20] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross ference on computer vision, pages 282–295. Springer, 2010. Girshick. Momentum contrast for unsupervised visual rep- 3 resentation learning. In CVPR, 2020. 1 9
[21] Berthold Horn and Brain Schunck. Determining optical flow. [36] Xueting Li, Sifei Liu, Shalini De Mello, Xiaolong Wang, In Artificial Intelligence, pages 185–203, 1981. 2 Jan Kautz, and Ming-Hsuan Yang. Joint-task self-supervised [22] Zhaoyang Huang, Xiaokun Pan, Runsen Xu, Yan Xu, learning for temporal correspondence. In Advances in Neu- Kachun Cheung, Guofeng Zhang, and Hongsheng Li. ral Information Processing Systems, pages 317–327, 2019. Life: Lighting invariant flow estimation. arXiv preprint 2, 5, 6 arXiv:2104.03097, 2021. 3 [37] Yunpeng Li and Daniel P Huttenlocher. Learning for optical [23] Junhwa Hur and Stefan Roth. Mirrorflow: Exploiting sym- flow using stochastic optimization. In European Conference metries in joint optical flow and occlusion estimation. In on Computer Vision, pages 379–391. Springer, 2008. 3 Proceedings of the IEEE International Conference on Com- [38] Ce Liu et al. Beyond pixels: exploring new representa- puter Vision, pages 312–321, 2017. 3 tions and applications for motion analysis. PhD thesis, Mas- [24] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, sachusetts Institute of Technology, 2009. 2 Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolu- [39] Liang Liu, Jiangning Zhang, Ruifei He, Yong Liu, Yabiao tion of optical flow estimation with deep networks. In Pro- Wang, Ying Tai, Donghao Luo, Chengjie Wang, Jilin Li, ceedings of the IEEE conference on computer vision and pat- and Feiyue Huang. Learning by analogy: Reliable super- tern recognition, pages 2462–2470, 2017. 3, 8 vision from transformations for unsupervised optical flow [25] Allan Jabri, Andrew Owens, and Alexei A Efros. Space- estimation. In Proceedings of the IEEE/CVF Conference time correspondence as a contrastive random walk. Neural on Computer Vision and Pattern Recognition, pages 6489– Information Processing Systems (NeurIPS), 2020. 1, 2, 3, 5, 6498, 2020. 2, 7, 8, 12 6, 7 [40] Pengpeng Liu, Irwin King, Michael R Lyu, and Jia Xu. [26] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Ddflow: Learning optical flow with unlabeled data distilla- Koray Kavukcuoglu. Spatial transformer networks, 2015. 2 tion. In Proceedings of the AAAI Conference on Artificial [27] Joel Janai, Fatma Güney, Anurag Ranjan, Michael Black, Intelligence, volume 33, pages 8770–8777, 2019. 2, 4 and Andreas Geiger. Unsupervised learning of multi-frame [41] Pengpeng Liu, Irwin King, Michael R. Lyu, and Jia Xu. optical flow with occlusions. In ECCV, 2018. 2, 3 DDFlow: Learning optical flow with unlabeled data distil- [28] Joel Janai, Fatma Güney, Anurag Ranjan, Michael J. Black, lation. AAAI, 2019. 8 and Andreas Geiger. Unsupervised learning of multi-frame [42] Pengpeng Liu, Michael Lyu, Irwin King, and Jia Xu. Self- optical flow with occlusions. ECCV, 2018. 8 low: Self-supervised learning of optical flow. In Proceedings [29] Joel Janai, Fatma Guney, Jonas Wulff, Michael J Black, and of the IEEE/CVF Conference on Computer Vision and Pat- Andreas Geiger. Slow flow: Exploiting high-speed cameras tern Recognition, pages 4571–4580, 2019. 2, 6, 8 for accurate and diverse optical flow reference data. In Pro- ceedings of the IEEE Conference on Computer Vision and [43] Bruce D Lucas and Takeo Kanade. An iterative image regis- Pattern Recognition, 2017. 3 tration technique with an application to stereo vision. IJCAI, 1981. 1, 2 [30] Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Michael J Black. Towards understanding ac- [44] Kunming Luo, Chuan Wang, Shuaicheng Liu, Haoqiang Fan, tion recognition. In Proceedings of the IEEE international Jue Wang, and Jian Sun. Upflow: Upsampling pyramid conference on computer vision, pages 3192–3199, 2013. 5 for unsupervised optical flow learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern [31] Rico Jonschkowski, Austin Stone, Jonathan T Barron, Ariel Recognition, pages 1045–1054, 2021. 2 Gordon, Kurt Konolige, and Anelia Angelova. What matters in unsupervised optical flow. arXiv preprint [45] Simon Meister, Junhwa Hur, and Stefan Roth. Unflow: Un- arXiv:2006.04902, 2020. 1, 2, 3, 4, 5, 6, 8, 12 supervised learning of optical flow with a bidirectional cen- [32] Ryan Kennedy and Camillo J Taylor. Optical flow with geo- sus loss. AAAI, 2018. 1, 6 metric occlusion estimation and fusion of multiple frames. In [46] Simon Meister, Junhwa Hur, and Stefan Roth. Unflow: Un- International Workshop on Energy Minimization Methods in supervised learning of optical flow with a bidirectional cen- Computer Vision and Pattern Recognition, pages 364–377. sus loss. In Proceedings of the AAAI Conference on Artificial Springer, 2015. 3 Intelligence, volume 32, 2018. 2 [33] Diederik P Kingma and Jimmy Ba. Adam: A method for [47] Roland Memisevic and Geoffrey Hinton. Unsupervised stochastic optimization. arXiv, 2014. 5, 12 learning of image transformations. In 2007 IEEE Confer- [34] Zihang Lai, Erika Lu, and Weidi Xie. Mast: A ence on Computer Vision and Pattern Recognition, pages 1– memory-augmented self-supervised tracker. arXiv preprint 8. IEEE, 2007. 2 arXiv:2002.07793, 2020. 1, 2 [48] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, [35] Cheng Lei and Yee-Hong Yang. Optical flow estimation on James Bradbury, Gregory Chanan, Trevor Killeen, Zeming coarse-to-fine region-trees using discrete optimization. In Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An im- 2009 IEEE 12th International Conference on Computer Vi- perative style, high-performance deep learning library. arXiv sion, pages 1562–1569. IEEE, 2009. 3 preprint arXiv:1912.01703, 2019. 5, 7, 12 10
[49] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, [63] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. M. Gross, and A. Sorkine-Hornung. A benchmark dataset Pwc-net: Cnns for optical flow using pyramid, warping, and and evaluation methodology for video object segmentation. cost volume, 2017. 2, 3, 4, 5, 8 In Computer Vision and Pattern Recognition, 2016. 6 [64] Narayanan Sundaram, Thomas Brox, and Kurt Keutzer. [50] Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Ar- Dense point trajectories by gpu-accelerated large displace- beláez, Alexander Sorkine-Hornung, and Luc Van Gool. ment optical flow. In European conference on computer vi- The 2017 davis challenge on video object segmentation. sion, pages 438–451. Springer, 2010. 3 arXiv:1704.00675, 2017. 5, 6 [65] Yansong Tang, Zhenyu Jiang, Zhenda Xie, Yue Cao, Zheng [51] Jerome Revaud, Philippe Weinzaepfel, Zaid Harchaoui, and Zhang, Philip HS Torr, and Han Hu. Breaking shortcut: Cordelia Schmid. Epicflow: Edge-preserving interpolation Exploring fully convolutional cycle-consistency for video of correspondences for optical flow. In CVPR, 2015. 2 correspondence learning. arXiv preprint arXiv:2105.05838, 2021. 2 [52] Dan Rosenbaum, Daniel Zoran, and Yair Weiss. Learning the local statistics of optical flow. Advances in Neural Infor- [66] Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field mation Processing Systems, 26:2373–2381, 2013. 4 transforms for optical flow, 2020. 2, 3, 8 [53] Stefan Roth, Victor Lempitsky, and Carsten Rother. [67] Sebastian Volz, Andres Bruhn, Levi Valgaerts, and Henning Discrete-continuous optimization for optical flow estimation. Zimmer. Modeling temporal coherence for optical flow. In In Statistical and Geometrical Approaches to Visual Motion 2011 International Conference on Computer Vision, pages Analysis, pages 1–22. Springer, 2009. 2 1116–1123. IEEE, 2011. 3 [68] Carl Vondrick, Abhinav Shrivastava, Alireza Fathi, Sergio [54] Michael Rubinstein, Ce Liu, and William T Freeman. To- Guadarrama, and Kevin Murphy. Tracking emerges by col- wards longer long-range motion trajectories. 2012. 3 orizing videos. In ECCV, 2017. 1 [55] Peter Sand and Seth Teller. Particle video: Long-range mo- [69] Chia-Ming Wang, Kuo-Chin Fan, Cheng-Tzu Wang, and tion estimation using point trajectories. ICCV, 2008. 3 Tong-Yee Lee. Estimating optical flow by integrating multi- [56] Alexander Shekhovtsov, Ivan Kovtun, and Václav Hlaváč. frame information. Journal of Information Science & Engi- Efficient mrf deformation model for non-rigid image match- neering, 24(6), 2008. 3 ing. Computer Vision and Image Understanding, 112(1):91– [70] Heng Wang, Alexander Kläser, Cordelia Schmid, and 99, 2008. 2 Cheng-Lin Liu. Dense trajectories and motion boundary [57] Leslie N Smith and Nicholay Topin. Super-convergence: descriptors for action recognition. International journal of Very fast training of neural networks using large learning computer vision, 103(1):60–79, 2013. 3 rates. In Artificial Intelligence and Machine Learning for [71] Ning Wang, Yibing Song, Chao Ma, Wengang Zhou, Wei Multi-Domain Operations Applications, volume 11006, page Liu, and Houqiang Li. Unsupervised deep tracking. In Pro- 1100612. International Society for Optics and Photonics, ceedings of the IEEE Conference on Computer Vision and 2019. 5, 12 Pattern Recognition, pages 1308–1317, 2019. 2 [58] Austin Stone, Daniel Maurer, Alper Ayvaci, Anelia An- [72] Xiaolong Wang, Allan Jabri, and Alexei A Efros. Learn- gelova, and Rico Jonschkowski. Smurf: Self-teaching multi- ing correspondence from the cycle-consistency of time. In frame unsupervised raft with full-image warping. In Pro- CVPR, 2019. 1, 6 ceedings of the IEEE/CVF Conference on Computer Vision [73] Yang Wang, Yi Yang, and Wei Xu. Occlusion aware unsu- and Pattern Recognition, pages 3887–3896, 2021. 2, 8 pervised learning of optical flow. 2018. 2, 3 [59] Deqing Sun, Stefan Roth, and Michael J Black. Secrets of [74] Zhongdao Wang, Hengshuang Zhao, Ya-Li Li, Shengjin optical flow estimation and their principles. In CVPR, 2010. Wang, Philip Torr, and Luca Bertinetto. Do different tracking 1, 2, 4, 6 tasks require different appearance models? NeruIPS, 2021. [60] Deqing Sun, Stefan Roth, JP Lewis, and Michael J Black. 1 Learning optical flow. In European Conference on Computer [75] Philippe Weinzaepfel, Jerome Revaud, Zaid Harchaoui, and Vision, pages 83–97. Springer, 2008. 2, 3 Cordelia Schmid. Deepflow: Large displacement optical [61] Deqing Sun, Erik Sudderth, and Michael Black. Layered im- flow with deep matching. In Proceedings of the IEEE inter- age motion with explicit occlusions, temporal consistency, national conference on computer vision, pages 1385–1392, and depth ordering. Advances in Neural Information Pro- 2013. 2 cessing Systems, 23:2226–2234, 2010. 3 [76] Jiarui Xu and Xiaolong Wang. Rethinking self-supervised [62] Deqing Sun, Daniel Vlasic, Charles Herrmann, Varun correspondence learning: A video frame-level similarity per- Jampani, Michael Krainin, Huiwen Chang, Ramin Zabih, spective. arXiv preprint arXiv:2103.17263, 2021. 2, 5, 6 William T Freeman, and Ce Liu. Autoflow: Learning a better [77] Li Xu, Jianing Chen, and Jiaya Jia. A segmentation based training set for optical flow. In Proceedings of the IEEE/CVF variational model for accurate optical flow estimation. In Conference on Computer Vision and Pattern Recognition, European Conference on Computer Vision, pages 671–684. pages 10093–10102, 2021. 3 Springer, 2008. 3 11
[78] Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Table 8. Number of scales in the multiscale random walk. We Liang, Jianchao Yang, and Thomas Huang. Youtube-vos: perform the contrastive random walk on only # of levels from A large-scale video object segmentation benchmark. arXiv coarse to fine. preprint arXiv:1809.03327, 2018. 5, 7, 8 [79] Zhichao Yin and Jianping Shi. Geonet: Unsupervised learn- # levels KITTI-15 train ing of dense depth, optical flow and camera pose. In CVPR, noc all ER% (occ) 2018. 2 1 4.45 8.98 24.52 [80] Jason J. Yu, Adam W. Harley, and Konstantinos G. Derpanis. 2 2.70 6.31 17.64 Back to basics: Unsupervised learning of optical flow via 3 2.36 4.55 14.35 brightness constancy and motion smoothness. In Computer 4 2.12 3.97 12.79 Vision - ECCV 2016 Workshops, Part 3, 2016. 1, 2 5 2.09 3.86 12.45 [81] Jean yves Bouguet. Pyramidal implementation of the lucas kanade feature tracker. Intel Corporation, Microprocessor Table 9. Hyperparameter settings Research Labs, 2000. 1 [82] Yiran Zhong, Pan Ji, Jianyuan Wang, Yuchao Dai, and Hong- Hyperparameter Values dong Li. Unsupervised deep epipolar flow for stationary or Learning rate schedule CyclicLR dynamic scenes. In Proceedings of the IEEE/CVF Confer- Base learning rate 10−4 ence on Computer Vision and Pattern Recognition, pages Max learning rate 5 × 10−4 12095–12104, 2019. 2, 8 Temperature τ 0.07 [83] Xiaojin Zhu and Zoubin Ghahramani. Learning from labeled Video length 2, 3, 4 and unlabeled data with label propagation. Technical report, Window Size (k × k) 3,7,11 2002. 1 Loss weight Values [84] CW Zitnick, Nebojsa Jojic, and Sing Bing Kang. Consistent Contrastive random walk loss 1 segmentation for optical flow estimation. In Tenth IEEE In- Smoothness loss 30 ternational Conference on Computer Vision (ICCV’05) Vol- Boundary loss 1 ume 1, volume 2, pages 1308–1315. IEEE, 2005. 2 Regressor constraint loss 1 [85] Yuliang Zou, Zelun Luo, and Jia-Bin Huang. DF-Net: Un- feature consistency loss 0.1 supervised joint learning of depth and flow using cross-task λ in smoothness loss 150 consistency. ECCV, 2018. 3 by randomly shifting the hue and brightness and randomly A. Ablating Scales in the Random Walk flipped left/right. The augmentation is kept the same across Our model learns feature representations by performing frame in a pair of images. In contrast to other work [31], we a multiscale contrastive random walk on L = 5 spatial did not modify the model per dataset. scales. Here we ask how the number of scales affects perfor- mance. Instead of performing the contrastive random walk Training details. We train our network with Py- on all scales, we perform it only the first (coarse) scales. Torch [48], using the Adam [33] optimizer with a cyclic Subsequent (finer) scales use the untrained (random) em- learning rate schedule [57] with a base learning rate of 10−4 beddings, but are otherwise unchanged. We found that re- and a max learning rate of 5 × 10−4 . We use batch size of ducing the number of scales greatly reduced performance, 4 for 2-cycle model and 2 for 3- and 4-cycle models (due suggesting that multiscale estimation and feature learning to memory constraints). The total training takes approxi- are critical components of our model. mately four days on two GTX 2080Ti, two days for training on Flying Chairs and two days for training on Sintel/KITTI. B. Hyperparameters Experiments on Sintel and KITTI start from a model that was first trained on Flying Chairs, as in [31]. We list the hyperparameters and ranges considered dur- ing our experiments. Weights for the boundary loss, learned C. Architecture photometric loss are hand-chosen. Parameters in bold are the ones that were systematically explored via ablations. We provide additional details about the network archi- For the image resolution, we follow the experimental setup tecture, which closely resembles PWC-Net with the simpli- of Jonschkowski et al. [31] i.e., Flying Chairs: 384 × 512, fications introduced by ARflow [39]. We attach a 1 × 1 Sintel: 448 × 1024, KITTI: 640 × 640. We use the same convolutional layer to the each scale to obtain the embed- loss weight across different scales for a specific type of loss. dings (of 32 channels) for contrastive random walk at each RGB image values are scaled to [−1, 1] and augmented scale. 12
KITTI Sintel-Final Ground Truth Nonparametric (Ours) Regression (Ours) Census Census + Regression (Ours) Figure 6. Additional qualitative results for KITTI and Sintel D. Additional Qualitative Results We provide additional qualitative results for optical flow in Figure 6. 13
You can also read