Learning Pixel Trajectories with Multiscale Contrastive Random Walks

Page created by Howard Shelton
 
CONTINUE READING
Learning Pixel Trajectories with Multiscale Contrastive Random Walks
Learning Pixel Trajectories with Multiscale Contrastive Random Walks

 Zhangxing Bian1,3 Allan Jabri2 Alexei A. Efros2 Andrew Owens1
 University of Michigan1 UC Berkeley2 Johns Hopkins University3

 https://jasonbian97.github.io/flowwalk
arXiv:2201.08379v1 [cs.CV] 20 Jan 2022

 Abstract solution. However, scaling this perspective to pixel-level
 space-time graphs holds challenges. Since computing sim-
 A range of video modeling tasks, from optical flow to ilarity between frames is quadratic in the number of nodes,
 multiple object tracking, share the same fundamental chal- estimating dense motion is prohibitively expensive. More-
 lenge: establishing space-time correspondence. Yet, ap- over, there is no way of explicitly estimating the motion in
 proaches that dominate each space differ. We take a step to- ambiguous cases, like occlusion. In parallel, the unsuper-
 wards bridging this gap by extending the recent contrastive vised optical flow community has adopted highly effective
 random walk formulation to much denser, pixel-level space- methods for dense matching [80], which use multiscale rep-
 time graphs. The main contribution is introducing hier- resentations [9, 43, 59, 81] to reduce the large search space,
 archy into the search problem by computing the transi- and smoothness priors to deal with ambiguity and occlu-
 tion matrix between two frames in a coarse-to-fine man- sion. But, in contrast to the self-supervised tracking meth-
 ner, forming a multiscale contrastive random walk when ex- ods, they rely on hand-crafted distance functions, such as
 tended in time. This establishes a unified technique for self- the Census Transform [31, 45]. Furthermore, because they
 supervised learning of optical flow, keypoint tracking, and focus on producing point estimates of motion, they may be
 video object segmentation. Experiments demonstrate that, less robust under long-term dynamics.
 for each of these tasks, the unified model achieves perfor- In this work, we take a step toward bridging the gap be-
 mance competitive with strong self-supervised approaches tween tracking and optical flow by extending the CRW for-
 specific to that task. mulation [25] to much more dense, pixel-level space-time
 graphs. The main contribution is introducing hierarchy into
 1. Introduction the search problem, i.e. the multiscale contrastive random
 walk. By integrating local attention in a coarse-to-fine man-
 Temporal correspondence underlies a range of video un-
 ner, the model can efficiently consider a distribution over
 derstanding tasks, from optical flow to object tracking. At
 pixel-level trajectories. Through experiments across optical
 the core, the challenge is to estimate the motion of some
 flow and video label propagation benchmarks, we show:
 entity as it persists in the world, by searching in space and
 time. For historical reasons, the practicalities differ sub-
 stantially across tasks: optical flow aims for dense corre- • This provides a unified technique for self-supervised op-
 spondences but only between neighboring pairs of frames, tical flow, pose tracking, and video object segmentation.
 whereas tracking cares about longer-range correspondences • For optical flow, the model is competitive with many re-
 but is spatially sparse. We argue that the time might be right cent unsupervised optical flow methods, despite using a
 to try and re-unify these different takes on temporal corre- novel loss function (without hand-crafted features).
 spondence. • For tracking, the model outperforms existing self-
 supervised approaches on pose tracking, and is compet-
 An emerging line of work in self-supervised learning has
 itive on video object segmentation.
 shown that generic representations pretrained on unlabeled
 • Cycle-consistency provides a complementary learning
 images and video can lead to strong performance across a
 signal to photo-consistency.
 range of tracking tasks [20, 25, 34, 72, 74]. The key idea is
 • Multi-frame training improves two-frame performance.
 that if tracking can be formulated as label propagation [83]
 on a space-time graph, all that is needed is a good mea-
 sure of similarity between nodes. Indeed, the recent con- 2. Related work
 trastive random walk (CRW) formulation [25] shows how Space-time representation learning. Recent work has
 such a similarity measure can be learned for temporal cor- proposed methods for tracking objects in video through
 respondence problems, suggesting a path towards a unified self-supervised representation learning. Vondrick et al. [68]

 1
Learning Pixel Trajectories with Multiscale Contrastive Random Walks
t t+k t

 level
 1 coarse

 X

 level
 2
 X
 fine
 level
 3
 X
Figure 1. Multiscale contrastive random walks. We learn representations for dense, fine-grained matching using multiscale contrastive
random walks. At each scale, we create a space-time graph in which each pixel is a node, and where nodes close in space-time are
connected. Transition probabilities are based on similarity in a learned representation. We train the network to maximize the probability
of obtaining a cycle-consistent random walk. Here, we illustrate a pixel’s walk over 3 spatial scales and 3 frames. The walker is initialized
using its position at the previous, coarser scale.

posed tracking as a colorization problem, by training a net- tional methods. Later research improved flow estimation
work to match pixels in grayscale images that have the using robust penalties [4, 6, 38, 59, 60], coarse-to-fine esti-
same held-out colors. The assumption that matching pixels mation [8], discrete optimization [13,53,56], feature match-
have the same color may break down over long time hori- ing [5, 51, 75], bilateral filtering [1], and segmentation [84].
zons, and the method is limited to grayscale inputs. This In contrast, our model estimates flow using neural networks.
approach was extended to obtain higher-accuracy matches Unsupervised optical flow. Early work [47] used Boltz-
with two-stage matching [34, 36]. In contrast to our ap- mann machines to learn transformations between images.
proach, the predictions are coarse, patch-level associations. More recently, Yu et al. [80] train a neural network to mini-
Another line of work learns representations by maximizing mize a loss very similar to that of optimization-based ap-
cycle consistency. These methods track patches forward, proaches. Later work extended this approach by adding
then backward, in time and test whether they end up where edge-aware smoothing [73], hand-crafted features [46], oc-
they began. Wang et al. [71] proposed a method based clusion handling [27, 73], learned upsampling [44], and
on hard attention and spatial transformers [26]. Jabri et depth and camera pose [79, 82]. Another approach learns
al. [25] posed cycle-consistency as a random walk problem, flow by matching augmented image pairs [39, 40, 42]. Re-
allowing the model to obtain dense supervision from multi- cently, Jonschkowski et al. [31] surveyed previous litera-
frame video. Tang et al. [65] proposed an extension that ture and conducted an exhaustive search to find the best
allowed for fully convolutional training. These approaches combination of methods and hyperparameters, resulting in
are trained on sparse patches and learn coarse-grained cor- a method that combines occlusion reasoning, augmenta-
respondences. We combine these representation learning tion, smoothness, flow dropout, and census transform fea-
methods with matching techniques from the optical flow lit- tures. In contrast to these works, our goal is to learn self-
erature, such as multiscale matching and smoothness, and supervised representations for matching, in lieu of hand-
obtain dense, pixel-to-pixel correspondences. Other work crafted features. Moreover, we aim to produce a distri-
has encouraged patches in the same position in neighboring bution over motion trajectories suitable for label transfer,
frames to be close in an embedding space [17, 76]. rather than motion estimates alone. In very recent work,
Optimization-based optical flow. Lucas and Kanade [2, Stone et al. [58] adapted unsupervised flow methods to the
43] used Gauss-Newton optimization to minimize a bright- RAFT architecture [66] and proposed new augmentation,
ness constancy objective. In their seminal work, Horn and self-distillation, and multi-frame occlusion inpainting meth-
Schunck [21] combined brightness constancy with a spa- ods. By contrast, we use PWC-net [63], since it is the stan-
tial smoothness criteria, and estimated flow using varia- dard architecture considered in prior work, and since it can

 2
Learning Pixel Trajectories with Multiscale Contrastive Random Walks
obtain strong performance with careful training [62]. 3.1.1 Preliminaries: Contrastive random walks
Cycle consistency. Cycle consistency has long been used We build on the contrastive random walk (CRW) formula-
to detect occlusions [3, 23, 35, 77], and is used to discount tion of Jabri et al. [25]. Given an input video with k frames,
the loss of occluded pixels in unsupervised flow [27,31,73]. we extract n patches from each frame and assign each an
Zou et al. [85] used a cycle consistency loss as part of a embedding using a learned encoder φ. These patches form
system that jointly estimated depth, pose, and flow. Re- the vertices of a graph that connects all patches in tempo-
cently, Huang et al. [22] combined cycle-consistency with rally adjacent frames. A random walker steps through the
epipolar matching, but their method is weakly supervised graph, forward in time from frames 1, 2, ..., k, then back-
with camera pose and assumes egomotion. In contrast, our ward in time from k − 1, k − 2, ..., 1. The transition proba-
model is entirely unsupervised and is capable of working bilities are determined by the similarity of the learned rep-
solely with cycle-consistency and smoothness losses. With- resentations:
out extra constraints, their cycle consistency formulations As,t = softmax(Xs Xt> /τ ), (1)
have a trivial solution (e.g., all-zero flow). A random walk
formulation of cycle consistency also been used in semi- for a pair of frames s and t, where Xi ∈ Rn×d is the matrix
supervised learning [18], using labels to test consistency. of d-dimensional embedding vectors, τ is a small constant,
 and the softmax is performed along each row. We train the
Multi-frame matching. Many methods use a third frame
 model to maximize the likelihood of cycle consistency, i.e.
to obtain more local evidence for matching. Classic meth-
 the event that the walker returns to the node it started from:
ods assume approximately constant velocity [29, 61, 67, 69]
or acceleration [32, 69] and measure the photo-consistency
over the full set of frames. Recently, Janai et al. [27] pro- LCRW = − tr(log(Āt,t+k Āt+k,t )), (2)
posed an unsupervised multi-frame flow method that used
a photometric loss with a low-acceleration assumption and where Āt,t+k are the transition probabilities from frame t to
 Qt+k−1
explicit occlusion handling. In contrast to these approaches, t + k: Āt,t+k = i=t Ai,i+1 .
our method can be deployed using two frames at test time.
We use subsequent frames as a training signal. There 3.1.2 Optical flow as a random walk
have also been a variety of approaches that track over long
time horizons, often using sparse (or semi-dense) keypoints After training, the transition matrix contains the probabil-
[54, 55]. Other work chains together optical flow [7, 64, 70], ity that a pair of patches is in space-time correspondence.
typically after removing low-texture regions. In contrast, We can estimate the optical flow fs,t ∈ Rn×2 of a patch
our method also learns “soft” per-pixel tracks, which con- between frames s and t by taking the expected value of the
vey the probability that pairs of pixels match. change in spatial position:
Supervised optical flow. Early work learned optical flow gavg (As,t ) = EAs,t [fs,t ] = As,t D − D, (3)
with probabilistic models, such as graphical models [15].
Other work learns parmeters for smoothness and bright- where D ∈ Rn×2 is the (constant) matrix containing of
ness constancy [60] or robust penalties [4, 37]. More recent pixel coordinates for each patch, and As,t D is the walker’s
methods has used neural networks. Fischer et al. [12, 24] expected position in frame t.
proposed architectures with a built-in correlation layers. In contrast to widely-used forward-backward cycle con-
Sun et al. [63] introduced a network with built-in coarse-to- sistency formulations [22, 85], which measure the deviation
fine matching. Recent work [66] iteratively updates a flow of a predicted motion from a starting point, there is no triv-
with multiscale features, in lieu of coarse-to-fine matching. ial solution (e.g., all-zero flow). This is because cycle con-
 sistency is measured in an embedding space defined solely
3. Method
 from the visual content of image regions.
 We first show how to learn dense space-time correspon-
dences using mutiscale contrastive random walks, resulting
 3.1.3 Multiscale random walk
in a model that obtains high quality motion estimates via
simple nonparametric matching. We then describe how the As presented so far, this formulation is expensive to scale
learned representation can be combined with regression to to high-resolutions because computing the transition matrix
handle occlusions and ambiguity, for improved optical flow. is quadratic in the number of nodes. We overcome this by
 introducing hierarchy into the search problem. Instead of
3.1. Multiscale contrastive random walks comparing all pairs, we only attend on a local neighbor-
 We review the single-scale contrastive random walk, hood. By integrating local search across scales in a coarse-
then extend the approach to multiscale estimation. to-fine manner, the model can efficiently consider a distri-
 bution over pixel-level trajectories.

 3
Learning Pixel Trajectories with Multiscale Contrastive Random Walks
warp w/
 upsample upsample warp w/ f t+1
 t
 Xs Xt⊤ fl+1 fl+1
 s,t fl+1 fl+1
 s,t
 ⊤
 s,t t,s
 Eq. 2 Xt Xt+1 D 
 
 Eq. 3
 ⊤
 Xt Xt+1
 init.
 0 3.1.4 Smooth random walks
 flow
 Xsl−1 Xtl−1 Xt Xt+1
 l−1 Since natural motions tend to be smooth [52], we follow
 coarse
 As,t
 work in optical flow [59], and incorporate smoothness as
 fls,t + an additional desiderata for our random walks. We use the
 Xsl warp X̃lt edge-aware loss of Jonschkowski et al. [31]:
 l
 As,t
 X ∂ 2 fs,t (p)
 fine fl+1 + Lsmooth = Ep exp(−λc Id (p))|
 ∂d2
 | (8)
 s,t
 Xsl+1 warp X̃l+1
 t
 d∈{x,y}
 l+1
 As,t where Id (p) = 31 c | ∂I
 P
 ∂d | is the spatial derivative averaged
 c

 Xs Xt over all color channels Ic in direction d. The parameter λc
 input
 fs,t input
 controls the influence of similarly colored pixels. We apply
 output this loss to each scale of the model.
Figure 2. Coarse-to-fine matching. Our model performs a con- 3.2. Handling occlusion
trastive random walk across spatial scales: it computes a transition
matrix (Eq. 5), uses it to obtain flow (Eq. 6), and recurses to the While effective for most image content, nonparametric
next level by using upsampled estimated flow to align the finer matching has no mechanism for estimating the motion of
scale for matching (Eq. 4). warp samples the grid using the flow. pixels that become occluded, since it requires a correspond-
Coarse-to-fine local attention. Computing the transition ing patch in the next frame. We propose a variation of
matrix closely resembles cost volume estimation in optical the model that combines the multiscale contrastive random
flow [6, 14, 59]. This inspires us to draw on the classic spa- walk with a regression module that directly predicts the flow
tial pyramid commonly used for multiscale search in optical values at each pixel.
flow, by iteratively computing the dense transition matrix, The architecture of our regressor closely follows the re-
from coarse to fine spatial scales l ∈ [1..L]. finement module of PWC-net [63]. We learn a function
 For frames s and t, we compute feature pyramids Xsl ∈ greg (·) that regresses the flow at each pixel from the mul-
 h(l) w(l) ×d tiscale contrastive random walk cost-volume and convolu-
R , where h(l) and w(l) are the width and height of
 tional features. These features are obtained from the same
the feature map at scale l. To match each level efficiently,
 shared backbone that is used to compute the embeddings.
we warp the features of the target frame Xtl into the coor-
dinate frame of Xsl using the coarse flow from the previous Regression loss. We choose a loss that approximates the
 l
level fs,t ; we then compute local transition probabilities on contrastive random walk objective, such that pixels that
the warped feature to account for remaining motion. Thus, are well-matched by the nonparametric model (e.g., non-
we estimate the transition matrix and flow in a coarse-to- occluded pixels) are unlikely to change their flow values,
fine manner (Fig. 2), computing levels recurrently: while the flow values of occluded pixels are obtained using
 a smoothness prior. To this end, we use the same smooth-
 X̃tl = warp(Xtl , fs,t
 l
 ) (4) ness loss, Lsmooth , as the nonparametric model (Eq. 8), and
 Als,t = masked softmax(Xsl X̃tl> /τ ) (5) penalize the model from deviating from the nonparametric
 l+1 flow estimate (Eq. 3). We also use our learned embeddings
 fs,t = upsample(gavg (Als,t ) + fs,t
 l
 ), (6)
 as features for a learned photometric loss, i.e. incurring loss
where warp(X, f ) samples features X with flow f us- if the model puts two pixels with dissimilar embedding vec-
ing bilinear sampling, and f 1 = 0. For notational con- tors into correspondence. This results in an additional loss:
venience, we write the local transition constraint using
 Lfeat = kXs − warp(Xt , fs,t )k2 + λa kfs,t − gavg (As,t )k2 , (9)
masked softmax, which sets values beyond a local spatial
window to zero. In practice, we use optimized correlation where fs,t is the predicted flow and λa is a constant. To
filtering kernels to compute Eq. 5 efficiently, and represent prevent the regression loss from influencing the learned em-
the transition matrices Als,t as a sparse matrix. beddings that it is based on, we do not propagate gradients
Loss. After computing the transition matrices between all from the regressor to the embeddings X during training. As
pairs of adjacent frames, we sum the contrastive random in Eq. 7, we apply the loss to each scale and sum.
walk loss over all levels: Augmentation and masking. We follow [31] and im-
 L
 X prove the regressor’s handling of occlusions through aug-
 LmsCRW = − tr(log(Ālt,t+k Ālt+k,t )), (7)
 mentation, and by discounting the loss of pixels that fail
 l=1
 a consistency check [40]. To handle pixels that move off-
where Āls,t
 is defined as in Eq. 2. In our experiments, we screen, we compute flow, then randomly crop the input im-
use L = 5 scales and consider k ∈ [2..4] length cycles. ages and compute it again, penalizing deviations between

 4
Learning Pixel Trajectories with Multiscale Contrastive Random Walks
the flows. This results in a new loss, Lbound . We also transfer and motion estimation tasks. We compare them to
remove the contribution of pixels in the photometric loss space-time correspondence learning methods, and with un-
(Eq. 9) if they have no correspondence in the backwards- supervised optical flow methods.
in-time flow estimation from t to s. Both are implemented 4.1. Datasets
exactly as in [31]. These losses apply only to the regressor, For simple comparison with other methods, we train on
and thus do not directly affect the nonparametric matching. standard optical flow datasets. We note that the training
3.3. Training protocols used by unsupervised optical flow literature are
 not standardized. We therefore follow the evaluation setup
Objective. The pure nonparametric model (Section 3.1) of [31]. We pretrain models on unlabeled videos from the
can be trained by simply minimizing the multiscale con- Flying Chairs dataset [11]. We then train on the KITTI-
trastive random walk loss with a smoothness penalty: 2015 [16] multi-view extension and Sintel [10]. To evalu-
 Lnon = LmsCRW + Lsmooth . (10) ate our model’s ability to learn from internet video, we also
 trained the model on YouTube-VOS [78], without pretrain-
Adding the regressor results in the following loss:
 ing on any other datasets.
 Lreg = LmsCRW + Lsmooth + Lfeat + Lbound (11) We also evaluate our model on standard label transfer
 We include weighting factors to control the relative im- tasks. The JHMDB benchmark [30] transfers 15 body
portance of each loss(specified in the supplement). parts to future frames over long time horizons. The DAVIS
 benchmark [50] transfers object masks.
Architecture. To provide a simple comparison with un-
supervised optical flow methods, we use the PWC-net ar- 4.2. Label propagation
chitecture [63] as our network backbone, after reducing the
 We evaluate our learned model’s ability to perform video
number of filters [31]. This network uses the feature hier-
 label propagation, a task widely studied in video represen-
archy from a convolutional network to provide the features
 tation learning work. The goal is to propagate a label map
at each scale. We use the cost volume from this approach
 provided in an initial video frame, which might describe
as our embedding representation for the random walk, af-
 keypoints or object segments, to the rest of the video.
ter performing `2 normalization. We adapt the cost volume
 We follow Jabri et al. [25] and use our model’s proba-
estimation Xsl and the regressor. We provide architecture
 bilistic motion trajectories to guide label propagation. We
details in the supplementary material.
 infer the labels for each target frame t auto-regressively. For
Subcycles. We follow [25] and include subcycles in our each previous source frame s, we have an predicted label
contrastive random walks: when training the model on k- map Ls ∈ Rn×c , where n is the number of pixels and c is
frame videos, we include losses for walks of length k, k −1, the number classes. As in [25], we compute Kt,s l
 , the matrix
... 2. These losses can be estimated efficiently by reusing of weights for attention on the source frames, by keeping the
the transition matrices for the full walk. top-k logits in each row of Ālt,s . We use Kt,s as an atten-
Multi-frame training. When training with k > 2 frames, tion matrix for each label, i.e. Lt = Kt,s Ls . Using several
we use curriculum learning to speed up and stabilize train- source frames as context allows for overcoming occlusion.
ing. We train the model to convergence with 2, 3, ... k frame We use variations of our model that was trained on the
cycles in succession. unlabeled Sintel and YouTube-VOS datasets, and use the
 transition matrix and flow fields at the penultimate level of
Optimization. To implement the contrastive random
 the pyramid, i.e. level 4. Since the transition matrix de-
walk, we exploit the sparsity of our coarse-to-fine formu- 4
 scribes residual motion, we warp (i.e. with ft,s ) each label
lation, and represent the transition matrices Als,t as sparse
 map before querying. Finally, since the features in level 4
matrices. This significantly improved training times and re-
 have only 16 channels, we stack features from levels 3 and
duced memory requirements, especially in the finest scales.
 4 to obtain hypercolumns [19], before computing attention.
It takes approximately 3 days to train the full model on one
GTX2080 Ti. We train our network with PyTorch [48], Evaluation. We compared our model to recent video
using the Adam [33] optimizer with a cyclic learning rate representation learning work, including single-scale
schedule [57] with a base learning rate of 10−4 and a max CRW [25], UVC [36], and the state-of-the-art VFS [76]
learning rate of 5 × 10−4 . We provide training hyperparam- (Tab. 1). We also report a baseline that uses a state-of-art
eters for each dataset in Section B. unsupervised flow method, UFlow [31], to obtain motion
 trajectories by chaining flow.
4. Results For pose propagation, we evaluate our model on JH-
 Our model produces two outputs: the optical flow fields MDB [30] and report the PCK metric, which measures the
and the pixel trajectories (which are captured in the tran- fraction of keypoints within various distances to the ground
sition matrices). We evaluated these predictions on label truth. Our approach outperforms existing self-supervised

 5
Learning Pixel Trajectories with Multiscale Contrastive Random Walks
t = 33 t = 33 t = 50 t = 50 t = 44 t = 44

 Mask
 t = 31 t = 31

 Prob.
 JHMDB pose DAVIS segments
Figure 3. Propagating segments and pose along motion trajectories. We show qualitative results for JHMDB pose (left) and object
masks on DAVIS (right). For DAVIS scenes, we show examples of mask propagations (top) and soft propagated label distributions (bottom).

 Pose (PCK) Segments model obtains significantly better performance on all met-
 rics using 3-frame and 4-frame cycles.
Method Params @0.05 @0.1 @0.2 J&Fm
TimeCycle [72] 25.6M – 57.3 78.1 50.1 4.3.2 Nonparametric motion estimation
UVC [36] 11M – 58.6 79.6 59.5
CRW [25] 11M 29.1 59.3 80.6 67.6 Our model is able to estimate motion solely through non-
VFS [76] 25.6M – 60.9 80.7 68.8 parametric matching, as can be seen qualitatively in Fig-
UFlow [31]†
Learning Pixel Trajectories with Multiscale Contrastive Random Walks
Ground Truth Nonparametric (Ours) Regression (Ours) Census Census + Regression (Ours)
KITTI
Sintel-Final

Figure 4. Optical flow qualitative results. We show results on images not seen during training, using our nonparametric-only and
regression-based models. The highlighted regions show significant differences between the regression-based models.
Table 2. Nonparametric motion estimation. We single-scale, Table 3. Model configurations. All models are 2-frame.
nonparametric, and regression-based methods. Regressor-only finds a shortcut solution, predicting zero flow.

 Method KITTI-15 train KITTI-15 train
 noc all ER% (occ)
 Configuration noc all ER%
 Jabri et al. [25] 12.63 19.41 64.50
 Ours - Nonparametric 2.18 9.42 27.98 Full 2.09 3.86 12.45
 Ours - With Regression 2.09 3.86 12.45 Nonparametric only 2.18 9.42 27.98
 No feature consistency in Lfeat 2.20 5.02 17.53
 Regressor only 14.21 21.45 41.34
our features obtained competitive performance on non- No regressor constraint in Lfeat 5.54 10.44 26.43
occluded pixels, but that there was a significant advantage No Lbound 2.14 4.88 16.54
to Census features on the all metric. This is understand- No Lsmooth 10.98 17.43 34.85
able since the contrastive random walk does not have a way
of learning features for occluded pixels. Interestingly, we results but outperforms the nonparametric model on the all
found that combining the two features together improved metric. We also see that Lbound loss improves the accu-
performance, and that the gap improves further when 3- racy of occluded pixels (due to image boundaries). We also
frame walks are used, obtaining the overall best results. try different window sizes for the contrastive random walk
 Finally, we used our learned features as part of the pho- (Tab. 6), finding that larger sizes improve results.
tometric loss for ARFlow [39], a very recent unsupervised
 Training on internet video. We found that our model
flow model. We added the Lfeat loss to their model and
 generalized well to benchmark datasets when training
retrained it (while, as in ours, jointly learning the features
 solely on YouTube-VOS [78] (Tab. 7). For comparison,
through a contrastive random walk). The resulting model
 we also trained ARFlow [39] on YouTube-VOS. Our model
improves performance on all metrics, with a larger gain on
 obtained better performance on KITTI, while ARFlow per-
Sintel (Tab. 4(a)).
 formed better on Sintel.
4.3.4 Motion estimation ablations Robustness on jittered images. To help understand the
To help understand which properties of our model con- flexibility of our self-supervised model, we trained a vari-
tribute to its performance, we perform an ablation study on ation of our model on images with large, simulated bright-
the KITTI-15 dataset (Table 3). We asked how the differ- ness and hue variations, inspired by the challenges of rapid
ent losses contribute to the performance. We ablated the exposure changes (e.g., in HDR photography). During
smoothness loss (Eq. 8), the self-supervision loss (Sec. 3.2), training and testing, we randomly jitter the brightness and
and removing the constraint on the regressor in Eq. 9 by set- hue of the second image in KITTI by a factor of up to 0.6
ting λa = 0. We see that the smoothness loss significantly and 0.3 respectively, using PyTorch’s [48] built-in augmen-
improves results. Similarly, we discarded the feature con- tation. We finetuned the variation of our model that com-
sistency term from Eq. 9, which reduces the quality of the bines our learned features with Census features (Tab. 5),

 7
Learning Pixel Trajectories with Multiscale Contrastive Random Walks
Sintel KITTI-15
 Clean Final
 train test train test train test
 Method EPE EPE EPE EPE all noc ER % ER %
 Supervised

 FlowNetC [12] (3.78) 6.85 (5.28) 8.51 - - - -
 FlowNet2 [24] (1.45) 4.16 (2.01) 5.74 (2.30) - (8.61) 11.48
 PWC-Net [63] (1.70) 3.86 (2.21) 5.13 (2.16) - (9.80) 9.60
 RAFT [66] (0.76) 1.94 (1.22) 3.18 (0.63) - (1.50) 5.10
 MFOccFlow [28]∗ {3.89} 7.23 {5.52} 8.81 [6.59] [3.22] - 22.94
 EPIFlow [82] 3.94 7.00 5.08 8.51 5.56 2.56 - 16.95
 Unsupervised

 DDFlow [41] {2.92} 6.18 {3.98} 7.40 [5.72] [2.73] - 14.29
 SelFlow [42]∗ [2.88] [6.56] {3.87} {6.57} [4.84] [2.40] - 14.19
 UFlow [31] {2.50} 5.21 {3.39} 6.50 {2.71} {1.88} {9.05} 11.13
 SMURF-PWC [58] 2.63 - 3.66 - 2.73 - 9.33 -
 SMURF-RAFT [58] {1.71} 3.15 {2.58} 4.18 {2.00} {1.41} {6.42} 6.83
 ARFlow [39] [2.79] [4.78] [3.73] [5.89] [2.85] - - [11.80]
 Ours + ARFlow [2.71] [4.70] [3.61] [5.76] [2.81] [2.17] [11.25] [11.67]
 Ours (2-cycle) {2.84} 5.68 {3.82} 6.72 {3.86} {2.09} {12.45} 13.10
 Ours

 Ours (3-cycle) {2.77} 5.42 {3.61} 6.63 {3.46} {2.05} {12.28} 12.78
 Ours (4-cycle) {2.72} 5.37 {3.55} 6.59 {3.39} {2.04} {12.14} 12.62
 (a) Optical flow benchmarks (b) Robustness to jittering
Table 4. (a) Our model is trained on the train/test splits of the corresponding datasets. We reprint numbers from [31] and adopt their
convention that “{}” are trained on unlabeled evaluation data, “[]” are trained on related data (e.g., full Sintel movie) and “()” are supervised.
Methods that use 3 frames at test time are marked with ∗. (b) We evaluate robustness to brightness and hue jittering.

Table 5. Photometric features. We evaluate the effectiveness of Table 7. Training on internet video. We train on YouTube-
our features when they are used to define a photometric loss. VOS [78] and evaluate on optical flow benchmarks.

 KITTI-15 train Sintel train KITTI-15 train
 Clean Final noc all ER%
 Losses noc all ER%
 Ours (2-cycle) 3.01 4.02 2.20 3.83 13.17
 Charbonnier 2.28 5.69 19.30 ARFlow [39] 2.91 3.85 2.32 4.02 14.04
 Census 2.05 3.14 11.04
 Our feats. 2.09 3.86 12.45 include models that use different numbers of frames for the
 Our feats. + Charbonnier 2.21 4.51 14.25 random walk, and a variation of ARFlow [39] that uses our
 Our feats. + Census 2.04 3.10 10.89 self-supervised features to augment its photometric loss.
 Our feats. (3-frame) + Census 2.02 3.03 10.67 We found that our models outperform many recent un-
 supervised optical flow methods, including the (3-frame)
Table 6. Contrastive random walk ablations. We evaluate dif-
 MFOccFlow [28] and EPIFlow. In particular, we signif-
ferent model parameters, including cycle length.
 icantly outperform the recent SelFlow method [42], despite
 Sintel train KITTI-15 train the fact that it takes 3 frames as input at test time and
 Clean Final noc all ER% uses Census Transform features. By contrast, our model
 uses no hand-crafted image features. The highest perform-
 Win. size Cycle len.

 2 2.84 3.82 2.09 3.68 12.45
 ing method is the very recent, highly optimized SMURF
 3 2.77 3.61 2.05 3.46 12.28
 4 2.72 3.55 2.04 3.39 12.14
 model [58], which uses the RAFT [66] architecture instead
 3×3 8.43 11.43 5.18 7.21 30.85 of PWC-net and which extends UFlow [31]. This model
 7×7 3.02 4.13 2.24 4.40 14.53 uses a variety of additional training signals, such as ex-
 11 × 11 2.84 3.82 2.09 3.68 12.45 tensive data augmentation, occlusion inpainting with multi-
 stage training, self-distillation, and hand-crafted features.
since it obtained strong performance on KITTI. We also
finetuned a model with only Census features. We found that 5. Discussion
the model with our learned features performed significantly We have proposed a method for learning dense mo-
better, and that the gap increased with the magnitude of the tion estimation using multiscale contrastive random walks.
jittering (Tab. 4b). We show qualitative results in Fig. 5. We see our work as a step toward unifying self-supervised
 tracking and optical flow. By extending methods from
4.3.5 Comparison to recent optical flow methods the former, our model obtains a useful learning signal for
To help understand our model’s overall performance, we dense temporal correspondence from multi-frame video,
compare it to recent optical flow methods (Table 4). We and learns representations well-suited for nonparametric

 8
Learning Pixel Trajectories with Multiscale Contrastive Random Walks
[8] Andrés Bruhn, Joachim Weickert, and Christoph Schnörr.
 Lucas/kanade meets horn/schunck: Combining local and
 global optic flow methods. International journal of computer
 vision, 61(3):211–231, 2005. 2
 [9] Peter J. Burt and Edward H. Adelson. The laplacian pyramid
 as a compact image code. IEEE Trans. Commun., 31:532–
 540, 1983. 1

Figure 5. Varying brightness and hue. Qualitative comparison [10] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A nat-
between the hand-crafted Census transform feature and a model uralistic open source movie for optical flow evaluation. In
that combines these features with our self-supervised features. A. Fitzgibbon et al. (Eds.), editor, European Conf. on Com-
 puter Vision (ECCV), Part IV, LNCS 7577, pages 611–625.
matching and label transfer. Through its use of multiscale Springer-Verlag, October 2012. 5
matching and architectures from the latter, it obtains dense [11] A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazırbaş,
motion estimates and handles occlusion. As a result, we V. Golkov, P. v.d. Smagt, D. Cremers, and T. Brox. Flownet:
obtain a single model for unsupervised optical flow, pose Learning optical flow with convolutional networks. In IEEE
tracking, and video object segmentation. International Conference on Computer Vision (ICCV), 2015.
Limitations and impact. Motion analysis has many ap- 5
plications, such as in health monitoring, surveillance, and [12] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip
security. There is also a potential for the technology to Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van
be used for harmful purposes if weaponized. The released Der Smagt, Daniel Cremers, and Thomas Brox. Flownet:
models are limited in scope to the benchmark datasets used Learning optical flow with convolutional networks. In Pro-
in training. ceedings of the IEEE international conference on computer
 vision, 2015. 3, 8
Acknowledgements. We thank David Fouhey for the
 [13] PF Felzenszwalb and DR Huttenlocher. Efficient belief prop-
helpful feedback. AO thanks Rick Szeliski for introduc- agation for early vision. In Proceedings of the 2004 IEEE
ing him to multi-frame optical flow. This research was sup- Computer Society Conference on Computer Vision and Pat-
ported in part by Toyota Research Institute, Cisco Systems, tern Recognition, 2004. CVPR 2004., volume 1, pages I–I.
and Berkeley Deep Drive. IEEE, 2004. 2

References [14] Philipp Fischer, Alexey Dosovitskiy, Eddy Ilg, Philip
 Häusser, Caner Hazırbaş, Vladimir Golkov, Patrick Van der
 [1] Robert Anderson, David Gallup, Jonathan T Barron, Janne Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learn-
 Kontkanen, Noah Snavely, Carlos Hernández, Sameer Agar- ing optical flow with convolutional networks. arXiv, 2015.
 wal, and Steven M Seitz. Jump: virtual reality video. ACM 4
 Transactions on Graphics (TOG), 35(6):1–13, 2016. 2
 [15] William T Freeman, Egon C Pasztor, and Owen T
 [2] Simon Baker and Iain Matthews. Lucas-kanade 20 years on: Carmichael. Learning low-level vision. International jour-
 A unifying framework. In International Journal of Computer nal of computer vision, 40(1):25–47, 2000. 3
 Vision, pages 221–255, February 2004. 2
 [16] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel
 [3] Simon Baker, Daniel Scharstein, JP Lewis, Stefan Roth, Urtasun. Vision meets robotics: The kitti dataset. Interna-
 Michael J Black, and Richard Szeliski. A database and eval- tional Journal of Robotics Research (IJRR), 2013. 5
 uation methodology for optical flow. IJCV, 2011. 3
 [17] Daniel Gordon, Kiana Ehsani, Dieter Fox, and Ali Farhadi.
 [4] Michael J Black and Paul Anandan. The robust estimation Watching the world go by: Representation learning from un-
 of multiple motions: Parametric and piecewise-smooth flow labeled videos, 2020. 2
 fields. Computer vision and image understanding, 63(1):75–
 [18] Philip Haeusser, Alexander Mordvintsev, and Daniel Cre-
 104, 1996. 2, 3
 mers. Learning by association–a versatile semi-supervised
 [5] Thomas Brox, Christoph Bregler, and Jitendra Malik. Large training method for neural networks. In Proceedings of the
 displacement optical flow. In CVPR, 2009. 2 IEEE Conference on Computer Vision and Pattern Recogni-
 tion, pages 89–98, 2017. 3
 [6] Thomas Brox, Andrés Bruhn, Nils Papenberg, and Joachim
 Weickert. High accuracy optical flow estimation based on a [19] Bharath Hariharan, Pablo Arbeláez, Ross Girshick, and Ji-
 theory for warping. In ECCV, 2004. 2, 4 tendra Malik. Simultaneous detection and segmentation. In
 Eur. Conf. Comput. Vis., pages 297–312, 2014. 5
 [7] Thomas Brox and Jitendra Malik. Object segmentation by
 long term analysis of point trajectories. In European con- [20] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
 ference on computer vision, pages 282–295. Springer, 2010. Girshick. Momentum contrast for unsupervised visual rep-
 3 resentation learning. In CVPR, 2020. 1

 9
Learning Pixel Trajectories with Multiscale Contrastive Random Walks
[21] Berthold Horn and Brain Schunck. Determining optical flow. [36] Xueting Li, Sifei Liu, Shalini De Mello, Xiaolong Wang,
 In Artificial Intelligence, pages 185–203, 1981. 2 Jan Kautz, and Ming-Hsuan Yang. Joint-task self-supervised
[22] Zhaoyang Huang, Xiaokun Pan, Runsen Xu, Yan Xu, learning for temporal correspondence. In Advances in Neu-
 Kachun Cheung, Guofeng Zhang, and Hongsheng Li. ral Information Processing Systems, pages 317–327, 2019.
 Life: Lighting invariant flow estimation. arXiv preprint 2, 5, 6
 arXiv:2104.03097, 2021. 3 [37] Yunpeng Li and Daniel P Huttenlocher. Learning for optical
[23] Junhwa Hur and Stefan Roth. Mirrorflow: Exploiting sym- flow using stochastic optimization. In European Conference
 metries in joint optical flow and occlusion estimation. In on Computer Vision, pages 379–391. Springer, 2008. 3
 Proceedings of the IEEE International Conference on Com- [38] Ce Liu et al. Beyond pixels: exploring new representa-
 puter Vision, pages 312–321, 2017. 3 tions and applications for motion analysis. PhD thesis, Mas-
[24] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, sachusetts Institute of Technology, 2009. 2
 Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolu- [39] Liang Liu, Jiangning Zhang, Ruifei He, Yong Liu, Yabiao
 tion of optical flow estimation with deep networks. In Pro- Wang, Ying Tai, Donghao Luo, Chengjie Wang, Jilin Li,
 ceedings of the IEEE conference on computer vision and pat- and Feiyue Huang. Learning by analogy: Reliable super-
 tern recognition, pages 2462–2470, 2017. 3, 8 vision from transformations for unsupervised optical flow
[25] Allan Jabri, Andrew Owens, and Alexei A Efros. Space- estimation. In Proceedings of the IEEE/CVF Conference
 time correspondence as a contrastive random walk. Neural on Computer Vision and Pattern Recognition, pages 6489–
 Information Processing Systems (NeurIPS), 2020. 1, 2, 3, 5, 6498, 2020. 2, 7, 8, 12
 6, 7
 [40] Pengpeng Liu, Irwin King, Michael R Lyu, and Jia Xu.
[26] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Ddflow: Learning optical flow with unlabeled data distilla-
 Koray Kavukcuoglu. Spatial transformer networks, 2015. 2 tion. In Proceedings of the AAAI Conference on Artificial
[27] Joel Janai, Fatma Güney, Anurag Ranjan, Michael Black, Intelligence, volume 33, pages 8770–8777, 2019. 2, 4
 and Andreas Geiger. Unsupervised learning of multi-frame [41] Pengpeng Liu, Irwin King, Michael R. Lyu, and Jia Xu.
 optical flow with occlusions. In ECCV, 2018. 2, 3 DDFlow: Learning optical flow with unlabeled data distil-
[28] Joel Janai, Fatma Güney, Anurag Ranjan, Michael J. Black, lation. AAAI, 2019. 8
 and Andreas Geiger. Unsupervised learning of multi-frame
 [42] Pengpeng Liu, Michael Lyu, Irwin King, and Jia Xu. Self-
 optical flow with occlusions. ECCV, 2018. 8
 low: Self-supervised learning of optical flow. In Proceedings
[29] Joel Janai, Fatma Guney, Jonas Wulff, Michael J Black, and of the IEEE/CVF Conference on Computer Vision and Pat-
 Andreas Geiger. Slow flow: Exploiting high-speed cameras tern Recognition, pages 4571–4580, 2019. 2, 6, 8
 for accurate and diverse optical flow reference data. In Pro-
 ceedings of the IEEE Conference on Computer Vision and [43] Bruce D Lucas and Takeo Kanade. An iterative image regis-
 Pattern Recognition, 2017. 3 tration technique with an application to stereo vision. IJCAI,
 1981. 1, 2
[30] Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia
 Schmid, and Michael J Black. Towards understanding ac- [44] Kunming Luo, Chuan Wang, Shuaicheng Liu, Haoqiang Fan,
 tion recognition. In Proceedings of the IEEE international Jue Wang, and Jian Sun. Upflow: Upsampling pyramid
 conference on computer vision, pages 3192–3199, 2013. 5 for unsupervised optical flow learning. In Proceedings of
 the IEEE/CVF Conference on Computer Vision and Pattern
[31] Rico Jonschkowski, Austin Stone, Jonathan T Barron, Ariel
 Recognition, pages 1045–1054, 2021. 2
 Gordon, Kurt Konolige, and Anelia Angelova. What
 matters in unsupervised optical flow. arXiv preprint [45] Simon Meister, Junhwa Hur, and Stefan Roth. Unflow: Un-
 arXiv:2006.04902, 2020. 1, 2, 3, 4, 5, 6, 8, 12 supervised learning of optical flow with a bidirectional cen-
[32] Ryan Kennedy and Camillo J Taylor. Optical flow with geo- sus loss. AAAI, 2018. 1, 6
 metric occlusion estimation and fusion of multiple frames. In [46] Simon Meister, Junhwa Hur, and Stefan Roth. Unflow: Un-
 International Workshop on Energy Minimization Methods in supervised learning of optical flow with a bidirectional cen-
 Computer Vision and Pattern Recognition, pages 364–377. sus loss. In Proceedings of the AAAI Conference on Artificial
 Springer, 2015. 3 Intelligence, volume 32, 2018. 2
[33] Diederik P Kingma and Jimmy Ba. Adam: A method for [47] Roland Memisevic and Geoffrey Hinton. Unsupervised
 stochastic optimization. arXiv, 2014. 5, 12 learning of image transformations. In 2007 IEEE Confer-
[34] Zihang Lai, Erika Lu, and Weidi Xie. Mast: A ence on Computer Vision and Pattern Recognition, pages 1–
 memory-augmented self-supervised tracker. arXiv preprint 8. IEEE, 2007. 2
 arXiv:2002.07793, 2020. 1, 2 [48] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,
[35] Cheng Lei and Yee-Hong Yang. Optical flow estimation on James Bradbury, Gregory Chanan, Trevor Killeen, Zeming
 coarse-to-fine region-trees using discrete optimization. In Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An im-
 2009 IEEE 12th International Conference on Computer Vi- perative style, high-performance deep learning library. arXiv
 sion, pages 1562–1569. IEEE, 2009. 3 preprint arXiv:1912.01703, 2019. 5, 7, 12

 10
[49] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, [63] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz.
 M. Gross, and A. Sorkine-Hornung. A benchmark dataset Pwc-net: Cnns for optical flow using pyramid, warping, and
 and evaluation methodology for video object segmentation. cost volume, 2017. 2, 3, 4, 5, 8
 In Computer Vision and Pattern Recognition, 2016. 6 [64] Narayanan Sundaram, Thomas Brox, and Kurt Keutzer.
[50] Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Ar- Dense point trajectories by gpu-accelerated large displace-
 beláez, Alexander Sorkine-Hornung, and Luc Van Gool. ment optical flow. In European conference on computer vi-
 The 2017 davis challenge on video object segmentation. sion, pages 438–451. Springer, 2010. 3
 arXiv:1704.00675, 2017. 5, 6 [65] Yansong Tang, Zhenyu Jiang, Zhenda Xie, Yue Cao, Zheng
[51] Jerome Revaud, Philippe Weinzaepfel, Zaid Harchaoui, and Zhang, Philip HS Torr, and Han Hu. Breaking shortcut:
 Cordelia Schmid. Epicflow: Edge-preserving interpolation Exploring fully convolutional cycle-consistency for video
 of correspondences for optical flow. In CVPR, 2015. 2 correspondence learning. arXiv preprint arXiv:2105.05838,
 2021. 2
[52] Dan Rosenbaum, Daniel Zoran, and Yair Weiss. Learning
 the local statistics of optical flow. Advances in Neural Infor- [66] Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field
 mation Processing Systems, 26:2373–2381, 2013. 4 transforms for optical flow, 2020. 2, 3, 8

[53] Stefan Roth, Victor Lempitsky, and Carsten Rother. [67] Sebastian Volz, Andres Bruhn, Levi Valgaerts, and Henning
 Discrete-continuous optimization for optical flow estimation. Zimmer. Modeling temporal coherence for optical flow. In
 In Statistical and Geometrical Approaches to Visual Motion 2011 International Conference on Computer Vision, pages
 Analysis, pages 1–22. Springer, 2009. 2 1116–1123. IEEE, 2011. 3
 [68] Carl Vondrick, Abhinav Shrivastava, Alireza Fathi, Sergio
[54] Michael Rubinstein, Ce Liu, and William T Freeman. To-
 Guadarrama, and Kevin Murphy. Tracking emerges by col-
 wards longer long-range motion trajectories. 2012. 3
 orizing videos. In ECCV, 2017. 1
[55] Peter Sand and Seth Teller. Particle video: Long-range mo- [69] Chia-Ming Wang, Kuo-Chin Fan, Cheng-Tzu Wang, and
 tion estimation using point trajectories. ICCV, 2008. 3 Tong-Yee Lee. Estimating optical flow by integrating multi-
[56] Alexander Shekhovtsov, Ivan Kovtun, and Václav Hlaváč. frame information. Journal of Information Science & Engi-
 Efficient mrf deformation model for non-rigid image match- neering, 24(6), 2008. 3
 ing. Computer Vision and Image Understanding, 112(1):91– [70] Heng Wang, Alexander Kläser, Cordelia Schmid, and
 99, 2008. 2 Cheng-Lin Liu. Dense trajectories and motion boundary
[57] Leslie N Smith and Nicholay Topin. Super-convergence: descriptors for action recognition. International journal of
 Very fast training of neural networks using large learning computer vision, 103(1):60–79, 2013. 3
 rates. In Artificial Intelligence and Machine Learning for [71] Ning Wang, Yibing Song, Chao Ma, Wengang Zhou, Wei
 Multi-Domain Operations Applications, volume 11006, page Liu, and Houqiang Li. Unsupervised deep tracking. In Pro-
 1100612. International Society for Optics and Photonics, ceedings of the IEEE Conference on Computer Vision and
 2019. 5, 12 Pattern Recognition, pages 1308–1317, 2019. 2
[58] Austin Stone, Daniel Maurer, Alper Ayvaci, Anelia An- [72] Xiaolong Wang, Allan Jabri, and Alexei A Efros. Learn-
 gelova, and Rico Jonschkowski. Smurf: Self-teaching multi- ing correspondence from the cycle-consistency of time. In
 frame unsupervised raft with full-image warping. In Pro- CVPR, 2019. 1, 6
 ceedings of the IEEE/CVF Conference on Computer Vision [73] Yang Wang, Yi Yang, and Wei Xu. Occlusion aware unsu-
 and Pattern Recognition, pages 3887–3896, 2021. 2, 8 pervised learning of optical flow. 2018. 2, 3
[59] Deqing Sun, Stefan Roth, and Michael J Black. Secrets of [74] Zhongdao Wang, Hengshuang Zhao, Ya-Li Li, Shengjin
 optical flow estimation and their principles. In CVPR, 2010. Wang, Philip Torr, and Luca Bertinetto. Do different tracking
 1, 2, 4, 6 tasks require different appearance models? NeruIPS, 2021.
[60] Deqing Sun, Stefan Roth, JP Lewis, and Michael J Black. 1
 Learning optical flow. In European Conference on Computer [75] Philippe Weinzaepfel, Jerome Revaud, Zaid Harchaoui, and
 Vision, pages 83–97. Springer, 2008. 2, 3 Cordelia Schmid. Deepflow: Large displacement optical
[61] Deqing Sun, Erik Sudderth, and Michael Black. Layered im- flow with deep matching. In Proceedings of the IEEE inter-
 age motion with explicit occlusions, temporal consistency, national conference on computer vision, pages 1385–1392,
 and depth ordering. Advances in Neural Information Pro- 2013. 2
 cessing Systems, 23:2226–2234, 2010. 3 [76] Jiarui Xu and Xiaolong Wang. Rethinking self-supervised
[62] Deqing Sun, Daniel Vlasic, Charles Herrmann, Varun correspondence learning: A video frame-level similarity per-
 Jampani, Michael Krainin, Huiwen Chang, Ramin Zabih, spective. arXiv preprint arXiv:2103.17263, 2021. 2, 5, 6
 William T Freeman, and Ce Liu. Autoflow: Learning a better [77] Li Xu, Jianing Chen, and Jiaya Jia. A segmentation based
 training set for optical flow. In Proceedings of the IEEE/CVF variational model for accurate optical flow estimation. In
 Conference on Computer Vision and Pattern Recognition, European Conference on Computer Vision, pages 671–684.
 pages 10093–10102, 2021. 3 Springer, 2008. 3

 11
[78] Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Table 8. Number of scales in the multiscale random walk. We
 Liang, Jianchao Yang, and Thomas Huang. Youtube-vos: perform the contrastive random walk on only # of levels from
 A large-scale video object segmentation benchmark. arXiv coarse to fine.
 preprint arXiv:1809.03327, 2018. 5, 7, 8
[79] Zhichao Yin and Jianping Shi. Geonet: Unsupervised learn- # levels KITTI-15 train
 ing of dense depth, optical flow and camera pose. In CVPR, noc all ER% (occ)
 2018. 2 1 4.45 8.98 24.52
[80] Jason J. Yu, Adam W. Harley, and Konstantinos G. Derpanis. 2 2.70 6.31 17.64
 Back to basics: Unsupervised learning of optical flow via 3 2.36 4.55 14.35
 brightness constancy and motion smoothness. In Computer 4 2.12 3.97 12.79
 Vision - ECCV 2016 Workshops, Part 3, 2016. 1, 2 5 2.09 3.86 12.45
[81] Jean yves Bouguet. Pyramidal implementation of the lucas
 kanade feature tracker. Intel Corporation, Microprocessor Table 9. Hyperparameter settings
 Research Labs, 2000. 1
[82] Yiran Zhong, Pan Ji, Jianyuan Wang, Yuchao Dai, and Hong- Hyperparameter Values
 dong Li. Unsupervised deep epipolar flow for stationary or Learning rate schedule CyclicLR
 dynamic scenes. In Proceedings of the IEEE/CVF Confer- Base learning rate 10−4
 ence on Computer Vision and Pattern Recognition, pages Max learning rate 5 × 10−4
 12095–12104, 2019. 2, 8 Temperature τ 0.07
[83] Xiaojin Zhu and Zoubin Ghahramani. Learning from labeled Video length 2, 3, 4
 and unlabeled data with label propagation. Technical report, Window Size (k × k) 3,7,11
 2002. 1 Loss weight Values
[84] CW Zitnick, Nebojsa Jojic, and Sing Bing Kang. Consistent Contrastive random walk loss 1
 segmentation for optical flow estimation. In Tenth IEEE In- Smoothness loss 30
 ternational Conference on Computer Vision (ICCV’05) Vol- Boundary loss 1
 ume 1, volume 2, pages 1308–1315. IEEE, 2005. 2 Regressor constraint loss 1
[85] Yuliang Zou, Zelun Luo, and Jia-Bin Huang. DF-Net: Un- feature consistency loss 0.1
 supervised joint learning of depth and flow using cross-task λ in smoothness loss 150
 consistency. ECCV, 2018. 3

 by randomly shifting the hue and brightness and randomly
A. Ablating Scales in the Random Walk flipped left/right. The augmentation is kept the same across
 Our model learns feature representations by performing frame in a pair of images. In contrast to other work [31], we
a multiscale contrastive random walk on L = 5 spatial did not modify the model per dataset.
scales. Here we ask how the number of scales affects perfor-
mance. Instead of performing the contrastive random walk Training details. We train our network with Py-
on all scales, we perform it only the first (coarse) scales. Torch [48], using the Adam [33] optimizer with a cyclic
Subsequent (finer) scales use the untrained (random) em- learning rate schedule [57] with a base learning rate of 10−4
beddings, but are otherwise unchanged. We found that re- and a max learning rate of 5 × 10−4 . We use batch size of
ducing the number of scales greatly reduced performance, 4 for 2-cycle model and 2 for 3- and 4-cycle models (due
suggesting that multiscale estimation and feature learning to memory constraints). The total training takes approxi-
are critical components of our model. mately four days on two GTX 2080Ti, two days for training
 on Flying Chairs and two days for training on Sintel/KITTI.
B. Hyperparameters Experiments on Sintel and KITTI start from a model that
 was first trained on Flying Chairs, as in [31].
 We list the hyperparameters and ranges considered dur-
ing our experiments. Weights for the boundary loss, learned C. Architecture
photometric loss are hand-chosen. Parameters in bold are
the ones that were systematically explored via ablations. We provide additional details about the network archi-
For the image resolution, we follow the experimental setup tecture, which closely resembles PWC-Net with the simpli-
of Jonschkowski et al. [31] i.e., Flying Chairs: 384 × 512, fications introduced by ARflow [39]. We attach a 1 × 1
Sintel: 448 × 1024, KITTI: 640 × 640. We use the same convolutional layer to the each scale to obtain the embed-
loss weight across different scales for a specific type of loss. dings (of 32 channels) for contrastive random walk at each
RGB image values are scaled to [−1, 1] and augmented scale.

 12
KITTI
 Sintel-Final Ground Truth Nonparametric (Ours) Regression (Ours) Census Census + Regression (Ours)

 Figure 6. Additional qualitative results for KITTI and Sintel

D. Additional Qualitative Results
 We provide additional qualitative results for optical flow
in Figure 6.

 13
You can also read