Multi-Scale Self-Contrastive Learning with Hard Negative Mining for Weakly-Supervised Query-based Video Grounding
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Multi-Scale Self-Contrastive Learning with Hard Negative Mining for Weakly-Supervised Query-based Video Grounding Shentong Mo Daizong Liu Wei Hu* Carnegie Mellon University Peking University Peking University Pittsburgh, United States Beijing, China Beijing, China shentonm@andrew.cmu.edu dzliu@stu.pku.edu.cn forhuwei@pku.edu.cn arXiv:2203.03838v1 [cs.CV] 8 Mar 2022 Abstract 1. Introduction Query-based video grounding has attracted increasing Query-based video grounding is an important yet chal- attention due to its wide spectrum of applications in video lenging task in video understanding, which aims to local- understanding [5, 11, 24, 30]. This task aims to deter- ize the target segment in an untrimmed video according mine the start and end timestamps of a target segment in to a sentence query. Most previous works achieve signifi- an untrimmed video that contains an activity semantically cant progress by addressing this task in a fully-supervised corresponding to a given sentence description, as shown manner with segment-level labels, which require high la- in Figure 1. Most previous works [6, 10, 19, 20, 36, 39] beling cost. Although some recent efforts develop weakly- have achieved significant performance by addressing query- supervised methods that only need the video-level knowl- based video grounding in a fully-supervised manner, which edge, they generally match multiple pre-defined segment however requires a large amount of segment-level annota- proposals with query and select the best one, which lacks tions (location of the target segment in the video according fine-grained frame-level details for distinguishing frames to the semantic of the matched query). Such manual annota- with high repeatability and similarity within the entire tion is quite labor-intensive and time-consuming, thus limit- video. To alleviate the above limitations, we propose a ing the wide applicability of query-based video grounding. self-contrastive learning framework to address the query- Recently, some weakly-supervised works [8, 14, 25, 26, based video grounding task under a weakly-supervised set- 47] have been proposed to alleviate the above issue by only ting. Firstly, instead of utilizing redundant segment propos- leveraging the video-level knowledge of matched video- als, we propose a new grounding scheme that learns frame- query pairs without detailed segment labels. These meth- wise matching scores referring to the query semantic to pre- ods generally pre-define multiple segment proposals, and dict the possible foreground frames by only using the video- employ video-level annotations as supervision to learn the level annotations. Secondly, since some predicted frames segment-query matching scores for selecting the best one. (i.e., boundary frames) are relatively coarse and exhibit However, the generated segment proposals are redundant similar appearance to their adjacent frames, we propose a and contain many negative (i.e., false) samples, resulting in coarse-to-fine contrastive learning paradigm to learn more inferior effectiveness and efficiency of the models. Further, discriminative frame-wise representations for distinguish- as for the positive (i.e., correct) proposals covering the ac- ing the false positive frames. In particular, we iteratively curate foreground frames, they are of high similarity [41] explore multi-scale hard negative samples that are close to and require more sophisticated intra-modal recognition ca- positive samples in the representation space for distinguish- pabilities to distinguish. Especially for the boundary frames ing fine-grained frame-wise details, thus enforcing more ac- which exhibit similar visual appearance to the foreground curate segment grounding. Extensive experiments on two frames in a certain segment, some of them are background challenging benchmarks demonstrate the superiority of our frames that are hard to recognize. Once such segment is se- proposed method compared with the state-of-the-art meth- lected as the best one, the grounding performance will be ods. degenerated due to the background noise. To this end, we propose a novel Multi-scale Self- Contrastive Learning (MSCL) paradigm with hard negative mining strategy for weakly-supervised query-based video * Corresponding author. grounding, aiming to learn fine-grained frame-wise seman-
Query: Person walks through the doorway. Untrimmed Video Time line: 1.90s 8.10s Multi-scale Self-Contrastive Learning Frame-Scale Segment-Scale frame-wise score curve segment-wise score curve iteratively negative iteratively negative frame mining step 1 segment mining step 1 … step 2 step 2 … ……. ……. step n Positive step n Negative Hard Negative Figure 1. Illustration of the proposed multi-scale self-contrastive learning for weakly-supervised query-based video grounding. tic matching by progressively sampling harder negative- ples, where samples with scores within the range serve as positive frames for discriminative feature learning. In par- negative frames and those above the range as positive ones. ticular, instead of relying on redundant segment proposals The range is progressively updated to exploit harder neg- for matching and selection, we propose to learn more fine- ative samples that are more similar to the positive ones, grained frame-wise matching scores to predict whether each which leads to more discriminative features learning. Note frame is the foreground frame. Once the scores of succes- that, our contrastive learning strategy is quite different from sive frames are larger than a learnable threshold, they are the previous vanilla ones [28, 43, 47] in this task, since they taken to construct the predicted segment, leading to more all utilize a one-step algorithm to define the constant nega- efficient grounding. The threshold is acquired from the tive samples with coarse frame-level representations. Com- frame-wise scores learned and enhanced by our developed pared to them, we employ a multi-step process to iteratively multi-scale self-contrastive learning paradigm. We achieve mine the negative samples in a coarse-to-fine manner, defin- such fine-grained frame-wise representation learning by the ing harder negative samples and thus leading to more dis- following twofold novelties. criminative frame-wise representation learning. Further, we explore hard negative samples from different hierarchies— Firstly, since we only resort to video-level annotations local frame-scale and nonlocal segment-scale, thus learning with no access to the frame-level knowledge, we propose multi-scale intrinsic features. frame-wise matching score prediction to estimate the score of each frame matched to the video-level annotations in Specifically, given the input video and query, we first a weakly-supervised manner, as well as the frame-wise encode both visual and textual features, and align their matching weight via an attention mechanism in order to semantics by a video-query attention mechanism to learn choose possible foreground frames with query semantics. the cross-modal interactions. Next, we predict frame-wise Secondly, in order to improve the frame-wise score predic- matching scores referring to the interacted features for fore- tion by enhancing the frame-wise representations, we pro- ground frame (inside the target segment) localization, as pose a self-contrastive learning with multi-scale hard nega- well as the frame-wise contribution weights for their im- tive mining strategy, which especially discriminates frames portance estimation, which are aggregated to compute the adjacent to the ground truth target segment—referred to as overall semantic dependency between the video-query pair. hard negative samples as shown in Figure 1. In particu- Further, we enhance frame-wise representations by learning lar, we dynamically set a range according to the previously more discriminative features via the proposed multi-scale estimated scores so as to select positive and negative sam- self-contrastive learning strategy, where both frame-scale
and segment-scale hard negative sampling are deployed in grounding models [4, 26, 32, 33], which only require video- a coarse-to-fine manner. level annotations. [26] proposed the first weakly-supervised Our main contributions are summarized as follows: model to learn a joint embedding space for video and query representations. [8] develop a two-stream structure to mea- • We propose a novel self-contrastive learning frame- sure the moment-query consistency and conduct moment work for weakly-supervised query-based video selection simultaneously. Although above methods have grounding, which predicts fine-grained frame-wise achieved promising performance, they are two-stage ap- matching scores referring to the query semantics for proaches that utilize multi-scale sliding windows to gener- more accurate segment localization. ate moment candidates, therefore suffering from inferior ef- fectiveness and efficiency. To address this issue, [14,25,47] • We propose a multi-scale hard negative mining in the further improve the segment-sentence matching accuracy, self-contrastive learning to learn discriminative frame- and score all the moments sampled at different scales in a wise representations by adaptively sampling hard neg- single pass. [38] employ a reinforcement learning frame- atives in the frame-scale and the segment-scale respec- work to refine the segment boundary. However, almost tively, which captures both local and nonlocal intrinsic all of the existing methods rely on the segment proposals patterns. for matching and selection, which fail to capture and dis- tinguish more fine-grained details among visually similar • Extensive experiments demonstrate that the proposed frames for acquiring more accurate segment boundaries. MSCL model outperforms the state-of-the-art method Contrastive Learning. Contrastive learning [3, 9] is a significantly over two challenging benchmarks. self-supervised learning paradigm that has demonstrated its effectiveness in many tasks, such as image classifica- 2. Related Work tion, object detection and point cloud classification. Pre- Fully-supervised query-based video grounding. Most ex- vious works [28, 43, 47] also showed promising results of isting works address the video grounding task in a fully- contrastive learning in video grounding. Typically, [28] supervised manner, where both the annotations of video- proposed a dual contrastive learning loss function by uti- sentence pairs and corresponding segment boundaries are lizing video-level samples for video-to-video and video- given. Traditional methods [6,10] utilize the proposal-based to-query representation learning. A Counterfactual Con- framework that samples video segment proposals through trastive Learning framework [47] is designed to distin- dense sliding windows and subsequently integrate query guish video-level embeddings between counterfactual posi- with these proposal representations via a matrix operation. tive and negative samples for hard negative sampling. Our To further mine the cross-modal interaction more effec- model differs significantly from these methods: 1) We tively, some works [15–18, 21–23, 36, 37, 39, 42, 45, 46] in- leverage the input single video only to perform contrastive tegrate the sentence representation with those pre-defined learning over frames and segments, without resorting to dif- segment proposals individually, and then evaluate their ferent instances of videos as in previous works. 2) We pro- matching relationships. The proposal with the highest pose multi-scale hard negative sampling at the frame scale matching score is selected as the target segment. Although and the segment scale to iteratively capture both local and the proposal-based methods can achieve significant perfor- non-local intrinsic feature representations. mance, they severely rely on the quality of the proposals and are very time-consuming. Instead of utilizing the seg- 3. Methodology ment proposals, recent proposal-free methods [1, 2, 27, 40] directly regress the temporal locations of the target seg- 3.1. Overview ment. Specifically, they either regress the start/end times- We focus on weakly-supervised query-based video tamps based on the entire video representation [1, 27], or grounding. Given an untrimmed video and the language predict at each frame to determine whether this frame is a query, the goal is to localize the start and end time of the start or end boundary [2, 40]. These works are much more temporal moment corresponding to the language query. As efficient than the proposal-based ones, but achieve relatively illustrated in Figure 2, the proposed MSCL model mainly lower performance. However, both of proposal-based and consists of four modules: proposal-free methods heavily rely on a large amount of hu- man annotations that are hard to collect in practice. Weakly-supervised query-based video grounding. As • Multi-modal encoding. Given the multi-modal input, manually annotating temporal boundaries of target mo- we first employ video and query encoders to encode ments is time-consuming, recent research attentions both visual and textual features, and then interact the have been shifted to developing weakly-supervised video cross-modal information for semantic alignment.
Frame Video Feature Score Head " … Video FC … Encoder ′ Encoder # ∑" Final Share Video-Query ⨂ weights Interaction Score Query Feature Query FC ′ Frame Encoder Encoder # Weight Head " Multi-Modal Encoding Frame-wise Matching Score Prediction Segment Localization (Inference) Multi-scale Self-Contrastive Learning … … …… frame-wise score curve … … upper bound iteratively negative frame-scale mining splitting … … … …… segment-wise score curve iteratively negative segment-scale mining 1.83s 8.17s 3e-4 2e-1 5e-1 6e-1 6e-2 Figure 2. The overall framework of our proposed MSCL model. • Frame-wise matching score prediction. After gener- to extract its frame-level features V = {vi }ni=1 ∈ Rn×Dv , ating query-specific video representations by the cross- where n is the number of frames and Dv is the feature di- modal interaction, we predict frame-wise matching mension. For each query, we deploy the GloVe model [29] scores and frame-wise matching weights referring to to obtain word-level embeddings Q = {qi }m i=1 ∈ R m×Dq , the aligned multi-modal features for choosing the pos- where m is the number of the words and Dq is the feature di- sible foreground frames within the video. mension. Then both video and query features are projected into the same latent space by two fully-connected layers • Multi-scale self-contrastive learning. In order to to generate V0 and Q0 with the same dimension D, where improve the frame-wise score prediction by enhanc- V0 ∈ Rn×D and Q0 ∈ Rm×D . After that, we feed V0 , Q0 ing the frame-wise representations, we perform self- into the modality-specific encoders fv (·), fq (·) to generate contrastive learning at both the frame-scale and the the final visual representations V e and query embeddings Q. e segment-scale with progressive hard negative mining 0 e 0 That is, V = fv (V ), Q = fq (Q ). Here, fv (·), fq (·) share e to distinguish more fine-grained frame-wise details, weights and consist of four convolution layers, followed by thus enforcing more accurate segment grounding. a multi-head attention layer [35]. • Segment localization. At the inference time, we first Video-query interaction. We further apply a video-query construct possible segments by choosing the consecu- attention mechanism to learn the cross-modal interactions, tive frames with scores higher than the threshold, and where we calculate the similarity score S ∈ Rn×m between then select the best segment by comparing the average video and query features, and use the SoftMax operation scores of the internal frames within each segment. along the row and column to generate Sr and Sc , respec- tively. Next, we compute the video-to-query (V) and query- We elaborate on the four modules in order as follows. to-video (Q) attention contexts [44] as: 3.2. Multi-Modal Encoding e ∈ Rn×D , Q = Sr · ST · V V = Sr · Q e ∈ Rm×D . (1) c Video and query encoders. For each video, following pre- Then a single feed-forward layer FFN (composed of multi- vious works, we first employ a pre-trained C3D model [34] ple linear layers) is applied to generate the interacted output
features Vq ∈ Rn×D as: Algorithm 1 Multi-scale contrastive learning algorithm Input: video and query features V and Q, iteration Vq = FFN(V; e V; V e V; V e Q), (2) number L where denotes the Hadamard product. 1: Initialize the parameters fv (·), fq (·), hs (·), hw (·), bl , bu 2: Warm-up our model for 50 epochs without hard nega- 3.3. Frame-wise Matching Score Prediction tive sampling 3: for iteration l ← 1 to L do In the weakly-supervised setting, we only have access to 4: Encode features V, Q and calculate Vq as in Eq. 2 the knowledge of the matched video-query pair without cor- 5: Predict frame scores and weights as in Eq. 3 responding detailed segment-level annotations. In order to 6: Calculate the score loss as in Eq. 4 determine which frame is matched with the query semantics 7: Update bl , bu in Eq. 5 and how much the frame contributes to the final grounding, 8: Calculate frame and segment losses in Eq. 6 and 7 we introduce a score-based self-supervised branch to pre- 9: Compute the total loss in Eq. 8 dict frame-wise matching scores and frame-wise matching 10: Update the parameters of fv (·), fq (·), hs (·), hw (·) weights for choosing the most possible foreground frames. 11: end for Specifically, we devise a frame score head hs (·) and a frame Output: fv (·), fq (·), hs (·), hw (·) weight head hw (·) to predict the corresponding matching score si and weight wi for each frame i, respectively. Here, both hs (·), hw (·) are composed of three linear layers. The and negative frames according to their predicted frame-wise S = {si }ni=1 and W = {wi }ni=1 are formulated as: scores at frame-scale, and then consider one positive seg- S = Sigmoid(hs (Vq )); W = Softmax(hw (Vq )). (3) ment with the highest segment score while taking the other segments as negative samples at frame-scale. Then, we per- Then, for the k-th video and k-th query in each batch, form both frame-scale and segment-scale contrastive learn- ing to learn more discriminative fine-grained frame-wise Pnsemantic matching score ŝk,k is calculated as their final details. The updated frame-wise features in turn provide ŝk,k = i=1 si · wi . In addition, we also estimate the sim- ilarity score (utilizing dot-product attention) between video more accurate matching score to mine harder negative sam- features Ve and query features Q e to measure their distance. ples in the next step of learning. By performing multi-scale The overall score objective is defined as: self-contrastive learning with such an iterative strategy, our model is able to enforce more accurate segment grounding. PK (ŝk,k + Vek · Q e k) We will illustrate the details of both frame- and segment- Lscore = − log PK k=1 PK , (4) scale negative mining of each step in the following. j=1 (ŝk,j + Vk · Qj ) e e k=1 Frame-scale. In order to mine hard negative frames that where ŝk,j represents the overall video score corresponding are close to positive frames in the representation space, we to the j-th query features and k-th video features at the same iteratively assign a lower bound bl and a upper bound bu to batch. ŝk,k denotes the score of the matched video-query select positive and negative frames. The lower bound bl and pair. K denotes the batch size. In this way, we maximize the upper bound bu are defined from frame-wise scores as: the overall score of video and query features from correct n 1X pairs while minimizing the score of false pairs. After getting bl = b0l ∗ δ e−e0 , bu = si , (5) the matching scores of all frames, we take them as pseudo n i=1 labels to provide better supervisions for iteratively training where δ is the increasing step and e, e0 denote the current the following contrastive learning module, and the learned epoch and the updated cycle of epoch respectively. b0l is discriminative features in turn further lead to more precise the initial value of bl . We set b0l = e−8 , e0 = 50, δ = 10 matching score prediction. during the training, that is, we increase bl exponentially by 3.4. Multi-scale Self-contrastive Learning 10 every 50 epoch after the warm-up stage. Accordingly, we consider frames with scores greater In order to discriminate the frame-wise representations than bu as positive frames, and other frames with scores for more accurate prediction of matching scores, we pro- ranging from bl to bu as negative frames. The loss function pose a multi-scale self-contrastive learning paradigm with of the frame-scale contrastive learning is defined as: hard negative mining to capture more discriminative frame- PK e wise representations in a coarse-to-fine manner. Specif- e k · pf Vk · Q Lfra = − log Pk=1K k , (6) ically, we iteratively mine hard negative samples that are e k · nf ek · Q V k=1 k close to positive samples in the representation space with a multi-step strategy. In each step, we first choose the positive where pfk and nfk denote the binary index mask of positive
Charades-STA ActivityNet-Caption Method R@1 R@5 R@1 R@5 IoU=0.3 IoU=0.5 IoU=0.7 IoU=0.3 IoU=0.5 IoU=0.7 IoU=0.1 IoU=0.3 IoU=0.5 IoU=0.1 IoU=0.3 IoU=0.5 TGA 32.14 19.94 8.84 86.58 65.52 33.51 - - - - - - CTF 39.80 27.30 12.90 - - - 74.20 44.30 23.60 - - - ReLoCLNet - - - - - - - 42.65 28.54 - - - SCN 42.96 23.58 9.97 95.56 71.80 38.87 71.48 47.23 29.22 - 71.45 55.69 MARN - 31.94 14.81 - 70.00 37.40 - 47.01 29.95 - 72.02 57.49 RTBPN 60.04 32.36 13.24 97.48 71.85 41.18 73.73 49.77 29.63 93.89 79.89 60.56 VGN+CCL - 33.21 15.68 - 73.50 41.87 - 50.12 31.07 - 77.36 61.29 Ours 58.92 43.15 23.49 98.02 81.23 48.45 75.61 55.05 38.23 95.26 82.72 68.05 Table 1. Comparison results on the Charades-STA and ActivityNet-Caption datasets. and negative frames at batch index k, respectively. The en- 4. Experiments tries of corresponding indices are 1 and others are 0. Segment-scale. In order to enforce more accurate segment 4.1. Datasets and Evaluation Metrics grounding predictions, we locate the predicted segments Charades-STA. The Charades-STA dataset [7] is built {gt }Tt=1 with consecutive indices and calculate the segment based on the Charades [31] dataset, which contains 6,672 score by averaging the scores of internal frame located in videos of indoor activities and involves 16,128 query-video the segment. Then we consider the segment with the highest pairs. There are 12,408 pairs used for training and 3,720 segment score as the positive segment and other segments used for testing. The average duration of each video is 29.76 as negative samples. The loss function of the segment-scale seconds. Each video has 2.4 annotated moments and each contrastive learning is formulated as: annotated moment lasts for 8 seconds on average. PK e e k · pg ActivityNet-Caption. The ActivityNet-Caption dataset Vk · Q Lseg = − log Pk=1 K k , (7) [13] contains 20,000 videos with 100,000 queries, where ek · Q V e k · ng k=1 k 37,421 query-video pairs are used for training and 34,536 where pgk and ngk denote the binary index mask of frames are used for testing. The average duration of the videos located in positive and negative segments at batch index k, is 1 minute and 50 seconds. On average, each video in separately. The entries of corresponding indices are 1 and ActivityNet-Caption has 3.65 annotated moments and each the others are 0. annotated moment lasts for 36 seconds. The overall objective of our model is minimized in an Evaluation metrics. Following previous works, we adopt end-to-end manner and formulated as: the metrics “R@n, IoU=m” to evaluate our model, where “R@n, IoU=m” presents the proportion of the top n moment L = Lscore + λfra · Lfra + λseg · Lseg , (8) candidates with IoU larger than m. Specifically, we set n as 1, 5 and set m as 0.3, 0.5, 0.7 in Charades-STA dataset and where λfra and λseg denote the weighting hyper-parameters 0.1, 0.3, 0.5 in ActivityNet-Caption dataset. of the frame loss and segment loss, respectively. In the ex- periments, we set λfra = 10 and λseg = 5. 4.2. Experimental Settings The overall algorithm of our training approach is sum- marized in Algorithm 1, where we utilize an iterative strat- To make a fair comparison with previous methods like egy to gradually mine the hard negative samples. In order [47], we extract video features from the pre-trained C3D to enforce the model predict accurate positive samples with network [34] and query features from the 300-d Glove em- higher confidence, we first warm-up our model in the first bedding [29]. We train the model for 200 epochs with the 50 epochs without the procedure of hard negative sampling. batch size of 16. We use a warm-up training without hard Then, we iteratively mine the hard negative samples with negative sampling for 50 epochs. The dimension of en- high similarity to the positive ones. coded features is set to 512. Our model is optimized by Adam [12] with an initial learning rate of 0.01 and linear 3.5. Segment Localization decay of learning rate. All experiments are conducted on At the inference time, we select the segment with the single NVIDIA GeForce RTX 3090 GPU. highest segment score as the final prediction. Specifically, 4.3. Comparison with State-of-the-art Methods we extract all possible segments by choosing consecutive frames with scores higher than the upper bound bu , and cal- Charades-STA. We compare our method against the cur- culate the segment score by averaging the scores of internal rent state-of-the-art methods under weakly-supervised set- frames located in the segment. Then we take the segment tings on Charades-STA dataset in Table 1. As can be with the highest segment score as the final output. seen, we achieve the best performance over all baselines.
R@1 R@1 Lscore Lfra Lseg λfra λseg IoU=0.3 IoU=0.5 IoU=0.7 IoU=0.3 IoU=0.5 IoU=0.7 7 7 7 44.19±0.25 19.95±0.21 8.60±0.18 1 1 58.65±0.08 42.92±0.05 23.25±0.03 3 7 7 50.91±0.19 21.13±0.18 9.54±0.16 5 1 58.01±0.07 42.58±0.05 22.75±0.03 3 3 7 55.22±0.12 29.73±0.09 14.52±0.08 10 1 58.22±0.07 42.75±0.04 22.96±0.02 3 7 3 51.94±0.16 33.66±0.11 16.77±0.09 1 5 58.85±0.05 43.07±0.03 23.36±0.01 3 3 3 58.65±0.08 42.92±0.05 23.25±0.03 1 10 58.52±0.07 42.87±0.04 23.18±0.02 10 5 58.92±0.04 43.15±0.02 23.49±0.01 Table 2. Ablation study for the effect of each module. Table 3. Ablation study for λfra and λseg . Particularly, our MSCL outperforms VGN+CCL [47], the current state-of-the-art method, by a large margin 9.94%, segments is introduced such that the results of our MSCL 7.81% and 7.73%, 6.58% in terms of R@1,IoU=0.5, 0.7 are improved. As can be seen, our MSCL reaches the best and R@5,IoU=0.5, 0.7, respectively. This indeed shows the performance when λf ra =10 and λseg =5, which implies the superiority of our multi-scale hard negative sampling strat- importance of balancing the weight of the frame-wise and egy in weakly-supervised video grounding. segment-wise hard negative sampling during the training. ActivityNet-Caption. Table 1 also reports the compari- Robustness to the batch size. In this part, we analyze son results with the state-of-the-art methods on ActivityNet- the effect of the batch size on the final performance of Caption dataset. We can observe that our MSCL achieves our MSCL, as shown in Table 4, where we set the batch superior performance against all previous works. This fur- size as 4, 8, 16, 32, 48. From Table 4, we observe that ther demonstrates the advantage of the multi-scale hard neg- our MSCL with the batch size of 16 achieves the best re- ative sampling on distinguishing fine-grained frame-wise sults in terms of all metrics. Meanwhile, the performance details and enforcing more accurate segment grounding. of our model does not change too much with the altering of the batch size. This further validates the robustness of 4.4. Ablation Study our MSCL to the choice of the batch size. In other words, In this part, we conduct extensive ablation studies on we do not need a large batch size that is desired in previ- each module in our MSCL including frame-wise matching ous contrastive learning based methods [43, 47] for weakly- score prediction and frame-/segment-scale hard negative supervised video grounding. sampling, the effect of batch size, and the hyper-parameters. 4.5. Visualization Results Unless specified, we perform all ablation studies on the Charades-STA benchmark. In this section, we provide more detailed visualization Effect of each module. In order to understand how each results on how our MSCL predicts more accurate segment module in our MSCL affects the final performance, we grounding results given frame score curves. Qualitative ex- explore the effect of each proposed loss as shown in Ta- amples of hard negative sampling and foreground frame ble 2. Our model with frame-wise matching score predic- predictions on two benchmarks are visualized to validate tion outperforms the baseline by 6.72%, 1.18%, and 0.94% the superiority of our MSCL. in terms of three criteria, which shows the effectiveness of Frame-wise matching scores. In order to better under- this module. Introducing frame-scale and segment-scale stand the effectiveness of the segment localization in our hard negative sampling separately boosts the performance MSCL, we plot the frame-wise score curves with respect to of our model with frame-wise matching score prediction the frame index among the positive frames in Figure 3. As only. Furthermore, by adding frame-scale and segment- can be seen, many hard negative samples with high frame scale hard negative sampling together, we observe the high- scores appeared in the training process. With the multi- est increasing range of 7.74%, 21.79%, and 13.71%. This scale hard negative sampling, our MSCL predicts the seg- demonstrates the superiority of our multi-scale hard nega- ment with the highest score greater than the upper bound bu tive sampling over baselines. as the final output, which matches the target segment. This Analysis of frame and segment loss. Furthermore, we shows the importance of the segment localization in more conduct extensive experiments to explore the weighting accurate segment grounding. hyper-parameters of frame and segment loss used in multi- Hard Negative Mining. In Figure 4, we also visualize the scale hard negative sampling in Table 3. Specifically, we hard negative samples at epoch=0, 50, 100, 150. We ob- take the value of λf ra and λseg from (1, 5, 10) for different serve the hard negative samples with high scores are pro- control settings. With the increase of λf ra , the performance gressively closer to the positive frames, which validates the of our model decreases since we fail to consider the more effectiveness of our multi-scale hard negative sampling. global information of segments in the whole video. How- Qualitative Results. To validate the superiority of our ever, with the increase of λseg , more global information of MSCL in a qualitative manner, we visualize qualitative ex-
R@1 R@5 Batch Size IoU=0.3 IoU=0.5 IoU=0.7 IoU=0.3 IoU=0.5 IoU=0.7 4 58.53±0.08 42.81±0.06 23.19±0.04 97.88±0.05 80.97±0.04 48.12±0.04 8 58.81±0.05 43.03±0.03 23.38±0.02 97.96±0.04 81.14±0.04 48.36±0.03 16 58.92±0.03 43.15±0.01 23.49±0.01 98.02±0.02 81.23±0.02 48.45±0.01 32 58.91±0.02 43.12±0.01 23.45±0.01 97.98±0.01 81.19±0.01 48.42±0.01 48 58.85±0.02 43.09±0.02 23.41±0.01 97.92±0.02 81.13±0.02 48.36±0.02 Table 4. Exploration study on the effect of batch size. Figure 3. Visualization of frame score curves used for segment localization. Red dotted lines denote the upper bound bu . The ground-truth of frame indices are 49-60, 53-62, 6-23, and 26-37. Index: 1 9 25 26 27 80 100 112 113 114 115 116 142 169 188 235 GT: [27, 112] ℎ = 0 ! : 0, " : 1e-6 Score: 1e-6 1e-6 1e-6 1e-6 1e-6 1e-6 1e-6 1e-6 1e-6 1e-6 1e-6 1e-6 1e-6 1e-6 1e-6 1e-6 ℎ = 50 ! : 1e-8, " : 1e-5 Score: 1e-7 3e-7 2e-3 8e-3 1e-2 5e-2 7e-2 5e-3 4e-3 3e-3 1e-3 8e-4 7e-5 5e-5 6e-7 2e-7 ℎ = 100 ! : 1e-7, " : 1e-3 Score: 1e-8 1e-7 6e-4 9e-4 8e-2 2e-1 3e-1 6e-2 3e-2 1e-2 7e-3 5e-3 8e-4 3e-4 1e-7 5e-8 ℎ = 150 ! : 1e-6, " : 1e-1 Score: 1e-8 2e-8 3e-4 6e-4 2e-1 5e-1 6e-1 9e-2 6e-2 4e-2 2e-2 1e-2 8e-7 6e-7 3e-8 1e-8 Figure 4. An illustration of the dynamic process of hard negative sampling (Gray Shadow) at epoch=0, 50, 100, 150. GT denotes the ground-truth (red indexes denote the segment boundaries of ground-truth). It shows that our iterative mining strategy can mine harder negative samples with the step goes on, leading to more discriminative frame-wise representation learning. amples of Charades-STA and ActivityNet-Caption bench- wise features with query semantics. In order to learn more marks in Figure 5. By comparison, our MSCL achieves bet- discriminative frame-wise representations for predicting ac- ter performance than SCN [14] and VGN+CCL [47]. Par- curate frame-wise scores, we further introduce a multi- ticularly, we achieve more accurate results on the boundary scale self-contrastive learning with multi-step hard negative of the ground-truth segment due to the effectiveness of our mining strategy to progressively discriminate hard negative multi-scale hard negative sampling strategy. samples that are close to positive samples in the represen- tation space. This iterative approach captures fine-grained 5. Conclusion frame-scale details as well as segment-scale semantics for distinguishing frames with high repeatability and similar- In this work, we propose a novel multi-scale self- ity within the entire video. Experimental results show that contrastive learning model for weakly-supervised query- our proposed model outperforms state-of-the-art methods based video grounding. Instead of utilizing redundant seg- on two challenging benchmarks. ment proposal for semantic matching, we predict frame- wise scores and weights for matching fine-grained frame-
Query: The person immediately opened a window. GT: 8.00s 15.20s SCN: 7.23s 16.33s VGN+CCL: 8.00s 15.78s Ours: 8.00s 15.23s Query: Next, the man raises the woman and they turn around and continue dancing. GT: 167.33s 209.17s SCN: 95.12s 189.51s VGN+CCL: 153.67s 209.17s Ours: 167.21s 209.17s Figure 5. Visualization of examples on Charades-STA and ActivityNet-Caption benchmarks. GT denotes the ground-truth. References Processing (EMNLP-IJCNLP), pages 1481–1487, 2019. 1, 3 [1] Jingyuan Chen, Lin Ma, Xinpeng Chen, Zequn Jie, and Jiebo [9] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Luo. Localizing natural language in videos. In Proceedings Girshick. Momentum contrast for unsupervised visual repre- of the American Association for Artificial Intelligence, 2019. sentation learning. In Proceedings of IEEE/CVF Conference 3 on Computer Vision and Pattern Recognition (CVPR), pages [2] Long Chen, Chujie Lu, Siliang Tang, Jun Xiao, Dong Zhang, 9729–9738, 2020. 3 Chilie Tan, and Xiaolin Li. Rethinking the bottom-up frame- [10] Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef work for query-based video localization. In Proceedings of Sivic, Trevor Darrell, and Bryan Russell. Localizing mo- the AAAI Conference on Artificial Intelligence, volume 34, ments in video with temporal language. Proceedings of pages 10551–10558, 2020. 3 the 2018 Conference on Empirical Methods in Natural Lan- [3] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- guage Processing (EMNLP), page 1380–1390, 2018. 1, 3 offrey Hinton. A simple framework for contrastive learning [11] Wenhao Jiang, Lin Ma, Yu-Gang Jiang, Wei Liu, and Tong of visual representations. In Proceedings of International Zhang. Recurrent fusion network for image captioning. In Conference on Machine Learning (ICML), 2020. 3 Proceedings of the European Conference on Computer Vi- [4] Zhenfang Chen, Lin Ma, Wenhan Luo, Peng Tang, and sion (ECCV), pages 499–515, 2018. 1 Kwan-Yee K Wong. Look closer to ground better: Weakly- [12] Diederik P Kingma and Jimmy Ba. Adam: A method for supervised temporal grounding of sentence in video. arXiv stochastic optimization. arXiv preprint arXiv:1412.6980, preprint arXiv:2001.09308, 2020. 3 2014. 6 [13] Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and [5] Yu Cheng, Quanfu Fan, Sharath Pankanti, and Alok Choud- Juan Carlos Niebles. Dense-captioning events in videos. In hary. Temporal sequence modeling for video event detection. Proceedings of the IEEE International Conference on Com- In Proceedings of the IEEE Conference on Computer Vision puter Vision (ICCV), pages 706–715, 2017. 6 and Pattern Recognition (CVPR), pages 2227–2234, 2014. 1 [14] Zhijie Lin, Zhou Zhao, Zhu Zhang, Qi Wang, and Huasheng [6] Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Liu. Weakly-supervised video moment retrieval via semantic Tall: Temporal activity localization via language query. Pro- completion network. In Proceedings of the AAAI Conference ceedings of the IEEE International Conference on Computer on Artificial Intelligence, volume 34, pages 11539–11546, Vision (ICCV), page 5267–5275, 2017. 1, 3 2020. 1, 3, 8 [7] Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. [15] Daizong Liu, Xiang Fang, Wei Hu, and Pan Zhou. Exploring Tall: Temporal activity localization via language query. In optical-flow-guided motion and detection-based appearance Proceedings of the IEEE Conference on Computer Vision for temporal sentence grounding. arXiv preprint, 2022. 3 and Pattern Recognition (CVPR), pages 5267–5275, 2017. [16] Daizong Liu, Xiaoye Qu, Xing Di, Yu Cheng, Zichuan Xu 6 Xu, and Pan Zhou. Memory-guided semantic learning net- [8] Mingfei Gao, Larry Davis, Richard Socher, and Caiming work for temporal sentence grounding. In AAAI, 2022. 3 Xiong. Wslln: Weakly supervised natural language local- [17] Daizong Liu, Xiaoye Qu, Jianfeng Dong, and Pan Zhou. ization networks. In Proceedings of the 2019 Conference Reasoning step-by-step: Temporal sentence localization in on Empirical Methods in Natural Language Processing and videos via deep rectification-modulation network. In COL- the 9th International Joint Conference on Natural Language ING, 2020. 3
[18] Daizong Liu, Xiaoye Qu, Jianfeng Dong, and Pan Zhou. standing. In European Conference on Computer Vision Adaptive proposal generation network for temporal sentence (ECCV), pages 510–526. Springer, 2016. 6 localization in videos. In EMNLP, 2021. 3 [32] Yijun Song, Jingwen Wang, Lin Ma, Zhou Yu, and Jun [19] Daizong Liu, Xiaoye Qu, Jianfeng Dong, Pan Zhou, Yu Yu. Weakly-supervised multi-level attentional reconstruc- Cheng, Wei Wei, Zichuan Xu, and Yulai Xie. Context-aware tion network for grounding textual queries in videos. arXiv biaffine localizing network for temporal sentence grounding. preprint arXiv:2003.07048, 2020. 3 In CVPR, 2021. 1 [33] Reuben Tan, Huijuan Xu, Kate Saenko, and Bryan A Plum- [20] Daizong Liu, Xiaoye Qu, Xiao-Yang Liu, Jianfeng Dong, mer. Logan: Latent graph co-attention network for weakly- Pan Zhou, and Zichuan Xu. Jointly cross-and self-modal supervised video moment retrieval. In Proceedings of the graph attention network for query-based moment localiza- IEEE Winter Conference on Applications of Computer Vision tion. In ACM MM, 2020. 1 (WACV), pages 2083–2092, 2021. 3 [21] Daizong Liu, Xiaoye Qu, Yinzhen Wang, Xing Di, Kai Zou, [34] Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torre- Yu Cheng, Zichuan Xu, and Pan Zhou. Unsupervised tempo- sani, and Manohar Paluri. Learning spatiotemporal features ral video grounding with deep semantic clustering. In AAAI, with 3d convolutional networks. In Proceedings of the IEEE 2022. 3 International Conference on Computer Vision (ICCV), pages [22] Daizong Liu, Xiaoye Qu, and Pan Zhou. Progressively guide 4489–4497, 2015. 4, 6 to attend: An iterative alignment framework for temporal [35] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- sentence grounding. In EMNLP, 2021. 3 reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia [23] Daizong Liu, Xiaoye Qu, Pan Zhou, and Yang Liu. Explor- Polosukhin. Attention is all you need. In Advances in Neural ing motion and appearance information for temporal sen- Information Processing Systems (NIPS), pages 5998–6008, tence grounding. In AAAI, 2022. 3 2017. 4 [36] Jingwen Wang, Lin Ma, and Wenhao Jiang. Tempo- [24] Jingzhou Liu, Wenhu Chen, Yu Cheng, Zhe Gan, Licheng rally grounding language queries in videos by contextual Yu, Yiming Yang, and Jingjing Liu. Violin: A large-scale boundary-aware prediction. In Proceedings of the AAAI Con- dataset for video-and-language inference. In Proceedings ference on Artificial Intelligence, volume 34, pages 12168– of the IEEE Conference on Computer Vision and Pattern 12175, 2020. 1, 3 Recognition (CVPR), pages 10900–10910, 2020. 1 [37] Jingwen Wang, Lin Ma, and Wenhao Jiang. Tempo- [25] Minuk Ma, Sunjae Yoon, Junyeong Kim, Youngjoon Lee, rally grounding language queries in videos by contextual Sunghun Kang, and Chang D Yoo. Vlanet: Video-language boundary-aware prediction. In Proceedings of the AAAI Con- alignment network for weakly-supervised video moment re- ference on Artificial Intelligence, 2020. 3 trieval. In Proceedings of the European Conference on Com- [38] Jie Wu, Guanbin Li, Xiaoguang Han, and Liang Lin. Rein- puter Vision (ECCV), pages 156–171, 2020. 1, 3 forcement learning for weakly supervised temporal ground- [26] Niluthpol Chowdhury Mithun, Sujoy Paul, and Amit K Roy- ing of natural language in untrimmed videos. In Proceedings Chowdhury. Weakly supervised video moment retrieval from of the 28th ACM International Conference on Multimedia, text queries. In Proceedings of the IEEE Conference on Com- pages 1283–1291, 2020. 3 puter Vision and Pattern Recognition (CVPR), pages 11592– [39] Yitian Yuan, Lin Ma, Jingwen Wang, Wei Liu, and Wenwu 11601, 2019. 1, 3 Zhu. Semantic conditioned dynamic modulation for tem- [27] Jonghwan Mun, Minsu Cho, and Bohyung Han. Local- poral sentence grounding in videos. IEEE Transactions on global video-text interactions for temporal grounding. In Pattern Analysis and Machine Intelligence, 2020. 1, 3 Proceedings of the IEEE Conference on Computer Vision [40] Runhao Zeng, Haoming Xu, Wenbing Huang, Peihao Chen, and Pattern Recognition (CVPR), 2020. 3 Mingkui Tan, and Chuang Gan. Dense regression network [28] Guoshun Nan, Rui Qiao, Yao Xiao, Jun Liu, Sicong Leng, for video grounding. In Proceedings of the IEEE Conference Hao Zhang, and Wei Lu. Interventional video grounding on Computer Vision and Pattern Recognition (CVPR), 2020. with dual contrastive learning. In Proceedings of the IEEE 3 Conference on Computer Vision and Pattern Recognition [41] Yawen Zeng, Da Cao, Xiaochi Wei, Meng Liu, Zhou Zhao, (CVPR), pages 2765–2775, 2021. 2, 3 and Zheng Qin. Multi-modal relational graph for cross- [29] Jeffrey Pennington, Richard Socher, and Christopher Man- modal video moment retrieval. In Proceedings of the IEEE ning. Glove: Global vectors for word representation. In Pro- Conference on Computer Vision and Pattern Recognition ceedings of the 2019 Conference on Empirical Methods in (CVPR), pages 2215–2224, 2021. 1 Natural Language Processing (EMNLP), pages 1532–1543, [42] Da Zhang, Xiyang Dai, Xin Wang, Yuan-Fang Wang, and 2014. 4, 6 Larry S Davis. Man: Moment alignment network for natu- [30] Zheng Shou, Dongang Wang, and Shih-Fu Chang. Temporal ral language moment retrieval via iterative graph adjustment. action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE Conference on Computer Vision In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1247–1257, 2019. 3 and Pattern Recognition (CVPR), pages 1049–1058, 2016. 1 [43] Hao Zhang, Aixin Sun, Wei Jing, Guoshun Nan, Liangli [31] Gunnar A Sigurdsson, Gül Varol, Xiaolong Wang, Ali Zhen, Joey Tianyi Zhou, and Rick Siow Mong Goh. Video Farhadi, Ivan Laptev, and Abhinav Gupta. Hollywood in corpus moment retrieval with contrastive learning. In Pro- homes: Crowdsourcing data collection for activity under- ceedings of the 44th International ACM SIGIR Conference
on Research and Development in Information Retrieval, page 685–695, 2021. 2, 3, 7 [44] Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou. Span-based localizing network for natural language video lo- calization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6543– 6554, 2020. 4 [45] Songyang Zhang, Houwen Peng, Jianlong Fu, and Jiebo Luo. Learning 2d temporal adjacent networks for moment localization with natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 12870–12877, 2020. 3 [46] Zhu Zhang, Zhijie Lin, Zhou Zhao, and Zhenxin Xiao. Cross-modal interaction networks for query-based moment retrieval in videos. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 655–664, 2019. 3 [47] Zhu Zhang, Zhou Zhao, Zhijie Lin, Xiuqiang He, et al. Counterfactual contrastive learning for weakly-supervised vision-language grounding. Advances in Neural Information Processing Systems (NIPS), 33:18123–18134, 2020. 1, 2, 3, 6, 7, 8
You can also read