Text-Driven Video Acceleration: A Weakly-Supervised Reinforcement Learning Method - arXiv
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
In IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 2022. The final is available at https://doi.org/10.1109/TPAMI.2022.3157198 Text-Driven Video Acceleration: A Weakly- Supervised Reinforcement Learning Method Washington Ramos , Michel Silva Edson Araujo , Victor Moura , Keller Oliveira , Leandro Soriano Marcolino , and Erickson R. Nascimento Abstract—The growth of videos in our digital age and the users’ limited time raise the demand for processing untrimmed videos to produce shorter versions conveying the same information. Despite the remarkable progress that summarization methods have made, most of them can only select a few frames or skims, creating visual gaps and breaking the video context. This paper presents a novel weakly-supervised methodology based on a reinforcement learning formulation to accelerate instructional videos using text. A novel arXiv:2203.15778v1 [cs.CV] 29 Mar 2022 joint reward function guides our agent to select which frames to remove and reduce the input video to a target length without creating gaps in the final video. We also propose the Extended Visually-guided Document Attention Network (VDAN+), which can generate a highly discriminative embedding space to represent both textual and visual data. Our experiments show that our method achieves the best performance in Precision, Recall, and F1 Score against the baselines while effectively controlling the video’s output length. Visit https://www.verlab.dcc.ufmg.br/semantic-hyperlapse/tpami2022/ for code and extra results. Index Terms—Untrimmed videos, Fast-forward, Instructional video, Cross-modal data, Reinforcement Learning F 1 I NTRODUCTION T HE development of new digital technologies, especially over the last two decades, has allowed humans to overcome physical barriers in exchanging information. With fast-forwarded without losing the message to be conveyed by the input video. Thus, ideally, instructional videos should be concise, similar to a textual description, but still having the emergence of several tools such as portable devices visually-rich demonstrations of all main steps of the task. (e.g., smartphones and wearables), and social multimedia In this paper, we address the problem of accelerating services like Twitter, Facebook, Instagram, and YouTube, temporally untrimmed videos by using text documents. For storing and sharing data of any kind has become effortless. example, a recipe for cooking tacos could be used as a guide Moreover, due to the exponential growth in processing to select relevant frames from videos of cooking tacos. Note power, storage, and bandwidth of digital devices, we are that this problem is different from moment localization [1], witnessing a substantial increase in the volume of data, procedure segmentation [2], and video summarization [3], in particular, textual and visual data such as images and [4], since non-relevant frames are still necessary for a user videos. Every day a plethora of textual tutorials and instruc- to understand the flow and temporal coherence of a task, i.e., tional videos is published on the Internet teaching a variety some segments should be accelerated, but not eliminated. of tasks, from how to cook burritos and tacos all the way to Although identifying and fast-forwarding non-relevant how to solve partial differential equations (PDEs). segments may be a trivial task for humans, it poses signif- Despite many textual tutorials and instructional videos icant challenges to fully automated systems. First, identify- sharing the increasing growth of available data as well as ing relevant segments requires reasoning about the semantic their content, they differ in a key aspect for users: how long concepts in the scene (e.g., person, tools, ingredients) and they would take to consume the content. In general, information their interactions as they evolve over time. Second, usually encoded by producers is more concise in the textual domain in fast-forwarding techniques [5], [6], the overall length than in the visual domain. For instance, a recipe of tacos or a of the output video is defined by the user. Therefore, the tutorial explaining how to solve a PDE is described in a few method must adjust a varying speed-up rate along the sentences. Instructional videos, for their turn, might have footage to attend to the user’s needs. several minutes showing non-relevant information for the Over the past few years, algorithms on semantic fast- task such as a person opening a refrigerator, picking up the forwarding have been emerging as effective approaches to pencil, or erasing the blackboard. Such segments could be tackle the task of retrieving meaningful segments without losing the temporal continuity [7], [8], [9], [10], [11], [12], • Washington Ramos, Edson Araujo, Victor Moura, Keller Oliveira and [13], [14], [15]. In our previous work [14], we initially in- Erickson R. Nascimento are with the Computer Science Deparment, troduced an algorithm to create fast-forward videos guided Universidade Federal de Minas Gerais, Belo Horizonte, MG 31270- by text documents. We formulated our fast-forwarding task 901, Brazil. E-mail: {washington.ramos, edsonroteia, victorhugomoura, as a sequential decision-making process. A reinforcement kellermartins, erickson}@dcc.ufmg.br • Michel Silva is with the Department of Informatics, Universidade Federal learning (RL) agent locally decides to increase, decrease, de Viçosa, Viçosa, MG 36570-900, Brazil. E-mail: michel.m.silva@ufv.br or keep the video’s speed-up rate based on the encoded • Leandro Soriano Marcolino is with Lancaster University, Lancaster, LA1 text and video frame. An embedding space is generated by 4YW, U.K. E-mail: l.marcolino@lancaster.ac.uk a Visually-guided Document Attention Network (VDAN),
TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2 which creates representative feature vectors for textual and our previous work. visual modalities. Although this RL agent and VDAN pre- sented relevant results regarding the output content’s rele- vance, they hold two significant drawbacks. First, the user 2 R ELATED W ORK does not control the output video’s duration, which is one 2.1 Video Summarization of the key aspects of most fast-forward algorithms. This Over the past several years, video summarization methods control is also essential to systems where the storage or time figured prominently in the task of shortening a video [17], resources are limited. Second, VDAN can only encode static [18], [19], [20]. The shorter version is, usually, a summary of frames; however, many sentences in the input document, the input video composed of a storyboard of keyframes or such as the ones with actions describing an instruction step, video skims with the most distinguishable segments [21]. may require modeling the scene temporal dynamics. The strategy adopted to summarize the videos varies To overcome these drawbacks, we introduce the Skip- from frames features clustering [19] and training neural Aware Fast-Forwarding Agent (SAFFA), which can adapt networks that infer the representativeness of a video seg- its actions to drop frames according to their relevance and ment [20], [22] to employing additional information such as subject to a target speed-up rate defined by the user. SAFFA user queries, external sensors [17], or textual annotations [3]. is trained with a novel joint reward function that aims at Lee et al. [17], in the context of first-person videos, ana- semantics and speed-up, allowing us to create videos of differ- lyzed properties such as social interaction, gaze, and object ent lengths without retraining. We also present an extension detection to create a storyboard summary of the video. to the representation capabilities of VDAN with the novel Zhang et al. [20] proposed a method to create a storyboard Extended Visually-guided Document Attention Network or skims by modeling long-range dependencies among the (VDAN+) that creates a joint embedding space for docu- video frames using a Bi-LSTM network. Yao et al. [22] ments and video segments instead of simple static frames. performed the selection of the relevant segments fusing Finally, thorough experiments are conducted on challenging information from spatial and temporal CNNs to identify instructional video datasets [2], [16]. We show quantitative highlighting moments in videos of sports. Plummer et al. [3] and qualitative results with in-depth discussions, showing created an approach that selects the best subset of video that our method achieves the best performance in F1 Score segments by analyzing their visual features along with against the baselines while effectively controlling the video’s vision-language modeling. output length. Reinforcement Learning (RL) has also been applied to This work takes a step forward towards fast-forwarding video summarization [23], [24], motivated by its success in videos to a required length based on the video content’s many challenging tasks, such as mastering complex games semantics and the speed-up rate defined by the user. Us- like Go [25], and achieving human-level performance in ing textual data to guide an agent that seeks the best set Atari [26]. Additionally, it has great application in vision of frames to be removed, our method emphasizes highly tasks, e.g., visual tracking [27] and image cropping [28]. semantic content segments while preserving the temporal Zhou et al. [23] presented an end-to-end unsupervised continuity and the target output length. Our approach is framework also based on RL. Their method summarizes weakly supervised since it does not require any temporal videos by applying a diversity-representativeness reward annotation at training or inference time, i.e., only the textual that guides the agent to create more diverse and representa- instructions are needed. The problem of text-driven video tive summaries. Li et al. [24] introduced an additional video acceleration naturally fits to the sequential decision-making level semantic reward to guide the unsupervised RL proce- formulation via a Markov Decision Process (MDP) since dure proposed by Zhou et al. Their approach uses a semantic no information about previous states is required for taking supervisor network to provide the reward signals based on subsequent actions. Unlike planning solutions like dynamic the similarity between the original video and the generated programming, which would require to go through all possi- summary. As a drawback, their video-level semantic reward ble states multiple times to calculate the value function, our may favor summaries that yield higher global semantic RL-based solution processes each state at most once. Thus, a similarities, which is not the case of instructional videos trained agent would act similar to a knowledgeable human where the steps may represent a small portion of the video. facing a new document and unseen video segments and On the other hand, our approach provides local rewards controlling the video playback speed to achieve the required given by the similarity between video segments and text. length. Despite impressive advances in shortening videos while Our contributions can be summarized as follows: i) a retaining relevant segments, most of the summarization new fast-forward method based on a reinforcement learning methods ignore temporal aspects or use a relaxed temporal formulation, which is able to accelerate videos according to restriction, resulting in visual gaps and breaking the video clip similarity scores with textual data subject to a target context. speed-up rate; ii) a new state representation and joint re- ward function that guide our agent to attend the multiple goals, i.e., semantics and speed-up; iii) an extended version of 2.2 Semantic Fast-forwarding the VDAN, the VDAN+, which can create a joint embedding The lack of context that emerges from the gaps generated space for documents and video clips, instead of static frames by summarization techniques creates a nuisance to video and; iv) extensive experiments with in-depth discussions, consumers, particularly instructional video consumers. The including an ablation study, using the YouCook2 and COIN existence of a gap might confuse the user about the whole datasets as opposed to a small subset of YouCook2 used in process. For instance, the user would be unaware if an
TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3 important step was missed with the gap if the original Leveraging cross-modal information, Alayrac et al. [37] pro- video is unknown. Semantic fast-forward based methods posed a method to discover the main steps of a task by add time constraint in the frame sampling, which results linking visual and textual clustering tasks to connect the in a shorter and contiguous version of the input video. They modalities via joint constraints. Zhukov et al. [39] proposed a usually apply different speed-up rates, where lower ones new representation for instructional video steps by learning are assigned to the relevant segments and higher ones to a set of classifiers per-component (e.g., nouns and verbs). the others. Sener et al. [38] use visual and language atoms to create a Okamoto and Yanai [7] proposed a method to accel- multimodal representation to parse a video into semantic erate guidance videos emphasizing video segments with steps. pedestrian crosswalks or turning movements in street cor- In our previous approach [14], we built an end-to-end ners. Ramos et al. [8] presented a semantic fast-forwarding trainable embedding space for text and image, which is method for first-person videos dealing with visual stability further used by an RL agent to accelerate an input video. constraints with emphasis on human faces. Silva et al. [11] However, unlike most of the other fast-forward methods [5], extended the work of Ramos et al., including a CNN to [6], [8], [11], [12], [15], [40], [41], [42], the agent cannot assign frame scores based on Internet users’ preferences optimize the output speed-up rate of the video, which is and a video stabilizer proper to fast-forwarded videos [29]. essential in several applications. In this work, we extend our Silva et al. [12] proposed modeling the frame sampling in previous approach [14] by introducing the Skip-Aware Fast- semantic fast-forwarding as a Minimum Sparse Reconstruc- Forward Agent (SAFFA) and an Extended Visually-guided tion problem. The authors also propose an extension [15] Document Attention Network (VDAN+). While our new that aims to remove visual gaps that could break the con- agent can jointly optimize the semantics and the speed-up tinuity of the output video, and to smooth the speed-up objectives, the new cross-modal embedding space, VDAN+, transitions between video segments. A drawback of the provides the semantic distance between each snippet in aforementioned works consists of pre-processing steps like the instructional video and the textual steps described in detecting objects and computing optical flow, which is time- a document, i.e., the recipe. consuming and rely on the accuracy of third-party methods. Lan et al. [10] introduced the Fast Forward Network (FFNet). The network can summarize videos on the fly using 3 M ETHODOLOGY an RL agent that selects frames with the most memorable As stated, we formulate the fast-forwarding task as a se- views according to human-labeled data. Similar to FFNet quential decision-making process, where an agent is trained and Zhou et al., we also apply an agent trained by the RL to create fast-forwarded videos. Our methodology combines paradigm; however, our approach is a step towards training an RL paradigm and encode-decoding cross-modal frame- agents to work in a cross-modal embedding space in a work to reduce videos’ size by keeping the most relevant weakly-supervised manner. Unlike FFNet, our approach frames and avoiding temporal gaps. Specifically, after cre- does not demand human-labeled data to indicate the rel- ating an embedding space for encoding documents and evant frames at training time. videos using a novel Extended Visually-guided Document Most recently, methods in semantic fast-forwarding in- Attention Network (VDAN+), we train an RL agent to de- troduced usage of language to guide the frame selection. cide which frames should be removed. The agent observes The approach proposed by Ramos et al. [13] personalizes the encoded text and video snippets, its position in the the fast-forwarding process by using social network posts video, and its current average skip rate, then outputs a as textual input. Their method uses a dense captioning distribution over the actions for increasing, decreasing, or algorithm on image frames to match positive tweets from maintaining the current speed-up rate. Figure 1 shows the Twitter in the embedding space. Although achieving a high main steps of our approach. F1 Score, the whole approach is too sensitive to errors from each of its components. 3.1 Extended Visually-guided Document Attention Net- 2.3 Cross-modal Embedding for Instructional Videos work (VDAN+) Recently, the algorithms on cross-modal embedding have In this paper, we extend VDAN [14] with the novel emerged as promising and effective approaches to deal with Extended Visually-guided Document Attention Network tasks like video description [30] and text-based image/video (VDAN+). Although the embedding space created by retrieval [31], [32], [33]. Virtually, these methods rely on VDAN has shown promising results providing semantically creating a shared embedding space, where features from meaningful representation for image and text, it is limited multiple modalities can be compared. by the lack of proper temporal modeling. For instance, There is a growing body research in instructional video some actions like “open” and “close” are inversely related; and image analysis. Marin et al. [34] applied a multi-modal therefore, a single static frame would be ambiguous when neural model to learn a shared embedding space for images associated with the text description. and recipes and tackled the task of retrieving recipes from To overcome this limitation, VDAN+ takes a document image queries. Wang et al. [35] proposed an adversarial and a video clip as input and, guided by the video features, learning strategy to align both modalities. Weakly super- creates representative feature vectors for both modalities. By vised and unsupervised methods have been proposed to training VDAN+, we aim at creating an embedding space in tackle different tasks such as activity segmentation [36], lo- which textual and visual features are semantically aligned. calizing key steps [37], and parsing [38] instructional videos. We argue that the aligned embedding vectors help our agent
TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4 Input Document Input Video Input Speed-up - Cook the bacon in a sauce pan Target Speed-up: - Coat the beef with flour and season ... ... ... Current Average Speed-up: it with pepper Maximum Speed-up Rate: ... - Add the salted mushroom onion and 1 2 f f+M F parsley and mix everything well to serve One-hot Encoding NRPE VDAN+ SAFFA Cook the bacon ... pan 3D FC Word Embedding Policy Network State-value Network CNN ... 1 Probability Bi-RNN ... Attention Layer Attention Layer 0 ... Add the salted ... serve Bi-RNN Actions Action space: FC Word Embedding 1-
TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5 do to decode the captions [47], [48]. It is noteworthy that acceleration, which will, in turn, affect the velocity. Since although D remains the same, eD is not unique throughout we apply Model-free Reinforcement Learning, the transition the input video since φ(v) may change. function does not need to be pre-defined, nor learned; as Both document and video encoding modules include the agent focus directly on learning the best policy. In the an `2 normalization layer to make eD and ev unit norm following sections, we define all elements used in our MDP vectors. We found in our experimentation that using a batch formulation. normalization layer [49] preceding the `2 normalization layer increased the performance of our model. 3.2.1 Action Space and States Composition To use text documents as guide while accelerating a video, 3.1.3 Training our agent adaptively adjusts the skip rate ν (velocity) to We follow a pairwise training strategy to build the cross- ensure lower speed to the video segments semantically modal embedding space. For each clip v in the training set, similar to the input text and higher speed otherwise. The we create a positive and a negative document, D+ and D− , agent’s action space A has three elements: i) decelerate; ii) to compose the training pairs and . A do nothing; and iii) accelerate. These actions are used by the positive document D+ comprises sentences that describe agent to update the skip rate as follows: decelerate and accel- the clip v and, additionally, sentences describing a randomly erate update the velocity and acceleration states of the agent selected clip v 0 . The strategy of adding sentences that do as ν = ν − ω and ω = ω − 1 for decelerate, and ν = ν + ω not describe the clip assists the document-level attention and ω = ω + 1 for accelerate, while do nothing keeps the module on attending the proper sentences at training time. current ν and ω . Acceleration and velocity saturate at values To create the negative document, D− , we randomly select empirically set as ωmax = 5 and νmax = 25, respectively, two other clips v 00 and v 000 , and collect their respective which are always greater than or equal to 1. Note that ω sentences. At each training step, we shuffle all the sentences does not correspond to a physical acceleration, allowing the in the document for generalization purposes. agent to quickly adjust the velocity to collect more rewards In order to create more aligned embeddings, we opti- when the semantic level changes or to focus on the target mize θenc = {θRp , Wp , cp , θRd , Wd , cd , θD , θv } by minimiz- speed-up rate. ing the cosine embedding loss as follows: To allow our agent to navigate through the video space attending to the requirements effectively, the state needs to encode information about the current semantics and the ( 1 − cos(eD , ev ), if y = 1 Lenc (eD , ev ; y) = (1) max(0, cos(eD , ev ) − η), otherwise, agent’s location in the video. Thus, the agent would be able to reason whether the skip rate should be higher or where y is equal to 1 if eD and ev were generated via a lower, aiming to create an output with length as close as positive pair, and η is a margin parameter, which is set to 0 possible to the desired one while maximizing the exhibi- in our problem. tion of scenes related to the document. A straightforward state composition to achieve these requirements would be 3.2 Skip-Aware Fast-Forward Agent (SAFFA) concatenating the document’s features, the current video segment, and the agent’s average skip rate. However, by After building the cross-modal embedding space, we train observing such a state, the agent would be unaware of an agent to observe the encoded vectors eD and ev , its its distance to the end of the video, while a successful position in the input video, and its current average speed-up decision to increase or decrease the skip rate depends on rate, then sample an action from the action space to adjust it. For example, at the beginning of the video, the agent its speed-up rate accordingly. has more freedom to attend to both objectives. However, We formulate the problem of selecting frames as a as the agent approaches the end of the video, attending Markov Decision Process (MDP). In our formulation, we the target speed-up rate becomes more challenging since train an agent to maximize hP the expected i sum of discounted changes in the current skip rate present a negligible effect on T −t n rewards: Rt = E n=0 γ rt+n , where t is the current the average skip rate. Therefore, inspired by the successful timestep, rt+n is the reward n timesteps into the future, usage of positional encoding in recent works [4], [50], [51], and T is the total number of timesteps. At each timestep, we propose the Normalized Reversed Positional Encoding one frame is selected; therefore, t also indicates the current (NRPE) to encode the current agent’s location in the video. number of selected frames, and T the total number of Let q be the NRPE embedding size, F be the number of selected frames. γ ∈ (0, 1] is a discount factor. frames in the input video and f the position of the agent in In our problem, the agent observes a long input video the video, i.e., f ∈ {1, 2, · · · , F }. Then, the dimensions 2k and a text document, and takes actions to create an acceler- and 2k + 1 (with k ∈ {1, 2, · · · , q/2}) of our NRPE embed- ated version of the video. Since we aim to keep the overall ding, ep , are: coherence of the video instead of trimming it, we restrict the ! ! F −f F −f agent to navigate the video space observing short segments N RP E(f,2k) = sin 2k , N RP E(f,2k+1) = cos 2k . (2) and skipping them accordingly, limited to a maximum skip F q F q length. The agent has velocity ν and acceleration ω , and The rationale of NRPE is that an agent in videos of different based on the current ν , the next frame is selected. Therefore, lengths but in the same relative position and under the same the agent goes through the whole video, but skips frames relevance profile should behave the same. according to a dynamically changing velocity. At each time- To encode the agent’s average skip rate, we use the one- step, the agent can increase, decrease, or keep its current hot vector given by es = Im (bŜt c − S ∗ + νmax ), where Ŝt
TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6 is the average skip rate at the timestep t, S ∗ ≤ νmax ∈ N+ Usually, Monte Carlo sampling is applied, due to the high is the target speed-up rate, and Im (ι) denotes the ιth line dimension of the action sequence space, leading to the of an identity matrix of size m, set as m = 2νmax in our ex- following approximation for the gradient: periments. We use the average relative skip rate, bŜt c − S ∗ , ∇J 0 (θπ ) ≈ X instead of the raw average skip, Ŝt , to allow the agent to ∇θπ log π(at |st , θπ )(Rt − v(st |θv )), (6) decide to increase or decrease its speed according to the t deviation to the target speed-up rate. Hence, a single agent where at is the action taken at time t. Hence, we minimize can accelerate an input video at different speed-up rates the following loss function: with no need to be re-trained. The final state composition at L0 (θπ ) = − X a given timestep t is defined as: st = [eD ; ev ; ep ; es ]| ∈ S . (log π(at |st , θπ )) (Rt − v(st |θv )). (7) t 3.2.2 Reward function Moreover, in order to have a greater action diversity, we The goal of our agent is to learn a policy π(a|st , θπ ) that add the entropy of the policy output H(π(at |st , θπ )) into represents the probability of taking a certain action a ∈ A the loss [28]. Therefore, our final policy loss is given by given the state st and the parameters θπ . The reward should Ldec (θπ ) = L0 (θπ ) − X encourage the agent to increase, decrease, or keep its skip β · H(π(at |st , θπ )), (8) rate w.r.t. the semantic similarity between the textual and t visual data in the upcoming video segment while consid- where β is a constant to balance the entropy importance. In ering the overall speed-up rate objective in the long-term. our experiments, we set β to 0.01. Therefore, we design an immediate reward proportional to Additionally, we also need to learn the state-value the alignment of the textual and visual features in non- function δ(st |θδ ). We do that by minimizing the mean terminal states and proportional to the overall speed-up rate 2 squared error: Lδ (θδ ) = t (δ(st |θδ ) − Rt ) . Both losses Lδ P deviation at the terminal state (final frame of a video). Thus, and Ldec can now be minimized using stochastic gradient at training time, after taking the action at ∼ π(a|st , θπ ) in descent. At test time, we use arg maxa π(a|st , θπ ) as the the tth step, the agent receives the following reward signal: chosen action for the agent in a given timestep t. ( eD · ev , if t < T rt = ŜT −S ∗ 2 (3) λ ∗ exp(−0.5 ∗ ( σ ) ), otherwise, 4 E XPERIMENTS where λ controls the relative importance of the overall 4.1 Datasets, Baselines, and Evaluation Metrics speed-up rate w.r.t. the frames’ relevance in the output We conducted our experiments on the YouCook2 and COIN video. Note that the terminal reward is similar to a Gaussian datasets [2], [16]. YouCook2 is a large-scale dataset com- function centered at S ∗ . Thus, lower σ ∈ R values force posed of unconstrained YouTube cooking videos, where the the agent to achieve the desired skip more accurately. Se- footages include a variety of cuisines and cooking styles. mantically, the agent receives higher rewards if eD and ev It consists of 2,000 videos distributed across 89 recipes point to the same direction in the embedding space, which with a total length of 176 hours. COIN is a large-scale encourages the agent to reduce the speed and accumulate dataset composed of 11,827 instructional videos distributed more rewards since the neighboring temporal frames are across 180 tasks organized hierarchically into 12 domains more likely to yield higher reward values. with a total length of 476 hours. Videos in both datasets Recall that our final objective is to maximize the expected are annotated with temporal boundaries and natural lan- sum of discounted rewards. Therefore, although being es- guage descriptions. Because YouCook2 lacks a test set with tablished as a terminal reward, the speed-up rate signal may, available textual instructions, we used its validation set as potentially, affect all the agent’s decisions. our test set. To optimize the hyperparameters and evaluate 3.2.3 Policy Learning our model, we used a strategy commonly applied in the Apart from aligning the textual and visual features pro- reinforcement learning literature (e.g., DQN [26], A3C [54], duced by VDAN+, the overall objective of our methodology Rainbow [55], and DDQN [56]). First, we randomly selected also tries to maximize the expected cumulative reward Rt at a subset of the recipes (in our case, 10%) from the entire each timestep t. We follow the REINFORCE algorithm [52] dataset and used their videos to tune the hyperparameters. to learn the parameters θπ that maximizes the expected Then, using all videos from the training and validation sets, utility: J(θπ ) = a∈A π(a|st , θπ )Rt . P including the ones used in the hyperparameter tuning, we To improve learning performance, we use the advantage train and evaluate our model. function approach [53] and maximize the expected advan- To evaluate the performance of each method, we use the tage: F1 Score (F1), which is the harmonic mean of the Precision and Recall. The higher the F1, the better retrieving relevant J 0 (θπ ) = X π(a|st , θπ )(Rt − v(st |θv )), (4) frames with the lowest false positive and negative rates. a∈A Note, however, that due to the constraints in temporal where v(st |θv ) is a function parameterized by θv , which connections imposed to fast-forwarding techniques, they predicts our expected cumulative reward at state st . The naturally select non-relevant frames since the relevant seg- gradient of J 0 , ∇θπ J 0 (θπ ), is given by: ments are usually distant from one another. We also measure the methods’ capability of accelerating the input videos X π(a|st , θπ )(∇θπ log π(a|st , θπ ))(Rt − v(st |θv )). (5) a∈A regarding the target speed-up rate. For that, we report the
TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7 TABLE 1 Comparison with baselines. Precision, Recall, F1 score, Output Speed-up (OS), and Overall Performance (OP) in YouCook2 and COIN datasets. The estimated running times are on the right, and the target OS values are S ∗ = 12 for YouCook2 and S ∗ = 16 for COIN. Best values are in bold. YouCook2 COIN Method No Dense Controls Uses Running Supervision Speed-up Language Precision1 Recall1 F1 Score1 OS2 OP1 Precision1 Recall1 F1 Score1 OS2 OP1 Time (ms)3 FFNet [10] 7 7 7 58.83 11.70 18.86 11.90 31.71 45.06 14.08 17.66 16.45 29.75 08.65 SAS [12] 3 3 7 49.22 08.61 14.44 11.64 25.02 40.17 09.07 13.20 16.10 23.32 344.15 SASv2 [15] 3 3 7 49.69 09.87 16.20 10.32 19.60 41.29 09.79 13.90 14.09 20.07 344.43 BoT [13] 3 3 3 48.66 08.37 14.04 12.13 24.40 39.80 09.21 13.01 16.07 23.02 1048.01 Ours 3 3 3 53.20 12.81 17.86 11.68 30.07 42.58 15.94 17.18 14.99 27.98 75.58 1 Higher is better (%) 2 Better closer to S ∗ 3 Lower is better 1x Output Speed-up (OS) rate, Ŝ . The closer is Ŝ to S ∗ , the 0.5x better. 0 Error There is a clear trade-off between F1 and OS. Selecting -0.5x more frames may increase the F1 score while failing to -1x attend the speed-up. Therefore, we propose the Overall -1.5x Performance (OP) metric, which summarizes both F1 and 2x 3x 4x 5x 6x 7x 8x 9x 10x 11x 12x 13x 14x 15x 16x 17x 18x 19x 20x Input Speed-up Rate OS in a single value. We compute the OP as the harmonic mean between the F1 and the OS accuracy, which is given Fig. 2. Average output speed-up rate error in the YouCook2 dataset. by a Gaussian function centered at S ∗ , and with a standard For target speed-up rates from 2 to 20×, SAFFA generates videos with deviation σOS . In the literature, acceptable speed-up rate a small relative error, indicating its accuracy in the speed-up objective. errors are on average 5.31%, with a standard deviation of 8.38% w.r.t. the target [5], [40]. We use this standard deviation in our evaluation, i.e., σOS = 0.0838 × S ∗ . We tested the sizes {128, 512, 1,024} for the dimension We compared our method with the following of the embedding space d, and we choose d = 128 as it fast-forwarding techniques: Sparse Adaptive Sampling showed a similar performance, but with fewer parameters. (SAS) [12] and its extension SASv2 [15], which are state of We define 512 to be the size of the hidden state vectors the art in terms of semantics retained in the final video; the hij and hi , and 1,024 to be the size of cp and cd . Note Bag-of-Topics (BoT) technique [13], which, like ours, also has that, in our semantic encoder, the hi vectors must lie on texts and videos as input and; our previous approach [14] the same dimension as the visual features since h0 = φ(v). with the semantic encoder VDAN and the Semantic Fast- The functions fv and fD are implemented as two indepen- Forward Reinforcement Learning (SFF-RL) agent. We also dent fully connected neural networks composed of a single report results for the Fast-Forward Network (FFNet), an hidden layer with 512 neurons. RL agent proposed by Lan et al. [10]. It is worth noting We train VDAN+ for 100 epochs with a batch size of that FFNet is used as a gold standard for semantics since 64 and obtain the model that had the best performance in it disregards the output speed-up as a target. During ex- the validation set. During training, we rescale the videos periments in the COIN dataset, the FFNet agent did not to 128 × 171, then randomly cropped them spatially to converge, acting as a uniform selection method due to the 112 × 112 [46]. We also apply clip-wise horizontal random high variability of labels from different domains. For the flip with 0.5 probability and temporal jittering, obtaining sake of a fair comparison, we performed the experiments in clips of 32 continuous frames. COIN in a domain-wise manner. The policy network π(at |st , θπ ) and the state-value func- tion δ(st |θδ ) were implemented as two Multilayer Percep- 4.2 Implementation Details trons (MLPs) with two hidden layers (256 and 128 neurons) We adopted a stage-wise training strategy, where VDAN+ each using ReLu activation function. We added a final layer and SAFFA are trained in two distinct stages. With this with |A| = 3 neurons and softmax activation to π , and one strategy, we avoid sending different reward signals to the with a single neuron and linear activation to the δ network. agent during training and make the convergence of the We ran a grid search in the YouCook2 dataset to find model easier [57]. Therefore, we first train VDAN+ using the best policy learning rate lr ∈ {1e−5, 5e−5, 1e−4}, pairs of documents and clips from the VaTeX dataset [58]. σ ∈ {0.5, 1, 2}, and γ ∈ {0.8, 0.9, 0.99}, with λ = F ∗ , All the available English descriptions in the dataset are used where F ∗ is the desired number of frames. We selected the to compose D+ and D− . We chop the sentences to be no configuration with policy learning rate of 5e−5, σ of 0.5, longer than 20 words and use a set of glove embeddings and γ of 0.99, which provided the highest F1 Score such pre-trained in the Wikipedia 2014 and Gigaword 5 sets [59] to that round(Ŝ) = S ∗ . This configuration was fixed for all represent each word. Note that, although a set of annotated experiments. In our experiments, γ and σ have shown to descriptions for a given clip may not fairly represent an be the most influential hyperparameters. Lower γ values instructional document structure, the first training stage is reduce the terminal reward’s strength in the first input video mainly dedicated to creating a highly discriminative embed- frames, making it difficult to control the overall speed- ding space for aligning documents and video clips. up, and higher σ values permit the agent to select more
TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8 a) Vietnamese Veggie Spring Rolls (YouCook2) FFNet 14.85x 1 2 Instructions: SAS 11.92x 1- mix tamarind ginger spice rub and garlic 2- dip the wrappers in water SASv2 11.31x 3- place the kale cucumber bell peppers carrots and radishes on the wrapper BoT 12.00x 3 4 4- place cilantro and scallions on the filling Ours 12.93x 5- roll the spring roll up GT 1 2 3 4 0 2000 4000 6000 8000 b) Spicy Tuna Rolls (YouCook2) FFNet 05.83x 1 2 Instructions: SAS 11.73x 1- chop some tuna meat into small pieces SASv2 10.43x 2- mix the tuna pieces and some fish eggs in a bowl 3- add hot sauce and mayo and then mix them together BoT 12.02x 3 4 4- spread some cooked rice on seaweed sushi wrapper 5- add the tuna mix on top and season with some sesame seeds Ours 12.89x 6- roll it up and press firmly 7- cut the roll into small rolls GT 1 2 3 4 0 2000 4000 6000 8000 10000 12000 14000 c) Graft (COIN) FFNet 2.34x 2 3 Instructions: SAS 4.00x 1- cut branches 2- connect branches SASv2 4.05x 3- wind the junction to fasten the connection 1 BoT 4.00x 4 5 Ours 3.98x GT 1 2 3 4 5 0 500 1000 1500 2000 2500 3000 Fig. 3. Qualitative results. The colored bars (right) represent the frames selected by each method for each instructional video, and the contiguous black blocks represent the ground-truth segments. Note that, in general, our agent performs a denser frame sampling in the regions representing a instruction step and a sparser one in the other regions (see the images outlined in green and the instructions). Exceptions are the regions where our agent decides to increase its skip rate to attend the target speed-up or matches the video snippet with the instructions (images outlined in red). frames, reducing the overall speed-up without additional for detailed time analysis. The best values in Table 1 are in punishment. Random initialization did not influence the bold. results. The results indicate that our approach outperforms all We trained SAFFA and its variants for 100 epochs us- competitors by a significant margin in Precision, Recall, ing the Adam optimizer [60]. In this stage, all VDAN+ and F1, while not compromising the output speed-up rate parameters remain frozen. For faster convergence, we use significantly in both datasets, which is reflected in the OP a learning rate of 1e−3 for the value-state approximator. metric. It means that SAFFA could use the instruction steps Our approach is fully implemented in the PyTorch library, described in natural language and match them to what is and the experiments were conducted in an Intel® Core™ i7- currently being observed in the scene. On the other hand, 3770K CPU @ 3.50GHz machine with 32GB RAM and an the semantics encoded by the SAS and SASv2 methods are NVIDIA Titan RTX GPU. based on the YOLO detector. Consequently, the encoded information may not contain details such as interactions with the objects, but only which objects are present in the 4.3 Results scene. For that reason, the distribution of the selected frames becomes more similar to a uniform selection since such ob- 4.3.1 Quantitative Results jects tend to be present in most frames of the video. Besides, Table 1 shows the average Precision, Recall, F1 Score (F1), these encoders are limited to a small set of objects while and Output Speed-up (OS), as well as the Overall Perfor- natural language sentences present, potentially, a higher mance (OP), achieved by each method in both datasets. number of objects and interactions. The methods are grouped according to whether they use The BoT method did not perform well, although using labels for training (if applicable), optimize the video’s out- the instructions as input. We argue that the BoT method put length, and use natural language as input. The gold depends on the accuracy of multiple external components standard method, FFNet, is above the dashed line as it needs such as the dense captioning and the video saliency algo- frame-level (dense) labels for supervision. Because FFNet rithms; therefore, it is prone to errors. For that same reason, is the only method that does not allow a target speed-up it yields the worst processing time among the methods. We rate as input, we used its average output speed-up rates, noticed that the saliency is the major contributor for the Ŝ = 11.90 and Ŝ = 16.45, as targets for all methods. I.e., BoT’s semantic encoder. The errors on saliency estimation S ∗ = 12 for the YouCook2 dataset and S ∗ = 16 (on average) lead the method to present higher frame scores in segments for the COIN dataset since S ∗ ∈ N+ . We removed the COIN that do not represent an instruction step (see images 2 and 1 domains “Nursing and Care” and “Leisure and Perfor- in Figure 3-a and c, respectively) or disregard relevant seg- mance” from the evaluation since S ∗ = 1 and it would not ments. In many cases, higher saliency values are assigned to require any acceleration for the other approaches. We also the person carrying out the task and lower at the task itself. present the estimated running times in milliseconds (ms), Besides outperforming the competitors, our method is where we can see that our approach is superior to the com- also on par with FFNet, even presenting a better Recall in petitors. We refer the reader to the supplementary materials the YouCook2. Note that FFNet has no speed-up control
TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9 optimization at the video level. Therefore, their agent can single ground-truth segment (e.g., third step in Figure 3-c) conveniently reduce and increase its frame skip according might represent a significant portion of the input video. to the number of relevant frames present in the video. I.e., We assessed the usefulness of VDAN+ representation for output videos with more relevant frames present lower the agent regarding the capability of navigating throughout output speed-up rates while others with less relevant frames videos. For each sample in the VaTeX validation set, we present higher rates. Moreover, in order to avoid abrupt created a positive document, as described in Section 3.1.3, speed-up rate changes along the video to keep transition and paired it with a single video, i.e., the same video smoothness, our agent is constrained via ν and ω to gradu- was used in all pairs. These pairs were projected into the ally change its navigation speed, while FFNet does not have VDAN+ space and further projected into a 2-dimensional this constraint. space via t-SNE [61]. In Figure 4-left, we illustrate the To assess the significance of the results presented in Ta- document embeddings (colored points) and some frames ble 1, we ran a Student’s t-test considering each metric and of the video. The colors represent the alignment between each pair of methods. We verified that, with 99% confidence the documents and the video. On the right, we show the (p-value < 0.01), our method is statistically superior to SAS, document that describes the video (shaded in blue) and BoT, and SASv2 in Precision, Recall, and F1 metrics in both a random document (shaded in gray) from a region that datasets. It also shows that the average values for F1 do not contains the documents most similar to the video. Note that differ significantly when comparing our approach to FFNet. the sentences in the documents with the highest attention Our approach is statistically tied with the FFNet, and SAS weights are semantically related to the video. It indicates methods concerning the OS metric in the YouCook2. We that our weakly-supervised approach is useful for the agent refer the reader to the supplementary material for tables to navigate not only in instructional videos, as demonstrated presenting all the p-values. in previous sections but also in videos from other domains. Figure 2 depicts our average OS error in the YouCook2 for different target rates (from 2× to 20×), i.e., the y -axis 4.3.3 Ablation Studies shows Ŝ − S ∗ . Note that the errors in the output rates VDAN+ versus VDAN [14]. The embedding space created are no higher than 1.75×, indicating that our agent can by the semantic encoder is crucial for the good performance effectively control the output video’s length without re- of SAFFA since it directly observes the video and document training. We accredit these results to our joint reward signal, features to infer the current semantic load. To demonstrate which leads the agent to drop even the relevant frames in the superiority of the VDAN+ over VDAN [14], we ap- favor of balancing both objectives. The task of addressing proach the alignment of embeddings as a binary classifica- the target speed-up becomes even more challenging when tion problem. First, we label the frames in the ground-truth higher target values are imposed. This can also be observed segments of the YouCook2’s training set as positive and in Figure 2, where the agent creates output videos with less the remaining ones as negative. Then, we apply multiple accuracy in their length as the target values increase. thresholds to the dot product values between each video frame/clip and the corresponding input recipe and calculate 4.3.2 Qualitative Results the Area Under Curve (AUC) computed from the Receiver Figure 3 shows qualitative results for three videos in the Operating Characteristic (ROC) curve. For the experiments YouCook2 and COIN test sets. The vertical colored bars with the VDAN, we used the same hyperparameters and represent each method’s selected frames, and the contiguous training data used in [14]. Figure 5 shows the ROC curve black blocks represent the ground-truth segments. Each and the AUC scores for each semantic encoder. We see that method’s Ŝ and S ∗ are shown on the right. The numbered VDAN+ is clearly superior to VDAN at correctly classifying images outlined in green represent frames from segments a pair of frame/clip and document, yielding an AUC of 65 where the agent takes the correct decision w.r.t. the ground- against 61 of VDAN. We claim that this superiority is related truth. Those outlined in red represent frames where the to the fact that VDAN+ creates embeddings at the segment agent fails to increase its skip in segments with no recipe level. Therefore, it avoids several false positives, such as steps or decrease it in the ground-truth segments. correlating static scenes and objects under no action. In general, SAFFA reduces its skip rate in the relevant The effect of h0 = φ(v). Apart from initializing the first hid- segments and increases it otherwise. This behavior indicates den state vector of our semantic encoder as h0 = φ(v), we that the VDAN+ space is useful for the agent to determine evaluated two other variants: h0 = 0 and; h0 = Ws , where when the video segments match the input document. Some Ws ∈ Rz is a learnable parameter vector. We computed the segments not in the ground-truth are semantically similar Mean Reciprocal Rank (MRR) using the VaTeX validation to the instructions, leading our agent to reduce its skip set as described in Section 3.1.3 to compare the variants with erroneously. That is the case illustrated in image 3 in Fig- our approach. 1 PQ ure 3-b. It is noteworthy that, to attend the target input The MRR is given by Q q=1 rankq , where Q is the 1 speed-up rate, the agent decides to increase its skip rate number of documents composing the query set and rankq even though the segment contains relevant frames. This is is the rank at which the relevant document was retrieved. illustrated in images 4 in Figure 3-a and c. However, note As expected, the MRR values for h0 = φ(v) are far superior that image 4 in Figure 3-c shows a transition scene where when compared to both h0 = 0 and h0 = Ws variants, pre- the person takes the tape to wind the junction. Although senting 0.718, 0.065, and 0.061, respectively. We observed this scene is in the ground-truth segment, speeding it up that using any of the two variants degraded the model’s does not incur losing the message conveyed by the video. capacity to produce semantically relevant eD vectors. Ba- Moreover, discarding relevant frames is necessary once a sically, after training for 100 epochs, the attention module
TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 10 Fig. 4. Qualitative results for VDAN+. The Figure shows VDAN+ document embeddings from the VaTeX validation set projected into two dimensions via t-SNE [61] and colored according to their alignment with a given input video. Note that the documents with the highest alignment with the input video (blue points) are composed of semantically related sentences, and these sentences present the higher attention weights (Att.). 1.0 for the agent to attend the target speed-up rate; note the low OP values. In the cases without SA and NRPE, due 0.8 to the lack of distinction among states with the same se- True Positive Rate mantic features, the agent ends up taking the same actions 0.6 under varied conditions, e.g., at the beginning/end of the video or with the average speed-up rate close/far from 0.4 the ideal. Adding only the NRPE is insufficient since the agent remains “blind” about its skip rate. It is worth noting 0.2 that using different encoders produced oppositive behaviors VDAN+ (area = 0.65) w.r.t. the OS metric. These results relate to the distribution VDAN (area = 0.61) of VDAN and VDAN+ embeddings, which present average 0.0 0.0 0.2 0.4 0.6 0.8 1.0 dot products of 0.175 ± 0.812 and 0.833 ± 0.358 for VDAN False Positive Rate and VDAN+, respectively, in the YouCook2 training set. The Fig. 5. VDAN+ versus VDAN [14]. The Receiver Operating Charac- higher semantic rewards of VDAN+ lead the agent to reduce teristic (ROC) curves in the YouCook2 when using all dot products as its skip rate and accumulate more rewards. thresholds. VDAN+ is superior in deciding if a frame/segment is relevant. The variations with SA vector and no NRPE aid the agent in attending the target speed-up rate. However, in general, the agent becomes cautious in terms of its skip rate, could still not attend the correct sentences because the preferring not to intensively increase or decrease the video’s attention weights became roughly evenly distributed. This speed-up rate to favor the relevance of the frames. For com- outcome was expected since φ(v) lies in a well-defined parison, the agents’ F1 Score using this state composition semantic space, making it easier to create d given the is 14.09 with VDAN and 15.18 with VDAN+, while 14.33 trajectory {h0 = φ(v), h1 , . . . , hN } in the 512-dimensional and 17.86 when using all the state’s components. The agent space. using all components proposed in this paper presents the Method’s composition. We evaluated the impact of using best trade-off between relevance and target speed-up rate different configurations to integrate a variant of our ap- in the output videos as confirmed by the OP metric. By proach. Table 2 presents the results for Precision, Recall, F1, using the information of its position in the video, the agent OS, and OP. The column SE shows the Semantic Encoder can freely decide to attend to the semantics objective at the used; SA shows the Skip-Aware component, which indicates beginning of the video and increase its concern about the the agent’s current average skip rate; and the NRPE is the speed-up rate as it reaches the end of the video. Normalized Reversed Positional Encoding, which informs the agent of its relative location in the video. All agent It is worth mentioning the Non-RL variant result. Its variations were trained with the same hyperparameters F1, OS, and OP values are better than those achieved by as detailed in Section 4.2, except for the terminal reward the VDAN-based variants, confirming the superiority of weight, λ (Equation 3). In the variations using VDAN as VDAN+ over VDAN. Although this variant is on par with the semantic encoder, we set λ = F since we observed that the “VDAN+ with SA”, the BoT frame sampler takes over local rewards were dominating the signal sent to the agent. 60× more processing time. It is noteworthy that the only difference of the SAFFA-based Robustness to Automatic Speech Recognition (ASR). A variants without SA and NRPE to the method proposed representative aspect of instructional videos is that the in our previous work [14] is in the reward. The SFF-RL person recording the instructions usually narrates the steps agent disregards the speed-up control, i.e., λ = 0. We also being performed. Thus, using an ASR system, one could included a variant that uses the BoT’s frame sampler, a non- utilize our methodology to create an accelerated video with RL solution. We used the frames and document alignments no need for text input by the user. To demonstrate this throughout the video as input. application, we selected a video from the YouCook2’s vali- As can be seen in Table 2, the SA element is crucial dation set and automatically created a document using the
TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11 TABLE 2 1 2 3 Ablation Study. Precision, Recall, F1 Score, Output Speed-up (OS), and Overall Performance (OP) in YouCook2, w.r.t. the method’s compo- sition. SE stands for Semantic Encoder, SA for Skip-Aware, NRPE for 5.35x Normalized Reversed Positional Encoding, and FS for Frame Sampler. 5.35x Ours 5.35x Ours+ASR 5.35x Method’s Composition Metrics GT 5 SE SA NRPE FS Precision1 Recall1 F1 Score1 OS2 OP1 00 2000 2000 4000 4000 6000 6000 8000 8000 10000 10000 Instructions: ASR Text: VDAN 7 7 SAFFA 49.17 04.00 07.34 24.87 00.00 mix soy sauce spicy sauce and sugar in a bowl "hello and welcome back to tea areas the King show international favorites on a VDAN 7 3 SAFFA 49.68 04.26 07.77 23.67 00.00 pour oil into a hot pan student budget we would like to thank all of you for your positive feedback on VDAN 3 7 SAFFA 49.32 08.37 14.09 11.93 24.69 add peppercorn ginger and garlic to the pan ... pour the sauce into the pan keep your ingredients we're gonna be Oh thought okay let's see VDAN 3 3 SAFFA 51.84 08.45 14.33 12.42 24.79 add the shrimp tofu and mushrooms to the pan here's a mushroom broccoli tofu sub-tree saw shooter soy sauce spicy sauce stir all of the ingredients in the pan ... VDAN+ 7 7 SAFFA 59.11 79.08 64.96 01.94 00.00 cover the pan with a lid they're still watering the tofu and my okay now just cover it and then with VDAN+ 7 3 SAFFA 59.63 47.67 50.98 03.02 00.00 sprinkle green onions and salt around 5 to 10 minutes it's okay yeah sedating it's done cooking I think so..." VDAN+ 3 7 SAFFA 52.39 09.03 15.18 11.86 26.33 VDAN+ 3 3 SAFFA 53.20 12.81 17.86 11.68 30.07 VDAN 7 7 SFF-RL [14] 53.78 42.07 41.55 06.09 00.00 Fig. 6. Robustness to Automatic Speech Recognition (ASR). The VDAN+ N/A N/A Non-RL [13] 52.00 09.33 15.56 11.48 26.42 image illustrates the difference between using the annotated recipe and the ASR text as input. Note that the distribution of the selected frames 1 Higher is better (%) 2 Better closer to S ∗ = 12 is similar for either input (colored bars represent the selected frames). 1 2 3 English subtitles generated by the YouTube ASR system. We ran our method using that input and compared it against our main result. Figure 6 shows the result (S ∗ = 5) with some representative frames on top, the frame selection in Ours 12.77x the middle, and the input documents for each approach at GT 12 0 1000 1000 2000 2000 3000 3000 4000 4000 5000 5000 6000 the bottom. In general, our method performed a similar frame sam- Fig. 7. Limitations. The colored bars, contiguous black blocks, and pling using either input document. The reason is that, images outlined in red represent, respectively, the selected frames, the although the input documents are different in length, they ground-truth, and the regions where the agent took a wrong decision. have similar semantics in content. In Figure 6, images 1 and 3 illustrate some cases where our method was accurate w.r.t. food in the oven (image 1) and some snails getting out of the ground-truth. It keeps a sparser frame selection in the the recipient (image 2). The major reason for this outcome recorder’s introductory speech (see image 1) and a denser is the design of our reward function that is sparse w.r.t. the frame selection when the food is cooked (see image 3). We speed-up rate deviation. Thus, at the beginning of the video, see a difference between the frame selection when using the the sum of discounted rewards will mostly be composed instructions and the recorder speech in the region illustrated of semantic signals, while as t gets higher, the speed-up in image 2. SAFFA using ASR decided to reduce the skip deviation signal gets stronger. Suppose the agent mistakenly rate in the segment that does not represent a recipe step, but reduces its skip rate at the beginning of the video due notice that it presents the ingredients used. This behavior is to a potential relevant segment. In that case, it must skip comprehensible since the ingredients are described in some the final frames even if they are also relevant (image 3). sentences. In this example, VDAN+ contributed negatively, producing highly aligned vectors due to the high semantic relation 5 D ISCUSSIONS between the segments represented in images 1 and 2 with instructions like “Remove the snail from the shell” and In this paper, we proposed a novel methodology based “Chop and cook the pancetta in a pan” of the recipe. on a reinforcement learning formulation to accelerate in- Nonetheless, our approach takes a step towards under- structional videos. The agent, namely Skip-Aware Fast- standing the training of RL agents in a multi-modal envi- Forwarding Agent (SAFFA), is trained to decide which ronment. To accelerate long untrimmed videos emphasizing frames to remove based on textual data and a target speed- the relevant content, we trained an agent to collect rewards up rate. We also presented an extension to the Visually- proportional to video-text similarity powered by VDAN+, guided Document Attention Network (VDAN) [14], called which removed the demand of temporal markings. This VDAN+, which creates a highly discriminative embedding strategy allows the agent to run with freely available texts space to represent both textual and visual data. Our ap- on the web and requires fewer annotation efforts. We believe proach outperforms several competitors in F1 Score while that the results may benefit further works in the direction allowing the user to define the video output length, which of weakly supervised reinforcement learning in the case of can be crucial for applications where time and storage availability of multi-modal data and scarcity of annotations. resources are scarce. Despite achieving the best results, our methodology suf- fers from certain limitations. The agent may take incorrect ACKNOWLEDGMENTS actions and emphasize segments that are not the recipe’s We thank the agencies CAPES, CNPq, FAPEMIG, and Petro- steps. Figure 7 illustrates an example of such a case, where bras for funding different parts of this work. We also thank the agent reduces its skip rate to emphasize a person putting NVIDIA Corporation for the donation of a GPU.
You can also read