Text-Driven Video Acceleration: A Weakly-Supervised Reinforcement Learning Method - arXiv

Page created by Raul Dean

Science

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Text-Driven Video Acceleration: A Weakly-Supervised Reinforcement Learning Method - arXiv

In IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 2022.
                                                                        The final is available at https://doi.org/10.1109/TPAMI.2022.3157198

                                                Text-Driven Video Acceleration: A Weakly-
                                               Supervised Reinforcement Learning Method
                                                       Washington Ramos , Michel Silva Edson Araujo , Victor Moura , Keller Oliveira ,
                                                                  Leandro Soriano Marcolino , and Erickson R. Nascimento

                                               Abstract—The growth of videos in our digital age and the users’ limited time raise the demand for processing untrimmed videos to
                                               produce shorter versions conveying the same information. Despite the remarkable progress that summarization methods have made,
                                               most of them can only select a few frames or skims, creating visual gaps and breaking the video context. This paper presents a novel
                                               weakly-supervised methodology based on a reinforcement learning formulation to accelerate instructional videos using text. A novel
arXiv:2203.15778v1 [cs.CV] 29 Mar 2022

                                               joint reward function guides our agent to select which frames to remove and reduce the input video to a target length without creating
                                               gaps in the final video. We also propose the Extended Visually-guided Document Attention Network (VDAN+), which can generate a
                                               highly discriminative embedding space to represent both textual and visual data. Our experiments show that our method achieves the
                                               best performance in Precision, Recall, and F1 Score against the baselines while effectively controlling the video’s output length. Visit
                                               https://www.verlab.dcc.ufmg.br/semantic-hyperlapse/tpami2022/ for code and extra results.

                                               Index Terms—Untrimmed videos, Fast-forward, Instructional video, Cross-modal data, Reinforcement Learning

                                                                                                                        F

                                         1    I NTRODUCTION

                                         T     HE development of new digital technologies, especially
                                               over the last two decades, has allowed humans to
                                         overcome physical barriers in exchanging information. With
                                                                                                                            fast-forwarded without losing the message to be conveyed
                                                                                                                            by the input video. Thus, ideally, instructional videos should
                                                                                                                            be concise, similar to a textual description, but still having
                                         the emergence of several tools such as portable devices                            visually-rich demonstrations of all main steps of the task.
                                         (e.g., smartphones and wearables), and social multimedia                               In this paper, we address the problem of accelerating
                                         services like Twitter, Facebook, Instagram, and YouTube,                           temporally untrimmed videos by using text documents. For
                                         storing and sharing data of any kind has become effortless.                        example, a recipe for cooking tacos could be used as a guide
                                         Moreover, due to the exponential growth in processing                              to select relevant frames from videos of cooking tacos. Note
                                         power, storage, and bandwidth of digital devices, we are                           that this problem is different from moment localization [1],
                                         witnessing a substantial increase in the volume of data,                           procedure segmentation [2], and video summarization [3],
                                         in particular, textual and visual data such as images and                          [4], since non-relevant frames are still necessary for a user
                                         videos. Every day a plethora of textual tutorials and instruc-                     to understand the flow and temporal coherence of a task, i.e.,
                                         tional videos is published on the Internet teaching a variety                      some segments should be accelerated, but not eliminated.
                                         of tasks, from how to cook burritos and tacos all the way to                           Although identifying and fast-forwarding non-relevant
                                         how to solve partial differential equations (PDEs).                                segments may be a trivial task for humans, it poses signif-
                                             Despite many textual tutorials and instructional videos                        icant challenges to fully automated systems. First, identify-
                                         sharing the increasing growth of available data as well as                         ing relevant segments requires reasoning about the semantic
                                         their content, they differ in a key aspect for users: how long                     concepts in the scene (e.g., person, tools, ingredients) and
                                         they would take to consume the content. In general, information                    their interactions as they evolve over time. Second, usually
                                         encoded by producers is more concise in the textual domain                         in fast-forwarding techniques [5], [6], the overall length
                                         than in the visual domain. For instance, a recipe of tacos or a                    of the output video is defined by the user. Therefore, the
                                         tutorial explaining how to solve a PDE is described in a few                       method must adjust a varying speed-up rate along the
                                         sentences. Instructional videos, for their turn, might have                        footage to attend to the user’s needs.
                                         several minutes showing non-relevant information for the                               Over the past few years, algorithms on semantic fast-
                                         task such as a person opening a refrigerator, picking up the                       forwarding have been emerging as effective approaches to
                                         pencil, or erasing the blackboard. Such segments could be                          tackle the task of retrieving meaningful segments without
                                                                                                                            losing the temporal continuity [7], [8], [9], [10], [11], [12],
                                         •   Washington Ramos, Edson Araujo, Victor Moura, Keller Oliveira and
                                                                                                                            [13], [14], [15]. In our previous work [14], we initially in-
                                             Erickson R. Nascimento are with the Computer Science Deparment,                troduced an algorithm to create fast-forward videos guided
                                             Universidade Federal de Minas Gerais, Belo Horizonte, MG 31270-                by text documents. We formulated our fast-forwarding task
                                             901, Brazil. E-mail: {washington.ramos, edsonroteia, victorhugomoura,          as a sequential decision-making process. A reinforcement
                                             kellermartins, erickson}@dcc.ufmg.br
                                         •   Michel Silva is with the Department of Informatics, Universidade Federal       learning (RL) agent locally decides to increase, decrease,
                                             de Viçosa, Viçosa, MG 36570-900, Brazil. E-mail: michel.m.silva@ufv.br       or keep the video’s speed-up rate based on the encoded
                                         •   Leandro Soriano Marcolino is with Lancaster University, Lancaster, LA1         text and video frame. An embedding space is generated by
                                             4YW, U.K. E-mail: l.marcolino@lancaster.ac.uk
                                                                                                                            a Visually-guided Document Attention Network (VDAN),

TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2

which creates representative feature vectors for textual and our previous work.
visual modalities. Although this RL agent and VDAN pre-
sented relevant results regarding the output content’s rele-
vance, they hold two significant drawbacks. First, the user 2 R ELATED W ORK
does not control the output video’s duration, which is one 2.1 Video Summarization
of the key aspects of most fast-forward algorithms. This Over the past several years, video summarization methods
control is also essential to systems where the storage or time figured prominently in the task of shortening a video [17],
resources are limited. Second, VDAN can only encode static [18], [19], [20]. The shorter version is, usually, a summary of
frames; however, many sentences in the input document, the input video composed of a storyboard of keyframes or
such as the ones with actions describing an instruction step, video skims with the most distinguishable segments [21].
may require modeling the scene temporal dynamics. The strategy adopted to summarize the videos varies
To overcome these drawbacks, we introduce the Skip- from frames features clustering [19] and training neural
Aware Fast-Forwarding Agent (SAFFA), which can adapt networks that infer the representativeness of a video seg-
its actions to drop frames according to their relevance and ment [20], [22] to employing additional information such as
subject to a target speed-up rate defined by the user. SAFFA user queries, external sensors [17], or textual annotations [3].
is trained with a novel joint reward function that aims at Lee et al. [17], in the context of first-person videos, ana-
semantics and speed-up, allowing us to create videos of differ- lyzed properties such as social interaction, gaze, and object
ent lengths without retraining. We also present an extension detection to create a storyboard summary of the video.
to the representation capabilities of VDAN with the novel Zhang et al. [20] proposed a method to create a storyboard
Extended Visually-guided Document Attention Network or skims by modeling long-range dependencies among the
(VDAN+) that creates a joint embedding space for docu- video frames using a Bi-LSTM network. Yao et al. [22]
ments and video segments instead of simple static frames. performed the selection of the relevant segments fusing
Finally, thorough experiments are conducted on challenging information from spatial and temporal CNNs to identify
instructional video datasets [2], [16]. We show quantitative highlighting moments in videos of sports. Plummer et al. [3]
and qualitative results with in-depth discussions, showing created an approach that selects the best subset of video
that our method achieves the best performance in F1 Score segments by analyzing their visual features along with
against the baselines while effectively controlling the video’s vision-language modeling.
output length. Reinforcement Learning (RL) has also been applied to
This work takes a step forward towards fast-forwarding video summarization [23], [24], motivated by its success in
videos to a required length based on the video content’s many challenging tasks, such as mastering complex games
semantics and the speed-up rate defined by the user. Us- like Go [25], and achieving human-level performance in
ing textual data to guide an agent that seeks the best set Atari [26]. Additionally, it has great application in vision
of frames to be removed, our method emphasizes highly tasks, e.g., visual tracking [27] and image cropping [28].
semantic content segments while preserving the temporal Zhou et al. [23] presented an end-to-end unsupervised
continuity and the target output length. Our approach is framework also based on RL. Their method summarizes
weakly supervised since it does not require any temporal videos by applying a diversity-representativeness reward
annotation at training or inference time, i.e., only the textual that guides the agent to create more diverse and representa-
instructions are needed. The problem of text-driven video tive summaries. Li et al. [24] introduced an additional video
acceleration naturally fits to the sequential decision-making level semantic reward to guide the unsupervised RL proce-
formulation via a Markov Decision Process (MDP) since dure proposed by Zhou et al. Their approach uses a semantic
no information about previous states is required for taking supervisor network to provide the reward signals based on
subsequent actions. Unlike planning solutions like dynamic the similarity between the original video and the generated
programming, which would require to go through all possi- summary. As a drawback, their video-level semantic reward
ble states multiple times to calculate the value function, our may favor summaries that yield higher global semantic
RL-based solution processes each state at most once. Thus, a similarities, which is not the case of instructional videos
trained agent would act similar to a knowledgeable human where the steps may represent a small portion of the video.
facing a new document and unseen video segments and On the other hand, our approach provides local rewards
controlling the video playback speed to achieve the required given by the similarity between video segments and text.
length. Despite impressive advances in shortening videos while
Our contributions can be summarized as follows: i) a retaining relevant segments, most of the summarization
new fast-forward method based on a reinforcement learning methods ignore temporal aspects or use a relaxed temporal
formulation, which is able to accelerate videos according to restriction, resulting in visual gaps and breaking the video
clip similarity scores with textual data subject to a target context.
speed-up rate; ii) a new state representation and joint re-
ward function that guide our agent to attend the multiple
goals, i.e., semantics and speed-up; iii) an extended version of 2.2 Semantic Fast-forwarding
the VDAN, the VDAN+, which can create a joint embedding The lack of context that emerges from the gaps generated
space for documents and video clips, instead of static frames by summarization techniques creates a nuisance to video
and; iv) extensive experiments with in-depth discussions, consumers, particularly instructional video consumers. The
including an ablation study, using the YouCook2 and COIN existence of a gap might confuse the user about the whole
datasets as opposed to a small subset of YouCook2 used in process. For instance, the user would be unaware if an

TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3

important step was missed with the gap if the original Leveraging cross-modal information, Alayrac et al. [37] pro-
video is unknown. Semantic fast-forward based methods posed a method to discover the main steps of a task by
add time constraint in the frame sampling, which results linking visual and textual clustering tasks to connect the
in a shorter and contiguous version of the input video. They modalities via joint constraints. Zhukov et al. [39] proposed a
usually apply different speed-up rates, where lower ones new representation for instructional video steps by learning
are assigned to the relevant segments and higher ones to a set of classifiers per-component (e.g., nouns and verbs).
the others. Sener et al. [38] use visual and language atoms to create a
Okamoto and Yanai [7] proposed a method to accel- multimodal representation to parse a video into semantic
erate guidance videos emphasizing video segments with steps.
pedestrian crosswalks or turning movements in street cor- In our previous approach [14], we built an end-to-end
ners. Ramos et al. [8] presented a semantic fast-forwarding trainable embedding space for text and image, which is
method for first-person videos dealing with visual stability further used by an RL agent to accelerate an input video.
constraints with emphasis on human faces. Silva et al. [11] However, unlike most of the other fast-forward methods [5],
extended the work of Ramos et al., including a CNN to [6], [8], [11], [12], [15], [40], [41], [42], the agent cannot
assign frame scores based on Internet users’ preferences optimize the output speed-up rate of the video, which is
and a video stabilizer proper to fast-forwarded videos [29]. essential in several applications. In this work, we extend our
Silva et al. [12] proposed modeling the frame sampling in previous approach [14] by introducing the Skip-Aware Fast-
semantic fast-forwarding as a Minimum Sparse Reconstruc- Forward Agent (SAFFA) and an Extended Visually-guided
tion problem. The authors also propose an extension [15] Document Attention Network (VDAN+). While our new
that aims to remove visual gaps that could break the con- agent can jointly optimize the semantics and the speed-up
tinuity of the output video, and to smooth the speed-up objectives, the new cross-modal embedding space, VDAN+,
transitions between video segments. A drawback of the provides the semantic distance between each snippet in
aforementioned works consists of pre-processing steps like the instructional video and the textual steps described in
detecting objects and computing optical flow, which is time- a document, i.e., the recipe.
consuming and rely on the accuracy of third-party methods.
Lan et al. [10] introduced the Fast Forward Network
(FFNet). The network can summarize videos on the fly using 3 M ETHODOLOGY
an RL agent that selects frames with the most memorable As stated, we formulate the fast-forwarding task as a se-
views according to human-labeled data. Similar to FFNet quential decision-making process, where an agent is trained
and Zhou et al., we also apply an agent trained by the RL to create fast-forwarded videos. Our methodology combines
paradigm; however, our approach is a step towards training an RL paradigm and encode-decoding cross-modal frame-
agents to work in a cross-modal embedding space in a work to reduce videos’ size by keeping the most relevant
weakly-supervised manner. Unlike FFNet, our approach frames and avoiding temporal gaps. Specifically, after cre-
does not demand human-labeled data to indicate the rel- ating an embedding space for encoding documents and
evant frames at training time. videos using a novel Extended Visually-guided Document
Most recently, methods in semantic fast-forwarding in- Attention Network (VDAN+), we train an RL agent to de-
troduced usage of language to guide the frame selection. cide which frames should be removed. The agent observes
The approach proposed by Ramos et al. [13] personalizes the encoded text and video snippets, its position in the
the fast-forwarding process by using social network posts video, and its current average skip rate, then outputs a
as textual input. Their method uses a dense captioning distribution over the actions for increasing, decreasing, or
algorithm on image frames to match positive tweets from maintaining the current speed-up rate. Figure 1 shows the
Twitter in the embedding space. Although achieving a high main steps of our approach.
F1 Score, the whole approach is too sensitive to errors from
each of its components.
3.1 Extended Visually-guided Document Attention Net-
2.3 Cross-modal Embedding for Instructional Videos work (VDAN+)
Recently, the algorithms on cross-modal embedding have In this paper, we extend VDAN [14] with the novel
emerged as promising and effective approaches to deal with Extended Visually-guided Document Attention Network
tasks like video description [30] and text-based image/video (VDAN+). Although the embedding space created by
retrieval [31], [32], [33]. Virtually, these methods rely on VDAN has shown promising results providing semantically
creating a shared embedding space, where features from meaningful representation for image and text, it is limited
multiple modalities can be compared. by the lack of proper temporal modeling. For instance,
There is a growing body research in instructional video some actions like “open” and “close” are inversely related;
and image analysis. Marin et al. [34] applied a multi-modal therefore, a single static frame would be ambiguous when
neural model to learn a shared embedding space for images associated with the text description.
and recipes and tackled the task of retrieving recipes from To overcome this limitation, VDAN+ takes a document
image queries. Wang et al. [35] proposed an adversarial and a video clip as input and, guided by the video features,
learning strategy to align both modalities. Weakly super- creates representative feature vectors for both modalities. By
vised and unsupervised methods have been proposed to training VDAN+, we aim at creating an embedding space in
tackle different tasks such as activity segmentation [36], lo- which textual and visual features are semantically aligned.
calizing key steps [37], and parsing [38] instructional videos. We argue that the aligned embedding vectors help our agent

TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE                                                                                                                                            4

Input Document                                               Input Video                                                                                    Input Speed-up
 - Cook the bacon in a sauce pan                                                                                                                             Target Speed-up:
 - Coat the beef with flour and season                                                   ...          ...                       ...                          Current Average Speed-up:
 it with pepper                                                                                                                                              Maximum Speed-up Rate:
  ...

 - Add the salted mushroom onion and                                 1               2         f                          f+M               F
 parsley and mix everything well to serve                                                                                                                          One-hot Encoding
                                                                                                      NRPE
        VDAN+                                                                                                                                                                             SAFFA
   Cook the bacon ... pan
                                                     3D

                                                                                    FC
        Word Embedding                                                                                                    Policy Network                                       State-value Network
                                                    CNN
                  ...
                                                                                                                          1

                                                                                                            Probability
            Bi-RNN
            ...

                                 Attention Layer

                                                                  Attention Layer
                                                                                                                          0
                                                   ...

   Add the salted ... serve
                                                         Bi-RNN

                                                                                                                                            Actions
                                                                                                        Action space:

                                                                                    FC
         Word Embedding                                                                                   1-

TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE                                                                                    5

do to decode the captions [47], [48]. It is noteworthy that          acceleration, which will, in turn, affect the velocity. Since
although D remains the same, eD is not unique throughout             we apply Model-free Reinforcement Learning, the transition
the input video since φ(v) may change.                               function does not need to be pre-defined, nor learned; as
    Both document and video encoding modules include                 the agent focus directly on learning the best policy. In the
an `2 normalization layer to make eD and ev unit norm                following sections, we define all elements used in our MDP
vectors. We found in our experimentation that using a batch          formulation.
normalization layer [49] preceding the `2 normalization
layer increased the performance of our model.                        3.2.1 Action Space and States Composition
                                                                     To use text documents as guide while accelerating a video,
3.1.3    Training                                                    our agent adaptively adjusts the skip rate ν (velocity) to
We follow a pairwise training strategy to build the cross-           ensure lower speed to the video segments semantically
modal embedding space. For each clip v in the training set,          similar to the input text and higher speed otherwise. The
we create a positive and a negative document, D+ and D− ,            agent’s action space A has three elements: i) decelerate; ii)
to compose the training pairs  and . A               do nothing; and iii) accelerate. These actions are used by the
positive document D+ comprises sentences that describe               agent to update the skip rate as follows: decelerate and accel-
the clip v and, additionally, sentences describing a randomly        erate update the velocity and acceleration states of the agent
selected clip v 0 . The strategy of adding sentences that do         as ν = ν − ω and ω = ω − 1 for decelerate, and ν = ν + ω
not describe the clip assists the document-level attention           and ω = ω + 1 for accelerate, while do nothing keeps the
module on attending the proper sentences at training time.           current ν and ω . Acceleration and velocity saturate at values
To create the negative document, D− , we randomly select             empirically set as ωmax = 5 and νmax = 25, respectively,
two other clips v 00 and v 000 , and collect their respective        which are always greater than or equal to 1. Note that ω
sentences. At each training step, we shuffle all the sentences       does not correspond to a physical acceleration, allowing the
in the document for generalization purposes.                         agent to quickly adjust the velocity to collect more rewards
    In order to create more aligned embeddings, we opti-             when the semantic level changes or to focus on the target
mize θenc = {θRp , Wp , cp , θRd , Wd , cd , θD , θv } by minimiz-   speed-up rate.
ing the cosine embedding loss as follows:                                To allow our agent to navigate through the video space
                                                                     attending to the requirements effectively, the state needs to
                                                                     encode information about the current semantics and the
                      (
                       1 − cos(eD , ev ),           if y = 1
 Lenc (eD , ev ; y) =                                          (1)
                       max(0, cos(eD , ev ) − η),   otherwise,       agent’s location in the video. Thus, the agent would be
                                                                     able to reason whether the skip rate should be higher or
where y is equal to 1 if eD and ev were generated via a              lower, aiming to create an output with length as close as
positive pair, and η is a margin parameter, which is set to 0        possible to the desired one while maximizing the exhibi-
in our problem.                                                      tion of scenes related to the document. A straightforward
                                                                     state composition to achieve these requirements would be
3.2     Skip-Aware Fast-Forward Agent (SAFFA)                        concatenating the document’s features, the current video
                                                                     segment, and the agent’s average skip rate. However, by
After building the cross-modal embedding space, we train             observing such a state, the agent would be unaware of
an agent to observe the encoded vectors eD and ev , its              its distance to the end of the video, while a successful
position in the input video, and its current average speed-up        decision to increase or decrease the skip rate depends on
rate, then sample an action from the action space to adjust          it. For example, at the beginning of the video, the agent
its speed-up rate accordingly.                                       has more freedom to attend to both objectives. However,
    We formulate the problem of selecting frames as a                as the agent approaches the end of the video, attending
Markov Decision Process (MDP). In our formulation, we                the target speed-up rate becomes more challenging since
train an agent to maximize
                   hP         the expected
                                   i        sum of discounted        changes in the current skip rate present a negligible effect on
                      T −t n
rewards: Rt = E       n=0  γ  rt+n   , where  t is the current       the average skip rate. Therefore, inspired by the successful
timestep, rt+n is the reward n timesteps into the future,            usage of positional encoding in recent works [4], [50], [51],
and T is the total number of timesteps. At each timestep,            we propose the Normalized Reversed Positional Encoding
one frame is selected; therefore, t also indicates the current       (NRPE) to encode the current agent’s location in the video.
number of selected frames, and T the total number of                     Let q be the NRPE embedding size, F be the number of
selected frames. γ ∈ (0, 1] is a discount factor.                    frames in the input video and f the position of the agent in
    In our problem, the agent observes a long input video            the video, i.e., f ∈ {1, 2, · · · , F }. Then, the dimensions 2k
and a text document, and takes actions to create an acceler-         and 2k + 1 (with k ∈ {1, 2, · · · , q/2}) of our NRPE embed-
ated version of the video. Since we aim to keep the overall          ding, ep , are:
coherence of the video instead of trimming it, we restrict the                                      !                                 !
                                                                                           F −f                              F −f
agent to navigate the video space observing short segments            N RP E(f,2k) = sin       2k
                                                                                                    , N RP E(f,2k+1) = cos       2k
                                                                                                                                          . (2)
and skipping them accordingly, limited to a maximum skip                                   F    q                            F    q

length. The agent has velocity ν and acceleration ω , and            The rationale of NRPE is that an agent in videos of different
based on the current ν , the next frame is selected. Therefore,      lengths but in the same relative position and under the same
the agent goes through the whole video, but skips frames             relevance profile should behave the same.
according to a dynamically changing velocity. At each time-              To encode the agent’s average skip rate, we use the one-
step, the agent can increase, decrease, or keep its current          hot vector given by es = Im (bŜt c − S ∗ + νmax ), where Ŝt

TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE                                                                           6

is the average skip rate at the timestep t, S ∗ ≤ νmax ∈ N+          Usually, Monte Carlo sampling is applied, due to the high
is the target speed-up rate, and Im (ι) denotes the ιth line         dimension of the action sequence space, leading to the
of an identity matrix of size m, set as m = 2νmax in our ex-         following approximation for the gradient:
periments. We use the average relative skip rate, bŜt c − S ∗ ,
                                                                         ∇J 0 (θπ ) ≈
                                                                                      X
instead of the raw average skip, Ŝt , to allow the agent to                            ∇θπ log π(at |st , θπ )(Rt − v(st |θv )), (6)
decide to increase or decrease its speed according to the                             t

deviation to the target speed-up rate. Hence, a single agent         where at is the action taken at time t. Hence, we minimize
can accelerate an input video at different speed-up rates            the following loss function:
with no need to be re-trained. The final state composition at
                                                                           L0 (θπ ) = −
                                                                                        X
a given timestep t is defined as: st = [eD ; ev ; ep ; es ]| ∈ S .                        (log π(at |st , θπ )) (Rt − v(st |θv )). (7)
                                                                                          t
3.2.2 Reward function                                                Moreover, in order to have a greater action diversity, we
The goal of our agent is to learn a policy π(a|st , θπ ) that        add the entropy of the policy output H(π(at |st , θπ )) into
represents the probability of taking a certain action a ∈ A          the loss [28]. Therefore, our final policy loss is given by
given the state st and the parameters θπ . The reward should
                                                                             Ldec (θπ ) = L0 (θπ ) −
                                                                                                     X
encourage the agent to increase, decrease, or keep its skip                                             β · H(π(at |st , θπ )),  (8)
rate w.r.t. the semantic similarity between the textual and                                           t
visual data in the upcoming video segment while consid-              where β is a constant to balance the entropy importance. In
ering the overall speed-up rate objective in the long-term.          our experiments, we set β to 0.01.
Therefore, we design an immediate reward proportional to                Additionally, we also need to learn the state-value
the alignment of the textual and visual features in non-             function δ(st |θδ ). We do that by minimizing the mean
terminal states and proportional to the overall speed-up rate                                                      2
                                                                     squared error: Lδ (θδ ) = t (δ(st |θδ ) − Rt ) . Both losses Lδ
                                                                                              P
deviation at the terminal state (final frame of a video). Thus,      and Ldec can now be minimized using stochastic gradient
at training time, after taking the action at ∼ π(a|st , θπ ) in      descent. At test time, we use arg maxa π(a|st , θπ ) as the
the tth step, the agent receives the following reward signal:        chosen action for the agent in a given timestep t.
             (
               eD · ev ,                     if t < T
       rt =                     ŜT −S ∗ 2
                                                            (3)
               λ ∗ exp(−0.5 ∗ ( σ ) ), otherwise,                    4     E XPERIMENTS
where λ controls the relative importance of the overall              4.1   Datasets, Baselines, and Evaluation Metrics
speed-up rate w.r.t. the frames’ relevance in the output
                                                                     We conducted our experiments on the YouCook2 and COIN
video. Note that the terminal reward is similar to a Gaussian
                                                                     datasets [2], [16]. YouCook2 is a large-scale dataset com-
function centered at S ∗ . Thus, lower σ ∈ R values force
                                                                     posed of unconstrained YouTube cooking videos, where the
the agent to achieve the desired skip more accurately. Se-
                                                                     footages include a variety of cuisines and cooking styles.
mantically, the agent receives higher rewards if eD and ev
                                                                     It consists of 2,000 videos distributed across 89 recipes
point to the same direction in the embedding space, which
                                                                     with a total length of 176 hours. COIN is a large-scale
encourages the agent to reduce the speed and accumulate
                                                                     dataset composed of 11,827 instructional videos distributed
more rewards since the neighboring temporal frames are
                                                                     across 180 tasks organized hierarchically into 12 domains
more likely to yield higher reward values.
                                                                     with a total length of 476 hours. Videos in both datasets
   Recall that our final objective is to maximize the expected
                                                                     are annotated with temporal boundaries and natural lan-
sum of discounted rewards. Therefore, although being es-
                                                                     guage descriptions. Because YouCook2 lacks a test set with
tablished as a terminal reward, the speed-up rate signal may,
                                                                     available textual instructions, we used its validation set as
potentially, affect all the agent’s decisions.
                                                                     our test set. To optimize the hyperparameters and evaluate
3.2.3 Policy Learning                                                our model, we used a strategy commonly applied in the
Apart from aligning the textual and visual features pro-             reinforcement learning literature (e.g., DQN [26], A3C [54],
duced by VDAN+, the overall objective of our methodology             Rainbow [55], and DDQN [56]). First, we randomly selected
also tries to maximize the expected cumulative reward Rt at          a subset of the recipes (in our case, 10%) from the entire
each timestep t. We follow the REINFORCE algorithm [52]              dataset and used their videos to tune the hyperparameters.
to learn the parameters     θπ that maximizes the expected           Then, using all videos from the training and validation sets,
utility: J(θπ ) = a∈A π(a|st , θπ )Rt .
                     P                                               including the ones used in the hyperparameter tuning, we
    To improve learning performance, we use the advantage            train and evaluate our model.
function approach [53] and maximize the expected advan-                  To evaluate the performance of each method, we use the
tage:                                                                F1 Score (F1), which is the harmonic mean of the Precision
                                                                     and Recall. The higher the F1, the better retrieving relevant
            J 0 (θπ ) =
                        X
                          π(a|st , θπ )(Rt − v(st |θv )), (4)
                                                                     frames with the lowest false positive and negative rates.
                     a∈A
                                                                     Note, however, that due to the constraints in temporal
where v(st |θv ) is a function parameterized by θv , which           connections imposed to fast-forwarding techniques, they
predicts our expected cumulative reward at state st . The            naturally select non-relevant frames since the relevant seg-
gradient of J 0 , ∇θπ J 0 (θπ ), is given by:                        ments are usually distant from one another. We also measure
                                                                     the methods’ capability of accelerating the input videos
   X
       π(a|st , θπ )(∇θπ log π(a|st , θπ ))(Rt − v(st |θv )). (5)
   a∈A
                                                                     regarding the target speed-up rate. For that, we report the

TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE                                                                                                                    7

                                                                    TABLE 1
Comparison with baselines. Precision, Recall, F1 score, Output Speed-up (OS), and Overall Performance (OP) in YouCook2 and COIN datasets.
The estimated running times are on the right, and the target OS values are S ∗ = 12 for YouCook2 and S ∗ = 16 for COIN. Best values are in bold.

                                                             YouCook2                                                          COIN
 Method       No Dense   Controls   Uses                                                                                                                          Running
             Supervision Speed-up Language   Precision1 Recall1 F1 Score1       OS2         OP1     Precision1 Recall1 F1 Score1              OS2      OP1       Time (ms)3
FFNet [10]       7         7          7        58.83    11.70     18.86       11.90         31.71     45.06          14.08       17.66       16.45     29.75        08.65
SAS [12]         3         3         7         49.22    08.61    14.44       11.64 25.02              40.17          09.07       13.20      16.10 23.32            344.15
SASv2 [15]       3         3         7         49.69    09.87    16.20       10.32 19.60              41.29          09.79       13.90      14.09 20.07            344.43
BoT [13]         3         3         3         48.66    08.37    14.04       12.13 24.40              39.80          09.21       13.01      16.07 23.02            1048.01
Ours             3         3         3         53.20    12.81    17.86       11.68 30.07              42.58          15.94       17.18      14.99 27.98             75.58
                                                                                                              1
                                                                                                                  Higher is better (%) 2 Better closer to S ∗ 3 Lower is better

                                                                                      1x
Output Speed-up (OS) rate, Ŝ . The closer is Ŝ to S ∗ , the
                                                                                    0.5x
better.                                                                                0

                                                                            Error
    There is a clear trade-off between F1 and OS. Selecting                         -0.5x
more frames may increase the F1 score while failing to                               -1x
attend the speed-up. Therefore, we propose the Overall                              -1.5x
Performance (OP) metric, which summarizes both F1 and                                        2x 3x 4x 5x 6x 7x 8x 9x 10x 11x 12x 13x 14x 15x 16x 17x 18x 19x 20x
                                                                                                                         Input Speed-up Rate
OS in a single value. We compute the OP as the harmonic
mean between the F1 and the OS accuracy, which is given                      Fig. 2. Average output speed-up rate error in the YouCook2 dataset.
by a Gaussian function centered at S ∗ , and with a standard                 For target speed-up rates from 2 to 20×, SAFFA generates videos with
deviation σOS . In the literature, acceptable speed-up rate                  a small relative error, indicating its accuracy in the speed-up objective.
errors are on average 5.31%, with a standard deviation
of 8.38% w.r.t. the target [5], [40]. We use this standard
deviation in our evaluation, i.e., σOS = 0.0838 × S ∗ .                          We tested the sizes {128, 512, 1,024} for the dimension
    We compared our method with the following                                of the embedding space d, and we choose d = 128 as it
fast-forwarding techniques: Sparse Adaptive Sampling                         showed a similar performance, but with fewer parameters.
(SAS) [12] and its extension SASv2 [15], which are state of                  We define 512 to be the size of the hidden state vectors
the art in terms of semantics retained in the final video; the               hij and hi , and 1,024 to be the size of cp and cd . Note
Bag-of-Topics (BoT) technique [13], which, like ours, also has               that, in our semantic encoder, the hi vectors must lie on
texts and videos as input and; our previous approach [14]                    the same dimension as the visual features since h0 = φ(v).
with the semantic encoder VDAN and the Semantic Fast-                        The functions fv and fD are implemented as two indepen-
Forward Reinforcement Learning (SFF-RL) agent. We also                       dent fully connected neural networks composed of a single
report results for the Fast-Forward Network (FFNet), an                      hidden layer with 512 neurons.
RL agent proposed by Lan et al. [10]. It is worth noting                         We train VDAN+ for 100 epochs with a batch size of
that FFNet is used as a gold standard for semantics since                    64 and obtain the model that had the best performance in
it disregards the output speed-up as a target. During ex-                    the validation set. During training, we rescale the videos
periments in the COIN dataset, the FFNet agent did not                       to 128 × 171, then randomly cropped them spatially to
converge, acting as a uniform selection method due to the                    112 × 112 [46]. We also apply clip-wise horizontal random
high variability of labels from different domains. For the                   flip with 0.5 probability and temporal jittering, obtaining
sake of a fair comparison, we performed the experiments in                   clips of 32 continuous frames.
COIN in a domain-wise manner.                                                    The policy network π(at |st , θπ ) and the state-value func-
                                                                             tion δ(st |θδ ) were implemented as two Multilayer Percep-
4.2   Implementation Details                                                 trons (MLPs) with two hidden layers (256 and 128 neurons)
We adopted a stage-wise training strategy, where VDAN+                       each using ReLu activation function. We added a final layer
and SAFFA are trained in two distinct stages. With this                      with |A| = 3 neurons and softmax activation to π , and one
strategy, we avoid sending different reward signals to the                   with a single neuron and linear activation to the δ network.
agent during training and make the convergence of the                            We ran a grid search in the YouCook2 dataset to find
model easier [57]. Therefore, we first train VDAN+ using                     the best policy learning rate lr ∈ {1e−5, 5e−5, 1e−4},
pairs of documents and clips from the VaTeX dataset [58].                    σ ∈ {0.5, 1, 2}, and γ ∈ {0.8, 0.9, 0.99}, with λ = F ∗ ,
All the available English descriptions in the dataset are used               where F ∗ is the desired number of frames. We selected the
to compose D+ and D− . We chop the sentences to be no                        configuration with policy learning rate of 5e−5, σ of 0.5,
longer than 20 words and use a set of glove embeddings                       and γ of 0.99, which provided the highest F1 Score such
pre-trained in the Wikipedia 2014 and Gigaword 5 sets [59] to                that round(Ŝ) = S ∗ . This configuration was fixed for all
represent each word. Note that, although a set of annotated                  experiments. In our experiments, γ and σ have shown to
descriptions for a given clip may not fairly represent an                    be the most influential hyperparameters. Lower γ values
instructional document structure, the first training stage is                reduce the terminal reward’s strength in the first input video
mainly dedicated to creating a highly discriminative embed-                  frames, making it difficult to control the overall speed-
ding space for aligning documents and video clips.                           up, and higher σ values permit the agent to select more

TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE                                                                                                                                                       8

a) Vietnamese Veggie Spring Rolls (YouCook2)                        FFNet                                                                                                                     14.85x    1   2

 Instructions:                                                      SAS                                                                                                                       11.92x
 1- mix tamarind ginger spice rub and garlic
 2- dip the wrappers in water                                       SASv2                                                                                                                     11.31x
 3- place the kale cucumber bell peppers carrots
 and radishes on the wrapper                                        BoT                                                                                                                       12.00x    3   4
 4- place cilantro and scallions on the filling
                                                                    Ours                                                                                                                      12.93x
 5- roll the spring roll up
                                                                    GT
                                                                                     1                            2                 3                            4
                                                                0                            2000                            4000                             6000                     8000

b) Spicy Tuna Rolls (YouCook2)                                      FFNet                                                                                                                     05.83x    1   2

 Instructions:                                                      SAS                                                                                                                       11.73x
 1- chop some tuna meat into small pieces
                                                                    SASv2                                                                                                                     10.43x
 2- mix the tuna pieces and some fish eggs in a bowl
 3- add hot sauce and mayo and then mix them together
                                                                    BoT                                                                                                                       12.02x    3   4
 4- spread some cooked rice on seaweed sushi wrapper
 5- add the tuna mix on top and season with some sesame seeds       Ours                                                                                                                      12.89x
 6- roll it up and press firmly
 7- cut the roll into small rolls                                   GT
                                                                                         1                                                         2              3                4
                                                                0           2000                    4000              6000                  8000         10000            12000         14000

c) Graft (COIN)                                                     FFNet                                                                                                                     2.34x     2   3

 Instructions:
                                                                    SAS                                                                                                                       4.00x
 1- cut branches
 2- connect branches                                                SASv2                                                                                                                     4.05x
 3- wind the junction to fasten the connection
                                       1                            BoT                                                                                                                       4.00x     4   5

                                                                    Ours                                                                                                                      3.98x

                                                                    GT
                                                                                   1                                     2              3                     4       5
                                                                0                  500                     1000                 1500                   2000                 2500                 3000

Fig. 3. Qualitative results. The colored bars (right) represent the frames selected by each method for each instructional video, and the contiguous
black blocks represent the ground-truth segments. Note that, in general, our agent performs a denser frame sampling in the regions representing a
instruction step and a sparser one in the other regions (see the images outlined in green and the instructions). Exceptions are the regions where
our agent decides to increase its skip rate to attend the target speed-up or matches the video snippet with the instructions (images outlined in red).

frames, reducing the overall speed-up without additional                                                                            for detailed time analysis. The best values in Table 1 are in
punishment. Random initialization did not influence the                                                                             bold.
results.                                                                                                                                The results indicate that our approach outperforms all
    We trained SAFFA and its variants for 100 epochs us-                                                                            competitors by a significant margin in Precision, Recall,
ing the Adam optimizer [60]. In this stage, all VDAN+                                                                               and F1, while not compromising the output speed-up rate
parameters remain frozen. For faster convergence, we use                                                                            significantly in both datasets, which is reflected in the OP
a learning rate of 1e−3 for the value-state approximator.                                                                           metric. It means that SAFFA could use the instruction steps
Our approach is fully implemented in the PyTorch library,                                                                           described in natural language and match them to what is
and the experiments were conducted in an Intel® Core™ i7-                                                                           currently being observed in the scene. On the other hand,
3770K CPU @ 3.50GHz machine with 32GB RAM and an                                                                                    the semantics encoded by the SAS and SASv2 methods are
NVIDIA Titan RTX GPU.                                                                                                               based on the YOLO detector. Consequently, the encoded
                                                                                                                                    information may not contain details such as interactions
                                                                                                                                    with the objects, but only which objects are present in the
4.3       Results                                                                                                                   scene. For that reason, the distribution of the selected frames
                                                                                                                                    becomes more similar to a uniform selection since such ob-
4.3.1          Quantitative Results
                                                                                                                                    jects tend to be present in most frames of the video. Besides,
Table 1 shows the average Precision, Recall, F1 Score (F1),                                                                         these encoders are limited to a small set of objects while
and Output Speed-up (OS), as well as the Overall Perfor-                                                                            natural language sentences present, potentially, a higher
mance (OP), achieved by each method in both datasets.                                                                               number of objects and interactions.
The methods are grouped according to whether they use                                                                                   The BoT method did not perform well, although using
labels for training (if applicable), optimize the video’s out-                                                                      the instructions as input. We argue that the BoT method
put length, and use natural language as input. The gold                                                                             depends on the accuracy of multiple external components
standard method, FFNet, is above the dashed line as it needs                                                                        such as the dense captioning and the video saliency algo-
frame-level (dense) labels for supervision. Because FFNet                                                                           rithms; therefore, it is prone to errors. For that same reason,
is the only method that does not allow a target speed-up                                                                            it yields the worst processing time among the methods. We
rate as input, we used its average output speed-up rates,                                                                           noticed that the saliency is the major contributor for the
Ŝ = 11.90 and Ŝ = 16.45, as targets for all methods. I.e.,                                                                        BoT’s semantic encoder. The errors on saliency estimation
S ∗ = 12 for the YouCook2 dataset and S ∗ = 16 (on average)                                                                         lead the method to present higher frame scores in segments
for the COIN dataset since S ∗ ∈ N+ . We removed the COIN                                                                           that do not represent an instruction step (see images 2 and 1
domains “Nursing and Care” and “Leisure and Perfor-                                                                                 in Figure 3-a and c, respectively) or disregard relevant seg-
mance” from the evaluation since S ∗ = 1 and it would not                                                                           ments. In many cases, higher saliency values are assigned to
require any acceleration for the other approaches. We also                                                                          the person carrying out the task and lower at the task itself.
present the estimated running times in milliseconds (ms),                                                                               Besides outperforming the competitors, our method is
where we can see that our approach is superior to the com-                                                                          also on par with FFNet, even presenting a better Recall in
petitors. We refer the reader to the supplementary materials                                                                        the YouCook2. Note that FFNet has no speed-up control

TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9

optimization at the video level. Therefore, their agent can single ground-truth segment (e.g., third step in Figure 3-c)
conveniently reduce and increase its frame skip according might represent a significant portion of the input video.
to the number of relevant frames present in the video. I.e., We assessed the usefulness of VDAN+ representation for
output videos with more relevant frames present lower the agent regarding the capability of navigating throughout
output speed-up rates while others with less relevant frames videos. For each sample in the VaTeX validation set, we
present higher rates. Moreover, in order to avoid abrupt created a positive document, as described in Section 3.1.3,
speed-up rate changes along the video to keep transition and paired it with a single video, i.e., the same video
smoothness, our agent is constrained via ν and ω to gradu- was used in all pairs. These pairs were projected into the
ally change its navigation speed, while FFNet does not have VDAN+ space and further projected into a 2-dimensional
this constraint. space via t-SNE [61]. In Figure 4-left, we illustrate the
To assess the significance of the results presented in Ta- document embeddings (colored points) and some frames
ble 1, we ran a Student’s t-test considering each metric and of the video. The colors represent the alignment between
each pair of methods. We verified that, with 99% confidence the documents and the video. On the right, we show the
(p-value < 0.01), our method is statistically superior to SAS, document that describes the video (shaded in blue) and
BoT, and SASv2 in Precision, Recall, and F1 metrics in both a random document (shaded in gray) from a region that
datasets. It also shows that the average values for F1 do not contains the documents most similar to the video. Note that
differ significantly when comparing our approach to FFNet. the sentences in the documents with the highest attention
Our approach is statistically tied with the FFNet, and SAS weights are semantically related to the video. It indicates
methods concerning the OS metric in the YouCook2. We that our weakly-supervised approach is useful for the agent
refer the reader to the supplementary material for tables to navigate not only in instructional videos, as demonstrated
presenting all the p-values. in previous sections but also in videos from other domains.
Figure 2 depicts our average OS error in the YouCook2
for different target rates (from 2× to 20×), i.e., the y -axis 4.3.3 Ablation Studies
shows Ŝ − S ∗ . Note that the errors in the output rates VDAN+ versus VDAN [14]. The embedding space created
are no higher than 1.75×, indicating that our agent can by the semantic encoder is crucial for the good performance
effectively control the output video’s length without re- of SAFFA since it directly observes the video and document
training. We accredit these results to our joint reward signal, features to infer the current semantic load. To demonstrate
which leads the agent to drop even the relevant frames in the superiority of the VDAN+ over VDAN [14], we ap-
favor of balancing both objectives. The task of addressing proach the alignment of embeddings as a binary classifica-
the target speed-up becomes even more challenging when tion problem. First, we label the frames in the ground-truth
higher target values are imposed. This can also be observed segments of the YouCook2’s training set as positive and
in Figure 2, where the agent creates output videos with less the remaining ones as negative. Then, we apply multiple
accuracy in their length as the target values increase. thresholds to the dot product values between each video
frame/clip and the corresponding input recipe and calculate
4.3.2 Qualitative Results the Area Under Curve (AUC) computed from the Receiver
Figure 3 shows qualitative results for three videos in the Operating Characteristic (ROC) curve. For the experiments
YouCook2 and COIN test sets. The vertical colored bars with the VDAN, we used the same hyperparameters and
represent each method’s selected frames, and the contiguous training data used in [14]. Figure 5 shows the ROC curve
black blocks represent the ground-truth segments. Each and the AUC scores for each semantic encoder. We see that
method’s Ŝ and S ∗ are shown on the right. The numbered VDAN+ is clearly superior to VDAN at correctly classifying
images outlined in green represent frames from segments a pair of frame/clip and document, yielding an AUC of 65
where the agent takes the correct decision w.r.t. the ground- against 61 of VDAN. We claim that this superiority is related
truth. Those outlined in red represent frames where the to the fact that VDAN+ creates embeddings at the segment
agent fails to increase its skip in segments with no recipe level. Therefore, it avoids several false positives, such as
steps or decrease it in the ground-truth segments. correlating static scenes and objects under no action.
In general, SAFFA reduces its skip rate in the relevant The effect of h0 = φ(v). Apart from initializing the first hid-
segments and increases it otherwise. This behavior indicates den state vector of our semantic encoder as h0 = φ(v), we
that the VDAN+ space is useful for the agent to determine evaluated two other variants: h0 = 0 and; h0 = Ws , where
when the video segments match the input document. Some Ws ∈ Rz is a learnable parameter vector. We computed the
segments not in the ground-truth are semantically similar Mean Reciprocal Rank (MRR) using the VaTeX validation
to the instructions, leading our agent to reduce its skip set as described in Section 3.1.3 to compare the variants with
erroneously. That is the case illustrated in image 3 in Fig- our approach.
1 PQ
ure 3-b. It is noteworthy that, to attend the target input The MRR is given by Q q=1 rankq , where Q is the
1

speed-up rate, the agent decides to increase its skip rate number of documents composing the query set and rankq
even though the segment contains relevant frames. This is is the rank at which the relevant document was retrieved.
illustrated in images 4 in Figure 3-a and c. However, note As expected, the MRR values for h0 = φ(v) are far superior
that image 4 in Figure 3-c shows a transition scene where when compared to both h0 = 0 and h0 = Ws variants, pre-
the person takes the tape to wind the junction. Although senting 0.718, 0.065, and 0.061, respectively. We observed
this scene is in the ground-truth segment, speeding it up that using any of the two variants degraded the model’s
does not incur losing the message conveyed by the video. capacity to produce semantically relevant eD vectors. Ba-
Moreover, discarding relevant frames is necessary once a sically, after training for 100 epochs, the attention module

TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 10

Fig. 4. Qualitative results for VDAN+. The Figure shows VDAN+ document embeddings from the VaTeX validation set projected into two
dimensions via t-SNE [61] and colored according to their alignment with a given input video. Note that the documents with the highest alignment
with the input video (blue points) are composed of semantically related sentences, and these sentences present the higher attention weights (Att.).

1.0
for the agent to attend the target speed-up rate; note the
low OP values. In the cases without SA and NRPE, due
0.8
to the lack of distinction among states with the same se-
True Positive Rate

mantic features, the agent ends up taking the same actions
0.6 under varied conditions, e.g., at the beginning/end of the
video or with the average speed-up rate close/far from
0.4 the ideal. Adding only the NRPE is insufficient since the
agent remains “blind” about its skip rate. It is worth noting
0.2
that using different encoders produced oppositive behaviors
VDAN+ (area = 0.65)
w.r.t. the OS metric. These results relate to the distribution
VDAN (area = 0.61) of VDAN and VDAN+ embeddings, which present average
0.0
0.0 0.2 0.4 0.6 0.8 1.0 dot products of 0.175 ± 0.812 and 0.833 ± 0.358 for VDAN
False Positive Rate
and VDAN+, respectively, in the YouCook2 training set. The
Fig. 5. VDAN+ versus VDAN [14]. The Receiver Operating Charac- higher semantic rewards of VDAN+ lead the agent to reduce
teristic (ROC) curves in the YouCook2 when using all dot products as its skip rate and accumulate more rewards.
thresholds. VDAN+ is superior in deciding if a frame/segment is relevant.
The variations with SA vector and no NRPE aid the
agent in attending the target speed-up rate. However, in
general, the agent becomes cautious in terms of its skip rate,
could still not attend the correct sentences because the
preferring not to intensively increase or decrease the video’s
attention weights became roughly evenly distributed. This
speed-up rate to favor the relevance of the frames. For com-
outcome was expected since φ(v) lies in a well-defined
parison, the agents’ F1 Score using this state composition
semantic space, making it easier to create d given the
is 14.09 with VDAN and 15.18 with VDAN+, while 14.33
trajectory {h0 = φ(v), h1 , . . . , hN } in the 512-dimensional
and 17.86 when using all the state’s components. The agent
space.
using all components proposed in this paper presents the
Method’s composition. We evaluated the impact of using
best trade-off between relevance and target speed-up rate
different configurations to integrate a variant of our ap-
in the output videos as confirmed by the OP metric. By
proach. Table 2 presents the results for Precision, Recall, F1,
using the information of its position in the video, the agent
OS, and OP. The column SE shows the Semantic Encoder
can freely decide to attend to the semantics objective at the
used; SA shows the Skip-Aware component, which indicates
beginning of the video and increase its concern about the
the agent’s current average skip rate; and the NRPE is the
speed-up rate as it reaches the end of the video.
Normalized Reversed Positional Encoding, which informs
the agent of its relative location in the video. All agent It is worth mentioning the Non-RL variant result. Its
variations were trained with the same hyperparameters F1, OS, and OP values are better than those achieved by
as detailed in Section 4.2, except for the terminal reward the VDAN-based variants, confirming the superiority of
weight, λ (Equation 3). In the variations using VDAN as VDAN+ over VDAN. Although this variant is on par with
the semantic encoder, we set λ = F since we observed that the “VDAN+ with SA”, the BoT frame sampler takes over
local rewards were dominating the signal sent to the agent. 60× more processing time.
It is noteworthy that the only difference of the SAFFA-based Robustness to Automatic Speech Recognition (ASR). A
variants without SA and NRPE to the method proposed representative aspect of instructional videos is that the
in our previous work [14] is in the reward. The SFF-RL person recording the instructions usually narrates the steps
agent disregards the speed-up control, i.e., λ = 0. We also being performed. Thus, using an ASR system, one could
included a variant that uses the BoT’s frame sampler, a non- utilize our methodology to create an accelerated video with
RL solution. We used the frames and document alignments no need for text input by the user. To demonstrate this
throughout the video as input. application, we selected a video from the YouCook2’s vali-
As can be seen in Table 2, the SA element is crucial dation set and automatically created a document using the

TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE                                                                                                                                                                           11

                               TABLE 2                                                       1                                             2                                                3

 Ablation Study. Precision, Recall, F1 Score, Output Speed-up (OS),
and Overall Performance (OP) in YouCook2, w.r.t. the method’s compo-
 sition. SE stands for Semantic Encoder, SA for Skip-Aware, NRPE for                                                                                                                                    5.35x

Normalized Reversed Positional Encoding, and FS for Frame Sampler.
                                                                                                                                                                                                        5.35x

                                                                                              Ours                                                                                                                              5.35x

                                                                                             Ours+ASR                                                                                                                           5.35x
         Method’s Composition                           Metrics
                                                                                                 GT                                                                                                                                 5
    SE       SA NRPE       FS    Precision1 Recall1 F1 Score1 OS2             OP1       00                      2000
                                                                                                                 2000
                                                                                                                                       4000
                                                                                                                                        4000
                                                                                                                                                                      6000
                                                                                                                                                                       6000
                                                                                                                                                                                            8000
                                                                                                                                                                                             8000
                                                                                                                                                                                                                       10000
                                                                                                                                                                                                                        10000

                                                                                         Instructions:                                             ASR Text:
VDAN         7     7     SAFFA        49.17     04.00      07.34    24.87    00.00           mix soy sauce spicy sauce and sugar in a bowl         "hello and welcome back to tea areas the King show international favorites on a
VDAN         7     3     SAFFA        49.68     04.26      07.77    23.67    00.00                         pour oil into a hot pan                  student budget we would like to thank all of you for your positive feedback on
VDAN         3     7     SAFFA        49.32     08.37      14.09    11.93    24.69            add peppercorn ginger and garlic to the pan                                                   ...
                                                                                                        pour the sauce into the pan                          keep your ingredients we're gonna be Oh thought okay let's see
VDAN         3     3     SAFFA        51.84     08.45      14.33    12.42    24.79           add the shrimp tofu and mushrooms to the pan            here's a mushroom broccoli tofu sub-tree saw shooter soy sauce spicy sauce
                                                                                                   stir all of the ingredients in the pan                                                   ...
VDAN+        7     7     SAFFA        59.11     79.08      64.96    01.94    00.00
                                                                                                          cover the pan with a lid                     they're still watering the tofu and my okay now just cover it and then with
VDAN+        7     3     SAFFA        59.63     47.67      50.98    03.02    00.00                    sprinkle green onions and salt                 around 5 to 10 minutes it's okay yeah sedating it's done cooking I think so..."
VDAN+        3     7     SAFFA        52.39     09.03      15.18    11.86    26.33
VDAN+        3     3     SAFFA        53.20     12.81      17.86    11.68    30.07
VDAN   7   7  SFF-RL [14]             53.78     42.07      41.55    06.09 00.00         Fig. 6. Robustness to Automatic Speech Recognition (ASR). The
VDAN+ N/A N/A Non-RL [13]             52.00     09.33      15.56    11.48 26.42         image illustrates the difference between using the annotated recipe and
                                                                                        the ASR text as input. Note that the distribution of the selected frames
                                 1
                                     Higher is better (%) 2 Better closer to S ∗ = 12   is similar for either input (colored bars represent the selected frames).

                                                                                                 1                                             2                                                    3

English subtitles generated by the YouTube ASR system. We
ran our method using that input and compared it against
our main result. Figure 6 shows the result (S ∗ = 5) with
some representative frames on top, the frame selection in                                    Ours                                                                                                                               12.77x

the middle, and the input documents for each approach at                                      GT                                                                                                                                    12
                                                                                        0                      1000
                                                                                                              1000
                                                                                                                                    2000
                                                                                                                                   2000
                                                                                                                                                               3000
                                                                                                                                                              3000
                                                                                                                                                                                     4000
                                                                                                                                                                                    4000
                                                                                                                                                                                                                5000
                                                                                                                                                                                                                5000               6000
the bottom.
     In general, our method performed a similar frame sam-                              Fig. 7. Limitations. The colored bars, contiguous black blocks, and
pling using either input document. The reason is that,                                  images outlined in red represent, respectively, the selected frames, the
although the input documents are different in length, they                              ground-truth, and the regions where the agent took a wrong decision.
have similar semantics in content. In Figure 6, images 1 and
3 illustrate some cases where our method was accurate w.r.t.
                                                                                        food in the oven (image 1) and some snails getting out of
the ground-truth. It keeps a sparser frame selection in the
                                                                                        the recipient (image 2). The major reason for this outcome
recorder’s introductory speech (see image 1) and a denser
                                                                                        is the design of our reward function that is sparse w.r.t. the
frame selection when the food is cooked (see image 3). We
                                                                                        speed-up rate deviation. Thus, at the beginning of the video,
see a difference between the frame selection when using the
                                                                                        the sum of discounted rewards will mostly be composed
instructions and the recorder speech in the region illustrated
                                                                                        of semantic signals, while as t gets higher, the speed-up
in image 2. SAFFA using ASR decided to reduce the skip
                                                                                        deviation signal gets stronger. Suppose the agent mistakenly
rate in the segment that does not represent a recipe step, but
                                                                                        reduces its skip rate at the beginning of the video due
notice that it presents the ingredients used. This behavior is
                                                                                        to a potential relevant segment. In that case, it must skip
comprehensible since the ingredients are described in some
                                                                                        the final frames even if they are also relevant (image 3).
sentences.
                                                                                        In this example, VDAN+ contributed negatively, producing
                                                                                        highly aligned vectors due to the high semantic relation
5    D ISCUSSIONS                                                                       between the segments represented in images 1 and 2 with
                                                                                        instructions like “Remove the snail from the shell” and
In this paper, we proposed a novel methodology based                                    “Chop and cook the pancetta in a pan” of the recipe.
on a reinforcement learning formulation to accelerate in-                                   Nonetheless, our approach takes a step towards under-
structional videos. The agent, namely Skip-Aware Fast-                                  standing the training of RL agents in a multi-modal envi-
Forwarding Agent (SAFFA), is trained to decide which                                    ronment. To accelerate long untrimmed videos emphasizing
frames to remove based on textual data and a target speed-                              the relevant content, we trained an agent to collect rewards
up rate. We also presented an extension to the Visually-                                proportional to video-text similarity powered by VDAN+,
guided Document Attention Network (VDAN) [14], called                                   which removed the demand of temporal markings. This
VDAN+, which creates a highly discriminative embedding                                  strategy allows the agent to run with freely available texts
space to represent both textual and visual data. Our ap-                                on the web and requires fewer annotation efforts. We believe
proach outperforms several competitors in F1 Score while                                that the results may benefit further works in the direction
allowing the user to define the video output length, which                              of weakly supervised reinforcement learning in the case of
can be crucial for applications where time and storage                                  availability of multi-modal data and scarcity of annotations.
resources are scarce.
    Despite achieving the best results, our methodology suf-
fers from certain limitations. The agent may take incorrect                             ACKNOWLEDGMENTS
actions and emphasize segments that are not the recipe’s                                We thank the agencies CAPES, CNPq, FAPEMIG, and Petro-
steps. Figure 7 illustrates an example of such a case, where                            bras for funding different parts of this work. We also thank
the agent reduces its skip rate to emphasize a person putting                           NVIDIA Corporation for the donation of a GPU.

You can also read