Scaling up sign spotting through sign language dictionaries
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
International Journal of Computer Vision Scaling up sign spotting through sign language dictionaries Gül Varol1,2∗ · Liliane Momeni1∗ · Samuel Albanie1,3∗ Triantafyllos Afouras1 · Andrew Zisserman1 arXiv:2205.04152v1 [cs.CV] 9 May 2022 Received: 1 May 2021 / Accepted: 21 January 2022 Abstract The focus of this work is sign spotting— for deaf communities [56] and sign spotting has a broad given a video of an isolated sign, our task is to iden- range of practical applications. Examples include: in- tify whether and where it has been signed in a con- dexing videos of signing content by keyword to enable tinuous, co-articulated sign language video. To achieve content-based search; gathering diverse dictionaries of this sign spotting task, we train a model using multiple sign exemplars from unlabelled footage for linguistic types of available supervision by: (1) watching existing study; automatic feedback for language students via an footage which is sparsely labelled using mouthing cues; “auto-correct” tool (e.g. “did you mean this sign?”); (2) reading associated subtitles (readily available trans- making voice activated wake word devices available to lations of the signed content) which provide additional deaf communities; and building sign language datasets weak-supervision; (3) looking up words (for which no by automatically labelling examples of signs. co-articulated labelled examples are available) in visual Recently, deep neural networks, equipped with large- sign language dictionaries to enable novel sign spot- scale, labelled datasets produced considerable progress ting. These three tasks are integrated into a unified in audio [23, 59] and visual [42, 53] keyword spotting in learning framework using the principles of Noise Con- spoken languages. However, a direct replication of these trastive Estimation and Multiple Instance Learning. We keyword spotting successes in sign language requires validate the effectiveness of our approach on low-shot a commensurate quantity of labelled data (note that sign spotting benchmarks. In addition, we contribute modern audiovisual spoken keyword spotting datasets a machine-readable British Sign Language (BSL) dic- contain millions of densely labelled examples [2, 19]), tionary dataset of isolated signs, BslDict, to facilitate but such datasets are not available for sign language. study of this task. The dataset, models and code are It might be thought that a sign language dictionary available at our project page. would offer a relatively straightforward solution to the sign spotting task, particularly to the problem of cov- ering only a limited vocabulary in existing large-scale 1 Introduction corpora. But, unfortunately, this is not the case due to the severe domain differences between dictionaries The objective of this work is to develop a sign spotting and continuous signing in the wild. The challenges are model that can identify and localise instances of signs that sign language dictionaries typically: (1) consist of within sequences of continuous sign language. Sign lan- isolated signs which differ in appearance from the co- guages represent the natural means of communication articulated 1 sequences of continuous signs (for which *Equal contribution we ultimately wish to perform spotting); and (2) differ in speed (are performed more slowly) relative to co- 1 Visual Geometry Group, University of Oxford, UK articulated signing. Furthermore, (3) dictionaries only 2 LIGM, École des Ponts, Univ Gustave Eiffel, CNRS, France possess a few examples of each sign (so learning must 3 Department of Engineering, University of Cambridge, UK E-mail: {gul,liliane,albanie,afourast,az}@robots.ox.ac.uk 1 Co-articulation refers to changes in the appearance of the https://www.robots.ox.ac.uk/~vgg/research/bsldict/ current sign due to neighbouring signs.
2 Varol, Momeni, Albanie, Afouras, Zisserman “We had some pretty reasonable-sized apple trees in the garden.” “And who knows? An apple tree might grow, or perhaps not.” “What about the apple juice, you tried that? Yeah. Is it good? Yeah.” “And who first recognised that this was such a special apple?” Sign: Apple “A French apple tart which is cooked upside down.” “This is the Big Apple. This is where things are big.” Similarity score Time frames Fig. 1: We consider the task of sign spotting in co-articulated, continuous signing. Given a query dictionary video of an isolated sign (e.g., “apple”), we aim to identify whether and where it appears in videos of continuous signing. The wide domain gap between dictionary examples of isolated signs and target sequences of continuous signing makes the task extremely challenging. be low shot); and as one more challenge, (4) there can Our loss formulation is an extension of InfoNCE be multiple signs corresponding to a single keyword, [46, 63] (and in particular the multiple instance vari- for example due to regional variations of the sign lan- ant MIL-NCE [41]). The novelty of our method lies guage [50]. We show through experiments in Sec. 4, that in the batch formulation that leverages the mouthing directly training a sign spotter for continuous signing on annotations, subtitles, and visual dictionaries to define dictionary examples, obtained from an internet-sourced positive and negative bags. Moreover, this work specif- sign language dictionary, does indeed perform poorly. ically focuses on computing similarities across two dif- ferent domains to learn matching between isolated and To address these challenges, we propose a unified co-articulated signing. framework in which sign spotting embeddings are learned We make the following contributions, originally in- from the dictionary (to provide broad coverage of the troduced in [43]: (1) We provide a machine readable lexicon) in combination with two additional sources of British Sign Language (BSL) dictionary dataset of iso- supervision. In aggregate, these multiple types of super- lated signs, BslDict, to facilitate study of the sign vision include: (1) watching sign language and learning spotting task; (2) We propose a unified Multiple In- from existing sparse annotations obtained from mouthing stance Learning framework for learning sign embed- cues [5]; (2) exploiting weak-supervision by reading the dings suitable for spotting from three supervisory sources; subtitles that accompany the footage and extracting (3) We validate the effectiveness of our approach on a candidates for signs that we expect to be present; (3) look- co-articulated sign spotting benchmark for which only ing up words (for which we do not have labelled exam- a small number (low-shot) of isolated signs are pro- ples) in a sign language dictionary. The recent devel- vided as labelled training examples, and (4) achieve opment of a large-scale, subtitled dataset of continu- state-of-the-art performance on the BSL-1K sign spot- ous signing providing sparse annotations [5] allows us ting benchmark [5] (closed vocabulary). We show qual- to study this problem setting directly. We formulate itatively that the learned embeddings can be used to our approach as a Multiple Instance Learning prob- (5) automatically mine new signing examples, and (6) dis- lem in which positive samples may arise from any of cover “faux amis” (false friends) between sign languages. the three sources and employ Noise Contrastive Esti- In addition, we extend these contributions with (7) the mation [32] to learn a domain-invariant (valid across demonstration that our framework can be effectively both isolated and co-articulated signing) representation deployed to obtain large numbers of sign examples, en- of signing content. abling state-of-the-art performance to be reached on
Scaling up sign spotting through sign language dictionaries 3 the BSL-1K sign recognition benchmark [5], and on the tion, since our method must bridge differences between recently released BOBSL dataset [4]. the dictionary and the target continuous signing dis- tribution. A vast number of techniques have been pro- posed to tackle distribution shift, including several ad- 2 Related Work versarial feature alignment methods that are specialised for the few-shot setting [44, 67]. In our work, we ex- Our work relates to several themes in the literature: sign plore the domain-specific batch normalization (DSBN) language recognition (and more specifically sign spot- method of [18], finding ultimately that simple batch ting), sign language datasets, multiple instance learn- normalization parameter re-initialization is instead most ing and low-shot action localization. We discuss each of effective when jointly training on two domains after pre- these themes next. training on the bigger domain. The concurrent work Sign language recognition. The study of automatic of [40] also seeks to align representation of isolated and sign recognition has a rich history in the computer vi- continuous signs. However, our work differs from theirs sion community stretching back over 30 years, with in several key aspects: (1) rather than assuming access early methods developing carefully engineered features to a large-scale labelled dataset of isolated signs, we to model trajectories and shape [30, 37, 54, 57]. A se- consider the setting in which only a handful of dictio- ries of techniques then emerged which made effective nary examples may be used to represent a word; (2) use of hand and body pose cues through robust key- we develop a generalised Multiple Instance Learning point estimation encodings [10, 22, 45, 49]. Sign lan- framework which allows the learning of representations guage recognition also has been considered in the con- from weakly-aligned subtitles whilst exploiting sparse text of sequence prediction, with HMMs [3, 31, 37, 54], labels from mouthings [5] and dictionaries (this inte- LSTMs [11, 35, 66, 68], and Transformers [12] proving grates cues beyond the learning formulation in [40]); to be effective mechanisms for this task. Recently, con- (3) we seek to label and improve performance on co- volutional neural networks have emerged as the dom- articulated signing (rather than improving recognition inant approach for appearance modelling [11], and in performance on isolated signing). Also related to our particular, action recognition models using spatio-temporal work, [49] uses a “reservoir” of weakly labelled sign convolutions [16] have proven very well-suited for video- footage to improve the performance of a sign classi- based sign recognition [5, 36, 39]. We adopt the I3D ar- fier learned from a small number of examples. Different chitecture [16] as a foundational building block in our to [49], we propose a multiple instance learning formula- studies. tion that explicitly accounts for signing variations that Sign language spotting. The sign language spotting are present in the dictionary. problem—in which the objective is to find performances Sign language datasets. A number of sign language of a sign (or sign sequence) in a longer sequence of datasets have been proposed for studying Finnish [60], signing—has been studied with Dynamic Time Warping German [38, 61], American [7, 36, 39, 62] and Chi- and skin colour histograms [60] and with Hierarchical nese [17, 35] sign recognition. For British Sign Language Sequential Patterns [26]. Different from our work which (BSL), [51] gathered the BSL Corpus which represents learns representations from multiple weak supervisory continuous signing, labelled with fine-grained linguis- cues, these approaches consider a fully-supervised set- tic annotations. More recently [5] collected BSL-1K, ting with a single source of supervision and use hand- a large-scale dataset of BSL signs that were obtained crafted features to represent signs [27]. Our proposed using a mouthing-based keyword spotting model. Fur- use of a dictionary is also closely tied to one-shot/few- ther details on this method are given in Sec. 3.1. In shot learning, in which the learner is assumed to have this work, we contribute BslDict, a dictionary-style access to only a handful of annotated examples of the dataset that is complementary to the datasets of [5, 51] target category. One-shot dictionary learning was stud- – it contains only a handful of instances of each sign, ied by [49] – different to their approach, we explicitly but achieves a comprehensive coverage of the BSL lexi- account for variations in the dictionary for a given word con with a 9K English vocabulary (vs a 1K vocabulary (and validate the improvements brought by doing so in in [5]). As we show in the sequel, this dataset enables a Sec. 4). Textual descriptions from a dictionary of 250 number of sign spotting applications. While BslDict signs were used to study zero-shot learning by [9] – we does not represent a linguistic corpus, as the correspon- instead consider the practical setting in which a handful dences to English words and phrases are not carefully of video examples are available per-sign and work with annotated with glosses 2 , it is significantly larger than a much larger vocabulary (9K words and phrases). The use of dictionaries to locate signs in subtitled 2 Glosses are atomic lexical units used to annotate sign lan- video also shares commonalities with domain adapta- guages.
4 Varol, Momeni, Albanie, Afouras, Zisserman 2 - Read “Oh, what it might have been is my oven was the wrong temperature.” 1 - Watch 3 - Look-up Sign: Oven Sign: Temperature Localised sign Fig. 2: The proposed Watch, Read and Lookup framework trains sign spotting embeddings with three cues: (1) watching videos and learning from sparse annotation in the form of localised signs obtained from mouthings [5] (lower-left); (2) reading subtitles to find candidate signs that may appear in the source footage (top); (3) looking up corresponding visual examples in a sign language dictionary and aligning the representation against the embedded source segment (lower-right). its linguistic counterparts (e.g., 4K videos correspond- 3 Learning Sign Spotting Embeddings from ing to 2K words in BSL SignBank [29], as opposed to Multiple Supervisors 14K videos of 9K words in BslDict), therefore Bsl- Dict is particularly suitable to be used in conjunction In this section, we describe the task of sign spotting and with subtitles. the three forms of supervision we assume access to. Let XL denote the space of RGB video segments containing a frontal-facing individual communicating in sign lan- Multiple instance learning. Motivated by the read- guage L and denote by XLsingle its restriction to the set of ily available sign language footage that is accompanied segments containing a single sign. Further, let T denote by subtitles, a number of methods have been proposed the space of subtitle sentences and VL = {1, . . . , V } for learning the association between signs and words denote the vocabulary—an index set corresponding to that occur in the subtitle text [10, 20, 21, 49]. In this an enumeration of written words that are equivalent to work, we adopt the framework of Multiple Instance signs that can be performed in L3 . Learning (MIL) [24] to tackle this problem, previously Our objective, illustrated in Fig. 1, is to discover all explored by [10, 48]. Our work differs from these works occurrences of a given keyword in a collection of contin- through the incorporation of a dictionary, and a princi- uous signing sequences. To do so, we assume access to: pled mechanism for explicitly handling sign variants, to (i) a subtitled collection of videos containing continu- guide the learning process. Furthermore, we generalise ous signing, S = {(xi , si ) : i ∈ {1, . . . , I}, xi ∈ XL , si ∈ the MIL framework so that it can learn to further ex- T }; (ii) a sparse collection of temporal sub-segments of ploit sparse labels. We also conduct experiments at sig- these videos that have been annotated with their cor- nificantly greater scale to make use of the full potential responding word, M = {(xk , vk ) : k ∈ {1, . . . , K}, vk ∈ of MIL, considering more than two orders of magnitude VL , xk ∈ XLsingle , ∃(xi , si ) ∈ S s.t. xk ⊆ xi }; (iii) a cu- more weakly supervised data than [10, 48]. rated dictionary of signing instances D = {(xj , vj ) : j ∈ {1, . . . , J}, xj ∈ XLsingle , vj ∈ VL }. To address the sign Low-shot action localization. This theme investi- spotting task, we propose to learn a data representa- gates semantic video localization: given one or more tion f : XL → Rd that maps video segments to vec- query videos the objective is to localize the segment tors such that they are discriminative for sign spotting in an untrimmed video that corresponds semantically and invariant to other factors of variation. Formally, for to the query video [13, 28, 65]. Semantic matching is 3 Sign language dictionaries provide a word-level or phrase- too general for the sign-spotting considered in this pa- level correspondence (between sign language and spoken lan- per. However, we build on the temporal ordering ideas guage) for many signs but no universally accepted glossing explored in this theme. scheme exists for transcribing languages such as BSL [56].
Scaling up sign spotting through sign language dictionaries 5 Keyword Spotting Annotation Pipeline Result: BSL-1K Happy Subs: Are you all happy with this application? padding subtitle appears padding Visual keyword spotting search window Perfect Locate occurence of target word e.g. "happy" in subtitles. Build mouthing Stage 1: search window from the s second window when subtitle appears, padded by p seconds on either side (to account for misalignment). Happy Strong Keyword Spotter Accept Stage 2: Keyword spotter locates precise 0.6 second window containing "happy" sign Fig. 3: Mouthing-based sign annotation from [5]: (Left, the annotation pipeline): Stage 1: for a given sign (e.g. “happy”), each instance of the word in the subtitles provides a candidate temporal segment where the sign may occur (the subtitle timestamps are padded by several seconds to account for the asynchrony between audio- aligned subtitles and signing interpretation); Stage 2: a mouthing visual keyword spotter uses the lip movements of the signer to perform precise localisation of the sign within this window. (Right): Examples of localised signs through mouthings from the BSL-1K dataset— produced by applying keyword spotting for a vocabulary of 1K words. any labelled pair of video segments (x, v), (x0 , v 0 ) with However, these temporal windows are difficult to make x, x0 ∈ XL and v, v 0 ∈ VL , we seek a data representa- use of directly since: (1) the occurrence of a keyword tion, f , that satisfies the constraint δf (x)f (x0 ) = δvv0 , in a subtitle does not ensure the presence of the corre- where δ represents the Kronecker delta. sponding sign in the signing sequence, (2) the subtitles themselves are not precisely aligned with the signing, and can differ in time by several seconds. To address 3.1 Sparse annotations from mouthing cues these issues, [5] demonstrated that the sign correspond- ing to a particular keyword can be localised within a As the source of temporal video segments with corre- candidate temporal window – given by the padded sub- sponding word annotations, M, we make use of auto- title timings to account for the asynchrony between the matic annotations that were collected as part of our audio-aligned subtitles and signing interpretation – by prior work on visual keyword spotting with mouthing searching for its spoken components [56] amongst the cues [5], which we briefly recap here. Signers sometimes mouth movements of the interpreter. While there are mouth a word while simultaneously signing it, as an challenges associated with using spoken components as additional signal [8, 55, 56], performing similar lip pat- a cue (signers do not typically mouth continuously and terns as for the spoken word. Fig. 3 presents an overview may only produce mouthing patterns that correspond of how we use such mouthings to spot signs. to a portion of the keyword [56]), it has the significant As a starting point for this approach, we assume advantage of transforming the general annotation prob- access to TV footage that is accompanied by: (i) a lem from classification (i.e., “which sign is this?”) into frontal facing sign language interpreter, who provides the much easier problem of localisation (i.e., “find a a translation of the spoken content of the video, and given token amongst a short sequence”). In [5], the vi- (ii) a subtitle track, representing a direct transcription sual keyword spotter uses the candidate temporal win- of the spoken content. The method of [5] first searches dow with the target keyword to estimate the probabil- among the subtitles for any occurrences of “keywords” ity that the sign was mouthed at each time step. If the from a given vocabulary. Subtitles containing these key- peak probability over time is above a threshold param- words provide a set of candidate temporal windows in eter, the predicted location of the sign is taken as the which the interpreter may have produced the sign cor- 0.6 second window starting before the position of the responding to the keyword (see Fig. 3, Left, Stage 1).
6 Varol, Momeni, Albanie, Afouras, Zisserman peak probability (see Fig. 3, Left, Stage 2). For building To incorporate the available sources of supervision into the BSL-1K dataset, [5] uses a probability threshold of this formulation, we consider two categories of positive 0.5 and runs the visual keyword spotter with a vocab- and negative bag formations, described next (a formal ulary of 1,350 keywords across 1,000 hours of signing. mathematical description of the positive and negative A further filtering step is performed on the vocabu- bags described below is deferred to Appendix C.2). lary to ensure that each word included in the dataset is Watch and Lookup: using sparse annotations and represented with high confidence (at least one instance dictionaries. Here, we describe a baseline where we with confidence 0.8) in the training partition, which assume no subtitles are available. To learn f from M produces a final dataset vocabulary of 1,064 words. The and D, we define each positive bag as the set of possible resulting BSL-1K dataset has 273K mouthing annota- pairs between a labelled (foreground ) temporal segment tions, some of which are illustrated in Fig. 3 (right). We of a continuous video from M and the examples of the employ these annotations directly to form the set M in corresponding sign in the dictionary (green regions in this work. Fig A.2). The key assumption here is that each labelled sign segment from M matches at least one sign varia- tion in the dictionary. Negative bags are constructed by 3.2 Integrating cues through multiple instance learning (i) anchoring on a continuous foreground segment and To learn f , we must address several challenges. First, as selecting dictionary examples corresponding to differ- noted in Sec. 1, there may be a considerable distribu- ent words from other batch items; (ii) anchoring on a tion shift between the dictionary videos of isolated signs dictionary foreground set and selecting continuous fore- in D and the co-articulated signing videos in S. Second, ground segments from other batch items (red regions in sign languages often contain multiple sign variants for Fig A.2). To maximize the number of negatives within a single written word (e.g., resulting from regional vari- one minibatch, we sample a different word per batch ations and synonyms). Third, since the subtitles in S item. are only weakly aligned with the sign sequence, we must Watch, Read and Lookup: using sparse anno- learn to associate signs and words from a noisy signal tations, subtitles and dictionaries. Using just the that lacks temporal localisation. Fourth, the localised labelled sign segments from M to construct bags has a annotations provided by M are sparse, and therefore significant limitation: f is not encouraged to represent we must make good use of the remaining segments of signs beyond the initial vocabulary represented in M. subtitled videos in S if we are to learn an effective rep- We therefore look at the subtitles (which contain words resentation. beyond M) to construct additional bags. We determine Given full supervision, we could simply adopt a pair- more positive bags between the set of unlabelled (back- wise metric learning approach to align segments from ground) segments in the continuous footage and the set the videos in S with dictionary videos from D by requir- of dictionaries corresponding to the background words ing that f maps a pair of isolated and co-articulated in the subtitle (green regions in Fig. 4, right-bottom). signing segments to the same point in the embedding Negatives (red regions in Fig. 4) are formed as the com- space if they correspond to the same sign (positive pairs) plements to these sets by (i) pairing continuous back- and apart if they do not (negative pairs). As noted ground segments with dictionary samples that can be above, in practice we do not have access to positive pairs excluded as matches (through subtitles) and (ii) pair- because: (1) for any annotated segment (xk , vk ) ∈ M, ing background dictionary entries with the foreground we have a set of potential sign variations represented in continuous segment. In both cases, we also define neg- the dictionary (annotated with the common label vk ), atives from other batch items by selecting pairs where rather than a single unique sign; (2) since S provides the word(s) have no overlap, e.g., in Fig. 4, the dictio- only weak supervision, even when a word is mentioned nary examples for the background word ‘speak’ from in the subtitles we do not know where it appears in the second batch item are negatives for the background the continuous signing sequence (if it appears at all). continuous segments from the first batch item, corre- These ambiguities motivate a Multiple Instance Learn- sponding to the unlabelled words ‘name’ and ‘what’ in ing [24] (MIL) objective. Rather than forming positive the subtitle. and negative pairs, we instead form positive bags of To assess the similarity of two embedded video seg- pairs, P bags , in which we expect at least one pairing ments, we employ a similarity function ψ : Rd ×Rd → R between a segment from a video in S and a dictionary whose value increases as its arguments become more video from D to contain the same sign, and negative similar (in this work, we use cosine similarity). For no- bags of pairs, N bags , in which we expect no (video seg- tational convenience below, we write ψij as shorthand ment, dictionary video) pair to contain the same sign. for ψ(f (xi ), f (xj )). To learn f , we consider a general-
Scaling up sign spotting through sign language dictionaries 7 friend friend name what batch language language speak Continuous foreground Dictionary foreground } foreground word background words Continuous signing Dictionary exemplars Continuous background Dictionary background Fig. 4: Batch sampling and positive/negative pairs: We illustrate the formation of a batch when jointly training on continuous signing video (squares) and dictionaries of isolated signing (circles). Left: For each contin- uous video, we sample the dictionaries corresponding to the labelled word (foreground), as well as to the rest of the subtitles (background). Right: We construct positive/negative pairs by anchoring at 4 different portions of a batch item: continuous foreground/background and dictionary foreground/background. Positives and negatives (defined across continuous and dictionary domains) are green and red, respectively; anchors have a dashed border (see Appendix C.2 for details). ization of the InfoNCE loss [46, 63] (a non-parametric Since we find that dictionary videos of isolated signs softmax loss formulation of Noise Contrastive Estima- tend to be performed more slowly, we uniformly sam- Figure: Batch sampling and positive/negative pairs: We illustrate the formation of a batch when jointly training on BSL-1K continuous video and BSL-Dict isolated dictionary videos. Top: For each continuous video, we sample the dictionaries corresponding to the labeled word (foreground), as well as tion [32]) recently proposed by [41] as MIL-NCE loss: ple 16 frames from each dictionary video with a random the dictionaries corresponding to the subtitles (background). Bottom: We construct positive/negative pairs by anchoring at 4 different portions of a batch item: BSL-1K foreground/background and BSL-Dict foreground/background. The anchor shift andis denoted random with lightframe green, positives ratefromnthetimes, other domain as dark n is propor- where green, negatives from the other domain as red. We mark the orange samples as negatives if there is no overlap within subtitles or word labels, otherwise we discard them. Note that we sample a 1 BSL-1K foreground sample per wordtional in an entireto batchthe length to maximize of the number ofthe video, negatives. and pass these clips eψjk /τ P " # (j,k)∈P(i) through the I3D trunk then average the resulting vec- L = −Ei log , (1) tors before they are processed by the MLP to produce eψjk /τ + P P eψlm /τ (j,k)∈P(i) (l,m)∈N (i) the final dictionary embeddings. We find that this form of random sampling performs better than sampling 16 where P(i) ∈ P bags , N (i) ∈ N bags , τ , often referred consecutive frames from the isolated signing videos (see to as the temperature, is set as a hyperparameter (we Appendix C.1 for more details). During pretraining, explore the effect of its value in Sec. 4). minibatches of size 4 are used; and colour, scale and horizontal flip augmentations are applied to the input video, following the procedure described in [5]. The 3.3 Implementation details trunk parameters are then frozen and the MLP outputs are used as embeddings. Both datasets are described in In this section, we provide details for the learning frame- detail in Sec. 4.1. work covering the embedding architecture, sampling protocol and optimization procedure. Minibatch sampling. To train the MLP given the Embedding architecture. The architecture comprises pretrained I3D features, we sample data by first iter- an I3D spatio-temporal trunk network [16] to which we ating over the set of labelled segments comprising the attach an MLP consisting of three linear layers sepa- sparse annotations, M, that accompany the dataset of rated by leaky ReLU activations (with negative slope continuous, subtitled sampling to form minibatches. For 0.2) and a skip connection. The trunk network takes as each continuous video, we sample 16 consecutive frames input 16 frames from a 224 × 224 resolution video clip around the annotated timestamp (more precisely a ran- and produces 1024-dimensional embeddings which are dom offset within 20 frames before, 5 frames after, fol- then projected to 256-dimensional sign spotting embed- lowing the timing study in [5]). We randomly sample 10 dings by the MLP. More details about the embedding additional 16-frame clips from this video outside of the architecture can be found in Appendix C.1. labelled window, i.e., continuous background segments. Joint pretraining. The I3D trunk parameters are ini- For each subtitled sequence, we sample the dictionary tialised by pretraining for sign classification jointly over entries for all subtitle words that appear in VL (see the sparse annotations M of a continuous signing dataset Fig. 4 for a sample batch formation). (BSL-1K [5]) and examples from a sign dictionary dataset Our minibatch comprises 128 sequences of continu- (BslDict) which fall within their common vocabulary. ous signing and their corresponding dictionary entries
8 Varol, Momeni, Albanie, Afouras, Zisserman (we investigate the impact of batch size in Sec. 4.4). 4.1 Datasets The embeddings are then trained by minimising the loss defined in Eqn.(1) in conjunction with positive bags, Although our method is conceptually applicable to a P bags , and negative bags, N bags , which are constructed number of sign languages, in this work we focus primar- on-the-fly for each minibatch (see Fig. 4). ily on BSL, the sign language of British deaf commu- Optimization. We use a SGD optimizer with an initial nities. We use BSL-1K [5], a large-scale, subtitled and learning rate of 10−2 to train the embedding architec- sparsely annotated dataset of more than 1,000 hours ture. The learning rate is decayed twice by a factor of of continuous signing which offers an ideal setting in 10 (at epochs 40 and 45). We train all models, including which to evaluate the effectiveness of the Watch, Read baselines and ablation studies, for 50 epochs at which and Lookup sign spotting framework. To provide dictio- point we find that learning has always converged. nary data for the lookup component of our approach, Test time. To perform spotting, we obtain the embed- we also contribute BslDict, a diverse visual dictionary dings learned with the MLP. For the dictionary, we have of signs. These two datasets are summarised in Tab. 1 a single embedding averaged over the video. Continu- and described in more detail below. We further include ous video embeddings are obtained with sliding window experiments on a new dataset, BOBSL [4], which we (stride 1) on the entire sequence. We show the impor- describe in Sec. 4.7 together with results. The BOBSL tance of using such a dense stride for a precise localisa- dataset has similar properties to BSL-1K. tion in our ablations (Sec. 4.4). However, for simplicity, BSL-1K [5] comprises over 1,000 hours of video of all qualitative visualisations are performed with contin- continuous sign-language-interpreted television broad- uous video embeddings obtained with a sliding window casts, with accompanying subtitles of the audio content. of stride 8. In [5], this data is processed for the task of individual We calculate the cosine similarity score between the sign recognition: a visual keyword spotter is applied continuous signing sequence embeddings and the em- to signer mouthings giving a total of 273K sparsely bedding for a given dictionary video. We determine the localised sign annotations from a vocabulary of 1,064 location with the maximum similarity as the location signs (169K in the training partition as shown in Tab. 1). of the queried sign. We maintain embedding sets of Please refer to Sec. 3.1 and [5] for more details on the all variants of dictionary videos for a given word and automatic annotation pipeline. We refer to Sec. 4.6 for a choose the best match as the one with the highest sim- description of the BSL-1K sign recognition benchmark ilarity. (TestRec Rec 2K and Test37K in Tab. 1). In this work, we process this data for the task of retrieval, extracting long videos with associated subti- tles. In particular, we pad ±2 seconds around the sub- 4 Experiments title timestamps and we add the corresponding video to our training set if there is a sparse annotation from In this section, we first present the datasets used in this mouthing falling within this time window – we assume work (including the contributed BslDict dataset) in this constraint indicates that the signing is reasonably Sec. 4.1, followed by the evaluation protocol in Sec. 4.2. well-aligned with its subtitles. We further consider only We then illustrate the benefits of the Watch, Read and the videos whose subtitle duration is longer than 2 sec- Lookup learning framework for sign spotting against onds. For testing, we use the automatic test set (corre- several baselines (Sec. 4.3) with a comprehensive abla- sponding to mouthing locations with confidences above tion study that validates our design choices (Sec. 4.4). 0.9). Thus we obtain 78K training (TrainReT ) and 2K Next, we investigate three applications of our method in test (TestReT ) videos as shown in Tab. 1, each of which Sec. 4.5, showing that it can be used to (i) not only spot has a subtitle of 8 words on average and 1 sparse mouthing signs, but also identify the specific sign variant that annotation. was used, (ii) label sign instances in continuous signing footage given the associated subtitles, and (iii) discover BslDict. BSL dictionary videos are collected from a “faux amis” between different sign languages. We then BSL sign aggregation platform signbsl.com [1], giving provide experiments on sign language recognition, sig- us a total of 14,210 video clips for a vocabulary of 9,283 nificantly improving the state of the art by applying signs. Each sign is typically performed several times by our labelling technique to obtain more training exam- different signers, often in different ways. The dictionary ples automatically (Sec. 4.6 and Sec. 4.7). Finally, we videos are linked from 28 known website sources and discuss limitations of our sign spotting technique using each source has at least 1 signer. We used face embed- dictionaries (Sec. 4.8). dings computed with SENet-50 [34] (trained on VG-
Scaling up sign spotting through sign language dictionaries 9 Dataset Split #Videos Vocabulary #Signers Examples TrainRec 169K 1,064 36 TestRec 2K 2,103 334 4 TestRec 37K 36,854 950 4 BSL-1K[5] TrainReT 78,211 1,064 36 TestReT 1,834 264 4 BslDict Full 14,210 9,283 124 1K-subset 2,963 1,064 70 Table 1: Datasets: We provide (i) the number of individual signing videos, (ii) the vocabulary size of the annotated signs, and (iii) the number of signers for several subsets of BSL-1K and BslDict. BSL-1K is large in the number of annotated signs whereas BslDict is large in the vocabulary size. Note that BSL-1K is constructed differently depending on whether it is used for the task of recognition or retrieval: for retrieval, longer signing sequences are used around individual localised signs as described in Sec. 4.1. GFace2 [14]) to cluster signer identities and manually Metrics. The performance is evaluated based on rank- verified that there are a total of 124 different signers. ing metrics as in retrieval. For every sign si in the test The dictionary videos are of isolated signs (as opposed vocabulary, we first select the BSL-1K test set clips to co-articulated in BSL-1K): this means (i) the start which have a mouthing annotation of si and then record and end of the video clips usually consist of a still signer the percentage of times that a dictionary clip of si ap- pausing, and (ii) the sign is performed at a much slower pears in the first 5 retrieved results, this is the ‘Recall rate for clarity. We first trim the sign dictionary videos, at 5’ (R@5). This is motivated by the fact that different using body keypoints estimated with OpenPose [15] English words can correspond to the same sign, and vice which indicate the start and end of wrist motion, to versa. We also report mean average precision (mAP). discard frames where the signer is still. With this pro- For each video pair, the match is considered correct if cess, the average number of frames per video drops from (i) the dictionary clip corresponds to si and the BSL-1K 78 to 56 (still significantly larger than co-articulated video clip has a mouthing annotation of si , and (ii) if signs). To the best of our knowledge, BslDict is the the predicted location of the sign in the BSL-1K video first curated, BSL sign dictionary dataset for computer clip, i.e., the time frame where the maximum similar- vision research. A collection of metadata associated for ity occurs, lies within certain frames around the ground the BslDict dataset is made publicly available, as well truth mouthing timing. In particular, we determine the as our pre-computed video embeddings from this work. correct interval to be defined between 20 frames before For the experiments in which BslDict is filtered and 5 frames after the labelled time (based on the study to the 1,064 vocabulary of BSL-1K, we have 3K videos in [5]). Finally, because the BSL-1K test set is class- as shown in Tab. 1. Within this subset, each sign has unbalanced, we report performances averaged over the between 1 and 10 examples (average of 3). test classes. 4.2 Evaluation protocols 4.3 Comparison to baselines Protocols. We define two settings: (i) training with In this section, we evaluate different components of our the entire 1,064 vocabulary of annotations in BSL-1K; approach. We first compare our contrastive learning ap- and (ii) training on a subset with 800 signs. The latter proach with classification baselines. Then, we investi- is needed to assess the performance on novel signs, for gate the effect of our multiple-instance loss formula- which we do not have access to co-articulated labels at tion. Finally, we report performance on a sign spotting training. We thus use the remaining 264 words for test- benchmark. ing. This test set is therefore common to both training I3D baselines. We start by evaluating baseline I3D settings, it is either ‘seen’ or ‘unseen’ at training. How- models trained with classification on the task of spot- ever, we do not limit the vocabulary of the dictionary ting, using the embeddings before the classification layer. as a practical assumption, for which we show benefits. We have three variants in Tab. 2: (i) I3DBSL-1K pro-
10 Varol, Momeni, Albanie, Afouras, Zisserman Train (1064) Train (800) Seen (264) Unseen (264) Embedding arch. Supervision mAP R@5 mAP R@5 I3DBslDict Classification 2.68 3.57 1.21 1.29 I3DBSL-1K [5] Classification 13.09 17.25 6.74 8.94 I3DBSL-1K,BslDict Classification 19.81 25.57 4.81 6.89 I3DBSL-1K,BslDict +MLP Classification 37.13 ± 0.29 39.68 ± 0.57 10.33 ± 0.43 13.33 ± 1.11 I3DBSL-1K,BslDict +MLP InfoNCE 43.59 ± 0.76 52.59 ± 0.75 11.40 ± 0.42 14.76 ± 0.40 I3DBSL-1K,BslDict +MLP Watch-Lookup 44.72 ± 0.85 55.51 ± 2.17 11.02 ± 0.27 15.03 ± 0.45 I3DBSL-1K,BslDict +MLP Watch-Read-Lookup 47.93 ± 0.20 60.76 ± 1.45 14.86 ± 1.29 19.85 ± 1.94 Table 2: The effect of the loss formulation: Embeddings learned with the classification loss are suboptimal since they are not trained for matching the two domains. Contrastive-based loss formulations (NCE) significantly improve, particularly when we adopt the multiple-instance variant introduced as our Watch-Read-Lookup frame- work of multiple supervisory signals. We report the relatively cheaper MLP-based models with three random seeds for each model and report the mean and the standard deviation. vided by [5] which is trained only on the BSL-1K dataset, Supervision Dictionary Vocab mAP R@5 and we also train (ii) I3DBslDict and (iii) I3DBSL-1K,BslDict . Watch-Read-Lookup Watch-Read-Lookup 800 training vocab 9k full vocab 14.86 ± 1.29 15.82 ± 0.48 19.85 ± 1.94 21.67 ± 0.72 Training only on BslDict (I3DBslDict ) performs sig- nificantly worse due to the few examples available per Table 3: Extending the dictionary vocabulary: We class and the domain gap that must be bridged to spot show the benefits of sampling dictionary videos outside co-articulated signs, suggesting that dictionary samples of the sparse annotations, using subtitles. Extending alone do not suffice to solve the task. We observe im- the lookup to the dictionary from the subtitles to the provements with fine-tuning I3DBSL-1K jointly on the full vocabulary of BslDict brings significant improve- two datasets (I3DBSL-1K,BslDict ), which becomes our ments for novel signs (the training uses sparse anno- base feature extractor for the remaining experiments tations for the 800 words, and the remaining 264 for to train a shallow MLP. test). Loss formulation. We first train the MLP parameters on top of the frozen I3D trunk with classification to es- using the full 9k vocabulary from BslDict improves tablish a baseline in a comparable setup. Note that, this the results on the unseen setting. shallow architecture can be trained with larger batches BSL-1K sign spotting benchmark. Although our than I3D. Next, we investigate variants of our loss to learning framework primarily targets good performance learn a joint sign embedding between BSL-1K and Bsl- on unseen continuous signs, it can also be naively ap- Dict video domains: (i) standard single-instance In- plied to the (closed-vocabulary) sign spotting bench- foNCE [46, 63] loss which pairs each BSL-1K video clip mark proposed by [5]. The sign spotting benchmark with one positive BslDict clip of the same sign, (ii) requires a model to localise every instance belonging to Watch-Lookup which considers multiple positive dictio- a given set of sign classes (334 in total) within long se- nary candidates, but does not consider subtitles (there- quences of untrimmed footage. The benchmark is chal- fore limited to the annotated video clips). Tab. 2 sum- lenging because each sign appears infrequently (cor- marises the results. Our Watch-Read-Lookup formula- responding to approximately one positive instance in tion which effectively combines multiple sources of su- every 90 minutes of continuous signing). We evaluate pervision in a multiple-instance framework outperforms the performance of our Watch-Read-Lookup model and the other baselines in both seen and unseen protocols. achieve a score of 0.170 mAP, outperforming the previ- ous state-of-the-art performance of 0.160 mAP [5]. Extending the vocabulary. The results presented so far were using the same vocabulary for both continu- ous and dictionary datasets. In reality, one can assume access to the entire vocabulary in the dictionary, but 4.4 Ablation study obtaining annotations for the continuous videos is pro- hibitive. Tab. 3 investigates removing the vocabulary We provide ablations for the learning hyperparame- limit on the dictionary side, but keeping the continu- ters, such as the batch size and the temperature; the ous annotations vocabulary at 800 signs. We show that mouthing confidence threshold as the training data se-
Scaling up sign spotting through sign language dictionaries 11 (a) (b) Fig. 5: The effect of (a) the batch size that determines the number of negatives across sign classes and (b) the temperature hyper-parameter for the MIL-NCE loss in Watch-Lookup against mAP and R@5 (trained on the full 1064 vocab.) lection parameter; and the stride parameter of the slid- Mouthing confidence Training size mAP R@5 ing window at test time. 0.9 10K 37.55 47.54 Batch size. Next, we investigate the effect of increas- 0.8 21K 39.49 48.84 0.7 33K 41.87 51.15 ing the number of negative pairs by increasing the batch 0.6 49K 42.44 52.42 size when training with Watch-Lookup on 1,064 cate- 0.5 78K 43.65 53.03 gories. We observe in Fig. 5(a) an improvement in per- formance with a greater number of negatives before sat- Table 4: Mouthing confidence threshold: The re- urating. Our final Watch-Read-Lookup model has high sults suggest that lower confidence automatic annota- memory requirements, for which we use 128 batch size. tions of BSL-1K provide better training, by increasing Note that the effective size of the batch with our sam- the amount of data (training on the full 1064 vocabu- pling is larger due to sampling extra video clips corre- lary with Watch-Lookup). sponding to subtitles. Temperature. Finally, we analyze the impact of the we investigate the effect of the stride parameter. Our temperature hyperparameter τ on the performance of window size is 16 frames, i.e., the number of input Watch-Lookup. We conclude from Fig. 5(b) that set- frames for the I3D feature extractor. A standard ap- ting τ to values between [0.04 - 0.10] does not im- proach when extracting features from longer videos is pact the performance significantly; therefore, we keep to use a sliding window with 50% overlap (i.e., stride τ = 0.07 following the previous work [33, 63] for all of 8 frames). However, this means the temporal reso- other experiments. However, values outside this range lution of the search space is reduced by a factor of 8, negatively impact the performance, especially for high and a stride of 8 may skip the most discriminative mo- values, i.e., {0.50, 1.00}; we observe a major decrease in ment since a sign duration is typically between 7-13 performance when τ approaches 1. frames (but can be shorter) [48] in continuous signing Mouthing confidence threshold at training. As video. In Tab. 5, we see that we can gain a significant explained in Sec. 3.1, the sparse annotations from the localisation improvement by computing the similarities BSL-1K dataset are obtained automatically by running more densely, e.g., stride of 4 frames may be sufficiently a visual keyword spotting method based on mouthing dense. In our experiments, we use stride 1. cues. The dataset provides a confidence value associ- We refer to Appendix B for additional ablations. ated with each label ranging between 0.5 and 1.0. Sim- ilar to [5], we experiment with different thresholds to determine the training set. Lower thresholds result in 4.5 Applications a noisier but larger training set. From Tab. 4, we con- clude that 0.5 mouthing confidence threshold performs In this section, we investigate three applications of our the best. This is in accordance with the conclusion from sign spotting method. [5]. Sign variant identification. We show the ability of Effect of the sliding window stride. As explained our model to spot specifically which variant of the sign in Sec. 3.3, at test time, we extract features from the was used. In Fig. 6, we observe high similarity scores continuous signing sequence using a sliding window ap- when the variant of the sign matches in both BSL-1K proach with 1 frame as the stride parameter. In Tab. 5, and BslDict videos. Identifying such sign variations
12 Varol, Momeni, Albanie, Afouras, Zisserman Similarity score “One of Britain’s worst cases of animal cruelty.” “I’ve never known you talk like this before, Johnnie. It’s mad!” Sign: Animal Sign: Before Time frames Fig. 6: Sign variant identification: We plot the similarity scores between BSL-1K test clips and BslDict variants of the sign “animal” (left) and “before” (right) over time. A high similarity occurs for the first two rows, where the BslDict examples match the variant used in BSL-1K. The labelled mouthing times from [5] are shown by red vertical lines and approximate windows for signing times are shaded. Note that neither the mouthing annotations (ground truth) nor the dictionary spottings provide the duration of the sign, but only a point in time where the response is highest. The mouthing peak (red vertical line) tends to appear at the end of the sign (due to the use of LSTM in visual keyword spotter). The dictionary peak (blue curve) tends to appear in the middle of the sign. “The only thing that’s wrong here, Sir, is the weather.” “It is a huge project, we are bringing 4.5 million tonnes.” Sign: Only Sign: Project Similarity score Sign: Wrong Sign: Bring Sign: Weather Sign: Million Time frames Fig. 7: Densification: We plot the similarity scores between BSL-1K test clips and BslDict examples over time, by querying only the words in the subtitle. We visually inspect the quality of the dictionary spottings with which we obtain cases of multiple words per subtitle spotted. The predicted locations of the signs correspond to the peak similarity scores. Note that unlike in Fig. 6, we cannot overlay the ground truth since the annotations using the mouthing cues are not dense enough to provide ground truth sign locations for 3 words per subtitle. allows a better understanding of regional differences els. In Fig. 7, we show qualitative examples of localising and can potentially help standardisation efforts of BSL. multiple signs in a given sentence in BSL-1K, where we Dense annotations. We demonstrate the potential of only query the words that occur in the subtitles, reduc- our model to obtain dense annotations on continuous ing the search space. In fact, if we assume the word to sign language video data. Sign spotting through the be known, we obtain 83.08% sign localisation accuracy use of sign dictionaries is not limited to mouthings as on BSL-1K with our best model. This is defined as the in [5] and therefore is of great importance to scale up number of times the maximum similarity occurs within datasets for learning more robust sign language mod- -20/+5 frames of the end label time provided by [5].
Scaling up sign spotting through sign language dictionaries 13 Fig. 8: “Faux amis” in BSL/ASL: Same/similar manual features for different English words (left), as well as for the same English words (right), are identified between BslDict and WLASL isolated sign language datasets. Stride mAP R@5 new localised sign instances. This allows us to train a sign recognition model: in this case, to retrain the I3D 16 31.96 38.98 architecture from [5] which was previously supervised 8 38.46 47.38 4 44.92 54.65 only with signs localised through mouthings. 2 45.39 55.63 1 43.65 53.03 BSL-1K automatic annotation. Similar to our pre- vious work using mouthing cues [5], where words in Table 5: Stride parameter of sliding window: A the subtitle were queried within a neighborhood around small stride at test time, when extracting embeddings the subtitle timestamps, we query each subtitle word from the continuous signing video, allows us to tempo- if they fall within a predefined set of vocabulary. In rally localise the signs more precisely. The window size particular, we query words and phrases from the 9K is 16 frames and the typical co-articulated sign dura- BslDict vocabulary if they occur in the subtitles. To tion is 7-13 frames at 25 fps. (testing 1064-class model determine whether a query from the dictionary occurs trained with Watch-Lookup) in the subtitle, we apply several checks. We look for the original word or phrase as it appears in the dic- tionary, as well as its text-normalised form (e.g., “20” “Faux Amis”. There are works investigating lexical becomes “twenty”). For the subtitle, we look for its orig- similarities between sign languages manually [6, 52]. inal, text-normalised, and lemmatised forms. Once we We show qualitatively the potential of our model to find a match between any form of the dictionary text discover similarities, as well as “faux-amis” between and any form of the subtitle text, we query the dic- different sign languages, in particular between British tionary video feature within the search window in the (BSL) and American (ASL) Sign Languages. We re- continuous video features. We use search windows of ±4 trieve nearest neighbors according to visual embedding seconds padding around the subtitle timestamps. We similarities between BslDict which has a 9K vocabu- compute the similarity between the continuous signing lary and WLASL [39], an ASL isolated sign language search window and each of the dictionary variants for a dataset with a 2K vocabulary. We provide some exam- given word: we record the frame location of maximum ples in Fig. 8. We automatically identify several signs similarity for all variants and choose the best match with similar manual features some of which correspond as the one with highest similarity score. The final sign to different meanings in English (left), as well as same localisations are obtained by filtering the peak simi- meanings, such as “ball”, “stand”, “umbrella” (right). larity scores to those above 0.7 threshold – resulting in a vocabulary of 4K signs – and taking 32 frames centered around the peak location. Fig. 9 summarises 4.6 Sign language recognition several statistics computed over the training set. We note that sign spotting with dictionaries (D) is more As demonstrated qualitatively in Sec. 4.5, we can reli- effective than with mouthing (M) in terms of the yield ably obtain automatic annotations using our sign spot- (510K versus 169K localised signs). Since, D can in- ting technique when the search space is reduced to can- clude duplicates from M, we further report the number didate words in the subtitle. A natural way to exploit of instances for which a mouthing spotting for the same our method is to apply it on the BSL-1K training set in keyword query exists within the same search window. conjunction with the weakly-aligned subtitles to collect We find that the majority of our D spottings repre-
14 Varol, Momeni, Albanie, Afouras, Zisserman Yield for 1K query vocabulary Yield for 9K query vocabulary Fig. 9: Statistics on the yield from the automatic annotations: We plot the vocabulary size (left) and the number of localised sign instances (middle) and (right) over several similarity thresholds for the new automatic annotations in the training set that we obtain through dictionaries. While we obtain a large number of localised signs (783K at 0.7 threshold) for the full 9K vocabulary, in our recognition experiments we use a subset of 510K annotations that correspond to the 1K vocabulary. To approximately quantify the amount of annotations that represent duplicates from those found through mouthing cues, we count those localisations for which the same keyword exists for mouthing annotations within the same search window. We observe that the majority of the annotations are new (783K vs 122K). sent new, not previously localised instances (see Fig. 9 TestRec 2K [5] TestRec 37K right). 2K inst. / 334 cls. 37K inst. / 950 cls. BSL-1K sign recognition benchmark. We use the per-instance per-class per-instance per-class BSL-1K manually verified recognition test set with 2K Training #ann. top-1 top-5 top-1 top-5 top-1 top-5 top-1 top-5 samples [5], which we denote with TestRec 2K , and signifi- M [5]§ 169K 76.6 89.2 54.6 71.8 26.4 41.3 19.4 33.2 D 510K 70.8 84.9 52.7 68.1 60.9 80.3 34.7 53.5 cantly extend it to 37K samples as TestRec 37K . We do this M+D 678K 80.8 92.1 60.5 79.9 62.3 81.3 40.2 60.1 by (a) running our dictionary-based sign spotting tech- nique on the BSL-1K test set and (b) verifying the pre- Table 6: An improved I3D sign recognition dicted sign instances with human annotators using the model: We find signs via automatic dictionary spot- VIA tool [25] as in [5]. Our goal in keeping these two di- ting (D), significantly expanding the training and test- visions is three-fold: (i) TestRec 2K is the result of annotat- ing data obtained from mouthing cues by [5] (M). We ing “mouthing” spottings above 0.9 confidence, which also significantly expand the test set by manually ver- means the models can largely rely on mouthing cues to ifying these new automatic annotations from the test recognise the signs. The new TestRec 37K annotations have partition (TestRec Rec 2K vs Test37K ). By training on the ex- both “mouthing” (10K) and “dictionary” (27K) spot- tended M+D data, we obtain state-of-the-art results, tings. The dictionary annotations are the result of an- outperforming the previous work of [5]. §The slight im- notating dictionary spottings above 0.7 confidence from provement in the performance of [5] over the original this work; therefore, models are required to recognise results reported in that work is due to our denser test- the signs even in the absence of mouthing, reducing the time averaging when applying sliding windows (8-frame bias towards signs with easily spotted mouthing cues. vs 1-frame stride). (ii) TestRec 37K spans a much larger fraction of the training vocabulary as seen in Tab. 1, with 950 out of 1,064 sign classes (vs only 334 classes in the original benchmark TestRec 2K of [5]). (iii) We wish to maintain direct com- vs 76.6%). This may be due to the strong bias towards parison to our previous work [5]; therefore, we report mouthing cues in the small test set TestRec 2K . Second, on both sets in this work. the benefits of combining annotations from both can Comparison to prior work. In Tab. 6, we compare be seen in the sign classifier trained using 678K auto- three I3D models trained on mouthing annotations (M), matic annotations. This obtains state-of-the-art perfor- dictionary annotations (D) , and their combination (M+D). mance on TestRec 2K , as well as the more challenging test First, we observe that D-only model significantly out- set TestRec 37K . All three models in the table (M, D, M+D) Rec performs M-only model on Test37K (60.9% vs 26.4%), are pretrained on Kinetics [16], followed by video pose while resulting in lower performance on TestRec 2K (70.8% distillation as described in [5]. We observed no improve-
You can also read