SPLAT: Speech-Language Joint Pre-Training for Spoken Language Understanding Yu-An Chung1⇤, Chenguang Zhu2⇤ , Michael Zeng2 1 MIT Computer Science and Artificial Intelligence Laboratory 2 Microsoft Cognitive Services Group andyyuan@mit.edu,{chezhu,nzeng}@microsoft.com Abstract results. However, such cascaded system has sev- eral drawbacks. First, the transcription produced Spoken language understanding (SLU) re- quires a model to analyze input acoustic sig- by the ASR module often contains errors, which nal to understand its linguistic content and adversely affects the language understanding mod- make predictions. To boost the models’ per- ule’s prediction accuracy. Second, even if the tran- formance, various pre-training methods have scription is perfect, the rich prosodic information been proposed to learn rich representations of speech (e.g., tempo, pitch, and intonation) is in- from large-scale unannotated speech and text. evitably lost after ASR. In comparison, humans of- However, the inherent disparities between the ten leverage these information to better understand two modalities necessitate a mutual analy- sis. In this paper, we propose a novel semi- and disambiguate the content. Therefore, there has supervised learning framework, SPLAT, to been a rising trend of end-to-end approaches to jointly pre-train the speech and language mod- retain information from audio signals to carry out ules. Besides conducting a self-supervised the understanding task (Serdyuk et al., 2018; Chen masked language modeling task on the two in- et al., 2018; Haghani et al., 2018). dividual modules using unpaired speech and While end-to-end SLU methods are effective, text, SPLAT aligns representations from the two modules in a shared latent space using they often suffer from a shortage of labeled training a small amount of paired speech and text. data, especially when the target task is in a novel Thus, during fine-tuning, the speech module domain. One solution is to leverage self-supervised alone can produce representations carrying training as is done in pre-trained language mod- both acoustic information and contextual se- els. Examples like BERT (Devlin et al., 2019), mantic knowledge of an input acoustic signal. GPT (Radford et al., 2018), and RoBERTa (Liu Experimental results verify the effectiveness et al., 2019) are first pre-trained on large-scale of our approach on various SLU tasks. For example, SPLAT improves the previous state- unannotated text in a self-supervised fashion to of-the-art performance on the Spoken SQuAD learn rich textual representations before being fine- dataset by more than 10%. tuned on downstream tasks with a modest amount of labeled data. Borrowing this idea, several pre- 1 Introduction training methods have been proposed for speech, Spoken language understanding (SLU) tackles the e.g., wav2vec (Schneider et al., 2019; Baevski et al., problem of comprehending audio signals and mak- 2020a), contrastive predictive coding (Oord et al., ing predictions related to the content. SLU has been 2018; Rivière et al., 2020), autoregressive predic- widely employed in various areas such as intent tive coding (Chung et al., 2019a, 2020; Chung and understanding (Tur and De Mori, 2011; Bhargava Glass, 2020b), and DeCoAR (Ling et al., 2020; et al., 2013; Ravuri and Stolcke, 2015; Lugosch Ling and Liu, 2020), to capture contextual repre- et al., 2019), question answering (Lee et al., 2018; sentations from unlabeled speech data. Neverthe- Chuang et al., 2020), and sentiment analysis (Zadeh less, these methods leverage only acoustic data and et al., 2018). Early approaches leverage a two-step mainly focus on modeling the acoustic informa- pipeline: use automatic speech recognition (ASR) tion during pre-training. As a result, the produced to transcribe input audio into text, and then em- representations may not be optimal for language ploy language understanding models to produce understanding tasks. ⇤ Equal contribution. The work was done when Yu-An To solve these problems, we propose a novel Chung was interning at Microsoft. SPeech-LAnguage joint pre-Training framework, 1897 Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1897–1907 June 6–11, 2021. ©2021 Association for Computational Linguistics
Figure 1: Overview of SPLAT. First, the speech and language modules are separately pre-trained using speech and text data via masked language modeling (MLM). In practice, we directly employ the BERTBASE model released by Devlin et al. (2019) to be the language module. Then, by leveraging a small amount of paired speech and text data, either a sequence-level alignment loss Lseq or a token-level alignment loss Ltok is applied to align the representations from both modules in a shared latent space (only Lseq is shown here). During alignment, the language module is kept frozen and only the speech module is updated. Before aligning the two modules, there is an optional step to update the BERTBASE -initialized language module via MLM using the text portion from the paired data. This optional step aims to adapt the language module to the speech domain to facilitate later alignment. After pre-training, the language module is discarded and only the speech module is used in downstream tasks. SPLAT. SPLAT contains a speech module and a lan- fine-tuning, the speech module alone can produce guage module for multi-modal understanding. The representations that bridge the speech input and the speech module is a Transformer encoder trained language understanding output. from scratch and the language module is initialized We conduct extensive evaluations on several from BERT. Both modules leverage large-scale downstream SLU tasks, including Fluent Speech unannotated data for pre-training via masked lan- Commands for intent detection, Switchboard for guage modeling. In the speech module, each frame dialog act classification, CMU-MOSEI for spoken is seen as a token and is replaced with zero vector sentiment analysis, and Spoken SQuAD for spoken with a certain probability. For each masked frame, question answering. SPLAT achieves superior re- we minimize the L1-distance between the predicted sults in all datasets. For example, SPLAT improves frame and the original frame. the previous state-of-the-art performance on the Spoken SQuAD dataset by more than 10%. Fur- Then, to make the speech module aware of the thermore, we show that SPLAT can perform well contextual information extracted from the language even given just a tiny portion of the labeled training module, we design an alignment loss to align the data in downstream tasks. representations from both modules in a shared la- tent semantic space. In detail, we propose two 2 Related Work alignment methods, a sequence-level one and a token-level one, that leverage a small amount of Spoken language understanding In recent paired speech and text to minimize the disparity be- years, due to its flexibility and effectiveness, end- tween the acoustic representations from the speech to-end spoken language understanding (SLU) has module and the textual representations from the been proposed and applied to various tasks (Qian language module. In this way, the speech represen- et al., 2017; Serdyuk et al., 2018; Lugosch et al., tations will carry not only the acoustic information 2019). For instance, Qian et al. (2017) use an auto- but also the contextual knowledge from the text. Af- encoder to initialize the SLU model. Lugosch et al. ter this alignment, when text input is absent during (2019) pre-train the model to recognize words and 1898
phonemes, and then fine-tune it on downstream of SpeechBERT require both speech and text input, tasks. Chen et al. (2018) pre-train the model to cat- since it is designed for a specific spoken question egorize graphemes, and the logits are fed into the answering task. However, many SLU tasks only classifier. In most of these approaches, the model take speech as input, which does not align with pre-training requires annotated speech, e.g., word the design of SpeechBERT. In contrast, our model or phonemes corresponding to audio signals. As can learn to align acoustic and textual representa- a result, the massive unlabeled speech data cannot tions using just (a small amount of) paired data be utilized by these models. during pre-training, and only needs speech input for downstream tasks. Self-supervised pre-training for language Pre- Denisov and Vu (2020) propose to align speech trained models have achieved great success in and language embeddings in a method similar to both language and speech domains. In lan- ours. However, there are several key differences. guage, BERT (Devlin et al., 2019), RoBERTa (Liu First, Denisov and Vu (2020) employ the encoder et al., 2019), UniLM (Dong et al., 2019), and of a pre-trained ASR model, which already requires BART (Lewis et al., 2020) have been successfully plentiful of annotated speech to obtain. Our model, applied to natural language inference (Zhang et al., on the other hand, conducts self-supervised learn- 2020b), question answering (Zhu et al., 2018), ing to pre-train the speech module using unanno- and summarization (Zhu et al., 2019). These pre- tated speech. Secondly, besides sequence-level trained models leverage self-supervised tasks such alignment, we propose a token-level alignment as masked language modeling (MLM), next sen- method, which is suitable for token-level down- tence prediction, and de-noising autoencoder. stream tasks. Last but not least, our model uses Self-supervised pre-training for speech In a much smaller paired speech and text for align- speech, wav2vec (Schneider et al., 2019) leverages ment (10 hours) than Denisov and Vu (2020) (1,453 contrastive learning to produce contextual represen- hours), yet still largely outperforms their method tations for audio input; vq-wav2vec (Baevski et al., in intent detection and dialog act classification. 2020a) and wav2vec 2.0 (Baevski et al., 2020b) further propose to discretize the original contin- 3 Method uous audio signals in order to enable more effi- cient MLM training with Transformer (Vaswani In this section we present SPLAT, a framework for et al., 2017). Pre-trained speech models have been learning joint contextual representations of speech applied to ASR (Ling et al., 2020; Chung and and language. The model consists of a speech Glass, 2020a; Baevski et al., 2020b), phoneme module and a language module that share a simi- recognition (Song et al., 2020; Liu et al., 2020a), lar architecture and learning algorithm. The pre- speech translation (Nguyen et al., 2020; Chung training of SPLAT is divided into two steps. First, et al., 2019c), and speech synthesis (Chung et al., we individually pre-train the speech and language 2019b), to name a few. modules using unannotated speech and text, respec- Nevertheless, an SLU model must incorporate tively. Then, we leverage a simple yet effective both acoustic and language understanding capabili- alignment task that uses only a small amount of ties to project speech signals to semantic outputs. paired speech and text data to align the represen- Thus, a pre-trained model for SLU needs to address tations from both modules in a shared latent se- tasks beyond a single modality. mantic space such that the information learned by Speech and language joint pre-training Re- the language module is transferred to the speech cently, SLU applications have prompted joint pre- module. After pre-training, the language module training on both speech and text data. Speech- is discarded and only the speech module is used in BERT (Chuang et al., 2020) applies MLM to pairs downstream tasks. of audio and transcripts. However, there are several Below we formally describe the procedures for crucial differences to compared to our work. First, pre-training the speech (§3.1) and language mod- SpeechBERT contains a phonetic-semantic embed- ules (§3.2), and the alignment loss (§3.3) for align- ding module that requires forced alignment to first ing the representations from the two modules. Fig- segment speech into word segments to obtain. Sec- ure 1 provides an overview of the pre-training pro- ond, both the pre-training and fine-tuning phases cedures of SPLAT. 1899
!! !" !# !$ !% !! !" !# !$ !% Speech "! 0.7 0.6 0.4 0.4 0.3 ∑) idf () max ,-../0(.* , () ) "" 0.5 0.4 0.5 0.7 0.3 ℒ&'( = − * ∑) idf () "# 0.6 0.9 0.4 0.4 0.6 “What day is today?” Language "$ 0.5 0.3 0.9 0.5 0.2 "! "" "# "$ Pairwise cosine similarity Maximum similarity Compute loss Figure 2: Token-level alignment between speech and language modules. (s1 , ..., s5 ) are the output embeddings of the speech module and (t1 , ..., t4 ) are those of the language module. 3.1 Speech module pre-training combined with temporal masking to reinforce the model’s capability to utilize contextual information The goal of this module is to leverage unlabeled from both time and channel, and reduce the impact speech data to learn representations that capture of co-adaptation between acoustic frames. The fi- meaningful acoustic information about speech ut- nal pre-training objective for the speech module is terances such as their phonetic content and speaker to reconstruct the entire input sequence from the characteristics. Formally, the input to the speech altered version of it: module is a 80-dimensional log Mel spectrogram, X (x1 , ..., xn ), where xi 2 R80 , 1 i n. The Lsp = kxi x̂i k1 (1) speech module, which is implemented as a Trans- i=1,2,...,n former architecture, then produces hidden repre- sentations (s1 , ..., sn ) and predictions (x̂1 , ..., x̂n ), We use the speech portion of the train-clean-360 where si 2 R768 and x̂i 2 R80 . subset from the LibriSpeech corpus (Panayotov To boost its capacity for contextual understand- et al., 2015) to pre-train the speech module, i.e., ing, we borrow the idea of masked language mod- to minimize Lsp . This subset contains 360 hours eling (MLM) (Devlin et al., 2019; Liu et al., 2020c; of read speech produced by 921 speakers. We fol- Wang et al., 2020; Liu et al., 2020b). Specifically, low the standard Kaldi setting, using a frame size each audio frame xi is replaced with a zero vector of 25ms and a time shift of 10ms for generating with a probability of 15%. The corresponding out- the 80-dimensional log Mel spectrograms. The put x̂i is trained to be close to the original frame xi spectrograms are normalized to zero mean and unit via minimizing their L1-distance. Additionally, variance per speaker. since consecutive frames are highly correlated, it 3.2 Language module pre-training is possible that the model simply utilizes the local smoothness of speech signals for reconstructing a The language module aims to offer contextual un- single frame and thus fails to capture useful infor- derstanding for text input. We directly employ the mation. To avoid such issue, when a frame xi is BERTBASE model released by Devlin et al. (2019), selected to be masked, its following three frames which is pre-trained on a large text corpus with xi+1 , xi+2 , and xi+3 are also masked, and the the MLM task and contains rich textual representa- model is asked to reconstruct all these masked tions, as the language module. We denote the cross- frames. entropy loss for the language MLM task as Ltext . Furthermore, according to SpecAugment (Park Given input token embeddings (y1 , ..., ym ), et al., 2019), the input features (x1 , ..., xn ) can where y1 corresponds to the [CLS] token, be seen as comprising two dimensions: time, i.e., the module produces contextual representations the subscript i, and channel, i.e., the elements in (t1 , ..., tm ), where tj 2 R768 , 1 j m. each xi . While conventional MLM masks along certain time steps, the input signals can also be 3.3 Aligning speech and language masked along the channel dimension. In other representations words, each column vector [x1,j , ..., xn,j ] for 1 The input to most SLU tasks consists of only audio j 80 has a 15% of chance to be masked, i.e., re- signals, but the model is required to conduct seman- placed with a zero vector. This channel masking is tic understanding, which can be best handled when 1900
textual information is present. Therefore, we pro- Algorithm 1 Pre-training SPLAT pose to align the pre-trained speech and language Input: An unlabeled speech corpus X = representations in a shared semantic latent space. p=1 , an unlabeled text corpus Y = {x(p) }N Suppose a pair of speech and text data consisting {y }q=1 , and a paired speech-text corpus (q) M of an acoustic feature sequence (x1 , ..., xn ) and its k=1 , where K ⌧ N, M . Z = {(x(k) , y (k) )}K transcript (y1 , ..., ym ). The speech and language modules separately produce the output representa- 1: Use X to train the speech module by minimiz- tions (s1 , ..., sn ) and (t1 , ..., tm ). We then propose ing Lsp (Equation 1). two methods to align the embeddings from the mod- 2: Use Y to train the language module by mini- ules: sequence-level and token-level alignment. mizing Ltext (we directly employ BERTBASE Sequence-level alignment For sequence-level from Devlin et al. (2019) for this step). alignment, we treat the first embeddings from the 3: Use {y (k) }K k=1 from Z to train the language two output representations, i.e., s1 and t1 , as the module by minimizing Ltext . sequence-level representations of their respective 4: Use Z to align the two modules by minimiz- sequences, and minimize their L1-distance: ing Lseq (Equation 2) or Ltok (Equation 3). 5: Discard the language module. Lseq = ks1 t 1 k1 (2) Output: The final speech module. Since our goal is to transfer the textual knowledge contained by the language module to the speech language module is kept fixed and only the speech module, we only update the speech module to min- module is updated. imize Lseq and keep the language module fixed. To minimize the alignment loss, we randomly After pre-training, when the transcript is absent sample 10 hours of audio paired with its tran- in downstream tasks, the first output embedding of scripts from the train-clean-360 subset, of which the speech module s1 will still be close to its cor- the speech portion is used to pre-train the speech responding text embedding t1 from the language module (§ 3.1). In practice, before minimizing the module, as if the transcript were given. It follows alignment loss, we find it beneficial to train (i.e., that s1 can then be used to predict the property of minimize Ltext ) the language module initialized the whole audio input, e.g., intent classification. with BERTBASE with the 10-hour LibriSpeech tran- Token-level alignment To achieve a finer level scripts with the MLM task. This step allows the of alignment, each audio feature should be com- model to adapt to the speech domain and facilitates pared with its each text token. Although forced the following alignment task. alignment (Gorman et al., 2011) can establish this We summarize the complete procedure of pre- correspondence between audio signals and individ- training SPLAT in Algorithm 1. After pre-training, ual words, it requires a pre-trained ASR system to the language module is discarded and only the obtain. Here we propose a method that automati- speech module is used in downstream tasks. cally aligns audio features with textual tokens. Inspired by BERTScore (Zhang et al., 2020a), 4 Experiment Setup for each output text embedding tj , we first com- 4.1 Baselines pute its cosine similarity with each output acoustic We include a number of strong baselines from re- embedding si , and select the acoustic feature with cent literature for each downstream task (Lugosch the highest similarity. Then, the alignment is per- et al., 2019; Duran and Battle, 2018; Ghosal et al., formed by maximizing the sum of these maximum 2018; Chuang et al., 2020). We also compare with similarities over all tokens, weighted by each to- another speech-language joint pre-training frame- ken’s inverse document frequency (idf) to reduce work (Denisov and Vu, 2020). For each baseline, the impact of common words: the reported performance is achieved by system Pm j=1 idf(tj ) maxi cossim(si , tj ) that either uses similar or more amounts of data Ltok = Pm (3) than our model. j=1 idf(tj ) To verify the effectiveness of each component The token-level alignment loss is illustrated in Fig- in SPLAT, we experiment with the following vari- ure 2. Same as Lseq , when minimizing Ltok , the ants of it, including whether to pre-train the model, 1901
Table 1: Variants of SPLAT. An 7 indicates that the variant does not incorporate this step during pre-training. The step numbers correspond to those listed in Algorithm 1. Step 1. Pre-train Step 2. Pre-train Step 3. Adapt language Step 4. Type of Model variant speech module language module module before alignment alignment loss SPLAT-Scratch 7 7 7 7 SPLAT-Speech 3 7 7 7 SPLAT-Seq 3 3 7 Lseq SPLAT-Seq-MLM 3 3 3 Lseq SPLAT-Tok 3 3 7 Ltok SPLAT-Tok-MLM 3 3 3 Ltok Table 2: Summary of SLU datasets. For the rows of Train, Validation, and Test, the numbers indicate the number of utterances in the split. Intent Dialog act Spoken sentiment Spoken question Task detection classification analysis answering Dataset FSC SwBD CMU-MOSEI Spoken SQuAD Num. of classes 31 42 7 - Train/val/test 23.1k/3.1k/3.8k 97.8k/8.6k/2.5k 16.2k/1.8k/4.6k 35.1k/2.0k/5.4k whether to use the language module and which of 768 and 12 self-attention heads. The language alignment task to apply. Table 1 summarizes the module is directly initialized from the pre-trained considered model variants. BERTBASE released by Devlin et al. (2019). • SPLAT-Scratch: No pre-training is con- 4.2 Downstream SLU Tasks ducted at all. Speech module is trained from We evaluate our model on four different SLU appli- scratch on downstream tasks. cations: intent detection, dialog act classification, • SPLAT-Speech: Only the speech module is spoken sentiment analysis, and spoken question pre-trained. Language module and alignment answering. The first three belong to multi-class loss are not incorporated. classification tasks, and the last one is a span pre- diction problem, which will be described in more • SPLAT-Seq: SPLAT with sequence-level detail below. Table 2 summarizes the used dataset alignment loss Lseq , but language module is for each application. For all datasets, we use 80- not trained on LibriSpeech transcripts with dimensional log Mel spectrograms as input acous- MLM before alignment. tic features as in the pre-training stage. • SPLAT-Seq-MLM: SPLAT with sequence- Intent detection We use the Fluent Speech Com- level alignment loss Lseq , and language mod- mands corpus (FSC) (Lugosch et al., 2019) for ule is trained on LibriSpeech transcripts with intent detection, where the goal is to correctly pre- MLM before alignment. dict the intent of an input utterance. In this dataset, each utterance is annotated with three slots: action, • SPLAT-Tok: SPLAT with token-level align- object, and location, where each slot can take one ment loss Ltok , but language module is not of multiple values. The combination of slot values trained on LibriSpeech transcripts with MLM is defined as the intent of the utterance, and there before alignment. are 31 unique intents in total. In this work we fol- • SPLAT-Tok-MLM: SPLAT with token-level low the original paper to formulate intent detection alignment loss Ltok , and language module is as a simple 31-class classification task. trained on LibriSpeech transcripts with MLM Dialog act classification We use the NTX- before alignment. format Switchboard corpus (SwDA) (Calhoun The speech module of SPLAT is a 3-layer Trans- et al., 2010), a dialog corpus of 2-speaker conver- former encoder where each layer has a hidden size sations. The goal is to correctly classify an input 1902
Table 3: Results on all downstream datasets. All numbers of our models are an average of three runs, of which variances are negligibly small and not included. The metric is classification accuracy for FSC, SwBD and CMU- MOSEI. The metric for Spoken SQuAD is Audio Overlapping Score (AOS). Model FSC SwBD CMU-MOSEI Spoken SQuAD Ours SPLAT-Scratch 97.6 65.8 68.8 30.4 SPLAT-Speech 99.5 67.5 69.0 57.7 SPLAT-Seq 99.5 74.6 72.5 62.7 SPLAT-Seq-MLM 99.5 76.3 74.7 65.9 SPLAT-Tok 99.2 71.2 70.4 58.0 SPLAT-Tok-MLM 99.2 72.7 71.2 63.8 SPLAT-Seq-MLM 1-hour 99.5 75.8 65.3 65.3 Baselines Lugosch et al. (2019) 98.8 - - - Duran and Battle (2018) - 75.5 - - Ghosal et al. (2018) - - 75.9 - Chuang et al. (2020) - - - 59.7 Denisov and Vu (2020) 95.5 60.2 - - utterance into one of the 42 dialog acts. segment extracted from spoken article as the an- swer. The model is evaluated by Audio Overlap- Spoken sentiment analysis We use the CMU- ping Score (AOS) (Li et al., 2018): the greater the MOSEI dataset (Zadeh et al., 2018), where each overlap between the predicted span and the ground- utterance is annotated for a sentiment score on truth answer span, the higher the score will be. a [ 3, 3] Likert scale: [-3: highly negative, -2: neg- During fine-tuning, given a spoken article and ative, -1: weakly negative, 0: neutral, +1: weakly a question in the text form, the pre-trained speech positive, +2: positive, +3: highly positive]. We module extracts audio representations of the arti- treat the task as a 7-class classification problem. cle and pass them to a randomly initialized 3-layer And we only use audio signals in the input data. Transformer encoder along with the tokenized tex- For the above three tasks, during fine-tuning, an tual question as input. The Transformer then uses MLP network with one hidden layer of 512 units is the self-attention mechanism to implicitly align el- appended on top of the speech module. It converts ements of the input audio and textual features. For the output representation of the first frame, i.e., s1 , each time step of the audio input, the Transformer for class prediction. Both the pre-trained speech is trained to predict whether this is the start of the module and the randomly initialized MLP are fine- span with a simple logistic regression. A separate tuned on the training set for 10 epochs with a batch classifier is used for predicting the end of the span. size of 64 and a fixed learning rate of 3e-4. We compute classification accuracy after each training 5 Results and Analysis epoch and pick the best-performing checkpoint on the validation set to report results on the test set. 5.1 Main results Table 3 shows the performance of models on all Spoken question answering We use the Spoken four downstream tasks. Each number from our SQuAD dataset (Li et al., 2018), which is aug- model is an average over three runs. Based on the mented1 from SQuAD (Rajpurkar et al., 2016) for results, we make the following observations. spoken question answering. The model is given Firstly, compared with SPLAT-Scratch, all pre- an article in the form of speech and a question trained models achieve superior results, especially in the form of text. The goal is to predict a time more than 30% gain on Spoken SQuAD, proving span in the spoken article that answers the ques- the effectiveness of pre-training. tion. In other words, the model outputs an audio Secondly, the inclusion of language module and 1 Li et al. (2018) used Google text-to-speech to generate the alignment task during pre-training is very ben- the spoken version of the articles in SQuAD. eficial. For instance, on CMU-MOSEI, SPLAT- 1903
FSC MOSEI Spoken SQuAD SwBD 100 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 60 ● ● ● ● ● ● 70 ● ● ● ● Accuracy 75 60 50 ●● ● ● ● ● ● ● 60 ● 50 ●● ● 40 ● ● ● SPLAT−Seq−MLM 40 ● 30 ● ● 50 ● ● ● ● 25 ● ● SPLAT−Speech 20 ● 40 ● ● SPLAT−Scratch ● ● 20 ● ● 0 ● ● 10 ● 30 ● 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 Training data size (%) Figure 3: Performance on downstream tasks with varying training data sizes. All numbers are an average of three runs, of which variances are negligibly small and not included. Seq-MLM outperforms SPLAT-Speech by 5.7%, 5.2 Robustness to the size of downstream and outperforms several baseline systems from re- training data cent literature. We argue that as SLU tasks require As human labeling is time-consuming and labor- the model to interpret acoustic signals and their un- intensive, the amount of labeled training data for derlying semantics, the language module will guide downstream tasks is often small and insufficient. the speech module towards a mutual understanding In this section, we show that with effective pre- of both modalities via our alignment task. training, the model will be less dependent on the Thirdly, updating the language module using amount of downstream labeled data. MLM during pre-training is helpful. Although the We randomly sample 50%, 10%, 5%, and 1% language module has been initialized with BERT, of the training data in the downstream tasks, and adaptation to the speech domain can help with se- evaluate the performance of different variants of mantic understanding in the downstream task. SPLAT when fine-tuned on the sampled data. Figure 3 shows the performance on all four downstream tasks with varying training data sizes. Types of alignment Comparing SPLAT-Seq We observe that among the variants, SPLAT-Seq- against SPLAT-Tok, we find that sequence-level MLM is least sensitive to training data sizes. For alignment outperforms token-level alignment on instance, in FSC, with only 10% of the training all four tasks, although the latter is supposed to data, its accuracy only drops 0.4 points. In compar- learn more fine-grained multi-modal representa- ison, both SPLAT-Scratch and SPLAT-Speech tions. We leave the investigations of reasons for drops about 10 points. And the gaps are in gen- such phenomenon and more advanced token-level eral larger when the size of training data further alignment approaches for future work. shrinks. Therefore, our proposed joint pre-training of speech and language modules can help the model quickly adapt to downstream tasks given a modest Low-resource scenario We experiment with a amount of training data. version of SPLAT that uses only 1 hour of tran- scribed speech randomly sampled from the Lib- 5.3 The geometry of the speech latent space riSpeech train-clean-360 subset for aligning speech before and after alignment and language modules, denoted as SPLAT-Seq- So far we have empirically demonstrated the effec- MLM 1-hour. The language module of SPLAT- tiveness of SPLAT for learning multi-modal speech- Seq-MLM 1-hour—after being initialized with language representations that are useful in various BERTBASE —is trained on the 1-hour LibriSpeech SLU tasks. Here we further show that our sequence- transcripts before minimizing the alignment loss. level alignment loss (Equation 2) can help project It achieves comparable results with the best vari- two speech utterances that have similar textual em- ant SPLAT-Seq-MLM: same accuracy on FSC, beddings to nearby points in the speech latent space. 0.5% less on SwBD, and 0.6% less on Spoken SQuAD. This shows that with a small amount of Recall that we use the embedding of the first labeled speech data, our pre-training framework token/feature to represent an utterance and con- can achieve good results on downstream tasks. duct sequence-level alignment (Equation 2). Sup- 1904
