MULTIMODAL TRANSFORMERS FOR UTTERANCE-LEVEL CODE-SWITCHING DETECTION
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
MULTIMODAL TRANSFORMERS FOR UTTERANCE-LEVEL CODE-SWITCHING DETECTION Krishna D N Freshworks Inc. arXiv:2011.02132v1 [eess.AS] 4 Nov 2020 ABSTRACT Frisian code-switched automatic speech recognition task. [3] proposes to use a shared acoustic model between the An utterance that contains speech from multiple languages is code-switched languages to obtain a robust German speech known as a code-switched sentence. In this work, we propose recognition system. [4] proposes a one-pass recognition a novel technique to predict whether given audio is mono- strategy and shows we can eliminate the classical multi-pass lingual or code-switched. We propose a multi-modal learning technique, which follows language identification, language approach by utilizing the phoneme information along with au- boundary detection, and language-independent speech recog- dio features for code-switch detection. Our model consists of nition. Recently,[5] proposes using an end-to-end neural net- a Phoneme Network (PN) that processes phoneme sequence work framework that uses the attention-CTC-based seq2seq and Audio Network (AN), which processes the MFCC fea- model. They propose to jointly predict the language label and tures. We fuse representation learned from both the Networks the CTC labels for robust code-switched speech recognition to predict if the utterance is code-switched or not. The Au- for Chinese. Language modeling techniques have also been dio Network and Phonetic Network consist of initial convolu- used for code-switched speech recognition tasks recently tional, Bi-LSTM, and transformer encoder layers. The trans- [7]. [6] proposed to use a recurrent neural network to pre- former encoder layer helps in selecting important and rele- dict the code-switches by using textual features like Part of vant features for better classification by using self-attention. speech. Language identification seems to be one of the main We show that utilizing the phoneme sequence of the utterance components of code-switch identification and code-switched along with the MFCC features improves the performance of speech recognition. It is shown to have provided good im- code-switch detection significantly. We train and evaluate provement in the system performance [8,9,10,11]. Some our model on Microsoft’s code-switching challenge dataset interesting works[1,2,3,4] have shown different methods to for Telugu, Tamil, and Gujarati languages. Our experiments detect the code-switching in a single conversation/utterance. show that the multi-modal learning approach significantly im- Multilingual training of DNN-based speech recognition sys- proved accuracy over the uni-modal approaches for Telugu- tems has shown some improvements in the low resource English, Gujarati-English, and Tamil-English datasets. We languages[13,14,15,16]. Various other techniques like multi- also study the system performance using different neural lay- lingual DNNs[12], using latent space language models[13], ers and show that the transformers help obtain better perfor- using untranscribed data[14], and data augmented acoustic mance. and language models [15] are also proposed to code-switch Index Terms— multimodal learning, transformers, code- detection. switching Recent developments in the area of deep learning have shown tremendous improvement in many areas of speech 1. INTRODUCTION processing, including speech recognition [16,17,18,19,20], language identification [21,22,23,24], speaker identification A continuous change between two or more languages within [25,26,27], emotion recognition [28,29,30], and many oth- a single utterance is known as code-switching [1]. Code- ers. Recently, the Attention mechanism [31] is shown to be switching detection is about identifying whether an utter- a dominating approach in both Natural language processing ance contains multiple languages. Low resource languages and speech due to their ability to focus on specific parts of like Indian languages most commonly have code-switches the input signal to extract relevant information for a particular with English. Code-switched utterances usually cause prob- task. Sequence to sequence models become more power- lems for automatic speech recognition systems, which are ful and efficient when it is built using just attention layers. built using monolingual data only. Many researchers in the Work by [32] builds an end-to-end architecture using many speech community are trying to develop systems that can attention layers, thus eliminating the need for a convolutional handle code-switching speech. Robust acoustic-modeling network or LSTM layers. These models are popularly known using Bi-lingual Deep Neural Networks are used in [2] for as Transformer, and they are the state of the art models for
most Natural language processing problems. In the speech recognition field, state of the art speech recognition uses attention based sequence to sequence model due to its effi- ciency and capacity to align and predict. These models are also used extensively in problems like emotion recognition [28], language identification, and so on. It is shown that the language identification problem can be tackled by using phoneme sequences extracted from the speech recognition system with some post processing[ 38]. [38],describes that we can use an English phoneme recog- nition system to extract phoneme sequences for all the ut- terances with a different language and apply some N-gram statistics post-processing to classify test sentences. Moti- vated by the work of M. A. Zissman et al. [38], we propose to use phoneme sequences as an additional guiding signal along with MFCC features for the network architecture. We propose a multi-modal transformer model for utterance-level Fig. 1. Proposed model architecture. code-switch detection. Our model consists of 2 streams, Au- dio Network (AN) and Phoneme Network (PN). The Audio Network consists of a sequence of CNN and LSTM layers consists of Audio Encoder, Audio Transformer, and statis- followed by a Transformer encoder to process MFCC fea- tics pooling blocks. The Audio Encoder converts the MFCC tures. The Phoneme Network also consists of a sequence features into high-level representations using Convolutional of CNN and LSTM layers followed by a Transformer En- neural networks followed by a Bi-LSTM layer, as shown coder layer to process phoneme embeddings. We pool the in Fig. 1. This high-level feature representation is fed into output features from transformer layers from both Audio the Audio Transformer block, consisting of 2 multi-head Network and Phoneme Network using the statistics pooling self-attention layers. The multi-head self-attention layers layer, and we concatenate these features to create utterance- use self-attention to select important and relevant features level multi-modal feature representation. We project this for code-switching detection using attention weighting. The multi-modal feature representation to lower dimensional statistics pooling layer in Audio Network helps in aggre- space using a fully connected layer. Finally, we classify gating frame-level features from Audio Transformer into an the lower-dimensional representation using a softmax layer utterance-level feature vector. Similarly, the Phoneme Net- to predict the class label. We conduct all our experiments on work consists of Phoneme Encoder, Phoneme Transformer, the Microsoft code-switching challenge dataset for three In- and statistic pooling layer, as shown in Fig. 1. The Phoneme dic languages Telugu-English, Tamil-English, and Gujarati- Encoder takes phoneme embeddings as input and produces English. Our experimental study shows that our approach high-level contextual representation using convolutional neu- significantly improves the system performance compared to ral networks and Bi-LSTM. These features are processed the baseline model. We also show that our multi-modal ap- by Phoneme Transformer to select relevant features for bet- proach outperforms the uni-modal methods. Our open source ter code-switch classification using multi-head self-attention implementation can be found in https:/github.com/ layers. The Phoneme network’s statistics-pooling layer acts KrishnaDN/Code-Switch-Detection as an aggregation layer to pool variable length feature vec- The organization of the paper is as follows. In section 2, tors from Phoneme Transformer into a single feature vector we explain our proposed approach in detail. In section 3, we for classification. We use Early fusion to concatenate the give a detailed analysis of the dataset, and in section 4, we utterance-level feature vector from both the Audio Network explain our experimental setup in detail. Finally, in section 5, and Phoneme Network to combine information from both we describe our results. audio features and phoneme sequences for better classifica- tion performance. We explain all the blocks in detail in the following section. 2. MULTI-MODAL LEARNING This paper proposes a multi-modal learning approach for 2.1. Phoneme Recognition utterance-level code-switch detection using audio features and phoneme information. The model consists of 2 stream Our proposed method consists of 2 streams known as Audio architecture containing Audio Network and Phoneme Net- Network and Phoneme network. The Phoneme Networks take work, as shown in figure 1. The Audio network takes MFCC a sequence of phonemes corresponding to the audio data as its features as input, as shown in Fig 1. The Audio Network input. Due to the lack of ground truth phoneme labels in the
code-switch detection dataset, we generate phoneme labels of filters in the last convolutional layer is 64. Since convolu- by training a GMM-HMM based speech recognition model tion cannot capture temporal information, we use Bi-LSTM for Telugu, Tamil, and Gujarati Languages using Microsoft’s Layers to learn the temporal representation of the feature ma- speech recognition challenge dataset. Using these speech trix F A recognizers for the three languages, we create phoneme la- bels for Telugu-English, Tamil-English, and Gujarati-English H A = Bi-LSTM(F A ) (2) code-switching dataset. These phoneme sequences and their corresponding MFCC features are used during the training Where Bi-LSTM represents Bidirectional LSTM layer of code-switch identification models. The Audio Network whose hidden dimension is 128. H A = [h1 , h2 .....hT 0] takes MFCC features for each audio file, while Phoneme Net- represents the output sequence from the final Bi-LSTM layer. work takes the phoneme sequence generated by the GMM- hi represents the ith timestep hidden activity from the last HMM speech recognition model for that input audio. All the layer. Since the LSTM is operating in both directions, the phoneme recognizers are trained in Kaldi. final hidden layer output feature will be 2 times the LSTM hidden feature vector in size. 2.2. Audio Encoder The Audio Network consists of Audio Encoder, Audio Trans- 2.3. Phoneme Encoder former, and statistics pooling layer, as shown in the figure. The Phoneme Network consists of Phoneme Encoder, Phone The Audio Encoder takes a sequence of MFCC frames as in- Transformer, and statistics pooling layer. The Phoneme En- put and produces an utterance-level feature vector. We extract coder consists of a learnable phoneme embedding layer, two 13-dimensional MFCC feature vectors for every 10ms using a layers of convolution followed by a Bi-LSTM layer, as shown 25ms window. These low-level features are passed as input to in Figure 1. The Phoneme Encoder takes a phoneme sequence Audio Encoder, followed by Audio transformer to learn fea- as inputs to the embedding layers, and these fixed dimen- ture representation, which contains information about code- sional embeddings are learned during training. The phoneme switch in an utterance. embeddings from the embedding layer are passed into the The Audio Encoder processes audio data to extract fea- convolutional network, consisting of 1D convolution layers. tures that are useful for code-switch identification. It consists Each convolutional layer has its kernel size and filters. we of two 1D convolutional layers followed by a Bi-LSTM layer. have [1x3,1x5] filter size for Conv-1 and Conv-2 layer respec- The Audio Encoder takes a sequence of 13-dimensional tively. Similarly, we have 64 filters for both Conv-1 and Conv- MFCC features and applies 1D convolution operations to 2. Since the convolutional layer does not capture tempo- extract higher-level feature representations. Each convolu- ral information, we use Bi-LSTM right after the convolution tional block consists of a 1D convolution operation followed block. The Bi-LSTM learns contextual representations in the by Batch normalization and Relu activation, as described phoneme sequence and helps in detecting code-switching.Let in Fig.1. Each convolutional layer has its kernel size and P = [p1 , p2 ..pn , ...pN ] be a sequence of N phoneme em- filters. We have [1x7,1x5] filter size for Conv-1 and Conv- beddings of dimension 128 and input vector pn is a phoneme 2 layer, respectively. Both Conv-1 and Conv-2 blocks are embedding for nth phoneme. having stride [1x3]. Similarly, we have 64 filters for both Conv-1 and Conv-2. Since the convolutional layer does not capture temporal information, we use Bi-LSTM right after F P = Convolution(P ) (3) the convolutional block to capture temporal information.Let Where, Convolution is a sequence of two convolutional X = [x1 , x2 ..xn , ...xT ] be a sequence of T MFCC fea- blocks applied to the sequence of phoneme embeddings se- ture vectors of dimension 13 and input vector xn is a MFCC quence P as shown in Figure 1. F P = [f1 , f2 .....fN 0] frame extracted from the raw audio for every 25ms window is high-level feature representation learned from convolution with 10ms shift. operation. Since convolution cannot capture temporal infor- mation, we use a sequence of Bi-LSTM Layers to learn a tem- F A = Convolution(X) (1) poral representation of the feature matrix F A . Where Convolution is a sequence of two 1D convolu- tional blocks applied to the MFCC sequence X, as shown H P = Bi-LSTM(F P ) (4) in Fig.1. After Convolution operation, we obtain a feature sequence F A = [f1 , f2 .....fT 0] of length T 0 . The feature Where Bi-LSTM represents 2 layers Bidirectional LSTM matrix, F A has an x-axis as the time dimension, and the y- network whose hidden dimension is 128. H P = [h1 , h2 .....hN 0] axis is a feature dimension. The feature dimension size will represents the output sequence from the final Bi-LSTM layer. be the same as the number of filters in the last convolutional hi represents the ith timestep hidden activity from the last layer. In our case, the feature dimension is 64, as the number layer.
A = Concat(A1 , A2 , A3 ...Ai .....AM )W0 (6) Our proposed model consists of Audio Transformer and Phoneme Transformer. The Audio Transformer takes output feature representation H A = [h1 , h2 .....hT 0] from Audio Encoder as input, while Phoneme Transformer takes out- put feature representation H P = [h1 , h2 .....hN 0] from Phoneme Encoder as input. The Query, Key, and Value are obtained using H A for Audio Transformer. Similarly, the Query, Key, and Value are obtained using H P for Phoneme Encoder. Both Audio Transformer and Phoneme Transformer consist of 2 layers of Multi-head self-attention with four attention heads each. Fig. 2. Multi-head self-attention: Audio Transformer (left) 2.5. Statistics Pooling and Phoneme Transformer (right) The statistics pooling layers pools the frame-level features from both Audio Transformer and Phoneme Transformer. It computes the mean and standard deviation across time in a 2.4. Transformer Network feature sequence to generate a single utterance-level feature Attention is a very well known concept in deep learning, representation from Audio Network and Phoneme Network. specifically for temporal sequences. These models are widely The mean and standard deviation vectors are concatenated used in speech and NLP applications due to their ability to to obtain the second-order statistics of Audio Encoder and attend and select relevant features that are useful for solv- Phoneme Encoder. ing a particular task. Recently, Multi-head self-attention has become one of the widely used models in NLP and Speech U A = Concat(mean(Aa ), std(Aa )) (7) due to its ability to attend different parts of the input us- ing multiple attention heads. Various attention models have U P = Concat(mean(Ap ), std(Ap )) (8) been invented over the past, including dot product attention, locality-sensitive attention, additive attention, and so on. In Where, Aa , Ap are the outputs from Audio Transformer this work, we use a specific type of attention called multi-head and Phoneme Transformer respectively. U A is a Utterance- self-attention. A detailed multi-head self-attention block is level feature vector from Aa and U P is a Utterance-level fea- shown in Fig. 2. ture vector from Ap . Both U A and U P are having same The multi-head attention (MHA) layer consists of mul- dimension. tihead scaled dot product attention, matrix multiply, and position-wise feed-forward layer. The MHA block consists 2.6. Early Fusion of M linear layers for query Key and Value matrices, where M is the number of heads in the multi-head attention. It The utterance-level feature vectors U A and U P are the repre- takes feature representations from Bi-LSTM layer and ap- sentation learned from different modalities. The early fusion plies linear transform to create Qi , Ki and Vi using ith linear block combines these two representations to combine infor- transform where, i = [1, 2.....M ] and M is the total number mation from both audio and phoneme modality using simple of attention heads. The Qi , Ki and Vi are fed into scaled concatenation. dot product attention layer followed by matrix multiplication between the value matrix and attention weights. The scaled C = Concat(U A , U P ) (9) dot product attention Ai for ith head is defined as follows. Where C represents the fused feature representation. Qi Ki Ai = Softmax( )Vi (5) 3. DATASET dq 3.1. Speech Recognition dataset Where dq is the dimension of the query vector. We com- bine the attention output from all the heads using simple con- The proposed model uses both audio features and phoneme catenation and feed them into the feed-forward layer. sequences to train multi-modal architecture for code-switching
Train Evaluation models, Telugu-ASR,Tamil-ASR, Gujarati-ASR for Telugu, Dataset Duration Utterances Duration Utterances Tamil and Gujarati language respectively. The acoustic model Gujarati 31.59 16780 3.59 2091 for Telugu-ASR is trained using ∼40 hours of audio data con- Telugu 31.59 16991 4.0 2135 taining ∼44K utterances. We train the Bi-Gram language Tamil 30.24 10933 7.312 2642 model using training utterances. Similarly, the acoustic mod- els for Tamil-ASR, Gujarati-ASR are trained using ∼40hours Table 1. Code-switching dataset: train and evaluation splits of data for each language, and the dataset contains ∼39K for different languages utterance for Tamil and ∼22K. The lexicon for Telugu-ASR, Tamil-ASR and Gujarati have ∼48K, ∼58K and ∼43K words respectively. We evaluate speech recognition models on ∼5 identification. We first build LVSCR speech recognition mod- hours of evaluation data for each language.We use Kaldi 1 els for Telugu, Tamil, and Gujarati languages to extract the [34] to train speech recognition models. We use KenLM2 code-switching dataset’s phoneme sequences. We use the [35] to train the language models. dataset provided for Low resource speech recognition chal- lenge by Microsoft. The dataset consists of 40 hours of audio data for each of the languages. All the audio files for each 4.2. Code-switching Identification language are sampled at 16KHz to train speech recognition Our proposed approach consists of 2 streams, Audio Network models. The dataset contains both training evaluation split for and Phoneme Network. The input to the audio network is a each language. We first train the GMM-HMM-based speech sequence of 13-dimensional MFCC features. We use 25ms recognition model for each language using the dataset split. window length with 10ms hop length to extract the features. We then evaluate each model with corresponding tests split The Audio Encoder has two convolutional neural layers con- from their respective language. taining [1x7,1x5] kernels and 64 filters each. Each convolu- tional layer has a stride of 3. Each of these convolutional lay- 3.2. Code-switching dataset ers consists of 1D convolution operation, 1D-batch normal- ization, and Relu activation function. The audio encoder’s In this section, we discuss the code-switching dataset in de- convolutional layers help reduce the temporal dimension of tail. One of the shared tasks’ main goal is to detect if an audio the input data and extract higher-level features. The convo- file is monolingual or code-switched. The shared task con- lutional block’s output from the audio encoder is fed to a Bi- tains two subtasks, 1) Utterance-level identification of mono- LSTM with a hidden size of 128 and a dropout of 0.3. Sim- lingual vs. code-switched utterances, and 2)Frame-level iden- ilarly, the phoneme encoder has two 1D convolutional layers tification of language in a code-switched utterance. We focus with kernel size [1x3,1x5] and 64 filters. Each convolution on subtask-1 in this paper. The dataset for subtask-1 con- layer has stride 1. The input to the text encoder is a sequence tains 3 Indic languages, Telugu-English, Tamil-English, and of phoneme embeddings of dimension 128 and is learned as Gujarati-English. Each language is provided with training part of the training. The output of the Phoneme Encoder goes and testing sets. Both training and test sets contain infor- to a Bi-LSTM with a hidden layer size of 128 and a dropout of mation about whether an audio file is monolingual or code- 0.3. Both Audio transformer and Phoneme Transformer con- switched. The statistics of the dataset is given in Table 2. tains two layers of multi-head attention(MHA) blocks. Each To use our approach, we need a phoneme sequence for every MHA block uses four attention heads layer-normalization be- utterance in the code-switching identification dataset. To ob- fore scaled dot product operation. The forward feed layer tain the phoneme sequence, we use the prebuilt GMM-HMM inside MHA has a hidden dimension is 256. We use Adam speech recognition models for their corresponding languages. optimizer [36] with a scheduled learning rate. We use Py- We use Telugu, Tamil, and Gujarati speech recognition mod- torch [37] framework for implementation. All our models are els to extract phoneme sequences for Telugu-English, Tamil- trained on RTX 2080Ti Graphics cards. English, and Gujarati-English code-switched dataset, respec- tively. 5. RESULTS 4. EXPERIMENTS In this section, we describe the experimental results in de- tail. We first evaluate our speech recognition system perfor- 4.1. Speech Recognition mance for Telugu, Tamil, and Gujarati using standard evalu- Our speech recognition dataset consists of 40hrs of labeled ation data. We then compare our multi-modal approach with audio data for 3 Indic languages Telugu, Tamil, and Gujarati. uni-modal approaches. We finally examine the effectiveness We extract 13-dimensional MFCC features for every 25ms of transformers for code-switching identification task. We window with a 10ms shift. We use these features to train 1 www.kaldi.net the acoustic models. We build 3 different speech recognition 2 www.kenLM.com
train GMM-HMM-based speech recognition on 40hrs of au- can be seen that the transformers help in improving overall dio data for each language and evaluate the word error rates system performance. on the corresponding test sets. We obtain, ∼24%, ∼20% and ∼17% WER for Telugu, Tamil and Gujarati respectively. System Accuracy We train our multi-modal model for Telugu-English, Telugu-English Tamil-English, and Gujarati-English on the code-switching CNN+Bi-LSTM 85.00% dataset. We evaluate the models on the evaluation dataset CNN+Transformer 85.30% for each language. The comparison between multi-modal CNN+Bi-LSTM+Transformer 85.71% and uni-modal approaches are shown in Table 2. The exper- Tamil-English iment Multi-Modal represents our approach, where we use CNN+Bi-LSTM 85.70% both Audio network and Phoneme Network. Uni-Modal- CNN+Transformer 86.02% Audio represents the model where the classifier is trained CNN+Bi-LSTM+Transformer 85.60% using only Audio Network with MFCC features as inputs. Gujarati-English Similarly, Uni-Modal-Phoneme is the model where the clas- CNN+Bi-LSTM 88.01% sifier is trained using only Phoneme Network with Phoneme CNN+Transformer 87.90% sequences as inputs. It can be shown that our approaches out- CNN+Bi-LSTM+Transformer 88.85% perform the uni-modal approaches. Our method outperforms the Uni-Modal-Audio approach by 6.8%,7.75%, and 4.14% Table 3. Comparison of Different neural architectures. Bold for Telugu, Tamil, and Gujarati, respectively. Similarly, Our indicates the best performance method outperforms the Uni-Modal-Phoneme approach by 5.46%,5.47%, and 13.85% for Telugu, Tamil, and Gujarati, respectively. 6. CONCLUSIONS System Accuracy Code-switching detection is one of the most challenging and Telugu-English still unsolved problems in the speech processing field. This Multi-Modal 85.71% work proposes a new approach for improving code-switching Uni-Modal-Audio 78.91% detection performance using multi-modal learning combined Uni-Modal-Phoneme 80.25% with transformer networks by utilizing the phonetic infor- Tamil-English mation. Our method consists of two streams neural network Multi-Modal 86.02% where one stream processes audio features like MFCC while Uni-Modal-Audio 78.27% the other stream processes phoneme sequences. Multi-modal Uni-Modal-Phoneme 80.55% learning helps capture information shared between differ- Gujarati-English ent modalities to learn a better code-switching identification Multi-Modal 88.85% model. We also show that transformer models help improve Uni-Modal-Audio 84.71% code-switching detection performance due to their ability Uni-Modal-Phoneme 75.80% to attend and select features using a self-attention mecha- nism for better code-switching detection performance. Our Table 2. Comparison of Multi-modal and Uni-Modal ap- experiments show that our multi-modal learning approach proaches. (Bold indicates the best performance) outperforms the uni-modal approaches on Microsoft’s code- switching identification dataset. We experiment to see the effectiveness of the transformer for code-switching identification task in multi-modal settings. 7. ACKNOWLEDGEMENTS We remove the Audio Transformer and Phoneme Trans- former from the original model to see the code-switching We want to thank Microsoft India for providing the dataset identification task’s performance. We call this experiment for this research work. Any opinion, findings, conclusions, or CNN+Bi-LSTM. Similarly, we remove Bi-LSTM from the recommendations expressed in this material are those of the original architecture to see the effect of using Bi-LSTM author(s) and do not necessarily reflect the views of Fresh- in the system performance. The experiment is known as works Inc. CNN+Transformer. In the final experiment, we keep all the blocks from the original architecture, and the experi- ment is called CNN+Bi-LSTM+Transformer. We conduct 8. REFERENCES these experiments for all the language pairs, and the results are shown in Table 3. It can be seen that the CNN+Bi- [1] Peter Auer, “Code-switching in conversation: Language, LSTM+Transformer model is shown to be the best model. It interaction and identity”, Routledge, 2013.
[2] E. Yılmaz, H. Van den Heuvel, and D. A. Van Leeuwen, the Delta-Bayesian Information Criterion,” in IEEE/ACM “Investigating bilingual deep neural networks for au- Transactions on Audio, Speech, and Language Process- tomatic speech recognition of code-switching Frisian ing, vol. 23, no. 11, pp. 1892-1903, Nov. 2015, doi: speech,” in Proc. Workshop on Spoken Language Tech- 10.1109/TASLP.2015.2456417. nology for Under-resourced Languages (SLTU), May 2016, pp. 159–166. [14] E. Yılmaz, H. Van den Heuvel, and D. A. Van Leeuwen, “Exploiting untranscribed broadcast data for improved [3] G. Stemmer, E. Noth, and H. Niemann, “Acoustic mod- code-switching detection,” in Proc. INTERSPEECH, eling of foreign words in a German speech recognition Aug. 2017, pp. 42–46. system,” in Proc.EUROSPEECH, 2001, pp. 2745–2748 [15] Emre Yılmaz, Henk van den Heuvel, and David van [4] D.-C. Lyu, R.-Y. Lyu, Y.-C. Chiang, and C.-N. Hsu, Leeuwen, “Code-switching detection with data aug- “Speech recognition on code-switching among the Chi- mented acoustic and language models,” in the Sixth Inter- nese dialects”, in Proc. ICASSP, vol. 1, May 2006, pp. national Workshop on Spoken Language Technology for 1105–1108 Under-resourced Languages (SLTU). Procedia Computer Science, 2018, pp. 127–131. [5] N. Luo, D. Jiang, S. Zhao, C. Gong, W. Zou, and X. Li, “Towards end-to-end code-switching speech recogni- [16] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, At- tion,” arXiv preprint arXiv:1810.13091, 2018. tend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition” in ICASSP, 2016 [6] H. Adel, N. Vu, F. Kraus, T. Schlippe, H. Li, and T. Schultz, “Recurrent neural network language model- [17] S. Kim, T. Hori, and S. Watanabe, “Joint CTC-attention ing for code switching conversational speech,” in Proc. based end-to-end speech recognition using multi-task ICASSP, 2013, pp. 8411–8415. learning,” in IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), 2017, pp. [7] Heike Adel, Katrin Kirchhoff, Dominic Telaar, Ngoc 4835–4839. Thang Vu, Tim Schlippe, and Tanja Schultz, “Features [18] C. Wang, Y. Wu, Y. Du, J. Li, S. Liu, L. Lu, S. Ren, for factored language models for code-switching speech,” G. Ye, S. Zhao, and M. Zhou, “Semantic mask for trans- in Spoken Language Technologies for Under-Resourced former based end-to-end speech recognition,” in Proc. In- Languages, 2014. terspeech, 2020. [8] J. Weiner, N. T. Vu, D. Telaar, F. Metze, T. Schultz, D.- [19] S. Schneider, A. Baevski, R. Collobert, and M. Auli. C. Lyu, E.-S. Chng, and H. Li, “Integration of language “wav2vec: Unsupervised pre-training for speech recog- identification into a recognition system for spoken con- nition.” In Proc. of Interspeech, 2019. versations containing codeswitches,” in Proc. SLTU, May 2012. [20] Ronan Collobert, Christian Puhrsch, and Gabriel Syn- naeve, “Wav2letter: an end-to-end convnet based speech [9] K. R. Mabokela, M. J. Manamela, and M. Manaileng, recognition system,” CoRR, vol. abs/1609.03193, 2016. “Modeling code-switching speech on under-resourced languages for language identification,” in Proc. SLTU, [21] D. Snyder, D. Garcia-Romero, A. McCree, G. Sell, D. 2014, pp. 225–230. Povey, and S. Khudanpur, “Spoken language recognition using x-vectors,” in Odyssey: The Speaker and Language [10] D.-C. Lyu, E.-S. Chng, and H. Li, “Language diarization Recognition Workshop, 2018 for codeswitch conversational speech,” in Proc. ICASSP, May 2013, pp.7314–7318. [22] J. Gonzalez-Dominguez, I. Lopez-Moreno, H. Sak, J. GonzalezRodriguez, and P. J. Moreno, “Automatic lan- [11] K Bhuvanagiri and Sunil Kopparapu, “An approach to guage identification using long short-term memory re- mixed language automatic speech recognition,” Oriental current neural networks.” in INTERSPEECH, 2014, pp. COCOSDA, Kathmandu, Nepal, 2010. 2155–2159. [12] E. Yılmaz, H. van den Heuvel and D. van Leeuwen, [23] Ling, S., Salazar, J., Liu, Y., and Kirchhoff, K. “BERT- “Code-switching detection using multilingual DNNS,” phone: Phonetically-aware Encoder Representations for 2016 IEEE Spoken Language Technology Workshop Utterance-level Speaker and Language Recognition”. (SLT), San Diego, CA, 2016, pp. 610-616, doi: 10.1109/SLT.2016.7846326. [24] Weicheng Cai, Danwei Cai, Shen Huang, and Ming Li, “Utterance-level end-to-end language identification us- [13] C. Wu, H. Shen and C. Hsu, “Code-Switching Event ing attention-based CNN-BLSTM,” in ICASSP, 2019, pp. Detection by Using a Latent Language Space Model and 5991–5995.
[25] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and International Conference on Learning Representations S. Khudanpur,“X-vectors: Robust dnn embeddings for (ICLR), 2014 speaker recognition,” in 2018 IEEE International Con- ference on Acoustics, Speech and Signal Processing [37] Paszke, Adam and Gross, Sam and Chintala, Soumith (ICASSP). IEEE, 2018 and Chanan, Gregory and Yang, Edward and DeVito, Zachary and Lin, Zeming and Desmaison, Alban and [26] Krishna D N, Ankita Patil, M.S.P Raj, Sai Prasad H S, Antiga, Luca and Lerer, Adam “Automatic differentiation Prabhu Aashish Garapati, “Identification of Indian Lan- in PyTorch” in NIPS, 2017 guages using Ghost-VLAD pooling”,arXiv:2002.01664 [38] M. A. Zissman and E. Singer, “Automatic lan- [27] Shuai Wang, Johan Rohdin, Lukas Burget, Oldrich guage identification of telephone speech messages us- Plchot, Yanmin Qian, Kai Yu, and Jan Cernock y, ing phoneme recognition and N-gram modeling,” Pro- “On the usage of phonetic information for textindepen- ceedings of ICASSP ’94. IEEE International Con- dent speaker embedding extraction,” INTERSPEECH, pp. ference on Acoustics, Speech and Signal Process- 1148–1152, 2019 ing, Adelaide, SA, 1994, pp. I/305-I/308 vol.1, doi: 10.1109/ICASSP.1994.389377. [28] M. Sarma, P. Ghahremani, D. Povey, N. K. Goel, K. K. Sarma,and N. Dehak, “Emotion identification from raw speech signals using DNNs,” in Proc. INTERSPEECH, Hyderabad, India,2018, pp. 3097–3101 [29] Yao-Hung Hubert Tsai, Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. “Learning Factorized Multimodal Representa- tions”. In 7th International Conference on Learning Rep- resentations, ICLR 2019 [30] N, Krishna and Reddy, Sai “Multi-Modal Speech Emo- tion Recognition Using Speech Embeddings and Audio Features”, AVSP 2019 Melbourne ,Autralia [31] D. Bahdanau, K. Cho, and Y. Bengio. “Neural machine translation by jointly learning to align and translate”, arXiv preprint arXiv:1409.0473, 2014 [32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Atten- tion is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 5998–6008. [33] D. Su, X. Wu and L. Xu, “GMM-HMM acoustic model training by a two level procedure with Gaussian com- ponents determined by automatic model selection,” 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas [34] D. Povey et al., “The kaldi speech recognition toolkit,” in Workshop on Auto. Speech Recogn. and Understanding (ASRU), Hawaii, US, 2011, pp. 1–4. [35] K. Heafield, I. Pouzyrevsky, J. H. Clark, and P. Koehn. “Scalable modified Kneser-Ney language model estima- tion”. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bul- garia, 8 2013. [36] Diederik P. Kingma and Jimmy Ba, “Adam: A Method for Stochastic Optimization”, In Proceedings of the
You can also read