ATTENTION BASED ON-DEVICE STREAMING SPEECH RECOGNITION WITH LARGE SPEECH CORPUS

Page created by Ramon Gomez

Food & Drink

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

ATTENTION BASED ON-DEVICE STREAMING SPEECH RECOGNITION
                                                                        WITH LARGE SPEECH CORPUS

                                              Kwangyoun Kim*, Kyungmin Lee*, Dhananjaya Gowda, Junmo Park, Sungsoo Kim, Sichen Jin,
                                            Young-Yoon Lee, Jinsu Yeo, Daehyun Kim, Seokyeong Jung, Jungin Lee, Myoungji Han, Chanwoo Kim

                                                                                          Samsung Electronics Co., Ltd., Korea
arXiv:2001.00577v1 [eess.AS] 2 Jan 2020

                                                                      ABSTRACT                                        we chose to use the attention based method since the accu-
                                                                                                                      racy of this method has surpassed that of the conventional
                                          In this paper, we present a new on-device automatic speech
                                                                                                                      HMM-DNN based state-of-the-art ASR systems [6]. Despite
                                          recognition (ASR) system based on monotonic chunk-wise
                                                                                                                      their extreme accuracy, attention models which require full
                                          attention (MoChA) models trained with large (> 10K hours)
                                                                                                                      alignment between the input and the output sequences are
                                          corpus. We attained around 90% of a word recognition rate for
                                                                                                                      not capable of providing streaming ASR services. Some re-
                                          general domain mainly by using joint training of connection-
                                                                                                                      searches have been made to address this lack of streaming
                                          ist temporal classifier (CTC) and cross entropy (CE) losses,
                                                                                                                      capabilities of the attention models [7, 8, 9]. In [7], an on-
                                          minimum word error rate (MWER) training, layer-wise pre-
                                                                                                                      line neural transducer was proposed, which applies the full
                                          training and data augmentation methods. In addition, we
                                                                                                                      attention method on chucks of input, and is trained with an
                                          compressed our models by more than 3.4 times smaller using
                                                                                                                      additional end-of-chunk symbol. In [8], a hard monotonic
                                          an iterative hyper low-rank approximation (LRA) method
                                                                                                                      attention based model was proposed for streaming decoding
                                          while minimizing the degradation in recognition accuracy.
                                                                                                                      with acceptable accuracy degradation. Furthermore, in [9],
                                          The memory footprint was further reduced with 8-bit quan-
                                                                                                                      a monotonic chunk-wise attention (MoChA) method was
                                          tization to bring down the final model size to lower than 39
                                                                                                                      proposed, which showed the promising accuracy by loosen-
                                          MB. For on-demand adaptation, we fused the MoChA mod-
                                                                                                                      ing a hard monotonic alignment constraint and using a soft
                                          els with statistical n-gram models, and we could achieve a
                                                                                                                      attention over a few speech chunks.
                                          relatively 36% improvement on average in word error rate
                                          (WER) for target domains including the general domain.                          In this paper, we explain how we improved our MoChA
                                                                                                                      based ASR system to become an on-device commercializa-
                                              Index Terms— Online speech recognition, end-to-end,
                                                                                                                      tion ready solution. First, we trained the MoChA models by
                                          attention, compression
                                                                                                                      using connectionist temporal classification (CTC) and cross-
                                                                                                                      entropy (CE) losses jointly to learn alignment information
                                                                1. INTRODUCTION                                       precisely. A minimum word error rate (MWER) method,
                                                                                                                      which is a type of sequence-discriminative training, was
                                          Recently, End-to-end (E2E) neural network architectures                     adopted to optimize the models [10]. Also, for better stability
                                          based on sequence to sequence (seq2seq) learning for au-                    and convergence of model training, we applied a layer-wise
                                          tomatic speech recognition (ASR) have been gaining lots of                  pre-training mechanism [11]. Furthermore, in order to com-
                                          attention [1, 2], mainly because they can learn both the acous-             press the models, we present a hyper low-rank matrix approx-
                                          tic and the linguistic information, as well as the alignments               imation (hyper-LRA) method by employing DeepTwist [12]
                                          between them, all simultaneously unlike the conventional                    with minimum accuracy degradation. Another important re-
                                          ASR systems which were based on the hybrid models of                        quirement for the commercializing ASR solutions is to boost
                                          hidden Markov models (HMMs) and deep neural network                         the recognition accuracy for user context-specific keywords.
                                          (DNN) models. Moreover, the E2E models are more suitable                    In order to bias the ASR system during inference time, we
                                          to be compressed since they do not need separate phonetic                   fused the MoChA models with statistical n-gram based per-
                                          dictionaries and language models, making them one of the                    sonalized language models (LMs).
                                          best candidates for on-device ASR systems.
                                                                                                                          The main contribution of this paper is in successfully
                                              Among the various E2E ASR model architectures such                      building the first ever attention-based streaming ASR system
                                          as attention-based encoder-decoder models [3] and recurrent                 capable of running on devices to the best of our knowledge.
                                          neural network transducer (RNN-T) based models [4, 5],                      We succeeded not only in training MoChA models with large
                                             * Equal contribution. The author would like to thank Jiyeon Kim, Mehul   corpus for Korean and English, but also in satisfying the
                                          Kumar and Abhinav Garg for constructive comment.                            needs of commercial on-device applications.

The rest of this paper is composed as follows: the speech      context, we could name this attention method as full atten-
recognition models based on attention methods are explained        tion. The Decoder, which consists of uni-directional LSTM
in a section 2. A section 3 describes how optimization meth-       layers, computes the current decoder state sl from the previ-
ods improved recognition accuracy, and explanation for the         ous predicted label yl−1 and the previous context cl−1 . And
compression algorithm for MoChA models is followed in a            the output label yl is calculated by a P rediction block using
section 4. A section 5 describes the n-gram LM fusion for          the current decoder state, the context and the previous output
on-demand adaptation, and then discusses the methods and           label.
related experiments results in a section 6 and 7.                                 sl = Decoder(sl−1 , yl−1 , cl−1 )
                                                                                                                                (4)
                                                                                  yl = P rediction(sl , yl−1 , cl )
              2. MODEL ARCHITECTURE
                                                                       Typically, the prediction block consists of one or two fully
Attention-based encoder-decoder models are composed of             connected layers and a softmax layer to generate a probabil-
an encoder, a decoder, and an attention block between the          ity score for the target labels. We applied max pooling layer
two [13]. The encoder converts an input sequence into a se-        between two fully connected layers. The probability of pre-
quence of hidden vector representations referred to as encoder     dicted output sequence y = {y1 , y2 , ..., yL } for given x is
embeddings. The decoder is a generative model which pre-           calculated as in equation (5).
dicts a sequence of the target labels. The attention is used to
learn the alignment between the two sequences of the encoder                                    L
                                                                                                Y
embeddings and the target labels.                                                   P (y|x) =       P (yl |x, y1:l−1 )          (5)
                                                                                                l
2.1. Attention-based speech recognition
                                                                   where P (yl |x, y1:l−1 ) is the probability of each output label.
The attention-based models can be applied to ASR sys-              Even though the attention-based models have shown state-
tems [14, 15] using the following equations (1)-(4).               of-the-art performance, they are not a suitable choice for the
                                                                   streaming ASR, particularly because they are required to cal-
                  ht = Encoder(ht−1 , xt )                  (1)    culate the alignment between the current decoder state and the
                                                                   encoder embeddings of the entire input frames.
where x = {x1 , x2 , ..., xT } is the speech feature vector se-
quence, and h = {h1 , h2 , ..., hT } is the sequence of encoder
embeddings. The Encoder can be constructed of bi- or uni-
directional long short term memory (LSTM) layers [16]. Due         2.2. Monotonic Chunk-wise Attention
to the difference in length of the input and the output se-
                                                                   A monotonic chunk-wise attention (MoChA) model is intro-
quence, the model is often found to have difficulty in con-
                                                                   duced to resolve the streaming incapability of the attention-
vergence. In order to compensate for this, pooling along the
                                                                   based models under the assumption that the alignment re-
time-axis on the output of intermediate layers of the encoder
                                                                   lationship between speech input and output text sequence
is used, effectively reducing the length of h to T 0 < T .
                                                                   should be monotonic [8, 9].
             et,l = Energy(ht , sl )                                   MoChA model computes the context by using two kinds
                                                                   of attentions, a hard monotonic attention and a soft chunkwise
                 = v T tanh(Wh ht + Ws sl + b)              (2)
                                                                   attention. The hard monotonic attention is computed as,
             at,l = Sof tmax(et,l )
                                                                             emonotonic
                                                                              t,l       = M onotonicEnergy(ht , sl )
An attention weight at,l , which is often referred as alignment,
represents how the encoder embeddings of each frame ht and                   amonotonic
                                                                              t,l       = σ(emonotonic
                                                                                             t,l       )
the decoder state sl are correlated [13]. We employed an addi-                       zt,l ∼ Bernoulli(amonotonic )              (6)
                                                                                                        t,l
tive attention method to compute the correlations. A softmax                                (
function converts the attention energy into the probability dis-                             1, if amonotonic
                                                                                                    t,l       ≥ 0.5
                                                                                          =
tribution which is used as attention weight. A weighted sum                                  0, otherwise
of the encoder embeddings is computed using the attention
weights as,                                                        where zt,l is the hard monotonic attention used to determine
                              T0
                             X                                     whether to attend the encoder embedding ht . The Decoder
                       cl =      at,l ht                     (3)   attends at uth encoder embedding to predict next label only if
                             t
                                                                   zu,l = 1. The equation (6) is computed on t ≥ ul−1 , where
where cl denotes the context vector, and since the encoder em-     ul−1 denotes the attended encoder embedding index for pre-
beddings of the entire input frames are used to compute the        vious output label prediction. The soft chunkwise attention is

computed as                                                        accuracy. Moreover in order to ensure the stability in training
                                                                   MoChA models, we adopted the pre-training scheme.
              echunk
               t,l       = ChunkEnergy(ht , sl )
              achunk
               t,l       = Sof tmax(echunk
                                      t,l  )
                                                            (7)    3.1. Joint CTC-CE training
                             Xu
                  ct,l   =        achunk
                                    t,l   ht                       In spite of the different length between the speech feature se-
                           t=u−w+1                                 quences and the corresponding text sequences, CTC loss in-
where u is the attending point chosen from monotonic atten-        duces the model that the total probability of all possible align-
tion, w is the pre-determined chunk size, and achunk      is the   ment cases between the input and the output sequence is max-
                                                    t,l
chunkwise soft attention weight and ct,l chunk
                                               is the chunkwise    imized [17]. The CTC loss are defined as follows,
context which is used to predict the output label.                                            X         T
                                                                                                        X
    We used the modified additive attention for computing the                  LCT C = −                    log(P (ytπ |xt )     (10)
attention energy in order to ensure model stability [9].                                    π⊂Π(y ∗ )   t
                            v
  Energy 0 (ht , sl ) = g       tanh(Wh ht +Ws sl−1 +b)+r (8)      where Π(y ∗ ) are all of the possible alignments generated with
                          ||v||
                                                                   {Blank} symbol and the repetition of outputs units for having
where g, r are additional trainable scalars, and others are
                                                                   same length as input speech frames, and π is one of them.
same as Energy() in equation (2). M onotonicEnergy()
                                                                   P (ytπ |xt ) is the probability about tth predicted label is tth
and ChunkEnergy() are computed using equation (8) with
                                                                   label in π alignment case.
own trainable variables.
                                                                       A CTC loss can be readily applicable for training MoChA
                                                                   Model, especially the encoder, because it also leads the align-
                                                                   ment between input and output as a monotonic manner. More-
                                                                   over CTC loss has the advantage of learning alignment in
                                                                   noisy environments and can help to quickly learn the align-
                                                                   ment of the attention based model through joint training [18].
                                                                       The joint training loss is defined as follows,

                                                                        LT otal = λLCE + (1 − λ)LCT C              λ ∈ [0, 1]    (11)

                                                                   where LT otal is joint loss of the two losses.

                                                                   3.2. MWER training
                                                                   In this paper, the byte-pair encoding (BPE) based sub-word
                                                                   units were used as the output unit of the decoder [19]. Thus
                                                                   the model is optimized to generate individual BPE output se-
                                                                   quences well. However, the eventual goal of the speech recog-
                                                                   nition is to reduce the word-error rate (WER). Also, since the
                                                                   decoder is used along with a beam search during inference, it
                  Fig. 1. Model architecture
                                                                   is effective to improve the accuracy by directly defining a loss
                                                                   that lowers the exptected WER of candidated beam search re-
                                                                   sults. The loss of MWER is represented as follows,
          3. TRAINING AND OPTIMIZATION
                                                                                             X
                                                                                 LM W ER =        P (yb |x)(W b − W̄)         (12)
The main objective of the attention-based encoder-decoder
                                                                                             b⊂B
model is to find parameters which minimize the the cross en-
tropy (CE) between the predicted sequences and the ground          where B are all the candidates of beams search results, and
truth sequences.                                                   W b indicates the number of word error of each beam result
                       X                                           sequence y b . The average word error of all the beam W̄ helps
              LCE = −      log(P (yl∗ |x, y1:l−1
                                           ∗
                                                 ))      (9)
                                                                   model converging well by reducing the variance.
                            l
         ∗
where y is the ground truth label sequence. We trained                 LT otal = λLCE + (1 − λ)LM W ER              λ ∈ [0, 1]   (13)
MoChA models with CTC loss and CE loss jointly for learn-
ing the alignment better and the MWER loss based sequence-         The MWER loss, LM W ER , also can be easily integrated with
discriminative training was employed to further improve the        the CE loss by linearly interpolating as shown in Equation 13.

3.3. Layer-wise pre-training                                       where b ∈ Rn and σ(·) represent a bias vector and a non-
                                                                   linear activation function, respectively. Then it propagates
A layer-wise pre-training of the encoder was proposed in [11]      through the layers and increases the training loss. In the
to ensure that the model converges well and has better perfor-     LRA, Ũ 0 and Ṽ T are updated in backward pass by constrain-
mance. The initial encoder consists of 2 LSTM layers with a        ing the weight space of r(m + n) dimensions. However, for
max-pool layer with a pooling factor of 32 in between. After       large ρ, it is difficult to find the optimal weights due to the
sub-epochs of training, a new LSTM layer and a max-pool            reduced dimension of the weight space. Instead, we find an
layer are added. The total pooling factor 32 is divided into       optimal LRA by employing DeepTwist method [12], called
2 for lower layer and 16 for newly added higher layer. This        a hyper-LRA. The hyper-LRA algorithm modifies retraining
strategy is repeated until the entire model network is com-
pletely built with 6 encoder LSTM layers and 5 max-pool
layers with 2 pooling factor. Finally, the total reducing fac-
tor is changed to 8, with only the lower 3 max-pool layers          Algorithm 1: The hyper-LRA algorithms
having a pooling factor 2.
                                                                      Procedure TrainingModelWeights()
    During pre-training of our MoChA models, when a new
                                                                         Result: Weight matrices {Wi }
LSTM and a max-pool layer were piled up at each stage,
                                                                         for each iteration N do
the training and validation errors shot up. In order to address
                                                                             for each layer i do
this, we employed a learning rate warm-up for every new pre-
                                                                                 xi+1 ← σ(W xi + b) ;        // xi :input
training stage.
                                                                                 if N mod D = 0 then
                                                                                     xi+1 ← xi+1 + ei ; // ei in (15)
3.4. Spec augmentation                                                               Wi ← Wi + ∆Wi ; // ∆Wi in (14)
                                                                                 end
Because end-to-end ASR model learns from the transcribed
                                                                             end
corpus, large dataset is one of the most important factor to
                                                                             compute the loss L;
achieve better accuracy. Data augmentation methods have
                                                                             for each layer i do
been introduced to generate additional data from the origi-                                      ∂L
                                                                                 Wi ← Wi − η ∂W      ; // η:learning rate
nals, and recently, spec augmentation shows state-of-the-art                                       i

result on public dataset [20]. spec augmentation masks parts                 end
of spectrogram along the time and frequency axis, thus model             end
could learn from masked speech in a lack of information.                 for each layer i do
                                                                             Wi ← Wi + ∆Wi ;
                                                                         end
     4. LOW-RANK MATRIX APPROXIMATION                                 Procedure InferenceModelWeights()
                                                                                                         0      T
We adopted a low-rank matrix approximation (LRA) algo-                     Result: Weight matrices {Ũi }, {Ṽi }
rithm based on singular value decomposition (SVD) to com-                  for each layer i do
                                                                                 0    T
press our MoChA model [21]. Given a weight matrix W ∈                         Ũi , Ṽi ← truncatedSV D(Wi )
Rm×n , SVD is U ΣV T , where Σ ∈ Rm×n is a diagonal ma-                    end
trix with singular values, and U ∈ Rm×m and V ∈ Rn×n
                                                     mn
are unitary matrices. If we specify the rank r < m+n       , the
                                          T       m×n
truncated SVD is given by W̃ = Ũ Σ̃Ṽ ∈ R             , where
Ũ ∈ Rm×r , Σ̃ ∈ Rr×r , and Ṽ ∈ Rn×r are the top-left             process by adding the LRA distortion to weights and the cor-
submatrices of U, Σ, and V , respectively. For an LRA, W           responding errors to the outputs of layers every D iterations,
is replaced by the multiplication of Ũ 0 = Ũ Σ̃ and Ṽ T , the   where D is a distortion period. After retraining, instead of
number of weight parameters is reduced as r(m + n) < mn.           W , the multiplication of Ũ 0 and Ṽ T is used for the inference
                                                mn
Hence we obtain the compression ratio ρ = r(m+n)      in mem-      model.
ory footprints and computation complexity for matrix-vector
multiplication. From LRA, we have an LRA distortion as                 Note that the hyper-LRA algorithm optimizes W rather
                                                                   than Ũ 0 and Ṽ T . In other words, a hyper-LRA method per-
                     ∆W = Ũ 0 Ṽ T − W.                   (14)    forms weight optimization in the hyperspace of the truncated
                                                                   weight space, which has the same dimension with the orig-
For each layer, given an input x ∈ Rm , the output error is        inal weight space. Therefore the hyper-LRA approach can
given by                                                           provide much higher compression ratio than the conventional
                                                                   LRA whereas it requires more computational complexity for
          e = σ((W + ∆W )x + b) − σ(W x + b),              (15)    retraining process.

5. ON-DEMAND ADAPTATION scheduled sampling was applied at a later stage of training to
reduce the mismatch between training and decoding. The ini-
The on-demand adaptation is an important requirement not tial total pooling factor in Encoder is 32 for Librispeech, but
only for personal devices such as mobiles but also for home 16 is used for internal usage data due to the training sensitiv-
appliances such as televisions. We adopted a shallow fu- ity, and they reduced into 8 after the pre-training stage. The
sion [22] method incorporated with n-gram LMs at inference n-gram LMs were stored in a const arpa structure [27].
time. By interpolating both general LM and domain-specific
LMs, we were able to boost the accuracy of the target do- 6.2. Performance
mains while minimizing degradation in that of the general
domain. The probabilities computed from the LMs and the We performed several experiments to build the baseline
E2E models are interpolated at each beam step as follows, model on each dataset, and evaluated accuracies are shown in
Table 1. In the table, Bi− and U ni− mean bi-directional and
P 0 (yl |x, y1:l−1 ) = log P (yl |x, y1:l−1 ) uni-directional respectively, and Cellsize denotes the size of
N the encoder LSTM cells. The size of attention dimension is
X (16)
+ αn log PLMn (yl |y1:l−1 ) same as the encoder cell size, and 1000 was used the size of
n=1 the decoder. The chunk size of MoChA is 2 for all the experi-
ments since we could not see any significant improvement in
where N is the number of n-gram LMs, PLM is a posterior accuracy by extending the size more than two.
distribution computed by the n-gram LMs. The LM distribu-
tion was calculated by looking up a probability per each BPE
for the given context.

6. EXPERIMENT (a) Bi-LSTM Full Attention

6.1. Experimental setup
We evaluated with Librispeech corpus which consists of 960
hours of data first and Internal usage data as well. The usage
(b) Uni-LSTM Full Attention
corpus consists of around 10K hours of transcribed speech
for Korean and English each and was recorded in mobiles
and televisions. We used randomly sampled one hour of us-
age data as our validation sets for each language. We doubled
speech corpus by adding the random noise both for training
(c) Uni-LSTM MoChA
and for validating. The decoding speed evaluated on Samsung
Galaxy S10+ equipped with Samsung Exynos 9820 chipsets, Fig. 2. Comparison of alignment by each attention method
a Mali-G76 MP12 GPU and 12GB of DRAM memory.
A sample rate of speech data was 16kHz and the bits As shown in Fig. 2, compared with bi-directional LSTM
per sample were 16. The speech data were coded with 40- case, it seems that uni-directional model’s alignment has
dimensional power mel-frequency filterbank features which some time delay because uni-directional LSTM cannot use
are computed by power operation of mel-spectrogram with backward information from input sequence. [28] Alignment
1
15 as the power function coefficient [23]. The frames were calculation with soft full attention may have the advantage of
computed every 10ms and were windowed by 25ms Hanning seeing more information and utilizing better context. But for
window. We split words into 10K word pieces through byte speech recognition, since the alignment of speech utterances
pair encoding (BPE) method for both Korean and English nor- is monotonic, that advantage could be not so great.
malized text corpus. Especially for Korean, we reduced the The accuracy of each trained model with various opti-
total number of units for Korean characters from 11,172 to mization method are shown in Table 2. The joint weight λ
68 by decomposing each Korean character into two or three for joint CTC training was 0.8, and it was gradually increased
graphemes depending on the presence of final consonants. during the training. We used 13 and 50 for the max size of the
We constructed our ASR system based on ReturNN [24]. frequency-axis mask and the time(frame)-axis mask, respec-
In order to speed up the training, we used a multiple GPU tively, and masks were applied one for each. For joint MWER
training based on the Horovod [25, 26] all reduce method. training, we used 0.6 as λ and beam size is 4. Spec augmen-
And for better convergence of the model, a ramp-up strat- tation made large improvement, especially on test-other, and
egy for both the learning rate and the number of workers was after Joint MWER training, finally the accuracies in WERs
used [6]. We used a uniform label smoothing method on the on test-clean and test-other were improved relatively 16.41%
output label distribution of the decoder for regularization, and and 17.71% respectively compared to that of the baseline.

Table 1. The performance of Attention based model depending on the number of direction and the cell size of LSTM layers in
encoder. Joint CTC and Label smoothing are applied for all the results, and Data augmentation is only used on Usage data. The
beam size of beam search based decoding is 12.
Librispeech Usage KOR Usage ENG
Encoder Attention Cell size
WER(Test-clean) Test-other WER WER
Bi-LSTM Full 1024 4.38% 14.34% 8.58% 8.25%
Full 1536 6.27% 18.42% - -
Uni-LSTM 1024 6.88% 19.11% 11.34% 10.77%
MoChA
1536 6.30% 18.41% 9.33% 8.82%

Table 2. Accuracy improvement from the optimizations Table 3. Performance for hyper-LRA. The size of models
Librispeech were evaluated in megabytes (MB), and the beam size was
Test-clean Test-other 4. xRT denotes real-time factor for decoding speed.
MoChA (baseline) 6.70% 18.86% Hyper Korean English
Bits
+ Joint CTC LRA WER xRT Size WER xRT Size
6.30% 18.41%
& Label smoothing 32 no 9.37 4.89 530.56 9.03 4.32 530.50
+ Spec augmentation 5.93% 15.98% 32 yes 9.85 0.99 140.18 8.91 1.15 153.98
+ Joint MWER 5.60% 15.52% 32 +MWER 9.60 1.26 140.18 8.64 1.48 153.98
8 no 9.64 1.18 132.88 9.07 0.94 132.87
8 yes 10.21 0.33 35.34 9.24 0.38 38.77
6.3. Compression
8 +MWER 9.80 0.35 35.34 8.88 0.44 38.77
We respectively applied hyper-LRA to weight matrix on each
layer, the rank r of which was chosen empirically. For en- Table 4. Building times for n-gram LMs (in seconds).
coder LSTM layers, the ranks of the first and the last layers are Domain entities patterns utterances time
set larger than those of the internal layers due to the severity Contact 2307 23 53061 4.37
of accuracy degradation. The distortion period D was set as App 699 25 17475 1.78
the total iterations in one sub-epoch divided by 16. The com- IoT 441 4 1764 0.74
pressed model was retrained with a whole training data. In
addition, we adopted 8-bit quantization both to compress and ically dropped from 12.76% to 6.78% without any accuracy
to speed up by using Tensorflow-lite. [29, 30]. As shown in degradation in a general domain. The additional xRT for the
Table 3, the sizes of the models were reduced at least 3.4 times LM fusion was less than 0.15xRT on average even though the
by applying hyper-LRA, and totally more than 13.68 times re- number of LM look-up reached millions. The LM sizes for
duced after 8-bit quantization with minimum degradation of general and for all the three domains were around 43MB and
the performance. Furthermore, we were able to compensate 2MB respectively. All test sets were recorded in mobiles.
the performance by using MWER joint training. At the same
time, the decoding speed of Korean and English models got Table 5. Performance improvement of on-demand adaptation.
13.97 and 9.81 times faster than that of baseline models, re- *xRTs were evaluated on-devices, but WERs were evaluated
spectively. The average latency of final models were 140ms on servers with the uncompressed MoChA model in Table 1.
and 156ms, and the memory usage during decoding (CODE Length MoChA Adapted
+ DATA) were 230MB and 235MB for Korean and English, Domain
(in hours) WER xRT WER xRT
respectively.
General 1.0 9.33 0.35 9.30 0.61
Contact 3.1 15.59 0.34 11.08 0.42
6.4. Personalization App 1.2 12.76 0.34 6.78 0.48
IoT 1.5 38.83 0.43 21.92 0.52
We evaluated our on-demand adaptation method for the three
domains in Korean. Names of contacts, IoT devices and appli-
cations were used to manipulate utterances with pattern sen- 7. DISCUSSION
tences where the names were replaced with a class name like
”call @name”. Individual n-gram LMs were built for each We accomplished to construct the first on-device stream-
domain using the synthesized corpus. LMs for the specific ing ASR system based on MoChA models trained with large
domains were built within 5 seconds as in Table. 4. corpus. In spite of the difficulties in training the MoChA mod-
As in Table 5, the WER for an App domain was dramat- els, we adopted various training strategies such as joint loss

training with CTC and MWER, layer-wise pre-training and        [10] Rohit Prabhavalkar, Tara N. Sainath, Yonghui Wu,
data augmentation. Moreover, by introducing hyper-LRA,              Patrick Nguyen, Zhifeng Chen, Chung-Cheng Chiu, and
we could reduce the size of our MoChA models to be fit on           Anjuli Kannan, “Minimum word error rate training for
devices without sacrificing the recognition accuracies. For         attention-based sequence-to-sequence models,” 2018
personalization, we used shallow fusion method with n-gram          IEEE International Conference on Acoustics, Speech
LMs, it showed improved results on target domains without           and Signal Processing (ICASSP), pp. 4839–4843, 2018.
sacrificing accuracy for a general domain.
                                                               [11] Albert Zeyer, André Merboldt, Ralf Schlüter, and Her-
                                                                    mann Ney, “A comprehensive analysis on attention
                   8. REFERENCES                                    models,” in Interpretability and Robustness in Audio,
                                                                    Speech, and Language (IRASL) Workshop, Conference
 [1] Eric Battenberg, Jitong Chen, Rewon Child, Adam                on Neural Information Processing Systems (NeurIPS),
     Coates, Yashesh Gaur, Yi Li, Hairong Liu, San-                 Montreal, Canada, Dec. 2018.
     jeev Satheesh, David Seetapun, Anuroop Sriram, and
     Zhenyao Zhu, “Exploring neural transducers for end-to-    [12] D. Lee, P. Kapoor, and B. Kim, “Deeptwist: Learning
     end speech recognition,” 2017 IEEE Automatic Speech            model compression via occasional weight distortion,”
     Recognition and Understanding Workshop (ASRU), pp.             ArXiv, vol. abs/1810.12823, 2018.
     206–213, 2017.
                                                               [13] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
 [2] Rohit Prabhavalkar, Kanishka Rao, Tara N. Sainath,             gio, “Neural machine translation by jointly learning to
     Bo Li, Leif Johnson, and Navdeep Jaitly, “A compar-            align and translate,” in 3rd International Conference
     ison of sequence-to-sequence models for speech recog-          on Learning Representations, ICLR 2015, San Diego,
     nition,” in Proc. Interspeech 2017, 2017, pp. 939–943.         CA, USA, May 7-9, 2015, Conference Track Proceed-
                                                                    ings, 2015.
 [3] William Chan, Navdeep Jaitly, Quoc V. Le, and Oriol
     Vinyals, “Listen, attend and spell,” ArXiv, vol.          [14] Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk,
     abs/1508.01211, 2015.                                          Kyunghyun Cho, and Yoshua Bengio, “Attention-based
                                                                    models for speech recognition,” in Advances in Neural
 [4] Alex Graves, “Sequence transduction with recurrent             Information Processing Systems 28: Annual Conference
     neural networks,” ArXiv, vol. abs/1211.3711, 2012.             on Neural Information Processing Systems 2015, De-
                                                                    cember 7-12, 2015, Montreal, Quebec, Canada, 2015,
 [5] Kanishka Rao, Hasim Sak, and Rohit Prabhavalkar,
                                                                    pp. 577–585.
     “Exploring architectures, data and units for stream-
     ing end-to-end speech recognition with rnn-transducer,”   [15] Albert Zeyer, Kazuki Irie, Ralf Schlüter, and Hermann
     2017 IEEE Automatic Speech Recognition and Under-              Ney, “Improved training of end-to-end attention mod-
     standing Workshop (ASRU), pp. 193–199, 2017.                   els for speech recognition,” in Proc. Interspeech 2018,
                                                                    2018, pp. 7–11.
 [6] Chung-Cheng Chiu, Tara N. Sainath, Yonghui Wu, Ro-
     hit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli    [16] Sepp Hochreiter and Jürgen Schmidhuber, “Long short-
     Kannan, Ron J. Weiss, Kanishka Rao, Ekaterina Go-              term memory,” Neural Computation, vol. 9, no. 8, pp.
     nina, Navdeep Jaitly, Bo Li, Jan Chorowski, and Michiel        1735–1780, Nov. 1997.
     Bacchiani, “State-of-the-art speech recognition with
     sequence-to-sequence models,” 2018 IEEE Interna-          [17] Alex Graves, Santiago Fernández, Faustino J. Gomez,
     tional Conference on Acoustics, Speech and Signal Pro-         and Jürgen Schmidhuber, “Connectionist temporal clas-
     cessing (ICASSP), pp. 4774–4778, 2018.                         sification: labelling unsegmented sequence data with re-
                                                                    current neural networks,” in ICML, 2006.
 [7] Navdeep Jaitly, David Sussillo, Quoc V. Le, Oriol
     Vinyals, Ilya Sutskever, and Samy Bengio, “A neural       [18] Suyoun Kim, Takaaki Hori, and Shinji Watanabe, “Joint
     transducer,” ArXiv, vol. abs/1511.04868, 2016.                 ctc-attention based end-to-end speech recognition us-
                                                                    ing multi-task learning,” 2017 IEEE International Con-
 [8] Colin Raffel, Thang Luong, Peter J. Liu, Ron J. Weiss,         ference on Acoustics, Speech and Signal Processing
     and Douglas Eck, “Online and linear-time attention by          (ICASSP), pp. 4835–4839, 2017.
     enforcing monotonic alignments,” in ICML, 2017.
                                                               [19] Rico Sennrich, Barry Haddow, and Alexandra Birch,
 [9] Chung-Cheng Chiu and Colin Raffel, “Monotonic                  “Neural machine translation of rare words with subword
     chunkwise attention,” in International Conference on           units,” in Proceedings of the 54th Annual Meeting of the
     Learning Representations, 2018.                                Association for Computational Linguistics, ACL 2016,

August 7-12, 2016, Berlin, Germany, Volume 1: Long         [29] Martı́n Abadi, Ashish Agarwal, Paul Barham, Eugene
     Papers, 2016.                                                   Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado,
                                                                     Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe-
[20] Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng             mawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving,
     Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le,               Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz
     “SpecAugment: A Simple Data Augmentation Method                 Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion
     for Automatic Speech Recognition,” in Proc. Inter-              Mané, Rajat Monga, Sherry Moore, Derek Murray,
     speech 2019, 2019, pp. 2613–2617.                               Chris Olah, Mike Schuster, Jonathon Shlens, Benoit
[21] J. Xue, J. Li, D. Yu, M. Seltzer, and Y. Gong, “Singular        Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vin-
     value decomposition based low-footprint speaker adap-           cent Vanhoucke, Vijay Vasudevan, Fernanda Viégas,
     tation and personalization for deep neural network,” in         Oriol Vinyals, Pete Warden, Martin Wattenberg, Mar-
     ICASSP, 2014.                                                   tin Wicke, Yuan Yu, and Xiaoqiang Zheng, “Tensor-
                                                                     Flow: Large-scale machine learning on heterogeneous
[22] Anjuli Kannan, Yonghui Wu, Patrick Nguyen, Tara N.              systems,” 2015, Software available from tensorflow.org.
     Sainath, Zhijeng Chen, and Rohit Prabhavalkar, “An
     analysis of incorporating an external language model       [30] Google Inc.,      “Tensorflow Lite,” Online docu-
     into a sequence-to-sequence model,” 2018 IEEE In-               ments; https://www.tensorflow.org/lite,
     ternational Conference on Acoustics, Speech and Signal          2018, [Online; accessed 2018-02-07].
     Processing (ICASSP), 2018.
[23] Chanwoo Kim, Minkyu Shin, Abhinav Garg, and
     Dhananjaya Gowda, “Improved Vocal Tract Length
     Perturbation for a State-of-the-Art End-to-End Speech
     Recognition System,” in Proc. Interspeech 2019, 2019,
     pp. 739–743.
[24] Patrick Doetsch, Albert Zeyer, Paul Voigtlaender, Ilya
     Kulikov, Ralf Schlüter, and Hermann Ney, “Returnn:
     The rwth extensible training framework for universal re-
     current neural networks,” in 2017 IEEE International
     Conference on Acoustics, Speech and Signal Processing
     (ICASSP), 2017.
[25] Alexander Sergeev and Mike Del Balso, “Horovod:
     fast and easy distributed deep learning in TensorFlow,”
     ArXiv, vol. abs/1802.05799, 2018.
[26] C. Kim, S. Kim, K. Kim, M. Kumar, J. Kim, K. Lee,
     C. Han, A. Garg, E. Kim, M. Shin, S. Singh, L. Heck,
     and D. Gowda, “End-to-end training of a large vocab-
     ulary end-to-end speech recognition system,” in 2019
     IEEE Automatic Speech Recognition and Understand-
     ing Workshop (ASRU), 2019, accepted.
[27] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas
     Burget, Ondrej Glembek, Nagendra Goel, Mirko Han-
     nemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan
     Silovsky, Georg Stemmer, and Karel Vesely, “The kaldi
     speech recognition toolkit,” in IEEE 2011 Workshop on
     Automatic Speech Recognition and Understanding. Dec.
     2011, IEEE Signal Processing Society, IEEE Catalog
     No.: CFP11SRW-USB.
[28] Hasim Sak, Andrew W. Senior, Kanishka Rao, and
     Françoise Beaufays, “Fast and accurate recurrent neural
     network acoustic models for speech recognition,” ArXiv,
     vol. abs/1507.06947, 2015.

You can also read