Interesting Papers from ICASSP 2020 Conference

Introduction

In May, I attended my first (virtual) conference which is “International Conference on Acoustics, Speech, and Signal Processing (ICASSP)”. I presented my paper entitled by Layer-normalized LSTM for Hybrid-HMM and End-to-end ASR where we investigated different variants of how to apply layer normalization inside the internal recurrency of the LSTM. In addition to that, there were some quite interesting papers (also relevant to my research) which I would like to highlight here.

Papers

  1. Hybrid Autoregressive Transducer (HAT)
    • Time-Synchronous (+ label) transducer model. It is considered also label synchronous because it allows emitting multiple labels at one time step
    • What makes it different from other related models?
      • Different formulation of the local posterior probabilities (blank and label)
      • Provides a measure of the quality of the internal LM
      • Subtract the internal LM during inference in order to add an external LM
    • Use Bernoulli distribution for taking the blank edge
    • It has been observed that shallow fusion does not help significantly RNN-T models (interpolation with log posterior) so we do interpolation with log likelihood: we know that log \(p(x|y)\) is proportional to log \(p(y|x) - \log p(y)\). \(\log p(y|x)\) is HAT’s posterior. So to apply Bayes’ decision rule correctly, we need to subtract the prior \(\log p(y)\). Now the question is how we compute this prior?
    • Tried to add extra loss to minimize the prior cost (multi task learning) but no gain in WER
  2. RNN-Transducer with Stateless Prediction Network
    • They claim that the prediction network is not a function of the LM but it helps the model to align the input features to the output labels. The joint network + encoder learns both acoustic and linguistic features
    • The prediction network is usually a RNN. They investigate the idea of removing this recurrence and making this network stateless (or first-order). Thus, the joint network takes as input the audio features and only the last output symbol
    • RNN-T overfits heavily with small data
    • Using a pre-trained RNN LM to initialize the prediction network (also whether to freeze it or not) makes the model worse. So does this mean that the prediction network is not analogous to a LM?
    • Interesting experiment on how to check the importance (or relevance) of each component of the model. If we freeze the prediction network and only being trained on 1% of the data, and train the other components which are the encoder and joint network, the model still performs pretty well. How much then is the prediction network important?
    • On graphemes level, there is a drawback with repeated letters (since the PN does not remember how many repeated chars it should output e.g in food) For word-pieces, the stateless PN model matches the model with RNN PN but not on English?
    • Conclusion: PN is not analogous to a LM in ASR. Stateless PN: small model size, simpler decoding with beam search, faster prediction
  3. Improving Sequence-To-Sequence Speech Recognition Training With On-The-Fly Data Augmentation
    • Presented some on-the-fly data augmentation methods for attention-based end-to-end models
    • Time stretch: stretch every of window of w feature vectors by a factor of s obtained from a uniform distribution of range [low, high] into a new window of size w * s
    • Sub-sequence sampling (enhance decoder): use force alignment (via HMM) to divide the training sentences into sub-sequences and use them during training
  4. High Accuracy and Low-latency Speech Recognition with Two-head Contextual Layer Trajectory LSTM Model
    • In Hybrid systems, we use LSTM to capture both temporal and classification information. In this paper, they decouples the tasks of temporal modelling and target classification with a time-LSTM and depth-LSTM respectively (layer-trajectory LSTM - ltLSTM)
    • Also add future context to the depth-LSTM (cltLSTM) but this increases latency
    • To solve the latency issue, they proposed a two-head cltLSTM
    • Added Teacher-Student learning after frame-wise cross-entropy and sequence-discriminative-training stages. They do it on the sequence level by minimizing the KL divergence between hypothesis sequence and not on the frame-level output posterior
  5. A Streaming On-Device E2E Model Surpassing Server-side Conventional Model Quality and Latency
    • Two-pass training: 1st pass consists of shared-encoder + RNN-T decoder and the 2nd pass consists of additional encoder + LAS decoder
    • Decoding: Use RNN-T decoder to generate the hypotheses, and then rescore with LAS decoder
    • Use external LM with multiple domains: Text normalization might be a problem ($100 vs one hundred dollars). Solution: Feed a 1-hot vector of the domain id to the RNN-T encoder, thus the model learns domain-specific data
    • Robustness to Accents: For Hybrid, the usage of lexicon is a solution. The end-to-end model works on word-piece level so we can let it decide how to break the output sequence
    • Use warmup for some epochs and then make it constant
    • Latency improvements: we want the model to output the end token (EOQ or </s>) as close as when the speaker finishes speaking to reduce latency. We also still have the rescoring latency. For that, we can make the RNN-T predicts end token and penalize the loss to make this prediction very close to the last word as much as possible. Penalty would be the difference between the predicted time of end token and its ground truth time
  6. Synchronous Transformers For End-to-end Speech Recognition
    • Proposed a sync-transformer model (combine transformer and transducer advantages)
    • Training chunk-wise future masked self-attentions for the encoder (with overlap)
    • Decoder: At each decoder step, the decoder predicts a symbol conditioned on the previous output and the current chunk of the encoder. Once a special token is predicted, it switches to the next chunk.
    • Training: First, initialize the sync-transformer with an already trained transformer model. The output probabilities are represented as a lattice grid with all the target sequence labels (since we can’t know which target labels correspond to this chunk). Then, we can compute the probability of an output sequence by summing over all the alignment paths using the forward-backward algorithm. The transition to the next chunk is done using the horizontal transition and vertical transitions represent output labels. More than one label can be produced at one time frame. It minimizes RNN-T loss.
  7. E2E-SINCNET: Toward Fully End-to-end Speech Recognition
    • Model: Attention-based encoder-decoder model with joint CTC decoding (interpolate losses)
    • Operates on raw audio by using SincNet to learn signal filters instead of doing CNNs because of huge number of parameters
    • Experiments were done only on Wall Street Journal task
  8. SpecAugment on Large Scale Datasets
    • Introduced adaptive policies for the size of the masks applied. This is needed since utterances vary in length and maybe applying a wide mask to a short utterance is not good. Adapts: size, multiplicity, or both (FullAdapt)
    • From 5.7% to 5.2% on Librispeech + big external LM with FullAdapt policy
  9. SNDCNN: Self-Normalizing Deep CNNs With Scaled Exponential Linear Units for Speech Recognition
    • Batch normalization is mainly used + residual connections. This leads to more memory and computational costs.
    • Use SELU (scaled exponential linear unit) activation function to have a self-normalized network. However, as we go deeper, the effect of SELU becomes weaker
    • To reduce latency: computation is done only every 3rd frame (sliding window) which reduces latency by 47% with no WER impact (shallow CNNs?)
Written on June 7, 2020