Listen, Understand and Translate:
Triple Supervision Decouples End-to-end Speech-to-text Translation
Qianqian Dong1,2Rong Ye3Mingxuan Wang3Hao Zhou3Shuang Xu1Bo Xu1,2Lei Li3
1Institute of Automation, Chinese Academy of Sciences,   
2School of Artificial Intelligence, University of Chinese Academy of Sciences,   
3ByteDance AI Lab
Overview
LUT, Listen-Understand-Translate, is a unified framework with triple supervision to decouple the end-to-end speech-to-text translation task. In addition to the target language sentence translation loss, LUT includes two auxiliary supervising signals to guide the acoustic encoder to extracts acoustic features from the input, and the semantic encoder to extract semantic features relevant to the source transcription text.
Results
We do experiments on Augmented English-French, IWSLT2018 English-German and TED English-Chinese speech translation benchmarks and the results demonstrate the reasonability of LUT. The table below shows the results on Augmented English-French dataset.

Method Enc Pre-train Dec Pre-train greedy beam
MT system
Transformer MT (Liu et al. 2019) - - 21.35 22.91
Base ST setting
LSTM ST (B ́erard et al. 2018) 12.30 12.90
+pre-train+multitask (B ́erard et al. 2018) 12.60 13.40
LSTM ST+pre-train (Inaguma et al. 2020) - 16.68
Transformer+pre-train (Liu et al. 2019) 13.89 14.30
+knowledge distillation (Liu et al. 2019) 14.96 17.02
TCEN-LSTM (Wang et al. 2020a) - 17.05
Transformer+ASR pre-train (Wang et al. 2020b) - 15.97
Transformer+curriculum pre-train (Wang et al. 2020b) - 17.66
LUT without pre-training 16.70 17.75
Expanded ST setting
LSTM+pre-train+SpecAugment (Bahar et al. 2019) - 17.00
Multilingual ST+PT (Inaguma et al. 2019) - 17.60
Transformer+ASR pre-train (Wang et al. 2020b) - 16.90
Transformer+curriculum pre-train (Wang et al. 2020b) - 18.01
LUT with pre-training 17.55 18.34
Cases
Our method is fault-tolerant in the case of incorrect recognition, missing recognition, repeated recognition, and so on during the acoustic modeling.

Case 1
Reference (Transcription) it was mister jack maldon
Hypothesis (Transcription) it was mister jack mal
Reference (Translation) c’e ́tait m. jack maldon
Hypothesis (Translation) c’e ́tait m. jack maldon

Case 2
Reference (Transcription) cried the old soldier
Hypothesis (Transcription) cried the soldier
Reference (Translation) s’e ́cria le vieux soldat,
Hypothesis (Translation) s’e ́cria le vieux soldat,


Case 3
Reference (Transcription) chapter seventeen the abbes chamber
Hypothesis (Transcription) chapter seventeen teen the abbey chamber
Reference (Translation) chapitre xvii la chambre de l’abbe ́.
Hypothesis (Translation) chapitre xvii la chambre de l’abbe ́.

Videos
BibTeX
@inproceedings{dong2021listen,
  title={Listen, Understand and Translate: Triple Supervision Decouples End-to-end Speech-to-text Translation},
  author={Qianqian Dong, Rong Ye, Mingxuan Wang, Hao Zhou, Shuang Xu, Bo Xu, Lei Li},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  year={2021}
}