LUT-AAAI2021

Qianqian Dong^1,2, Rong Ye³, Mingxuan Wang³, Hao Zhou³, Shuang Xu¹, Bo Xu^1,2, Lei Li³

¹Institute of Automation, Chinese Academy of Sciences,
²School of Artificial Intelligence, University of Chinese Academy of Sciences,
³ByteDance AI Lab

[Paper] [Code]

Overview

LUT, Listen-Understand-Translate, is a unified framework with triple supervision to decouple the end-to-end speech-to-text translation task. In addition to the target language sentence translation loss, LUT includes two auxiliary supervising signals to guide the acoustic encoder to extracts acoustic features from the input, and the semantic encoder to extract semantic features relevant to the source transcription text.

Results

We do experiments on Augmented English-French, IWSLT2018 English-German and TED English-Chinese speech translation benchmarks and the results demonstrate the reasonability of LUT. The table below shows the results on Augmented English-French dataset.

Method	Enc Pre-train	Dec Pre-train	greedy	beam

MT system
Transformer MT (Liu et al. 2019)	-	-	21.35	22.91
Base ST setting
LSTM ST (B ́erard et al. 2018)	✗	✗	12.30	12.90
+pre-train+multitask (B ́erard et al. 2018)	✓	✓	12.60	13.40
LSTM ST+pre-train (Inaguma et al. 2020)	✓	✓	-	16.68
Transformer+pre-train (Liu et al. 2019)	✓	✓	13.89	14.30
+knowledge distillation (Liu et al. 2019)	✓	✓	14.96	17.02
TCEN-LSTM (Wang et al. 2020a)	✓	✓	-	17.05
Transformer+ASR pre-train (Wang et al. 2020b)	✓	✗	-	15.97
Transformer+curriculum pre-train (Wang et al. 2020b)	✓	✗	-	17.66
LUT without pre-training	✗	✗	16.70	17.75
Expanded ST setting
LSTM+pre-train+SpecAugment (Bahar et al. 2019)	✓	✓	-	17.00
Multilingual ST+PT (Inaguma et al. 2019)	✓	✗	-	17.60
Transformer+ASR pre-train (Wang et al. 2020b)	✓	✗	-	16.90
Transformer+curriculum pre-train (Wang et al. 2020b)	✓	✗	-	18.01
LUT with pre-training	✓	✗	17.55	18.34

Cases

Our method is fault-tolerant in the case of incorrect recognition, missing recognition, repeated recognition, and so on during the acoustic modeling.

Case 1

Reference (Transcription)	it was mister jack maldon
Hypothesis (Transcription)	it was mister jack mal
Reference (Translation)	c’e ́tait m. jack maldon
Hypothesis (Translation)	c’e ́tait m. jack maldon

Case 2

Reference (Transcription)	cried the old soldier
Hypothesis (Transcription)	cried the soldier
Reference (Translation)	s’e ́cria le vieux soldat,
Hypothesis (Translation)	s’e ́cria le vieux soldat,

Case 3

Reference (Transcription)	chapter seventeen the abbes chamber
Hypothesis (Transcription)	chapter seventeen teen the abbey chamber
Reference (Translation)	chapitre xvii la chambre de l’abbe ́.
Hypothesis (Translation)	chapitre xvii la chambre de l’abbe ́.

Videos

BibTeX

@inproceedings{dong2021listen,
  title={Listen, Understand and Translate: Triple Supervision Decouples End-to-end Speech-to-text Translation},
  author={Qianqian Dong, Rong Ye, Mingxuan Wang, Hao Zhou, Shuang Xu, Bo Xu, Lei Li},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  year={2021}
}