Consecutive Decoding for Speech-to-text Translation
Qianqian Dong1,2Mingxuan Wang3Hao Zhou3Shuang Xu1Bo Xu1,2Lei Li3
1Institute of Automation, Chinese Academy of Sciences,   
2School of Artificial Intelligence, University of Chinese Academy of Sciences,   
3ByteDance AI Lab
COSTT is a unified training framework with consecutive decoding which bridges the benefits of both cascaded and end-to-end models. As a benefit of explicit multi-phase modeling, COSTT facilitates the use of parallel bilingual text corpus, which is difficult for traditional end-to-end ST models.

We do experiments on Augmented English-French, IWSLT2018 English-German and TED English-Chinese speech translation benchmarks and achieve state-of-the-art results on these popular benchmark datasets. The table below shows the results on Augmented English-French dataset.

Method Enc Pre-train Dec Pre-train BLEU
MT system
Transformer MT (Liu et al. 2019) - - 22.91
Base ST setting
LSTM ST (B ́erard et al. 2018) 12.90
+pre-train+multitask (B ́erard et al. 2018) 13.40
LSTM ST+pre-train (Inaguma et al. 2020) 16.68
Transformer+pre-train (Liu et al. 2019) 14.30
+knowledge distillation (Liu et al. 2019) 17.02
TCEN-LSTM (Wang et al. 2020a) 17.05
Transformer+ASR pre-train (Wang et al. 2020b) 15.97
Transformer+curriculum pre-train (Wang et al. 2020b) 17.66
COSTT without pre-training 17.83
Expanded ST setting
LSTM+pre-train+SpecAugment (Bahar et al. 2019) 17.00
Multilingual ST+PT (Inaguma et al. 2019) 17.60
Transformer+ASR pre-train (Wang et al. 2020b) 16.90
Transformer+curriculum pre-train (Wang et al. 2020b) 18.01
COSTT with pre-training 18.23
Our method has obvious structural advantages in solving missed translation, mistranslation, and fault tolerance.

Case 1
Transcript said the doctor yes
Target dit le docteur , oui .
Base ST dit le docteur .
COSTT <asr> said the doctor yes <ast> dit le doc- teur , oui .

Case 2
Transcript i rushed aboard
Target je me pre ́cipitai a` bord.
Base ST je me pre ́cipitai vers l’ avant .
COSTT <asr> i rushed aboard <ast> je me pre ́cipitai a` bord .

Case 3
Transcript is there any news today
Target y a-t-il des nouvelles aujourd’ hui ?
Base ST est-ce que j’ ai de ́ja` utilise ́ aujourd’ hui ?
COSTT <asr> is there any news to day <ast> y a-t- il des nouvelles aujourd’ hui ?

  title={Consecutive Decoding for Speech-to-text Translation},
  author={Qianqian Dong, Mingxuan Wang, Hao Zhou, Shuang Xu, Bo Xu, Lei Li},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},