The Multitarget TED Talks Task (MTTT)

This is a collection of multitarget bitexts based on TED Talks (https://www.ted.com).
The data is extracted from WIT^3, which is also used for the IWSLT Machine Translation Evaluation Campaigns.

We have a different train/dev/test split from IWSLT. Here, all the dev and test sets have the same English side and come from the same talks.
There are 20 languages, so this is 20-way parallel.
This can support the evaluation of:

The dev and test sets have roughly 2000 sentences each, extracted from 30 talks, and are multi-way parallel.
The train set for different languages may have different English sides, ranging from 77k-188k "sentences" (1.5M to 3.9M English tokens). These train sets are not 20-way parallel and represent the largest bitext we can extract for each language pair.
The data is preprocessed and tokenized via either the Moses tokenizer by default or other language-specific tokenizers when available (PyArabic for Arabic, Kytea for Japanese, Mecab-ko for Korean, Jieba for Chinese).

Additionally, metadata about talk ids and seekvideo counters are retained so that document-level processing or speech translation experiments are possible.

The languages are: ar (Arabic), bg (Bulgarian), cs (Czech), de (German), fa (Farsi), fr (French), he (Hebrew), hu (Hungarian), id (Indonesian), ja (Japanese), ko (Korean), pl (Polish), pt (Portuguese), ro (Romanian), ru (Russian), tr (Turkish), uk (Ukranian), vi (Vietnamese), zh (Chinese). Note that all talks are originally spoken and transcribed in English, then translated by TED translators.

Data


Terms of Use

TED makes its collection available under the Creative Commons BY-NC-ND license. Please acknowledge TED when using this data. We acknowledge the authorship of TED Talks (BY condition). We are not redistributing the transcripts for commercial purposes (NC condition) nor making derivative works of the original contents (ND condition).


Leaderboard

The goal here is to create a standard way for researchers to compare and improve their machine translation systems. We are doing so in a friendly competition format. Feel free to email your BLEU results to x@cs.jhu.edu (x=kevinduh) for inclusion in the tables below (ideally, also provide a link to a paper or a comment about your system). BLEU is computed with the Moses toolkit multi-bleu.perl on the provided tokenization. The tables below are sorted by task, then sorted by BLEU.

Translation into English (xx->en)

Translation into English (xx->en)
Task Date System Name Submitter test1 BLEU Comment
ar-en 2018年11月20日 BPE+CharCNN Pamela Shapiro (JHU) 29.93 Combining BPE subunits with character CNN for addressing source morphology
ar-en 2018年11月19日 SockeyeNMT tm1 Kevin Duh (JHU) 28.28 6-layer transformer in sockeye-recipes
ar-en 2018年11月19日 SockeyeNMT rm1 Kevin Duh (JHU) 27.50 2-layer mid-size RNN in sockeye-recipes
bg-en 2018年11月19日 SockeyeNMT tm1 Kevin Duh (JHU) 35.93 6-layer transformer in sockeye-recipes
bg-en 2018年11月19日 SockeyeNMT rm1 Kevin Duh (JHU) 35.84 2-layer mid-size RNN in sockeye-recipes
cs-en 2018年11月19日 SockeyeNMT tm1 Kevin Duh (JHU) 26.05 6-layer transformer in sockeye-recipes
cs-en 2018年11月19日 SockeyeNMT rm1 Kevin Duh (JHU) 25.80 2-layer mid-size RNN in sockeye-recipes
de-en 2018年11月20日 BPE+CharCNN Pamela Shapiro (JHU) 32.74 Combining BPE subunits with character CNN for addressing source morphology
de-en 2018年11月19日 SockeyeNMT tm1 Kevin Duh (JHU) 32.46 6-layer transformer in sockeye-recipes
de-en 2018年11月19日 SockeyeNMT rm1 Kevin Duh (JHU) 31.34 2-layer mid-size RNN in sockeye-recipes
fa-en 2018年11月19日 SockeyeNMT tm1 Kevin Duh (JHU) 22.08 6-layer transformer in sockeye-recipes
fa-en 2018年11月19日 SockeyeNMT rm1 Kevin Duh (JHU) 21.21 2-layer mid-size RNN in sockeye-recipes
fr-en 2018年11月20日 BPE+CharCNN Pamela Shapiro (JHU) 35.49 Combining BPE subunits with character CNN for addressing source morphology
fr-en 2018年11月19日 SockeyeNMT tm1 Kevin Duh (JHU) 35.09 6-layer transformer in sockeye-recipes
fr-en 2018年11月19日 SockeyeNMT rm1 Kevin Duh (JHU) 35.01 2-layer mid-size RNN in sockeye-recipes
he-en 2018年11月19日 SockeyeNMT tm1 Kevin Duh (JHU) 35.09 6-layer transformer in sockeye-recipes
he-en 2018年11月19日 SockeyeNMT rm1 Kevin Duh (JHU) 32.76 2-layer mid-size RNN in sockeye-recipes
he-en 2018年11月20日 BPE+CharCNN Pamela Shapiro (JHU) 30.81 Combining BPE subunits with character CNN for addressing source morphology
hu-en 2018年11月20日 BPE+CharCNN Pamela Shapiro (JHU) 22.62 Combining BPE subunits with character CNN for addressing source morphology
hu-en 2018年11月19日 SockeyeNMT tm1 Kevin Duh (JHU) 21.14 6-layer transformer in sockeye-recipes
hu-en 2018年11月19日 SockeyeNMT rm1 Kevin Duh (JHU) 20.64 2-layer mid-size RNN in sockeye-recipes
id-en 2018年11月19日 SockeyeNMT tm1 Kevin Duh (JHU) 27.47 6-layer transformer in sockeye-recipes
id-en 2018年11月19日 SockeyeNMT rm1 Kevin Duh (JHU) 26.85 2-layer mid-size RNN in sockeye-recipes
ja-en 2018年11月19日 SockeyeNMT tm1 Kevin Duh (JHU) 10.90 6-layer transformer in sockeye-recipes
ja-en 2018年11月19日 SockeyeNMT rm1 Kevin Duh (JHU) 10.42 2-layer mid-size RNN in sockeye-recipes
ko-en 2018年11月19日 SockeyeNMT tm1 Kevin Duh (JHU) 15.23 6-layer transformer in sockeye-recipes
ko-en 2018年11月19日 SockeyeNMT rm1 Kevin Duh (JHU) 14.30 2-layer mid-size RNN in sockeye-recipes
pl-en 2018年11月19日 SockeyeNMT tm1 Kevin Duh (JHU) 23.56 6-layer transformer in sockeye-recipes
pl-en 2018年11月19日 SockeyeNMT rm1 Kevin Duh (JHU) 21.97 2-layer mid-size RNN in sockeye-recipes
pt-en 2018年11月19日 SockeyeNMT tm1 Kevin Duh (JHU) 41.80 6-layer transformer in sockeye-recipes
pt-en 2018年11月20日 BPE+CharCNN Pamela Shapiro (JHU) 41.67 Combining BPE subunits with character CNN for addressing source morphology
pt-en 2018年11月19日 SockeyeNMT rm1 Kevin Duh (JHU) 40.80 2-layer mid-size RNN in sockeye-recipes
ro-en 2018年11月20日 BPE+CharCNN Pamela Shapiro (JHU) 36.97 Combining BPE subunits with character CNN for addressing source morphology
ro-en 2018年11月19日 SockeyeNMT tm1 Kevin Duh (JHU) 34.96 6-layer transformer in sockeye-recipes
ro-en 2018年11月19日 SockeyeNMT rm1 Kevin Duh (JHU) 34.56 2-layer mid-size RNN in sockeye-recipes
ru-en 2018年11月20日 BPE+CharCNN Pamela Shapiro (JHU) 24.14 Combining BPE subunits with character CNN for addressing source morphology
ru-en 2018年11月19日 SockeyeNMT tm1 Kevin Duh (JHU) 24.03 6-layer transformer in sockeye-recipes
ru-en 2018年11月19日 SockeyeNMT rm1 Kevin Duh (JHU) 22.58 2-layer mid-size RNN in sockeye-recipes
tr-en 2018年11月19日 SockeyeNMT tm1 Kevin Duh (JHU) 22.40 6-layer transformer in sockeye-recipes
tr-en 2018年11月19日 SockeyeNMT rm1 Kevin Duh (JHU) 18.78 2-layer mid-size RNN in sockeye-recipes
uk-en 2018年11月19日 SockeyeNMT tm1 Kevin Duh (JHU) 17.87 6-layer transformer in sockeye-recipes
uk-en 2018年11月19日 SockeyeNMT rm1 Kevin Duh (JHU) 16.99 2-layer mid-size RNN in sockeye-recipes
vi-en 2018年11月19日 SockeyeNMT tm1 Kevin Duh (JHU) 25.39 6-layer transformer in sockeye-recipes
vi-en 2018年11月19日 SockeyeNMT rm1 Kevin Duh (JHU) 24.15 2-layer mid-size RNN in sockeye-recipes
zh-en 2018年11月19日 SockeyeNMT tm1 Kevin Duh (JHU) 16.63 6-layer transformer in sockeye-recipes
zh-en 2018年11月19日 SockeyeNMT rm1 Kevin Duh (JHU) 15.83 2-layer mid-size RNN in sockeye-recipes

Translation from English (en->xx)

TODO (setup leaderboard for en->xx)

Related Resources and Reference

We kindly thank WIT3, which provides ready-to-use versions for research purposes. For a detailed description of WIT3, see:

You may also be interested in a related dataset from Ye, et. al. NAACL 2018. It packages TED Talks with even more languages. The main difference is its dev/test is not split in a multi-way parallel fashion like here, but is different for each language.

If you would like to cite this task:

	 @misc{duh18multitarget,
 author = {Kevin Duh},
 title = {The Multitarget TED Talks Task},
 howpublished = {\url{http://www.cs.jhu.edu/~kevinduh/a/multitarget-tedtalks/}},
 year = {2018},
	 }
	 

AltStyle によって変換されたページ (->オリジナル) /