The Multitarget TED Talks Task (MTTT)

This is a collection of multitarget bitexts based on TED Talks (https://www.ted.com).
The data is extracted from WIT^3 , which is also used for the IWSLT Machine Translation Evaluation Campaigns.

We have a different train/dev/test split from IWSLT. Here, all the dev and test sets have the same English side and come from the same talks.
There are 20 languages, so this is 20-way parallel.
This can support the evaluation of:

Multitarget translation (e.g. jointly translating to French, German, Japanese, etc. from English)
Multisource translation (e.g. jointly translating into German given French and Japanese as input)
Pivot translation (e.g. translating from English to Japanese through a German pivot)
Analysis of machine translation results across many typologically different languages. (While BLEU scores should not be directly compared across datasets, using the same dev/test should aid multilingual analysis)

The dev and test sets have roughly 2000 sentences each, extracted from 30 talks, and are multi-way parallel.
The train set for different languages may have different English sides, ranging from 77k-188k "sentences" (1.5M to 3.9M English tokens). These train sets are not 20-way parallel and represent the largest bitext we can extract for each language pair.
The data is preprocessed and tokenized via either the Moses tokenizer by default or other language-specific tokenizers when available (PyArabic for Arabic, Kytea for Japanese, Mecab-ko for Korean, Jieba for Chinese).

Additionally, metadata about talk ids and seekvideo counters are retained so that document-level processing or speech translation experiments are possible.

The languages are: ar (Arabic), bg (Bulgarian), cs (Czech), de (German), fa (Farsi), fr (French), he (Hebrew), hu (Hungarian), id (Indonesian), ja (Japanese), ko (Korean), pl (Polish), pt (Portuguese), ro (Romanian), ru (Russian), tr (Turkish), uk (Ukranian), vi (Vietnamese), zh (Chinese). Note that all talks are originally spoken and transcribed in English, then translated by TED translators.

Data

Download full 20-language data here: multitarget-ted.tgz (version 0.2) (~608M)
To browse a sample: en-de
See README.txt for details and stats.txt for data sizes
Previous versions of the data are available here

Terms of Use

TED makes its collection available under the Creative Commons BY-NC-ND license. Please acknowledge TED when using this data. We acknowledge the authorship of TED Talks (BY condition). We are not redistributing the transcripts for commercial purposes (NC condition) nor making derivative works of the original contents (ND condition).

Leaderboard

The goal here is to create a standard way for researchers to compare and improve their machine translation systems. We are doing so in a friendly competition format. Feel free to email your BLEU results to x@cs.jhu.edu (x=kevinduh) for inclusion in the tables below (ideally, also provide a link to a paper or a comment about your system). BLEU is computed with the Moses toolkit multi-bleu.perl on the provided tokenization. The tables below are sorted by task, then sorted by BLEU.

Translation into English (xx->en)

Train on tok/ted_train_en-xx.tok.clean.{xx,en} where xx is any one language (e.g. fr)
Tune on tok/ted_dev_en-xx.tok.{xx,en}
To test, translate tok/ted_test1_en-xx.tok.xx and measure BLEU by: perl multi-bleu.perl tok/ted_test1_en-xx.tok.en < $systemoutput

Translation into English (xx->en)
Task	Date	System Name	Submitter	test1 BLEU	Comment
ar-en	2018年11月20日	BPE+CharCNN	Pamela Shapiro (JHU)	29.93	Combining BPE subunits with character CNN for addressing source morphology
ar-en	2018年11月19日	SockeyeNMT tm1	Kevin Duh (JHU)	28.28	6-layer transformer in sockeye-recipes
ar-en	2018年11月19日	SockeyeNMT rm1	Kevin Duh (JHU)	27.50	2-layer mid-size RNN in sockeye-recipes
bg-en	2018年11月19日	SockeyeNMT tm1	Kevin Duh (JHU)	35.93	6-layer transformer in sockeye-recipes
bg-en	2018年11月19日	SockeyeNMT rm1	Kevin Duh (JHU)	35.84	2-layer mid-size RNN in sockeye-recipes
cs-en	2018年11月19日	SockeyeNMT tm1	Kevin Duh (JHU)	26.05	6-layer transformer in sockeye-recipes
cs-en	2018年11月19日	SockeyeNMT rm1	Kevin Duh (JHU)	25.80	2-layer mid-size RNN in sockeye-recipes
de-en	2018年11月20日	BPE+CharCNN	Pamela Shapiro (JHU)	32.74	Combining BPE subunits with character CNN for addressing source morphology
de-en	2018年11月19日	SockeyeNMT tm1	Kevin Duh (JHU)	32.46	6-layer transformer in sockeye-recipes
de-en	2018年11月19日	SockeyeNMT rm1	Kevin Duh (JHU)	31.34	2-layer mid-size RNN in sockeye-recipes
fa-en	2018年11月19日	SockeyeNMT tm1	Kevin Duh (JHU)	22.08	6-layer transformer in sockeye-recipes
fa-en	2018年11月19日	SockeyeNMT rm1	Kevin Duh (JHU)	21.21	2-layer mid-size RNN in sockeye-recipes
fr-en	2018年11月20日	BPE+CharCNN	Pamela Shapiro (JHU)	35.49	Combining BPE subunits with character CNN for addressing source morphology
fr-en	2018年11月19日	SockeyeNMT tm1	Kevin Duh (JHU)	35.09	6-layer transformer in sockeye-recipes
fr-en	2018年11月19日	SockeyeNMT rm1	Kevin Duh (JHU)	35.01	2-layer mid-size RNN in sockeye-recipes
he-en	2018年11月19日	SockeyeNMT tm1	Kevin Duh (JHU)	35.09	6-layer transformer in sockeye-recipes
he-en	2018年11月19日	SockeyeNMT rm1	Kevin Duh (JHU)	32.76	2-layer mid-size RNN in sockeye-recipes
he-en	2018年11月20日	BPE+CharCNN	Pamela Shapiro (JHU)	30.81	Combining BPE subunits with character CNN for addressing source morphology
hu-en	2018年11月20日	BPE+CharCNN	Pamela Shapiro (JHU)	22.62	Combining BPE subunits with character CNN for addressing source morphology
hu-en	2018年11月19日	SockeyeNMT tm1	Kevin Duh (JHU)	21.14	6-layer transformer in sockeye-recipes
hu-en	2018年11月19日	SockeyeNMT rm1	Kevin Duh (JHU)	20.64	2-layer mid-size RNN in sockeye-recipes
id-en	2018年11月19日	SockeyeNMT tm1	Kevin Duh (JHU)	27.47	6-layer transformer in sockeye-recipes
id-en	2018年11月19日	SockeyeNMT rm1	Kevin Duh (JHU)	26.85	2-layer mid-size RNN in sockeye-recipes
ja-en	2018年11月19日	SockeyeNMT tm1	Kevin Duh (JHU)	10.90	6-layer transformer in sockeye-recipes
ja-en	2018年11月19日	SockeyeNMT rm1	Kevin Duh (JHU)	10.42	2-layer mid-size RNN in sockeye-recipes
ko-en	2018年11月19日	SockeyeNMT tm1	Kevin Duh (JHU)	15.23	6-layer transformer in sockeye-recipes
ko-en	2018年11月19日	SockeyeNMT rm1	Kevin Duh (JHU)	14.30	2-layer mid-size RNN in sockeye-recipes
pl-en	2018年11月19日	SockeyeNMT tm1	Kevin Duh (JHU)	23.56	6-layer transformer in sockeye-recipes
pl-en	2018年11月19日	SockeyeNMT rm1	Kevin Duh (JHU)	21.97	2-layer mid-size RNN in sockeye-recipes
pt-en	2018年11月19日	SockeyeNMT tm1	Kevin Duh (JHU)	41.80	6-layer transformer in sockeye-recipes
pt-en	2018年11月20日	BPE+CharCNN	Pamela Shapiro (JHU)	41.67	Combining BPE subunits with character CNN for addressing source morphology
pt-en	2018年11月19日	SockeyeNMT rm1	Kevin Duh (JHU)	40.80	2-layer mid-size RNN in sockeye-recipes
ro-en	2018年11月20日	BPE+CharCNN	Pamela Shapiro (JHU)	36.97	Combining BPE subunits with character CNN for addressing source morphology
ro-en	2018年11月19日	SockeyeNMT tm1	Kevin Duh (JHU)	34.96	6-layer transformer in sockeye-recipes
ro-en	2018年11月19日	SockeyeNMT rm1	Kevin Duh (JHU)	34.56	2-layer mid-size RNN in sockeye-recipes
ru-en	2018年11月20日	BPE+CharCNN	Pamela Shapiro (JHU)	24.14	Combining BPE subunits with character CNN for addressing source morphology
ru-en	2018年11月19日	SockeyeNMT tm1	Kevin Duh (JHU)	24.03	6-layer transformer in sockeye-recipes
ru-en	2018年11月19日	SockeyeNMT rm1	Kevin Duh (JHU)	22.58	2-layer mid-size RNN in sockeye-recipes
tr-en	2018年11月19日	SockeyeNMT tm1	Kevin Duh (JHU)	22.40	6-layer transformer in sockeye-recipes
tr-en	2018年11月19日	SockeyeNMT rm1	Kevin Duh (JHU)	18.78	2-layer mid-size RNN in sockeye-recipes
uk-en	2018年11月19日	SockeyeNMT tm1	Kevin Duh (JHU)	17.87	6-layer transformer in sockeye-recipes
uk-en	2018年11月19日	SockeyeNMT rm1	Kevin Duh (JHU)	16.99	2-layer mid-size RNN in sockeye-recipes
vi-en	2018年11月19日	SockeyeNMT tm1	Kevin Duh (JHU)	25.39	6-layer transformer in sockeye-recipes
vi-en	2018年11月19日	SockeyeNMT rm1	Kevin Duh (JHU)	24.15	2-layer mid-size RNN in sockeye-recipes
zh-en	2018年11月19日	SockeyeNMT tm1	Kevin Duh (JHU)	16.63	6-layer transformer in sockeye-recipes
zh-en	2018年11月19日	SockeyeNMT rm1	Kevin Duh (JHU)	15.83	2-layer mid-size RNN in sockeye-recipes

Translation from English (en->xx)

Train on tok/ted_train_en-xx.tok.clean.{xx,en} where xx is any one language (e.g. fr)
Tune on tok/ted_dev_en-xx.tok.{xx,en}. This is the same datast as the "Translation into English" tasks, but different direction.
To test, translate tok/ted_test1_en-xx.tok.en and measure BLEU by: perl multi-bleu.perl tok/ted_test1_en-xx.tok.xx < $systemoutput

TODO (setup leaderboard for en->xx)

Related Resources and Reference

We kindly thank WIT3, which provides ready-to-use versions for research purposes. For a detailed description of WIT3, see:

M. Cettolo, C. Girardi, and M. Federico. 2012. "WIT3: Web Inventory of Transcribed and Translated Talks." In Proc. of EAMT, pp. 261-268, Trento, Italy

You may also be interested in a related dataset from Ye, et. al. NAACL 2018. It packages TED Talks with even more languages. The main difference is its dev/test is not split in a multi-way parallel fashion like here, but is different for each language.

If you would like to cite this task:

Kevin Duh, "The Multitarget TED Talks Task", http://www.cs.jhu.edu/~kevinduh/a/multitarget-tedtalks/, 2018

	 @misc{duh18multitarget,
 author = {Kevin Duh},
 title = {The Multitarget TED Talks Task},
 howpublished = {\url{http://www.cs.jhu.edu/~kevinduh/a/multitarget-tedtalks/}},
 year = {2018},
	 }