For our 2014 ACL paper on multilingual distributed representations, we developed a corpus for a cross-lingual document classification task across a large number of languages with a multi-label classification setting. This corpus is based on the 2013 WIT3 edition of TED talk transcripts, which is available here.
It is easily possible to re-create our CLDC corpus given the original WIT3 data. We used the following keywords as our classifier labels: technology, culture, science, global issues, design, business, entertainment, arts, politics, education, art, health, creativity, economics, biology.
Given the WIT3 data, development data is provided and we split the WIT3 training data into a new training/test split by considering all talks with a talk-id >= 1500 as test data. For reasons of keeping datasets separate and clean, we do not consider the WIT3 test data in this task.
During training and testing only the body of the talk was used in our paper, throwing away all meta-information such as author names, talk title, data and location. We further pre-processed the data by lowercasing all text and replacing unique tokens with an unknown type “UNK”. Finally, to avoid lexica-clashes, all words were suffixed according to their language.
A pre-processed version of the corpus is available for download here:
The directory structure is designed to make the classification evaluation simple. For instance for evaluating on the keyword arts in the arabic-english language setting, you
- train a classifier on the data in ted-cldc/ar-en/train/arts/[positive|negative]
- evaluate that classifier on ted-cldc/en-ar/test/arts/[positive|negative]
When using this corpus, please cite the following paper:
- Multilingual Models for Compositional Distributional Semantics. Karl Moritz Hermann, Phil Blunsom. In Proceedings of ACL. bibtex
For the original WIT3 corpus, please cite:
- WIT$^3$: Web Inventory of Transcribed and Translated Talks Mauro Cettolo, Christian Girardi, Marcello Federico. In Proceedings of the 16$^th$ Conference of the European Association for Machine Translation (EAMT). pp. 261–268. Trento, Italy. bibtex
- 15 September 2014: Updated version. Thanks to Geert Heyman, KU Leuven, for pointing out an issue with the data originally uploaded.
- 1 July 2014: First version of corpus released.