TED CLDC Corpus

For our 2014 ACL paper on multilingual distributed representations, we developed a corpus for a cross-lingual document classification task across a large number of languages with a multi-label classification setting. This corpus is based on the 2013 WIT3 edition of TED talk transcripts, which is available here.

Format

It is easily possible to re-create our CLDC corpus given the original WIT3 data. We used the following keywords as our classifier labels: technology, culture, science, global issues, design, business, entertainment, arts, politics, education, art, health, creativity, economics, biology.

Given the WIT3 data, development data is provided and we split the WIT3 training data into a new training/test split by considering all talks with a talk-id >= 1500 as test data. For reasons of keeping datasets separate and clean, we do not consider the WIT3 test data in this task.

During training and testing only the body of the talk was used in our paper, throwing away all meta-information such as author names, talk title, data and location. We further pre-processed the data by lowercasing all text and replacing unique tokens with an unknown type “UNK”. Finally, to avoid lexica-clashes, all words were suffixed according to their language.

Download

A pre-processed version of the corpus is available for download here:

Download

The directory structure is designed to make the classification evaluation simple. For instance for evaluating on the keyword arts in the arabic-english language setting, you

Publications

When using this corpus, please cite the following paper:

For the original WIT3 corpus, please cite:

History