Europarl Parallel Corpus / [no name] baa188b6-1618-4d7a-bfda-917aa350649b

Последно обновяване
Unknown
Формат
Unknown
Лиценз
Other (Open) [Отворени данни]
Quality
[unrecognised content type]
От Dataset:

Description

Overview from home page:

The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 11 European languages: Romanic (French, Italian, Spanish, Portuguese), Germanic (English, Dutch, German, Danish, Swedish), Greek and Finnish, and slightly smaller corpora for 10 other languages (Bulgarian, Czech, Estonian, Hungarian, Latvian, Lithuanian, Polish, Romanian, Slovak, and Slovene).

The goal of the extraction and processing was to generate sentence aligned text for statistical machine translation systems. For this purpose we extracted matching items and labeled them with corresponding document IDs. Using a preprocessor we identified sentence boundaries. We sentence aligned the data using a tool based on the Church and Gale algorithm.

...

On 4 February 2011 we released a further expanded and improved version of the corpus. Previous versions are available here. The corpus is released as a source release with the document files and a sentence aligner, and parallel corpora of language pairs that include English.

Openness: OPEN

No explicit license but looks to be open (public domain?) according to terms of use section which says:

We are not aware of any copyright restrictions of the material. If you use this data in your research, please contact pkoehn@inf.ed.ac.uk. Please let us know if you find problems with the data or if you want the data for other language pairs. We recommend using the last quarter of 2000 for testing (2000-10 until 2000-12) for consistency in reporting research results on this data.

Преглед


Допълнителна информация

Поле Стойност
cache_last_updated
cache_url
created
datastore_active False
format
hash
id baa188b6-1618-4d7a-bfda-917aa350649b
last_modified
mimetype
mimetype_inner
name
position 0
resource_group_id 5a7dd4ac-7e6b-258c-1a20-b05102270ebf
resource_type
revision_id 01129dc3-7a20-4677-9c4c-a72afd3b06f9
revision_timestamp 2011-01-18T15:54:13.631611
size
state active
tracking_summary totalrecent
url http://www.statmt.org/europarl/v3/
webstore_last_updated
webstore_url