The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 11 European languages: Romanic (French, Italian, Spanish, Portuguese), Germanic (English, Dutch, German, Danish, Swedish), Greek and Finnish, and slightly smaller corpora for 10 other languages (Bulgarian, Czech, Estonian, Hungarian, Latvian, Lithuanian, Polish, Romanian, Slovak, and Slovene).
The goal of the extraction and processing was to generate sentence aligned text for statistical machine translation systems. For this purpose we extracted matching items and labeled them with corresponding document IDs. Using a preprocessor we identified sentence boundaries. We sentence aligned the data using a tool based on the Church and Gale algorithm.
On 4 February 2011 we released a further expanded and improved version of the corpus. Previous versions are available here. The corpus is released as a source release with the document files and a sentence aligner, and parallel corpora of language pairs that include English.
We are not aware of any copyright restrictions of the material. If you use this data in your research, please contact email@example.com. Please let us know if you find problems with the data or if you want the data for other language pairs. We recommend using the last quarter of 2000 for testing (2000-10 until 2000-12) for consistency in reporting research results on this data.