This data set, contributed by Google Inc., contains English word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams. We expect this data will be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses.
File sizes: approx. 24 GB compressed (gzip'ed) text files
Number of tokens: 1,024,908,267,229 Number of sentences: 95,119,665,584 Number of unigrams: 13,588,391 Number of bigrams: 314,843,401 Number of trigrams: 977,069,902 Number of fourgrams: 1,313,818,354 Number of fivegrams: 1,176,470,663
Bronnen
Additionele informatie
| Veld | Waarde |
|---|---|
| Bron | http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13 |
| Auteur | Thorsten Brants, Alex Franz |
| Beheerder | Linguistic Data Consortium, Philadelphia |
| Versie | 1 |
Cite this
Web 1T 5-gram Version 1. Thorsten Brants, Alex Franz.
Retrieved 00:13, May 23, 2013 (UTC).
the Data Hub
Comments