Тег: nlp

Всего зарегистрировано 21 пакетов с тегом nlp:

  • Beautiful Data Natural Language Corpus and Code
    • 35 views
    • None Отсутствует открытая лицензия
    Ngrams and code from Dr. Peter Norvig's chapter for Beautiful Data (2009), edited by Segaran and Hammerbacher. Data files are derived from the Google Web Trillion Word Corpus, as...
  • This is a publicly available, tokenized version of the Reuters RCV1 corpus by David D Lewis et al. The creator requests attribution.
  • Reuters Corpus Volume 2 (RCV2)
    • 779 views
    • None Отсутствует открытая лицензия
    Reuters Corpus, Volume 2, Multilingual Corpus, 1996-08-20 to 1997-08-19 (Release date 2005-05-31, Format version 1, correction level 0) This is distributed on one CD and contains over...
  • DBpedia Spotlight is a tool for annotating mentions of DBpedia resources in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud through...
  • Reuters Corpus Volume 1 (RCV1)
    • 133 views
    • None Отсутствует открытая лицензия
    RCV1 is a dataset of 810,000 documents (2.5GB uncompressed), which is available by request from the NIST. Those documents are distributed by CD. For derivative data that is publicly...
  • New Zealand Digital Library
    • 11 views
    • None Отсутствует открытая лицензия
    The library is a collection of machine-readable texts and metadata, especially relating to New Zealand and the Asia/Pacific Region. From the website: [The library] provides several...
  • About From website: WordNet is a lexical database for English that has been widely adopted in artificial intelligence and computational linguistics for a variety of practical...
  • Wikisource
    • 31 views
    • [Открытые данные]
    Wikisource is a repository of English language text. As of October 2011, it contains over 240,000 pages. From the website Wikisource is an online library of free content publications...
  • The New York Times Annotated Corpus
    • 137 views
    • None Отсутствует открытая лицензия
    About From website: The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with...
  • Read the Web
    • 29 views
    • None Отсутствует открытая лицензия
    This data includes facts extracted from 500 million web pages. From the project's website: To build a never-ending machine learning system that acquires the ability to extract...
  • The Ontologies of Linguistic Annotations (OLiA) provide an OWL/DL taxonomy of data categories as a reference for linguistic annotation (OLiA Reference Model), plus OWL/DL models for a...
  • GeoWordNet is a semantic resource built from the full integration of WordNet, GeoNames and the Italian part of MultiWordNet. GeoWordNet Public Dataset contains 3,698,238 entities,...
  • Here are the datasets backing the Google Books Ngram Viewer. These datasets were generated in July 2009; we will update these datasets as our book scanning continues, and the updated...
  • This is a recipe to train word n-gram language models using the newswire text provided in the English Gigaword corpus (1200M words of NYT, APW, AFE, XIE). It also prepares dictionaries...
  • Microsoft Web N-Gram Service
    • 29 views
    • None Отсутствует открытая лицензия
    Microsoft has developed services on the basis of ngrams from all of Bing's en_US corpus. The raw public data available include two files with the top 100k words from this corpus. The...
  • The Ontos News Portal extracts facts (objects as e. g. persons or organizations as well as relations between them, e. g. a person is working for an organization or living at a location)....
  • Reuters-21578
    • 33 views
    • None Отсутствует открытая лицензия
    A set of documents from Reuters' 1986 newswire which have been classified. This dataset is appropriate for testing natural language processing and information retrieval algorithms....
  • The ClueWeb09 Dataset
    • 41 views
    • None Отсутствует открытая лицензия
    The ClueWeb09 dataset was created to support research on information retrieval and related human language technologies. It consists of about 1 billion web pages in ten languages that were...
  • OpenCalais
    • 1068 views
    • None Отсутствует открытая лицензия
    OpenCalais is a web service that extracts semantic metadata from text content, such as web pages.
  • Web 1T 5-gram Version 1
    • 189 views
    • None Отсутствует открытая лицензия
    This data set, contributed by Google Inc., contains English word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams....
  • WordNet 2.0 (W3C)
    • 485 views
    • None Отсутствует открытая лицензия
    Presents a standard conversion of Princeton WordNet to RDF/OWL. It describes how it was converted and gives examples of how it may be queried for use in Semantic Web applications....