Tag: nlp

Ci sono 21 dataset taggati con nlp:

  • Ngrams and code from Dr. Peter Norvig's chapter for Beautiful Data (2009), edited by Segaran and Hammerbacher. Data files are derived from the Google Web Trillion Word Corpus, as...
  • This is a publicly available, tokenized version of the Reuters RCV1 corpus by David D Lewis et al. The creator requests attribution.
  • Reuters Corpus Volume 2 (RCV2)
    • 779 views
    • None Licenza non libera
    Reuters Corpus, Volume 2, Multilingual Corpus, 1996-08-20 to 1997-08-19 (Release date 2005-05-31, Format version 1, correction level 0) This is distributed on one CD and contains over...
  • DBpedia Spotlight is a tool for annotating mentions of DBpedia resources in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud through...
  • Reuters Corpus Volume 1 (RCV1)
    • 133 views
    • None Licenza non libera
    RCV1 is a dataset of 810,000 documents (2.5GB uncompressed), which is available by request from the NIST. Those documents are distributed by CD. For derivative data that is publicly...
  • New Zealand Digital Library
    • 11 views
    • None Licenza non libera
    The library is a collection of machine-readable texts and metadata, especially relating to New Zealand and the Asia/Pacific Region. From the website: [The library] provides several...
  • About From website: WordNet is a lexical database for English that has been widely adopted in artificial intelligence and computational linguistics for a variety of practical...
  • Wikisource
    • 31 views
    • [Open Data]
    Wikisource is a repository of English language text. As of October 2011, it contains over 240,000 pages. From the website Wikisource is an online library of free content publications...
  • The New York Times Annotated Corpus
    • 137 views
    • None Licenza non libera
    About From website: The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with...
  • Read the Web
    • 29 views
    • None Licenza non libera
    This data includes facts extracted from 500 million web pages. From the project's website: To build a never-ending machine learning system that acquires the ability to extract...
  • The Ontologies of Linguistic Annotations (OLiA) provide an OWL/DL taxonomy of data categories as a reference for linguistic annotation (OLiA Reference Model), plus OWL/DL models for a...
  • GeoWordNet is a semantic resource built from the full integration of WordNet, GeoNames and the Italian part of MultiWordNet. GeoWordNet Public Dataset contains 3,698,238 entities,...
  • Here are the datasets backing the Google Books Ngram Viewer. These datasets were generated in July 2009; we will update these datasets as our book scanning continues, and the updated...
  • This is a recipe to train word n-gram language models using the newswire text provided in the English Gigaword corpus (1200M words of NYT, APW, AFE, XIE). It also prepares dictionaries...
  • Microsoft Web N-Gram Service
    • 29 views
    • None Licenza non libera
    Microsoft has developed services on the basis of ngrams from all of Bing's en_US corpus. The raw public data available include two files with the top 100k words from this corpus. The...
  • The Ontos News Portal extracts facts (objects as e. g. persons or organizations as well as relations between them, e. g. a person is working for an organization or living at a location)....
  • Reuters-21578
    • 33 views
    • None Licenza non libera
    A set of documents from Reuters' 1986 newswire which have been classified. This dataset is appropriate for testing natural language processing and information retrieval algorithms....
  • The ClueWeb09 Dataset
    • 41 views
    • None Licenza non libera
    The ClueWeb09 dataset was created to support research on information retrieval and related human language technologies. It consists of about 1 billion web pages in ten languages that were...
  • OpenCalais is a web service that extracts semantic metadata from text content, such as web pages.
  • Web 1T 5-gram Version 1
    • 189 views
    • None Licenza non libera
    This data set, contributed by Google Inc., contains English word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams....
  • Presents a standard conversion of Princeton WordNet to RDF/OWL. It describes how it was converted and gives examples of how it may be queried for use in Semantic Web applications....