Ci sono 21 dataset taggati con nlp:
-
Ngrams and code from Dr. Peter Norvig's chapter for Beautiful Data (2009), edited by Segaran and Hammerbacher. Data files are derived from the Google Web Trillion Word Corpus, as...
-
This is a publicly available, tokenized version of the Reuters RCV1 corpus by David D Lewis et al. The creator requests attribution.
-
Reuters Corpus, Volume 2, Multilingual Corpus, 1996-08-20 to 1997-08-19 (Release date 2005-05-31, Format version 1, correction level 0) This is distributed on one CD and contains over...
-
DBpedia Spotlight is a tool for annotating mentions of DBpedia resources in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud through...
-
RCV1 is a dataset of 810,000 documents (2.5GB uncompressed), which is available by request from the NIST. Those documents are distributed by CD. For derivative data that is publicly...
-
The library is a collection of machine-readable texts and metadata, especially relating to New Zealand and the Asia/Pacific Region. From the website: [The library] provides several...
-
About From website: WordNet is a lexical database for English that has been widely adopted in artificial intelligence and computational linguistics for a variety of practical...
-
Wikisource is a repository of English language text. As of October 2011, it contains over 240,000 pages. From the website Wikisource is an online library of free content publications...
-
About From website: The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with...
-
This data includes facts extracted from 500 million web pages. From the project's website: To build a never-ending machine learning system that acquires the ability to extract...
-
The Ontologies of Linguistic Annotations (OLiA) provide an OWL/DL taxonomy of data categories as a reference for linguistic annotation (OLiA Reference Model), plus OWL/DL models for a...
-
GeoWordNet is a semantic resource built from the full integration of WordNet, GeoNames and the Italian part of MultiWordNet. GeoWordNet Public Dataset contains 3,698,238 entities,...
-
Here are the datasets backing the Google Books Ngram Viewer. These datasets were generated in July 2009; we will update these datasets as our book scanning continues, and the updated...
-
This is a recipe to train word n-gram language models using the newswire text provided in the English Gigaword corpus (1200M words of NYT, APW, AFE, XIE). It also prepares dictionaries...
-
Microsoft has developed services on the basis of ngrams from all of Bing's en_US corpus. The raw public data available include two files with the top 100k words from this corpus. The...
-
The Ontos News Portal extracts facts (objects as e. g. persons or organizations as well as relations between them, e. g. a person is working for an organization or living at a location)....
-
A set of documents from Reuters' 1986 newswire which have been classified. This dataset is appropriate for testing natural language processing and information retrieval algorithms....
-
The ClueWeb09 dataset was created to support research on information retrieval and related human language technologies. It consists of about 1 billion web pages in ten languages that were...
-
OpenCalais is a web service that extracts semantic metadata from text content, such as web pages.
-
This data set, contributed by Google Inc., contains English word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams....
-
Presents a standard conversion of Princeton WordNet to RDF/OWL. It describes how it was converted and gives examples of how it may be queried for use in Semantic Web applications....