Registry of Core Datasets

core

Files Size Format Created Updated License Source
2 51kB csv zip 5 days ago
Core data registry and tooling. Registry Registry is maintained as Tabular Data Package with list of datasets in core-list.csv. [tdp]: http://frictionlessdata.io/guides/tabular-data-package/ To add a dataset add it to the core-list.csv - we recommend fork and pull. Discussion of proposals for read more
Download

Data Files

File Description Size Last changed Download Other formats
core-list [csv] 10kB core-list [csv] core-list [json] (24kB)
datapackage_zip [zip] Compressed versions of dataset. Includes normalized CSV and JSON data with original data and datapackage.json. 6kB datapackage_zip [zip]

core-list  

This is a preview version. There might be more data in the original version.

Field information

Field Name Order Type (Format) Description
name 1 string Name of the dataset
github_url 2 string The location in GitHub
run_date 3 string Last run date
modified 4 string Frequency information (year-A, quarter-Q, month-M, day-D, no-N)
validated_metadata 5 string Metadata validation status
validated_data 6 string Data validation status
published 7 string Published location on DataHub
ok_on_datahub 8 string Status on DataHub
validated_metadata_message 9 string Error messages if validation fails
validated_data_message 10 string Error messages if validation fails
auto_publish 11 string Published by DataHub automatically

datapackage_zip  

This is a preview version. There might be more data in the original version.

Read me

Core data registry and tooling.

Registry

Registry is maintained as Tabular Data Package with list of datasets in core-list.csv.

To add a dataset add it to the core-list.csv - we recommend fork and pull.

Discussion of proposals for new datasets and for incorporation of prepared datasets takes place in the issues.

To propose a new dataset for inclusion, please create a new issue.

Core Dataset Tools

Installation

$ npm install

Usage

  • Environmental variables

DOMAIN - testing or production environment. For example: https://datahub.io TYPE - type of dataset. For example: examples or core

node index.js [COMMAND] [PATH]

# PATH - path to csv file

Clone datasets

To clone all core datasets run the following command:

npm index.js clone [PATH]

It will clone all core datasets into following directory: data/${pkg_name}

Check datasets

To check all core datasets run the following command:

npm index.js check [PATH]

It will validate metadata and data according to the latest spec.

Normalize datasets

To normalize all core datasets run the following command:

npm index.js norm [PATH]

It will normalize all core datasets into following directory: data/${pkg_name}

Push datasets

To publish all core data packages run the following command:

npm index.js push [PATH]

Running tests

We use Ava for our tests. For running tests use:

$ [sudo] npm test

To run tests in watch mode:

$ [sudo] npm run watch:test

Import into your tool

If you are using R here's how to get the data you want quickly loaded:

install.packages("jsonlite")
library("jsonlite")

json_file <- "http://datahub.io/core/registry/datapackage.json"
json_data <- fromJSON(paste(readLines(json_file), collapse=""))

# access csv file by the index starting from 1
path_to_file = json_data$resources[[1]]$path
data <- read.csv(url(path_to_file))
print(data)

In order to work with Data Packages in Pandas you need to install the Frictionless Data data package library and the pandas extension:

pip install datapackage
pip install jsontableschema-pandas

To get the data run following code:

import datapackage

data_url = "http://datahub.io/core/registry/datapackage.json"

# to load Data Package into storage
storage = datapackage.push_datapackage(data_url, 'pandas')

# data frames available (corresponding to data files in original dataset)
storage.buckets

# you can access datasets inside storage, e.g. the first one:
storage[storage.buckets[0]]

For Python, first install the `datapackage` library (all the datasets on DataHub are Data Packages):

pip install datapackage

To get Data Package into your Python environment, run following code:

from datapackage import Package

package = Package('http://datahub.io/core/registry/datapackage.json')

# get list of resources:
resources = package.descriptor['resources']
resourceList = [resources[x]['name'] for x in range(0, len(resources))]
print(resourceList)

data = package.resources[0].read()
print(data)

If you are using JavaScript, please, follow instructions below:

Install data.js module using npm:

  $ npm install data.js

Once the package is installed, use the following code snippet:

const {Dataset} = require('data.js')

const path = 'http://datahub.io/core/registry/datapackage.json'

// We're using self-invoking function here as we want to use async-await syntax:
(async () => {
  const dataset = await Dataset.load(path)

  // Get the first data file in this dataset
  const file = dataset.resources[0]
  // Get a raw stream
  const stream = await file.stream()
  // entire file as a buffer (be careful with large files!)
  const buffer = await file.buffer
})()

Install the datapackage library created specially for Ruby language using gem:

gem install datapackage

Now get the dataset and read the data:

require 'datapackage'

path = 'http://datahub.io/core/registry/datapackage.json'

package = DataPackage::Package.new(path)
# So package variable contains metadata. You can see it:
puts package

# Read data itself:
resource = package.resources[0]
data = resource.read
puts data
Datapackage.json