Core Data: Essential Datasets for Data Wranglers and Data Scientists

rufuspollock
mikanebu
anuveyatsu

The "Core Data" project provides essential data for the data wranglers and data science community. Its online home is on the DataHub:

https://datahub.io/docs

https://datahub.io/docs/core-data

This post introduces you to the Core Data, presents a couple of examples and shows you how you can access and use core data easily from your own tools and systems including R, Python, Pandas and more.

Why Core Data

If you build data driven applications or data driven insight you regularly find yourself wanting common "core" data, things like lists of countries, populations, geographic boundaries and more.

However, finding good quality data has always been challenging. Professionals can spend lots of time finding and preparing data before they get to do any real work analysing or presenting it.

To address this, a few years ago we started the "core data" project as part of the Frictionless Data initiative. Its purpose was to curate important, commonly used datasets including reference data like country codes, indicators like population and GDP, and geodata like country boundaries. It provides them in a high-quality, easy-to-use, and standard form.

Recently the Core Data project has got even better with a new home on the newly upgraded DataHub and has expanded thanks to new partners like Datopian and John Snow Labs (more on this in a future post!).

Examples

There are dozens of core datasets already available and many more being worked on, including a list of countries and their 2 digit codes, and a more extensive version.

List of Countries

Ever needed to build a drop-down list of countries in a web application? Or ever needed to add country name labels for a graph and only had country codes?

Then these datasets are for you!

First up is the very simple "country-list" dataset:

https://datahub.io/core/country-list

You can see a preview table for the dataset on the showcase page:

You can download it in either CSV or JSON formats:

Country Codes

Maybe the simple list of countries is not enough for you. Perhaps you need phone codes for each country, or want to know their currencies?

We've got you covered with the more extensive country codes dataset:

https://datahub.io/core/country-codes

All the countries from Country List including number of associated codes - ISO 3166 codes, ITU dialing codes, ISO 4217 currency codes, and many others. This dataset includes 26 different codes and associated information.

You can also preview the data and download in different formats just like it is described for Country List dataset above:

Population

This is another useful dataset for people: you regularly need population in order to do normalisations and calculate per capita figures as part of a statistical analysis.

This dataset includes population figures for countries, regions (e.g. Asia) and the world. Data comes originally from World Bank and has been converted into standard tabular data package with CSV data and a table schema:

https://datahub.io/core/population

Preview the data on the showcase page:

Get the data in CSV or JSON formats just like for any other Core Datasets:

Use Core Data from your favorite language or tool

We have made Core Data easy-to-use from various programming languages and tools. We will walk through using our Country List example. But you can apply these instructions to any other Core Data in the DataHub.

CSV and JSON

If you just need to get data, you have a direct link usable from any tool or app e.g. for the country list:

For more read our "Getting Data" tutorial:

cURL

Following commands help you to get the data using "cURL" tool. Use -L flag so "cURL" follows redirects:

# Get the data:
curl -L https://datahub.io/core/country-list/r/data.csv

# datapackage.json provides metadata and a list of all data files
curl -L https://datahub.io/core/country-list/datapackage.json

# See just the available data files (resources):
curl -L https://datahub.io/core/country-list/datapackage.json | jq ".resources"

R

If you are using R here's how to get the data you want quickly loaded:

install.packages("jsonlite")
library("jsonlite")

json_file <- 'https://datahub.io/core/country-list/datapackage.json'
json_data <- fromJSON(paste(readLines(json_file), collapse=""))

# get list of all resources:
print(json_data$resources$name)

# print all tabular data(if exists any)
for(i in 1:length(json_data$resources$datahub$type)){
  if(json_data$resources$datahub$type[i]=='derived/csv'){
    path_to_file = json_data$resources$path[i]
    data <- read.csv(url(path_to_file))
    print(data)
  }
}

Python

Here we take a look at how to get Country List in Python programming language:

For Python, first install the datapackage library (all the datasets on DataHub are Data Packages):

pip install datapackage

Again, we'll use the country-list dataset:

from datapackage import Package

package = Package('https://datahub.io/core/country-list/datapackage.json')

# get list of all resources:
resources = package.descriptor['resources']
resourceList = [resources[x]['name'] for x in range(0, len(resources))]
print(resourceList)

# print all tabular data(if exists any)
resources = package.resources
for resource in resources:
    if resource.tabular:
        print(resource.read())

Pandas

In order to work with Data Packages in Pandas you need to install the Frictionless Data data package library and the pandas:

pip install datapackage
pip install pandas

To get the data run following code:

import datapackage
import pandas as pd

data_url = 'https://datahub.io/core/country-list/datapackage.json'

# to load Data Package into storage
package = datapackage.Package(data_url)

# to load only tabular data
resources = package.resources
for resource in resources:
    if resource.tabular:
        data = pd.read_csv(resource.descriptor['path'])
        print (data)

JavaScript and many more

We also have support for JavaScript, SQL, and PHP. See our "Getting Data" tutorial for more: https://datahub.io/docs/getting-started/getting-data

Conclusion

This post has shown how you can import datasets in a high quality, standard form quickly and easily.

There are many more datasets to explore than the three we showed you here. You can find a full list here:

https://datahub.io/docs/core-data

Finally, we would love collaborators to help us curate even more core datasets. If you're interested you can find out more about the Core Data Curator program here:

https://datahub.io/docs/core-data/curators

Built with DataHub LogoDataHub Cloud