Datahub metrics

datahq

Files Size Format Created Updated License Source
4 162kB csv zip 11 months ago 11 months ago Open Data Commons Public Domain Dedication and License
DataHub Metrics This repo automates daily, weekly and biweekly stats collection for datahub.io. Stats are collected from postgre sql database, metastore api sevice, google analytics and gitter. The script also uploads test csv files via data-cli to measure data processing time on website on a read more
Download Developers

Data Files

Download files in this dataset

File Description Size Last changed Download
biweekly_stats 1kB csv (1kB) , json (5kB)
daily_stats 23kB csv (23kB) , json (214kB)
weekly_stats 1kB csv (1kB) , json (10kB)
datahub-metrics_zip Compressed versions of dataset. Includes normalized CSV and JSON data with original data and datapackage.json. 41kB zip (41kB)

biweekly_stats  

This is a preview version. There might be more data in the original version.

Field information

Field Name Order Type (Format) Description
Date 1 date (%Y-%m-%d)
Total Unique Visitors 2 integer (default)
Total Users 3 any (default)
Total new users 4 integer (default)
Downloads CLI (npm) 5 integer (default)
Downloads CLI (GA) 6 integer (default)
cli-windows 7 integer (default)
cli-linux 8 integer (default)
cli-macos 9 integer (default)
Number of (new = last 2w) users who publishes any dataset 10 integer (default)
How many of these push more than one dataset? 11 integer (default)
Number of first runs of 'data' 12 integer (default)
help 13 integer (default)
noArgs 14 any (default)
validate 15 integer (default)
push 16 integer (default)
get 17 integer (default)
cat 18 integer (default)
info 19 integer (default)
init 20 integer (default)
login 21 integer (default)
Site traffic (daily average) 22 number (default)
Total published (public) datasets 23 integer (default)
Total number of new datasets in last 2w 24 integer (default)
Number of pushes (daily average) 25 number (default)
Number of members on datahubio chat on gitter 26 integer (default)
Number of data requests (daily average) 27 any (default)
Number of unique visits 28 integer (default)

daily_stats  

This is a preview version. There might be more data in the original version.

Field information

Field Name Order Type (Format) Description
Date 1 date (%Y-%m-%d)
Total Users 2 integer (default)
Total new users 3 integer (default)
Published datasets (metastore) 4 integer (default)
Published datasets (DB) 5 integer (default)
Unlisted datasets 6 string (default)
Unlisted datasets (extracting our datasets) 7 string (default)
Private datasets 8 string (default)
Private datasets(extracting our datasets) 9 string (default)
Total datasets 10 integer (default)
Number of pushes 11 integer (default)
Number of pushes (excluding us) 12 string (default)
Speed of a 1Mb of packaged dataset push (in seconds) 13 string (default)
Speed of a 5kb of packaged dataset push (in seconds) 14 string (default)
Clicks on download link (csv + json + zip) 15 integer (default)
cli-macos 16 string (default)
cli-linux 17 string (default)
cli-windows 18 string (default)
data-desktop 19 string (default)
Total downloads 20 string (default)
Download page (unique pageviews) 21 string (default)
sign-in-pricing-page 22 string (default)
sign-up-pricing-page 23 string (default)
contact-us-pricing-page 24 string (default)
Pricing page (unique pageviews) 25 string (default)
Site Traffic 26 integer (default)
Data stored (Published) 27 string (default)
Data growth 28 string (default)
Total number of data requests 29 string (default)
Comment 30 string (default)

weekly_stats  

This is a preview version. There might be more data in the original version.

Field information

Field Name Order Type (Format) Description
Date 1 date (%Y-%m-%d)
Site traffic weekly (measured every monday for the last week) 2 integer (default)
Number of members on datahubio chat on gitter (every monday) 3 string (default)
Total number of data requests per week 4 string (default)

Integrate this dataset into your favourite tool

Use our data-cli tool designed for data wranglers:

data get https://datahub.io/datahq/datahub-metrics
data info datahq/datahub-metrics
tree datahq/datahub-metrics
# Get a list of dataset's resources
curl -L -s https://datahub.io/datahq/datahub-metrics/datapackage.json | grep path

# Get resources

curl -L https://datahub.io/datahq/datahub-metrics/r/0.csv

curl -L https://datahub.io/datahq/datahub-metrics/r/1.csv

curl -L https://datahub.io/datahq/datahub-metrics/r/2.csv

curl -L https://datahub.io/datahq/datahub-metrics/r/3.zip

If you are using R here's how to get the data you want quickly loaded:

install.packages("jsonlite", repos="https://cran.rstudio.com/")
library("jsonlite")

json_file <- 'https://datahub.io/datahq/datahub-metrics/datapackage.json'
json_data <- fromJSON(paste(readLines(json_file), collapse=""))

# get list of all resources:
print(json_data$resources$name)

# print all tabular data(if exists any)
for(i in 1:length(json_data$resources$datahub$type)){
  if(json_data$resources$datahub$type[i]=='derived/csv'){
    path_to_file = json_data$resources$path[i]
    data <- read.csv(url(path_to_file))
    print(data)
  }
}

Note: You might need to run the script with root permissions if you are running on Linux machine

Install the Frictionless Data data package library and the pandas itself:

pip install datapackage
pip install pandas

Now you can use the datapackage in the Pandas:

import datapackage
import pandas as pd

data_url = 'https://datahub.io/datahq/datahub-metrics/datapackage.json'

# to load Data Package into storage
package = datapackage.Package(data_url)

# to load only tabular data
resources = package.resources
for resource in resources:
    if resource.tabular:
        data = pd.read_csv(resource.descriptor['path'])
        print (data)

For Python, first install the `datapackage` library (all the datasets on DataHub are Data Packages):

pip install datapackage

To get Data Package into your Python environment, run following code:

from datapackage import Package

package = Package('https://datahub.io/datahq/datahub-metrics/datapackage.json')

# print list of all resources:
print(package.resource_names)

# print processed tabular data (if exists any)
for resource in package.resources:
    if resource.descriptor['datahub']['type'] == 'derived/csv':
        print(resource.read())

If you are using JavaScript, please, follow instructions below:

Install data.js module using npm:

  $ npm install data.js

Once the package is installed, use the following code snippet:

const {Dataset} = require('data.js')

const path = 'https://datahub.io/datahq/datahub-metrics/datapackage.json'

// We're using self-invoking function here as we want to use async-await syntax:
;(async () => {
  const dataset = await Dataset.load(path)
  // get list of all resources:
  for (const id in dataset.resources) {
    console.log(dataset.resources[id]._descriptor.name)
  }
  // get all tabular data(if exists any)
  for (const id in dataset.resources) {
    if (dataset.resources[id]._descriptor.format === "csv") {
      const file = dataset.resources[id]
      // Get a raw stream
      const stream = await file.stream()
      // entire file as a buffer (be careful with large files!)
      const buffer = await file.buffer
      // print data
      stream.pipe(process.stdout)
    }
  }
})()

Read me

DataHub Metrics

This repo automates daily, weekly and biweekly stats collection for datahub.io. Stats are collected from postgre sql database, metastore api sevice, google analytics and gitter. The script also uploads test csv files via data-cli to measure data processing time on website on a daily basis

Once collected the stats are inserted into a google spreadsheet. The stats collection is automated via Travis. There are three scripts that run automatically at specified times.

  • dailyStats.py runs every Tuesday, Wednesday, Thursday, Friday and Saturday at 00:05 UTC and collects daily stats for the previous day
  • weeklyStats.py runs every Monday at 00:05 UTC and collects stats for the previous week
  • biWeeklyStats.py runs every other Thursday at 00:05 UTC and collects stats for previous 14 days which is a duration of our sprint

Requirements

Scripts are written for python 3.6+. Modules that are required:

  • pexpect
  • python-dotenv
  • psycopg2
  • httplib2
  • google-api-python-client
  • oauth2client
  • gspread

To install requirements run pip install -r requirements.txt

License

Public Domain Dedication and License (PDDL)

Datapackage.json