Data Load Design

Data Load Design

Key point: this is classic ETL so let's reuse those patterns and tooling.

Logic

1. some steps
direct
2. load
User has CSV,XLS etc
User has Tabular Data Resource
Tabular Data Resource
DataStore

In more detail, dividing ET(transform) from L(oad):

Loader (L)
Prepare (ET)
DataStore Create
DataStore Delete
Load to CKAN via DataStore API or direct copy
Tidy
Raw CSV
Infer types
Tabular Data Resource
csv/json + table schema

Load step in even more detail

log
log
Tabular Data Resource on disk from CSV in FileStore of a resource
Load Tabular Data Resource Metadata
Create Table in DS if not exists
Clear DS table if existing content
Load to DS via PG copy
Data in DataStore
LogStore

Runner

We will use AirFlow.

Research

What is a Tabular Data Resource?

See Frictionless Specs. For our purposes:

NB: even if you want to go direct loading route (a la XLoader) and forget types you still need encoding etc sorted – and it still fits in diagram above (Table Schema is just trivial – everything is strings).

What is datastore and how to create the DataStore entry

https://github.com/ckan/ckan/tree/master/ckanext/datastore

Create an entry

curl -X POST http://127.0.0.1:5000/api/3/action/datastore_create -H "Authorization: {YOUR-API-KEY}"

resource
-d '{
  "resource": {"package_id": "{PACKAGE-ID}"},
  "fields": [ {"id": "a"}, {"id": "b"} ]
  }'

https://docs.ckan.org/en/2.8/maintaining/datastore.html#ckanext.datastore.logic.action.datastore_create

Options for Loading

There are 3 different paths we could take:

1
2
3
Load in python in streaming mode
Clear DS table if existing content
Load to DS via PG copy
Load to DS via DataStore API
Load to DS via sql over PG api
Data in DataStore
DataFlows SQL loader

Pros and Cons of different approaches

CriteriaDatastore Write APIPG CopyDataflows
SpeedLowHigh???
Error ReportingYesYesNo(?)
Easy of implementationYesNo(?)Yes
Works Big dataNoYesYes(?)
Works well in parrallelNoYes(?)Yes(?)

DataFlows

https://github.com/datahq/dataflows

Dataflows is a framework for loading, processing, manipulating data.

Notes an QA (Sep 2019)

Raw insert ~ 15m (on 1m rows) Insert with begin / commit ~5m copy ~82s (though may have limit on b/w) – and what happens if pipe breaks

Q: Is it better to but everything in DB as a string and cast later or cast and insert in DB. A: Probably cast first and insert after.

Q: Why do we rush to insert the data in DB? We will have to wait until it's casted anyways befroe use A: It's much faster to do operations id DB than outside.