Data Load Design

Data Load Design

Key point: this is classic ETL so let's reuse those patterns and tooling.

Logic

In more detail, dividing ET(transform) from L(oad):

Load step in even more detail

Runner

We will use AirFlow.

Research

What is a Tabular Data Resource?

See Frictionless Specs. For our purposes:

NB: even if you want to go direct loading route (a la XLoader) and forget types you still need encoding etc sorted – and it still fits in diagram above (Table Schema is just trivial – everything is strings).

What is datastore and how to create the DataStore entry

https://github.com/ckan/ckan/tree/master/ckanext/datastore

Create an entry

curl -X POST http://127.0.0.1:5000/api/3/action/datastore_create -H "Authorization: {YOUR-API-KEY}"

resource
-d '{
  "resource": {"package_id": "{PACKAGE-ID}"},
  "fields": [ {"id": "a"}, {"id": "b"} ]
  }'

https://docs.ckan.org/en/2.8/maintaining/datastore.html#ckanext.datastore.logic.action.datastore_create

Options for Loading

There are 3 different paths we could take:

Pros and Cons of different approaches

CriteriaDatastore Write APIPG CopyDataflows
SpeedLowHigh???
Error ReportingYesYesNo(?)
Easy of implementationYesNo(?)Yes
Works Big dataNoYesYes(?)
Works well in parrallelNoYes(?)Yes(?)

DataFlows

https://github.com/datahq/dataflows

Dataflows is a framework for loading, processing, manipulating data.

Notes an QA (Sep 2019)

Raw insert ~ 15m (on 1m rows) Insert with begin / commit ~5m copy ~82s (though may have limit on b/w) – and what happens if pipe breaks

Q: Is it better to but everything in DB as a string and cast later or cast and insert in DB. A: Probably cast first and insert after.

Q: Why do we rush to insert the data in DB? We will have to wait until it's casted anyways befroe use A: It's much faster to do operations id DB than outside.

Built with DataHub LogoDataHub Cloud