Data Load Design
Data Load Design
Key point: this is classic ETL so let's reuse those patterns and tooling.
Logic
In more detail, dividing ET(transform) from L(oad):
Load step in even more detail
Runner
We will use AirFlow.
Research
What is a Tabular Data Resource?
See Frictionless Specs. For our purposes:
- A "Good" CSV file: Valid CSV - with one header row, No blank header etc…
- Encoding worked out – usually we should have already converted to utf-8
- Dialect - https://frictionlessdata.io/specs/csv-dialect/
- Table Schema https://frictionlessdata.io/specs/table-schema
NB: even if you want to go direct loading route (a la XLoader) and forget types you still need encoding etc sorted – and it still fits in diagram above (Table Schema is just trivial – everything is strings).
What is datastore and how to create the DataStore entry
https://github.com/ckan/ckan/tree/master/ckanext/datastore
- provides an ad hoc database for storage of structured data from CKAN resources
- Connection with Datapusher: https://docs.ckan.org/en/2.8/maintaining/datastore.html#datapusher-automatically-add-data-to-the-datastore
- Datastore API: https://docs.ckan.org/en/2.8/maintaining/datastore.html#the-datastore-api
- Making Datastore API requests: https://docs.ckan.org/en/2.8/maintaining/datastore.html#making-a-datastore-api-request
Create an entry
curl -X POST http://127.0.0.1:5000/api/3/action/datastore_create -H "Authorization: {YOUR-API-KEY}"
resource
-d '{
"resource": {"package_id": "{PACKAGE-ID}"},
"fields": [ {"id": "a"}, {"id": "b"} ]
}'
Options for Loading
There are 3 different paths we could take:
Pros and Cons of different approaches
Criteria | Datastore Write API | PG Copy | Dataflows |
---|---|---|---|
Speed | Low | High | ??? |
Error Reporting | Yes | Yes | No(?) |
Easy of implementation | Yes | No(?) | Yes |
Works Big data | No | Yes | Yes(?) |
Works well in parrallel | No | Yes(?) | Yes(?) |
DataFlows
https://github.com/datahq/dataflows
Dataflows is a framework for loading, processing, manipulating data.
- Loader (Loading from external source (or disk)): https://github.com/datahq/dataflows/blob/master/dataflows/processors/load.py
- Load to an SQL db (Dump processed data) https://github.com/datahq/dataflows/blob/master/dataflows/processors/dumpers/to_sql.py
- What is error reporting, what is runner system …, does it have a UI? does it have a queue system?
- Think data package pipelines is taking care of all of these. https://github.com/frictionlessdata/datapackage-pipelines
- DPP itself is also a ETL framework, just much heavier and a bit complicated.
- Think data package pipelines is taking care of all of these. https://github.com/frictionlessdata/datapackage-pipelines
Notes an QA (Sep 2019)
- Note: TDR needs info on CKAN Resource source so we can create right datastore entry ..
- No need to validate as we assume it is good …
- We might want to do that … still
- Pros and Cons
- Speed
- Error reporting …
- What happens with Copy if you hit an error (e.g. a bad cast?)
- https://infinum.co/the-capsized-eight/superfast-csv-imports-using-postgresqls-copy
- https://wiki.postgresql.org/wiki/Error_logging_in_COPY
- Ease of implementation
- Good with inserting Big data
- Create as strings and cast later … ?
- xloader implementation with COPY command: https://github.com/ckan/ckanext-xloader/blob/fb17763fc7726084f67f6ebd640809ecc055b3a2/ckanext/xloader/loader.py#L40
Raw insert ~ 15m (on 1m rows) Insert with begin / commit ~5m copy ~82s (though may have limit on b/w) – and what happens if pipe breaks
Q: Is it better to but everything in DB as a string and cast later or cast and insert in DB. A: Probably cast first and insert after.
Q: Why do we rush to insert the data in DB? We will have to wait until it's casted anyways befroe use A: It's much faster to do operations id DB than outside.