Using DuckDB for data wrangling
Good two part series:
https://duckdb.org/2024/08/19/duckdb-tricks-part-1 https://duckdb.org/2024/10/11/duckdb-tricks-part-2
Here are some of the key items:
Operation | Snippet |
---|---|
Pretty-printing floats | SELECT (10 / 9)::DECIMAL(15, 3) |
Copying the schema | CREATE TABLE tbl AS FROM example LIMIT 0 |
Shuffling data | FROM example ORDER BY hash(rowid + 42) |
Specifying types when reading CSVs | FROM read_csv('example.csv', types = {'x': 'DECIMAL(15, 3)'}) |
Updating CSV files in-place | COPY (SELECT s FROM 'example.csv') TO 'example.csv' |
Operation | SQL instructions |
---|---|
Fixing timestamps in CSV files | regexp_replace() and strptime() |
Filling in missing values | CROSS JOIN , LEFT JOIN and coalesce() |
Repeated transformation steps | CREATE OR REPLACE TABLE t AS ... FROM t ... |
Computing checksums for columns | bit_xor(md5_number(COLUMNS(*)::VARCHAR)) |
Creating a macro for checksum | CREATE MACRO checksum(tbl) AS TABLE ... |
CREATE TABLE schedule_cleaned AS SELECT timeslot .regexp_replace(' (\d+)(am|pm)$', ' \1.00\2') .strptime('%Y-%m-%d %H.%M%p') AS timeslot, location, event FROM schedule_raw;
Repeated Data Transformation Steps
Nice tip on how to run processes.
Data cleaning and transformation usually happens as a sequence of transformations that shape the data into a form that’s best fitted to later analysis. These transformations are often done by defining newer and newer tables using
CREATE TABLE ... AS SELECT
statements.For example, in the sections above, we created
schedule_raw
,schedule_cleaned
, andschedule_filled
. If, for some reason, we want to skip the cleaning steps for the timestamps, we have to reformulate the query computingschedule_filled
to useschedule_raw
instead ofschedule_cleaned
. This can be tedious and error-prone, and it results in a lot of unused temporary data – data that may accidentally get picked up by queries that we forgot to update!In interactive analysis, it’s often better to use the same table name by running
CREATE OR REPLACE
statements:CREATE OR REPLACE TABLE table_name AS ... FROM table_name ...;
Questions
One question I still have is:
How I would run this in e.g. GitHub Actions? How would i store the code, chain steps together. Do I use something like dbt or can I just use a simpler way?
Ok, let's try it out …
Have grants table here …
create table grants2 as select Grant, Amount.replace('$', '').replace(',', '') as Amount from grants;
alter table grants2 alter Amount type int;
select sum(Amount) from grants2;
total is: 3,703,705,877 ($3.7bn)