David Gasquez + Rufus Pollock 2023-03-09 and 2023-03-13
David Gasquez + Rufus Pollock 2023-03-09 and 2023-03-13
Present: David, Rufus
2023-03-13
Present: David, Rufus
Agenda
- Check-in
- Create agenda
- Review agenda
- David's notes https://hackmd.io/EzmOTpRfQpeYmiwkGeslEA?both
- Review our actions from last time
- Check-out and next steps
- David going to dump notes about data companies (https://github.com/datopian/datahub-next/issues/49)
- Look at patterns or tools for maintaining the github.com/datasets
- @davidgasquez - https://github.com/datasets/awesome-data/issues/375
Notes
- Q: What is your pain point and do you think others have it?
- Scratchpad for data especially that it encourages me to then publish/package
- What do people actually want to do - want to tell a data story, make a point (some people: want be a librarian but rare!)
- Is it intrinsic to open data - not intrinsic, but the reuse and shared maitenance really makes sense around open or at least shared data
- Not so much incentives to do "data packages" in companies
- Start with publishing and then walk backwards
- A community (even if 3 people!)
- David started to play with different tooling
- What's the MVP?
- Data Factory is changing fast. And its very personal. Complicated especially to run in the cloud
- Feedback on Datopian docs: Hard to know what is obsolete and what is current from the docs.
- Running things as a community: https://docs.bacalhau.org/getting-started/docker-workload-onboarding
- How do you see things like IPFS/HyperCore/Bacalhau/ODF helping the open data movement?
Asides
- Small data is larger now (Arrow, DuckDB, …)
💡 Reproduce our world in data posts …
- README with pre-computed data and graphs (i.e. any tables etc are pre-computed)
- README with live data ie ```sql select xxx from yyy“
David improvements for github.com/datasets
- Default GH Actions
- Docker images
- Makefile or something similar as consistent entrypoint
make data
.
2023-03-09
Present: David, Rufus
- Introductions / Hello
- David's thoughts on open data: https://publish.obsidian.md/davidgasquez/Open+Data
- Discussion of vision
- Data Scratch/Canvas
- Data Integration
Next steps
- https://datahub.io/notes/plan and vision ✅2023-03-13 david looked at this. see his notes
- https://datahub.io/docs/core-data/ and github.com/datasets/ 🚧2023-03-13 David looked at this
- Brainstorm individual write-ups from David's research so far 🚚 MOVED to https://github.com/datopian/datahub-next/issues/49
- Identify colloboration opportunities on "DataHub Next"
- What does David like doing / want to do
- David give feedback on anything there ✅2023-03-13 Feedback here: https://hackmd.io/EzmOTpRfQpeYmiwkGeslEA?edit
David Story
I'm in an company. I have a hunch I want to investigate with data.
In a company this is relatively easy to do:
- I know where to look.
- If the data is there, it is probably easily queriable (schemas, tests, …) as there is a team that is managing that (data team).
- Getting external data is also easy. Run your Singer or Airbyte tap: https://airbyte.com/
- For each part of the stack, there are interoperable standards and tools. e.g. warehouses (S3, Redshift, Clickhouse), modeling (getdbt, Airflow + python), orchestration (prefect, dagster, airflow, …) etc
With public data, is much more painful:
e.g. I want to check if the number of graduates from COUNTRY
I want to be able to grab and work on a public data source as easily as I do in the company and reuse the tools in a data company.
select
date,
graduates,
gdp
from university_data
left join country_data
In a company, models compound. You don't have to derive the same models over and over. In Open Data, we usually only have the raw data. Collaborate on data like people are doing in some DAOs (https://github.com/duneanalytics/spellbook/tree/main/models, https://github.com/MetricsDAO/near_dbt/pull/98). Reuse these models in a permissionless way.
Aside: [Rufus] idea of a data collective where people get paid based on their contribution to data (real connection with DAOs. key point is we are not storing the data on blockchain. Storing the contributions and membership).
If we could connect with virtual impact certificates. and all the impact dao and public goods stuff.
Sketches of collaboration
Workflow
- I come across an interesting dataset online
- I want to create a page for it (and package it …)
- 2 options
- Option 1: just add a page to the "wiki" (quick and dirty)
- it's own folder (and own repo at some point) 👈 preferred
- Link it to datahub: small workflow for linking your dataset. (see https://datahub.io/notes/design-publish-ui) or even just have datahub proxy)
- ie. datahub.io/@david/my-dataset automatically proxies to github.com/david/my-dataset (and gives error if not there or not data …)
- Showcase: turn README into a nice page maybe with extra features - https://datahub.io/notes/design-showcase
- MDX+Data aka data rich documents: markdown + MDX + data components for tables, graphs etc
- can even proxy issues
Next: add useful workflows (or tutorial on how to add themselves)
Notes
- Have you read datahub.io/docs/dms/
- https://github.com/davidgasquez/
- Splitgraphs datafiles: https://www.splitgraph.com/docs/sgr-advanced/concepts/splitfiles
- Working on adapters and packaging (debian style)
- Like psql FDW
- DataHub Next vision aka Data Canvas/Scratch (extending to Data Project)
- Projects related to Data Project/Data Canvas:
- https://rath.kanaries.net/
- evidence.dev - markdown + sql of a kind
- Re package managers:
Data Project
Turn Github into a DataHub. Easy, fast, reliable data publishing.
Data Canvas
What would be an amazing experience? An experience i personally would love …?
- Home screen
- Sign in
- Straight onto a canvas where i can drop things especially data files or urls to sites and get previews e.g. i can add a data.csv and immediately get a preview, i can drop a url and get the screenshot preview of that site, can add an image
- data.csv is uploaded in the background for me
- can link things together
- can groups things inside of larger "boxes"
- can click on any object and start adding metadata
- can split out groups to their own separate canvas and keep that canvas embedded as a sub-canvas
- NB: a reduced version of this (much simpler) is that it is not a canvas but more a flowing page. Everything here is same except:
- No visual layout
- No linking things with arrows etc
- Grouping would have to be by sections
- Embedding of bigger canvas can be link outs or transclusions
- And … i can jump out of this as a power user and go into the backend which is a github repo + cloud (for assets) + api (for data viewing)
A bit of history
- 2005-2007: Dream of data package management
- https://okfnlabs.org/projects/dpm/
- ckan.net => thedatahub.org => datahub.io (old.datahub.io)
- 2011: https://blog.okfn.org/2011/02/11/as-coder-is-for-code-x-is-for-data/ => this has become data engineering
- https://github.com/datasets - one the first datasets on github idea
- data in git / hg / svn in 2004-2007 open economics project
Sorry for the message out of the blue. Turns out that I was recently exploring moving out of Obsidian Publish for my small handbook and discovered https://github.com/flowershow/flowershow. It looked very cool so I wanted to check out the folks behind it. Was surprised to see yet another amazing project coming from Datopian/Life Itself!
Been following what you're doing for a while and you've been one of the greatest inspiration for some of the thoughts I have around Open Data and Open Knowledge: https://publish.obsidian.md/davidgasquez/Open+Data
The main reason I'm reaching out is to check if you're still looking for a Data Engineer: https://github.com/datopian/hiring/blob/main/README.md?plain=1#L10.
Would love to hear more if you are but no pressure at all. Thanks for doing all the projects you do Rufus. Keep rocking!
Email from David 2023-02-23
I've been following and getting inspired by your work for a while. Wanted to reach out since I recently spotted some potential Data Engineering opening at Datopian. Do you know who would be the best person to reach out about that?
I was recently affected by lay offs at Protocol Labs where I was working with Juan Benet (who you might remember from old interactions) on data management on top of IPFS/Filecoin. As silly as it sounds, I've been passionate about building yet another git for Open Data. In that journey, I've discovered that you've been thinking about that much longer than me and even tried things out in a distributed way with the Dat project.
If you're up to, I'd love to chat and learn more what are your current thoughts on OKFN, frictionless specs, Open Data and networks like IPFS! Been recently talking with folks at OWID and Catalyst Coop and feeling (again) that the time has come for a standard protocol or improved interoperability for open datasets, making organizations working with open data more effective.
Thanks for all the work you've done there and sorry for the long email!