Rufus notes

Re 2431-storage-layer-and-metastore-v0.3-git-inspired
- What is needed to finish

Outflow

Be useful to have a map of products and how they connect / overlap
- Then some analysis of which to prioritize
Want examples on the site
Want a demo / trial without signing up?
🧊 Add a marketplace section to capture all the people who are coming for data …
What's the experience we want
Find all the previous write-ups of UX options and post them …
Create a clear ⏭️ next list

Finishing the storage layer design

Diagram in excalidraw showing …

Explanation of how we copy from source

Source:
- git(hub)
- Local on disk

Why can't we use github as our direct storage layer?

Github is unsuitable primarily b/c it has an API limit on accessing files which is very low ~5k/h. We hit the storage layer for every read request for a page (perhaps multiple times).

In addition:

we need additional storage anyway for computed material etc etc. In this case we may as well have one consolidated place for storage.
for large files github would work poorly or not all (even for e.g. image files)
one single storage layer no matter what the original source (one day we have support sources other than github)
permissions and processing may be simpler we just need user to give read access to copy over once and don't need for every anonymous read - in essence, we can separate our DataHub permissions from github permissions more cleanly).
using our storage layer is probably faster (r2 is close to the edge, we don't through github's api layer etc)

Why not use project database for all content?

Don't want to store large filees in database
So may as well not store all content files in database for consistency
Cleaner to have database "rebuildable" (see database as an index rather than source of truth)
- ASIDE: do we store project info file into project with its owner (that would be cool)

Current sequence from github to storage layer for a request

Have an architectural separation between "import/sync" of data/content into storage layer and then read from it …

Copy from github into storage layer
- Raw-ish copy of files and the tree info
- May do some additional processing e.g. adding metadata from markdowndb

Contrast this with the simple design

Request for @me/my-project/myfile => app => app requests source file from github => app renders it

Extra things to discuss

Computing stuff …
Indexing stuff …

Markdown based product ideas

Here was sketch and notes from march 2023 - datahub-next-direction-march-2023
- Worth re-reading
Discussion issue with David Gasquez (could turn into a post) #todo find that

List

Data Project
Data Story
Data Scratchpad
Markdown-based wiki
Markdown-based blog
Markdown-based website
Mardkwon-based single page site (home page)
Simple visualization app

What should we do for each of these?

What criteria do we evaluate with?

Stuff for David Gasquez

Organize a regular chat. use the chat to drive a write-up
Organize a short free course (and use the course to drive)

Stuff to post from DataHub v2 days …

Old deck
Old SCQA
Various notes about Data Experience vs Developer Experience