Rufus notes
Rufus notes
- Re 2431-storage-layer-and-metastore-v0.3-git-inspired
- What is needed to finish
Outflow
- Be useful to have a map of products and how they connect / overlap
- Then some analysis of which to prioritize
- Want examples on the site
- Want a demo / trial without signing up?
- 🧊 Add a marketplace section to capture all the people who are coming for data …
- What's the experience we want
- Find all the previous write-ups of UX options and post them …
- Create a clear ⏭️ next list
Finishing the storage layer design
Diagram in excalidraw showing …
Explanation of how we copy from source
- Source:
- git(hub)
- Local on disk
Why can't we use github as our direct storage layer?
Github is unsuitable primarily b/c it has an API limit on accessing files which is very low ~5k/h. We hit the storage layer for every read request for a page (perhaps multiple times).
In addition:
- we need additional storage anyway for computed material etc etc. In this case we may as well have one consolidated place for storage.
- for large files github would work poorly or not all (even for e.g. image files)
- one single storage layer no matter what the original source (one day we have support sources other than github)
- permissions and processing may be simpler we just need user to give read access to copy over once and don't need for every anonymous read - in essence, we can separate our DataHub permissions from github permissions more cleanly).
- using our storage layer is probably faster (r2 is close to the edge, we don't through github's api layer etc)
Why not use project database for all content?
- Don't want to store large filees in database
- So may as well not store all content files in database for consistency
- Cleaner to have database "rebuildable" (see database as an index rather than source of truth)
- ASIDE: do we store project info file into project with its owner (that would be cool)
Current sequence from github to storage layer for a request
Have an architectural separation between "import/sync" of data/content into storage layer and then read from it …
- Copy from github into storage layer
- Raw-ish copy of files and the tree info
- May do some additional processing e.g. adding metadata from markdowndb
Contrast this with the simple design
- Request for
@me/my-project/myfile
=> app => app requests source file from github => app renders it
Extra things to discuss
- Computing stuff …
- Indexing stuff …
Markdown based product ideas
- Here was sketch and notes from march 2023 - datahub-next-direction-march-2023
- Worth re-reading
- Discussion issue with David Gasquez (could turn into a post) #todo find that
List
- Data Project
- Data Story
- Data Scratchpad
- Markdown-based wiki
- Markdown-based blog
- Markdown-based website
- Mardkwon-based single page site (home page)
- Simple visualization app
What should we do for each of these?
What criteria do we evaluate with?
Stuff for David Gasquez
- Organize a regular chat. use the chat to drive a write-up
- Organize a short free course (and use the course to drive)
Stuff to post from DataHub v2 days …
- Old deck
- Old SCQA
- Various notes about Data Experience vs Developer Experience