Design and Architecture
Design and Architecture
Design of DataHub Next. Design here means analysis and architecture (not visual design).
Concepts
- Showcase
- Data rich documents (aka: data literate documents)
- Data Project
- Data API
Components
MVP components sketch v2023-03-09
David + Rufus (more expansive)
This is a different take with a focus on naming domains of the application.
Domain Model
See ../docs/dms/dms#Domain Model which has a fairly fulsome DMS oriented domain model (structured in graphql format).
We probably can work with something simpler to start with e.g. we could just start with 2 items:
- Project: a data project corresponding to
/@{owner{/{name}
- Account corresponding to
/@{owner}
Over time can image a project having thins like:
- apis
- readme
- files
- datasets
- issues
Simple typescript schema
interface Project {
id:
name:
title: string
owner: Account
github_repo: string[url]
readme: string
datapackage: object
}
// TODO: copy datahub v2?
interface Account {
id:
name: string; // username
fullname: ...
}
Typesense version:
let project = {
'name': 'projects',
'fields': [
{'name': 'id', 'type': 'string' },
{'name': 'name', 'type': 'string' },
{'name': 'title', 'type': 'string' },
{'name': 'owner_id', 'type': 'string', 'facet': true },
{'name': 'owner_name', 'type': 'string', 'facet': true },
{'name': 'github_repo'. 'type': 'string'},
{'name': 'readme', 'type': 'string' },
{'name': 'datapackage', 'type': 'object'}
],
'default_sorting_field': 'title'
}
Notes (to process)
What is the aggregation model? 2023-02-22
- projects: which aggregate those (equivalent to repo or project on gitlab)
- blobs
- documents: metadata + text stream
- assets: filepointer + metadata
- [groups of files e.g. a dataset]
Hub
- orgs: which aggregate projects
Presentation/processing of these
- workflows
- views
- apis
Question tree
- What is our priority use case?
- Why do we need a datapackage.json in github repos? โ 2023-02-19 - Why do we need a datapackage.json in github repos?
- How do we do stuff in response to updates in source repo? โ 2023-02-19 How do we do stuff in response to updates in source repo?
- How to do workflows? ๐ง2023-02-20 - see below
Why do we need a datapackage.json in github repos?
You need something like a datapackage.json
when:
- descriptions of data files (b/c they are not self-describing)
- title, nicer name, description
- schema
- views
- api specs
Things that datapackage.json
provides that we probably don't want / need:
- general metadata like title and description because we can get that from Github API (assuming people have described their repo) oreven use the README and its description and title.
How do we do stuff in response to updates in source repo?
When a user updates material in their github repo they want the associated data project on DataHub to update e.g. showcase page update, data APIs update etc so that their new material is available to others (and themselves)
Assumptions
- user has pushed to a github repo they control
There are two parts to this:
- Getting notified and triggering some action **โ (easy) use a webhook i.e. install a trigger on github repo to notify datahub on commit. (NB: for simple cases where processing can be done on fly could do a hacky pull model where you simply request content on each request but not recommended)
- Taking that action which itself involves two parts
- Accessing content in their repo e.g. to import it, to run processing on it, to render it. โ as part of the project creation flow user granted access to their repo to DataHub (cf vercel, netlify, cloudflare etc). For other stuff like storage imagine a similar setup.
- Running the action itself (assuming content is accessible) โ Assume work required "substantive" e.g. importing content, running an API build or similar. In this case this is an action/workflows. See separate answer on this
Brainstorm re Simplest possible approach (NB: won't work)
- run live with server i.e. we use github as our "content" API โ this will run into issues with github API limits, plus slow. Also how do we do things like have a database of projects for doing catalog on DataHub itself?
- direct access github (either b/c public or with token)
- No Data API โน rendering is "data literate" style โ problems with larger data, more complexity in frontend etc
How to do workflows?
By workflows we mean processing that needs to have "out of band".
- could we just use workers e.g. cloudflare workers for?
- what are cloudflare workers?
- how would we monitor tasks and executions? maybe we just wing it to start with or build our own?
- we don't really have dags at the start - just one task, one action (maybe something on data validate)
- would there be an issue with time outs?
- how long can workers run for?
- can you use workers in cloudflare pages? โ 2023-02-23 yes, they are called functions https://developers.cloudflare.com/pages/platform/functions/
- Another option could be Next.js ISR with on-demand revalidation?
Notes
- "Is there a fully-OSS control plane for defining, scheduling, and managing serverless (container img) invocations and corresponding dag? With an Airflow-ish UI but using lambda/ cloud functions instead of worker pool? Or anything remotely close" https://twitter.com/aerialfly/status/1599443538990682113