Extracting and storing metadata for faster loads, richer metadata, non frictionless datasets etc.
Extracting and storing metadata for faster loads, richer metadata, non frictionless datasets etc.
Summary
- Situation: we do all computation to render a project root page in frontend code
- Problem: this is slow and has a bad smell (in sense this code should be done once in backend)
- Solution: part i (for now) is to refactor logic to compute all info for dataset projects on "way in" (i.e. when we sync) and save in a single file in backend (r2 or db). Once we have done this we will reflect on next steps.
- Appetite: 5d (probably will need less)
Situation
- Rendering of main page of a project may involve up to 8 reads from R2
- No Project title in database
- Title of Project home page is computed in code at runtime with a complex logic
- Site is quite slow (e.g. 3s+ to render a page)
- Working out what "type" of page we are doing happens at runtime on each page run
- We don't have information like file sizes (which is available to us from github APi) actually rendered in the frontend unless it has been copied manually into datapackage.json
(Future) Needs - some examples
- README + csv works Able to render from less structures projects e.g. just README.md and data.csv TODO: see previous pitch
- Richer summary info Provide summary information about a dataset e.g. counts of what data files, total size, data information (e.g. number of rows etc)
- Show backlinks: for more PKM style setups we want to render network map
- Given an sqlite file show me its tables and information in it
- Same for Excel
- Generate SEO previews:
- …
Problem
Poor performance and/or complex business logic in presentation layer
- Slow site b/c rendering can use multiple requests see appendix (min 3 to even 8 (more?) DB/R2 requests).
- Frontend is doing lots of business logic (e.g. computing the title field, consolidating size information) which is error prone and hard to reason about long-term
- And may also be slow/expensive: on top of multiple R2/DB requests we're also performing the following computations before the passing it to the renderer component (MdxPage) e.g. TODO
- Data layer does not contain relevant information and/or not nicely consolidated e.g. Don't have key metadata computed e.g. backlinks or size information
There's some overlap here between these points.
Let's take one example: we are doing multiple requests because we don't know if datapackage.json/yml even exists. This is consolidation issue in some regards.
Solution
KISS:
- Let's focus just on the project root for now (we will fix other files later …)
- We compute a single json file (we can call it
project.json
ordatapackage.json
or whatever) that is the consolidated version of what we want
{
title: My cool dataset
description: My short description for this stuff.
_content: // markdown string (even better if links are resolved)
pagetype: `dataset | story` // optional
// dataset stuff following Frictionless pattern
resources:
-
path: data.csv
size:
-
path: data2.csv
size: // this comes from github file tree etc
views:
...
licenses:
sources:
// FUTURE stuff we compute
links:
backLinks
tags:
tasks: ...
// other frontmatter?
// raw stuff???
_frontmatter // e.g. original frontmatter
}
Algorithm: This info is computed as follows
- Look for datapackage.json, datapackage.yml, README.md
- Extract title, description etc if you need to from README (use markdowndb code i think)
- Consolidate frontmatter
- Consolidate file sizes from the FileTree info
- FUTURE: compute more stuff about each in the resources and add that info if needed …
This algorithm runs once at the end of a sync.
? Do we update anythign in the DB e.g. update the project title and description?? Maybe …
Metadata storage options:
- Store computed metadata for the whole project in our current Postgres DB, in JSON, with structure similar to GH tree but extended with more metadata for each file in the tree.
- SQLite file per project with e.g.
mddb
index, stored in R2. - Put all files metadata in a new table in our current Postgres DB.
- …
Appendix: Solution options
Have factored tables etc
Tables
Project/Site table
File/Blob table (very simple) e.g. what you get from github
project_id
path
MarkdownFile
frontmatter
links
tags
Option 1
Easiest to implement, could be the first iteration of the solution.
Let's go with this one as it's already solving the major part of the problem, i.e. separation of metadata+content preparation from rendering.
We'd add tree
(or files
, or similar) column (type JsonB for more efficient queries) to Site
(Project) table in our current Postgres DB, like this:
model Site {
id String @id @default(cuid())
gh_scope String
gh_repository String
gh_branch String
subdomain String? @unique
customDomain String? @unique
projectName String
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
user User? @relation(fields: [userId], references: [id], onDelete: Cascade, onUpdate: Cascade)
userId String?
synced Boolean @default(false)
syncedAt DateTime?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
tree Json? @db.JsonB
<<<<<<<<<<<<<<<<<<<<<<<<<<<<
@@unique([userId, projectName])
@@index([userId])
}
With tree
schema being basically modified/extended GH tree, e.g. like this one:
Tree {
"sha": "9fb037999f264ba9a7fc6274d15fa3ae2ab98312", // sha from original gh repo tree
"tree": [
{
"path": "README.md",
"size": 30,
"sha": "44b4fc6d56897b048c772eb4087f854f46256132",
"url": "https://api.github.com/repos/octocat/Hello-World/git/blobs/44b4fc6d56897b048c772eb4087f854f46256132",
"metadata": {
"title": "Title, set in frontmatter, datapackage file or computed.",
"description": "Description, set in frontmatter, datapackage file or computed.",
"tags": ["a"],
"datapackage": {
...datapackage.json
// or ...datapackage.yaml/yml parsed to json
},
"type": "datapackage" // OR "story", "other" etc.,
...other frontmatter fields
}
},
]
}
(maybe also add top level permalinks
field with a ready to use list of permalinks for parsing stage)
All this data will be saved to the db/updated when we create/sync a site.
Later we can improve on this and think about better storage solution.
Questions:
- do we want to already include metadata extracted when parsing, .e.g frontmatter? Or only combine GH repo tree with datapackage files?
Option 2
??? Not sure about this
General idea: replace _tree
with user project index in SQLite file, e.g. generated with mddb
on site creation and updated on each sync.
Let's consider this example GitHub repository:
/data
some-data-1.csv
some-data-2.csv
/stories
story-1.md
story-2.md
README.md
datapackage.yml
It could have standard mddb
tables, i.e. files
, tags
, links
, with File
schema being:
interface File {
_id: string;
file_path: string; // or this could be bucket object id instead, but basically some pointer to a bucket object
extension: string;
url_path: string; // used as by the app when rendering a given
metadata: string; // JSON with ALL metadata related to this file, i.e. combined frontmatter + potential datapackage file
type: "datapackage" | "story" | "blog" | "other" // (or similar) useful for creating catalogs of datasets or list of stories/blog posts etc.
}
The Project
object from our main Postgres DB would rather stay the same.
…
What about a repo like this one, i.e. catalogs of datasets?
/stories
story-1.md
story-2.md
/datasets
/abc
/data
some-data-1.csv
some-data-2.csv
README.md
datapackage.yml
/xyz
/data
some-data-3.csv
some-data-4.csv
README.md
datapackage.json
README.md
Rabbit holes
Are there any issues related to loading the SQLite file for a given project from R2?
No-goes
Appendices
Appendix: Current Architecture
DataHub Cloud architecture currently consists of the following main components:
- Next.js App
- GitHub as an origin content source
- R2 bucket as our main content store
- Postgres DB on Vercel as user accounts and sites storage.
This is how site creation flow looks currently:
This is how site rendering flow looks currently:
Rendering uses Multiple DB/R2 requests
Current site rendering flow (server side part, i.e. excluding e.g. fetching data for charts) can use from min 3 to even 8 (more?) DB/R2 requests.
Best case scenario (3 requests):
- sb accesses
datahub.io/@olayway/abc-project/stories/story-1
and there is astories/story-1.md
file in the content store - (1) get site ID and branch from the DB
- (2) try fetching
/siteid123456/main/raw/stories/story-1(md)
–> found! - (3) fetch
/siteid123456/main/_tree(json)
–> found! (needed at md parsing stage to correctly resolve wiki-links)
Worst (?) case scenario (8 requests):
- sb accesses
datahub.io/@olayway/abc-project/stories
and there isstories/index.md
file in the content store (notstories.md
) and same leveldatapackage.yml
- (1) get site ID and branch from the DB
- (2) try fetching
/siteid123456/main/raw/stories(md)
–> NOT found! - (3) try fetching
/siteid123456/main/raw/stories/README(md)
–> NOT found! - (4) try fetching
/siteid123456/main/raw/stories/index(md)
–> found! - …and since we're fetching
index
orREADME
, we're also trying to fetch same leveldatapackage(json/yaml/yml)
… - (5) try fetching
/siteid123456/main/raw/datapackage.json
–> NOT found! - (6) try fetching
/siteid123456/main/raw/datapackage.yaml
–> NOT found! - (7) try fetching
/siteid123456/main/raw/datapackage.yml
–> found! - (8) fetch
/siteid123456/main/_tree(json)
–> found! (needed at md parsing stage to correctly resolve wiki-links)
Appendix: GitHub Repo fetch response schema
https://docs.github.com/en/rest/git/trees?apiVersion=2022-11-28#get-a-tree
{
"title": "Git Tree",
"description": "The hierarchy between files in a Git repository.",
"type": "object",
"properties": {
"sha": {
"type": "string"
},
"url": {
"type": "string",
"format": "uri"
},
"truncated": {
"type": "boolean"
},
"tree": {
"description": "Objects specifying a tree structure",
"type": "array",
"items": {
"type": "object",
"properties": {
"path": {
"type": "string",
"examples": [
"test/file.rb"
]
},
"mode": {
"type": "string",
"examples": [
"040000"
]
},
"type": {
"type": "string",
"examples": [
"tree", "blob"
]
},
"sha": {
"type": "string",
"examples": [
"23f6827669e43831def8a7ad935069c8bd418261"
]
},
"size": {
"type": "integer",
"examples": [
12
]
},
"url": {
"type": "string",
"examples": [
"https://api.github.com/repos/owner-482f3203ecf01f67e9deb18e/BBB_Private_Repo/git/blobs/23f6827669e43831def8a7ad935069c8bd418261"
]
}
}
},
"examples": [
{
"path": "file.rb",
"mode": "100644",
"type": "blob",
"size": 30,
"sha": "44b4fc6d56897b048c772eb4087f854f46256132",
"url": "https://api.github.com/repos/octocat/Hello-World/git/blobs/44b4fc6d56897b048c772eb4087f854f46256132",
"properties": {
"path": {
"type": "string"
},
"mode": {
"type": "string"
},
"type": {
"type": "string"
},
"size": {
"type": "integer"
},
"sha": {
"type": "string"
},
"url": {
"type": "string"
}
},
"required": [
"path",
"mode",
"type",
"sha",
"url",
"size"
]
}
]
}
},
"required": [
"sha",
"url",
"tree",
"truncated"
]
}
Notes
This is how site update flow looks currently:
Appendix: Domain model
- Project
hasMany Files
- RawBlob: this is what we get from github pretty much
- ComputedBlobInfo: this is info we compute re each blob e.g. links, frontmatter extraction etc
- ConsolidatedBlobInfo: this consolidated all that info into the form consumable by the front end for a given page (e.g. ComputedBlobInfo from README.md plus datapackage.yml plus some extra info about RawBlob e.g. file size)
RawBlob
What is a consolidated TreeItem/File?
class TreeItem {
"path": "file.rb",
"mode": "100644",
"type": "blob",
"size": 30,
"sha": "44b4fc6d56897b048c772eb4087f854f46256132",
"url": "https://api.github.com/repos/octocat/Hello-World/git/blobs/44b4fc6d56897b048c772eb4087f854f46256132",
"project_id": link to the parent project,
"metadata": {
...
}
}
From github tree API
Note: The limit for the tree array is 100,000 entries with a maximum size of 7 MB when using the recursive parameter. (https://docs.github.com/en/rest/git/trees?apiVersion=2022-11-28#get-a-tree)
class TreeItem {
"path": "file.rb",
"mode": "100644",
"type": "blob",
"size": 30,
"sha": "44b4fc6d56897b048c772eb4087f854f46256132",
"url": "https://api.github.com/repos/octocat/Hello-World/git/blobs/44b4fc6d56897b048c772eb4087f854f46256132"
}
What is the item object in markdowndb?
interface File {
_id: string;
file_path: string; // or this could be bucket object id instead, but basically some pointer to a file
extension: string;
url_path: string; // used as a key by which the app finds a corresponding file when rendering a given URL
metadata: string; // JSON with ALL metadata related to this file, i.e. combined frontmatter + potential datapackage file
type: string; // could be enum e.g. "datapackage" | "story" | "blog" | "other" // (or similar) useful for creating catalogs of datasets or list of stories/blog posts etc.
}
Questions
- Where do tags, links, tasks etc go? separate tables "links", "tags", "file_tags"
Appendix: Use Cases of computing title etc
When rendering a dataset page … what do i need?
What is the computed object i actually want to work with in the front end …
class ConsolidatedForDatasetCase {
title: ...
description:
readme:
datapackage:
licenses:
sources:
resources:
views:
// array of all site permalinks
// so that i can resolve links
fileTreeNames: string[] // an entry looks like ["@olayway/abc/story1", "https://r2-sth.path-to-image.png", ...]
// maybe?
owner:
}
What do i want to extract and store once off …
- title
- description
- files
What are specific use cases we can imagine and have right now
- having a title - for a project at least
- having a description - for home page at least
- show a list of images for the project
- ❓ showing the list of files by grouping e.g. images, data files, markdown files, other (e.g. pdf)
Showing the title
- For a Project => info from README / datapackage.yml for that project
- For root index of a Project => as previous
- For an index page in general => README/index that folder and/or datapackage stuff in that folder
- For a given file in a Project (that is not a README/index page) => that file and its frontmatter
Situation for an index file:
- README.md/index.md with title being the first heading
- README.md with frontmatter with
title
attribute - README.md with frontmatter with
datapackage
attribute withtitle
attribute - README.md with
datapackage.yml
withtitle
attribute
In code:
const title =
datapackage?.title ||
frontMatter.datapackage?.title ||
frontMatter.title ||
extractTitle(source) ||
"";
Aside: ❓ Why do we want to compute this on the way in rather than at render time? Because we want this title when we e.g. show project/dataset listings, or when we do search. Plus cleaner to compute once in one place (easier to reason about etc). Plus currently we need to do this computation twice: 1) for generating page SEO metadata in generateMetadata
function and 2) in the main slug Page
code (probably could find a workaround for it though.)
Algorithm
for each filePath in filetree:
const out = {
title
description
} = computeInfo(filePath)
Rendering a dataset page
Comment: a dataset page requires the following info
- Project
- README File
- datapackage
Motivating goal, I can give you the following:
README.md data.csv data2.csv
Where README.md
# My cool dataset
My short description for this stuff.
Then lot's more of the README.
And i end up with rendering the equivalent of datapackage.yml
(extended version)
{
title: My cool dataset
description: My short description for this stuff.
readme: ...
resources:
-
path: data.csv
size:
-
path: data2.csv
size: // this comes from github file tree etc
}
🔮 INSIGHT
- We want a ComputedMetadata table
{
project_id: // id of the project
path: '' // this is the rendered url path - have empty and default to '' as we only compute for root url of a project for now
title:
description:
readme: // it's not really needed, is it?
metadata:
}
Rufus NTS
- 3 parts here
- What we store
- How we compute that
- How we retrieve it
- Let's start from what makes retrieval experience and rendering great …
- And then if that is too painful to do … we can loop back
- Rendering a page, especially an index page for a project (or an index page for a sub part of the tree)