Extracting and storing metadata for faster loads, richer metadata, non frictionless datasets etc.

Extracting and storing metadata for faster loads, richer metadata, non frictionless datasets etc.

Summary

  • Situation: we do all computation to render a project root page in frontend code
  • Problem: this is slow and has a bad smell (in sense this code should be done once in backend)
  • Solution: part i (for now) is to refactor logic to compute all info for dataset projects on "way in" (i.e. when we sync) and save in a single file in backend (r2 or db). Once we have done this we will reflect on next steps.
  • Appetite: 5d (probably will need less)

Situation

  • Rendering of main page of a project may involve up to 8 reads from R2
  • No Project title in database
  • Title of Project home page is computed in code at runtime with a complex logic
  • Site is quite slow (e.g. 3s+ to render a page)
  • Working out what "type" of page we are doing happens at runtime on each page run
  • We don't have information like file sizes (which is available to us from github APi) actually rendered in the frontend unless it has been copied manually into datapackage.json

(Future) Needs - some examples

  • README + csv works Able to render from less structures projects e.g. just README.md and data.csv TODO: see previous pitch
  • Richer summary info Provide summary information about a dataset e.g. counts of what data files, total size, data information (e.g. number of rows etc)
  • Show backlinks: for more PKM style setups we want to render network map
  • Given an sqlite file show me its tables and information in it
  • Same for Excel
  • Generate SEO previews:

Problem

Poor performance and/or complex business logic in presentation layer

  • Slow site b/c rendering can use multiple requests see appendix (min 3 to even 8 (more?) DB/R2 requests).
  • Frontend is doing lots of business logic (e.g. computing the title field, consolidating size information) which is error prone and hard to reason about long-term
    • And may also be slow/expensive: on top of multiple R2/DB requests we're also performing the following computations before the passing it to the renderer component (MdxPage) e.g. TODO
  • Data layer does not contain relevant information and/or not nicely consolidated e.g. Don't have key metadata computed e.g. backlinks or size information

There's some overlap here between these points.

Let's take one example: we are doing multiple requests because we don't know if datapackage.json/yml even exists. This is consolidation issue in some regards.

Solution

KISS:

  • Let's focus just on the project root for now (we will fix other files later …)
  • We compute a single json file (we can call it project.json or datapackage.json or whatever) that is the consolidated version of what we want
{
  title: My cool dataset
  description: My short description for this stuff.
  _content: // markdown string (even better if links are resolved)
  pagetype: `dataset | story` // optional
  
  // dataset stuff following Frictionless pattern
  resources:
    -
      path: data.csv
      size: 
    -
      path: data2.csv
      size: // this comes from github file tree etc
  views:
    ...
  licenses:
  sources:
  
  // FUTURE stuff we compute
  links:
  backLinks
  tags:
  tasks: ...
  
  // other frontmatter?
    
  // raw stuff???
  _frontmatter // e.g. original frontmatter
}

Algorithm: This info is computed as follows

  • Look for datapackage.json, datapackage.yml, README.md
  • Extract title, description etc if you need to from README (use markdowndb code i think)
  • Consolidate frontmatter
  • Consolidate file sizes from the FileTree info
  • FUTURE: compute more stuff about each in the resources and add that info if needed …

This algorithm runs once at the end of a sync.

? Do we update anythign in the DB e.g. update the project title and description?? Maybe …


Metadata storage options:

  1. Store computed metadata for the whole project in our current Postgres DB, in JSON, with structure similar to GH tree but extended with more metadata for each file in the tree.
  2. SQLite file per project with e.g. mddb index, stored in R2.
  3. Put all files metadata in a new table in our current Postgres DB.

Appendix: Solution options

Have factored tables etc

Tables

Project/Site table

File/Blob table (very simple) e.g. what you get from github
  project_id
  path
  
MarkdownFile
  frontmatter
  links
  tags

Option 1

Easiest to implement, could be the first iteration of the solution.

Let's go with this one as it's already solving the major part of the problem, i.e. separation of metadata+content preparation from rendering.

We'd add tree (or files, or similar) column (type JsonB for more efficient queries) to Site (Project) table in our current Postgres DB, like this:

model Site {
	id            String   @id @default(cuid())
    gh_scope      String
    gh_repository String
    gh_branch     String
    subdomain     String?  @unique
	  customDomain  String?  @unique
    projectName   String
    createdAt     DateTime @default(now())
    updatedAt     DateTime @updatedAt
    user          User?    @relation(fields: [userId], references: [id], onDelete: Cascade, onUpdate: Cascade)
    userId        String?
    synced        Boolean  @default(false)
    syncedAt      DateTime?

    >>>>>>>>>>>>>>>>>>>>>>>>>>>>
    tree      Json?    @db.JsonB
    <<<<<<<<<<<<<<<<<<<<<<<<<<<<

    @@unique([userId, projectName])
    @@index([userId])
}

With tree schema being basically modified/extended GH tree, e.g. like this one:

Tree {
  "sha": "9fb037999f264ba9a7fc6274d15fa3ae2ab98312", // sha from original gh repo tree
  "tree": [
    {
      "path": "README.md",
      "size": 30,
      "sha": "44b4fc6d56897b048c772eb4087f854f46256132",
      "url": "https://api.github.com/repos/octocat/Hello-World/git/blobs/44b4fc6d56897b048c772eb4087f854f46256132",
	  "metadata": {
		  "title": "Title, set in frontmatter, datapackage file or computed.",
		  "description": "Description, set in frontmatter, datapackage file or computed.",
		  "tags": ["a"],
		  "datapackage": {
			  ...datapackage.json
			  // or ...datapackage.yaml/yml parsed to json
		  },
		  "type": "datapackage" // OR "story", "other" etc.,
		  ...other frontmatter fields
	  }
    },
  ]
}

(maybe also add top level permalinks field with a ready to use list of permalinks for parsing stage)

All this data will be saved to the db/updated when we create/sync a site.

Later we can improve on this and think about better storage solution.

Questions:

  • do we want to already include metadata extracted when parsing, .e.g frontmatter? Or only combine GH repo tree with datapackage files?

Option 2

??? Not sure about this

General idea: replace _tree with user project index in SQLite file, e.g. generated with mddb on site creation and updated on each sync.

Let's consider this example GitHub repository:

/data
	some-data-1.csv
	some-data-2.csv
/stories
	story-1.md
	story-2.md
README.md
datapackage.yml

It could have standard mddb tables, i.e. files, tags, links, with File schema being:

interface File {
	_id: string;
	file_path: string; // or this could be bucket object id instead, but basically some pointer to a bucket object
	extension: string;
	url_path: string; // used as by the app when rendering a given 
	metadata: string; // JSON with ALL metadata related to this file, i.e. combined frontmatter + potential datapackage file
	type: "datapackage" | "story" | "blog" | "other" // (or similar) useful for creating catalogs of datasets or list of stories/blog posts etc. 
}

The Project object from our main Postgres DB would rather stay the same.

What about a repo like this one, i.e. catalogs of datasets?

/stories
	story-1.md
	story-2.md
/datasets
	/abc
		/data
			some-data-1.csv
			some-data-2.csv
		README.md
		datapackage.yml
	/xyz
		/data
			some-data-3.csv
			some-data-4.csv
		README.md
		datapackage.json
README.md

Rabbit holes

Are there any issues related to loading the SQLite file for a given project from R2?

No-goes

Appendices

Appendix: Current Architecture

DataHub Cloud architecture currently consists of the following main components:

  • Next.js App
  • GitHub as an origin content source
  • R2 bucket as our main content store
  • Postgres DB on Vercel as user accounts and sites storage.

This is how site creation flow looks currently:

This is how site rendering flow looks currently:

Rendering uses Multiple DB/R2 requests

Current site rendering flow (server side part, i.e. excluding e.g. fetching data for charts) can use from min 3 to even 8 (more?) DB/R2 requests.

Best case scenario (3 requests):

  • sb accesses datahub.io/@olayway/abc-project/stories/story-1 and there is a stories/story-1.md file in the content store
  • (1) get site ID and branch from the DB
  • (2) try fetching /siteid123456/main/raw/stories/story-1(md) –> found!
  • (3) fetch /siteid123456/main/_tree(json) –> found! (needed at md parsing stage to correctly resolve wiki-links)

Worst (?) case scenario (8 requests):

  • sb accesses datahub.io/@olayway/abc-project/stories and there is stories/index.md file in the content store (not stories.md) and same level datapackage.yml
  • (1) get site ID and branch from the DB
  • (2) try fetching /siteid123456/main/raw/stories(md) –> NOT found!
  • (3) try fetching /siteid123456/main/raw/stories/README(md) –> NOT found!
  • (4) try fetching /siteid123456/main/raw/stories/index(md) –> found!
  • …and since we're fetching index or README, we're also trying to fetch same level datapackage(json/yaml/yml)
  • (5) try fetching /siteid123456/main/raw/datapackage.json –> NOT found!
  • (6) try fetching /siteid123456/main/raw/datapackage.yaml –> NOT found!
  • (7) try fetching /siteid123456/main/raw/datapackage.yml –> found!
  • (8) fetch /siteid123456/main/_tree(json) –> found! (needed at md parsing stage to correctly resolve wiki-links)

Appendix: GitHub Repo fetch response schema

https://docs.github.com/en/rest/git/trees?apiVersion=2022-11-28#get-a-tree

{
  "title": "Git Tree",
  "description": "The hierarchy between files in a Git repository.",
  "type": "object",
  "properties": {
    "sha": {
      "type": "string"
    },
    "url": {
      "type": "string",
      "format": "uri"
    },
    "truncated": {
      "type": "boolean"
    },
    "tree": {
      "description": "Objects specifying a tree structure",
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "path": {
            "type": "string",
            "examples": [
              "test/file.rb"
            ]
          },
          "mode": {
            "type": "string",
            "examples": [
              "040000"
            ]
          },
          "type": {
            "type": "string",
            "examples": [
              "tree", "blob"
            ]
          },
          "sha": {
            "type": "string",
            "examples": [
              "23f6827669e43831def8a7ad935069c8bd418261"
            ]
          },
          "size": {
            "type": "integer",
            "examples": [
              12
            ]
          },
          "url": {
            "type": "string",
            "examples": [
              "https://api.github.com/repos/owner-482f3203ecf01f67e9deb18e/BBB_Private_Repo/git/blobs/23f6827669e43831def8a7ad935069c8bd418261"
            ]
          }
        }
      },
      "examples": [
        {
          "path": "file.rb",
          "mode": "100644",
          "type": "blob",
          "size": 30,
          "sha": "44b4fc6d56897b048c772eb4087f854f46256132",
          "url": "https://api.github.com/repos/octocat/Hello-World/git/blobs/44b4fc6d56897b048c772eb4087f854f46256132",
          "properties": {
            "path": {
              "type": "string"
            },
            "mode": {
              "type": "string"
            },
            "type": {
              "type": "string"
            },
            "size": {
              "type": "integer"
            },
            "sha": {
              "type": "string"
            },
            "url": {
              "type": "string"
            }
          },
          "required": [
            "path",
            "mode",
            "type",
            "sha",
            "url",
            "size"
          ]
        }
      ]
    }
  },
  "required": [
    "sha",
    "url",
    "tree",
    "truncated"
  ]
}

Notes

This is how site update flow looks currently:

Appendix: Domain model

  • Project hasMany Files
  • RawBlob: this is what we get from github pretty much
  • ComputedBlobInfo: this is info we compute re each blob e.g. links, frontmatter extraction etc
  • ConsolidatedBlobInfo: this consolidated all that info into the form consumable by the front end for a given page (e.g. ComputedBlobInfo from README.md plus datapackage.yml plus some extra info about RawBlob e.g. file size)

RawBlob

What is a consolidated TreeItem/File?

class TreeItem {
  "path": "file.rb",
  "mode": "100644",
  "type": "blob",
  "size": 30,
  "sha": "44b4fc6d56897b048c772eb4087f854f46256132",
  "url": "https://api.github.com/repos/octocat/Hello-World/git/blobs/44b4fc6d56897b048c772eb4087f854f46256132",
  "project_id": link to the parent project,
  "metadata": {
    ...
  }
}

From github tree API

Note: The limit for the tree array is 100,000 entries with a maximum size of 7 MB when using the recursive parameter. (https://docs.github.com/en/rest/git/trees?apiVersion=2022-11-28#get-a-tree)

class TreeItem {
  "path": "file.rb",
  "mode": "100644",
  "type": "blob",
  "size": 30,
  "sha": "44b4fc6d56897b048c772eb4087f854f46256132",
  "url": "https://api.github.com/repos/octocat/Hello-World/git/blobs/44b4fc6d56897b048c772eb4087f854f46256132"
}

What is the item object in markdowndb?

interface File {
	_id: string;
	file_path: string; // or this could be bucket object id instead, but basically some pointer to a file
	extension: string;
	url_path: string; // used as a key by which the app finds a corresponding file when rendering a given URL
	metadata: string; // JSON with ALL metadata related to this file, i.e. combined frontmatter + potential datapackage file
	type: string; // could be enum e.g. "datapackage" | "story" | "blog" | "other" // (or similar) useful for creating catalogs of datasets or list of stories/blog posts etc. 
}

Questions

  • Where do tags, links, tasks etc go? separate tables "links", "tags", "file_tags"

Appendix: Use Cases of computing title etc

When rendering a dataset page … what do i need?

What is the computed object i actually want to work with in the front end …

class ConsolidatedForDatasetCase {
  title: ...
  description:
  readme: 
  datapackage:
    licenses:
    sources:
    resources:
    views:
  // array of all site permalinks
  // so that i can resolve links
  fileTreeNames: string[] // an entry looks like ["@olayway/abc/story1", "https://r2-sth.path-to-image.png", ...]
  
  // maybe?
  owner:
}

What do i want to extract and store once off …

  • title
  • description
  • files

What are specific use cases we can imagine and have right now

  • having a title - for a project at least
  • having a description - for home page at least
  • show a list of images for the project
  • ❓ showing the list of files by grouping e.g. images, data files, markdown files, other (e.g. pdf)

Showing the title

  • For a Project => info from README / datapackage.yml for that project
  • For root index of a Project => as previous
  • For an index page in general => README/index that folder and/or datapackage stuff in that folder
  • For a given file in a Project (that is not a README/index page) => that file and its frontmatter

Situation for an index file:

  • README.md/index.md with title being the first heading
  • README.md with frontmatter with title attribute
  • README.md with frontmatter with datapackage attribute with title attribute
  • README.md with datapackage.yml with title attribute

In code:

const title =
  datapackage?.title ||
  frontMatter.datapackage?.title ||
  frontMatter.title ||
  extractTitle(source) ||
  "";

Aside: ❓ Why do we want to compute this on the way in rather than at render time? Because we want this title when we e.g. show project/dataset listings, or when we do search. Plus cleaner to compute once in one place (easier to reason about etc). Plus currently we need to do this computation twice: 1) for generating page SEO metadata in generateMetadata function and 2) in the main slug Page code (probably could find a workaround for it though.)

Algorithm

for each filePath in filetree:
  const out = {
    title
    description
    
  } = computeInfo(filePath)

Rendering a dataset page

Comment: a dataset page requires the following info

  • Project
  • README File
  • datapackage

Motivating goal, I can give you the following:

README.md data.csv data2.csv

Where README.md

# My cool dataset

My short description for this stuff.

Then lot's more of the README.

And i end up with rendering the equivalent of datapackage.yml (extended version)

{
  title: My cool dataset
  description: My short description for this stuff.
  readme: ...
  resources:
    -
      path: data.csv
      size: 
    -
      path: data2.csv
      size: // this comes from github file tree etc
  
}

🔮 INSIGHT

  • We want a ComputedMetadata table
{
  project_id: // id of the project
  path: ''  // this is the rendered url path - have empty and default to '' as we only compute for root url of a project for now
  title:
  description: 
  readme: // it's not really needed, is it?
  metadata: 
}

Rufus NTS

  • 3 parts here
    • What we store
    • How we compute that
    • How we retrieve it
  • Let's start from what makes retrieval experience and rendering great …
    • And then if that is too painful to do … we can loop back
  • Rendering a page, especially an index page for a project (or an index page for a sub part of the tree)

© 2025 All rights reservedBuilt with DataHub Cloud

Built with LogoDataHub Cloud