Separate file processing from syncing and new Blob table to make querying easier

Separate file processing from syncing and new Blob table to make querying easier

Situation

  • Structure on R2 bucket atm:
    /{projectId}/{branch}/raw/{pathtofile}
    /{projectId}/{branch}/_tree
    
  • Syncing and file processing are tied together
  • DB structure - see https://github.com/datopian/datahub-next/blob/main/prisma/schema.prisma
    • files json blob
    • How does that differ from _tree? _tree is exactly github API tree object unmodified
  • MetaStore atm is Site table with files json blob on it.

How rendering etc currently works:

getPageMetadata: https://github.com/datopian/datahub-next/blob/8b70a29bfc9303dea92e65f11ba43322a7eb3e98/server/api/routers/site.ts#L640

getPageContent (from R2): https://github.com/datopian/datahub-next/blob/8b70a29bfc9303dea92e65f11ba43322a7eb3e98/server/api/routers/site.ts#L689

Render MarkdownPage

renderMarkdownPage(FileObject, SiteConfig e.g. for setting the title) {

}

Component responsible for rendering: https://github.com/datopian/datahub-next/blob/main/components/MDX.tsx

Complication

  • Syncing and file processing are tied together
  • File/Blob metadata is on Site object in files json blob which is hard to query e.g. hard to get a list of blog posts.

We want to separate file processing from syncing

Currently syncing and file processing (i.e. extraction of structured info) happen in same steps in inngest driven code - see https://github.com/datopian/datahub-next?tab=readme-ov-file#sync-process-details

We want to separate file processing from syncing because

  • easier to reason about (syncing and processing are separate so e.g. errors in one don't affect the others)
  • can handle direct add of files to R2
  • syncing is faster (Time to first render on updates is lower - even if processing has not completed we show something …)
  • currently, if we make any changes to metadata computation, we need to re-sync the site which mean copying over stuff from github once again…
  • Suppose i want to add new processing e.g. full text search that runs in parallel to file entry this is now another stage in a slow process.
  • We can switch CF infrastructure especially workflows and workers for processing whilst still using NextJS and inngest for syncing …

I want to get a list of blog posts or other stuff and it is painful

e.g. want to do ..

getFiles(query)
getFilesByPath(path, sort=datedesc)
getFiles

Currently have to get files and iterate through everthing which is painful.

SELECT * from FILES WHERE path.startswith('/blog') is easier.

Hypothesis

  • Add a Blob table
  • Refactor to have ingest and processing separate with processing on cloudflare

Metastore structure

Today

  • Site table
  • with files attribute on site object
  • _tree object

Future

Site table

File/Blob table (very simple) e.g. what you get from github project_id path

What is a consolidated TreeItem/File/Blob?

class Blob {
  "project_id": link to the parent project,
  "path": "my cool page.md", // /abc/my page.md
  "app_path": "my+cool+page", // /abc/my+page
  "size": 30,
  "sha": "44b4fc6d56897b048c772eb4087f854f46256132",
  "metadata": {
    title:
    image: ?
    description: ...
    date:
    authors: 
    layout: ...
    tags: ...
    // forward links
    links: ...
  }
}

New ingest and processing approach

On ingest from Github just fill in Blob with stuff from github (and so e.g. metadata will be empty).

Principle: (?) the site pages should still render even if somewhat broken based on that …

Then after ingest a cloudflare worker or workflow kicks off for each file that is created/updated and deleted

  • It will fetch the file from R2
  • Fetch the corresponding entry in database SELECT from Blog where project_id = project_id and path = path
  • It will start extracting stuff from the markdown file (or other type of file)
  • And add that to the db entry …

Sequence

  • Ingest code (whether cloudflare or inngest): copy files over to R2 (and updates DB with file entries)
  • Processing code can be triggered from the event queue and starts operating immediately …

Plan of work

  • Update inngest code to write to DB in Blob table
  • Create the processing code on R2
    • Can we run cloudflare workers/workflows off a github repo … (i hope so)
  • (Disable processing code in inngest) - can be done later
  • Check it works …
  • Then start refactoring frontend code to use new Blob table

© 2025 All rights reservedBuilt with DataHub Cloud

Built with LogoDataHub Cloud