Separate file processing from syncing and new Blob table to make querying easier
Separate file processing from syncing and new Blob table to make querying easier
Situation
- Structure on R2 bucket atm:
/{projectId}/{branch}/raw/{pathtofile} /{projectId}/{branch}/_tree
- Syncing and file processing are tied together
- DB structure - see https://github.com/datopian/datahub-next/blob/main/prisma/schema.prisma
files
json blob- How does that differ from
_tree
?_tree
is exactly github API tree object unmodified
- MetaStore atm is
Site
table withfiles
json blob on it.
How rendering etc currently works:
getPageMetadata: https://github.com/datopian/datahub-next/blob/8b70a29bfc9303dea92e65f11ba43322a7eb3e98/server/api/routers/site.ts#L640
getPageContent (from R2): https://github.com/datopian/datahub-next/blob/8b70a29bfc9303dea92e65f11ba43322a7eb3e98/server/api/routers/site.ts#L689
Render MarkdownPage
renderMarkdownPage(FileObject, SiteConfig e.g. for setting the title) {
}
Component responsible for rendering: https://github.com/datopian/datahub-next/blob/main/components/MDX.tsx
Complication
- Syncing and file processing are tied together
- File/Blob metadata is on
Site
object infiles
json blob which is hard to query e.g. hard to get a list of blog posts.
We want to separate file processing from syncing
Currently syncing and file processing (i.e. extraction of structured info) happen in same steps in inngest driven code - see https://github.com/datopian/datahub-next?tab=readme-ov-file#sync-process-details
We want to separate file processing from syncing because
- easier to reason about (syncing and processing are separate so e.g. errors in one don't affect the others)
- can handle direct add of files to R2
- syncing is faster (Time to first render on updates is lower - even if processing has not completed we show something …)
- currently, if we make any changes to metadata computation, we need to re-sync the site which mean copying over stuff from github once again…
- Suppose i want to add new processing e.g. full text search that runs in parallel to file entry this is now another stage in a slow process.
- We can switch CF infrastructure especially workflows and workers for processing whilst still using NextJS and inngest for syncing …
I want to get a list of blog posts or other stuff and it is painful
e.g. want to do ..
getFiles(query)
getFilesByPath(path, sort=datedesc)
getFiles
Currently have to get files
and iterate through everthing which is painful.
SELECT * from FILES WHERE path.startswith('/blog') is easier.
Hypothesis
- Add a
Blob
table - Refactor to have ingest and processing separate with processing on cloudflare
Metastore structure
Today
- Site table
- with
files
attribute on site object _tree
object
Future
Site table
File/Blob table (very simple) e.g. what you get from github project_id path
What is a consolidated TreeItem/File/Blob?
class Blob {
"project_id": link to the parent project,
"path": "my cool page.md", // /abc/my page.md
"app_path": "my+cool+page", // /abc/my+page
"size": 30,
"sha": "44b4fc6d56897b048c772eb4087f854f46256132",
"metadata": {
title:
image: ?
description: ...
date:
authors:
layout: ...
tags: ...
// forward links
links: ...
}
}
New ingest and processing approach
On ingest from Github just fill in Blob with stuff from github (and so e.g. metadata will be empty).
Principle: (?) the site pages should still render even if somewhat broken based on that …
Then after ingest a cloudflare worker or workflow kicks off for each file that is created/updated and deleted
- It will fetch the file from R2
- Fetch the corresponding entry in database SELECT from Blog where project_id = project_id and path = path
- It will start extracting stuff from the markdown file (or other type of file)
- And add that to the db entry …
Sequence
- Ingest code (whether cloudflare or inngest): copy files over to R2 (and updates DB with file entries)
- Processing code can be triggered from the event queue and starts operating immediately …
Plan of work
- Update inngest code to write to DB in Blob table
- Create the processing code on R2
- Can we run cloudflare workers/workflows off a github repo … (i hope so)
- (Disable processing code in inngest) - can be done later
- Check it works …
- Then start refactoring frontend code to use new Blob table