Separate file processing from syncing and new Blob table to make querying easier

Situation

Structure on R2 bucket atm:

/{projectId}/{branch}/raw/{pathtofile}
/{projectId}/{branch}/_tree

Syncing and file processing are tied together
DB structure - see https://github.com/datopian/datahub-next/blob/main/prisma/schema.prisma
- files json blob
- How does that differ from _tree? _tree is exactly github API tree object unmodified
MetaStore atm is Site table with files json blob on it.

How rendering etc currently works:

getPageMetadata: https://github.com/datopian/datahub-next/blob/8b70a29bfc9303dea92e65f11ba43322a7eb3e98/server/api/routers/site.ts#L640

getPageContent (from R2): https://github.com/datopian/datahub-next/blob/8b70a29bfc9303dea92e65f11ba43322a7eb3e98/server/api/routers/site.ts#L689

Render MarkdownPage

renderMarkdownPage(FileObject, SiteConfig e.g. for setting the title) {

}

Component responsible for rendering: https://github.com/datopian/datahub-next/blob/main/components/MDX.tsx

Complication

Syncing and file processing are tied together
File/Blob metadata is on Site object in files json blob which is hard to query e.g. hard to get a list of blog posts.

We want to separate file processing from syncing

Currently syncing and file processing (i.e. extraction of structured info) happen in same steps in inngest driven code - see https://github.com/datopian/datahub-next?tab=readme-ov-file#sync-process-details

We want to separate file processing from syncing because

easier to reason about (syncing and processing are separate so e.g. errors in one don't affect the others)
can handle direct add of files to R2
syncing is faster (Time to first render on updates is lower - even if processing has not completed we show something …)
currently, if we make any changes to metadata computation, we need to re-sync the site which mean copying over stuff from github once again…
Suppose i want to add new processing e.g. full text search that runs in parallel to file entry this is now another stage in a slow process.
We can switch CF infrastructure especially workflows and workers for processing whilst still using NextJS and inngest for syncing …

I want to get a list of blog posts or other stuff and it is painful

e.g. want to do ..

getFiles(query)
getFilesByPath(path, sort=datedesc)
getFiles

Currently have to get files and iterate through everthing which is painful.

SELECT * from FILES WHERE path.startswith('/blog') is easier.

Hypothesis

Add a Blob table
Refactor to have ingest and processing separate with processing on cloudflare

Metastore structure

Today

Site table
with files attribute on site object
_tree object

Future

Site table

File/Blob table (very simple) e.g. what you get from github project_id path

What is a consolidated TreeItem/File/Blob?

class Blob {
  "project_id": link to the parent project,
  "path": "my cool page.md", // /abc/my page.md
  "app_path": "my+cool+page", // /abc/my+page
  "size": 30,
  "sha": "44b4fc6d56897b048c772eb4087f854f46256132",
  "metadata": {
    title:
    image: ?
    description: ...
    date:
    authors: 
    layout: ...
    tags: ...
    // forward links
    links: ...
  }
}

New ingest and processing approach

On ingest from Github just fill in Blob with stuff from github (and so e.g. metadata will be empty).

Principle: (?) the site pages should still render even if somewhat broken based on that …

Then after ingest a cloudflare worker or workflow kicks off for each file that is created/updated and deleted

It will fetch the file from R2
Fetch the corresponding entry in database SELECT from Blog where project_id = project_id and path = path
It will start extracting stuff from the markdown file (or other type of file)
And add that to the db entry …

Sequence

Ingest code (whether cloudflare or inngest): copy files over to R2 (and updates DB with file entries)
Processing code can be triggered from the event queue and starts operating immediately …

Plan of work

Update inngest code to write to DB in Blob table
Create the processing code on R2
- Can we run cloudflare workers/workflows off a github repo … (i hope so)
(Disable processing code in inngest) - can be done later
Check it works …
Then start refactoring frontend code to use new Blob table