R2 as main storage and cache

Summary

Appetite: 3d
Problem:

Problem

Github as (sole) content backend does not work: using github as our content backend is problematic because of API limits on github. See previous shaping on github scaling
Serviing larg-ish files off github won't work well anyway
We need somewhere to compute and store derived information such as markdowndb etc. This implies we want a store we can easily and regularly access (and potentially write to)

Sketch of the solution

Project create

Add to queue / start a worker
Worker gets list of all files OR gets zip/tar of repo
- OR use google cloud containers or similar which have no time limit and have storage they can use
Pushes contents onto R2 {project-id}/{branch}/raw/

Project updates (come from github app)

For Later: we have a github app and pushes updates to r2 as they happen on github
For NOW = HACK: could just ask the user to add a ?refresh=true query string if they want to refresh an item and we handle updating R2 as part of the request
- ISSUES: does not handle delete
- ISSUES: does not handle updating the file tree …

Pseudo-code

import contentStore from lib/db.js

# get list of files
const fileTree = getFileTree(repo, branch)
for file in fileTree:
  contentStore.copyFileFromGithubToR2(project, file)
writeFileTreeToR2(project, fileTree)

// 

contentStore = contentStore(r2ConfigInfo)

def contentStore.copyFileFromGithubToR2(project, file) {
  // get content for file
  fileContent = githubStore.getFileContent(project.user.token, file.path, project.repo.url)
  const destPath = /{project-id}/raw/{branch}/{file.path}
  r2.store(destPath, fileContent)
}

In our render code:

# get a file
fileTree = getFileTree()
fileContent = getRawFileContents()  # now retrieves from R2

return Page(fileTree, fileContent)

R2 has one bucket: e.g. rawstore.datahub.io with rough layout

/project/{project-id}/{branch}/raw/{path}
/project/{project-id}/{branch}/files.json # github tree list
/project/{project-id}/{branch}/markdowndb.json

Two "store" objects

GithubStore - largely read-only
R2Store - read/write

Idea

Use a github app to update someone's repo to R2 on every push. Can be intelligent at some point and just sync the changes. (At start we can just reproduce)
Then we can run whatever processes we want against R2 e.g. build markdowndb, whatever
Frontend app just runs off r2. i.e. we use r2 as our api for now 😉

Design

Github App that listens for changes. Github app gets all webhooks apparently
Need a worker that receives the webhooks requests and uses relevant token to pull content onto r2
boom we're done

Now i know there is a chunk of work so this is not for next week … but this is pretty nice for us.

Qus

Are we using github oauth app or a github app? 🔑2024-02-21 oauth app atm
- Seems like github apps are better overall e.g. re rate limits etc https://docs.github.com/en/apps/oauth-apps/building-oauth-apps/differences-between-github-apps-and-oauth-apps
  - Re rate limits: https://docs.github.com/en/apps/creating-github-apps/about-creating-github-apps/deciding-when-to-build-a-github-app#github-apps-have-scalable-rate-limits
    
    The rate limit for GitHub Apps using an installation access token scales with the number of repositories and number of organization users. Conversely, OAuth apps have lower rate limits and do not scale. For more information, see "Rate limits for GitHub Apps."

Risks / Rabbit holes

Copying stuff to R2 at the start results in time out in workers or vercel. For now we will assume repo is small and that we are able to copy files quickly
- Potential solution is to use queues but this complicates the solution significantly.

No gos

Optimizing the process of getting the initial git repo by using the source tar/zip - see https://docs.github.com/en/repositories/working-with-files/using-files/downloading-source-code-archives. For now, we'll just iterate over the repo tree and copy files manually
Push updates of r2 content from repo. For now we live with force refreshes