R2 as main storage and cache
R2 as main storage and cache
Summary
- Appetite: 3d
- Problem:
Problem
- Github as (sole) content backend does not work: using github as our content backend is problematic because of API limits on github. See previous shaping on github scaling
- Serviing larg-ish files off github won't work well anyway
- We need somewhere to compute and store derived information such as markdowndb etc. This implies we want a store we can easily and regularly access (and potentially write to)
Sketch of the solution
Project create
- Add to queue / start a worker
- Worker gets list of all files OR gets zip/tar of repo
- OR use google cloud containers or similar which have no time limit and have storage they can use
- Pushes contents onto R2
{project-id}/{branch}/raw/
Project updates (come from github app)
- For Later: we have a github app and pushes updates to r2 as they happen on github
- For NOW = HACK: could just ask the user to add a
?refresh=true
query string if they want to refresh an item and we handle updating R2 as part of the request- ISSUES: does not handle delete
- ISSUES: does not handle updating the file tree …
Pseudo-code
import contentStore from lib/db.js
# get list of files
const fileTree = getFileTree(repo, branch)
for file in fileTree:
contentStore.copyFileFromGithubToR2(project, file)
writeFileTreeToR2(project, fileTree)
//
contentStore = contentStore(r2ConfigInfo)
def contentStore.copyFileFromGithubToR2(project, file) {
// get content for file
fileContent = githubStore.getFileContent(project.user.token, file.path, project.repo.url)
const destPath = /{project-id}/raw/{branch}/{file.path}
r2.store(destPath, fileContent)
}
In our render code:
# get a file
fileTree = getFileTree()
fileContent = getRawFileContents() # now retrieves from R2
return Page(fileTree, fileContent)
R2 has one bucket: e.g. rawstore.datahub.io
with rough layout
/project/{project-id}/{branch}/raw/{path}
/project/{project-id}/{branch}/files.json # github tree list
/project/{project-id}/{branch}/markdowndb.json
Two "store" objects
GithubStore
- largely read-onlyR2Store
- read/write
Idea
- Use a github app to update someone's repo to R2 on every push. Can be intelligent at some point and just sync the changes. (At start we can just reproduce)
- Then we can run whatever processes we want against R2 e.g. build markdowndb, whatever
- Frontend app just runs off r2. i.e. we use r2 as our api for now 😉
Design
- Github App that listens for changes. Github app gets all webhooks apparently
- Need a worker that receives the webhooks requests and uses relevant token to pull content onto r2
- boom we're done
Now i know there is a chunk of work so this is not for next week … but this is pretty nice for us.
Qus
- Are we using github oauth app or a github app? 🔑2024-02-21 oauth app atm
- Seems like github apps are better overall e.g. re rate limits etc https://docs.github.com/en/apps/oauth-apps/building-oauth-apps/differences-between-github-apps-and-oauth-apps
-
The rate limit for GitHub Apps using an installation access token scales with the number of repositories and number of organization users. Conversely, OAuth apps have lower rate limits and do not scale. For more information, see "Rate limits for GitHub Apps."
-
- Seems like github apps are better overall e.g. re rate limits etc https://docs.github.com/en/apps/oauth-apps/building-oauth-apps/differences-between-github-apps-and-oauth-apps
Risks / Rabbit holes
- Copying stuff to R2 at the start results in time out in workers or vercel. For now we will assume repo is small and that we are able to copy files quickly
- Potential solution is to use queues but this complicates the solution significantly.
No gos
- Optimizing the process of getting the initial git repo by using the source tar/zip - see https://docs.github.com/en/repositories/working-with-files/using-files/downloading-source-code-archives. For now, we'll just iterate over the repo tree and copy files manually
- Push updates of r2 content from repo. For now we live with force refreshes