R2 as main storage and cache

R2 as main storage and cache

Summary

  • Appetite: 3d
  • Problem:

Problem

  • Github as (sole) content backend does not work: using github as our content backend is problematic because of API limits on github. See previous shaping on github scaling
  • Serviing larg-ish files off github won't work well anyway
  • We need somewhere to compute and store derived information such as markdowndb etc. This implies we want a store we can easily and regularly access (and potentially write to)

Sketch of the solution

Project create

  • Add to queue / start a worker
  • Worker gets list of all files OR gets zip/tar of repo
    • OR use google cloud containers or similar which have no time limit and have storage they can use
  • Pushes contents onto R2 {project-id}/{branch}/raw/

Project updates (come from github app)

  • For Later: we have a github app and pushes updates to r2 as they happen on github
  • For NOW = HACK: could just ask the user to add a ?refresh=true query string if they want to refresh an item and we handle updating R2 as part of the request
    • ISSUES: does not handle delete
    • ISSUES: does not handle updating the file tree …

Pseudo-code

import contentStore from lib/db.js

# get list of files
const fileTree = getFileTree(repo, branch)
for file in fileTree:
  contentStore.copyFileFromGithubToR2(project, file)
writeFileTreeToR2(project, fileTree)

// 

contentStore = contentStore(r2ConfigInfo)

def contentStore.copyFileFromGithubToR2(project, file) {
  // get content for file
  fileContent = githubStore.getFileContent(project.user.token, file.path, project.repo.url)
  const destPath = /{project-id}/raw/{branch}/{file.path}
  r2.store(destPath, fileContent)
}

In our render code:

# get a file
fileTree = getFileTree()
fileContent = getRawFileContents()  # now retrieves from R2

return Page(fileTree, fileContent)

R2 has one bucket: e.g. rawstore.datahub.io with rough layout

/project/{project-id}/{branch}/raw/{path}
/project/{project-id}/{branch}/files.json # github tree list
/project/{project-id}/{branch}/markdowndb.json

Two "store" objects

  • GithubStore - largely read-only
  • R2Store - read/write

Idea

  • Use a github app to update someone's repo to R2 on every push. Can be intelligent at some point and just sync the changes. (At start we can just reproduce)
  • Then we can run whatever processes we want against R2 e.g. build markdowndb, whatever
  • Frontend app just runs off r2. i.e. we use r2 as our api for now 😉

Design

  • Github App that listens for changes. Github app gets all webhooks apparently
  • Need a worker that receives the webhooks requests and uses relevant token to pull content onto r2
  • boom we're done

Now i know there is a chunk of work so this is not for next week … but this is pretty nice for us.

Qus

Risks / Rabbit holes

  • Copying stuff to R2 at the start results in time out in workers or vercel. For now we will assume repo is small and that we are able to copy files quickly
    • Potential solution is to use queues but this complicates the solution significantly.

No gos

© 2024 All rights reserved

Built with DataHub LogoDataHub Cloud