Discord

Situation

We currently use github directly as our backend for data, content and metadata.

Problem

That means every single page view results in (possibly multiple) requests to github API
Github has API limits of:
- 60 requests / hour / visitor IP for unauthenticated requests,
- 5k requests / hour for authenticated requests.
Thus, with increased traffic, DataHub Cloud's posing a risk of hitting GitHub's API rate limits, which would affect service availability and user experience.

Appetite

Unspecified yet, but given the complexity of the problem, probably a multi-sprint effort is required.

Solution

We need a caching strategy that balances content freshness with API usage efficiency.

Iteration 1 (Immediate fix)

Switch to authenticated requests using user access tokens as an immediate improvement, to raise rate limits from 60/hour/visitor IP to 5000/hour/user (site creator).

Iteration 2 (Core caching implementation)

Time-based cache revalidation with conditional requests as a base cache revalidation mechanism.
Switch to ISR for better user experience.

(Shaping and integration of the Content Store before moving to the next iteration.)

Iteration 3 (Instant On-Change Cache invalidation with Webhooks & Content Store integration)

Update the Content Store (mddb, content index) dynamically, anytime new content is fetched from GitHub (anytime we revalidate Data Cache).
Integrate GitHub Webhooks for immediate invalidation upon repo changes, ensuring page content and Content Store freshness.

Rabbit holes

Can we even use user tokens for fetching content basically not for them but for their site visitors?
Over-reliance on webhooks for cache invalidation could lead to missed updates if webhook deliveries fail. We're probably use TTL as a fallback, base strategy. Plus there seems to be an option now to poll some GitHub endpoint to find out about Webhook request misses (but haven't investigated yet).
Long build times with ISR if we would want to pre-build any user sites' pages: Pre-building all user pages could lead to long build times. Instead, we could selectively pre-build only high-traffic pages or premium users' pages. Let's not prerender any user pages at all, just render on demand. Also, without a proper Content Store where we can get a list of sites pages, we probably don't want to spend time on some hacky solution with fetching and parsing GH trees.
Ensuring up-to-date automated navigation and catalog components: we need to invalidate any cached pages that may be listing other pages, data files or other assets (or in other way using metadata thereof). This may not be an issue at all, as ANY fetched data (both from external and from our own API) used for rendering such an index/catalog page will be cached, not only the markdown content of the page itself, and so will be subject to time-based revalidation after TTL has expired
Ensuring up-to-date data visualisations: same issue as above, i.e. how do we trigger cache invalidation (if we don't use Webhooks or if they fail) of a page with some charts displaying data from a dataset that has just been updated. Again, maybe this is not an issue at all as during the page build the fetched dataset will be cached in the DataCache; and so if the dataset has changed the fetch result cache will be revalidated, which will trigger full page rebuild
Content Store integration, index population and index updates. Should be shaped separately before we move to "Iteration 3"
What pages do we pre-build on initial build if we don't have a Content Store yet? We can just pre-build our own marketing pages. Or just don't prebuild anything. Or actually try fetching GitHub trees of user site repositories and pre-build all md files foud, or just README.md files.
How does Cloudflare trigger site deploys anyway? And do we even want/need the same thing?
How do we handle a surge of repository change events in a webhook handler?
It seems Data Cache is not able to cache fetch results above 2MB.
Maybe Webhooks would best be used only for Content Store updates and TTL for pages?

No gos

Pre-building user sites (or even only some of their pages) without having a proper Content Store. Let's not spend time on working out some hacky solution that would require fetching (additional request btw) GitHub trees to get a list of pages to pre-render.
GitHub Webhooks as a first invalidation method, before we even have a Content Store implemented and integrated, and this is where GitHub Webhooks would be most beneficial.
GitHub Webhooks as an only invalidation method. Handling failed or not sent Webhook events is another rabbit hall we could easily fall into that can be avoided by using time-based validation as a fallback method.

Appendix 1: Alternative solutions

A pure SSR approach with Data Cache only, without using ISR, for serving dynamic content directly on demand.

Appendix 2: Rufus original sketch of want

We want to "shape" caching and cache invalidation strategies so that we avoid hitting GitHub API rate limits (Initial analysis done here https://github.com/datopian/datahub-next/issues/139)

Options:

Webhooks with caching
Polling with caching

My guess here is that we can resolve this with some combination of:

Building "static sites" (infinite cache if you like) and invalidating on a github repo commit
A content db/layer in between our frontend and github e.g. we build a combination of file index (e.g. from markdowndb) plus content cache that sits between our site and github

For the latter imagining something like …

Appendix 3: Research

Caching mechanisms available in Next.js App Router

Here are the caching mechanisms available in Next.js projects built with App Router (as is DataHub Cloud). The two relevant here are: Data Cache and Full Route Cache.

https://nextjs.org/docs/app/building-your-application/caching

Data Cache

persists the result of data fetches
persists them across server requests and deployments

https://nextjs.org/docs/app/building-your-application/caching#data-cache

Revalidation:

Time-based Revalidation: Revalidate data after a certain amount of time has passed and a new request is made. This is useful for data that changes infrequently and freshness is not as critical.
On-demand Revalidation: Revalidate data based on an event (e.g. form submission). On-demand revalidation can use a tag-based or path-based approach to revalidate groups of data at once. This is useful when you want to ensure the latest data is shown as soon as possible (e.g. when content from your headless CMS is updated).

Full Route Cache

persists built static pages
persists them across server requests BUT NOT across deployments
can be used for dynamic paths but only with generateStaticParams which will return initial set of paths to build

Invalidation:

Revalidating Data Cache
Redeploying: Unlike the Data Cache, which persists across deployments, the Full Route Cache is cleared on new deployments.

Potential caching approaches

Option 1: Data Cache Only Scenario (Server-Side Rendering with only GitHub API data cached)

A user request comes into the server for a specific page.
The Next.js app checks the Data Cache (fetch results cache) for relevant data to build that page.
If there's a cache hit, the Next.js app retrieves the data from the cache.
The Next.js app then performs server-side rendering using the cached data to dynamically generate the HTML for the requested page.
The generated HTML is sent back to the user's browser.

Option 2: Full Route Cache Scenario (ISR)

https://nextjs.org/docs/app/api-reference/functions/generate-static-params

In the Full Route Cache scenario, we would use generateStaticParams to generate static paths' segments, statically build pages for them and cache them in Full Route Cache.

During the build process (e.g., when running next build), Next.js pre-renders pages as static HTML files.
Detailed page paths and required data are determined through the generateStaticParams function for dynamic routes.
Static HTML and JSON are generated and stored.
During the build process, any fetch results are also cached in Data Cache
When a user requests a page, Next.js serves the pre-built static file from the cache, bypassing the need for server rendering.
If Data Cache is revalidated, the Full Route Cache is revalidated as well, triggering a rebuild of specific pages.

Options comparison

Content Freshness:

Both strategies rely on the Data Cache for their source of truth and will serve the latest content as per the cache's state. However, the mechanics and wait time for initial content load are different:

SSR: will always render pages on-demand, hence always using the latest data (within TTL) from Data Cache at request time and only serving the page with the latest content.
ISR: serves pre-built static pages, which might not always represent the very latest data until a revalidation (triggered by TTL or webhook), rebuild, and page refresh/revisit occurs.

GitHub API Rate Limits:

SSR: GitHub API calls occur only on cache miss or revalidation.
ISR: API calls are concentrated during the build/rebuild process, which may include many unnecessary API calls (even for pages that will possibly never be visited). Maybe there is a way to prevent Data Cache revalidations on (re)builds and use what's currently in DataCache? And maybe we could only return paths for sites' home pages, or only for premium user sites' pages?

Server Processing Times/Build Times:

SSR: There are no upfront build times since pages are rendered on demand. However, this could lead to increased server load per request, especially on cache miss or revalidation.
ISR: Pages are pre-built, leading to potentially extensive build times, especially for large numbers of pages. Server load per request is minimized since pre-built pages are served from the cache. I don't think we could do this for all the user sites and pages. Maybe just pre-build home pages (index.md/README.md) or pre-build all pages but only for premium-users?

User Experience:

SSR: The page visitor has to wait for the page to render before seeing it at all (i.e. no cached page version is returned) (unless it's cached locally ofc). On cache miss or revalidation, users may experience slower response times due to on-demand rendering.
ISR: Users typically enjoy faster response times due to the serving of pre-built static pages, unless a new page build is taking place.

Scalability:

SSR: Generally scalable, as long as the infrastructure can handle the load. Scaling vertically or horizontally to accommodate more server-side rendering can address increased traffic.
ISR: Highly scalable for serving content due to the static nature of content delivery. Limits in scalability come from the build process and the ability to revalidate efficiently.

Cache invalidation: TTL-only vs. TTL + Webhooks hybrid

TTL alone might be sufficient when:
- Data freshness is less critical, and a delay in reflecting updates is acceptable.
- Reducing complexity and avoiding the infrastructure overhead of handling webhooks is a priority.
A combination of TTL and Webhooks might be the better when:
- Real-time updates are important, but we also want to ensure consistency.
- We wish to minimize redundant API calls but also account for the possibility of missed webhook deliveries.

Maybe instant updates with webhooks could be included in the paid plan though? Or at least implemented second.

User sites indexing

The db should be updated each time we receive new content from GitHub, i.e. on each Data Cache revalidation, no matter how it was triggered ? (TTL or webhook).

Appendix 4: Initial research on GitHub Rate Limits

Situation

DataHub Cloud currently serves users' markdown content directly from their GitHub repositories. Update 2024-02-02: we've agreed to stick to this approach.

Complication

The key complication is the potential of hitting GitHub’s rate limits with an increase in traffic, as the application needs to fetch content from GitHub repositories frequently. This could affect service availability and user experience.

GitHub REST API rate limits:

60 requests per hour for unauthenticated requests
- it's per IP address
- we're currently using unauthenticated requests for fetching user sites' content
5,000 per hour for authenticated ones
- it's per GitHub user, not per token (so other requests made by the user e.g. using their PAT count towards this limit as well)
- We're currently making authenticated requests to fetch GitHub scopes and repositories in the project creation wizard.**
~~15,000 per hour for authenticated ones for OAuth apps owned or approved by a GitHub Enterprise Cloud organization~~
- ~~:x: only if the user is a member of such organization, so it's irrelevant here~~

Question

How can we run DataHub Cloud off GitHub and not hit GitHub API rate limits which could potentially lead to some user sites being unavailable for some periods?

Hypothesis

For the best scalability, I recommend server-side caching with a hybrid cache invalidation approach combining webhooks with a sensible time-based fallback and conditional requests:

Use authenticated requests only, using user access tokens.
Use webhooks to invalidate the cache as soon as changes occur in GitHub repositories. This ensures that content is updated rapidly. Implement this last. Maybe even could be only in a paid plan? In any case, it's not crucial atm in my opinion.
As a fallback, set a reasonable time-to-live (TTL) for cache entries. This way, if a webhook fails or we haven't received an update notification for some reason, the cache will still be refreshed after the TTL expires and a new request is made.
On top of that use conditional requests that leverage ETags or last modified headers. These requests count against your GitHub API rate limit only when they result in actual data being served.
If for some reason the user API rate limit has been hit (e.g. user's massive GitHub API usage for own purposes), serve stale cached content.

More information on Webhook-triggered and TTL cache invalidations in the Notes section below.

Option 1: Webhook-Driven Invalidation

How It Works:

At the time a user creates a site and links a GitHub repository, the app, using the admin:repo_hook scope granted through OAuth, programmatically sets up a webhook on that user’s repository.
The webhook sends a POST request to a specified app endpoint whenever changes are pushed to the repository. The app's webhook handler invalidates the relevant cache entry to ensure the updated content is fetched on the next request.
The application maintains a cache (in-memory, Redis, etc.) of recently fetched content. When a user accesses a specific URL, the app first checks the cache before reaching out to the GitHub API.

Pros:

Real-time Updates: Immediately updates the cache when there's new content, ensuring users always see the latest information.
Efficiency: More efficient use of the GitHub API as it minimizes unnecessary requests.

Cons:

Complexity: More complex to set up since it requires handling incoming webhook requests and securing them.
Reliability: If the webhook fails or there's a delay in the notification, the cache might serve stale content until the issue is resolved.
Scalability: Managing and scaling webhooks can become challenging as the number of tenants and frequency of content updates grow.

Risks and mitigations:

High volume of webhook events: Implement a queuing system to process webhook events asynchronously and scale the processing infrastructure if needed. Additionally, throttle webhook processing to prevent overloading the application.

Option 2: Time-based Invalidation (TTL)

How It Works:

The application relies on timed expiry based on the poll interval to determine when to fetch new content.

Pros:

Simplicity: Easy to implement as it doesn't require a complex infrastructure setup.
Predictability: Provides a predictable pattern for when the cache will be refreshed.

Cons:

Stale Content: Content may be stale up until the TTL expires.
Unnecessary API Calls: May lead to unnecessary API calls if content isn't changing frequently.

Risks and Mitigations

Intelligent Polling: Implement logic that only polls at intelligent intervals or in response to user activity to reduce unnecessary API calls and manage load.
Adjustable Frequency: Allow configuration for polling intervals based on user preference or repository activity levels to economize on API usage.