Reliably create/sync sites from large repos

Summary

  • Situation: At DataHub cloud any repository, no matter the size of it, can be used as a base for a new site.
  • Problem: Sites built off of large repositories fail to correctly create and sync, due to reaching Vercel's timeout of 300s.
  • Solution: Use inngest to orchestrate long-running functions in the background.
  • Appetite: 2-3d

Situation

  • At DataHub Cloud any repository, no matter the size of it, can be used as a base for a new site.
  • At site creation DataHub Cloud traverses the entire GitHub repo tree, processes every item and copies its content over to the content store (excluding a few unsupported files and extensions).
  • At site sync DataHub Cloud updates the content store with new, updated and deleted files.

Problem

Sites built off of large repositories fail to correctly create and sync. This is due to reaching Vercel's timeout of 300s, which is the highest value that can be set for Pro plan.

  • Pro: 15s (default) - configurable up to 300s
  • Enterprise: 15s (default) - configurable up to 900s

Appetite

1-2d

Solution

Solution: Asynchronous processing with task queues using Inngest for workflow orchestration. This will allow us to break down the long-running repository sync operation into smaller, independent tasks that are queued and executed asynchronously (each being triggered as a separate HTTP request), avoiding the Vercel timeout issue and improving overall performance.

The 3 most time consuming TRPC endpoints are:

  • sync (specifically processGitHubTree function inside)
  • create (specifically processGitHubTree function inside)
  • delete

Solution steps:

  1. Break up each long-running TRPC endpoint into smaller logical parts.
  2. Define a function for each separate logical part.
  3. Create Inngest workflows that encapsulate and chain these functions together.
  4. Modify TRPC endpoints to trigger the Inngest workflows, making them run in the background.

Here is a draft of an Inngest workflow extracted from and triggered by the sync TRCP endpoint:

Other considerations

Implementation questions/details

Should e.g. fetching GitHub and content store trees be included in the workflow or handled by the TRPC method?

  • Including these steps in the workflow will ensure that the entire process is managed within the Inngest framework, allowing for better error handling and retries.

How can the sync status be communicated to users?

  • The database will be updated with the current sync status at each step. The UI will periodically check this status to provide real-time feedback to users (as it does now to check if the site is outdated).

Should we process GitHub tree in batches or each item one by one?

  • Processing the GitHub tree in batches is more efficient than processing each item one by one, especially when dealing with a large number of items. Batching helps to reduce the overhead associated with initiating many small tasks and can improve overall performance.

Inngest vs other tools

Your code all runs on Vercel - you keep the same repo, the same platform, and the same tooling. There is no need to set up another runtime for your functions and trust another platform to run your code (which likely has access to your database).

Inngest seems to be a great choice if you want a straightforward setup that keeps everything within the same codebase and repository, especially if you prefer avoiding additional platforms. It’s particularly useful for simpler use cases where ease of integration and quick setup are prioritised, like in our case. Alternatives like Temporal or BullMQ might be better suited when more advanced features or complex workflow management are needed. But those tools add extra setup and maintenance complexity.

(See Appendix B for full comparison of Inngest vs 5 potential alternatives.)

https://www.inngest.com/blog/vercel-long-running-background-functions https://www.inngest.com/blog/nextjs-trpc-inngest

Rabbit holes

  • Progress bar showing X/Y processed files

No-goes

Appendix A: Original issue description

Atm can't sync/create sites built on large repos. Likely because times out in syncing …

➕ 2024-06-02: a sub-part (or separate but related item): showing progress on syncs. I just did a sync of a largish repo (life-itself/community) that succeeded but it took several minutes and i could easily have exited or similar and any other user would have had no idea what was going on … I think it would work a lot better after create to redirect to the new project page and say "sync in progress please wait …" (like we would do on any other sync …)

Example: https://github.com/datahubio/example-fivethirtyeight

This will fail to sync (too large in number of files). My guess is it times out …

../assets/Pasted image 20240607212740.png

Note the site may show as created in my DataHub Cloud account's dashboard but it will show as being outdated:

../assets/Pasted image 20240607212754.png

Appendix B: Inngest vs Other Tools

(Comparison prepared with help of ChatGPT)

1. Inngest

  • Description: Inngest is a platform for building serverless workflows and background jobs in your existing codebase.
  • Pros:
    • Same Repo Integration: Allows you to keep your code in the same repository.
    • Ease of Setup: Designed for quick setup and integration with existing frameworks like Next.js.
    • Event-Driven: Supports event-driven architecture which is ideal for background processing and workflows.
    • Simple API: Intuitive API for defining and running workflows.
  • Cons:
    • Vendor Lock-In: Depending on Inngest might mean relying on a specific service.
    • Features: Might lack some advanced features offered by more mature platforms like Temporal or Airflow.
  • Best For: Developers looking for an easy-to-setup, integrated solution that doesn’t require managing separate infrastructure.

2. BullMQ

  • Description: A job queue and message queue built on top of Redis.
  • Pros:
    • Performance: High performance and efficient job processing.
    • Feature-Rich: Includes job prioritization, scheduling, concurrency, and more.
    • Integration: Can be easily integrated with your Next.js app and works well with Redis.
  • Cons:
    • Redis Dependency: Requires a Redis server to be set up and maintained.
    • Manual Status Tracking: You’ll need to implement your own status tracking and updates.
  • Best For: Applications that need high-performance job processing and are already using or can easily use Redis.

3. Temporal

  • Description: A workflow orchestration engine that handles stateful applications.
  • Pros:
    • Complex Workflows: Can handle very complex workflows with ease.
    • Fault Tolerant: Highly fault-tolerant and scalable.
    • Language Support: Supports multiple languages.
  • Cons:
    • Complex Setup: More complex to set up compared to Inngest or BullMQ.
    • Separate Infrastructure: Requires running a separate Temporal server.
  • Best For: Applications with complex, stateful workflows requiring high reliability.

4. n8n.io

  • Description: An open-source workflow automation tool.
  • Pros:
    • Visual Workflow Editor: Provides a visual interface for designing workflows.
    • Integration: Integrates with a wide variety of services.
    • Self-Hosted Option: Can be self-hosted, giving you full control over the environment.
  • Cons:
    • Separate Platform: Adds another platform to your stack.
    • Performance: May not be as performant for very high-scale applications.
  • Best For: Users looking for a visual interface for workflow automation and extensive integration options.

5. RabbitMQ with NestJS

  • Description: A message broker with powerful routing capabilities, integrated with a modern framework like NestJS.
  • Pros:
    • Messaging: Excellent for handling messaging patterns and complex routing.
    • Integration: Can be tightly integrated with your app for more complex job processing.
    • Community Support: Robust community and support.
  • Cons:
    • Complex Setup: Requires setting up and maintaining RabbitMQ server.
    • Learning Curve: Steeper learning curve due to the messaging patterns.
  • Best For: Applications requiring complex messaging and job processing patterns.

6. AWS Step Functions

  • Description: Managed service for orchestrating workflows using AWS.
  • Pros:
    • Managed Service: Fully managed, reducing operational overhead.
    • AWS Integration: Integrates seamlessly with other AWS services.
    • Visual Workflow: Provides a visual interface for designing workflows.
  • Cons:
    • AWS Dependency: Ties you into the AWS ecosystem.
    • Cost: Can become expensive depending on the volume and complexity of workflows.
  • Best For: Applications already using AWS services extensively and needing managed workflow orchestration.

© 2024 All rights reservedBuilt with Find, Share and Publish Quality Data with Datahub

Built with Find, Share and Publish Quality Data with DatahubFind, Share and Publish Quality Data with Datahub