Appendix: Cloudflare materials
https://developers.cloudflare.com/reference-architecture/diagrams/serverless/serverless-etl/ (archived copy)
This diagram describes our architecture and needs almost exactly.
Question tree
- What are examples of the kind of processing we'd want to do in Flowershow/DataHub
- What advantages / disadvantages does using Cloudflare have over our existing approach using e.g. inngest
- What would be the architecture we would use?
- What actual code examples / demos are there we can draw on?
The Concept
- Data Anywhere with Pipelines, Event Notifications, and Workflows - original CF post announcing workflows in April 2024
Our needs
- Copy files from github into R2 (currently handled by inngest)
- Build a meatastore ie. index of those files likely with additional metadata
- Build other things e.g. full text search
Consumption needs
- Get me a list of files that are blog posts
- Get me for those blog posts their title, description and image
- Get me pages that match these text search criteria provided by a user
How to actually do this
Roughly we need …
- A workflow to get stuff into R2
- Workflow(s) once in R2 to do processing
How do we trigger workflows from events in R2?
- Reference: https://developers.cloudflare.com/workflows/get-started/guide/ (example of setting up a workflow. Note the trigger here is a via http api provided by a worker)
- How to trigger workflows: https://developers.cloudflare.com/workflows/build/trigger-workflows/
- My understanding for our kind of use case is you need: r2 event => cf queue => cf worker that is queue consumer => workflow
FAQs
- Can you have multiple API endpoints in one worker? Yes. But it is a bit hacky … https://community.cloudflare.com/t/how-can-add-two-endpoints-apis-in-same-cloudflare-worker/200758
- To handle events from R2 do you have to use a queue? Yes, you have to build a queue and then consume events from that queue. See https://developers.cloudflare.com/r2/buckets/event-notifications/
Prompts for code design
Design the layout on R2
Create a cloudflare worker that syncs from github to R2
Appendix: current materials
- Content sync: https://github.com/datopian/datahub-next?tab=readme-ov-file#content-synchronization-architecture
- NB: this currently includes file processing
- 2410-metadata-store has a bunch of good material on how sync etc works.
What does the MetaStore look like?
This is the main question …
And how is it accessed …
Algorithm for home page when a dataset
See ../pitchs/2410-metadata-store#Appendix Current Architecture
Appendix: Cloudflare materials
- https://developers.cloudflare.com/r2/tutorials/upload-logs-event-notifications/ - set up logging events for r2 using queues
- Docs on event notifications https://developers.cloudflare.com/r2/buckets/event-notifications/