DataHub Cloud

UPDATE 2024-02-09: here's a short overview deck https://link.excalidraw.com/p/readonly/uuoQlgbF9aDuiXMrOMeI

Outline of product

DataHub Cloud turns GitHub repositories into elegant data-driven sites.

It enables you to seamlessly integrate your existing GitHub datasets into visually appealing and interactive sites, deploying within seconds.

Publish datasets, data stories, and data portals with a few clicks, and share with others.

Features

Datahub Cloud has the following features:

Enhanced Data Presentation: Directly from data stored on GitHub, create an elegant page with beautiful tables and visualisations
Effortless Sharing: share your datasets and data stories easily with anyone
Seamless GitHub Integration: Directly leverage your GitHub repositories to feed data into DataHub Cloud
Catalog Functionality: publish catalogs as well as individual datasets and data stories to help users quickly find the data they need
Flexible Data Handling: Effortlessly handle all types of data (without the necessity for a datapackage.json)
Guided User Experience: Access a suite of guides and walkthroughs for creating your datasets / data stories

Imagined users

Individuals who possess a GitHub repository containing data

Anu, Rufus, open data folks who want to share data, people sharing data on github

Open Data crowd People who want to share publicly data and data-driven insights

Data professionals and enthusiasts — scientists, researchers, analysts, journalists - using GitHub, who seek to present their data more effectively and elegantly, but also want to share their insights broadly with ease.

These data enthusiasts require a platform that not only showcases their data compellingly but also provides practical tools to manage and interact with large datasets, APIs, and catalogs.

Example persona: Anu building an open data catalog on github

Anuar the data enthusiast who is currently using Github to present and share his data. He is creating repos for each dataset. Eg. see https://github.com/open-data-kazakhstan. He is seeking another solution for:

collecting and organizing his datasets in a catalog for easier discovery and sharing (publicly or privately) (data lake need)
creating views for his data, eg. geospatial; having automatic preview / visualizations of the data and able to easily embed a view or an image
creating content (an article or a report) out of his datasets.

Rufus: the data curator/hoarder

Example of what Rufus does today: paste stuff into github issues. See e.g. K12 shooting database - https://github.com/datasets/awesome-data/issues/371

Or paste stuff in github discussions e.g.

OpenCorporates.com: is it no longer open? https://github.com/orgs/datasets/discussions/386
Has snow level been declining in the alps and will it affect winter sports like skiing? https://github.com/orgs/datasets/discussions/390

Or making collections of data in github.com/datasets/awesome-data

What we are building for start of March

A stunning showcase page that will replicate the current datahub.io layout. Key focus:

Showcases data beautifully - the emphasis will be on presenting data through beautiful tables and other components (eg. lists of data files) and possibly visualizations (depending on feasibility)
Offers guidance and tools - users will be provided with guides/walkthroughs for creating datasets and/or data stories

BONUS

(Maybe) Has a catalog feature designed to streamline the organization and discovery of datasets
Handles random data efficiently, without the prerequisite of a datapackage.json

Update 2024-02-28

🎯 DataHub.io runs entirely on a single new DataHub Cloud including content e.g. (docs/blog/collections) and data (e.g. core datasets)

Context

Situation: we have achieved our basic goal of working cloud publishing flow 🎉

Complication: what do we build next? There's so much we could build next so how do prioritize?

Hypothesis:

Technical goal: Get the site on one system so we eat our dogfood
Principle: Focus on one flow for now which is publish dataset from github and keep optimizing that.
Through that we will find new features we want to add

What we could build after

v2 Post-March

Still github-based but extended by adding the following features:

Large File Support: Implementation of APIs or similar technologies to manage large datasets
API: Offering an API for users to interact with their data programmatically, enabling more complex analyses and integrations
Private Repos: Users can create private repositories where they can store their data securely
Custom Domain Support: Users have the option to use their own custom domain name, allowing them to personalize their data portals

User flow

Get started (homepage)
Log in / Sign up
Choose a repo
Published or private page that is shareable with others

Appendix: The Editor Option

[still probably publishing to github]

User flow

Get started (homepage)
Log in / Sign up
Editor
- Drag and drop a dataset (or skip)
- Add descriptions / notes / metadata
Publish (or save as draft)

../Excalidraw/datahub-cloud-editor-option-user-flow-2024-02-02.excalidraw.svg

Appendix: Commentary on our product choices

Build on github: We are adopting an approach that leverages existing content and data stored in GitHub. We want to solve a pain or give a gain for adoption beyond what we have on Github.
Simpler: We are pursuing a strategy focused on less features, exceptionally executed (e.g., prioritizing a beautifully crafted table over multiple mediocre visualization components)

Appendix with some info on also-rans

Other not agreed ideas:

Data Viz & Management

Automatic data preview (eg. geospatial)
Automatic visualizations
Automatic data file listing: catalog
Embedded collections, eg. https://datahub.io/collections but able to embed my datasets in there

Platform Capabilities

AI integration of tools that assist in writing and data analysis to generate content based on data
Overview of the data/content available or navigation bar or side bar or similar
Tool integration - easy integration with various tools and platforms
Real-time collaboration - allow multiple users to edit / work / collaborate on a doc at the same time
Built-in SEO optimization tools and analytics so that a users can track the performance of their published sites e.g. how users interact with the data
Rate limiting?

Appendix: David Gasquez comments re DataPublish.dev idea Nov 2023

https://github.com/datopian/product/discussions/191#discussioncomment-7508184

This is the option that resonates the most with me. A simple website where I can upload a file and get an endpoint and website from it.
Initially, it could be as simple as that although I can see multiple ways it could add more value:
- Store the asset in multiple backends (R2, S3, GitHub Releases, …)
- Automatically generates a datapackage.yaml from README and data.
- Related datasets…
- API Endpoints to add/update datasets
- GitHub actions to add/update datasets
- Extra endpoints to explore the data. E.g: datapublish.dev/user/dataset/explore pointing to a flatfile table or a datasette little instance.
Would be great how can this be surfaced in other places.
- Perhaps something like a GitHub Badge for datapublish.dev?
Could datapublish.dev be backed on GitHub?
- Create a repo for each dataset.
- Keep the repo's README up to date with static charts (static version of the website).
- Sync issues and discussions on github.
- I view this similar to how I view Cloudflare Pages or Netlify. You can interact with it without GitHub/git, but it is there and can be surfaced. I need to improve the analogy and think further on this path as is something I feel strongly about but not very clear why.
To recap, the user flow I have in mind is something like:
1. Upload a dataset
2. Write a small readme. Can use Portal.js stuff.
3. Click publish and get a few links: one to the dataset (datapublish.dev/user/dataset.csv) and another to the rendered readme (datapublish.dev/user/dataset).
4. All the other things are optional and can be done in the future.
I'll write a follow up with the architecture this project could use!

References

Product sync 2024-02-02 📺 https://drive.google.com/file/d/1pRynxHCMUxgOAA-oRoh1gnqlD4j351Dl/view - especially from minute
DataHub Pages outline

Notes

2024-02-02

Key questions we had:

Who would use Datahub Cloud in current state and what is their pain?
Why would someone use Datahub Cloud? When and where in their flow would they use it?

At the moment:

Dataset: You need to produce a dataset, add it to Github, add a data package .json and publish it.
We need to believe (for someone to use our product): you are able to do all this work but you don't know how to publish your data story. Is that really plausible? We were plausible.

Our answer to this was:

2024-02-09 - Rufus

Almost identifical to pages from 2021/2022 with sole focus (for now) on cloud option.

=> we can reuse that content significantly.

🔑 Our HYPOTHESIS about the key VALUE ADD

Tagline:

Publish data and data-driven stories with github as your backend.
Turn data and data-driven content on github into elegant website