2022-02-14

Present: Leo, Rufus

Summary: clarified product vision and did feature list, started on architecture issue tree.

#todo/process

Agenda (with summary notes)

PRODUCT: Reviewed the Value Proposition for DataHub Pages (GDocs)
- Clarified and refined some of the items esp the first one
- 💡 AGREED: chose the first option in the list as our focus for the product: focus on "data literate" documents as the most natural publishing method to start with. You write a "README.md" (or any other markdown file) and it gets published.
ARCHITECTURE Analysed the technical architecture quite a bit …
- Created an issue tree (see below in "On architecture")
- Got insight about key limitations for the basic product e.g. file size
PRODUCT: hypothesis tree re the product See below for first pass
PRODUCT: walked through a getting started tutorial to hammer out how things worked. See Tutorial section below. One major insight which reinforces the "write a README.md" approach.
- Does it start with a README and then add data … OR does it start with a CSV and then add a README?
- 💡🚩 START with the README (you can add links to the CSV - with that rendering in some nice way). Starting with text is the right way. AND allowing insertion of data links and for those to "render" would be amazing.
PRODUCT: feature analysis and summary Largely done. See below.
PRODUCT: brainstormed initial "getting started path"
AOB
- Leo link for Rufus about metadata for datasets: https://knowyourdata-tfds.withgoogle.com/#tab=STATS&dataset=pet_finder
- Rufus: for Leo - mentioned svelte kit - see ../notes/tech-stack#Svelte

ACTION

?? Summarize the value proposition analysis (archiving the background material so that we can return to it if we need to later …)
?? Move "On architecture" stuff to On architecture and refine there …
2022-02-14 @rufuspollock finish feature coggle
- https://tech.datopian.com/datahub/v3/#portal is included
- review product vision document and merge …
2022-02-14 (?) Convert features coggle into spreadsheet so we can prioritize and track
2022-02-14 @Leo: start investigating the data literate documents and the getting started path

On ../ideas/pages

AGREED: focus on "data literate" documents as the most natural publishing method to start with.

You write a "README.md" (or any other markdown file) and it gets published.

Product Hypothesis

#todo/integrate

Hyp:

Target audience: people using github to "publish / store" their data who want to EITHER quickly present that data better (i.e. data oriented) OR to get create data-literate documents (mixed content and data)
- Their data is located on github
- Their key pains are
  - you can't see the data (in e.g. a table)
  - you can't visualize the data
  - you can't weave text & metadata & data tables/visualization into an overall
  - you need to write your own visualization tools/algorithms
  - [extension: they can't extend from there into more complex apps …?]
Product offers: Publish your data with speed, ease and elegance. Turn a (README +) CSV into an explorable data table … and add graphs and more. Weave data and content together.
- self-publishing via instructions and open source components to self publish ("community edition")
- Cloud service for 1-click publishing
Key features of self-service
- Documentation of how to use, customize and deploy (on popular hosting providers)
- Pre-built templates :+1:
- Pre-built set of data presentation components
  - Data tables
  - Visualizations: plots, maps?, …
  - We've made the choices for you in terms of library, configuration etc
  - Open-source so you can extend
- Data pre-processing (?) e.g. computing data summary
Key features of cloud
- Publish on every push
- Larger datasets
- Private data
- FUTURE: complex data processing in the build
The product in initially intentionally limited in key ways (features as well as bugs)
- Small-medium data: kbs to mbs (under 25mb or so)
  - Why? data rendering runs entirely in the browser which allows for simplicity, static only system and interactivity
- Markdown based with web components which requires reasonably savvy audience
  - Why? Text-based language well supported by git(hub) and with extensions like MDX
- Data formats are only CSV, JSON (plus maybe Excel + SQLite)

Feature tree

https://coggle.it/diagram/YgpxWTL-yUb82LfW/t/datahub-pages-feature-tree

Qu: What features are we building in what order such that … we address the key product needs …

Getting started path

#done/moved Getting Started

npx create-next-app@latest --example https://github.com/datopian/portal.js/examples/data-literate my-app

cd my-app
# create your markdown
# edit a bit ...
vi content/README.md

# let's add a csv file ...
cp ~/mycsv.csv public

npm run dev

On pages-architecture

What format do we write the file in? 🔑 Markdown
- What base text format could we use? 🔑 Markdown, WYSIWYG (e.g. notion, google docs …)
- Which format do we use? 🔑 Markdown because … ubiquitous, raw text, used by our target audience …
- How do we include data (tables), how do we include graphs etc? **🔑 Markdown processor
  - What are the options
    - 🔑 Pandoc: (great at transforming between formats but does not have a good integration with JS environments)
    - 🔑 MDX already extends markdown with power of JS components.
  - Which do we choose? 🔑 MDX. MDX or MDX like stuff is already doing what we want
What data table presentation do we have?
- What data table library can we use?
- What are criteria?
- What is our ranking?
What graphing library(s) should we use?
- What graph options are there? 🔑 Vega, vega-lite, plotly, …
- What is our evaluation? 🔑 Let's stay simple with what we know works well Vega-lite
What data formats do we want to be able to present? 🔑 CSV, XSLX, JSON, sqlite (in rough order)
What are the key functionality to support?
- What text functionality?
- What table support?
  - What search
  - What filter
- What visualization?
- What mapping?
- What spreadsheets?
  - What libraries can support? ag-grid
- sql (functionality)
  - How can we support this …
    - :speech_balloon: https://willschenk.com/articles/2021/sq_lite_in_the_browser/
    - https://phiresky.github.io/blog/2021/hosting-sqlite-databases-on-github-pages/
- How can we provide spreadsheet like functionality?
  - What spreadsheet functionality do we mean?
  - Why is this valuable?
  - What is the performance like?
How does render system know what data there is?
- What options are there?
  - auto-discovery
  - user provided metadata (e.g. datapackage.json)
    - user provided inline in the README etc
How does the render system have access to the data?
- What do we render "server side" (or in the processing step) vs rendering in browser?
- How does this work statically (or does it require an API)?
- What data processing is done "server side"
What is the technical architecture …?
- What is the separation between frontend and backend? 🔑 https://docs.google.com/drawings/d/1QUJFgW51Ku2Y9tT7igz_nRLpcgWUQcL4xNXx4xnYyGY/edit
Is it worth doing metadata extraction so we can x-connect datasets? 🔑 not for now - may be useful when we are moving towards the full hub later

How does the data table get rendered?

Frontend = rendered html page or app (with JS in it) produced by DataHub Pages
Backend = anything else but specifically static storage and any API

Issue tree

Where does the data come from? (i.e. where is it loaded from?)
- 🔑 In a pure "SSG" setup data can only be static so has to be "rendered" at build time
- 🔑 In a dynamic setup can use an API etc
Is the table rendered on the frontend or backend? 🔑 frontend b/c we need it to be interactive?
Do we render all of the data at once or only some of it?
- Can we show all of the data (via pagination / scrolling) even if we don't render all of it at once?
How do we visually render large datasets? (in tables, in graphs)
- What is our limit for tables

Static Backend vs Dynamic Backend

Note: frontend is always statically rendered for now¹

Features	Static Backend	Dynamic "API" Backend
Local data	✅	✅
Remote data	❌ (could be done via pre-fetch but then cached and may go stale)	✅ (via CORS proxy)
Really large data	??²	✅ (via CORS proxy)

Static setup (built once)

Frontend: HTML + JS generated during build
- Rendering (of data) is dynamic
- But brand etc is fixed (done on e.g. build)
Backend: data files (including derived ones) plus any computed "data / metadata" generated during build

Dynamic setup

Some data stored somewhere fancier e.g. a DB which allows us to offload querying and other functionality to the backend

💡🚩🚀 CHOICE: we are focused on data files that are small to medium e.g. kbs to mbs (and generally (well) under 20-25MB) that can be rendered in browser. For bigger data we need a dynamic (API) backend and/or fancier processing in the build step.

How does a visualization get rendered?

Visualizations have a size limit (on the image size). Having to plot a graph of a million points makes no sense for a time series that on screen can only show 1000px (as an example), it is also heavy on the users' processor

Tutorial/Getting Started Try-out To Walk-through the Experience

How does it start? Key meta-question …

Does it start with a README and then add data … OR does it start with a CSV and then add a README?

💡🚩 START with the README (you can add links to the CSV - with that rendering in some nice way). Starting with text is the right way. AND allowing insertion of data links and for those to "render" would be amazing.

Why? What's our experience?

Leo

I sat down and had some data I had previously analysed and i wanted to tell the story about this data so that the people reading this understand what i'm thinking and how and why I'd made these choices
- this would be flow when i'm making a research paper or blog post
- intro: this is the data, this is where it came from
- then: samples from the data that you show
- then: more explanation
- => some text with some data and some plots interpolated
I will sit down with some data i've just downloaded about a population and i will make an analysis and share some ideas that come to my mind
- I will put the CSV
- make a first few plots (as if using a spreadsheet)
- then: here are the results i obtained with this kind of processing
- [Like jupyter but without the complexity of going through the code]
- [Later: i might come back and tidy up the analysis]
Rufus: [scratchpad case] Actually 2 cases:
- scratchpad for some research (i.e. i'm exploring a question like "how much energy can solar produce") *
- scratchpad for data archiving / curation e.g. i've found this interesting dataset and i want to "archive it / write it up" ("add it to my potential reference library") e.g. https://github.com/datasets/awesome-data/issues/339 or https://github.com/datasets/awesome-data/issues/340
- How these both proceed usually is that i am accumulating links, text or sometimes just raw data files …
  - Want to quickly dump links, notes and data files
    - Often quite a pain to key info about those data files.
  - 🚩 Surprisingly often not easy to quickly explore the data files (even just to get a look can be painful e.g. have to open some excel file or its a large csv with many columns …). I often use less / grep or other tools. (used to use data cat ... but a bit unreliable now …)

Tutorial flow

Getting start with DataHub Pages …

Another story would be publishing a "dataset" with existing metadata (or where you want metadata)

Given a datapackage.json
Give me a nice rendered page

---
datapackage.json metadata goes here
---

README type content goes here ...

Or even

---
metadata_path: datapackage.json
---

markdown stuff goes here ...

Step by Step

README.md is empty

Create a README.md

Launch local server/app

Renders a blank page!

Add a title

# My cool dataset or data literate document

Show some data

# My cool dataset or data literate document

<DataTable url={dat-url?} />

Result is …

…

# My cool dataset or data literate document

<DataTable url={dat-url?} />

#aside NB: discussion to be had here long-term whether we also want to make frontend dynamically rendered so we can update brand etc - pure static rendering could be problematic] ↩
depends on whether you do fancy range-header stuff or do file splitting on the build step (but's that very complex …). But you need a static backend that supports range-header etc. ↩