Re notes/data-presentation|Data Presentation in DataHub incl Literate etc

Re Data Presentation in DataHub incl Literate etc

../people/rufuspollock

#todo/process

Summary

Data presentation in our apps can be organized in two major ways:

  • Direct load: raw data is loaded (and queried) by JS code in the frontend and then presented
  • Via Data API raw data is first imported to data storage with a data API around it and then the frontend presentation access that API

See below for details.

Choosing if we want to support one or both of these is a crucial implementation choice.

Key choices to make:

  • Do we go a "data api" only route? 🚧2023-02-15 initial KISS intuition is yes …
  • If we do, do we bother to "factor" out cleanly views stuff e.g. to have a TableView which takes a well defined object like a Frictionless Resource? **🚧2023-02-15 probably not that much **

Implications

  • If we go data API only we don't have support in e.g. flowershow (though it could be reasonably easy to add i guess at least for very simple case like csv?)

Common stuff

  • What table library do we use? ✅2023-02-15 react-table seems definitely best headless now. let's use it
  • What chart library do we use? 🚧2023-02-15 choose between vega, plotly and chartjs …?
  • Do we do mapping (at all)? 🚧2023-02-15 probably yes, at least a bit. it's quite easy.

thoughts

  • even if we do data api only will want to be able to develop against data so suspect we want a simple way to load csv with a schema??

Question tree

  • What do we need to build?
    • what have we already built? ✅2023-02-14 extensive answer in https://github.com/datopian/product/issues/51. summary: 2 remaining "live" items: a) explorer package in datahub-core that runs off graphql data api https://github.com/datopian/datahub-core/tree/main/packages/data-explorer-graphql b) portaljs (which is not very used)
      • data literate is a bit special as strictly not using portaljs component code but lives in that repo. See working demo of data literate here https://portaljs.org/data-literate/demo
        • where is the code for this? ✅2023-02-14 code is bespoke to site in fact with a handcrafted table view etc i.e. a bit hacky. lives in site/components/
        • is it worth salvaging? ✅2023-02-14 🤷‍♀️ not so clear. some stuff in it may be useful e.g. the xls code loading. however, probably nicer way to do the table e.g. using react-table etc.
        • did this get ported to next.datahub.io? 🚧2023-02-14 i'm pretty sure this had been ported to next.datahub.io but might have got lost in flowershow upgrade)
      • What's not working about it? 🚧2023-02-14 we don't have a chart or maps display. not sure about state of data importing into the data api.
        • What's is there in terms of views and what's missing? 🚧2023-02-15 in explorer we have a table and a slighty hacky chart based on PlotlyChart in portal.js.
        • What's there in term of importing and what's missing? **
        • What's there in terms of query support and what's missing?
        • What's there in builders and what's missing?
    • What are the needs in terms of display?
      • Do we need tables? ✅2023-02-14 Yes
      • Do we need charts? ✅2023-02-14 Yes?
      • Do we need maps? 🚧2023-02-14 ❓
    • what are needs in terms of data sources? 🚧2023-02-15 anything we can load in data api. for data literate i'd say csv, xlsx, json
      • Do we need to import xlsx? ✅2023-02-14 almost certainly yes
    • Do we care about (pre)viewing data from disk e.g. previewing CSV files direct (or xlsx or whatever). Conversely, can we assume that data is first imported to a data api or some kind?🚧2023-02-14 UNCLEAR ❗
      • Do 🚧2023-02-14 yes for data literate stuff (though even there in the build stuff we could convert to e.g. sqlite). but for … datahub not so much i think as we can count on always importing.
        • what about DataHub Open Data? doesn't that need previews of stuff?

Two ways data can flow into data presentation

2 ways raw data can flow into data presentation:

  • Direct load: raw data is loaded (and queried) by JS code in the frontend
  • Via Data API raw data is first imported to data storage with a data api around it

Raw data is a data file (e.g. csv, xls, etc) on disk or online somewhere. Emphasize raw data b/c of course frontend renderer will ultimately have to load data from somewhere in some format.

Direct load

via Data API

The essence of the difference

The essence of the difference is whether it is frontend or backend that is responsible for:

  • Loader: i.e. importing (i.e. converting the raw data to a standarized, well structured form)
  • Query: querying over data i.e. providing some way to filter etc

Note: for large data support the Data API route is only option as the direct route simply isn't feasible (other than for a sample preview where you load only a part of the file). e.g. loading a 1GB CSV into memory in your browser won't go well. However, handling 1GB CSV in postgres is easy and so a Data API can work.

Pros and cons of Data API (and direct load)

  • data importing and prep can be done in specialist backend code
  • clean separation: frontend has clear guarantees about how to access data
  • large data is no problem: many highly developed backends for data ranging from MBs to PBs. frontend can interact with even very large datasets via querying.
  • data integration: merging or integrating datasets can be done in the backend

Cons: a data API has to be created and maintained. =>

  • no simple static apps (and more complexity in general)
  • if backend code fails the presentation fails
  • no live interaction with data (data has to be loaded first to API and then becomes live in frontend)

NB: from a technical/architecture perspective the issue is that a) the query components in the frontend are usually quite tied to the backend API b) the query stuff is a large part of the value add (it's what makes something and "explorer" rather than just a sample view). Whilst a simple view is nice, what everyone ends up wanting is an explorer.

To summarize:

  • if you can go with a data API, do it
  • If you are handling largish data you have you have choice other than Data API
  • BUT: if you want something simple that works locally and is easy to publish especially statically then you want direct …

Comments:

  • if i were doing direct today i would probably try and get stuff into sqlite and then run against that (rather than try and build my own query layer). there are now some pretty cool ways to run sqlite in the browser.
    • you would then build the sqlite as part of the static build
    • OR: i would try alasql or something
  • for the Data API route i'd strongly examine just wholesale adopting some cloud provider solution off the shelf e.g. google bigquery is great.
  • also: extracting a small sample e.g. first 500 rows and converting to json and caching that separate from data API can still support some simple direct load and provides a good fallback if data api has an issue (or just to save queries e.g. default view just uses that until a user explicitly queries)

Job Stories

Data Literate case

I have markdown + data (csv, json, xlsx, sqlite?) and want to put it online with the minimum of fuss and complexity so that others can see and use my work

  • I have a dataset i want to publish consisting of README plus a data file (or files) e.g. in CSV
  • I have a data-driven story where i want to display data and/or charts in my markdown doc
  • I have a data science analysis i quickly want to publish

On disk you have:

README.md
data.csv

In README i have something like:

# My Amazing Dataset

This is my awesome dataset / data driven analysis.

Here's the data

<DataTable src="mydata.csv" />

You can create a nice chart like this:

<LineChart config={} data="mydata.csv" />

And pfft in a puff of smoke this turns into a nice rendered page with a table and Line Chart. Here was the demo:

https://portaljs.org/data-literate/demo

Inbox

© 2024 All rights reservedBuilt with Find, Share and Publish Quality Data with Datahub

Built with Find, Share and Publish Quality Data with DatahubFind, Share and Publish Quality Data with Datahub