Handle "random data files" (no datapackage etc)

Handle "random data files" (no datapackage etc)

For now, this is another "do nothing until we understand user needs" better (including our own as users).

Summary

  • Appetite: 5d?
  • Situation: only render dataset layout properly if Frictionless DataPackage metadata
  • Problem: it's a hassle to add Frictionless metadata. Want to just dump a data.csv and go.
  • Solution: Do nothing for now! Until we have a clearer idea of what is wanted as doesn't seem obvious what benefit and involves us "guessing" on behaviour quite a bit.
    • To get a sense of why see the appendix user journey walk through. Our sense is that we are having to guess quite a lot about what people want …

Situation

Currently, to render a nice "dataset" showcase we need one of the following:

  • any markdown + Frictionless Data Package frontmatter field
  • any index.md/README.md + same level datapackage.{json/yaml/yml}

Having data files that are not specifically listed in the datapackage are ignored.

What is in a Frictionless Data Package?

  • Very general metadata e.g. title, description
  • Bit more specialist metadata e.g. licenses and sources
  • Resources ❗
  • Views

Problem

A user has to create Frictionless metadata to get started which is a PITA e.g. a hassle, prone to error etc - people should create this when they actually need it to get some kind of functionality.

Put differently: Having to create a datapackage to get very basic, GitHub-like functionalities (like e.g. listing all data files, showing tabular views of them) is cumbersome.

The more so information from datapackage is not needed for supporting them, as basic metadata of the repository and it's files can be inferred from information provided by GitHub API.

What may be usesful is the Table Schema in the resources field.

Solution

Enhance/extend current page metadata computation function so that it calculates datapackage fields:

  • (Option 1)
    • for each README.md/index.md
    • based on data files at same dir level in the GH tree OR
    • (maybe) based on data files from data dir specified in data frontmatter field
  • (Option 2)
    • for root README.md/index.md only
    • based on all data files found in the tree OR
    • (maybe) based on data files from data dir specified in data frontmatter field

image

Also, we need an option to opt in/out of auto-inferring datapackage, e.g.:

  • global, project-level opt-in/out toggle OR/AND
  • local, README.md/index.md-level opt-in/out frontmatter field

Notes

Current page metadata calculation flow:

if file is README.md or index.md:
  find same level datapackage.json/yaml/yml in GH tree
  fetch the datapackage from GH
  computePageMetadata(file, datapackage)

Proposed solution metadata calculation flow:

if file is README.md or index.md:
  find same level datapackage.json/yaml/yml in GH tree
  fetch the datapackage from GH
  +++
  find data files in GH tree at the same dir level
  +++
  computePageMetadata(file, datapackage, dataFileGHTreeItems)

Appendix: an example user journey

1. README.md

I add a README.md

OK, i expected a rendered home page based on README.md ✅ and that's what we do.

1B. Just add a data.csv

Blank home page atm.

IGNORE this case: we can always assume user should add a README.md of some kind

2. README + data.csv

Then i add a data file … e.g. data.csv

What should we do? Suggest they do this in their README.md

# Hey my readme

blah blah

<Table src="data.csv" />

3. I want a dataset layout

What happens if i want a dataset layout? Well i can add the following to the README.md?

---
layout: dataset
---

And what would i expect to happen now? I would expect to get dataset layout with the data.csv auto-magically discovered and presented (as if in the datapackage.json)

🚩 not clear we want to go this route …

3.B I want to share a list of data files etc


<FileList />

Appendix: Original notes of Rufus (Feb 2024)

I want to get the features of e.g. listing data files, showing tabular views of them etc that i get when i have a datapackage.json/yml.

In more detail

I want to be create a "dataset" project without a datapackage.json and have it auto-inferred for me e.g. i can create a project with this repo:

README.md
data.csv
data2.csv

And have it show up as a "dataset" including with a files list.

Note

How do we know this wants to show up as a "dataset" project? My guess is we need this to be an option you get to set when creating your project. and we can auto-infer from existence of datapackage.json or similar usually.

Note the equivalent in proper Frictionless form would be:

README.md
data.csv
data2.csv
datapackage.yml

Where datapackage.yml

...
resources:
  - 
    path: data.csv
    format: csv
    bytes: xxx
    ...
  - 
    path: data2.csv
   format: csv
   bytes: xxx
   ...
...

© 2024 All rights reservedBuilt with Find, Share and Publish Quality Data with Datahub

Built with Find, Share and Publish Quality Data with DatahubFind, Share and Publish Quality Data with Datahub