Handle "random data files" (no datapackage etc)
Handle "random data files" (no datapackage etc)
For now, this is another "do nothing until we understand user needs" better (including our own as users).
Summary
- Appetite: 5d?
- Situation: only render dataset layout properly if Frictionless DataPackage metadata
- Problem: it's a hassle to add Frictionless metadata. Want to just dump a data.csv and go.
- Solution: Do nothing for now! Until we have a clearer idea of what is wanted as doesn't seem obvious what benefit and involves us "guessing" on behaviour quite a bit.
- To get a sense of why see the appendix user journey walk through. Our sense is that we are having to guess quite a lot about what people want …
Situation
Currently, to render a nice "dataset" showcase we need one of the following:
- any markdown + Frictionless Data Package frontmatter field
- any
index.md/README.md
+ same leveldatapackage.{json/yaml/yml}
Having data files that are not specifically listed in the datapackage are ignored.
What is in a Frictionless Data Package?
- Very general metadata e.g. title, description
- Bit more specialist metadata e.g.
licenses
andsources
- Resources ❗
- Views
Problem
A user has to create Frictionless metadata to get started which is a PITA e.g. a hassle, prone to error etc - people should create this when they actually need it to get some kind of functionality.
Put differently: Having to create a datapackage to get very basic, GitHub-like functionalities (like e.g. listing all data files, showing tabular views of them) is cumbersome.
The more so information from datapackage is not needed for supporting them, as basic metadata of the repository and it's files can be inferred from information provided by GitHub API.
What may be usesful is the Table Schema in the resources field.
Solution
Enhance/extend current page metadata computation function so that it calculates datapackage fields:
- (Option 1)
- for each README.md/index.md
- based on data files at same dir level in the GH tree OR
- (maybe) based on data files from data dir specified in
data
frontmatter field
- (Option 2)
- for root README.md/index.md only
- based on all data files found in the tree OR
- (maybe) based on data files from data dir specified in
data
frontmatter field
Also, we need an option to opt in/out of auto-inferring datapackage, e.g.:
- global, project-level opt-in/out toggle OR/AND
- local, README.md/index.md-level opt-in/out frontmatter field
Notes
Current page metadata calculation flow:
if file is README.md or index.md:
find same level datapackage.json/yaml/yml in GH tree
fetch the datapackage from GH
computePageMetadata(file, datapackage)
Proposed solution metadata calculation flow:
if file is README.md or index.md:
find same level datapackage.json/yaml/yml in GH tree
fetch the datapackage from GH
+++
find data files in GH tree at the same dir level
+++
computePageMetadata(file, datapackage, dataFileGHTreeItems)
Appendix: an example user journey
1. README.md
I add a README.md
OK, i expected a rendered home page based on README.md ✅ and that's what we do.
1B. Just add a data.csv
Blank home page atm.
IGNORE this case: we can always assume user should add a README.md of some kind
2. README + data.csv
Then i add a data file … e.g. data.csv
What should we do? Suggest they do this in their README.md
# Hey my readme
blah blah
<Table src="data.csv" />
3. I want a dataset layout
What happens if i want a dataset layout? Well i can add the following to the README.md?
---
layout: dataset
---
And what would i expect to happen now? I would expect to get dataset layout with the data.csv
auto-magically discovered and presented (as if in the datapackage.json
)
🚩 not clear we want to go this route …
3.B I want to share a list of data files etc
<FileList />
Appendix: Original notes of Rufus (Feb 2024)
I want to get the features of e.g. listing data files, showing tabular views of them etc that i get when i have a datapackage.json/yml.
In more detail
I want to be create a "dataset" project without a datapackage.json and have it auto-inferred for me e.g. i can create a project with this repo:
README.md
data.csv
data2.csv
And have it show up as a "dataset" including with a files list.
NoteHow do we know this wants to show up as a "dataset" project? My guess is we need this to be an option you get to set when creating your project. and we can auto-infer from existence of datapackage.json or similar usually.
Note the equivalent in proper Frictionless form would be:
README.md
data.csv
data2.csv
datapackage.yml
Where datapackage.yml
...
resources:
-
path: data.csv
format: csv
bytes: xxx
...
-
path: data2.csv
format: csv
bytes: xxx
...
...