Alpha Testing - DataHub
Alpha Testing - DataHub
[Originally in a HackMD. Notes from 2017. Copied here on 5 july 2021 by Rufus Pollock]
Notes on publishing Open Data for Justice Tax datasets with Stephen Abbot (2017?)
We used data-cli
tools for publishing datasets, since our data-desktop
was not ready for it. During the publication we faced several issues:
- The biggest challange was using CLI for non-tech user
- Tried to use data-desktop, but we faced with issue related to tableschema, because it generated invalid schema(fields) and it was reported https://github.com/frictionlessdata/tableschema-js/issues/111
- When we published excell file, we encountered the same issue with table schema, so I created dataset manually and pushed under their organization.
- exported excel sheets into csv files
- packaged them
- published under opendatafortaxjustice organization.
- https://datahub.io/opendatafortaxjustice/blacklist
- Since, our user did not understand all pipeline errors, I explained him how to fix.
- He saved all files in different encoding which our system did not recognize it. So fix was to save them in
utf-8
. - bad data
- He saved all files in different encoding which our system did not recognize it. So fix was to save them in
- In total we published 30 datasets https://datahub.io/opendatafortaxjustice and more coming.
Feature Request from Paul Walsh - Nov 16 19:58
@rufuspollock @Mikanebu @akariv I have a bunch of data processing steps, accompanying source data files, as Data Package Pipeline specs, mostly using the built-in standard lib. Is there any way I can push these specs with a data package to datahub.io and have those processing steps run?
Related question for @rufuspollock @akariv :: How far away are we from having datahub.io function has a hub in a data transport flow to other data analysis backends? Use case: I have tons of CSV data that I want to load into Elasticsearch to auto-generate an API using the Table Schema descriptors. The flow can be supported trivially by using https://github.com/frictionlessdata/tableschema-elasticsearch-py in a Data Package Pipeline, but then datahub.io would also have to be able to do one of: write into an Elasticsearch server managed as a service provided by datahub accept some credentials (encrypted) to write to some public Elasticsearch endpoint
I have this exact use case right now for a Frictionless Data pilot - large amounts of UK household energy data.
UX analysis on electron app v2
- icon on the bottom bar is not clickable, probably we do not need it.
- it supposed to be removed in the future probably if don't need it
- [minor]:
data 0.2.2.dmg
todata-0.2.0.dmg
. I checked skype, RStudio, posgress all have this convention in the name.- i dont think it is important
- When there is invalid json, it does nothing. Maybe be good idea to validate or show error message saying
inccorect json format
…- great - create an issue for it please
Data Files
section has missing values, probably it has not implemented yet- it is just a mockup - links not working for purpose
- No short readme, readme, graph, preview table, probably they will be implemented soon
- FIXED NOW
- No validation
- please open an issue for it
- [nice-to-have] datahub icon on the top left as in datahub.io
- not for now.. may be in the future
- after field information,
datapackage_zip
link is missing- INVALID - we're using just original dp.json there
- in the end,
datapackage.json
is missing- INVALID - no reason to have it
- [nice-to-have] several windows, so we can compare with other datasets showcase page? If I drag and drop another dataset or file, old one is disappeared
- not sure about this one - for now it's not in user stories..
- Is it nice to make it full screen
- I am disabling full screen for now, I don't see a decent reason for having it…
- Very nice and super fast
Rufus' UX analysis
- data cli should point to tutorials …
- data help is out of date e.g mentions
data get core/finance-vix
- Help messages could be a lot better
- data command -h does not work [misc improvements]
- I'm wondering whether we should use commander or something like that for parsing help …
- data command -h does not work [misc improvements]
- http://datahub.io/docs/getting-started/pushing-data should point to where i can get some data to start with (some people may not know how to get a file)
Minor (?)
- I've changed my primary email on github and when i logged in i ended with username rufuspollock1 (why wouldn't it connect me with my existing username of rufuspollock??)
- [TODO: (just for me) merge these two accounts together …]
UX analysis and cli and electron
Plan
- Check main help message
- Check each help message for commands
-
data push
- CWD
- with path
- csv file from URL
- excel file(xlsx)
- options:
--findability
,--schedule
,--sheets
,--format
- invalid data
- invalid format
- invalid json
- invalid metadata
-
data info
- cwd
- local path
- remote path
-
data get
-
data cat
- input data formats
- output data formats
- reading from stdin
-
data login
- electron app
Output datahub cli
Feedback [name=Rufus Pollock]
- Thorough list
Improvements
- Need to prioritize the items - what's a big deal or a big bug vs not …
- Note small things e.g. the main help list data get core/finance-vix but that fails …
- As a user I want to logout from CLI, I think this is useful command.
Docker
,now
cli havelogout
command. Also, Paul Walsh experienced this situation.-
Why is this important? So what that docker has this? Why would you want to use this and so what that Paul Walsh mentioned this (he mainly mentioned logging out on main site). I'm not saying this is not important but i'd want you to think about and explain why (and whether this is important vs other things) [name=Rufus Pollock]
-
Paul Walsh mentioned this, he tried to logout from cli as well. If I want to login again from different github account, or in the future, we will implement singing in with google account. [name=Meiran]
-
- [nice-to-have]It would be nice to have a update availability message in main help message, so users know that they are using latest CLI version. - https://github.com/datahq/datahub-cli/issues/198
Update available! 0.5.0 → 0.6.1 │ │ Changelog: https://github.com/datahq/datahub-cli/releases/tag/v0.6.1 │ │ Please download binaries from https://datahub.io/download
-
Is this a "nice to have" or something essential Please can you prioritize comments into priorities and less important items [name=Rufus Pollock]
-
- In
help
messageabout
section is extra. I compared withheroku
,docker
,now
cli.-
AGREED. And fixed. [name=Rufus Pollock]
-
data push
- [improvements] when there is no
datapackage.json
, error message is not super helpfulError! ENOENT: no such file or directory, open '/Users/Zhiyenbayev_mirza/Desktop/Datopian/src/pm/qa-data/datapackage.json
. Could we make it more human readable likedatapackage.json not found in current directory, please, make sure that directory contains datapackage.json file, or please see data push -help for more info
?-
[name=Rufus Pollock] compare with git push in a directory where not git repo …
-
[name=Meiran] Similar to git push message
fatal: No datapackage.json at destination.
Either add datapackage.json or if you want to push single file read http://datahub.io/docs/getting-started/pushing-data
-
- [nice-to-have]It would be nice to split
data push
help message intodata push dataset
,data push file
categories.- Minor?
[bug] if pipeline fails, dataset findability is always- now irrelevantpublic
.- OK, that's a significant bug.
- [bug] when I pushed csv file from url, it shows nothing, just stops command
data push https://raw.githubusercontent.com/datasets/country-codes/master/data/country-codes.csv
- MAJOR bug
- this is known bug - https://github.com/frictionlessdata/tableschema-js/issues/109
- MAJOR bug
- [bug] Incorrect error message when I use invalid URL
data push https://docs.google.com/spreadsheets/d/14kJluhePaMOx6vYBic0poVjDYK3I8-v_xF3sv4Focac/edit#gid=697227580
Error: You can push only local datasets.
- GREAT find. Another important bug.
- please open issue in the CLI repo
- GREAT find. Another important bug.
- [bug]Not human readable error message when I run
data push invalid-data-format-xlsx/data.xlsx --sheets=2
where sheet=2 does not exist> Error! sheets.split is not a function
- GREAT! Should have tests for this and a decent error message.
- Please open an issue in the cli
- GREAT! Should have tests for this and a decent error message.
- [improvements]Invalid metadata error message needs improvements
> Error! Unexpected end of JSON input
- [improvements] when there is no
data info
[duplicate]When there is not datapackage.json in given directory, error message needs inprovementsError! ENOENT: no such file or directory, open '/Users/Zhiyenbayev_mirza/Desktop/Datopian/src/pm/qa-data/datapackage.json'
- [bug] Invalid error message when remote URL is invalid
> Error! File is not in known tabular format.
- open an issue in the cli please
data get
- [bug] After getting dataset, could not open downloaded zip file. https://github.com/datahq/datahub-cli/issues/200
data get https://datahub.io/core/co2-ppm
- - [minor]
--format
option is extra in help message- it works only with push for now - remove from help message
- [minor-bug]
data get
help message has heading## Example:
- [bug] After getting dataset, could not open downloaded zip file. https://github.com/datahq/datahub-cli/issues/200
data cat
- [improvements]When I run invalid url, it gives me this error
Error! Invalid opening quote at line 1
- [improvements]When I run invalid url, it gives me this error
data login
- [minor-bug]help message has unprocessed markdown elements
## Options:
- [minor-bug]help message has unprocessed markdown elements
Output Electron app
- if I open github page to create an issue by using link in the main page, I cannot nagivate to main page
- [bug]Then I tried to open docs
Pushing a data file
, it opens github page appears and disappears quickly - If I go to datahub.io, I cannot download any data
- If I go to tutorial page, I cannot close window.
Chat with Paul Walsh
- suggestion to use
~/config
and namespace under there:~.config/datahub.io/config.json
-
config/datahub/config.json
-
When following the instructions at https://datahub.io/docs/getting-started/installing-data I consciously decided to run data info instead of data info https://datahub.io/core/finance-vix.
- [name=Meiran] As I understand, people might have a wrong understanding about
data info
command. It feels likedata help
or other related command to get some info about datahub-cli itself. Suggest to replacedata info
with other commands likedata get
ordata cat
etc
As a user of many CLI tools, I expect to be able to run a command without arguments and get some type of context-driven help, or, at least an error for the missing argument.
The error I do receive is related to a missing configuration step, I suspect, and this is confusing to me, especially because the instructions on this page make no mention of configuration.
Error! ENOENT: no such file or directory, open '/Users/pwalsh/datapackage.json'
- [name=Meiran] That error needs to be fixed with more clear error message
I then went to http://datahub.io/docs/features/data-cli which is linked from the above page, and still do not have any idea how to configure the CLI. data help works, but has no info I am looking for.
- [name=Meiran] We might to think about this page http://datahub.io/docs/features/data-cli. Move
Login before pushing:
section afterdata help
section
I then ran data push which tells me to log in. This was successful (according to CLI messaging), but then I ran data info again and got the same original error.
The current config file has a data structure designed around a single account on datahub. I have just started interacting with the system as a user and I already want two accounts (a pseudo org account, and my own account).
Many config files for CLIs support this elegantly. I suggest taking a good look at the gcloud CLI and the aws CLI in terms of user experience if running commands with different users, and the config file itself.
-
[name=Meiran] Adding configuraion section in http://datahub.io/docs/getting-started/installing-data.
-
Pushing metadata with inline data???
-
.txt
format throws unclear error -
Bug in browser
safari
with logout and dashboard pages -
[ ]
Chat with Jonathan Gray
Questions
For datahub, what about:
- dataset for each spreadsheet
- group for the whole collection (which might later expand to include other guardian datasets)
- a tag for the ones obtained from this specific page (so they are easy to group together)
- readme with URL to guardian article (people can always use wayback machine if article is gone), plus page title – and (if you think copyright permits) snippet/preview of first x lines / sentences of text
Answers
- You can create one dataset for several similar spreadsheets, you an specify in
resources
property - Yes, you can group them into several datasets. For example, economics, geography, sports etc
- dataset must have a readme. Readme might consist of sections such as
USAGE
,SOURCE
,LICENCES
etc. Also, it would be great to add scraping script so it will be up to date later. - about snipped preview, yes you do it, but in datahub.io, we have a built-in table preview, also, you can build graphs out of data. There is a property
views
in dataset desciptor file to build graphs. Take a look on some of our core datasets: http://datahub.io/core/finance-vix http://datahub.io/core/gdp-uk This is entire list of core datasets: http://datahub.io/core
Analysis on estimation for Jonathan Grey source
Entire source of data:
https://docs.google.com/spreadsheets/d/14gdRgcb_4cIRWrIlJRHgWQDGRZczC1Vbn-Hz2TfBL0A/edit?hl=en_US&hl=en_US#gid=0 Total: 1412 rows/sources
Scraping: 3-4h Packaging: 1h Check for specification: 1h Publishing: 1h
For this url: http://www.tandfonline.com/loi/rdij20 I do have an access to source and data. So I cannot estimate.
Sample spreadsheets from guardian page:
https://docs.google.com/spreadsheets/d/1MTN2cMoXzscuueG_cwyP0zX2__uBiAnK4vi7_8t8VoE/edit#gid=0 https://docs.google.com/spreadsheets/d/1cZQurWp7q1y_yVKEpRe_iXi8Kn7vrAx1J2LiWJ7V-ug/edit?copiedFromTrash#gid=12
- Scraping:
- unpivot to normalize data
- remove extra headers
- clean data
Scraping: 5-6h Packaging: 1h Check for specification: 1h Publishing: 1h
Chat with JohnSnowLabs(Ali) on October 5, 2017
- Does the validation have to be so strict?
- it should be strict, otherwise it will not pass pipelines
- when I say multiple processes, it is equivalent to two terminals with both running the cli on two separate datasets
Alis-MacBook-Pro-3:NYC Social Media Usage alinaqvi$ data push --published
> Error! request to https://api.datahub.io/auth/authorize?service=rawstore failed, reason: socket hang up
Chat with JohnSnowLabs(Ali) on October 4, 2017
- there seems to be a discrepancy with datahub-cli vs the spec.
Alis-MacBook-Pro-3:FDA CVX Code Vaccines Administered alinaqvi$ data validate
> Error! Error: Descriptor validation error:
Missing required property: name
at "/contributors/0" in descriptor and
at "/properties/contributors/items/required/0" in profile
-
fixed in
data validate
by updating the schema of new spec v1. Reason: it usedpre-v1
spec schema for validation -
also it would be great if all of the errors could be made part of data validate function like when a date column can't be cast to date because its format in datapackage.json is not specified correctly
stream_remote_resources
ERROR :Failed to cast row: Field "Birth_Year" can't cast value "2011" for type "date" with format "default"
Traceback (most recent call last):
Tasks
- created issue on
data validate
command https://github.com/datahq/datahub-cli/issues/175 - Adam is fixing pipeline error, it is bug
- new datahub cli release v0.5.0 either v0.5.0 or its v0.4.1
- Simpler way to get started and test it is working => issue in frontend
- ACTION: add
data info {some dataset on datahub}
to getting started - As Brook I want to run a simple non-side-effects command to see everything is working before I do something bigger so that I don’t damage anything and feel more certain …
- I don’t want to write to my disk straight away … (where will it get written!)
- data info https://datahub.io/core/finance-vix
- ACTION: add
-
info command
: encountered a poor error message => an issue https://gist.github.com/brew/eb47b4a6a3df01caa25b38079de47444 : should be no data package found … - XS data push help message is not clear and short => an issue
- add real example, providing path and explain path options
-
data validate doesn’t check data (and it should …) => an issue (big and significant)- Checked
data validate
command, it checks data agains schema as well. So, we do not need to handle this - Improve help message by adding data validation
- Checked
- XS data normalize - poor CLI documentation at basic help screen (need more info - normalizes data package against spec 1.0)
data normalize
help message is good, withUsage, Option, Examples
sections
- [minor] http://docs.datahub.io/developers/publish/ - sequence diagram needs a bit of fixing … (what fixing??)
FEEDBACK
Chat with Brook on September 19, 2017
Feedback on new v0.4.1 DataHub CLI
- I still have a problem running data validate:
Error! Error: Descriptor validation error: Missing required property: name at "/contributors/0" in descriptor and at "/properties/contributors/items/required/0" in profile
I don't have aname
attribute in my contributor. I have title. - I also tried data push, which was successful, but my showcase page has a pipeline error: https://datahub.io/brew/multiple-item/pipelines
- Okay, I've removed the contributor property from my datapackage.json completely and reran data push. I now have a showcase page! https://datahub.io/brew/multiple-item
Few more notes
A few more notes and issues about the Showcase page as they relate to my interpretation of the available spec at https://specs.frictionlessdata.io/data-package/:
- Will
keywords
property be used?- yes
- Will
description
property in datapackage.json be used to populate the Showcase description? How does that relate to the Readme section? - Will
homepage
property be used? I'm using it to link back to the original dataset page on UKDS. - Related to
homepage
, a singlesource
is printed to the showcase page as part of the dataset summary. Currently, this doesn't link to asource.path
, and it uses the first item'ssource.title
as a title attribute for the link. The datapackage spec allows for multiplesources
and I've used it to link back to all resource files associated with my datapackage on UKDS (including non-tabular support files), so it seems odd to display just the first one. - Will additional admin tools be available for the Showcase page? Currently page is
unlisted
, is there a way to change this topublic
? Are my datasets listed anywhere, I can't see myunlisted
datasets on my profile page or dashboard.
Call with Brook on September 15, 2017
data-push
and data get
did not work since CLI
DataHub Cli binary is too old, it was last released on 9 August.
Ali feedback on September 13, 2017
- He pushed data, but pipeline is not working or debugging by Adam.
data push
returned URL and success message, but frontend kept showing update message
- Push help message again, was short and not clear
- It did not say that data push works with providing path
- He was interested on
views
, how it works and what needs to be done. Provided with documention on http://datahub.io/docs/features/views. - Could not install using npm
- Also, DataHub ClI binary was little bit not clear, about putting to
bin
folder and make it executable with+x chmod
. He handled it by himself - Agreed to chat tomorrow morning, when pipeline will work on stable mod
misc issue from Brook feedback on September 12, 2017
Docs - getting started
- As Brook i want to run a simple non-side-effects command to see everything is working before i do something bigger so that I don’t damage anything and feel more certain …
- I don’t want to write to my disk straight away … (where will it get written!)
- data info https://datahub.io/core/finance-vix
- Also encountered a really poor error message - https://gist.github.com/brew/eb47b4a6a3df01caa25b38079de47444 : should be no data package found …
- data validate doesn’t check data (and it should …)
- data normalize - poor CLI documentation at basic help screen (need more info - normalizes data package against spec 1.0)
- [minor] http://docs.datahub.io/developers/publish/ - sequence diagram needs a bit of fixing …
As a Brook I want to get pipeline status after pushing data package so that to make sure that it is failed or succeeded
As a Brook I want to get hash version of datapackage.json if pipeline fails, so I can modify it later with changes
NOTE: this was because pipelines were not working and that is now fixed