Alpha Testing - DataHub

Alpha Testing - DataHub

[Originally in a HackMD. Notes from 2017. Copied here on 5 july 2021 by Rufus Pollock]

Notes on publishing Open Data for Justice Tax datasets with Stephen Abbot (2017?)

We used data-cli tools for publishing datasets, since our data-desktop was not ready for it. During the publication we faced several issues:

  • The biggest challange was using CLI for non-tech user
  • Tried to use data-desktop, but we faced with issue related to tableschema, because it generated invalid schema(fields) and it was reported https://github.com/frictionlessdata/tableschema-js/issues/111
  • When we published excell file, we encountered the same issue with table schema, so I created dataset manually and pushed under their organization.
  • Since, our user did not understand all pipeline errors, I explained him how to fix.
    • He saved all files in different encoding which our system did not recognize it. So fix was to save them in utf-8.
    • bad data
  • In total we published 30 datasets https://datahub.io/opendatafortaxjustice and more coming.

Feature Request from Paul Walsh - Nov 16 19:58

@rufuspollock @Mikanebu @akariv I have a bunch of data processing steps, accompanying source data files, as Data Package Pipeline specs, mostly using the built-in standard lib. Is there any way I can push these specs with a data package to datahub.io and have those processing steps run?

Related question for @rufuspollock @akariv :: How far away are we from having datahub.io function has a hub in a data transport flow to other data analysis backends? Use case: I have tons of CSV data that I want to load into Elasticsearch to auto-generate an API using the Table Schema descriptors. The flow can be supported trivially by using https://github.com/frictionlessdata/tableschema-elasticsearch-py in a Data Package Pipeline, but then datahub.io would also have to be able to do one of: write into an Elasticsearch server managed as a service provided by datahub accept some credentials (encrypted) to write to some public Elasticsearch endpoint

I have this exact use case right now for a Frictionless Data pilot - large amounts of UK household energy data.

UX analysis on electron app v2

  • icon on the bottom bar is not clickable, probably we do not need it.
    • it supposed to be removed in the future probably if don't need it
  • [minor]: data 0.2.2.dmg to data-0.2.0.dmg. I checked skype, RStudio, posgress all have this convention in the name.
    • i dont think it is important
  • When there is invalid json, it does nothing. Maybe be good idea to validate or show error message saying inccorect json format
  • Data Files section has missing values, probably it has not implemented yet
    • it is just a mockup - links not working for purpose
  • No short readme, readme, graph, preview table, probably they will be implemented soon
    • FIXED NOW
  • No validation
  • [nice-to-have] datahub icon on the top left as in datahub.io
    • not for now.. may be in the future
  • after field information, datapackage_zip link is missing
    • INVALID - we're using just original dp.json there
  • in the end, datapackage.json is missing
    • INVALID - no reason to have it
  • [nice-to-have] several windows, so we can compare with other datasets showcase page? If I drag and drop another dataset or file, old one is disappeared
    • not sure about this one - for now it's not in user stories..
  • Is it nice to make it full screen
    • I am disabling full screen for now, I don't see a decent reason for having it…
  • Very nice and super fast

Rufus' UX analysis

  • data cli should point to tutorials …
  • data help is out of date e.g mentions data get core/finance-vix
  • Help messages could be a lot better
    • data command -h does not work [misc improvements]
      • I'm wondering whether we should use commander or something like that for parsing help …
  • http://datahub.io/docs/getting-started/pushing-data should point to where i can get some data to start with (some people may not know how to get a file)

Minor (?)

  • I've changed my primary email on github and when i logged in i ended with username rufuspollock1 (why wouldn't it connect me with my existing username of rufuspollock??)
    • [TODO: (just for me) merge these two accounts together …]

UX analysis and cli and electron

Plan

  • Check main help message
  • Check each help message for commands
  • data push
    • CWD
    • with path
    • csv file from URL
    • excel file(xlsx)
    • options: --findability, --schedule, --sheets, --format
    • invalid data
    • invalid format
    • invalid json
    • invalid metadata
  • data info
    • cwd
    • local path
    • remote path
  • data get
  • data cat
    • input data formats
    • output data formats
    • reading from stdin
  • data login
  • electron app

Output datahub cli

Feedback [name=Rufus Pollock]

  • Thorough list

Improvements

  • Need to prioritize the items - what's a big deal or a big bug vs not …
  • Note small things e.g. the main help list data get core/finance-vix but that fails …

  • As a user I want to logout from CLI, I think this is useful command. Docker, now cli have logout command. Also, Paul Walsh experienced this situation.
    • Why is this important? So what that docker has this? Why would you want to use this and so what that Paul Walsh mentioned this (he mainly mentioned logging out on main site). I'm not saying this is not important but i'd want you to think about and explain why (and whether this is important vs other things) [name=Rufus Pollock]

    • Paul Walsh mentioned this, he tried to logout from cli as well. If I want to login again from different github account, or in the future, we will implement singing in with google account. [name=Meiran]

  • [nice-to-have]It would be nice to have a update availability message in main help message, so users know that they are using latest CLI version. - https://github.com/datahq/datahub-cli/issues/198
    Update available! 0.5.0 → 0.6.1                   │
       │   Changelog: https://github.com/datahq/datahub-cli/releases/tag/v0.6.1   │
       │      Please download binaries from https://datahub.io/download 
    
    • Is this a "nice to have" or something essential Please can you prioritize comments into priorities and less important items [name=Rufus Pollock]

  • In help message about section is extra. I compared with heroku, docker, now cli.
    • AGREED. And fixed. [name=Rufus Pollock]

  • data push
    • [improvements] when there is no datapackage.json, error message is not super helpful Error! ENOENT: no such file or directory, open '/Users/Zhiyenbayev_mirza/Desktop/Datopian/src/pm/qa-data/datapackage.json. Could we make it more human readable like datapackage.json not found in current directory, please, make sure that directory contains datapackage.json file, or please see data push -help for more info?
      • [name=Rufus Pollock] compare with git push in a directory where not git repo …

      • [name=Meiran] Similar to git push message
        fatal: No datapackage.json at destination.
        Either add datapackage.json or if you want to push single file read http://datahub.io/docs/getting-started/pushing-data

    • [nice-to-have]It would be nice to split data push help message into data push dataset, data push file categories.
      • Minor?
    • [bug] if pipeline fails, dataset findability is always public. - now irrelevant
      • OK, that's a significant bug.
    • [bug] when I pushed csv file from url, it shows nothing, just stops command data push https://raw.githubusercontent.com/datasets/country-codes/master/data/country-codes.csv
    • [bug] Incorrect error message when I use invalid URL data push https://docs.google.com/spreadsheets/d/14kJluhePaMOx6vYBic0poVjDYK3I8-v_xF3sv4Focac/edit#gid=697227580 Error: You can push only local datasets.
    • [bug]Not human readable error message when I run data push invalid-data-format-xlsx/data.xlsx --sheets=2 where sheet=2 does not exist > Error! sheets.split is not a function
    • [improvements]Invalid metadata error message needs improvements > Error! Unexpected end of JSON input
  • data info
    • [duplicate]When there is not datapackage.json in given directory, error message needs inprovements Error! ENOENT: no such file or directory, open '/Users/Zhiyenbayev_mirza/Desktop/Datopian/src/pm/qa-data/datapackage.json'
    • [bug] Invalid error message when remote URL is invalid > Error! File is not in known tabular format.
  • data get
    • [bug] After getting dataset, could not open downloaded zip file. https://github.com/datahq/datahub-cli/issues/200 data get https://datahub.io/core/co2-ppm -
    • [minor]--format option is extra in help message
      • it works only with push for now - remove from help message
    • [minor-bug]data get help message has heading ## Example:
  • data cat
    • [improvements]When I run invalid url, it gives me this error Error! Invalid opening quote at line 1
  • data login
    • [minor-bug]help message has unprocessed markdown elements ## Options:

Output Electron app

  • if I open github page to create an issue by using link in the main page, I cannot nagivate to main page
  • [bug]Then I tried to open docs Pushing a data file, it opens github page appears and disappears quickly
  • If I go to datahub.io, I cannot download any data
  • If I go to tutorial page, I cannot close window.

Chat with Paul Walsh

  • suggestion to use ~/config and namespace under there: ~.config/datahub.io/config.json
    • config/datahub/config.json

When following the instructions at https://datahub.io/docs/getting-started/installing-data I consciously decided to run data info instead of data info https://datahub.io/core/finance-vix.

  • [name=Meiran] As I understand, people might have a wrong understanding about data info command. It feels like data help or other related command to get some info about datahub-cli itself. Suggest to replace data info with other commands like data get or data cat etc

As a user of many CLI tools, I expect to be able to run a command without arguments and get some type of context-driven help, or, at least an error for the missing argument.

The error I do receive is related to a missing configuration step, I suspect, and this is confusing to me, especially because the instructions on this page make no mention of configuration.

Error! ENOENT: no such file or directory, open '/Users/pwalsh/datapackage.json'
  • [name=Meiran] That error needs to be fixed with more clear error message

I then went to http://datahub.io/docs/features/data-cli which is linked from the above page, and still do not have any idea how to configure the CLI. data help works, but has no info I am looking for.

I then ran data push which tells me to log in. This was successful (according to CLI messaging), but then I ran data info again and got the same original error.

The current config file has a data structure designed around a single account on datahub. I have just started interacting with the system as a user and I already want two accounts (a pseudo org account, and my own account).

Many config files for CLIs support this elegantly. I suggest taking a good look at the gcloud CLI and the aws CLI in terms of user experience if running commands with different users, and the config file itself.

Chat with Jonathan Gray

Questions

For datahub, what about:

  • dataset for each spreadsheet
  • group for the whole collection (which might later expand to include other guardian datasets)
  • a tag for the ones obtained from this specific page (so they are easy to group together)
  • readme with URL to guardian article (people can always use wayback machine if article is gone), plus page title – and (if you think copyright permits) snippet/preview of first x lines / sentences of text

Answers

  • You can create one dataset for several similar spreadsheets, you an specify in resources property
  • Yes, you can group them into several datasets. For example, economics, geography, sports etc
  • dataset must have a readme. Readme might consist of sections such as USAGE, SOURCE, LICENCES etc. Also, it would be great to add scraping script so it will be up to date later.
  • about snipped preview, yes you do it, but in datahub.io, we have a built-in table preview, also, you can build graphs out of data. There is a property views in dataset desciptor file to build graphs. Take a look on some of our core datasets: http://datahub.io/core/finance-vix http://datahub.io/core/gdp-uk This is entire list of core datasets: http://datahub.io/core

Analysis on estimation for Jonathan Grey source

Entire source of data:

https://docs.google.com/spreadsheets/d/14gdRgcb_4cIRWrIlJRHgWQDGRZczC1Vbn-Hz2TfBL0A/edit?hl=en_US&hl=en_US#gid=0 Total: 1412 rows/sources

Scraping: 3-4h Packaging: 1h Check for specification: 1h Publishing: 1h

For this url: http://www.tandfonline.com/loi/rdij20 I do have an access to source and data. So I cannot estimate.

Sample spreadsheets from guardian page:

https://docs.google.com/spreadsheets/d/1MTN2cMoXzscuueG_cwyP0zX2__uBiAnK4vi7_8t8VoE/edit#gid=0 https://docs.google.com/spreadsheets/d/1cZQurWp7q1y_yVKEpRe_iXi8Kn7vrAx1J2LiWJ7V-ug/edit?copiedFromTrash#gid=12

  • Scraping:
    • unpivot to normalize data
    • remove extra headers
    • clean data

Scraping: 5-6h Packaging: 1h Check for specification: 1h Publishing: 1h

Chat with JohnSnowLabs(Ali) on October 5, 2017

  • Does the validation have to be so strict?
    • it should be strict, otherwise it will not pass pipelines
  • when I say multiple processes, it is equivalent to two terminals with both running the cli on two separate datasets
Alis-MacBook-Pro-3:NYC Social Media Usage alinaqvi$ data push --published
> Error! request to https://api.datahub.io/auth/authorize?service=rawstore failed, reason: socket hang up

Chat with JohnSnowLabs(Ali) on October 4, 2017

  • there seems to be a discrepancy with datahub-cli vs the spec.
Alis-MacBook-Pro-3:FDA CVX Code Vaccines Administered alinaqvi$ data validate
> Error! Error: Descriptor validation error:
          Missing required property: name
          at "/contributors/0" in descriptor and
          at "/properties/contributors/items/required/0" in profile
  • fixed in data validate by updating the schema of new spec v1. Reason: it used pre-v1 spec schema for validation

  • also it would be great if all of the errors could be made part of data validate function like when a date column can't be cast to date because its format in datapackage.json is not specified correctly

stream_remote_resources
ERROR   :Failed to cast row: Field "Birth_Year" can't cast value "2011" for type "date" with format "default"
Traceback (most recent call last):

Tasks

  • created issue on data validate command https://github.com/datahq/datahub-cli/issues/175
  • Adam is fixing pipeline error, it is bug
  • new datahub cli release v0.5.0 either v0.5.0 or its v0.4.1
  • Simpler way to get started and test it is working => issue in frontend
    • ACTION: add data info {some dataset on datahub} to getting started
    • As Brook I want to run a simple non-side-effects command to see everything is working before I do something bigger so that I don’t damage anything and feel more certain …
  • info command: encountered a poor error message => an issue https://gist.github.com/brew/eb47b4a6a3df01caa25b38079de47444 : should be no data package found …
  • XS data push help message is not clear and short => an issue
    • add real example, providing path and explain path options
  • data validate doesn’t check data (and it should …) => an issue (big and significant)
    • Checked data validate command, it checks data agains schema as well. So, we do not need to handle this
    • Improve help message by adding data validation
  • XS data normalize - poor CLI documentation at basic help screen (need more info - normalizes data package against spec 1.0)
    • data normalize help message is good, with Usage, Option, Examples sections
  • [minor] http://docs.datahub.io/developers/publish/ - sequence diagram needs a bit of fixing … (what fixing??)

FEEDBACK

Chat with Brook on September 19, 2017

Feedback on new v0.4.1 DataHub CLI

  • I still have a problem running data validate: Error! Error: Descriptor validation error: Missing required property: name at "/contributors/0" in descriptor and at "/properties/contributors/items/required/0" in profile I don't have a name attribute in my contributor. I have title.
  • I also tried data push, which was successful, but my showcase page has a pipeline error: https://datahub.io/brew/multiple-item/pipelines
  • Okay, I've removed the contributor property from my datapackage.json completely and reran data push. I now have a showcase page! https://datahub.io/brew/multiple-item

Few more notes

A few more notes and issues about the Showcase page as they relate to my interpretation of the available spec at https://specs.frictionlessdata.io/data-package/:

  • Will keywords property be used?
    • yes
  • Will description property in datapackage.json be used to populate the Showcase description? How does that relate to the Readme section?
  • Will homepage property be used? I'm using it to link back to the original dataset page on UKDS.
  • Related to homepage, a single source is printed to the showcase page as part of the dataset summary. Currently, this doesn't link to a source.path, and it uses the first item's source.title as a title attribute for the link. The datapackage spec allows for multiple sources and I've used it to link back to all resource files associated with my datapackage on UKDS (including non-tabular support files), so it seems odd to display just the first one.
  • Will additional admin tools be available for the Showcase page? Currently page is unlisted, is there a way to change this to public? Are my datasets listed anywhere, I can't see my unlisted datasets on my profile page or dashboard.

Call with Brook on September 15, 2017

data-push and data get did not work since CLI DataHub Cli binary is too old, it was last released on 9 August.

Ali feedback on September 13, 2017

  • He pushed data, but pipeline is not working or debugging by Adam.
    • data push returned URL and success message, but frontend kept showing update message
  • Push help message again, was short and not clear
    • It did not say that data push works with providing path
  • He was interested on views, how it works and what needs to be done. Provided with documention on http://datahub.io/docs/features/views.
  • Could not install using npm
  • Also, DataHub ClI binary was little bit not clear, about putting to bin folder and make it executable with +x chmod. He handled it by himself
  • Agreed to chat tomorrow morning, when pipeline will work on stable mod

misc issue from Brook feedback on September 12, 2017

Docs - getting started

  • As Brook i want to run a simple non-side-effects command to see everything is working before i do something bigger so that I don’t damage anything and feel more certain …
  • data validate doesn’t check data (and it should …)
  • data normalize - poor CLI documentation at basic help screen (need more info - normalizes data package against spec 1.0)
  • [minor] http://docs.datahub.io/developers/publish/ - sequence diagram needs a bit of fixing …

As a Brook I want to get pipeline status after pushing data package so that to make sure that it is failed or succeeded

As a Brook I want to get hash version of datapackage.json if pipeline fails, so I can modify it later with changes

NOTE: this was because pipelines were not working and that is now fixed

© 2024 All rights reserved

Built with DataHub LogoDataHub Cloud