Users can now import online data files directly into the DataHub using the
data command line tool – and setup scheduled re-imports at the same time.
We’re very excited about this feature as it is the first step in supporting automated scraping and doing this on a regular schedule. This is something we ourselves have long wanted for our Core Data work and we’re already using the feature ourselves.
We’ll use an example of the “Energy consumption by sector” from the US Energy Information Administration. This data is updated on monthly basis so we want it to be re-imported every 30 days (~1 month):
data push https://www.eia.gov/totalenergy/data/browser/csv.php?tbl=T02.01 --schedule="every 30d" --format=csv
By default, when you push datasets to DataHub, they are “unlisted” so only people with the link can see it. If you wish to make your dataset “published”, you need to pass
data push URL --published.
Once the process is completed open your browser and check it out! It would generate a URL using your username, which will be copied to clipboard so you can just open a browser and paste it. You should see something similar to this page:
Note: We’ve decided to still use the push command, even though unlike local data you are not “pushing” it but rather importing it. Read more about the
push command in our getting started guide.
You can setup a schedule so the DataHub will automatically re-import the remote file on a regular basis. E.g.,
every 2d etc. The number is always an integer, selector is
s/m/h/d/w (second -> week) and you can’t schedule for less than 60 seconds.
In our example above the dataset is updated on monthly basis so we have the schedule that runs every 30 days:
This data file will then re-imported monthly.
To read more including full details of the schedule format see our [full docs on importing online data to the DataHub].
If the file URL does not contain conventional file name, you need to provide its format explicitly by using
--format option. See how we did it in our example above: