My journey with DataHub Cloud

A personal narrative of my evolution in publishing data stories on this innovative platform

By César Heredia, data journalist

My journey with DataHub Cloud began in April 2024 when Rufus Pollock contacted me through Upwork. My profile there says I'm a data journalist, so I think it clicked on him. At the time, I didn't know who Pollock was. When I Googled his name, I found out.

I couldn't believe such an important person in the open data world would contact this humble Venezuelan who uses data for journalistic purposes. Also, considering that Venezuela is not exactly known for its open data policies. Au contraire. It's hard to work with data in my country because of the lack of open and updated datasets. So, data people have to either build their data or look for datasets on foreign websites and NGOs.

But that's an issue we can assess in another opportunity.

When Rufus contacted me, I was honored but also intrigued. Then came the interview, and I learned about the DataHub platform. What I understood (at the time) was that a lot of the things people usually do with data could have been done within DataHub.

But then I discovered, with practice, that it wasn't a process as automatic as I initially thought. I still had to clean and analyze datasets outside of the platform. So, I still had to rely on Excel/Google Spreadsheets and, if necessary, Python or R.

Then, something became a frustrating topic in my particular learning curve. My charts weren't shown on the DataHub platform, even though I was doing everything right. It took around two weeks and a couple of video calls with a DataHub developer to figure it out.

Since then, my journey with DataHub has been easy peasy.

I've done articles with a journalistic approach focused on several topics, such as the 50 most followed YouTube channels, the 100 worst movies according to Rotten Tomatoes, data analysis of the Paris Olympic Games, or the state of press freedom in the world by 2024, among others.

What I love about DataHub

Two things I find great about DataHub are the ease of use and version control through GitHub.

It's as easy as uploading all the datasets you need to a previously defined GitHub repository. In the same repository, there is a README.md file in which you can write the data story and put the components (charts) that will help illustrate your work using markdown language.

The graphics can be found on the PortalJS components guide website. The available charts are:

  • Catalog (with and without facets).
  • Tabular.
  • IFrame.
  • PDF Viewer.
  • Line.
  • Bar (vertical).
  • Vega.
  • Map.

If I make a mistake, I navigate through the GitHub version control and solve any issue. Also, I can edit any dataset or the README file and commit (save) the changes. It's that easy.

What I'd improve

Although using DataHub is awesome for data purposes, I encountered some challenges I had to overcome while working with the platform. However, I understand this is perfectly normal when working on dynamic and incremental platforms.

For example, the FlatUITable component, ideal for tables, is designed to take the first column as the index column, which means the component will sort the data by the first column.

The detail is that sometimes I want to sort the data referencing to another column and not in descending order (numbers) or from A to Z as it is preset.

Faced with this scenario, I found that I had two alternatives: set a first index column in your dataset or put the field I would like to sort the data by in the first column.

Another issue I found is that Spanish speakers (like me) usually separate decimals with a comma. If you work with comma-separated values in your dataset, it will corrupt the file. That's why it's so important to use a dot instead of a comma when working with numbers like English speakers naturally do. Besides, the system interprets the dot as the default separator between integers and decimals.

A similar thing happens with dates. DataHub uses the DD/MM/YYYY format instead of the MM/DD/YYYY Americans use. The thing is that many datasets come with the American format. I'm aware of this every time a dataset includes a date field.

An issue not related to data formatting comes when I update a dataset. It turns out that the DataHub Cloud doesn't show the updated values. How do I overcome this? I update and rename the dataset and then replace, on the README.md file, the name of the old dataset with the new one where it corresponds.

A component I haven't explored yet and I'm sure it will improve the quality of my reports is the GeoJSON points map. I expect to use it shortly. It's a personal debt with the DataHub platform.

What do I think is missing on DataHub?

For future developments of DataHub, I would like to use components that are not currently available such as horizontal bar charts, line charts with over one line, stacked bar charts, or scatter plots, just to mention four specific examples. The latter is ideal for graphic relationships between two numeric variables, e.g., the correlation between gross domestic product and life expectancy.

It would also be interesting to format chart components in terms of color (it only shows blue at the moment), the thickness of the bar or line, and the possibility of incorporating features like a legend, a summary beyond the title of the chart or mentioning the source of the data in the footer.

Summary

I see great potential in DataHub Cloud for data lovers (like me). I'm proud of using this innovative platform for my data stories/insights while contributing to its growth. I hope to keep having the opportunity to publish many more things on DataHub since it's a platform I really enjoy.

© 2024 All rights reservedBuilt with DataHub Cloud

Built with DataHub CloudDataHub Cloud