Leo Thoughts re datahub-next ideas, a possibility for the forward strategy

Comment from Anu

I think all ideas here are really valuable as Leo has background in data science/analysis.

It probably worth to emphasise that we'd like to build the next version of datahub so that we keep the community and grow it further. We believe that datahubio already has great traffic but we think it is missing something to become a great product.

We should consider the existing market and see where we are at the moment, build some hypothesis and work against them. I'm not sure if current draft of the roadmap has been reviewed but if not it worth to read it quickly so that we are on the same page.

Leo Email

Hi,

I've been thinking about the product from the user point of view, the value is basically in being able to make people share data analysis in an easy and meaningful way.

I've come with a strategy that includes several stages but starting from the user point of view instead of from datahub. I write down here the basic idea and paths for a later discussion.

I'll start with some context and a personal example.

Context

From the data point of view, historically, the creation of spreadsheets was a market and world changing technology, it allows to deal, graph and compute data, but mainly it allows to share and show conclusions in visual plots, which is the big deal.

Nevertheless sharing spreadsheets is not practical, it not only has limits in the amount of data that can be handled but also on which systems it can be used depending on the spreadsheet format.

From other side, more from the programming perspective there are better tools to deal with data and visualize the changes, among these tools there are RStudio, SASS and one of the industry transforming technologies were the notebooks currently being Jupyter one of the winning ones allowing for multiple different language backends (including Julia, Python, R, Scala, Bash, etc)

One of the current ways of sharing data and analysis is for example exporting html from a jupyter notebook like here: https://leomrocha.github.io/ud_conllu_v2.6/index.html

Nevertheless making the graphs takes time and effort.

Currently all lives on the web, and to be able to share data and visualizations there are many things, but nothing makes sharing visualizations in any platform stupidly easy.

In another dimension we can see how data is shared and integrated, several platforms exist for this but integration and sharing keep being a problem. One of the pain points is having to create an account and do things the platform way instead of something easy. This has also privacy issues for restricted data, which means they can't necessarily be used in many corporations and governments internal reports.

If we also check where things are being shared, from the vulgarization point of view medium.com is one of the most popular places

From the scientific point of view there are arxiv.org (and other ones focused on different communities like medrxiv.org and biorxiv.org)

So the question is:

How can this be stupidly easy to make, share and integrate graphs that allows to create a community around it?

To sum up this first part:

from one side there is software like spreadsheets that makes creating graphs easy with limited amounts of data, but really a bit more complex dealing with data transformations and limited in the data-size.
from the other side there is a programmatic way of dealing with data to allow for complex data manipulation but also makes graphing more complex.
web first
there are diverse platforms each one with their own thing
We need something stupidly easy to make, integrate, share and search.
We need to also take out any barriers to use (i.e account creation and so on).

Strategy

In this text the strategy comes first trying to build something that is useful from the first version trying to get the following things from the start:

Make it really easy to create graphs from a CSV data
Make it easy to export and create graphs
NO need of any type of account (at least at the beginning)
allow a community creation

Each step contains also iterations, which means none of those is terminal and can always be improved. The idea is to create the first version of each step and then iterate simultaneously inside those

Step 1 - MVP

The first MVP should allow for the following:

Version 0.1 alpha:

From a tabular source of data (anywhere) select graphs to do and conditions to show the data, only a few graphs should be shown from this (like timeseries, pie-charts or some others)
This can be exported as html and saved in github pages (I choose github pages because it's easy, popular and already available to the developers and scientist community)
Command line, just select data source and graph it, maybe be able to create multiple graphs selecting either different csv input files or
a command for CSV auto crop file (like head -n XXX in linux) to make iteration quick during the graph building

Version 0.2 alpha:

(Static?) website that allows to build these graphs from a webpage. This website can be run locally or in a datahub.io helper page.
This website follows a graph building helper workflow:
ask for ONE datasource
extract the column names (if no column names, then ask for the names or automatically name them col1, col2 …)
Present the available graphs, user selects
Present how to name each graph axis, some other options (for example what kind of tooltip to show, if dynamic exploration or not, etc)
select Generate -> this creates a file and allows the user to save it, can then be added into the github repository or shared as a report (html, pdf)

Version 0.3 alpha:

Better integration with git[hub]
Better integration with datahub.io
Better integration with medium.com
JSON data integration (and possible other ones)
load zipped data
other data sources (not CSV)
start working on other options asked by the community

Version 0.4 alpha:

Allow to generate and visualize data diffs
Graph from multiple data-sources (multiple files)
More complex integrations like data versioning

Step 2 - Communication & Community

From the beginning we could start writing about data visualization, analysis and storage, I would gladly do this. We start using our own tools and showing that's amazingly easy with no setup or hassle to do it.
Periodically writing (once a week or every two weeks) will in due time grow a community or at least show the datahub.io name.

Step 3 - datahub integration

Version 0.1 Beta:

The graph generation is available as part of datahub.io, it does NOT need authentication, but at the end of the graph creation the user is proposed with the option to create an account.
If user logged in, the actions and "dashboard" configurations can be saved for later in datahub.io server (or in a github repository)
While the graph is being built automatically create an entry into Datahub.io (allow the user to set this to false or true) pointing to the user's github page/username, and allow the user to either create a datahub.io account or not (it doesn't really matter, the user can later claim the data with his/her github account) This allows for distributed data integration with other sources, metadata search, here is where some of the future value can come, from datasource search and integration services.

Advanced (paid) features:

Being able to integrate multiple data sources
query (GraphQL) multiple data sources as if they were one
notifications on data changes

Step 4 - Monetization strategy

Now, how does this produce money? well, at the beginning it doesn't and this is the main problem for a self-funded company.

The main idea from the start is to try to get as much market share possible with the least cost. For this there are several points to tackle:

start communicating from the beginning and try to get a user base user
communication should go through the most used channels in the data community, for example medium.com and arxiv.org (look for other sources too)
get as much publicity and feedback as possible

The following strategies can be used for monetization

consulting
data integration (of multiple data sources, having a community collected data list is amazing in this way)
data versioning hosting
data visualization (here we enter kind of a new market for datopian as is not mainly focused here, but it does bring a lot of value)
Datahub.io installation on premise and support
ad-hoc paid data analysis service pipelines

Well this is my take on what and how I'll build a product around data visualization, allowing for data versioning in the middle but the main value proposition is to be able to quickly build and share information in a graph form while being able to integrate the output in different places, public and private servers without compromising any private data.