Inbox
Inbox
- list product options 🚧2023-02-21 see below
- review ../areas/datahub-v3-2021 and https://datahub.io/docs/dms/datahub/v3
- review old work from 2020 on vision for git + storage = ❤️
- analyse basics of "DataHub as Github" option
Day 3 II - briefing for team etc
Day 3 - 27 Feb 2023
Present: rufuspollock anu
Intention: create technical roadmap for next month
- Chosen between showcase first, or publish first or both ✅2023-02-27 go for showcase first
- Datasets or data literate documents? ✅2023-02-27 doing data literate first.
Agenda
- Check-in
- Create agenda
- Review agenda
- Design and roadmap
- Review milestones
- Create domain model (basic and more advanced)
- Briefing for the team
- 🔥 What does Ola need? (what does team need) ✅2023-02-27 see briefing below which goes from top level down to immediate next steps
- Does she need a high level vision? ❓ maybe
- Does she need immediate steps? ✅2023-02-27 yes she does
- What background material does she need?
- Material for Ola?
- What is immediate goal?
- What are steps to that?
- What is the high level vision?
- Visual version ?? https://app.excalidraw.com/s/9u8crB2ZmUo/6GYoM18s0XX
- 🔥 What does Ola need? (what does team need) ✅2023-02-27 see briefing below which goes from top level down to immediate next steps
- Design and roadmap
- AOB
- Vercel vs Cloudflare Pages
- Naming: git-dms vs datahub v3 vs … ✅2023-02-27 datahub-next
- Meta question: how best to list features against milestones. may be better just to have a big backlog and then assign. But then at some point want to see what is pulled into a given milestone.
Parking lot
- Where do we put material / ideas for the project atm? e.g. in datahub-core/docs
- Where do i publish notes for this work?
- Answer the data literate issue https://github.com/flowershow/flowershow/issues/286
- 🔥 Decide on quickest way to get started e.g. start a new repo for spike solution or use datahub.io?
- How do we converge flowershow planning and datahub? ✅2023-02-27 we don't for now. just focus on datahub core
Someday
- Plugins
- Why have them? so that people can expand our functionality for us ✅2023-02-27
- How would it work?
- What are the hook points?
- 🔥 Post data literate stuff on datahub.io/docs/
- 🚭 Sort out datahub.io/docs e.g. move stuff for v2 to datahub.io/docs/v2/
- Do a plan
- Review what is there
- Implement
- Do a plan
- 🚭 Sort out datahub.io/docs e.g. move stuff for v2 to datahub.io/docs/v2/
- Write up the history
- https://github.com/datopian/datahub-git-based
- ..
- …
- => datahub-next
Actions
- Create mermaid diagram of backlog with sequencing
- Define "data rich" documents and their levels
Showcase
Note and Vote: ideas and questions
Questions
- Datasets or data literate documents to start with? 💬 ? Rufus inclines to data rich documents to start with
Anu
- Design
- README first with quick navigation to data content
- Old layout/design is good so probably don't change it too much. But might update fonts and structure.
- Check similar sites and what they are doing: kaggle, statista (?),
- Data is great but adding README driven analysis / docs might be something cool / boost popularity vs what we had before. Note previously we also had README but it was purely tech docs about preparing data etc. We could start deriving new datasets with a bit of insights?
Misc
- Can we have a method for already getting support? We have enough traffic maybe it would generate some revenue who knows.
- Add buy a coffee button :smiley:
- e.g. obsidian have their supporter/insider model etc
Showcase v0.1
Design
- Pick best current showcase
- List current showcases with screenshot and url e.g. datahub.io/core/finance-vix, bayanat etc
- Sketch in figma??
Implement
- Assumptions: repo is public, has datapackage.json in root, data is in repo (not git lfs or remote?). data is csv or json.
- Create a "Data Layer" that encapsulates retrieving stuff
- Review existing libraries e.g. metastore-lib-js, frictionless-js
- Spec simplest approach possible
- Data Explorer
Day 2 - 24 Feb 2023
Present: rufuspollock anu
Working on ../projects/datahub.io-design-sprint-2023
Goal: complete defining the product and create backlog of work i.e. https://github.com/datopian/product/issues/139
Agenda
- Check-in
- Create agenda
- Review agenda
- New product overview in areas/git-dms
- Roadmap of tasks
- Brainstorm high level list of tasks
- Review tasks and material here from yesterday and today ✅2023-02-24 processed into ../areas/git-dms and related
- Review tasks in ../areas/datahub-v3-2021 to see what i can port ✅2023-02-25 processed all tasks from there.
- Review datahub pages projects and outline to see what to integrate
- ../areas/pages ✅2023-02-26 all processed. mainly pitch. not many tasks. This is worth reviewing in future perhaps
- ../projects/pages-v0.1.0-readme-with-csv-table-preview ✅2023-02-25 all processed across
- Prioritize tasks ✅2023-02-26 ❌ wontfix. go with milestones and work from there.
- Brainstorm high level list of tasks
- identify initial goals / milestones ✅2023-02-27 see ../areas/git-dms#Milestones & OKRs
- Detail on first X prioritized tasks
- Search v0.1
- Showcase v0.1
- Publish
- Issues created for them for backlog
- Updated high level overview of all products so that we can show to team
- AOB
- DataHub.io naming: Resolve confusion over DataHub.io i.e. it is a site and a product. Maybe we use term DataHub Cloud or better we have a name like git-dms ✅2023-02-24 git-dms (and rename current git-dms to git-enterprise-catalog) or git-portal
- Introduce these ideas to team
- How much do we do in public? e.g. do we publish the product vision. do we open source and what? ✅2023-02-24 we can put general docs and vision in public datahub.io/docs/git-dms/ or similar. code itself is not open source for now
Extra
- Backpost datahub v3 outline from 2021 on datahub.io so i can then modify it and have it be authoritative vision for it
Day 1: 23 Feb 2023
#done/process to ../areas/git-dms and issue https://github.com/datopian/product/issues/139
Present: rufuspollock anu
Sprint goals: MVP for DataHub.io. Detailed moved to https://github.com/datopian/product/issues/139
Agenda
- Check-in
- Create agenda and goals
- Review agenda
- Where are we at? ✅2023-02-23 putting enterprise on backburner and focusing on DataHub.io as "publish your dataset quickly and easily"
- What is the MVP? ✅2023-02-23. Moved to ../areas/git-dms
- What were previous efforts / notes we could draw on? ✅2023-02-23 see list of previous efforts below
- What is the technical roadmap 🚧2023-02-23 - sketched publish ui flow below and did some analysis about pipelines. now moved to ../areas/git-dms and links therein
- Other questions
- How it could help with DataHub Open Data? ✅2023-02-23 a great DataHub.io is a demo for DataHub Open Data
- Is it a demo or more a subscription based product? ✅2023-02-23 it's both with the definite aim to be a product people pay for
Recap of where we are [Anu]
- Have Enterprise landing page oriented to general data management and most people coming through are metadata
- Not going into metadata management space so much
- i.e. have lineage, just catalog with focus on importing metadata from multiple enterprise sources
- Rufus: do we need to justify this? No, not for now. Rufus has lot of notes in notebook if needed
- focus on datahub.io as "share your data analysis", "share your dataset", "make your team-mates discover your dataset"?
- instead of trying to create this enterprise product
- having datahub.io as a product people like
- user-driven development
- could lead to enterprise sales via passionate users in enterprise.
- NB: will keep enterprise offer around and will keep doing calls and see what happens. But not priority.
- Next: have a plan on datahub.io
- Anu has some high level plans
- What is the offer?
- What are the main features?
- Why would someone subscribe?
- Important for the developers as well
Appendix
Appendix: Learnings compared to DataHub v2
- 🔥🔥 hard to publish i.e. no github based publishing (focused on command line tool that was hard to install and buggy) => build off github
- 🔥🔥 Reinvented the wheel in processing e.g. created our own airflow system. if this broke (and it did) you couldn't even publish => use a standard framework and (at least at start) github actions as a runner (maybe prefect cloud later)
- 🔥🔥 no data APIs (issue with data size therefore in presentation)
- 🔥 Did not (try to) monetize it
- Did not try to very much e.g. no plans with a way to sign up for them
- not sure a value proposition
- confused between data as a service and publishing as a service
- did not make value-add features obvious e.g.
- data validation obvious
- views and embedding views
- versioning
- Reinvented the wheel in storage (rather that git lfs)
- Did not (or stopped) analysing users behaviour, eg, how many users, what are they doing.
Prior versions of some of these efforts 😜
- DataHub v3: (March 2021) https://datahub.io/docs/dms/datahub/v3/ (plus original gdoc that is mostly processed but still has some usefl stuff)
- vercel for data (datahub) + next for data (portaljs)
- emphasis on publishing from command line e.g.
datahub publish
- Tagline still seems very strong: "Make it stupidly easy, fast and reliable to share your data in a useable* way**."
- Evolved / simplified to DataHub pages
- Pages different how?
- emphasis on being able to publish from command line. We
- DataHub Summer 2022: https://docs.google.com/document/d/126WZidR3bk2wvYDoMi8Z5p2wPlWU-XlySa32ErSyEUg/edit#
- Turn Github into a Datahub. The tool you're familiar with the data features you’ve been missing.
- Git + ☁️ = ❤️
Whiteboard
Product: DataHub.io aka DataHub Cloud
Subject: what is product direction for DataHub.io and DataHub suite in general?
Hypothesis: ??
- not sure we believe in the enterprise route. What has been our success so far in the last 6 months? How many enterprise customers have we successfully landed (and expanded) in the last 5-10y? Ans: ~5. Yes, they come through but we seem to have very limited stickiness. Every one, has moved on in some way (we think, though maybe some are still using the solution?)
- we have had dozens of conversations. very few conversions. we have no enterprise sales team.
- i think you could get to enterprise but likely through a passionate user route.
- There are four options
- "Hub" (or DataHub for GitHub). Power data user oriented data publishing / sharing.
- MVP is heavily github oriented e.g. Connect github (+ storage) ⟹ presentable and queryable data ⟹ share your data
- Catalog: connect data sources ⟹ catalog ⟹ find your data
- Pages: connect docs and data ⟹ data driven website ⟹ share my insights
- Portal: classic (open) data portal.
- "Hub" (or DataHub for GitHub). Power data user oriented data publishing / sharing.
- All of these have significant overlap but differ in feature emphasis and audience
- It comes down to Hub vs Enteprise
- We already have a solid portal product. The portal product overlaps.
- Pages can be subsumed under Hub and probably isn't substantial enough on its own.
- Hub is attractive because
- It can be "consumer" oriented so purchase individually (and also purchased by enterprise)
- It has network like effects (like github): people show the product to others in using it
- It has close similarities to open data portals that we have built successfully for many years
- We have a connection with the data community that we can leverage
- We have an innovative approach (the github basis) which confers some immediate distinctive advantages e.g. versioning
- It is also a risk …
- Aren't going with Catalog because
- Hard to win and grow customers
- Have not been winning them.
- Experience in the last 6 months has not generated many solid leads and no conversions so far. we have had dozens of conversations. very few conversions.
- Relatively few (< 5) enterprise customers that we successfully landed (and expanded) in the last 5-10y. Plus the ones that come through have very limited stickiness. Every one, has moved on in some way (we think, though maybe some are still using the solution?)
- Have no enterprise sales team.
- Have not been winning them.
- substantial work to have a competitive product (e.g. need quite a few ingestors, need a lineage system etc)
- tough to make money without going more proprietary (which makes sales and removes a major differentiator)
- limited overlap with portal product
- we don't have a strong base in enterprise sales or the sales capacity (resources etc) to sell well in that area
- Details
- Our catalog product is in development and we have enough of a demo to pitch it.
- Enterprise has tough sales cycle
- We don't have a leading product
- It's an increasingly competitive space where we don't have a particularly innovative take other than being open source and established (with ckan)
- We have had now quite a few enterprise clients. our experience has been:
- on business side we get hired for professional services but at some point they in-house or cut the project for political reasons
- on technical side
- Hard to win and grow customers
- NB: we think that Hub could be purchased by enterprise one day (in same way github built up to enterprise)
reflections (from last week with Anu)
- What could product be (high level)? there are two directions DataHub.io could go (and we could go in general with the DataHub product)
- Which direction do we choose?
- What are strengths / weaknesses of each option?
- This is some of the demand we have been seeing. however, …
- What are strengths / weaknesses of each option?