Inbox

Inbox

Day 3 II - briefing for team etc

See datahub-next-feb-2023

Day 3 - 27 Feb 2023

Present: rufuspollock anu

Intention: create technical roadmap for next month

  • Chosen between showcase first, or publish first or both ✅2023-02-27 go for showcase first
  • Datasets or data literate documents? ✅2023-02-27 doing data literate first.

Agenda

  • Check-in
  • Create agenda
  • Review agenda
    • Design and roadmap
      • Review milestones
      • Create domain model (basic and more advanced)
    • Briefing for the team
      • 🔥 What does Ola need? (what does team need) ✅2023-02-27 see briefing below which goes from top level down to immediate next steps
        • Does she need a high level vision? ❓ maybe
        • Does she need immediate steps? ✅2023-02-27 yes she does
        • What background material does she need?
      • Material for Ola?
  • AOB
    • Vercel vs Cloudflare Pages
    • Naming: git-dms vs datahub v3 vs … ✅2023-02-27 datahub-next
    • Meta question: how best to list features against milestones. may be better just to have a big backlog and then assign. But then at some point want to see what is pulled into a given milestone.

Parking lot

  • Where do we put material / ideas for the project atm? e.g. in datahub-core/docs
    • Where do i publish notes for this work?
  • Answer the data literate issue https://github.com/flowershow/flowershow/issues/286
  • 🔥 Decide on quickest way to get started e.g. start a new repo for spike solution or use datahub.io?
  • How do we converge flowershow planning and datahub? ✅2023-02-27 we don't for now. just focus on datahub core

Someday

  • Plugins
    • Why have them? so that people can expand our functionality for us ✅2023-02-27
    • How would it work?
      • What are the hook points?
  • 🔥 Post data literate stuff on datahub.io/docs/
    • 🚭 Sort out datahub.io/docs e.g. move stuff for v2 to datahub.io/docs/v2/
      • Do a plan
        • Review what is there
      • Implement
  • Write up the history

Actions

  • Create mermaid diagram of backlog with sequencing
  • Define "data rich" documents and their levels

Showcase

Note and Vote: ideas and questions

Questions

  • Datasets or data literate documents to start with? 💬 ? Rufus inclines to data rich documents to start with

Anu

  • Design
    • README first with quick navigation to data content
    • Old layout/design is good so probably don't change it too much. But might update fonts and structure.
    • Check similar sites and what they are doing: kaggle, statista (?),
  • Data is great but adding README driven analysis / docs might be something cool / boost popularity vs what we had before. Note previously we also had README but it was purely tech docs about preparing data etc. We could start deriving new datasets with a bit of insights?

Misc

  • Can we have a method for already getting support? We have enough traffic maybe it would generate some revenue who knows.
    • Add buy a coffee button :smiley:
    • e.g. obsidian have their supporter/insider model etc

Showcase v0.1

Design

  • Pick best current showcase
    • List current showcases with screenshot and url e.g. datahub.io/core/finance-vix, bayanat etc
  • Sketch in figma??

Implement

  • Assumptions: repo is public, has datapackage.json in root, data is in repo (not git lfs or remote?). data is csv or json.
  • Create a "Data Layer" that encapsulates retrieving stuff
    • Review existing libraries e.g. metastore-lib-js, frictionless-js
    • Spec simplest approach possible
  • Data Explorer

Day 2 - 24 Feb 2023

Present: rufuspollock anu

Working on ../projects/datahub.io-design-sprint-2023

Goal: complete defining the product and create backlog of work i.e. https://github.com/datopian/product/issues/139

Agenda

  • Check-in
  • Create agenda
  • Review agenda
    • New product overview in ../ideas/git-dms
    • Roadmap of tasks
      • Brainstorm high level list of tasks
      • Prioritize tasks ✅2023-02-26 ❌ wontfix. go with milestones and work from there.
    • identify initial goals / milestones ✅2023-02-27 see ../ideas/git-dms#Milestones & OKRs
    • Detail on first X prioritized tasks
      • Search v0.1
      • Showcase v0.1
      • Publish
    • Issues created for them for backlog
    • Updated high level overview of all products so that we can show to team
  • AOB
    • DataHub.io naming: Resolve confusion over DataHub.io i.e. it is a site and a product. Maybe we use term DataHub Cloud or better we have a name like git-dms ✅2023-02-24 git-dms (and rename current git-dms to git-enterprise-catalog) or git-portal
    • Introduce these ideas to team
    • How much do we do in public? e.g. do we publish the product vision. do we open source and what? ✅2023-02-24 we can put general docs and vision in public datahub.io/docs/git-dms/ or similar. code itself is not open source for now

Extra

  • Backpost datahub v3 outline from 2021 on datahub.io so i can then modify it and have it be authoritative vision for it

Day 1: 23 Feb 2023

#done/process to ../ideas/git-dms and issue https://github.com/datopian/product/issues/139

Present: rufuspollock anu

Sprint goals: MVP for DataHub.io. Detailed moved to https://github.com/datopian/product/issues/139

Agenda

  • Check-in
  • Create agenda and goals
  • Review agenda
    • Where are we at? ✅2023-02-23 putting enterprise on backburner and focusing on DataHub.io as "publish your dataset quickly and easily"
    • What is the MVP? ✅2023-02-23. Moved to ../ideas/git-dms
      • What were previous efforts / notes we could draw on? ✅2023-02-23 see list of previous efforts below
    • What is the technical roadmap 🚧2023-02-23 - sketched publish ui flow below and did some analysis about pipelines. now moved to ../ideas/git-dms and links therein
    • Other questions
      • How it could help with DataHub Open Data? ✅2023-02-23 a great DataHub.io is a demo for DataHub Open Data
      • Is it a demo or more a subscription based product? ✅2023-02-23 it's both with the definite aim to be a product people pay for

Recap of where we are [Anu]

  • Have Enterprise landing page oriented to general data management and most people coming through are metadata
  • Not going into metadata management space so much
    • i.e. have lineage, just catalog with focus on importing metadata from multiple enterprise sources
    • Rufus: do we need to justify this? No, not for now. Rufus has lot of notes in notebook if needed
  • focus on datahub.io as "share your data analysis", "share your dataset", "make your team-mates discover your dataset"?
    • instead of trying to create this enterprise product
    • having datahub.io as a product people like
    • user-driven development
    • could lead to enterprise sales via passionate users in enterprise.
    • NB: will keep enterprise offer around and will keep doing calls and see what happens. But not priority.
  • Next: have a plan on datahub.io
    • Anu has some high level plans
    • What is the offer?
    • What are the main features?
    • Why would someone subscribe?
    • Important for the developers as well

Appendix

Appendix: Learnings compared to DataHub v2

  • 🔥🔥 hard to publish i.e. no github based publishing (focused on command line tool that was hard to install and buggy) => build off github
  • 🔥🔥 Reinvented the wheel in processing e.g. created our own airflow system. if this broke (and it did) you couldn't even publish => use a standard framework and (at least at start) github actions as a runner (maybe prefect cloud later)
  • 🔥🔥 no data APIs (issue with data size therefore in presentation)
  • 🔥 Did not (try to) monetize it
    • Did not try to very much e.g. no plans with a way to sign up for them
    • not sure a value proposition
    • confused between data as a service and publishing as a service
  • did not make value-add features obvious e.g.
    • data validation obvious
    • views and embedding views
    • versioning
  • Reinvented the wheel in storage (rather that git lfs)
  • Did not (or stopped) analysing users behaviour, eg, how many users, what are they doing.

Prior versions of some of these efforts 😜

  • DataHub v3: (March 2021) https://datahub.io/docs/dms/datahub/v3/ (plus original gdoc that is mostly processed but still has some usefl stuff)
    • vercel for data (datahub) + next for data (portaljs)
    • emphasis on publishing from command line e.g. datahub publish
    • Tagline still seems very strong: "Make it stupidly easy, fast and reliable to share your data in a useable* way**."
  • Evolved / simplified to DataHub pages
    • Pages different how?
  • emphasis on being able to publish from command line. We
  • DataHub Summer 2022: https://docs.google.com/document/d/126WZidR3bk2wvYDoMi8Z5p2wPlWU-XlySa32ErSyEUg/edit#
    • Turn Github into a Datahub. The tool you're familiar with the data features you’ve been missing.
    • Git + ☁️ = ❤️

Whiteboard

Product: DataHub.io aka DataHub Cloud

Subject: what is product direction for DataHub.io and DataHub suite in general?

Hypothesis: ??

  • not sure we believe in the enterprise route. What has been our success so far in the last 6 months? How many enterprise customers have we successfully landed (and expanded) in the last 5-10y? Ans: ~5. Yes, they come through but we seem to have very limited stickiness. Every one, has moved on in some way (we think, though maybe some are still using the solution?)
    • we have had dozens of conversations. very few conversions. we have no enterprise sales team.
  • i think you could get to enterprise but likely through a passionate user route.

  • There are four options
    • "Hub" (or DataHub for GitHub). Power data user oriented data publishing / sharing.
      • MVP is heavily github oriented e.g. Connect github (+ storage) ⟹ presentable and queryable data ⟹ share your data
    • Catalog: connect data sources ⟹ catalog ⟹ find your data
    • Pages: connect docs and data ⟹ data driven website ⟹ share my insights
    • Portal: classic (open) data portal.
  • All of these have significant overlap but differ in feature emphasis and audience
  • It comes down to Hub vs Enteprise
    • We already have a solid portal product. The portal product overlaps.
    • Pages can be subsumed under Hub and probably isn't substantial enough on its own.
  • Hub is attractive because
    • It can be "consumer" oriented so purchase individually (and also purchased by enterprise)
    • It has network like effects (like github): people show the product to others in using it
    • It has close similarities to open data portals that we have built successfully for many years
    • We have a connection with the data community that we can leverage
    • We have an innovative approach (the github basis) which confers some immediate distinctive advantages e.g. versioning
      • It is also a risk …
  • Aren't going with Catalog because
    • Hard to win and grow customers
      • Have not been winning them.
        • Experience in the last 6 months has not generated many solid leads and no conversions so far. we have had dozens of conversations. very few conversions.
        • Relatively few (< 5) enterprise customers that we successfully landed (and expanded) in the last 5-10y. Plus the ones that come through have very limited stickiness. Every one, has moved on in some way (we think, though maybe some are still using the solution?)
      • Have no enterprise sales team.
    • substantial work to have a competitive product (e.g. need quite a few ingestors, need a lineage system etc)
    • tough to make money without going more proprietary (which makes sales and removes a major differentiator)
    • limited overlap with portal product
    • we don't have a strong base in enterprise sales or the sales capacity (resources etc) to sell well in that area
    • Details
      • Our catalog product is in development and we have enough of a demo to pitch it.
      • Enterprise has tough sales cycle
      • We don't have a leading product
      • It's an increasingly competitive space where we don't have a particularly innovative take other than being open source and established (with ckan)
      • We have had now quite a few enterprise clients. our experience has been:
        • on business side we get hired for professional services but at some point they in-house or cut the project for political reasons
        • on technical side
  • NB: we think that Hub could be purchased by enterprise one day (in same way github built up to enterprise)

reflections (from last week with Anu)

  • What could product be (high level)? there are two directions DataHub.io could go (and we could go in general with the DataHub product)
  • Which direction do we choose?
    • What are strengths / weaknesses of each option?
      • This is some of the demand we have been seeing. however, …

© 2024 All rights reservedBuilt with Find, Share and Publish Quality Data with Datahub

Built with Find, Share and Publish Quality Data with DatahubFind, Share and Publish Quality Data with Datahub