Data Projects Database
Data Projects Database
This is a list of interesting data projects that might be of interest to the Datopian community.
Related Projects
Each project is listed within the category that's closer to the Datahub Data Management System (DMS) but might have interesting ideas on other categories as well.
Data Factory
- Kamu. A command-line tool for managing, transforming, and collaborating on structured data.
- Bacalhau. A platform for fast, cost efficient, and secure computation by running jobs where the data is generated and stored.
- At the moment is a free volunteer network.
- You can run arbitrary computations on the data (e.g. a simple image processing pipeline).
Package Management
- Open Data Fabric. Open protocol specification for decentralized exchange and transformation of semi-structured data, that aims to holistically address many shortcomings of the modern data management systems and workflows.
- The protocol takes care of some interesting aspect of data like reproducibility, complete historical account (all history is preserved), veriability (data is immutable), and provenance (data is linked to its source).
- It also has some strong opinions on the nature of data and transformations. The entire specification is worth reading.
- Dataset and transformations are defined in YAML files.
- Qri. Was a project to help with dataset syncing, versioning, storing and collaboration. Sadly, it came to an end early in 2022.
- Datalad. Distributed data management system that keeps track of your data, creates structure, ensures reproducibility, supports collaboration, and integrates with widely used data infrastructure.
- Uses Git Annex (distributed binary object tracking layer on top of git) to provide a decentralized dataset management system.
- Can be extended to IPFS.
- Quilt.
- Works on top of S3.
- Oxen.
- LakeFS. More like Git for Data.
- DVC.
- XVC.
- Xetdata.
- Dud.
- Deep Lake.
- Dim.
- Juan Benet's data.
- Colah's data.
- Dolt.
- They also do data bounties!
- Ocean Protocol Market.
Frontend
- Evidence.dev.
- Malloy Notebooks.
- Install recommended extension,
malloydata.malloy-vscode
, and open the notebook. Everything runs on the browser.
- Install recommended extension,
Visualizations and Dashboards
Data APIs
- Splitgraph Data Delivery Network.
- Seafowl.
- Datasette Lite.
- ROAPI.
- Dozer.
- Huggingface Datasets.
- Integrates with the Arrow ecosystem.
- Automatically exposes datasets as Parquet files.