Data Lineage Research including on suppliers
Data Lineage Research including on suppliers
- Informatica
- Collibra
- Ad-hoc
My thoughts
- Complexity, diversity and rapidity (ad-hoc-ness) of most data wrangling means that formal metadata-based data lineage is probably not so useful …
- Better to try and formalize what people "actually do" i.e. write code.
- This is exemplified by our approach for core datasets:
- Version code and data
- Link these together
- Going a bit further it would be to pattern and structure that code: e.g. data transformation is a DAG of processors.
- We could then extract that DAG information and use it to provide insight into data lineage
Overall incremental, realistic approach is better than aspirational "metadata everywhere" approach
Data enablement / empowerment over data governance
Collibra
https://www.collibra.com/data-lineage
No real information about what their system does.
Collibra Launches Data Lineage, an Automated Data Lifecycle Mapping Capability - Jan 2020 https://www.collibra.com/pressroom/collibra-launches-data-lineage-automated-data-lifecycle-mapping
By automatically mapping relationships between data points, Collibra Lineage shows how data sets are built, aggregated, sourced and used and provides complete, end-to-end lineage visualization. Collibra Lineage enables large enterprises to understand the full context of their data and ensure that the most trustworthy data available is used to inform business decisions.
Again no real info about how this works. Digging through SQLDep it looks like it visualizes SQL Statements to reconstruct data flow e.g. here is screenshot from https://app.sqldep.com/queryflow/demo/
Informatica
Looks like this doesn't do extraction. You have to move entirely into their ecosystem of products (the metadata manager?) to get the benefits …
This is first hit for "informatica data lineage tool"
PR piece in Dec 2019 in ZDNet
Informatica aims to better track data lineage with AI-powered data catalog
Among Informatica's new features, its AI-powered data catalog, called Catalog of Catalogs is notable because it is trying to track data lineage across ecosystems.
What Connects Good Food and Good Data? The Importance of Provenance and Lineage - Oct 2019
Mainly a pitch for their enterprise data catalog.
Overcoming the Challenges of Deriving
End-to-End Data Lineage
In a complex modern data environment, understanding end-to-end data lineage is not a trivial task. It requires metadata connectivity across the entire data landscape—across cloud and on-premises databases, ETL tools, BI tools, and enterprise applications. It requires the ability to automatically stitch together lineage from all of these sources including the ability to extract and infer lineage from the metadata. Lineage will often have to be automatically derived from different types of code—ETL jobs, SQL scripts and stored procedures—to understand how data gets transformed in each step. Lineage views have to be presented at different levels ranging from business and logical views to detailed field-level views with the ability to drill down into transformation logic. When there is no direct ability to extract lineage, it requires the ability to indirectly infer lineage through AI/ML-powered intelligent capabilities like data similarity and data relationship discovery. Informatica’s AI-powered Enterprise Data Catalog delivers automated, granular end-to-end lineage across cloud and on-premises with these capabilities.
Learn how Enterprise Data Catalog can help you discover and inventory data across your organization.
To hear from an enterprise customer about how data lineage helps them address different business use cases, view this video from our customer Maersk.