This article describes in short almost a year of work by a team of engineers.

Introduction - Finding the right dataset in a massive data mess

In most companies there are several data sources, internal, external, open access, legal, governmental.

Many problems require cross referencing data among those sources or at least being able to shorten a list of possibly related data points. Also those sources can (and do) use different ways of referring to the same entity (synonyms) and to make reference to related entities, all these elements only make it harder to search through such a setup.

The main reasons why it's difficult finding the right data and crossing it can be summarized as:

  • Heterogeneous data
  • Missing metadata
  • Heterogeneous data sources such as spreadsheets, RDBMSs, NoSQL, JSON, CSV, PDFs and plain text files as well as data sources that can be accessed only under NDAs and other sources.

There is also the added complexity (at least in our case) that many datapoints are created, maintained and fixed by users. We'll dig further into this issue later.

We give first a small description of the environments before going into the main issues of dealing with data search and updates.

User interfaces available

We used Kibana and internally developed a couple for specific purposes

  • Kibana: For power users, developers and administrators
  • A Data Catalog which helps users find where data is available through a metadata search
  • A Entity Explorer that does search through different already interconnected data from different databases (or DBs that at least contain some related information) with GIS referencing and search (with integration to google maps and search-while-scrolling feature)

The ad-hoc UIs contains the following functionalities:

  • Search and data filtering
  • Request forms to request new data
  • Report issues (with a dataset or with the UI itself)
  • Allows manually editing data points (to fix metadata, add annotations)
  • Save users collections (this are used for tracking internal projects and related datapoints)
  • Show current "issues" from github in a UI and RSS/Atom
  • Show docs about code that runs the system

The first step to find the right data is being able to first find where that data is available. This is the function of the Data Catalog, the main search input accepts any valid Elasticsearch dsl query while there are also a few ad-hoc filtering fields (built on domain specific content)

After finding where the data might be it is possible to progress and look at the particular data instance(s) that one is interested in, including being able to cross reference those datasets.

Tech Stack

  • Python
  • Elasticsearch / Kibana
  • BigQuery + SQL
  • Cloud Run
  • Prefect and dbt

Building the Data Catalog - Extracting metadata from hundreds of databases

Extracting metadata from hundreds of DBs is one of the most compute and memory intensive tasks.

The goal of this step is, for each new table and Database extract the different types of values that exist for significant columns, being significant different from db to db depending on the domain and search goal.

Steps:

  1. Manually checking column names to understand patterns and what kind of content
  2. From those column names select a few names and string patterns to be blacklisted (anything that is time series, geo-points, geo-shapes, dates and ids for example). This first blacklist will make sure to cut down the processing time.
  3. Then after a pre-selection of the fields, do a SELECT count() ... over the selected columns for each.
  4. Select a threshold of the maximum number of different values for the columns
  5. Run the queries
  6. Store the query results in a new DB or file

This results need now to be post-processed to clean from any undesired data

Updating Metadata

Afterwards, to update the metadata the same process can be run on the data that has been updated since the last run and then post-process to join the results with the previous results.

These updating and post processing can be done in SQL in part but for our purposes we needed to run full-text search so the post processing includes passing the metadata to JSON and then injecting it into an ElasticSearch index.

Data Paths

Automatic Extracting metadata from the DB

BigQuery->Json->Elasticsearch

This process takes place linearly, as there is no performance or time issue going through these in series and has the advantage of not having to deal with concurrency issues[* 1](Even if there is a problem there is no mission critical real-time problem, at most we'll have a day delay on the user-facing interfaces).

Data Paths

Daily tasks (in order of occurrence):

  1. Extract data from the changes in the datasets and tables stored in bigquery and dumps to GCS (Google Compute Storage) buckets
  2. Extract the manually introduced data (in a spreadsheet and through the UI forms) and dump into a JSON file in GCS.
  3. Take those and many other reference datasets [* 2](static datasets built for reference and mapping different naming, geographic and aliasses) and combines all these into a single json file
  4. This json file represents a Metadata Catalog that can be searched through later.

Mixing manual data with the automated pathway

One of the main issues is having different sources for the same information, specially when there are collisions in heterogenous sources.

Depending on the business use-case the policy used to handle will change, but a general approach of keeping both sources available is better in the sense that we can later change the display policy if needed and the data will not need any change. This is what we do here.

For the particular case we display the manual edit over the automated ones (even if it's older).

In this case the difference between manually added metadata and automatically extracted; The automated one means the organization has ingested the data in bigquery while manually added means that the organization only knows about it's existence but does not have the data itself, just knows how/where to get it - mocking.

One big issue that we observed is what happens when a manually added metadata gets finally added as a data source, in this case all needs to be homogenized and updated, as this has happened only a couple of times spending effort to automate this task was not worth it and it will need in any case human intervention to check the metadata coherence.

Conflicting updates

Path A - Search path: The user updated data takes priority over the automated one (this is the implemented policy, even if we could implement others such as timestamp based)

Path B - User Write path: The new data is kept in a separate index, this avoids any data overwrite conflict and leaves the display policy to a later stage which is configurable and can change without creating data conflicts

Building the MASTER Dataset - Cross Referencing Datasets

As much as we would all love to have an automated way of building everything, there is no other way than applying some level of domain expertise.

So to build the first dataset that cross-references two different tables we need to do this by hand.

The process that we used was (in short):

  1. For a particular subset of interesting data that were already in DB form: We exported them to BigQuery (without normalizing data yet)
  2. Then we analyzed the data columns (types and what kind of information is there)
  3. For the columns that were similar and with specific data domains (like countries, states, specific names where the cardinality was low) we did the necessary joins (in BigQuery)
  4. Then we injected each resulting table into Elasticsearch
  5. With Kibana we did searches for specific keywords and started taking notes of similarities and differences, specially focusing on aliases for a same entity (like a state or country and their short forms, the ISO forms, the long forms, all normalized to lowercase)
  6. With this we could start refining the references.
  7. With a couple of tables cross-referenced we iterated through this process with more and more tables

While there are things that we can not cross-reference we have now a few tables that cross-reference many others and with a full text search through those we can establish a short number of results that might be related which later get selected and saved by the end user in the respective document or list to be used in another project or referenced later when needed.

This made the process easier and now with a new table that needs to be added we have two big elements:

  • A base MASTER database containing a consistent volume of data that can be used to match a new element
  • A set of mapping files which contain aliases for the same entity

This is in itself a Knowledge Graph for the particular domain in which our client works.

Improvements

  • Regards data entity normalisation and annotation, we would like to expand automation and unlock more value. Initial investigation within the domain highlighted problems of data sparsity in common open source triple stores as related to the client domain. We would also expand on RDMS techniques for generating candidate entities e.g low cardinality. Approaches such as unsupervised, semi-supervised and crowd sourcing for labelling, entity resolution, normalisation and inference, are good starting points for further exploration.

  • We would like to develop further vis traditional ACID/BASE notions with respect to master data. With federated and heterogenous data in lakes that collides perhaps ("domain authority" (DA) i.e. expertness + voting + recency) maybe a useful approach.

By way of a toy example.

  1. FDR/1960/"Humans are not on the moon" - "DA=10", "assert=-1"
  2. New York Times/1969/"Humans are on the moon" - "DA=6", "assert=+1"

Algorithm

  • Roughly some functional form like "x = Σ(passert + qDA - r*lag)"
  • Humans are on the moon where x > 0

Additional Resources

© 2025 All rights reservedBuilt with DataHub Cloud

Built with LogoDataHub Cloud