Links resolution system

Links resolution system

Summary

  • Situation: We aim to support different types of links, including Common Mark links, Obsidian Wiki-Links, media embeds and links to data files in data vis components.
  • Problem: Due to many different aspects that have an effect on how relative and absolute links in DataHub Cloud sites' pages should be interpreted and a lack of a solid system for relevant links adjustments, we keep finding bugs.
  • Solution: Replacement of current hacks and patches scattered across the codebase with link resolution system/pipeline, including extensive suite of unit tests.
  • Appetite: 1-2d

Situation

We currently support the following link formats in markdown:

  • Common Mark links
    • regular links, like [](some-md-page)
    • media embeds, like ![](some-image.jpg))
  • Obsidian Wiki-Links
    • regular wiki-links, like [[some-md-page]]
    • media embeds, like ![[some-image.jpg]]

Note, that both regular Common Mark links and regular Obsidian Wiki-Links can include .md file extension but don't have to.

On top of that, we also support in-markdown data visualisation components, with links to data files passed to them, like: <LineChart data={./abc.csv} />

Note, that each of the above links can be of one of the following link types:

  • external, e.g. https://r2-datahub.io/xyz/raw/abc.csv
  • absolute, .e.g. /abc.csv
  • relative, .e.g. abc.csv or ./abc.csv or ../abc.csv (or further up like ../../abc.csv)

Also, note, that Obsidian Wiki-Links have an extra link type: "shortest path possible", meaning shortest path required to uniquely identify the linked file (usually ends up being just the name of that file).

We want all combinations of the above to resolve to correct locations.

Problem

We keep having issues with links in DataHub Cloud's user sites.

The problem stems from lack of a robust and tested link resolution system. Instead we have many temporary patches scattered across the codebase that were applied as responses to very specific bugs reported by the users.

Main causes of the reported bugs are:

  • user sites can be published at either default @{username}/{projectName} paths OR at custom domains, both requiring different approach to absolute links
  • special cases of index/README files, which are interpreted as "home" pages in DataHub Cloud, meaning URLs of those pages will be trimmed off of index/README file name at the end
  • media files and data files being hosted in an R2 bucket

Solution

System logic

As described in the "Situation" section, the link resolution system needs to handle different combinations of link formats and link types (including some caveats related to Obsidian Wiki-Links).

On top of that, we need to take into account if the origin markdown file (where the link is used) is a README.md or an index.md file. This is because these files will end up being interpreted as "home"/"index" pages for their parent directories, meaning the README/index at the end of the URL will be removed. So, without a counter adjustment of the link it would point to a wrong location in the published site (e.g. file at /blog/may/README.md will end up being published at /blog/may, so obviously any relative link in that file, if not adjusted, will now start from the wrong origin).

By the same token, any links pointing to any README(.md) or index(.md) files should have the file name removed, so that they don't point to non-existing page in the user site (e.g. link to /blog/README should be converted to /blog)

Another thing that we need to consider is the fact that user sites can be published either at default /@{username}/{projectname} URL paths OR at custom domains. This aspect is relevant for absolute links, as without proper adjustments, when used on default URL paths, they would start from the root datahub.io URL, instead of datahub.io/@{username}/{projectname}/....

To sum this up, the system needs to handle combinations of the following link facets:

  • link format: 1) Common Mark, 2) Obsidian Wiki-Link, or 2) Chart source data link
  • link type: 1) external 2) absolute 3) relative 4*) shortest-path-possible (Obsidian-Wiki links only)
  • with or without extension (regular Obsidian-Wiki links only)
  • origin file: 1) is a README/index file or 2) not
  • destination file: 1) is a README/index file or 2) not
  • publish type: 1) default single-site url 2) custom domain

Important thing to note is that part of the link resolution - for Obsidian Wiki Links - is done by the remark plugin remark-wiki-link. Thus the system will consist of the following two parts:

  1. Initial markdown links resolution: done by existing remark plugins, specifically our remark-wiki-link. This part of the system is mostly done, although some adjustments to the remark-wiki-link plugin may be required.
  2. Supplementary links resolution: Additional src/href (or data files links passed to charts) adjustments. This is the main focus of this shaping document.

Here is the sketch of the logical flow of the "Supplementary" part:

https://app.excalidraw.com/s/9u8crB2ZmUo/8w4JbY07RFk

../assets/Pasted image 20240507153547.png

To sum up, in order to correctly resolve any link we need the following 4 pieces information:

  1. path of the link's origin file = file that includes this link (not URL, so that we know if the link is on a special index/README page)
  2. the link itself
  3. knowledge if the site is published at the default URL path or at custom domain
  4. context = how the link is being used, e.g. image embed, link to data file passed to a chart component, link to another page

Knowing all these is sufficient for the sketched algorithm to produce an adjusted, correct link.

Unit testing (TDD)

Implementation of the system should be accompanied by a detailed suite of unit tests, covering each part of the system that can be reasoned about separately. Thus the system should be split into as many standalone logical blocks as possible.

Appetite

1-2d

Rabbit holes

  • "safe-guarding" backward relative links so that they can't go back more than datahub.io/@{username}/{projectname} nice to have, but not necessary; if the link is correct and points to the file within the published repo dir, it should resolve correctly based on the proposed system
  • should we create a rehype plugin for this? (not as a standalone package, but just within the DataHub Cloud) for not let's not focus on that, let's first make it work well

No-goes

© 2024 All rights reservedBuilt with Find, Share and Publish Quality Data with Datahub

Built with Find, Share and Publish Quality Data with DatahubFind, Share and Publish Quality Data with Datahub