2022-04-01

2022-04-01

Notes re Git, Git LFS and Data Lake

Present: Anu

Intention: capture ideas

Git, Git LFS and GitHub

Using Git + GitHub is extremely powerful for Data Experts (e.g., sophisticated data analytics, data scientists, data engineers, CTOs etc.) - I think everyone appreciates it. However, when it comes to storing relativelt large files (few GBs or even ~500 MB), it becomes tricky. Although, GitHub and other Git servers may provide Git LFS option, it is quite limited:

Every account using Git Large File Storage receives 1 GB of free storage and 1 GB a month of free bandwidth.

If you push a 500 MB file to Git LFS, you'll use 500 MB of your allotted storage and none of your bandwidth. If you make a 1 byte change and push the file again, you'll use another 500 MB of storage and no bandwidth, bringing your total usage for these two pushes to 1 GB of storage and zero bandwidth.

Full text: https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-storage-and-bandwidth-usage

Even with "GitHub Enterprise Cloud" plan your max file size is 5GB which is OK for most situations but still…with today's Cloud Storage possibilities it must not be an issue.

Giftless

Setting up your own Git LFS server + wiring it up with cloud based blob storage such as AWS S3 or Google Cloud Storage sounds like a great solution - https://github.com/datopian/giftless. You basically can do revisions/changes to your large file without worrying about storage cost:

Data Lake

Can this become our new pattern for creating Data Lakes? For instance:

  • Use Git for your data.
  • Use Cloud Storage as a blob storage with Giftless.
  • Create your Data Lake with powerful features of Git + GitHub/GitLab.

git-github-giftless-data-lake

What if you already have S3 as a Data Lake?

Options:

  • Re-upload so that we start tracking via Git.
  • Add metadata for existing data files.
  • TODO

© 2024 All rights reservedBuilt with DataHub Cloud

Built with LogoDataHub Cloud