2022-04-01
2022-04-01
Notes re Git, Git LFS and Data Lake
Present: Anu
Intention: capture ideas
Git, Git LFS and GitHub
Using Git + GitHub is extremely powerful for Data Experts (e.g., sophisticated data analytics, data scientists, data engineers, CTOs etc.) - I think everyone appreciates it. However, when it comes to storing relativelt large files (few GBs or even ~500 MB), it becomes tricky. Although, GitHub and other Git servers may provide Git LFS option, it is quite limited:
Every account using Git Large File Storage receives 1 GB of free storage and 1 GB a month of free bandwidth.
If you push a 500 MB file to Git LFS, you'll use 500 MB of your allotted storage and none of your bandwidth. If you make a 1 byte change and push the file again, you'll use another 500 MB of storage and no bandwidth, bringing your total usage for these two pushes to 1 GB of storage and zero bandwidth.
Even with "GitHub Enterprise Cloud" plan your max file size is 5GB which is OK for most situations but still…with today's Cloud Storage possibilities it must not be an issue.
Giftless
Setting up your own Git LFS server + wiring it up with cloud based blob storage such as AWS S3 or Google Cloud Storage sounds like a great solution - https://github.com/datopian/giftless. You basically can do revisions/changes to your large file without worrying about storage cost:
- https://aws.amazon.com/s3/pricing/ ($0.023 per GB of storage + bandwidth (first 100GB/m is free)).
- Cloudflare's FREE egress bandwidth is coming: https://www.cloudflare.com/press-releases/2021/cloudflare-announces-r2-storage/.
Data Lake
Can this become our new pattern for creating Data Lakes? For instance:
- Use Git for your data.
- Use Cloud Storage as a blob storage with Giftless.
- Create your Data Lake with powerful features of Git + GitHub/GitLab.
What if you already have S3 as a Data Lake?
Options:
- Re-upload so that we start tracking via Git.
- Add metadata for existing data files.
- TODO