Codeberg/Community
54
325
Fork
You've already forked Community
12

Large Files, Git and Ways to Save Storage—A Discussion #1910

Open
opened 2025年05月04日 12:36:06 +02:00 by dbat · 3 comments

Comment

I hope this is okay, to open an issue as a community discussion. I don't know where else to go. I tried on Mastodon, but it's not working there. Please let me know!

Why git-lfs is an "illusion"

I have been building game-dev tools for a while; this usually means big blender files alongside small code files. I found git-lfs (like many do) and thought, "cool, it's got it sorted!" I did not look into the details, I just believed the pitch: it handles large files cleverly so that you don't have to worry.

But... I am fairly sure that git-lfs is not saving space; neither for me nor Codeberg.

I first noticed this when pushing a small repo that had a 7mb blend file. I was making changes to it, add, commit, and push, repeat. By chance I was also watching the Codeberg repo page and noticed my repo's size going up and up. 7mb -> 14mb -> 21mb -> eek!

So, after some asking around on Mastodon and a bunch of surfing and reading, I learned that git-lfs is saving versions of my blend files on the server-side, where I thought it was just saving one file. I also found that locally in .git/lfs there are just as many versions! So it's chewing local space too!

I have found that git lfs prune run locally, will clean out the local copies quite well. Is there an equivalent that Codeberg can run?

(Git-lfs' actual use-case seems to be for those who clone your repo to get the smallest possible download of binary blobs.)

Please let me know if I'm wrong about lfs; I'd love to be wrong!

Let's Rewind: The Situation

Doing game-dev, it's impossible to avoid big blobby files. 3D assets, sounds, textures, videos, etc.

These files raise basic questions:

  1. How can I backup these huge files online, without blowing-up storage space?
  2. How can I work with them in a natural and easy way using git?

The most basic approach I can think of is a structure like:

📁 project
 📁 bigfiles
 char.blend
 texture.png
 📁 dev
 .git
 .gitattributes
 .gitignore
 somecode.gd
 char.blend (symlink into ../bigfiles/char.blend)
 texture.png (symlink into ../bigfiles/texture.png)

You work in dev as normal with git. When you add new large files, do so in bigfiles and make a relative links in dev as needed.

How to backup the files in bigfiles is the main question. The use-case is to get those big files saved on the Codeberg server, using minimum space (no duplicates because of git versioning).

Pushing

rsync (+ ssh)

If Codeberg could offer rsync space, then this would be a solved problem! I honestly think they would use less drive space than LFS is currently using.

We could write some scripts, perhaps to extend git, perhaps just a script one runs from the bigfiles dir when you want to send changed files up.

rsync is pretty magical and it can even do binary diffs etc.

Pulling/Cloning

rsync (+ ssh)

Since the project is now two directories, the git part comes down with a normal clone. This exposes all the broken symlinks. I don't know yet how to best do it, but the README can explain how to pull down the bigfiles dir. Perhaps a simple script, perhaps a git extension again.

Your Ideas?

What do you think about the general problem and these ideas? What else could/would you do?

File locking is probably going to be needed, but I am talking about a me (a small dev, like, honestly, solo) and if others joined, it would be up to us to communicate so we don't stomp files.

Other Ideas

  • Git-submodules I tried to understand this, but it's pretty obscure. I also think the same problem happens: If all the binary blobs are in a submodule rep, that repo is still going to grow by the size of each blob times every change.
    Unless.. Is there a way to keep squashing history, or whatever git voodoo, so that the sub-repo gets garbage-collected often and thus keeps the storage space down?
  • Git-subtrees I could not make head or tail of these. Are they an option?
  • Git-LFS Is there a way to use it that results in actual minimal file versioning? i.e. I only want the latest of each of the files in bigfiles to be saved on Codeberg and my local repo.
  • Git-Annex Seems it needs a server-side component so I could not test it out.
  • A recent idea from Mastodon: Codeberg "Releases". Have to still look at it. https://docs.codeberg.org/git/using-tags/

What else?
.

### Comment I hope this is okay, to open an issue as a community discussion. I don't know where else to go. I tried on Mastodon, but it's not working there. Please let me know! # Why git-lfs is an "illusion" I have been building game-dev tools for a while; this usually means big blender files alongside small code files. I found git-lfs (like many do) and thought, "cool, it's got it sorted!" I did not look into the details, I just believed the pitch: it handles large files cleverly so that you don't have to worry. But... I am fairly sure that git-lfs is not saving space; neither for me nor Codeberg. I first noticed this when pushing a small repo that had a 7mb blend file. I was making changes to it, add, commit, and push, repeat. By chance I was also watching the Codeberg repo page and noticed my repo's size going up and up. 7mb -> 14mb -> 21mb -> eek! So, after some asking around on Mastodon and a bunch of surfing and reading, I learned that git-lfs *is* saving versions of my blend files on the server-side, where I thought it was just saving one file. I also found that *locally* in .git/lfs there are just as many versions! So it's chewing local space too! I have found that `git lfs prune` run locally, will clean out the local copies quite well. Is there an equivalent that Codeberg can run? (Git-lfs' actual use-case seems to be for those who `clone` your repo to get the smallest possible download of binary blobs.) Please let me know if I'm wrong about lfs; I'd love to be wrong! # Let's Rewind: The Situation Doing game-dev, it's impossible to avoid big blobby files. 3D assets, sounds, textures, videos, etc. These files raise basic questions: 1. How can I backup these huge files online, without blowing-up storage space? 2. How can I work with them in a natural and easy way using git? The most basic approach I can think of is a structure like: ``` 📁 project 📁 bigfiles char.blend texture.png 📁 dev .git .gitattributes .gitignore somecode.gd char.blend (symlink into ../bigfiles/char.blend) texture.png (symlink into ../bigfiles/texture.png) ``` You work in `dev` as normal with git. When you add new large files, do so in `bigfiles` and make a relative links in `dev` as needed. How to backup the files in `bigfiles` is the main question. The use-case is to get those big files saved on the Codeberg server, using minimum space (no duplicates because of git versioning). # Pushing ## rsync (+ ssh) If Codeberg could offer rsync space, then this would be a solved problem! I honestly think they would use less drive space than LFS is currently using. We could write some scripts, perhaps to extend git, perhaps just a script one runs from the `bigfiles` dir when you want to send changed files up. rsync is pretty magical and it can even do binary diffs etc. # Pulling/Cloning ## rsync (+ ssh) Since the project is now two directories, the git part comes down with a normal clone. This exposes all the broken symlinks. I don't know yet how to best do it, but the README can explain how to pull down the `bigfiles` dir. Perhaps a simple script, perhaps a git extension again. # Your Ideas? What do you think about the general problem and these ideas? What else could/would you do? File locking is probably going to be needed, but I am talking about a me (a small dev, like, honestly, solo) and if others joined, it would be up to us to communicate so we don't stomp files. # Other Ideas - **Git-submodules** I tried to understand this, but it's pretty obscure. I also think the same problem happens: If all the binary blobs are in a submodule rep, that repo is still going to grow by the size of each blob times every change. Unless.. Is there a way to keep squashing history, or whatever git voodoo, so that the sub-repo gets garbage-collected often and thus keeps the storage space down? - **Git-subtrees** I could not make head or tail of these. Are they an option? - **Git-LFS** Is there a way to use it that results in actual minimal file versioning? i.e. I only want the latest of each of the files in `bigfiles` to be saved on Codeberg *and* my local repo. - **Git-Annex** Seems it needs a server-side component so I could not test it out. - A recent idea from Mastodon: Codeberg "Releases". Have to still look at it. https://docs.codeberg.org/git/using-tags/ What else? .
Owner
Copy link

Git LFS greatly improves the situation for Codeberg, because:

  • It provides the technical ability to remove old files without corrupting the Git history. With normal Git, there is no way to remove files, but with Git LFS it is possible.
  • It reduces the overhead necessary for Codeberg. Git repos are regularly compacted and backup is "expensive". Git LFS is just normal files that are much easier to handle for us.

You can currently delete LFS files manually from the repo settings. There is a pending grant application for improving the LFS implementation in Forgejo, which will hopefully allow to clean up more in the future.

Git LFS greatly improves the situation for Codeberg, because: * It provides the technical ability to remove old files without corrupting the Git history. With normal Git, there is no way to remove files, but with Git LFS it is possible. * It reduces the overhead necessary for Codeberg. Git repos are regularly compacted and backup is "expensive". Git LFS is just normal files that are much easier to handle for us. You can currently delete LFS files manually from the repo settings. There is a pending grant application for [improving the LFS implementation](https://codeberg.org/forgejo/sustainability/src/branch/main/2025/2025-04-01-nlnet-ngi0-commons/tasks.md#improved-git-lfs-support-refactoring-170-hours) in Forgejo, which will hopefully allow to clean up more in the future.
Author
Copy link

Thanks for your reply. I guess we got to wait and see.

Thanks for your reply. I guess we got to wait and see.
Author
Copy link

I have one technique that I am using. If anyone want to read about it: https://dbat.codeberg.page/posts/git-large-file-technique.html

I have one technique that I am using. If anyone want to read about it: https://dbat.codeberg.page/posts/git-large-file-technique.html
Sign in to join this conversation.
No Branch/Tag specified
main
No results found.
Labels
Clear labels
accessibility

Reduces accessibility and is thus a "bug" for certain user groups on Codeberg.
bug

Something is not working the way it should. Does not concern outages.
bug
infrastructure

Errors evidently caused by infrastructure malfunctions or outages
Codeberg

This issue involves Codeberg's downstream modifications and settings and/or Codeberg's structures.
contributions welcome

Please join the discussion and consider contributing a PR!
docs

No bug, but an improvement to the docs or UI description will help
duplicate

This issue or pull request already exists
enhancement

New feature
infrastructure

Involves changes to the server setups, use `bug/infrastructure` for infrastructure-related user errors.
legal

An issue directly involving legal compliance
licence / ToS

involving questions about the ToS, especially licencing compliance
please chill
we are volunteers

Please consider editing your posts and remember that there is a human on the other side. We get that you are frustrated, but it's harder for us to help you this way.
public relations

Things related to Codeberg's external communication
question

More information is needed
question
user support

This issue contains a clearly stated problem. However, it is not clear whether we have to fix anything on Codeberg's end, but we're helping them fix it and/or find the cause.
s/Forgejo

Related to Forgejo. Please also check Forgejo's issue tracker.
s/Forgejo/migration

Migration related issues in Forgejo
s/Pages

Issues related to the Codeberg Pages feature
s/Weblate

Issue is related to the Weblate instance at https://translate.codeberg.org
s/Woodpecker

Woodpecker CI related issue
security

involves improvements to the sites security
service

Add a new service to the Codeberg ecosystem (instead of implementing into Gitea)
upstream

An open issue or pull request to an upstream repository to fix this issue (partially or completely) exists (i.e. Gitea, Forgejo, etc.)
wontfix

Codeberg's current set of contributors are not planning to spend time on delegating this issue.
Milestone
Clear milestone
No items
No milestone
Projects
Clear projects
No items
No project
Assignees
Clear assignees
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
Codeberg/Community#1910
Reference in a new issue
Codeberg/Community
No description provided.
Delete branch "%!s()"

Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?