inside the mind of a linux admin

Optimizing a previously large and bloated git repository

In a déjà vu scenario of a previous blog post I authored in 2012 called Source control != File System, I ranted about why binaries do not have any place in a source controlled repository. Fast forward nearly 4 years later, and I’ve once again encountered a repository that was filled with network device firmware image (.bin) files.

I knew something was terribly wrong when I went to clone a fresh copy of the repo to look at some basic device startup configs, and it took me nearly 10 minutes:

Cloning into 'network'...
remote: Counting objects: 1014, done.
remote: Compressing objects: 100% (925/925), done.
remote: Total 1014 (delta 499), reused 155 (delta 61)
Receiving objects: 100% (1014/1014), 1.63 GiB | 2.67 MiB/s, done.
Resolving deltas: 100% (499/499), done.
Checking connectivity... done.

real 9m9.360s
user 1m13.431s
sys 0m22.595s

After grabbing another coffee and enjoying a smoke, the cloning operation had actually finally completed. A bit of poking around quickly revealed the 2.9GB “Firmware” directory inside an otherwise organized and newly restructured repository. The logical fix to reclaim what would likely be hours of my life over the period of a few quarters of working with this repo was to just git rm -rf and move on. After pushing my changes, I quickly realized that the wasted space was still very much alive in git’s history data. This wasn’t much of a surprise, considering this functionality is required to fulfill one of my favorite and arguably one of the most valuable purposes of source control: revision history.

So, how did restore sanity to this repository?

No, you do not have to delete everything and start over – though this would be effective, it is a waste of time and energy.

However, I HIGHLY recommend that you backup or clone a copy of your bloated repo, just in case you do something dumb.

Next, you will need to find the files that you want removed from the repo. I stumbled upon a one liner that leveraged the git rev-list command, and piped it to some ugly perl to chomp and print the largest files (source). While I am not a big fan of perl, it is certainly not the ugliest perl I’ve ever encountered, and it’s effective.

But, in the interest of my quest to avoid using perl and enable me to continue making fun of one of my perl loving systems architects, I decided to find my own way. Just as with 90% of the lines you’ll find in an average perl script, you can accomplish near identical results with 10% by using an alternative. In this case, I achieved the same thing with just a bit of sed|sort|head action:

git rev-list master | while read rev; do git ls-tree -lr $rev | cut -c54- | sed -r 's/^ +//g;'; done | sort -u | sort -rnk1 | head -n 20

The output will list you the byte size and path/file of the largest nonsense to ever get pushed to your repo in any previous revision.

Using the results, you need to determine what files you want removed. In the command below, simply replace DUMBFILE with the files or directories that you want removed.

git filter-branch --tag-name-filter cat --index-filter 'git rm -r --cached --ignore-unmatch DUMBFILE' --prune-empty -f -- --all

Next up, I needed to do a bit of garbage collection on the references and reclaim the lost space:

rm -rf .git/refs/original/
git reflog expire --expire=now --all
git gc --aggressive --prune=now

Lastly, I needed to push the history changes made back to the repository with the use of force.

Disclaimer: there are VERY few situations where the use of –force is recommended, and in my first hand experience I have seen it destroy repos if used incorrectly or simultaneously while others are pushing. USE CAUTION!

git push origin --force --all
git push origin --force --tags

And all was right with the world again…

After the successful push, demolish your bloated local repository and clone her slim and healthy self back home again…

Cloning into 'network'...
remote: Counting objects: 773, done.
remote: Compressing objects: 100% (317/317), done.
remote: Total 773 (delta 439), reused 773 (delta 439)
Receiving objects: 100% (773/773), 196.15 KiB | 0 bytes/s, done.
Resolving deltas: 100% (439/439), done.
Checking connectivity... done.

real 0m0.910s
user 0m0.088s
sys 0m0.032s

Related Posts

How to revert back to X11 / Xorg from Wayland

Wayland is intended as a “simpler replacement for X”, and is getting quite a bit of hype around the Linux community. So much so, that it’s the default in the latest versions of Fedora, GNOME, KDE and others. While Wayland may be the future, it really is in future. At least for Ubuntu 17.10. The […]

Read More

Touchpad stops working after sleep + resume (Fedora 26 on Dell XPS)

After recently upgrading my Dell XPS 13 w/Touchscreen to Fedora 26, the touchpad suddenly stopped working upon resuming from sleep mode. It was reproduced 100% of the time, and required a complete restart in order to get the touchpad working again. There have been several forum posts and bug reports regarding others experiencing these exact […]

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *

Tweeter button Facebook button Myspace button