inside the mind of a linux admin

Optimizing a previously large and bloated git repository

In a déjà vu scenario of a previous blog post I authored in 2012 called Source control != File System, I ranted about why binaries do not have any place in a source controlled repository. Fast forward nearly 4 years later, and I’ve once again encountered a repository that was filled with network device firmware image (.bin) files.

I knew something was terribly wrong when I went to clone a fresh copy of the repo to look at some basic device startup configs, and it took me nearly 10 minutes:

Cloning into 'network'...
remote: Counting objects: 1014, done.
remote: Compressing objects: 100% (925/925), done.
remote: Total 1014 (delta 499), reused 155 (delta 61)
Receiving objects: 100% (1014/1014), 1.63 GiB | 2.67 MiB/s, done.
Resolving deltas: 100% (499/499), done.
Checking connectivity... done.

real 9m9.360s
user 1m13.431s
sys 0m22.595s

After grabbing another coffee and enjoying a smoke, the cloning operation had actually finally completed. A bit of poking around quickly revealed the 2.9GB “Firmware” directory inside an otherwise organized and newly restructured repository. The logical fix to reclaim what would likely be hours of my life over the period of a few quarters of working with this repo was to just git rm -rf and move on. After pushing my changes, I quickly realized that the wasted space was still very much alive in git’s history data. This wasn’t much of a surprise, considering this functionality is required to fulfill one of my favorite and arguably one of the most valuable purposes of source control: revision history.

So, how did restore sanity to this repository?

No, you do not have to delete everything and start over – though this would be effective, it is a waste of time and energy.

However, I HIGHLY recommend that you backup or clone a copy of your bloated repo, just in case you do something dumb.

Next, you will need to find the files that you want removed from the repo. I stumbled upon a one liner that leveraged the git rev-list command, and piped it to some ugly perl to chomp and print the largest files (source). While I am not a big fan of perl, it is certainly not the ugliest perl I’ve ever encountered, and it’s effective.

But, in the interest of my quest to avoid using perl and enable me to continue making fun of one of my perl loving systems architects, I decided to find my own way. Just as with 90% of the lines you’ll find in an average perl script, you can accomplish near identical results with 10% by using an alternative. In this case, I achieved the same thing with just a bit of sed|sort|head action:

git rev-list master | while read rev; do git ls-tree -lr $rev | cut -c54- | sed -r 's/^ +//g;'; done | sort -u | sort -rnk1 | head -n 20

The output will list you the byte size and path/file of the largest nonsense to ever get pushed to your repo in any previous revision.

Using the results, you need to determine what files you want removed. In the command below, simply replace DUMBFILE with the files or directories that you want removed.

git filter-branch --tag-name-filter cat --index-filter 'git rm -r --cached --ignore-unmatch DUMBFILE' --prune-empty -f -- --all

Next up, I needed to do a bit of garbage collection on the references and reclaim the lost space:

rm -rf .git/refs/original/
git reflog expire --expire=now --all
git gc --aggressive --prune=now

Lastly, I needed to push the history changes made back to the repository with the use of force.

Disclaimer: there are VERY few situations where the use of –force is recommended, and in my first hand experience I have seen it destroy repos if used incorrectly or simultaneously while others are pushing. USE CAUTION!

git push origin --force --all
git push origin --force --tags

And all was right with the world again…

After the successful push, demolish your bloated local repository and clone her slim and healthy self back home again…

Cloning into 'network'...
remote: Counting objects: 773, done.
remote: Compressing objects: 100% (317/317), done.
remote: Total 773 (delta 439), reused 773 (delta 439)
Receiving objects: 100% (773/773), 196.15 KiB | 0 bytes/s, done.
Resolving deltas: 100% (439/439), done.
Checking connectivity... done.

real 0m0.910s
user 0m0.088s
sys 0m0.032s

Related Posts

How to install CSF Firewall on CentOS 7

ConfigServer (CSF) is advanced open-source firewall for Linux. If you are like me, I don’t really care much for the native firewalld that’s included with RHEL7 releases, and I’ve used APF for years which is basically just a frontend for iptables. Here’s instructions on how to install it: 1. Disable firewalld systemctl stop firewalld systemctl […]

Read More

ScreenCloud: Troubleshooting missing Python modules after updates

One of my favorite tools that I find myself using quite often is called “ScreenCloud“. It allows you to quickly select any area of your workspace, create an sized screen shot, and then upload it or export it off to their server, your Dropbox account or an SFTP server. If you’ve recently performed upgrades, either […]

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *

Twitter: kireguy

Tweeter button Facebook button Myspace button