Clean up Git history part 2

Sensitive data or too much memory consumption: There are good reasons to want to change the Git history. In this blog post , I explained how to purge files from Git history using BFG . A weak point of BFG is the lack of support for direct paths , so you cannot specifically remove files or folders in subfolders from the history. With that, it's time to look at alternative solutions.


In addition to the officially not recommended git filter branch , git-filter-repo is one of the tools for cleaning up the history. After a short installation , we first analyze the repository and find, for example, the largest folders in history:

git filter-repo --analyze

Well be in the folder .git/filter-repo/analysis generated all sorts of TXT files:

  • directories-all-sizes.txt
  • extensions-all-sizes.txt
  • path-all-sizes.txt
  • ...

It's worth the file directories-all-sizes.txt take a closer look:

=== All directories by reverse size ===

Format: unpacked size, packed size, date deleted, directory name

  4624417043 3796607988 <present> <toplevel>
  4475940396 3778033787 <present> wp-content
  4060236681 3694449320 <present> wp-content/uploads
   305163809   70576241 <present> wp-content/plugins
   123818107   15442735 <present> wp-includes
...

It often happens that you have long ignored and removed from the HEAD data in the history (for example, the WordPress media folder wp-content/uploads/ or an accidentally pushed one node_modules- or vendor-Binder).

Importantly, the major code hosting platforms GitHub and GitLab recommend different approaches, some of which differ from each other. For example, on GitHub we remove wp-content/uploads/ using the following steps git-filter-repo from history:

mkdir tmp-repo
cd tmp-repo
git clone git@github.com:foo/bar.git .
cp .git/config /tmp/config-backup
git filter-repo --invert-paths --path wp-content/uploads
mv /tmp/config-backup .git/config
git push origin --force --all
git push origin --force --tags
# check size locally
git gc && git count-objects -vH
cd ..
rm -rf tmp-repo

We can now also check the size remotely (changing the size via API and in the UI can take up to 24 hours). To do this, open the repository settings (if the repository belongs to an organization, you must first add your own account to the organization). Now we see the size:

GitHub: disk space before cleanup
GitHub: disk space after cleanup

The procedure is slightly different on GitLab:

mkdir tmp-repo
cd tmp-repo
# Settings > General > Advanced > Export project > download tar.gz file into tmp-repo
tar xzf 20*.tar.gz
git clone --bare --mirror project.bundle
cd project.git
git filter-repo --invert-paths --path wp-content/uploads/
cp ./filter-repo/commit-map /tmp/commit-map-1
# copying the commit-map has to be done after every single command from git filter-repo
# you need the commit-map files later
git remote remove origin
git remote add origin git@gitlab.com:foo/bar.git
# Settings > Repository > Protected branches/Protected branches >
# enable "Allowed to force push to main/master"
git push origin --force 'refs/heads/*'
git push origin --force 'refs/tags/*'
git push origin --force 'refs/replace/*'
# Settings > Repository > Protected branches/Protected branches >
# disable "Allowed to force push on main/master"
cd ./../../
rm -rf tmp-repo
date
# wait 30 minutes (😱)
date
# Settings > Repository > upload /tmp/commit-map-X

After another wait of ~5 minutes we can go under Settings > Usage Quotas view storage space:

GitLab: disk space before cleanup
GitLab: disk space after cleanup

After the removal, it is important that all developers involved are involved in the final steps: If a user now performs a normal push with their own local copy, this would result in the large files migrating back to the central repository. Therefore, the following 3 options are recommended:

  • rm -rf .git && git clone xxx temp && mv temp/.git ./.git && rm -rf temp && git add -A .
    ("poor man's fresh clone", re-clone into existing repository)
  • rm -rf repo && git clone xxx .
    ("start from scratch", the cleanest variant)
  • git pull -r
    ("pull with rebase", you still have the uncleaned history, but no longer accidentally overwrite)

In the course of the current quotas (especially due to the new restrictions of GitLab ), it is always worth checking the size of the history of your repositories and cleaning them up if necessary:

GitHub FreeGitLab Free
Max file size limit100MB
Max repo size limit5,000MB
Max repo count limit
Max overall size limit5,000MB

Finally, it's also worth taking a look at a self-hosted, free variant like Gitea . With little effort, you can host a self-hosted Git instance (GUI secured by SSL , backup included, control via powerful API ) on a very lean server , which can also be configured excellently and is also superior in terms of data protection.

Back