Hosting your own GIT large file storage repositories

17 Jan 2023 - tsp
Last update 27 Jan 2023
Reading time 10 mins

Introduction
Setting up the server
- Installing on FreeBSD
  - Configuring service with local file storage (for testing)
Using the launched local service
- Creating a new repository on the server
  - Usage on the client

Introduction

So who doesn’t know this? You’re running a git repository and have to store large files like assets in the repository - or you want to use it to keep track of for example some kind of log- or lab book that also includes larger files. In this case git usually is not the best solution and one uses other repositories - for binary artifacts this is most of the time a repository like Nexus or Artifactory. Don’t get me wrong - those tools are great. But sometimes you really want to keep that data associated and versioned automatically in sync with your git repository. For example when building static webpages or - like mentioned above - hacking a lab book using git instead of using some other custom better suited solution.

Git itself is not suited to files larger than 100 MB and even for decent file sizes it gets pretty slow - it has been designed for textual content like source code or texts anyways. To solve that problem a third party solution - git large file storage (LFS) - has been developed. Note that this LFS plugin does not store large content in the git repository. It only contains references and pushes the objects (identified by their SHA-1 hash) to an referenced external web service. Please note that there are many drawbacks when using git LFS:

First off different than most users believe large binary files managed by lfs are not directly stored in your git repositories but on an external web service (a third component). Only references are kept in the repository itself.
LFS is really only an interim solution - it’s a plugin into git, not a core feature of the version control system. So it may die any time soon even when git continues to exist and …
… you cannot really remove LFS from a repository that you are using it in a clean fashion. It requires rewriting history which is usually a bad thing (and blocked from many external git repository hosting services for good reason).
It adds complexity. Even when it does not look too complex from standpoint of a single user the additional service that actually hosts the large files adds another point of complexity. And even when pushing your repository to multiple remotes the same single large file store is used - and thus might introduce a single point of failure.

But anyways this should be a summary on how to run a server that supports git LFS and how to use it from your clients in a simple fashion. Please note that the described LFS server seems to be used in some particular production scenarios but might not be usable for production in some small scale deployments the way described in this blog article.

Setting up the server

As mentioned above using git LFS requires a separate web service that’s reachable by all clients under the same common name - which usually means it’s exposed to the web. There is a number of implementations of the git LFS protocol. Most of them are more in the experimental non-mature stage. In addition there is a number of commercial implementations out there (for example GitHub runs their own service). Anyways one has to really think about if one wants to host such a service. But nevertheless the simplest solution as of today in the authors opinion is the usage of giftless. Giftless is a WSGI application and thus should run in an WSGI application server such as uwsgi or gunicorn. giftless has been designed to support a variety of authentication backends (though not many of them being available despite a generic read only / generic allow read and write to anyone and a JWT based schema). It also supports a large amount of storage backends - besides storing to local filesystem it allows to use one of the major cloud storage providers (Amazon’s S3, Google Cloud Storage or Microsoft Azure Blob Storage as backend.

In the following sample local file storage will be used to illustrate how to use git LFS for an in-house solution. Still keep in mind that you have to make backups of your repository. There won’t be an automatic copy of the whole repository on any client and pushing to multiple remotes does not copy all large files!

First one needs to install at least:

git
The git-lfs plugin when one wants to create repositories
A WSGI application server such as uwsgi
The giftless git LFS server

Installing on FreeBSD

It’s assumed that Python and pip is already available since uwsgi and giftless are Python applications or Python containers.

To install the software on FreeBSD one can use either ports or packages. When using packages for example:

pkg install git
pkg install git-lfs

pkg install www/uwsgi
pip install giftless

Since giftless does not declare all of it’s dependencies as one would expect one has to bootstrap them oneself:

fetch https://raw.githubusercontent.com/datopian/giftless/master/requirements.txt
# Inspect the fetched file!
pip install -Ur requirements.txt

Configuring service with local file storage (for testing)

The next step is to run the server - here it’s illustrated how one could do this from the command line. In the usual deployment scenario one would of course launch the service using the rc.init system. For sake of simplicity first lets look how one launches and configures it manually. Usually one configures the service using a configuration file that’s then referenced using the GIFTLESS_CONFIG_FILE environment variable. Currently the configuration file allows one to configure:

TRANSFER_ADAPTERS that interface different storage backends. Those specify a storage class as well as options for the given storage class
AUTH_PROVIDERS that control authentication. The stock implementation only supports three providers that are usually of limited use for a publically reachable service.
MIDDLEWARE configuration for WSGI middleware (for example when running behind a proxy so giftless knows under which URI it should be supplied or which CORS headers should be set)

There is some documentation for the configuration options available.

The most simple - but never to be used for a publicly ran service - authentication provider is giftless.auth.allow_anon:read_write. This just allows anonymous users to store and read arbitrary large files on the service. To configure that authentication provider one would use the following YAML snippet in ones configuration file:

AUTH_PROVIDERS:
  - giftless.auth.allow_anon:read_write

This provider is useful for local deployments for testing purposes only. Usually one will use a JWT based provider or an anonymous read-only storage provider for an exposed service - for example to use HMAC protected JWT tokens one could use giftless.auth.jwt:factory:

AUTH_PROVIDERS:
  - factory: giftless.auth.jwt:factory
    options:
      algorithm: HS256
      private_key: XXXX

Unfortunately documentation for authentication provider configuration is not really usable at the current point in time - it can be assumed that any production use of the service uses a cloud backend and JWT tokens for authentication.

Storage backends can be configured for the supported main cloud storage systems which seems to be the typical use case for the LFS server. As of the time of writing this summary the official documentation only contains information for Amazon S2, Google Cloud and Microsoft Azure backends - and just mentions there is a local file storage but nothing about how to configure it. When launching giftless with uwsgi though the local storage backend is the default backend - just using the current working directory - not a clean way to solve but enough to play around.

So one can simply create a directory, create a configuration file in there and launch the service in the uwsgi container on any arbitrary local port - in this case it’s decided that the service should only run on port 1234 on the local host:

$ mkdir giftless-server
$ cd giftless-server
$ cat > giftless.conf.yaml << EOL
AUTH_PROVIDERS:
  - giftless.auth.allow_anon:read_write
EOL
$ env GIFTLESS_CONFIG_FILE=giftless.conf.yaml uwsgi -M -T --threads 2 -p 2 --manage-script-name --module giftless.wsgi_entrypoint --callable app --http 127.0.0.1:1234

Using the launched local service

Creating a new repository on the server

First one has to create a new (bare) repository on the server as usual:

mkdir -p testrepo
cd testrepo
git init --bare

Usage on the client

It’s assumed that git and git-lfs is already installed on the client. In case the client complains about an unknown lfs command the git-lfs package is missing. First one has to clone the repository as usual:

git clone REPOURI

In this case REPOURI is the URI of the repository - this can be referenced on the local filesystem, an SSH server or using the git protocol - it does not matter as usual. Then one has to install lfs in the local repository and tell it which files to track (inside the repository directory) and also which repository service to use - in this example it’s assumed to be running on 127.0.0.1:1234 and the repository is assumed to be referenced by the path my-organization/test-repo which is a pattern suggested by the developers of the giftless server. The URI should be reachable from any client that is supposed to access files stored in the repositories using LFS - using the same pattern (i.e. no distinction when accessing from outside your network or inside, etc.).

git lfs install
git config -f .lfsconfig lfs.url http://127.0.0.1:1234/my-organization/test-repo
git lfs track "*.bin"

This would track all files matching the *.bin pattern - which works by creating two local files in the repository. The first one is .lfsconfig that tells the lfs module about it’s configuration, on the other hand it’s the .gitattributes file that prevents git to include the files matched to be stored in the repository but redirects them to the lfs module for filter, diff and merge operations - which means that each client has to install the lfs extension! One should add .lfsconfig and .gitattributes to the repository and push them at the next commit after enabling lfs

git add .gitattributes .lfsconfig
git commit
git push

Pushing now already pushes all newly created files into the LFS repository - one can locate all *.bin files that one has added and committed / pushed to the repository not at the git directory on the remote but on the LFS server identified by it’s SHA hash. Do not forget to install lfs on each client after cloning the repository - LFS really is just a redirection wrapper around git that has some additional maintenance to do.