Why one should use version control like GIT or SVN for nearly everything

07 Oct 2021 - tsp
Last update 07 Oct 2021
Reading time 9 mins

Introduction
Different models and basic operations
Problems solved by using these tools
Drawbacks

Introduction

First of - what is everything in the context of this blog entry and what is a version control system? And who is this article targeted at? It’s not targeted at the experienced software developer who manages his code already using git or SVN. It will be boring and sound somewhat strange in this case. It’s targeted at people who currently don’t use SCM for any task. By everything I mean stuff like:

Program source code (obviously)
Articles, thesis, papers, document collections, your schools homework, etc.
Your curriculum vitae
Your website or the sources your website is generated from
Your exam samples
Books
Configuration files

The stuff that I don’t mean are large binary files such as your media collection, photo collection, etc. and temporary files that can easily be regenerated at a later time as well as large databases, scraped data, extracted data that can be regenerated, etc.

What is version control? Version control systems (sometimes also called revision control, source control or source code management system) allow one to centrally or decentrally manage collections of files in different versions each. Imagine you change something in your computer programs source code or in your thesis and want to look into the old version later on. Often one sees people calling their files thesis, thesis_final, thesis_finallyfinal, thesis_finallyfinal_really and so on. And then shifting around the files on external storage devices such as external harddisks or USB flash drives, many times with colliding names and then later on overwriting much of their new work or not locating the most current version, not being able to locate comments, etc. Version control systems solve that problem including the moving around on USB sticks - they usually provide a blaming feature that even can show who changed what and when in case one’s working in a team. And they usually allow for seamless interoperation by including merge tools - if many people modify the same file at different positions they’re usually able to automatically merge (if using proper file formats) differences or at least highlight merging conflicts. And you never loose any old content - so think about what you put inside a repository, usually if everything goes right nothing will ever be deleted and most systems do not even support that without major hacking around in their internal representation.

As already mentioned they’ve been mainly developed for software development but the problem of revision management is as old as writing itself - and these systems are really great to be applied to all textual content in a highly efficient way. In fact this web page is built out of a source control system.

Different models and basic operations

There exist two different main models for source control (but only two really popular software packages though there are is a huge number of different tools out there).

First there are centralized version control systems. These are built around a central repository that’s usually hosted on a server that’s reachable on the network or via the internet. A typical representative is Subversion (SVN). One creates a repository on the server (should do automated backups there) and then checks out (copies) the version or branch one requires from the server using the svn tool. Changes are stored locally and then commited (copied back onto the server) into the central storage. One only stores the working copy in one fixed version locally. The main advantage of a centralized version control system is that one only checks out a given version or a given subset of the project, is able to perform centralized rule checking and centralized linting of the commits. To use SVN one usually only needs to know 3 different commands:

Checkout creates a new copy of a centralized repository or a subset of it in a given revision. This is usually the first operation one ever performs after creating a repository on the server side.
Update pulls the most current version of content from the server into a local repository.
Commit pushes local changes into the remote repository - if there is a conflict that is not solvable automatically the commit fails and one is able to perform a local merge of the changes before trying again

In addition SVN also supports locking and unlocking resources so one can negotiate who modifies which resources but usually this is not needed. Another operation that one might need is Revert that reverts a file to an older revision previously stored discarding any newer changes. The blame utility helps identifying modifications.

Then there are distributed version control systems such as the really popular GIT (note that this is not directly related to the well known GitHub hosting service though that’s an really easy starting point for newcomers) or the less well known older darcs. Git provides the ability to run in distributed mode by keeping an own complete local repository including all versions - but also allows one to synchronize to remote ones like in the centralized case. This makes using git a little bit more cumbersome and harder to think about than using SVN - but for source code in the open source environment it’s currently more popular than SVN due to it’s distributed nature. You can simply take the whole repository with you offline, you have a whole copy (solves the backup problem if you simply clone / pull the repositories on different machines and keeps them in sync).

To use git one requires at least the following commands:

Remote repository:
- clone is similar to checkout in SVN. It copies a remote repository - but in contrast to SVN it copies everything including all old revisions and branches. Later on when one uses nested repositories one will see that it does only clone them recursively when one instructs it to but this is nothing a beginner will usually have to worry about. It also adds the remotes to the repository
- pull fetches the latest version from the registered remotes and includes the latest changes into the local repository. Note that any changes to local files should be commited or the pull will fail in case there is a conflict to prevent data loss. In case the commit chain differs the system will try to automatically merge the repositories.
- push uploads all local changes to the remote repository
Local repository:
- add adds files to the local staging area. Data stages will be included in the next local commit
- Operations such as rm and mv should also be done through the git utility and will be added to the staging area.
- checkout can be used to revert local changes that have not yet been commited
- commit creates a new commit / revision in the commit hierarchy from all staged changes. A commit can also be signed using OpenPGP to proof the identity of the author even when using some untrusted repository storage.

The previously mentioned GitHub service is a nice external storage solution for your git repositories if they are either public or should be shared only with a small number of collaborators or a small group.

Previously I’ve written a short git cheat-sheet that should provide a nice summary on how to do common stuff using git. It’s really worth it and other than centralized systems it does not require one to perform proper server administration for the central repository.

Problems solved by using these tools

You never ever have to worry about millions of filenames again. You have a single object that you write your changes into and can walk the log to see what has been changed when
Collaboration gets easier since multiple people can in fact work on the same documents in parallel and merge their changes later on
You get a consistent copy of everything on all machines that you are using
When using build automation you just have to upload your changes and the build automation system builds your software package, book, documents, webpage or articles in a clean fashion. No more “this project only builds on developer X machine” or “I don’t know how to format the LaTeX document”
You do not accidentally loose your hard work
It makes backups easier
You get a central repository (even when using a distributed revision control system such as git when using a central remote such as GitHub or a GitLab instance). No more guessing which USB stick now has your current version. Just always push your changes to your remotes. And you can use multiple remotes to increase reliability.
When using stuff like GitHub it even formats your Markdown documents in a nice fashion which is nice for documenting stuff - one can of course also build a fully blown wiki solution on top of version control if this is really required but if you build your lab book around markdown that’s pretty efficient.
You can easily track progress and locate problematic changes later on.
In case you have some version that you want to remember for a given reason you can add a tag. For example if you have a pre-print for your paper or something that you handed in you can simply tag it to identify it later on. This is also done for software when a version is released into testing or to the general public. For software one can also decide which commits form a given release to include only partial features, etc.
Version control systems integrate very well with build automation systems like Jenkins. Even though these systems are designed for continuous integration for software development one can extend this concept for example to book publishing or web publishing.

Drawbacks

I don’t consider this as an drawback since I think this writing tool is a mistake anyways but automatic merging of Microsoft Office files does not work of course.
You have to invest some time to learn and setup the tools.