Version control systems

Licenced under CC-BY 4.0 and OSI-approved licenses, see licensing.

Overview

Teaching: 20 min
Exercises: 5 min
Questions
  • How not to worry about making changes to a project?

  • What is version control?

  • What is Git and GitHub?

Objectives
  • Get a mental representation of the benefits of version control.

  • Understand the difference between Git and GitHub.

  • Why use version control systems.

Introduction to the lesson

This tutorial provides a brief but hands-on introduction to the concept of version control, Git in RStudio, and GitHub. Git is a version control system which allows files and data to be tracked and synchronized, which allows collaboration and reproducibility. GitHub is a popular website that stores Git repositories.

This tutorial avoids using the command line. Instead, we will practice the basics of Git using Rstudio and later on we will also introduce GitHub. In many cases, this is enough. Git in Rstudio and GitHub provide collaboration tools to all kinds of projects, and there are all kinds of good ways to use it.

This tutorial serves as an introduction to version control and Git. After this lesson, you will both be able to use Git, and feel much more confident taking a command-line Git course such as CodeRefinery’s git-intro course.

What we will not cover

  • Command line interface
  • Cloning using SSH protocol and SSH keys
  • Rebasing and squashing
  • Many Git tricks which can be explored later

Why GitHub?

We will do an exercise on GitHub but also GitLab and Bitbucket allow similar workflows and basically everything that we will discuss is transferable. With this material and these exercises we do not endorse the company GitHub. We have chosen to demonstrate a number of concepts using examples with GitHub because it is currently the most popular web platform for hosting Git repositories and the chance is high that you will interact with GitHub-based repositories even if you choose to host your Git repository on another platform.

We also encourage course participants to use CodeRefinery’s Nordic research software repository platform, for more information see https://coderefinery.org/repository/.

Version control in general

Figure 1. Versioning Hell
“Piled Higher and Deeper” by Jorge Cham (http://phdcomics.com/)

Exercise 1: Problems with change (5 min)

Which of these issues can you relate to?

  • I have fifteen versions of this file - which one do I use?
  • I can’t remake this figure from last year
  • I slightly modified my code in one place, everything stopped working
  • I have several copies of the same directory because I’m worried about breaking something
  • Somebody added erroneous records in a shared file with samples, cannot find who and why
  • You remember seeing a data file but cannot find it anymore: is it deleted? Moved away?
  • I tried multiple analyses and I don’t remember which one I chose to generate my output data
  • I have to merge changes to our manuscript from 10 different emails with collaborators
  • I accidentally deleted a part of my work
  • I came to an old project and forgot where I left/saved it
  • I reorganized my data files, but broke the analysis pipeline and don’t know how to fix it
  • I have trouble to find the source of a mistake in an experiment
  • My directory is polluted with a lot of unused/temporary/old folders because I’m afraid of losing something important
  • I made a lot of changes to my paper but want to bring back one paragraph

Version Control Systems

(13 min teaching) [fair-bio-practice](https://carpentries-incubator.github.io/fair-bio-practice/11-version-control/index.html#how-version-control-systems-work:~:text=Solution-,Version%20Control%20Systems,-(13%20min%20teaching)

How Version Control Systems Work

A version control system is a tool that keeps track of file changes for us and helps us version and merge our files. It allows you to decide which changes make up the next version, called a commit, and keeps useful metadata about them. The complete history of commits for a particular project and their metadata make up a repository.

A version control system stores snapshots of a project’s files in a repository. Users modify their working copy of the project and then save changes to the repository when they wish to make a permanent record and/or share their work with colleagues. The version control system automatically records when the change was made and by whom along with the changes themselves.

Repositories can be kept in sync across different computers facilitating collaboration among different people.

Crucially, if several people have edited files simultaneously, the version control system will detect the collision and require them to resolve any conflicts.

Tool-based version control has several benefits over manual version control:

  • Instead of requiring users to backup copies of the whole project, version control safely stores just enough information to allow old versions of files to be re-created on demand.

  • Instead of relying on users to choose sensible names for backup copies the version control system timestamps all saved changes automatically.

  • Instead of requiring users to be disciplined about completing the changelog version control systems prompt them every time a change is saved. They also keep a 100% accurate record of what was actually changed, as opposed to what the user thought they changed, which can be invaluable when problems crop up later.

  • Instead of simply copying files to remote storage, version control checks to see whether doing that would overwrite anyone’s work. If so, they identify conflicts and merge changes.

Simplifying Git is a local program (like Word) that runs on your machine and knows how to track changes. Github is an online system that can communicate with Git and propagate your changes to other computers (like OneDrive can do for Word documents).

Keeping track of changes that you or your collaborators make to data and software is a critical part of research. Being able to reference or retrieve a specific version of the entire project aids in reproducibility for you leading up to publication, when responding to reviewer comments, and when providing supporting information for reviewers, editors, and readers.

Exercise 2: Using Version Control System (5 min)

Five reasons to use a version control system in research

  • Tell the story: The history of your commit messages will describe your project progress.
  • Travel back in time: a version control system makes it easy to compare different time points of your project smoothly. If you want to compare the stage of your project a year ago from now, it only takes one command-line of code.
  • Experiment with changes: if you want to make changes in a script you can first make a “snapshot” of the project status before experimenting with changes.
  • Backup your work: by being able to link your local repository (folder) to a distant online host (GitHub), a version control system backs up your precious work instantly.
  • Collaborate easily on projects: having a web-hosted synchronised version of your project will encourage collaboration with other researchers. Think about a colleague of yours being able to add a script to make a figure for your publication for instance. The repository will track their contribution and evidence their participation in the project.

Which two are the most important for you.

Version Control System and FAIR

Looking at the reasons to use a version control system in research, how does using version control system help in being FAIR?

Solution

  • Online respository is a Findable and Accessible resource (also Trackable).
  • Online repositories are also interoperable since they can be accessed from anywhere (with internet) and are not restricted to your local hard drive.
  • The ‘telling story’ and being able to travel back in time helps in reuse and reproducibility.

Extra Material

Commits, branches, repositories, forks, clones

  • repository: The project, contains all data and history (commits, branches, tags).
  • commit: Snapshot of the project, gets a unique identifier (e.g. c7f0e8bfc718be04525847fc7ac237f470add76e).
  • branch: Independent development line, often we call the main development line main.
  • tag: A pointer to one commit, to be able to refer to it later. Like a sticky note that you attach to a particular commit (e.g. phd-printed or paper-submitted).
  • cloning: Copying the whole repository to your laptop - the first time. It is not necessary to download each file one by one.
  • forking: Taking a copy of a repository (which is typically not yours) - your copy (fork) stays on GitHub and you can make changes to your copy.

fig

GitHub file view of the nf-core/viralrecon repository. This is the version of all files at a single point in time.

fig

Github history view of the viralrecon repository. This is the progression of the repository (with the commit message over time).

fig

Network graph of all commits in the nf-core/viralrecon repository. This shows the relationship between different forks of people who are contributing and sharing code.

Interesting repositories to explore these concepts

Why using repositories?

  • All changes are recorded.
  • We do not have to send changes via email.
  • We can experiment with several ideas which might not work out (using branches).
  • Several people can work on the same project at the same time (using branches).
  • We do not have to wait for others to send us “the latest version” over email.
  • We do not have to merge parallel developments by hand.
  • Group-based access model where shared access is the default, instead of everything fundamentally owned by individuals who manage sharing as-needed: with Git you can easily have collaboration be the default.
  • It is possible to serve websites directly from a repository.

Discussion

  • How have you solved these in the past without version control?