Working with Git¶

Git is open-source software for Revision Control. It is used to keep track of changes to your software project, allows you to roll back to any earlier version, and to facilitate collaboration with colleagues and/or the global community. You can use git locally without depending on an external server, but to share your repository with a wider audience, a git hosting service or so-called git forge is often used. GitHub is one of the biggest commercial platforms that provides free hosting of (public) git repositories. Be aware that relying on external commercial parties has its drawbacks too.

On top of that, git forges typically offer additional useful functionality above-and-beyond what git offers, such as issue-trackers, project management, and continuous integration/deployment.

Git forges¶

We have a presence on GitHub as knaw-huc, this groups many of our software projects. If you want to use Github, this is a good place to put your project in:

https://github.com/knaw-huc

Other KNAW institutes may have their own namespaces on GitHub (which calls them organisations) or on other git forges. Larger projects may also have their own. Examples:

https://github.com/HuygensING - Huygens Institute
https://bitbucket.org/fryske-akademy/ - Fryske Akademy
https://github.com/CLARIAH - CLARIAH project

Some colleagues also keep many work-related projects under their personal namespace if they are the sole developer or sole maintainer. Example: https://github.com/proycon

Aside from cloud-based platforms like GitHub. Digital Infrastructure offers two git forges itself, powered by Gitlab, an open-source solution. These two instances are:

https://gitlab.huc.knaw.nl/ - This is our public instance suitable for open source software projects. Source code can be offered publicly (read-only) to unauthenticated users. Private repositories are also possible. This can be an alternative (or complement) to https://github.com/knaw-huc and is a good spot to put your project in.
We also have our private internal instance, it is therefore not suitable for open source software but is used for things like infrastructure as code (i.e. deployment configurations that may contain secrets such as passwords/api-keys). It is only available via the KNAW VPN and an account.

By putting your repository publicly on Github, on
https://gitlab.huc.knaw.nl, or elsewhere, you make your software available to the global community: always make sure to choose and attach an appropriate open source license. Anyone can modify your software in their own copy, according to the terms of the license you chose. If you like their improvements, you can merge it back into your own version (they can request you to do so via a pull/merge request). This is a great method for collaboration.

If your project is of a sensitive nature, use a private repository, and preferably on our gitlab rather than on github.

From a git perspective, the servers you push to or pull from are called git remotes. You're not limited to having just a single remote, git is distributed version control. With git remote -v you can see the configured remotes for any given local repository you're in.

Git basics¶

This section shows some of the basics using git on command-line: Note that this is not a full git tutorial for beginners, plenty of those can be found elsewhere, such as git-scm.com and w3schools.com. So feel free to skip this section or check elsewhere.

Clone¶

Cloning is the process that creates a local copy of a codebase. Clone the initial repository from Github:

git clone https://$USERNAME@github.com/$USERNAME/$REPOSITORY.git
cd $REPOSITORY

... or from our own public Gitlab instance:

git clone https://gitlab.huc.knaw.nl/$NAMESPACE/$PROJECT.git
cd $PROJECT

You can clone on as many directories or computers as you like. If you give others access to your repository, by sharing it on github, they can clone it like this:

git clone https://$FRIEND_USERNAME@github.com/$USERNAME/$REPOSITORY.git

You can use either HTTPS or SSH to clone. The latter is recommended if you use git a lot, as it gives you the benefit of using SSH keypairs so you don't have to authenticate each time.

git clone git@github.com:$USERNAME/$REPOSITORY.git
cd $REPOSITORY

Modify your files¶

You now start creating or editing files using whatever editor software you like.

Status¶

Check which files were modified or created

git status

Diff¶

Git can also show you exactly which lines were changed in each file.

# Add --color to show additions in green, and deletions in red.
git diff
git diff --color

Or, for a single file:

git diff --color example.txt

Add¶

Add your files to the commit-list (these files are then staged for commit). This way you can decide which files should be tracked and stored to the repository. Skip unnecessary files.

git add main.py
git add modules/myclass.py
git add docs/*

Add all files¶

Or: add all files in your directory. Both existing modified files and new files will be added.

git add -A

Be careful with this, especially if you don't have a .gitignore file set up yet.

Reset (Undo add)¶

To remove a file from the staged commits, use reset.

git reset modules/tmpfile

Commit¶

Commit your changes to your local repository, along with a message describing what you changed.

git commit -m "$MESSAGE"

Add and commit¶

If you want to commit all previously added files, you can automatically add them to the current commit. New files will not be added.

git commit -am "$MESSAGE"

Push¶

Push all your commits the remote (e.g. github.com or our own gitlab instance). Before you can push, your repository must be up-to-date with respect to the remote repository. It's recommended to always do a git pull first.

git push # You may need to enter your git password

Pull¶

Update your local repository from a remote.

# Get all updates that were pushed from others
# (by friends who have access or from your other computers)
git pull

Your local edits will be automatically merged with edits on the remote repository (git calls this fastforward).

Reset to a previous point in time¶

To reset a repository to a particular commit, you can do:

git reset --hard $COMMITHASH

This is safe as long as you don't have unstaged changes and won't affect any remote.

Merge conflict¶

After doing git pull, git normally automatically merges your edits with those on the remote repository.

If your edits conflict with the remote edits, then git cannot merge for you. Git will report which files had conflicts. Both your edits and remote edits will show up in the files. You need to reconcile these conflicting edits manually. Take a careful look and decide which lines should stay in the file and which shouldn't.

The file will contain a section similar to the following:

<<<<<<< HEAD:mergetest
This is my third line
=======
This is a fourth line I am adding
>>>>>>> 4e2b407f501b68f8588aa645acafffa0224b9b78:mergetest

HEAD refers to the version on the remote server (GitHub). The section below the ====== is the edits you have made. Delete or rewrite the conflicting lines, remove the section indicators <<<< ... >>>>, ====== and re-add the file to the commit: git add example.txt. Use git status to see if git reports any other merge conflicts.

Git guidelines¶

These are important guidelines to help you work with git.

1. Set your name and e-mail for proper attribution¶

It is strongly recommend to set your Name and Email address so commits are properly attributed to you, and show up properly on platforms like Github, Gitlab, or https://tools.huc.knaw.nl. You can omit --global in the examples below to get repository specific settings

# Check current settings with
git config --global --list

# Add your details with
git config --global user.name "Jan Janssen"
git config --global user.email "jan.janssen@di.huc.knaw.nl"

You need to do this on all machines you're using git on, please use consistent naming.

2. Pull before starting work¶

When working on a shared codebase, always do a git pull before you start editing and before pushing. This will pull in the latest commits done by others and reduces merge conflicts later on. This should be a recurring daily habit before you work on a codebase.

(If you push without pulling first and your local copy has diverged from the remote copy, git will prompt you to pull first)

3. Use clear meaningful commit messages and reference reported issues¶

A commit encompasses a body of work, ideally some meaningful unit like implementing a particular feature or a fixing a particular bug. In essence a commit holds that what has changed between versions. You may also hear the terms patch, changeset or diff to describe such a body of work.

The commit message describes this change. It should be clear and concise. If you need more space, then realise commit messages can be multi-line. The first-line is the most important one and holds the short summary, it can e followed by an empty line and further lines detailing the changes.

Commit messages should ideally start with a verb and briefly describe what has been done in this changeset. Conventions often prefer imperative mood over past tense. Example: "Fix crash in loading texts". Different projects may adhere to different commit message conventions, if they adhere to any convention at all. Some will be stricter than others. An established strict convention is for example Conventional Commits, but always follow the style of the project you are contributing to. Mature projects often have a CONTRIBUTING.md in the root of the repository that tells you what style is preferred.

Commit messages can also reference related issues on git forges like Github or Gitlab. It's usually sufficient to just reference the issue number, like #123 or by adding a user/repo part if you're referring to an issue in another project on the same platform (e.g. johndoe/some-tool#123) at the end of your commit message. But a fool-proof verbose convention that works across platforms to add a line Ref: https://example.com/some-issue-tracker/issue-123 as one of the last lines of your commit message (and never as a first line!). Here you just mention the full URL of the issue.

4. Commit often, push often¶

Commit whenever you have some sensible body of work, before you move on to fixing the next bug or implementing the next feature. This makes it easier to track where problems have been fixed or introduced.

Push often if you work on a shared codebase with colleagues (in the same git branch), so they can pull in your changes and work on from there.

Some corollaries of this:

Never wait until your project is ready (by whatever definition) and then add everything in one big commit, always commit incrementally right from day one!
Better commit too often than too little, as multiple commits can easily be squashed into a single one later on, but the reverse is more difficult.

5. Review changes before committing¶

Your commit doesn't have to be perfect (see also point 12), you can always add commits later that fixes earlier mistakes. However, it is a good idea to review your changes before you commit.

The most basic way to do this visually is to do git status and git diff on the command line. It will show you what will be in the commit and gives you a chance to spot errors. You can also use more interactive tools to inspect changes, such as tig (terminal interface), gitk (desktop gui), Github desktop (desktop gui).

If you use any compilers, linters, or auto-formatters. Make sure to run them before committing. This prevents needless commits with syntax errors. Your IDE can often take care of this for you.

If you have a local test suite, run it before committing. It's also customary and good practice for larger test suites to run automatically opon commit and push on the server-side as part of a continuous integration pipeline.

6. Not everything should be in git, se .gitignore to specify things to ignore¶

Not everything should be in git. Examples are:

secrets such as passwords (see point 9)
build artefacts such as binary executables (see point 10)
runtime environments such as Python Virtual Environments
very large files (e.g. >50MB), especially large binary files such as media. Git is optimised for text-based files. If you do need to include very large files, then this is supported through git extensions like LFS.
back-up files: git itself handles versioning, you don't need to add any copies of earlier versions in your repository.

To prevent accidentally adding certain files to git, you should have a file named .gitignore in the root of your repository that lists all files to ignore. See a collection of templates here. Also see the gitignore manual page (man gitignore) for syntax.

7. Use branches during development and invite review¶

By default, code resides in a main or master branch (or sometimes a develop branch, anything is possible), which is essentially the main branch of a tree.

Whenever you work on a particular feature or bugfix and you're not the sole developer in a project, you should create a git branch. A branch is a temporary fork in a git repository, usually with the intention to be merged back later. You can create a branch (git checkout -b $BRANCHNAME) to work on a particular feature or bugfix, which in turn may consist of multiple commits.

Branches also help you if you're working on multiple independent things at the same time, as you can switching between them at will (git checkout $BRANCHNAME).

If you don't have direct write permission to a project on GitHub or Gitlab, you first create a fork of the project under your own account. Then when you're done implementing the feature, you push your changes to a remote branch on your own remote (i.e. the fork you made) andthen do a pull request or merge request (different word for the same thing).

This invites review of your commits on a git forge like GitHub or Gitlab (or by mailing a patch to a mailing list). This is essentially a request for your contributions to be merged into the main/master tree. A project maintainer will approve your request, or request changes, and eventually perform the merge.

This is the essentially a way of doing peer review on code, which, especially in scientific contexts, is important.

8. Tag releases¶

Use git tag to mark released versions of your software for the public. The tag is typically a version number, using a semantic versioning scheme is recommended here. Example: git tag v1.2.1.

If you create releases via Github or Gitlab, such git tags are created for you when you publish a release via their interface.

9. Don't commit secrets¶

Be careful when dealing with things such as passwords, API keys, or other private/personal information. Those should NOT be a hard-coded part of the source code and configuration files containing such secrets should NOT be committed to a public git repository.

Also be aware that if you did accidentally commit a secret, then doing another commit that removes it is not a solution as the secret will still be in the history.

There are ways to fix this (essentially rewriting history, a more advanced use of git), but it is better to prevent than to cure. Contact someone with more git experience to help you out here if needed.

The exception here is our private gitlab instance. Here you can commit deployment configurations WITH secrets. Make sure the project is set as private and minimize the number of people that have access, and be fully aware of the types of secrets that you are using.

10. Prevent adding build artefacts¶

Git is designed for source code, not for any build artefacts that can be constructed. Build artefacts are things that can be generated automatically from the source code, such as binary executables, or container images. These are platform and architecture specific and don't belong in git. The common rule is to add the source, and not anything generated from it.

Things like PDFs can also be considered build artefacts (from LaTeX or markdown sources for example) and should ideally not be added, although sometimes for convenience an exception can be made in such cases.

11. Respect shared history¶

One of the deadliest sins in git is to erase history, as keeping version history is the core of what git does. If you derive your project from another project that uses git, then make sure to always a proper fork that retains the full git history.

This also allows you to contribute back any changes you made the original project, in good open source spirit. Asking another party to incorporate your changes is called a Pull Request or Merge Request and is only possible if the git history is respected.

Don't copy a project's source code from another git repo into a new repository in one single commit!
Don't use version numbers in repository names. If you're building a v2 version of an earlier v1 project, you generally DO NOT need a new repository!

It's also a good idea to work on your own fork on Github or Gitlab if you're unsure, and use a Pull Request/Merge Request to contribute back to the actual project (see also point 7 and 12).

12. Don't be afraid to screw up¶

Git has many options and may seem a bit daunting to the uninitiated. People are often worried they might mess up a codebase. The good news is that it is really hard for you to screw things up with git, as it was specifically designed to make it easy to roll back to any earlier version. Just stay clear of some of the advanced features as force-pushing and you'll be fine.

Using remote branches (fork + branches), and doing merge/pull requests is always recommended if you're working on a shared codebase and are not sure (see point 6).

Don't be discouraged if your merge/pull requests on another projects get denied the first time, do not interpret that as a sign of failure. It is very common for requests to have to go through several sets of changes before being approved.

In the same vein, beginners sometimes feel insecure sharing their code because they feel it is not good enough. This may lead to behaviour where they won't commit for a long time because they're waiting for things to be 'ready'. Don't do this. We were all beginners at some point and sharing and inviting review helps your learning practice more than anything.