Accelerating Oxford Nanopore basecalling

It looks innocuous sitting on the desk, an Oxford Nanopore MinION, but it can produce a huge data of data from a single sequencing run. Since the nanowire works by inferring which base is in the pore by how much it reduces the flow of ions (and hence current) through the pore, the raw data is commonly called “squiggles”. Converting each of these squiggles to a sequence of nucleotides is “base-calling”. Ideally, you want to do this in real-time, i.e. as the squiggles are being produced by the MinION. Interestingly, this is becoming a challenge for the molecular biologists and bioinformaticians in our group since the flow of data is now high enough that a high-spec laptop struggles to cope. It may becoming obvious that this is not my field of expertise – and you’d be right – but I do know something about speeding up computational jobs through the use of GPUs and/or computing clusters. There appear to be two main use-cases we have for base-calling the squiggles. I’m only going to talk about nanonetcall,

1. Live basecalling.

Nick Sanderson is leading the charge here in our group (that is his hand in the photo above). He has built a workstation with a GPU and SSD disc and was playing around with the ability of nanonetcall to use multiple threads. This is our base case, which has a single process with one OpenMP thread, so by definition has a speedup of 1.0x. The folder sample-data/ contains a large number of squiggle files (fast5 files).

OMP_NUM_THREADS=1 nanonetcall sample_data/ --jobs 1 > /dev/null 

Let’s first try using 2 OpenMP threads and a single process.

OMP_NUM_THREADS=2 nanonetcall sample_data/ --jobs 1 > /dev/null 

This makes no difference whatsoever. In common with running GROMACS simulations, OpenMP doesn’t help much, in my hands at least. Let’s rule out using additional OpenMP threads and simply increase
We can simply try increasing the number of jobs to run in parallel:

OMP_NUM_THREADS=1 nanonetcall sample_data/ --jobs 2 > /dev/null    
OMP_NUM_THREADS=1 nanonetcall sample_data/ --jobs 4 > /dev/null 
OMP_NUM_THREADS=1 nanonetcall sample_data/ --jobs 8 > /dev/null 
OMP_NUM_THREADS=1 nanonetcall sample_data/ --jobs 12 > /dev/null 

These lead to speedups of 1.8x, 3.0x, 3.2x and 3.2x. These were all run on my early-2013 MacPro which has a single 6-core 3.5 GHz Intel Xeon CPU, so can run up to 12 processes with hyperthreading. I don’t know exactly how nanonetcall is parallelised, but at least a good chunk of it is Python, it is no surprise that it doesn’t scale that well since Python will always struggle due to the limitations inherent in being an interpreted language (which in Python’s case means the GIL). Now parts of nanonetcall are cleverly written using OpenCL so it can make use of a GPU if one is available. My MacPro has two AMD FirePro D300 cards. Good graphics cards, but I would have chosen better GPUs if I could. Even so using a single GPU gives a speedup of 3.3x.

nanonetcall sample_data/ --platforms Apple:0:1 Apple:1:1 --exc_opencl > /dev/null

I suggested we try one of my favourite apparently-little-known unix tools, GNU Parallel. This is simple but truly awesome command that you can install via a package manager like apt-get on Ubuntu and MacPorts on a Mac. The simplest way to use it is in a pipe like this;

find sample-data/ -name '*.fast5' | parallel -j 12 nanonetcall {} > /dev/null

This needs explaining. The find command will produce a long list of all the fast5 files. Parallel then consumes these, and will first launch 12 nanonetcall jobs, each running on a single core. As soon as one of these finishes, parallel will launch another nanonetcall job to process the next fast5 in the list. In this way parallel will ensure that there are 12 nanonetcall jobs running at all times and we rapidly work out way through the squiggles. This results in a speed up of 4.8x, so not linear, but certainly better than trying to use the ‘internal’ parallelisation of nanonetcall.

But we can do better because we can use parallel to overload the GPU too.

find sample-data/ -name '*fast5' | parallel nanonetcall {} --platforms Apple:0:1 --exc_opencl > /dev/null

This yields our fastest speed-up of 5.7x. But perhaps the GPU is getting too overloaded, so let’s try sharing the loads

find sample-data-1/ -name '*.fast5' | parallel -j 6 nanonetcall {} --platforms Apple:1:1 --exc_opencl > /dev/null &
find sample-data-2/ -name '*.fast5' | parallel -j 6 nanonetcall {} --platforms Apple:0:1 --exc_opencl > /dev/null

where I’ve split the data equally into two folders. Sure enough, this now gives a speedup of 9.9x. Now, remember there are only 12 virtual cores, so if we try running more processes, we should start to see a performance penalty, but let’s try!

find sample-data-1/ -name '*.fast5' | parallel -j 12 nanonetcall {} --platforms Apple:1:1 --exc_opencl > /dev/null &
find sample-data-2/ -name '*.fast5' | parallel -j 12 nanonetcall {} --platforms Apple:0:1 --exc_opencl > /dev/null

Unexpectedly, this ekes out a bit more speed at 10.9x! So by ignoring the inherently poor scalability built-in to nanonetcall and using GNU Parallel in harness with two GPUs, we have managed to speedup the base-calling by a factor of nearly eleven. I expect Nick to manage even higher speedups using more powerful GPUs.

2. Batch basecalling.

I sit in the same office as two of our bioinformaticians and even with a good setup for live-basecalling, it sounds like there are still occasions when they need to baseball a large dataset (~50GB) of squiggle files. This might be because part of the live base calling process failed, or even the MinION writing files to a different folder due to some software update, or perhaps you simply want to compare several different pieces of basecalling software or even just compare across versions. You want to load the data onto some “computational resource”, press go and have it chew its way through the squiggle files as quickly and reliably as possible. There are clearly many ways to do this; here I am going to look at a simple linux computing cluster with a head node and a large number of slave nodes. Jobs are distributed across the slave nodes using a batch scheduler and queuing system (I use SLURM).

Hang Phan, another bioinformatician, had a large dataset of squiggles that needed 2D base calling and she wanted to try nanonetcall. To demonstrate the idea, I simply installed nanonetcall on my venerable but perfectly serviceable cluster of Apple Xserves. Then it is just a matter of rsync`ing the data over, writing a simple bit of Python to (a) group together sets of fast5 files (I chose groups of 50) and then (b) create a SLURM job submission file for each group and finally (c) submit the job to the queue.

The advantages of this well-established “bare metal” approach are that
– it is truly scalable: if you want to speed up the process, you add more nodes
– it is reliable; my Ubuntu/SLURM/NFS cluster wasn’t switched off for over half a year (and this isn’t unusual)
– you can walk away and let the scheduler handle what runs when

As you can see from my other posts, I am a fan of cloud and distributed computing, but in this case a good old-fashioned computing cluster (preferably one with GPUs!) fits the bill nicely. alpha launch

I’m planning to launch a citizen science project,, in 2017 which has two distinct ways anyone can help combat antibiotic resistance. I’ve revamped and relaunched what will ultimately become the public-facing project website – please have a look.

The first strand is closer to the light of day and will help the international Tuberculosis consortium, CRyPTIC. This global group of researchers, of which I am a part, will be collecting over 100,000 samples from patients with TB. Each sample will be tested to see which antibiotics are effective as well as having the genome of its M.tuberculosis bacterium sequenced. In practice, because each sample is measured at least three different times, that means looking at 300,000 96-well plates. Step forward Zooniverse! This type of large-scale image classification is exactly the sort of thing Zooniverse Citizen Science projects excel at. I hope to launch this project in early 2017.

The second citizen science project is more complex and I have recently applied for funding. As described in my Research, I am developing methods that can predict whether novel or rarely-observed mutations cause resistance to an antibiotic (or not). These require a lot of computer resource and the idea is to build a volunteer computing project, like [climate](http://climate, using the BOINC framework, so that volunteers can download a program onto their laptop or desktop. When they’re not using their computer, the program will retrieve part of a problem and run the simulations on their machine before returning the results over the internet. These type of project is more complicated and requires more infrastructure to be setup, but with some luck, I’d hope to have a soft launch late in 2017.

GROMACS in DOCKER: First Steps

DOCKER is cool. But what is it? From the DOCKER webpage

Docker containers wrap up a piece of software in a complete filesystem that contains everything it needs to run: code, runtime, system tools, system libraries – anything you can install on a server. This guarantees that it will always run the same, regardless of the environment it is running in.

I like to think of it as somewhere in between virtualenv and a virtual machine. Although the DOCKER website is focussed on commercial software development, and so talks about building and shipping applications, DOCKER could be of huge use to myself as a computational scientist. For example, rather than make a series of input files for my simulations available, along with a list of which software versions I used, I could instead simply make a DOCKER image available that contains all the compiled software I used along with all the input files. Then anyone should, in principle, be able to reproduce my research.


Make no mistake: reproducibility is, rightly, a coming trend. But surely all scientific results are reproduced?. Turns out if the experiment or simulation was difficult to do the answer is not so much. And when concerted efforts have been made to reproduce results reported in high impact journals, the answer is often, well, disconcerting at the very least. In a now famous study, Begley & Ellis from a pharmaceutical company, Amgen, reported that their in-house scientists were unable to reproduce 47 out of 53 landmark experimental studies in haematology and oncology. They were looking at novel, exciting findings which are more likely to be challenging to reproduce (although the pressure to over-sell is also stronger). I have no reason to think computational studies are much better. The past few years there have been a flurry of paperscomments and best practices. One can even now make a DOCKER image available via GitHub with a DOI so it can be cited independently of an article.

As I’d like to do this in the future, I’ve started to play with DOCKER and GROMACS. Since my workstation is a Mac, the DOCKER host has to run within a lightweight Linux virtual machine. First I installed DOCKER. Then I opened a DOCKER Quick Terminal and checked everything was working by downloading the hello-world image and running it

$ docker run hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world

4276590986f6: Pull complete 
a3ed95caeb02: Pull complete 
Digest: sha256:4f32210e234b4ad5cac92efacc0a3d602b02476c754f13d517e1ada048e5a8ba
Status: Downloaded newer image for hello-world:latest

Hello from Docker.
This message shows that your installation appears to be working correctly.

Let’s get try something more real, like an Ubuntu 16.04 Server image.

$ docker run -it ubuntu bash

This drops me inside the Ubuntu image. Let’s compile GROMACS!

root@4b511a41dbf0:/# apt-get update -y
root@4b511a41dbf0:/# apt-get upgrade -y
root@4b511a41dbf0:/# apt-get install build-essential cmake wget openssh-server -y
root@4b511a41dbf0:/# wget
root@4b511a41dbf0:/# tar zxvf gromacs-5.1.2.tar.gz 
root@4b511a41dbf0:/# cd gromacs-5.1.2
root@4b511a41dbf0:/# mkdir build
root@4b511a41dbf0:/# cd build
root@4b511a41dbf0:/# cmake .. -DGMX_BUILD_OWN_FFTW=ON
root@4b511a41dbf0:/# make 
root@4b511a41dbf0:/# make install
root@4b511a41dbf0:/# cd

Now let’s copy over a TPR file to see how fast GROMACS is within a DOCKER container

root@4b511a41dbf0:/# scp fowler@somewhere.else:benchmark.tpr .
root@4b511a41dbf0:/# source /usr/local/gromacs/bin/GMXRC
root@4b511a41dbf0:/# gmx mdrun -s benchmark -resethway -noconfout -maxh 0.1

Note that this is a single CPU DOCKER image. I was worried that since the DOCKER host was running inside a Linux VM it would be slow compared to running natively in Mac OS X so I ran three repeats of each and DOCKER was only 1.7% slower…

To save this DOCKER image locally, quit the session

$ docker commit -m "Installed GROMACS 5.1.2 for benchmarking" -a "Philip W Fowler" c5f1cf30c96b philipwfowler/gromacs-5.1.2
$ docker images
REPOSITORY                    TAG                 IMAGE ID            CREATED             SIZE
philipwfowler/gromacs-5.1.2   latest              73e44c120bfa        6 seconds ago       809 MB
ubuntu                        latest              c5f1cf30c96b        2 weeks ago         120.8 MB
hello-world                   latest              94df4f0ce8a4        3 weeks ago         967 B

Done. More soon on multiple cores, can-we-use-the-GPU? and using DOCKER on Amazon Web Services.

Setting up a GROMACS cluster

Recently I’ve moved to the John Radcliffe hospital and my old lab kindly let me have some old servers that were switched off. This pushed me to learn how to setup them up as a compute cluster with a scheduler for running GROMACS jobs. I’ve wanted to learn this for years, having used many clusters myself, but haven’t plucked up the courage until now.

This post is a detailed walk-through on how I chose to do this. During the process I did a lot of Googling and have written the post I would have liked to have found; long, comprehensive and a bit verbose.

Since it is long so let’s break it down into four tasks.

Fingerless gloves and a woolly hat can be useful in a cold, noisy machine room

Fingerless gloves and a woolly hat can be useful in a cold, noisy machine room

1. Install Ubuntu on each machine

2. Setup networking, including sharing directories on the headnode via NFS

3. Using environment modules, compile GROMACS into one of the shared directories so all the machines in the cluster can run lmx

4. Install SLURM

How to setup a Gramble

This is a Gramble, which of course is short for a GROMACS Bramble, or, in other words, a Raspberry Pi 2 model B cluster running GROMACS. Given the ARM processor in a Raspberry Pi 2 does not allow SIMD instructions like the more complex (and expensive) Intel chips, why would I want to do such a thing? Well, I wanted to learn how to setup a simple compute cluster.

And this is what I did. Unless stated otherwise you need to do this on both machines (or how ever many you are using).

1. Install Ubuntu 14.04 LTS Server onto the microSD cards

Each Raspberry Pi 2 runs off a microSD card; on a computer with a microSD slot (I used an iMac) download the Ubuntu 14.04 LTS image and copy it onto the microSD card as described here. Then you simply push it back into the slot on the Raspberry Pi and power it up. Note that Ubuntu does not run on the model A and, at the time of writing, only ran on the model B. If your microSD card is bigger than 2GB you might want to resize the partition.

2. Update

First, let’s update the installed software and also install the ssh server so we can remotely connect.

sudo apt-get update -y
sudo apt-get upgrade -y
sudo apt-get install openssh-server

3. Setup the network

Now my setup was a bit strange. I was using an old Apple Time Capsule I had; both Raspberry Pis were connected to this via ethernet cables and the Time Capsule itself was in “Extend Wireless Network” mode since our main wireless router is somewhere else in the house. Ideally, I’d want a dual-homed headnode with one public IP address and then a private network for communication within the cluster. Instead each of the two Raspberry Pis have their own IP that is dynamically assigned by my router, but this will do for now.

sudo nano /etc/hosts

so it reads localhost rasp0 rasp1

Also edit the hostname

sudo nano /etc/hostname

so it matches /etc/hosts


and reboot

sudo reboot

4. Add an MPI user

We will need a special user that can log in without passwords to all the nodes that SLURM will use later on. As I understand it, giving it a uid less than 1000 stops the user appearing in any login GUI.

sudo adduser mpiuser --uid 999

5. Install NFS

We will share a folder on the headnode with all the compute nodes using the NFS protocol. This means we’ll only need to install applications on the headnode and they will be accessible from any compute node. Also this is where the GROMACS output files will be written.

on the headnode (rasp0)

sudo apt-get install nfs-kernel-server

on the compute node (rasp1)

sudo apt-get install nfs-common

on the headnode (rasp0) add the following to /etc/exports

/home/mpiuser *(rw,sync,no_subtree_check)
/apps *(rw,sync,no_subtree_check)

This will export the folders /apps and /home/mpiuser on rasp0 to all the compute nodes (in this case just rasp1). You need to make sure all folders shared by NFS exist on both machines. So on both machines

sudo mkdir /apps

You don’t need to mkdir /home/mpiuser as creating this user will have automatically created a home directory for it. Now on the headnode

sudo service nfs-kernel-server start

On the compute node (rasp1),

sudo ufw allow from
sudo mount rasp0:/home/mpiuser /home/mpiuser
sudo mount rasp0:/apps /apps

The first line opens a port in the firewall. although I’m definitely sure I needed to do this. The last two manually mount the NFS share from the headnode (rasp0). To set it up so this happens automatically

sudo nano /etc/fstab

and add

rasp0:/home/mpiuser /home/mpiuser nfs
rasp0:/apps /apps nfs

then we can force a remount via

sudo mount -a

6. Create an SSH key pair to allow passwordless login

Because we now have /apps and /home/mpiuser shared with all nodes of the cluster (ok, just rasp1, but you know what I mean) we can simply on the headnode create an ssh keypair as mpiuser and it will be shared with all the compute nodes. So on rasp0

su mpiuser
ssh-keygen -t rsa
cd .ssh/
cat >> authorized_keys

I didn’t use a passphrase during key generation. I expect this is a bad thing and I did read you could use a key chain, but as this is a toy cluster I’m going to stick my fingers in my ears and pretend I didn’t read that. If you haven’t created an ssh keypair before, it is fairly simple – it creates a public and a private key. This is described in more detail here. The key things are that the private key (.ssh/id_rsa) should only be readable by the mpiuser and no-one else. In Linux-land, this means it should have permissions of 400 – this is how it will be created. Secondly, any remote machine will allow a passwordless login if the public key for that user is in .ssh/authorized_keys; this explains the last line above.

Let’s test it. Since we are already the mpiuser and we are on rasp0

ssh rasp1

Should automatically log you into rasp1. If you try the same thing as the default ubuntu user it will prompt you for your password as that user doesn’t have an ssh keypair setup.

7. Compile GROMACS

As we are thinking about NFS, let’s compile GROMACS in /apps so the gmx binary can be run from any of the compute node(s). We need a few things before we begin.

sudo apt-get install build-essential cmake

The first package contains the compilers you’ll need to, well, compile GROMACS and cmake is the build tool GROMACS uses. So as the mpiuser,

cd /apps
mkdir src
cd src/
cd gromacs-5.1.2/
mkdir build-gcc48
cd build-gcc48
make -j 4
sudo make install

Note the unusual -DCBUILD_SHARED_LIBS flag in the GROMACS cmake command; this is to get around an error when compiling GROMACS on the Raspberry Pi. You shouldn’t normally need this flag. Now the make command will take at least ten minutes, so put the kettle on.

Because of dynamic linking, you’ll also need to install the compilers on the compute nodes via

sudo apt-get install build-essential

I suspect using environment modules might avoid this issue; I’m going to play with these and if I can get it to work will write another post.

Once you’ve done this, then on any machine you should be able to run GROMACS via

source /apps/gromacs/5.1.2/bin/GMXRC
gmx mdrun

8. Install a cluster management and job scheduling system (SLURM)

Despite the fact the machines I’ve used in the past have tended to use PBS or SGE (and so my fingers can type qstat really, really fast), I chose to use SLURM as

  • it is available as an Ubuntu package
  • our University high performance computing centre have recently started using it and they recommended it
  • it is actively developed and I can’t work out, or at least remember for more than five minutes, what is going on with SGE
  • it has documentation! and tutorials!
  • it is open source (GPL2)
  • I liked the name

To install on rasp0

sudo apt-get install slurm-llnl

This also installs MUNGE as a pre-requisite (see the next section). Now SLURM appeared to want to use /usr/bin/mail and complained when it couldn’t find it so I also installed

sudo apt-get install mailutils

which drops you into a setup screen and I chose the “local” option.

Also on rasp1

sudo apt-get install slurm-llnl

9. Get MUNGE working

MUNGE creates and validates credentials and SLURM uses it. On the headnode

sudo /usr/sbin/create-munge-key

This creates a key /etc/munge/munge.key. Now copy this key to /etc/munge/ on all nodes (you may need to fiddle with permissions etc to use the NFS share). There appears to be a bug with Ubuntu and MUNGE, but the workaround is to do the following on all nodes

sudo nano /etc/default/munge

and add the line


now start the service

sudo service munge start

Check it is running

ps -e | grep munge

10. Get SLURM working

This was the bit I wasn’t looking forward to as job schedulers have, frankly, scared me. But it turns out this was one of the easiest steps. If we are the mpiuser on rasp0. SLURM comes with a very simple configuration file that we can edit.

cp /usr/share/doc/slurm-llnl/examples/slurm.conf.simple.gz .
gunzip slurm.conf.simple.gz
nano slurm.conf.simple

All I did was change the lines so they read

NodeName=rasp[0-1] Procs=4 State=UNKNOWN
PartitionName=test Nodes=rasp[0-1] Default=YES MaxTime=INIFINITE State=UP

You’ll notice I’ve identified rasp0 as both the ControlMachine (i.e. headnode) and also a Node (compute node) belonging to the test Partition. On a regular cluster you probably don’t want the headnode also being a compute node, but I only had two Raspberry Pis so I thought why not? This also shows the syntax for referring to multiple nodes. If you want a more complex configuration an online configuration tool is provided. There is also an even more complicated online configurator. A note of caution: these may not work with the version of SLURM installed by apt-get (2.6.5) since the current version is 15.08. That doesn’t mean 2.6.5 is old; they’ve changed the numbering system recently.

Finally copy the file to the right place on all nodes

 sudo cp slurm.conf.simple /etc/slurm-llnl/slurm.conf

and (on the headnode, rasp0)

sudo service slurm start

on the compute node

sudo slurmd -c


srun -N1 hostname



11. Submit a GROMACS job to the queue

I’m going to assume you have prepared a TPR file called md.tpr and have copied it into /home/ubuntu (and we are now the default user, ubuntu).

Let’s do some simple benchmarking – remember a Raspberry Pi has 4 cores. So first, let’s create a series of TPR files

cp md.tpr md-1.tpr
cp md.tpr md-2.tpr
cp md.tpr md-4.tpr

Now let’s create some SLURM job submission files. This is the one for running on two cores – you’ll need to change the --cpus-per-task, the --job-name SBATCH flags and the -deffnm and -ntmpi GROMACS flags depending on the number of cores.

 sudo nano

and copy in

 #SBATCH --nodes=1
 #SBATCH --ntasks-per-node=1
 #SBATCH --cpus-per-task=2
 #SBATCH --time=00:15:00
 #SBATCH --job-name=md-2

 source /apps/gromacs/5.1.2/bin/GMXRC

 srun gmx mdrun -deffnm md-2 -ntmpi 2 -ntomp 1 -maxh 0.1 -resethway -noconfout

This will run for 6 minutes, resetting the GROMACS timers after 3 minutes. It won’t write out a final GRO file as this can affect the timings. Hopefully you’ll find that, whilst useful and fun machines, Raspberry Pis are really slow at running GROMACS! To submit the jobs


To check the queue we can issue


and to cancel we can use scancel.

Ta da!

New Publication: Predicting affinities for peptide transporters

PepT1 is a nutrient transporter found in the cells that line your small intestine. It is not only responsible for the uptake of di- and tai-peptides, and therefore much of your dietary proteins, but also the uptake of most β-lactam antibiotics. This serendipity ensures that we can take (many of) these important drugs orally.

Our ultimate goal is to develop the capability to predict modifications to drug scaffolds that will improve or enable their uptake by PepT1, thereby improving their oral bioavailability.

In this paper, just published online in the new journal Cell Chemical Biology (and free to download, thanks to the Wellcome Trust), we show that it is possible to predict how well a series of di- and tai-peptides bind to a bacterial homologue of PepT1 using a hierarchical approach that combines an end-point free energy method with thermodynamic integration. Since there is no structure of PepT1, we then tried our method on a homology model we have published in 2015. We found that method lost its predictive power. By studying a range of homology models of intermediate quality, we showed that it is highly likely an experimental structure of hPepT1 will be required for in silico accurate predictions of transport.

This is the second paper that Firdaus Samsudin has published as part of his DPhil here in Oxford.

GROMACS on AWS: compiling against CUDA

If you want to compile GROMACS to run on a GPU Amazon Web Services EC2 instance, please first read these instructions on how to compile GROMACS on an AMI without CUDA. These instructions then explain how to install the CUDA toolkit and compile GROMACS against it.

The first few steps are loosely based on these instructions, except rather than download the NVIDIA driver, we shall download the CUDA toolkit since this includes an NVIDIA driver. First we need to make sure the kernel is updated

sudo yum install kernel-devel-`uname -r`
sudo reboot

Safest to do a quick reboot here. Assuming you are in your HOME directory, move into your packages folder.

cd packages/

And download the CUDA toolkit (version 7.5 at present)

sudo /bin/bash

It will ask you to accept the license and then asks you a series of questions. I answer Yes to everything except installing the CUDA samples. Now add the following to the end of your ~/.bash_profile using a text editor

export PATH; PATH="/usr/local/cuda-7.5/bin:$PATH"
export LD_LIBRARY_PATH; LD_LIBRARY_PATH="/usr/local/cuda-7.5/lib64:$LD_LIBRARY_PATH"

Now we can build GROMACS against the CUDA toolkit. I’m assuming you’ve already downloaded a version of GROMACS and probably installed a non-CUDA version of GROMACS (so you’ll already have one build directory). Let’s make another build directory. You can call it what you want, but some kind of consistent naming can be helpful. The -j 4 flag assumes you have four cores to compile on – this will depend on the EC2 instance you have deployed. Obviously the more cores, the faster, but GROMACS only takes minutes, not hours.

mkdir build-gcc48-cuda75
cd build-gcc48-cuda75
cmake .. -DGMX_BUILD_OWN_FFTW=ON -DCMAKE_INSTALL_PREFIX=/usr/local/gromacs/5.0.7-cuda/  -DGMX_GPU=ON -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda
make -j 4
sudo make install

To load all the GROMACS tools into your $PATH, run this command and you are done!

source /usr/local/gromacs/5.0.7-cuda/bin/GMXRC

If you run this mdrun binary on a GPU instance it should automatically detect the GPU and run on it, assuming your MDP file options support this. If it does you will see this zip by in the log file as GROMACS starts up

1 GPU detected:
  #0: NVIDIA GRID K520, compute cap.: 3.0, ECC:  no, stat: compatible

1 GPU auto-selected for this run.
Mapping of GPU to the 1 PP rank in this node: #0

Will do PME sum in reciprocal space for electrostatic interactions.

Depending on the size and forcefield you are using you should get a speedup of at least a factor two, and realistically three, using a GPU in combination with the CPUs. For example, see these benchmarks.

GROMACS on AWS: compiling GCC

These are some quick instructions on how to build a more recent version of GCC than is provided by the devel-tools package on the Cent OS based Amazon Linux AMI. (currently GCC 4.8.3) You may, for example, wish to use a more recent version to compile GROMACS – that is my interest. If so, then these instructions assume you have done all the steps up to, but not including, compiling GROMACS in this post. Compiling GCC needs several GB of disk space so if you use the default 8GB for an EC2 AMI it will run out of disk space; increasing this to 12 GB is sufficient.

First let’s find out what versions of GCC are available.

[ec2-user@ip-172-30-0-42 ~]$ svn ls svn:// | grep gcc | grep release

As you can see when I wrote this 5.3.0 was the most recent stable version, so let’s try that one. I’m going to compile everything inside a folder called packages/ so let’s create that then use subversion to check out version 5.3.0 (this is going to download a lot of files so will take a minute or two)

[ec2-user@ip-172-30-0-42 ~]$ mkdir ~/packages
[ec2-user@ip-172-30-0-42 ~]$ cd ~/packages
[ec2-user@ip-172-30-0-42 packages]$ svn co svn://
A    gcc_5_3_0_release/
A    gcc_5_3_0_release/libitm
A    gcc_5_3_0_release/fixincludes/fixopts.c
A    gcc_5_3_0_release/install-sh
A    gcc_5_3_0_release/ylwrap
 U   gcc_5_3_0_release
Checked out revision 232268.
[ec2-user@ip-172-30-0-42 packages]$ cd gcc_5_3_0_release/

GCC needs some prerequisites which are installed by this script.

[ec2-user@ip-172-30-0-42 gcc_5_3_0_release]$ ./contrib/download_prerequisites 
--2016-01-12 13:24:23--
       => ‘mpfr-2.4.2.tar.bz2’
Resolving (
isl-0.14.tar.bz2    100%[=====================>]   1.33M   693KB/s   in 2.0s   

2016-01-12 13:24:39 (693 KB/s) - ‘isl-0.14.tar.bz2’ saved [1399896]

Go up a level, make a build directory and move there.

[ec2-user@ip-172-30-0-42 gcc_5_3_0_release]$ cd ..
[ec2-user@ip-172-30-0-42 packages]$ mkdir gcc_5_3_0_release_build/
[ec2-user@ip-172-30-0-42 packages]$ cd gcc_5_3_0_release_build/

Now we are in a position to compile GCC 5.3.0. This took about 50 min using all eight cores of a c3.2xlarge instance, so this is a good moment to go and have lunch. Note that since the instance I am compiling on has 8 virtual CPUs, I can use the -j 8 flag to tell make to use up to 8 threads during compilation which will speed things up. If you are using a micro instance, just omit the
-j 8 (but good luck as that would take a long time).

[ec2-user@ip-172-30-0-42 gcc_5_3_0_release_build]$ ../gcc_5_3_0_release/configure && make -j 8 && sudo make install && echo "success" && date
checking build system type... x86_64-unknown-linux-gnu
checking host system type... x86_64-unknown-linux-gnu
checking target system type... x86_64-unknown-linux-gnu
checking for a BSD-compatible install... /usr/bin/install -c
checking whether ln works... yes

Hopefully you now have a newer version of GCC to compile binaries with. With any luck it might even give you a performance boost.

GROMACS on AWS: Performance and Cost

So we have created an Amazon Machine Image (AMI) with GROMACS installed. In this post I will examine the sort of single core performance you can expect and much this is likely to cost compared to other compute options you might have.


To test the different types of instances you can deploy our GROMACS image on, we need a benchmark system to test. For this I’ve chosen a peptide MFS transporter in a simple POPC lipid bilayer solvated by water. This is very similar to the simulations found in this paper. Or to put it another way: 78,000 atoms in a cube, most of which are water, some belong to lipids and the rest, protein. It is fully atomistic and is described using the CHARMM27 forcefield.

Computing Resources Tested

I tried to use a range of compute resources to provide a good comparison for AWS. First, and most obviously, I used my workstation on my desk, which is a late-2013 MacPro which has 12 Intel Xeon cores. In our department we also have a small compute cluster, each node of which has 16 cores. Some of these nodes also have a K20 GPU. Then I also have access to a much larger computing cluster run by the University. Unfortunately, since the division I am in has decided not to contribute to its running, I have to pay for any significant usage.

Rather than test all the different types of instances available on EC2, I tested an example from each of the current (m4) and older generation (m3) of non-burstable general purpose instances. I also tested an example from the latest generation of compute instances (c4) and finally the smaller instance from the GPU instances (g2).



The performance, in nanoseconds per day for a single compute core, is shown on the left (bigger is better).

One worry about AWS EC2 is that for a highly-optimised compute code, like GROMACS, performance might suffer due to the layers of virtualisation, but, as you can see, even the current generation of general purpose instances is as fast as my MacPro workstation. The fastest machine, perhaps unsurprisingly, is the new University compute cluster. On AWS, the compute c2 class is faster than the current general purpose m4 class, which in turn is faster than the older generation general purpose m3 class. Finally, as you might expect, using a GPU boosts performance by slightly more than 3x.



fig-aws-gromacs-costI’m going to do a “real” comparison. So if I buy a compute cluster and keep it in the department I only have to pay the purchase cost but none of the running costs. So I’m assuming the workstation is £2,500 and a single 16-core node is £4,000 and both of these have a five year lifetime. Alternatively I can use the university’s high performance computing clusters at 2p per core hour. This obviously is unfair on the university facility as this does include operational costs, like electricity, staff etc, and you can see that reflected in the difference in costs.

So AWS EC2 more or less expensive? This hinges on whether you use it in the standard “on demand” manner or instead get access through bidding via the market. The later is significantly cheaper but you only have access whilst your bid price is above the current “spot price” and so you can’t guarantee access and your simulations have to be able to cope with restarts. Since the spot price varies with time, I took the average of two prices at different times on Wed 13 Jan 2016.

As you can see AWS is more expensive per core hour if you use it “on demand”, but is cheaper than the university facility if you are willing to surf the market. Really, though we should be considering the cost efficiency i.e. the cost per nanosecond as this also takes into account the performance.


Cost efficiency



When we do this an interesting picture emerges: using AWS EC2 via bidding on the market is cheaper than using the university facility and can be as cheap as buying your own hardware even if you don’t have to pay the running costs. Furthermore, as you’d expect, using a GPU decreases cost and so should be a no-brainer for GROMACS.

Of course, this assumes lots of people don’t start using the EC2 market, thereby driving the spot price up…


In this post I’m going to show how I created an Amazon Machine Instance with GROMACS 5.0.7 installed for use in the Amazon Web Services cloud.

I’m going to assume that you have signed up for Amazon Web Services (AWS), created an Identity and Access Management (IAM) user (each AWS account can have multiple IAM users), created an SSH key pair for that user, downloaded it, given it an appropriate name with the correct permissions and placed it in. ~/.ssh. Amazon have a good tutorial that cover the above actions. One thing that confused me is if you already have an or account then you can use this to signup to AWS. In other words, depending on your mood, you can order a book or 10,000 CPU hours of compute. I felt a bit nervous about setting up an account backed by my credit card – if you also feel nervous, then Amazon offer a Free Tier which permits you at present to use up to 750 hours a month, as long as you only use the smallest virtual machine instance (t2.micro). If you use more than this, or use a more powerful instance then you will be billed.

First, log in to your AWS console. This will have a strange URL like

where 123456789012 is your AWS account number. You should get something that looks like this.

AWS Management Console

AWS Management Console

Next we need to create an EC2 (ElastiCloud) instance based on one of the standard virtual machine images and download and compile GROMACS on it. In the AWS Management Console, choosing “EC2” in the top left should bring you here

AWS EC2 dashboard

AWS EC2 dashboard

Now click the Blue “Launch Instance” button.

Step 1. Choose an Amazon Machine Instance (AMI).

Here we can choose one of the standard virtual machine images to compile GROMACS on. Let’s keep it simple and use the standard Amazon Linux AMI.


Step 2. Choose an Instance Type.

The important thing to remember here is that the image we create can be run on any instance type. So if we want to compile on multiple cores to speed things up we can choose an instance with say 8 vCPUs, or if we don’t want to be billed and are willing to wait that we can choose the t2.micro instance. Let’s choose an c4.2xlarge instance which has 8 vCPUs. You can at this stage hit “Review and Launch” but it is worth checking the amount of storage allocated to the instance. So hit Next:Configure Instance Details. I’m not going to fiddle with these options. Hit Next:Add Storage.


Step 4. Add storage.

What I have found here is if you use the version of gcc installed via yum (4.8.3) then 8 GB is fine, but if you want to compile a more recent version you will need at least 12 GB.
I’m going to accept the rest of the defaults for the rest of the steps so will click “Review and Launch” now.


Step 7. Review instance Launch.

Check it all looks ok and hit “Launch”. This will bring up a window. Here it is crucial that you choose the name of the keypair you created and downloaded. As you need a different key pair for each IAM user for each Amazon Region, it is worth naming them carefully as you will otherwise rapidly get very confused. Also Amazon don’t let you download a key pair again so you have to be careful with them. You can see mine is called


Which contains the name of my IAM user and the name of the AWS region it will work for, here EU West, which is Ireland. Hit Launch.


Launch Status

This window gives you some links on how to connect to the AWS instance. Hit View Instances””. It may take a minute or two for your instance to be created. During this time the status is given as “Initializing”. When it is finished, you can click on your new instance (you should have only one) and it will give you a whole host of information. We need the public IP address and the name of our SSH key pair so we can ssh to the instance (Note that the user by default is called ec2-user).


lambda 508 $ ssh -i "PhilFowler-key-pair-euwest.pem" ec2-user@
The authenticity of host ' (' can't be established.
ECDSA key fingerprint is SHA256:N+B3toLxLE3vRuuzLZWF44N9qb3ucUVVU/RD00W3iNo.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '' (ECDSA) to the list of known hosts.

__| __|_ )
_| ( / Amazon Linux AMI
11 package(s) needed for security, out of 27 available
Run "sudo yum update" to apply all updates.
[ec2-user@ip-172-30-0-42 ~]$

Installing pre-requisites

Amazon Linux is based on CentOS so uses the yum package manager. You might be more familiar with apt-get if you use Ubuntu but the principles are similar. Worth following their recommendation and applying all the updates – this will spew out a lot of information to the terminal and asks you to confirm.

[ec2-user@ip-172-30-0-42 ~]$ sudo yum update
Loaded plugins: priorities, update-motd, upgrade-helper
Resolving Dependencies
--> Running transaction check
---> Package aws-cli.noarch 0:1.9.1-1.29.amzn1 will be updated
---> Package aws-cli.noarch 0:1.9.11-1.30.amzn1 will be an update
---> Package binutils.x86_64 0: will be updated
---> Package binutils.x86_64 0: will be an update
---> Package ec2-net-utils.noarch 0:0.4-1.23.amzn1 will be updated
sudo.x86_64 0:1.8.6p3-20.21.amzn1
vim-common.x86_64 2:7.4.944-1.35.amzn1
vim-enhanced.x86_64 2:7.4.944-1.35.amzn1
vim-filesystem.x86_64 2:7.4.944-1.35.amzn1
vim-minimal.x86_64 2:7.4.944-1.35.amzn1


This instance is fairly basic and there is no version of gcc, cmake etc. But we can install them via yum

[ec2-user@ip-172-30-0-42 ~]$ sudo yum install gcc gcc-c++ openmpi-devel mpich-devel cmake svn texinfo-tex flex zip libgcc.i686 glibc-devel.i686
texlive-xdvi.noarch 2:svn26689.22.85-27.21.amzn1
texlive-xdvi-bin.x86_64 2:svn26509.0-27.20130427_r30134.21.amzn1
zziplib.x86_64 0:0.13.62-1.3.amzn1


Next we need to add some the openmpi executables to the $PATH. These will only persist for this session; to make them permanent add them to the .bashrc.

export PATH=/usr/lib64/openmpi/bin:$PATH
export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib

Now we hit a potential problem. The version of gcc installed by yum is fairly old

[ec2-user@ip-172-30-0-42 ~]$ gcc --version
gcc (GCC) 4.8.3 20140911 (Red Hat 4.8.3-9)
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO

Having said that 4.8.3 should be good enough for GROMACS. I’ll push ahead using this version, but in a subsequent post I also detail how to download and install gcc 5.3.0.

Compiling GROMACS

First, let’s get the GROMACS source code using wget. I’m going to compile version 5.0.7 since I’ve got benchmarks for this one, but you could equally install 5.1.X.

[ec2-user@ip-172-30-0-42 ~]$ mkdir ~/packages
[ec2-user@ip-172-30-0-42 ~]$ cd ~/packages
[ec2-user@ip-172-30-0-42 packages]$ wget
[ec2-user@ip-172-30-0-42 packages]$ tar zxvf gromacs-5.0.7.tar.gz
[ec2-user@ip-172-30-0-42 packages]$ cd gromacs-5.0.7

Now let’s make a build directory, move there and then issue the cmake directive

[ec2-user@ip-172-30-0-42 gromacs-5.0.7]$ mkdir build-gcc48
[ec2-user@ip-172-30-0-42 gromacs-5.0.7]$ cd build-gcc48
[ec2-user@ip-172-30-0-42 build-gcc48]$ cmake .. -DGMX_BUILD_OWN_FFTW=ON -DCMAKE_INSTALL_PREFIX='/usr/local/gromacs/5.0.7/

The compilation step will take a good few minutes on a single core machine, but as I’ve got 8 virtual CPUs to play with I can give make the “-j 8” flag which is going to speed things up.

[ec2-user@ip-172-30-0-42 build-gcc48]$ make -j 8
Building CXX object src/programs/CMakeFiles/gmx.dir/gmx.cpp.o
Building CXX object src/programs/CMakeFiles/gmx.dir/legacymodules.cpp.o
Linking CXX executable ../../bin/gmx
[100%] Built target gmx
Linking CXX executable ../../bin/template
[100%] Built target template

This took 90 seconds using all 8 cores. Now we can install the binary. Note that because I told cmake to install it in /usr/local/gromacs/5.0.7 so I can keep track of different versions, rather than just having /usr/local/gromacs

[ec2-user@ip-172-30-0-51 build-gcc48]$ sudo make install
-- Installing: Creating symbolic link /usr/local/gromacs/5.0.7/bin/g_velacc
-- Installing: Creating symbolic link /usr/local/gromacs/5.0.7/bin/g_wham
-- Installing: Creating symbolic link /usr/local/gromacs/5.0.7/bin/g_wheel

To add this version of GROMACS to your $PATH (add this to .bashrc to avoid doing this each time)

[ec2-user@ip-172-30-0-51 build-gcc48]$ source /usr/local/gromacs/5.0.7/bin/GMXRC

Now you have all the GROMACS tools available!