GPUs

Accelerating Oxford Nanopore basecalling

It looks innocuous sitting on the desk, an Oxford Nanopore MinION, but it can produce a huge data of data from a single sequencing run. Since the nanowire works by inferring which base is in the pore by how much it reduces the flow of ions (and hence current) through the pore, the raw data is commonly called “squiggles”. Converting each of these squiggles to a sequence of nucleotides is “base-calling”. Ideally, you want to do this in real-time, i.e. as the squiggles are being produced by the MinION. Interestingly, this is becoming a challenge for the molecular biologists and bioinformaticians in our group since the flow of data is now high enough that a high-spec laptop struggles to cope. It may becoming obvious that this is not my field of expertise – and you’d be right – but I do know something about speeding up computational jobs through the use of GPUs and/or computing clusters. There appear to be two main use-cases we have for base-calling the squiggles. I’m only going to talk about nanonetcall,

1. Live basecalling.

Nick Sanderson is leading the charge here in our group (that is his hand in the photo above). He has built a workstation with a GPU and SSD disc and was playing around with the ability of nanonetcall to use multiple threads. This is our base case, which has a single process with one OpenMP thread, so by definition has a speedup of 1.0x. The folder sample-data/ contains a large number of squiggle files (fast5 files).

OMP_NUM_THREADS=1 nanonetcall sample_data/ --jobs 1 > /dev/null 

Let’s first try using 2 OpenMP threads and a single process.

OMP_NUM_THREADS=2 nanonetcall sample_data/ --jobs 1 > /dev/null 

This makes no difference whatsoever. In common with running GROMACS simulations, OpenMP doesn’t help much, in my hands at least. Let’s rule out using additional OpenMP threads and simply increase
We can simply try increasing the number of jobs to run in parallel:

OMP_NUM_THREADS=1 nanonetcall sample_data/ --jobs 2 > /dev/null    
OMP_NUM_THREADS=1 nanonetcall sample_data/ --jobs 4 > /dev/null 
OMP_NUM_THREADS=1 nanonetcall sample_data/ --jobs 8 > /dev/null 
OMP_NUM_THREADS=1 nanonetcall sample_data/ --jobs 12 > /dev/null 

These lead to speedups of 1.8x, 3.0x, 3.2x and 3.2x. These were all run on my early-2013 MacPro which has a single 6-core 3.5 GHz Intel Xeon CPU, so can run up to 12 processes with hyperthreading. I don’t know exactly how nanonetcall is parallelised, but at least a good chunk of it is Python, it is no surprise that it doesn’t scale that well since Python will always struggle due to the limitations inherent in being an interpreted language (which in Python’s case means the GIL). Now parts of nanonetcall are cleverly written using OpenCL so it can make use of a GPU if one is available. My MacPro has two AMD FirePro D300 cards. Good graphics cards, but I would have chosen better GPUs if I could. Even so using a single GPU gives a speedup of 3.3x.

nanonetcall sample_data/ --platforms Apple:0:1 Apple:1:1 --exc_opencl > /dev/null

I suggested we try one of my favourite apparently-little-known unix tools, GNU Parallel. This is simple but truly awesome command that you can install via a package manager like apt-get on Ubuntu and MacPorts on a Mac. The simplest way to use it is in a pipe like this;

find sample-data/ -name '*.fast5' | parallel -j 12 nanonetcall {} > /dev/null

This needs explaining. The find command will produce a long list of all the fast5 files. Parallel then consumes these, and will first launch 12 nanonetcall jobs, each running on a single core. As soon as one of these finishes, parallel will launch another nanonetcall job to process the next fast5 in the list. In this way parallel will ensure that there are 12 nanonetcall jobs running at all times and we rapidly work out way through the squiggles. This results in a speed up of 4.8x, so not linear, but certainly better than trying to use the ‘internal’ parallelisation of nanonetcall.

But we can do better because we can use parallel to overload the GPU too.

find sample-data/ -name '*fast5' | parallel nanonetcall {} --platforms Apple:0:1 --exc_opencl > /dev/null

This yields our fastest speed-up of 5.7x. But perhaps the GPU is getting too overloaded, so let’s try sharing the loads

find sample-data-1/ -name '*.fast5' | parallel -j 6 nanonetcall {} --platforms Apple:1:1 --exc_opencl > /dev/null &
find sample-data-2/ -name '*.fast5' | parallel -j 6 nanonetcall {} --platforms Apple:0:1 --exc_opencl > /dev/null

where I’ve split the data equally into two folders. Sure enough, this now gives a speedup of 9.9x. Now, remember there are only 12 virtual cores, so if we try running more processes, we should start to see a performance penalty, but let’s try!

find sample-data-1/ -name '*.fast5' | parallel -j 12 nanonetcall {} --platforms Apple:1:1 --exc_opencl > /dev/null &
find sample-data-2/ -name '*.fast5' | parallel -j 12 nanonetcall {} --platforms Apple:0:1 --exc_opencl > /dev/null

Unexpectedly, this ekes out a bit more speed at 10.9x! So by ignoring the inherently poor scalability built-in to nanonetcall and using GNU Parallel in harness with two GPUs, we have managed to speedup the base-calling by a factor of nearly eleven. I expect Nick to manage even higher speedups using more powerful GPUs.

2. Batch basecalling.

I sit in the same office as two of our bioinformaticians and even with a good setup for live-basecalling, it sounds like there are still occasions when they need to baseball a large dataset (~50GB) of squiggle files. This might be because part of the live base calling process failed, or even the MinION writing files to a different folder due to some software update, or perhaps you simply want to compare several different pieces of basecalling software or even just compare across versions. You want to load the data onto some “computational resource”, press go and have it chew its way through the squiggle files as quickly and reliably as possible. There are clearly many ways to do this; here I am going to look at a simple linux computing cluster with a head node and a large number of slave nodes. Jobs are distributed across the slave nodes using a batch scheduler and queuing system (I use SLURM).

Hang Phan, another bioinformatician, had a large dataset of squiggles that needed 2D base calling and she wanted to try nanonetcall. To demonstrate the idea, I simply installed nanonetcall on my venerable but perfectly serviceable cluster of Apple Xserves. Then it is just a matter of rsync`ing the data over, writing a simple bit of Python to (a) group together sets of fast5 files (I chose groups of 50) and then (b) create a SLURM job submission file for each group and finally (c) submit the job to the queue.

The advantages of this well-established “bare metal” approach are that
– it is truly scalable: if you want to speed up the process, you add more nodes
– it is reliable; my Ubuntu/SLURM/NFS cluster wasn’t switched off for over half a year (and this isn’t unusual)
– you can walk away and let the scheduler handle what runs when

As you can see from my other posts, I am a fan of cloud and distributed computing, but in this case a good old-fashioned computing cluster (preferably one with GPUs!) fits the bill nicely.

GROMACS on AWS: compiling against CUDA

If you want to compile GROMACS to run on a GPU Amazon Web Services EC2 instance, please first read these instructions on how to compile GROMACS on an AMI without CUDA. These instructions then explain how to install the CUDA toolkit and compile GROMACS against it.

The first few steps are loosely based on these instructions, except rather than download the NVIDIA driver, we shall download the CUDA toolkit since this includes an NVIDIA driver. First we need to make sure the kernel is updated

sudo yum install kernel-devel-`uname -r`
sudo reboot

Safest to do a quick reboot here. Assuming you are in your HOME directory, move into your packages folder.

cd packages/

And download the CUDA toolkit (version 7.5 at present)

wget http://developer.download.nvidia.com/compute/cuda/7.5/Prod/local_installers/cuda_7.5.18_linux.run
sudo /bin/bash cuda_7.5.18_linux.run

It will ask you to accept the license and then asks you a series of questions. I answer Yes to everything except installing the CUDA samples. Now add the following to the end of your ~/.bash_profile using a text editor

export PATH; PATH="/usr/local/cuda-7.5/bin:$PATH"
export LD_LIBRARY_PATH; LD_LIBRARY_PATH="/usr/local/cuda-7.5/lib64:$LD_LIBRARY_PATH"

Now we can build GROMACS against the CUDA toolkit. I’m assuming you’ve already downloaded a version of GROMACS and probably installed a non-CUDA version of GROMACS (so you’ll already have one build directory). Let’s make another build directory. You can call it what you want, but some kind of consistent naming can be helpful. The -j 4 flag assumes you have four cores to compile on – this will depend on the EC2 instance you have deployed. Obviously the more cores, the faster, but GROMACS only takes minutes, not hours.

mkdir build-gcc48-cuda75
cd build-gcc48-cuda75
cmake .. -DGMX_BUILD_OWN_FFTW=ON -DCMAKE_INSTALL_PREFIX=/usr/local/gromacs/5.0.7-cuda/  -DGMX_GPU=ON -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda
make -j 4
sudo make install

To load all the GROMACS tools into your $PATH, run this command and you are done!

source /usr/local/gromacs/5.0.7-cuda/bin/GMXRC

If you run this mdrun binary on a GPU instance it should automatically detect the GPU and run on it, assuming your MDP file options support this. If it does you will see this zip by in the log file as GROMACS starts up

1 GPU detected:
  #0: NVIDIA GRID K520, compute cap.: 3.0, ECC:  no, stat: compatible

1 GPU auto-selected for this run.
Mapping of GPU to the 1 PP rank in this node: #0

Will do PME sum in reciprocal space for electrostatic interactions.

Depending on the size and forcefield you are using you should get a speedup of at least a factor two, and realistically three, using a GPU in combination with the CPUs. For example, see these benchmarks.

GROMACS on AWS: Performance and Cost

So we have created an Amazon Machine Image (AMI) with GROMACS installed. In this post I will examine the sort of single core performance you can expect and much this is likely to cost compared to other compute options you might have.

Benchmark

To test the different types of instances you can deploy our GROMACS image on, we need a benchmark system to test. For this I’ve chosen a peptide MFS transporter in a simple POPC lipid bilayer solvated by water. This is very similar to the simulations found in this paper. Or to put it another way: 78,000 atoms in a cube, most of which are water, some belong to lipids and the rest, protein. It is fully atomistic and is described using the CHARMM27 forcefield.

Computing Resources Tested

I tried to use a range of compute resources to provide a good comparison for AWS. First, and most obviously, I used my workstation on my desk, which is a late-2013 MacPro which has 12 Intel Xeon cores. In our department we also have a small compute cluster, each node of which has 16 cores. Some of these nodes also have a K20 GPU. Then I also have access to a much larger computing cluster run by the University. Unfortunately, since the division I am in has decided not to contribute to its running, I have to pay for any significant usage.

Rather than test all the different types of instances available on EC2, I tested an example from each of the current (m4) and older generation (m3) of non-burstable general purpose instances. I also tested an example from the latest generation of compute instances (c4) and finally the smaller instance from the GPU instances (g2).

Performance

fig-aws-gromacs-performance

The performance, in nanoseconds per day for a single compute core, is shown on the left (bigger is better).

One worry about AWS EC2 is that for a highly-optimised compute code, like GROMACS, performance might suffer due to the layers of virtualisation, but, as you can see, even the current generation of general purpose instances is as fast as my MacPro workstation. The fastest machine, perhaps unsurprisingly, is the new University compute cluster. On AWS, the compute c2 class is faster than the current general purpose m4 class, which in turn is faster than the older generation general purpose m3 class. Finally, as you might expect, using a GPU boosts performance by slightly more than 3x.

 

Cost

fig-aws-gromacs-costI’m going to do a “real” comparison. So if I buy a compute cluster and keep it in the department I only have to pay the purchase cost but none of the running costs. So I’m assuming the workstation is £2,500 and a single 16-core node is £4,000 and both of these have a five year lifetime. Alternatively I can use the university’s high performance computing clusters at 2p per core hour. This obviously is unfair on the university facility as this does include operational costs, like electricity, staff etc, and you can see that reflected in the difference in costs.

So AWS EC2 more or less expensive? This hinges on whether you use it in the standard “on demand” manner or instead get access through bidding via the market. The later is significantly cheaper but you only have access whilst your bid price is above the current “spot price” and so you can’t guarantee access and your simulations have to be able to cope with restarts. Since the spot price varies with time, I took the average of two prices at different times on Wed 13 Jan 2016.

As you can see AWS is more expensive per core hour if you use it “on demand”, but is cheaper than the university facility if you are willing to surf the market. Really, though we should be considering the cost efficiency i.e. the cost per nanosecond as this also takes into account the performance.

 

Cost efficiency

fig-aws-gromacs-efficiency

 

When we do this an interesting picture emerges: using AWS EC2 via bidding on the market is cheaper than using the university facility and can be as cheap as buying your own hardware even if you don’t have to pay the running costs. Furthermore, as you’d expect, using a GPU decreases cost and so should be a no-brainer for GROMACS.

Of course, this assumes lots of people don’t start using the EC2 market, thereby driving the spot price up…

Running GROMACS on an AMD GPU using OpenCL

I first used an Apple Mac when I was eight. Apart from a brief period in the 1990s when I had a PC laptop I’ve used them ever since.

Until last year I had an old MacPro which had four PCI slots so you could add a GPU-capable NVIDIA card, although you were limited by the power supply. A GPU can accelerate the molecular dynamics code I use, GROMACS, by up to 2-3 times.

Unfortunately, when Apple designed the new MacPro, they put in AMD FirePro GPUs so although it is a lovely machine, you can’t run CUDA applications.

But this morning I saw that the next release candidate of GROMACS 5.1 supported OpenCL. Although OpenCL applications are usually a bit slower than CUDA applications, this would, in theory, allow me to accelerate GROMACS on my MacPro.

So I downloaded the code, compiled it with the appropriate OpenCL flag and it just works! I benchmarked the code on an atomistic and a coarse-grained benchmark that I use. Running on a single core, using a single AMD FirePro D300 accelerated GROMACS by 2.0 and 2.5x for the atomistic and coarse-grained benchmarks, respectively.

Here’s looking forward to the final release of GROMACS 5.1!fig-gromacs-5.1-amd

GROMACS 4.6: Running on GPUs

fig-blog-gpu-1I mentioned before that I would write something on running GROMACS on GPUs. Let’s imagine we want to simulate a solvated lipid bilayer containing 6,000 lipids for 5 µs. The total number of MARTINI coarse-grained beads is around 137,000 and the box dimensions are roughly 42x42x11 nm. Although this is smaller than the benchmark we looked at last time, it is still a challenge to run on a workstation. To see this let’s consider running it on my MacPro using GROMACS 4.6.1. The machine is an early 2008 MacPro and has 2 Intel Xeons, each with 4 cores. Using 8 MPI processes gets me 132 ns/day, so I would have to wait 38 days for 5 µs. Too slow!

 

fig-blog-gpu-2You have to be careful installing non-Apple supported NVidia GPUs into MacPros, not least because you are limited to 2x6pin power connectors. Taking all this into account, the best I can do without doing something drastic to the power supply is to install an NVIDIA GeForce GTX670. Since I only have one GPU, I can only run one MPI process, but this can spawn multiple OpenMP threads. For this particular system, I get the best performance with 4 threads (134 ns/day) which is the same performance I get using all 8 cores without the GPU. So when I am using just a single core, adding in the GPU increases the performance by a factor of 3.3x. But as I add additional cores, the increase afforded by the single GPU drops until the performance is about the same at 8 cores.

Now let’s try something bigger. Our lab hasfig-blog-gpu-3 a small Intel (Sandy Bridge) computing cluster. Each node has 12 cores, and there are 8 nodes, yielding a maximum of 96 cores. Running on the whole cluster would reduce the time down to 6 days, which is a lot better but not very fair on everyone else in the lab. We could try and get access to Tier-1 or Tier-0 supercomputers but, for this system, that is overkill. Instead let’s look at a Tier-2 machine that uses GPUs to accelerate the calculations.

 

fig-blog-gpu-5

The University of Oxford, through e-infrastructure South, has access to the machines owned by the Centre for Innovation. One of these, EMERALD, is a GPU-based cluster. We shall look at one of the partitions; this has 60 nodes, each with two 6-core Intel processors and 3 NVIDIA M2090 Tesla GPUs. For comparison, let’s run without the GPUs. The data shown are for simulations with only 1 OpenMP thread per MPI process. So now let’s run using the GPUs (which is the point of this cluster!). Again just using asingle OpenMP thread per MPI process/GPU (shown on graph) we again find a performance increase of 3-4x. Since there are 3 GPUs per node, and each node has 12 cores, we could run 3 MPI processes (each attached to a GPU) on each node and each process could spawn 1, 2, 3 or 4 OpenMP threads. This uses more cores,  but since they probably would be sitting idle, this is a more efficient use of the compute resource. Trying 2 or 3 OpenMP threads per MPI process/GPU lets us reach a maximum performance of 1.77 µs per day, so we would get our 5 µs in less than 3 days. Comparing back to our cluster, we can get the same performance of our 96-core local cluster using a total of 9 GPUs and 18 cores on EMERALD.

fig-blog-gpu-9Finally, let’s compare EMERALD to the Tier-1 PRACE supercomputer CURIE. CURIE was the 20th fastest supercomputer in the world in November 2013. For this comparison we will need to use a bigger benchmark, so let’s us the same one as before. It has 9x the number of lipids, but because I had to add extra water ends up being about 15x bigger at 2.1 million particles. Using 24 GPUs and 72 cores, EMERALD manages 130 ns/day. To get the same performance on CURIE requires 150 cores and ultimately CURIE tops out at 1,500 ns/day on 4,196 cores. Still, EMERALD is respectable and shows how it can serve as a useful bridge to Tier-1 and Tier-0 supercomputers. Interestingly, CURIE also has a “hybrid” partition that contains 144 nodes, each with 2 Intel Westmere processors and 2 NVIDIA M2090 Tesler GPUs. I was able to run on up to 128 GPUs with 4 OpenMP threads per MPI/GPU, making a total of 512 cores. This demonstrates that GROMACS can run on large numbers of GPU/CPUs and that such hybrid architectures are viable as supercomputers (for GROMACS at least).

GROMACS 4.6: Scaling of a very large coarse-grained system

So if I have a particular system I want to simulate, how many processing cores can I harness to run a single GROMACS version 4.6 job? If I only use a few then the simulation will take a long time to finish, if I use too many the cores will end up waiting for communications from other cores and so the simulation will be inefficient (and also take a long time to finish). In between is a regime where the code, in this case GROMACS, scales well. Ideally, of course, you’d like linear scaling i.e. if I run on 100 cores in parallel it is 100x faster than if I ran on just one.

The rule of thumb for GROMACS 4.6 is that it scales well until there are only ~130 atoms/core. In other words, the more atoms or beads in your system, the larger the number of computing cores you can run on before the scaling performance starts to degrade.

As you might imagine there is a hierarchy of computers we can run our simulations on; this starts at humble workstations, passes through departmental, university and regional computing clusters before ending up at national (Tier 1) and international (Tier 0) high performance computers (HPC).

In our lab we applied for, and got access to, a set of European Tier 0 supercomputers through PRACE. These are currently amongst the fastest and largest supercomputers in the world. We tested five supercomputers in all: CURIE (Paris, France; green lines), MareNostrum (Barcelona, Spain; black line), FERMI (Bologna, Italy; lilac), SuperMUC (Munich, Germany; blue) and HERMIT (Stuttgart, Germany; red ). Each has a different architecture and inevitably some are slightly newer than others. CURIE has three different partitions, called thin, fat and hybrid. The thin nodes constitute the bulk of the system; the fat nodes have more cores per node whilst the hybrid nodes combine conventional CPUs with GPUs.

We tested a coarse-grained 54,000 lipid bilayer (2.1 million MARTINI beads) on all seven different architectures and the performance is shown in the graph – note that the axes are logarithmic. Some machines did better than others; FERMI, which is an IBM BlueGene/Q, appears not to be well-suited to our benchmark system, but then one doesn’t expect fast per-core performance on a BlueGene as that is not how they are designed. Of the others, MareNostrum was fastest for small numbers of cores, but its performance began to suffer if more than 256 cores were used. SuperMUC and the Curie thin nodes were the fastest conventional supercomputers, with the Curie thin nodes performing better at large core counts. Interestingly, the Curie hybrid GPU nodes were very fast, especially bearing in mind the CPUs on these nodes are older and slower than those in the thin nodes. One innovation introduced into GROMACS 4.6 that I haven’t discussed previously is one can now run either using purely MPI processes or a combination of MPI processes and OpenMP threads. We were somewhat surprised to find, that, in nearly all cases, the pure MPI approach remained slightly faster than the new hybrid parallelisation.

Of course, you may see very different performance using your system with GROMACS 4.6. You just have to try and see what you get! In the next post I will show some detailed results on using GROMACS on GPUs.