GROMACS 4.6: Running on GPUs Philip Fowler, 11th February 2014 I mentioned before that I would write something on running GROMACS on GPUs. Let’s imagine we want to simulate a solvated lipid bilayer containing 6,000 lipids for 5 µs. The total number of MARTINI coarse-grained beads is around 137,000 and the box dimensions are roughly 42x42x11 nm. Although this is smaller than the benchmark we looked at last time, it is still a challenge to run on a workstation. To see this let’s consider running it on my MacPro using GROMACS 4.6.1. The machine is an early 2008 MacPro and has 2 Intel Xeons, each with 4 cores. Using 8 MPI processes gets me 132 ns/day, so I would have to wait 38 days for 5 µs. Too slow! You have to be careful installing non-Apple supported NVidia GPUs into MacPros, not least because you are limited to 2x6pin power connectors. Taking all this into account, the best I can do without doing something drastic to the power supply is to install an NVIDIA GeForce GTX670. Since I only have one GPU, I can only run one MPI process, but this can spawn multiple OpenMP threads. For this particular system, I get the best performance with 4 threads (134 ns/day) which is the same performance I get using all 8 cores without the GPU. So when I am using just a single core, adding in the GPU increases the performance by a factor of 3.3x. But as I add additional cores, the increase afforded by the single GPU drops until the performance is about the same at 8 cores. Now let’s try something bigger. Our lab has a small Intel (Sandy Bridge) computing cluster. Each node has 12 cores, and there are 8 nodes, yielding a maximum of 96 cores. Running on the whole cluster would reduce the time down to 6 days, which is a lot better but not very fair on everyone else in the lab. We could try and get access to Tier-1 or Tier-0 supercomputers but, for this system, that is overkill. Instead let’s look at a Tier-2 machine that uses GPUs to accelerate the calculations. The University of Oxford, through e-infrastructure South, has access to the machines owned by the Centre for Innovation. One of these, EMERALD, is a GPU-based cluster. We shall look at one of the partitions; this has 60 nodes, each with two 6-core Intel processors and 3 NVIDIA M2090 Tesla GPUs. For comparison, let’s run without the GPUs. The data shown are for simulations with only 1 OpenMP thread per MPI process. So now let’s run using the GPUs (which is the point of this cluster!). Again just using asingle OpenMP thread per MPI process/GPU (shown on graph) we again find a performance increase of 3-4x. Since there are 3 GPUs per node, and each node has 12 cores, we could run 3 MPI processes (each attached to a GPU) on each node and each process could spawn 1, 2, 3 or 4 OpenMP threads. This uses more cores, but since they probably would be sitting idle, this is a more efficient use of the compute resource. Trying 2 or 3 OpenMP threads per MPI process/GPU lets us reach a maximum performance of 1.77 µs per day, so we would get our 5 µs in less than 3 days. Comparing back to our cluster, we can get the same performance of our 96-core local cluster using a total of 9 GPUs and 18 cores on EMERALD. Finally, let’s compare EMERALD to the Tier-1 PRACE supercomputer CURIE. CURIE was the 20th fastest supercomputer in the world in November 2013. For this comparison we will need to use a bigger benchmark, so let’s us the same one as before. It has 9x the number of lipids, but because I had to add extra water ends up being about 15x bigger at 2.1 million particles. Using 24 GPUs and 72 cores, EMERALD manages 130 ns/day. To get the same performance on CURIE requires 150 cores and ultimately CURIE tops out at 1,500 ns/day on 4,196 cores. Still, EMERALD is respectable and shows how it can serve as a useful bridge to Tier-1 and Tier-0 supercomputers. Interestingly, CURIE also has a “hybrid” partition that contains 144 nodes, each with 2 Intel Westmere processors and 2 NVIDIA M2090 Tesler GPUs. I was able to run on up to 128 GPUs with 4 OpenMP threads per MPI/GPU, making a total of 512 cores. This demonstrates that GROMACS can run on large numbers of GPU/CPUs and that such hybrid architectures are viable as supercomputers (for GROMACS at least). Share this:Twitter Related GPUs molecular dynamics
antimicrobial resistance Accelerating Oxford Nanopore basecalling 26th January 20175th August 2018 It looks innocuous sitting on the desk, an Oxford Nanopore MinION, but it can produce… Share this:Twitter Read More
antimicrobial resistance New publication: Predicting resistance is (not) futile 21st August 201921st August 2019 Our “First Reactions” article has been published in ACS Central Science. We discuss the paper,… Share this:Twitter Read More
computing GROMACS2018 on NVIDIA DGX-1s 27th September 201929th October 2019 HECBioSim advertised for proposals to use JADE, the new Tier-2 UK GPU high performance computer… Share this:Twitter Read More