GROMACS2018 on NVIDIA DGX-1s

HECBioSim advertised for proposals to use JADE, the new Tier-2 UK GPU high performance computer back in April 2019. JADE is built around NVIDIA DGX-1s, each of which contains 8 Tesla V100 GPUs. I’d previously run some alchemical free energy calculations on ARCHER, the Tier-1 UK academic supercomputer that has a conventional architecture, thanks to a HECBioSim grant, so a wrote a short proposal asking for 500 GPUhours to answer this questions:

Is JADE or ARCHER better for GROMACS alchemical free energy methods?

M. tuberculosis RNA polymerase

Figure 1. The M. tuberculosis RNA polymerase. The different protein subunits in the complex are coloured different shades of blue, whilst rifampicin is in yellow and the DNA, grey. When placed in a cuboid and solvated, the system contains 397,776 atoms.

The RNAP is a large protein complex that is the target of the rifamycins, an important class of antitubercular drugs. It is large, and when solvated has 396,776 atoms so is a good benchmark for HPC. First though we have to see how it behaves on JADE.

How to optimally use GROMACS on JADE

Before jumping in, I needed to examine how to optimally run GROMACS on JADE. All the simulations on JADE used `GROMACS2018.3` as that was what was available in modules at the time. GROMACS associates each GPU with a single MPI process, however each MPI process can run 1 or more OpenMP threads and you can also “overload” the GPU by assigning a single GPU to multiple MPI processes. I shall only consider the former here.

In principle, adding more OpenMP threads to the MPI process should increase the performance until the GPU becomes the limiting factor. Due to the way the machine and queues were configured we could only run on a single node and so only use 1,4 or 8 GPUs, each with 1,2 or 4 OpenMP threads. The performance, as measured by the number of nanoseconds simulated per day, is shown in this graph.

Figure 2. Raw GROMACS performance in nanoseconds/day when running the RNAP (396k atoms) on a single DGX-1 node of JADE. The number of GPUs (and hence MPI processes) and OpenMP threads are varied.

What we see is that, as expected using the V100 GPUs gives a dramatic boost (5-12x) to the performance and that, on the whole, assigning more OpenMP threads to each GPU improves speed, but not linearly. That said, the cores on the CPU will sit idle, so one might as well use them. In the following comparison, we will therefore compare using 1 and 4 threads per GPU, with ARCHER.

Comparing vanilla GROMACS on JADE and ARCHER

A note of caution: this is not quite like-with-like since although the ARCHER benchmarks used exactly the same protein input file, an older version of GROMACS (2016.3) was used. We do not expect this to make a large difference to the raw CPU speed of the code, however.

Figure 3. Comparing the raw performance of GROMACS on JADE, using a DGX-1, and ARCHER, which has a conventional CPU-based architecture. Note that the versions of GROMACS used are different, however in both cases the same system was used.

What we see is that the V100 GPUs within the DGX-1 give GROMACS a huge performance boost over ARCHER (note that the graph on the right is on a log-log scale to show the shifts in performance). To put it another way, we can get a bit over 20 ns/day of the RNAP using a single node of JADE (an entire DGX-1 and 32 cores). To get the same performance on ARCHER requires about 168 cores i.e. 7 whole nodes. What I don’t know the answer to, but would be interesting, is the relative size, cost and power consumption of a node in each system.

So, how fast are alchemical free energy calculations on JADE?

Finally, we are in a position to answer the original question. To simplify the calculations we choose the number of lambda simulations in the thermodynamic integration to be 8, which matches the hardware. On JADE therefore, we are always using all 8 GPUs; the only difference is how many OpenMP threads we allow each lambda simulation to use.

Unsurprisingly, we find again that the DGX-1 gives a huge performance boost over ARCHER, as shown by the increase in gradient in the graph on the left. If we run a single alchemical free energy transition using 8 lambda simulations on a single DGX-1 (i.e. each lambda simulation uses a single V100 GPU with 4 cores) then get about 1.6 ns/day. To get the same performance on ARCHER would require about 32 cores per lambda simulation (i.e. 256 cores which is 10.7 nodes). In practice, of course, one would either use 1 or 2 nodes per lambda, equivalent to 24 or 48 cores per lambda. Either way, we again have a speedup on JADE over ARCHER of 8-fold!

Figure 4. Comparing the performance of JADE and ARCHER when running an alchemical free energy calculation using GROMACS. Note that different versions of the code were used on each platform and 8 lambda simulations, even spaced, where used in both cases.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.