HECBioSim advertised for proposals to use JADE, the new Tier-2 UK GPU high performance computer back in April 2019. JADE is built around NVIDIA DGX-1s, each of which contains 8 Tesla V100 GPUs. I’d previously run some alchemical free energy calculations on ARCHER, the Tier-1 UK academic supercomputer that has a conventional architecture, thanks to a HECBioSim grant, so a wrote a short proposal asking for 500 GPUhours to answer this questions:
Is JADE or ARCHER better for GROMACS alchemical free energy methods?
M. tuberculosis RNA polymerase
The RNAP is a large protein complex that is the target of the rifamycins, an important class of antitubercular drugs. It is large, and when solvated has 396,776 atoms so is a good benchmark for HPC. First though we have to see how it behaves on JADE.
How to optimally use
GROMACS on JADE
Before jumping in, I needed to examine how to optimally run GROMACS on JADE. All the simulations on JADE used `GROMACS2018.3` as that was what was available in modules at the time. GROMACS associates each GPU with a single MPI process, however each MPI process can run 1 or more OpenMP threads and you can also “overload” the GPU by assigning a single GPU to multiple MPI processes. I shall only consider the former here.
In principle, adding more OpenMP threads to the MPI process should increase the performance until the GPU becomes the limiting factor. Due to the way the machine and queues were configured we could only run on a single node and so only use 1,4 or 8 GPUs, each with 1,2 or 4 OpenMP threads. The performance, as measured by the number of nanoseconds simulated per day, is shown in this graph.
What we see is that, as expected using the V100 GPUs gives a dramatic boost (5-12x) to the performance and that, on the whole, assigning more OpenMP threads to each GPU improves speed, but not linearly. That said, the cores on the CPU will sit idle, so one might as well use them. In the following comparison, we will therefore compare using 1 and 4 threads per GPU, with ARCHER.
Comparing vanilla GROMACS on JADE and ARCHER
A note of caution: this is not quite like-with-like since although the ARCHER benchmarks used exactly the same protein input file, an older version of GROMACS (
2016.3) was used. We do not expect this to make a large difference to the raw CPU speed of the code, however.
What we see is that the V100 GPUs within the DGX-1 give GROMACS a huge performance boost over ARCHER (note that the graph on the right is on a log-log scale to show the shifts in performance). To put it another way, we can get a bit over 20 ns/day of the RNAP using a single node of JADE (an entire DGX-1 and 32 cores). To get the same performance on ARCHER requires about 168 cores i.e. 7 whole nodes. What I don’t know the answer to, but would be interesting, is the relative size, cost and power consumption of a node in each system.
So, how fast are alchemical free energy calculations on JADE?
Finally, we are in a position to answer the original question. To simplify the calculations we choose the number of lambda simulations in the thermodynamic integration to be 8, which matches the hardware. On JADE therefore, we are always using all 8 GPUs; the only difference is how many OpenMP threads we allow each lambda simulation to use.
Unsurprisingly, we find again that the DGX-1 gives a huge performance boost over ARCHER, as shown by the increase in gradient in the graph on the left. If we run a single alchemical free energy transition using 8 lambda simulations on a single DGX-1 (i.e. each lambda simulation uses a single V100 GPU with 4 cores) then get about 1.6 ns/day. To get the same performance on ARCHER would require about 32 cores per lambda simulation (i.e. 256 cores which is 10.7 nodes). In practice, of course, one would either use 1 or 2 nodes per lambda, equivalent to 24 or 48 cores per lambda. Either way, we again have a speedup on JADE over ARCHER of 8-fold!