GROMACS 4.6: Scaling of a very large coarse-grained system Philip Fowler, 23rd October 2013 So if I have a particular system I want to simulate, how many processing cores can I harness to run a single GROMACS version 4.6 job? If I only use a few then the simulation will take a long time to finish, if I use too many the cores will end up waiting for communications from other cores and so the simulation will be inefficient (and also take a long time to finish). In between is a regime where the code, in this case GROMACS, scales well. Ideally, of course, you’d like linear scaling i.e. if I run on 100 cores in parallel it is 100x faster than if I ran on just one. The rule of thumb for GROMACS 4.6 is that it scales well until there are only ~130 atoms/core. In other words, the more atoms or beads in your system, the larger the number of computing cores you can run on before the scaling performance starts to degrade. As you might imagine there is a hierarchy of computers we can run our simulations on; this starts at humble workstations, passes through departmental, university and regional computing clusters before ending up at national (Tier 1) and international (Tier 0) high performance computers (HPC). In our lab we applied for, and got access to, a set of European Tier 0 supercomputers through PRACE. These are currently amongst the fastest and largest supercomputers in the world. We tested five supercomputers in all: CURIE (Paris, France; green lines), MareNostrum (Barcelona, Spain; black line), FERMI (Bologna, Italy; lilac), SuperMUC (Munich, Germany; blue) and HERMIT (Stuttgart, Germany; red ). Each has a different architecture and inevitably some are slightly newer than others. CURIE has three different partitions, called thin, fat and hybrid. The thin nodes constitute the bulk of the system; the fat nodes have more cores per node whilst the hybrid nodes combine conventional CPUs with GPUs. We tested a coarse-grained 54,000 lipid bilayer (2.1 million MARTINI beads) on all seven different architectures and the performance is shown in the graph – note that the axes are logarithmic. Some machines did better than others; FERMI, which is an IBM BlueGene/Q, appears not to be well-suited to our benchmark system, but then one doesn’t expect fast per-core performance on a BlueGene as that is not how they are designed. Of the others, MareNostrum was fastest for small numbers of cores, but its performance began to suffer if more than 256 cores were used. SuperMUC and the Curie thin nodes were the fastest conventional supercomputers, with the Curie thin nodes performing better at large core counts. Interestingly, the Curie hybrid GPU nodes were very fast, especially bearing in mind the CPUs on these nodes are older and slower than those in the thin nodes. One innovation introduced into GROMACS 4.6 that I haven’t discussed previously is one can now run either using purely MPI processes or a combination of MPI processes and OpenMP threads. We were somewhat surprised to find, that, in nearly all cases, the pure MPI approach remained slightly faster than the new hybrid parallelisation. Of course, you may see very different performance using your system with GROMACS 4.6. You just have to try and see what you get! In the next post I will show some detailed results on using GROMACS on GPUs. Share this:Twitter Related GPUs molecular dynamics
computing CECAM Macromolecular simulation software workshop 14th July 2015 I’m co-organiser of this slightly-different CECAM workshop in October 2015 at the Forschungszentrum Jülich, Germany. Rather than following the… Share this:Twitter Read More
molecular dynamics A simple tutorial on analysing membrane protein simulations. 3rd September 2014 I’m teaching a short tutorial on how to analyse membrane protein simulations next week at… Share this:Twitter Read More
computing How to setup a Gramble 14th April 2016 This is a Gramble, which of course is short for a GROMACS Bramble, or, in… Share this:Twitter Read More