GROMACS 4.6: Scaling of a very large coarse-grained system Philip Fowler, 23rd October 2013 So if I have a particular system I want to simulate, how many processing cores can I harness to run a single GROMACS version 4.6 job? If I only use a few then the simulation will take a long time to finish, if I use too many the cores will end up waiting for communications from other cores and so the simulation will be inefficient (and also take a long time to finish). In between is a regime where the code, in this case GROMACS, scales well. Ideally, of course, you’d like linear scaling i.e. if I run on 100 cores in parallel it is 100x faster than if I ran on just one. The rule of thumb for GROMACS 4.6 is that it scales well until there are only ~130 atoms/core. In other words, the more atoms or beads in your system, the larger the number of computing cores you can run on before the scaling performance starts to degrade. As you might imagine there is a hierarchy of computers we can run our simulations on; this starts at humble workstations, passes through departmental, university and regional computing clusters before ending up at national (Tier 1) and international (Tier 0) high performance computers (HPC). In our lab we applied for, and got access to, a set of European Tier 0 supercomputers through PRACE. These are currently amongst the fastest and largest supercomputers in the world. We tested five supercomputers in all: CURIE (Paris, France; green lines), MareNostrum (Barcelona, Spain; black line), FERMI (Bologna, Italy; lilac), SuperMUC (Munich, Germany; blue) and HERMIT (Stuttgart, Germany; red ). Each has a different architecture and inevitably some are slightly newer than others. CURIE has three different partitions, called thin, fat and hybrid. The thin nodes constitute the bulk of the system; the fat nodes have more cores per node whilst the hybrid nodes combine conventional CPUs with GPUs. We tested a coarse-grained 54,000 lipid bilayer (2.1 million MARTINI beads) on all seven different architectures and the performance is shown in the graph – note that the axes are logarithmic. Some machines did better than others; FERMI, which is an IBM BlueGene/Q, appears not to be well-suited to our benchmark system, but then one doesn’t expect fast per-core performance on a BlueGene as that is not how they are designed. Of the others, MareNostrum was fastest for small numbers of cores, but its performance began to suffer if more than 256 cores were used. SuperMUC and the Curie thin nodes were the fastest conventional supercomputers, with the Curie thin nodes performing better at large core counts. Interestingly, the Curie hybrid GPU nodes were very fast, especially bearing in mind the CPUs on these nodes are older and slower than those in the thin nodes. One innovation introduced into GROMACS 4.6 that I haven’t discussed previously is one can now run either using purely MPI processes or a combination of MPI processes and OpenMP threads. We were somewhat surprised to find, that, in nearly all cases, the pure MPI approach remained slightly faster than the new hybrid parallelisation. Of course, you may see very different performance using your system with GROMACS 4.6. You just have to try and see what you get! In the next post I will show some detailed results on using GROMACS on GPUs. Share this:Twitter Related GPUs molecular dynamics
antimicrobial resistance BioExcel Alchemical Free Energy workshop 17th June 2019 Last month I was invited to give a talk on using alchemical free energy methods… Share this:Twitter Read More
antimicrobial resistance New preprint: Predicting antibiotic resistance in complex protein targets 4th January 20224th January 2022 In this preprint, which Alice has been working on for several years, we show how… Share this:Twitter Read More
computing New Publication: Predicting affinities for peptide transporters 29th January 201629th September 2018 PepT1 is a nutrient transporter found in the cells that line your small intestine. It… Share this:Twitter Read More