Accelerating Oxford Nanopore basecalling Philip Fowler, 26th January 20175th August 2018 It looks innocuous sitting on the desk, an Oxford Nanopore MinION, but it can produce a huge data of data from a single sequencing run. Since the nanowire works by inferring which base is in the pore by how much it reduces the flow of ions (and hence current) through the pore, the raw data is commonly called “squiggles”. Converting each of these squiggles to a sequence of nucleotides is “base-calling”. Ideally, you want to do this in real-time, i.e. as the squiggles are being produced by the MinION. Interestingly, this is becoming a challenge for the molecular biologists and bioinformaticians in our group since the flow of data is now high enough that a high-spec laptop struggles to cope. It may becoming obvious that this is not my field of expertise – and you’d be right – but I do know something about speeding up computational jobs through the use of GPUs and/or computing clusters. There appear to be two main use-cases we have for base-calling the squiggles. I’m only going to talk about nanonetcall, 1. Live basecalling. Nick Sanderson is leading the charge here in our group (that is his hand in the photo above). He has built a workstation with a GPU and SSD disc and was playing around with the ability of nanonetcall to use multiple threads. This is our base case, which has a single process with one OpenMP thread, so by definition has a speedup of 1.0x. The folder sample-data/ contains a large number of squiggle files (fast5 files). OMP_NUM_THREADS=1 nanonetcall sample_data/ --jobs 1 > /dev/null Let’s first try using 2 OpenMP threads and a single process. OMP_NUM_THREADS=2 nanonetcall sample_data/ --jobs 1 > /dev/null This makes no difference whatsoever. In common with running GROMACS simulations, OpenMP doesn’t help much, in my hands at least. Let’s rule out using additional OpenMP threads and simply increase We can simply try increasing the number of jobs to run in parallel: OMP_NUM_THREADS=1 nanonetcall sample_data/ --jobs 2 > /dev/null OMP_NUM_THREADS=1 nanonetcall sample_data/ --jobs 4 > /dev/null OMP_NUM_THREADS=1 nanonetcall sample_data/ --jobs 8 > /dev/null OMP_NUM_THREADS=1 nanonetcall sample_data/ --jobs 12 > /dev/null These lead to speedups of 1.8x, 3.0x, 3.2x and 3.2x. These were all run on my early-2013 MacPro which has a single 6-core 3.5 GHz Intel Xeon CPU, so can run up to 12 processes with hyperthreading. I don’t know exactly how nanonetcall is parallelised, but at least a good chunk of it is Python, it is no surprise that it doesn’t scale that well since Python will always struggle due to the limitations inherent in being an interpreted language (which in Python’s case means the GIL). Now parts of nanonetcall are cleverly written using OpenCL so it can make use of a GPU if one is available. My MacPro has two AMD FirePro D300 cards. Good graphics cards, but I would have chosen better GPUs if I could. Even so using a single GPU gives a speedup of 3.3x. nanonetcall sample_data/ --platforms Apple:0:1 Apple:1:1 --exc_opencl > /dev/null I suggested we try one of my favourite apparently-little-known unix tools, GNU Parallel. This is simple but truly awesome command that you can install via a package manager like apt-get on Ubuntu and MacPorts on a Mac. The simplest way to use it is in a pipe like this; find sample-data/ -name '*.fast5' | parallel -j 12 nanonetcall {} > /dev/null This needs explaining. The find command will produce a long list of all the fast5 files. Parallel then consumes these, and will first launch 12 nanonetcall jobs, each running on a single core. As soon as one of these finishes, parallel will launch another nanonetcall job to process the next fast5 in the list. In this way parallel will ensure that there are 12 nanonetcall jobs running at all times and we rapidly work out way through the squiggles. This results in a speed up of 4.8x, so not linear, but certainly better than trying to use the ‘internal’ parallelisation of nanonetcall. But we can do better because we can use parallel to overload the GPU too. find sample-data/ -name '*fast5' | parallel nanonetcall {} --platforms Apple:0:1 --exc_opencl > /dev/null This yields our fastest speed-up of 5.7x. But perhaps the GPU is getting too overloaded, so let’s try sharing the loads find sample-data-1/ -name '*.fast5' | parallel -j 6 nanonetcall {} --platforms Apple:1:1 --exc_opencl > /dev/null & find sample-data-2/ -name '*.fast5' | parallel -j 6 nanonetcall {} --platforms Apple:0:1 --exc_opencl > /dev/null where I’ve split the data equally into two folders. Sure enough, this now gives a speedup of 9.9x. Now, remember there are only 12 virtual cores, so if we try running more processes, we should start to see a performance penalty, but let’s try! find sample-data-1/ -name '*.fast5' | parallel -j 12 nanonetcall {} --platforms Apple:1:1 --exc_opencl > /dev/null & find sample-data-2/ -name '*.fast5' | parallel -j 12 nanonetcall {} --platforms Apple:0:1 --exc_opencl > /dev/null Unexpectedly, this ekes out a bit more speed at 10.9x! So by ignoring the inherently poor scalability built-in to nanonetcall and using GNU Parallel in harness with two GPUs, we have managed to speedup the base-calling by a factor of nearly eleven. I expect Nick to manage even higher speedups using more powerful GPUs. 2. Batch basecalling. I sit in the same office as two of our bioinformaticians and even with a good setup for live-basecalling, it sounds like there are still occasions when they need to baseball a large dataset (~50GB) of squiggle files. This might be because part of the live base calling process failed, or even the MinION writing files to a different folder due to some software update, or perhaps you simply want to compare several different pieces of basecalling software or even just compare across versions. You want to load the data onto some “computational resource”, press go and have it chew its way through the squiggle files as quickly and reliably as possible. There are clearly many ways to do this; here I am going to look at a simple linux computing cluster with a head node and a large number of slave nodes. Jobs are distributed across the slave nodes using a batch scheduler and queuing system (I use SLURM). Hang Phan, another bioinformatician, had a large dataset of squiggles that needed 2D base calling and she wanted to try nanonetcall. To demonstrate the idea, I simply installed nanonetcall on my venerable but perfectly serviceable cluster of Apple Xserves. Then it is just a matter of rsync`ing the data over, writing a simple bit of Python to (a) group together sets of fast5 files (I chose groups of 50) and then (b) create a SLURM job submission file for each group and finally (c) submit the job to the queue. The advantages of this well-established “bare metal” approach are that – it is truly scalable: if you want to speed up the process, you add more nodes – it is reliable; my Ubuntu/SLURM/NFS cluster wasn’t switched off for over half a year (and this isn’t unusual) – you can walk away and let the scheduler handle what runs when As you can see from my other posts, I am a fan of cloud and distributed computing, but in this case a good old-fashioned computing cluster (preferably one with GPUs!) fits the bill nicely. Share this:Twitter Related antimicrobial resistance clinical microbiology computing distributed computing GPUs
computing CECAM Macromolecular simulation software workshop 14th July 2015 I’m co-organiser of this slightly-different CECAM workshop in October 2015 at the Forschungszentrum Jülich, Germany. Rather than following the… Share this:Twitter Read More
antimicrobial resistance New paper: predicting pyrazinamide resistance 20th March 202420th March 2024 This paper has finally been published and you can find it here. It had a… Share this:Twitter Read More
Desirable features for any antibiotic resistance catalogue 31st October 202331st October 2023 In the past few years a growing number of catalogues containing mutations associated with resistance… Share this:Twitter Read More
When you overload you GPU, what is your CPU doing? Sitting idle twiddling thumbs? If so, why not run jobs on both GPU and CPU: my_nano() { case $1 in 0) nanonetcall “$2” –platforms Apple:$1:1 –exc_opencl ;; 1) nanonetcall “$2” –platforms Apple:$1:1 –exc_opencl ;; 2) nanonetcall “$2” ;; 3) nanonetcall “$2” ;; esac } export -f my_nano seq 10000 | parallel -j20 my_nano ‘{= $_=slot()%4 =}’ {} Reply
When you overload you GPU, what is your CPU doing? Sitting idle twiddling thumbs? If so, why not run jobs on both GPU and CPU: my_nano() { case $1 in 0) nanonetcall “$2” –platforms Apple:$1:1 –exc_opencl ;; 1) nanonetcall “$2” –platforms Apple:$1:1 –exc_opencl ;; 2) nanonetcall “$2” ;; 3) nanonetcall “$2” ;; esac } export -f my_nano find sample-data/ -name ‘*fast5’ | parallel -j12 my_nano ‘{= $_=$job->slot()%4 =}’ {} Version 20140822 or later needed. Reply
When you overload you GPU, what is your CPU doing? Sitting idle twiddling thumbs? If so, why not run jobs on both GPU and CPU: my_nano() { case $1 in 0) nanonetcall “$2” –platforms Apple:$1:1 –exc_opencl ;; 1) nanonetcall “$2” –platforms Apple:$1:1 –exc_opencl ;; 2) nanonetcall “$2” ;; esac } export -f my_nano find sample-data/ -name ‘*fast5’ | parallel -j36 my_nano ‘{= $_=$job->slot()%3 =}’ {} Version 20140822 or later needed for the {==} construct. This will read jobs from a single folder and start 12 jobs on each GPU and 12 on the CPU. By using slot() we make sure that if a GPU-1 job finishes, then another job will be started on GPU-1, so if the GPU is faster than the CPU, then should get more work. Reply
When you overload your GPUs, what is your CPU doing? Sitting idle twiddling thumbs? If so, why not run jobs on both GPU and CPU: my_nano() { case $1 in 0) nanonetcall “$2” –platforms Apple:$1:1 –exc_opencl ;; 1) nanonetcall “$2” –platforms Apple:$1:1 –exc_opencl ;; 2) nanonetcall “$2” ;; esac } export -f my_nano find sample-data/ -name ‘*fast5’ | parallel -j36 my_nano ‘{= $_=$job->slot()%3 =}’ {} Version 20140822 or later needed for the {==} construct. This will read jobs from a single folder and start 12 jobs on each GPU and 12 on the CPU. Every 3rd jobslot will be spawned on GPU1, GPU2, and CPU repectively. By using slot() we make sure that if a GPU1 job finishes, then another job will be started on GPU1, so if GPU1 is faster than the CPU, then it should get more work. This should distribute the jobs almost optimally: Worst case is that the very last job is spawned on the slowest worker. (Your CMS screws up single quotes, and double dashes in comments). Reply