Numpy v Biopython Philip Fowler, 25th July 20194th August 2019 Having only recently having to write bioinformatics Python code that e.g. interrogate GenBank files to find out the sequence of specific genes I’ve learnt a bit of Biopython. I’ve always wondered why (and I could be wrong) the bioinformatics community doesn’t make more use of numpy? For example the Seq class in Biopython seems to be built off the str built-in type yet representing the genetic sequence as a numpy array of single characters would not only be faster, but would allow some analysis to be done much more simply (more on this later). Much of the Python code I’ve written for the CRyPTIC project relies on Biopython and has always seemed a touch slow, but I was astonished to see just how much I could speed certain operations up by switching from Biopython to numpy internally at the first opportunity. All of the below was done in a jupyter lab and used the magic %timeit to repeat operations to ensure an accurate time. First, let’s load and parse a Genbank file for the M. tuberculosis reference, H37rV [1]: from Bio import SeqIO [2]: %timeit reference_genome=SeqIO.read("H37rV_v3.gbk",'genbank') 466 ms ± 3.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Not too shoddy: about 0.5sec to read in 4.4 million bases and parse the metadata for 4000 odd genes. Let’s say we want to retrieve the genetic sequence for my current favourite M. tuberculosis gene, rpoB, by slicing the reference_genome object. [3]: %timeit reference_genome[759807:763325] 12.4 ms ± 459 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) Doesn’t look too bad, does it. But how about if instead we directly access the seq variable and try slicing that instead? [4]: %timeit reference_genome.seq[759807:763325] 1.83 µs ± 92.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) Woah, suddenly we get a 6,775x speedup! Could we speed things up even more if we use numpy? First though we have to spend some time storing the whole genome as a numpy array. [5]: import numpy [6]: %timeit numpy_genome=numpy.array(reference_genome.seq) 7.26 s ± 231 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Ouch, 7.3 s is 15x slower than simply parsing the genome with BioPython, so we’ll have to do a lot of slicing to get that time back! But is there a better way? How about if we convert it into a list first? [7]: %timeit numpy_genome=numpy.array(list(reference_genome.seq)) 2.08 s ± 96.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) A bit better, but that is still longer than reading the file in the first place. What about list comprehensions? They are nicely Pythonic. [8]: %timeit numpy_genome=numpy.array(numpy.array([i for i in str(reference_genome.seq)])) 426 ms ± 12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Much better: about the same length of time as reading the GenBank file in the first place, which kind of makes sense. Back to our original question, does this now speed up slicing? [9]: %timeit numpy_genome[759807:763325] 252 ns ± 8.74 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) Yes! Another 7x speedup, which means it is almost 50,000x faster than just slicing the BioPython Genome object. But we’ve had to double the time taken to create our “genome” object, however, assuming we noticed that accessing the seq variable is faster, we only have to do 270 slice operations before we’ve recovered the additional setup time. Once a sequence is stored as an array you can then do some neat things. Say we want to count the number of adenosine bases. This is simply. [10]: numpy.sum(numpy_genome=='A') 758552 In summary Never directly slice reference_genome, always access reference_genome.seqIf you are going to fish in the genome multiple times, convert as soon as possible to numpy arrays Now, I’m not involved in either project, but it would seem that numpy, which was developed from numeric, matured after Biopython, and hence the Biopython developers were not able to take advantage of the optimisation and clever design that has gone into numpy. It also means there is an opportunity for someone to create a next generation Python3 bioinformatic package that builds off numpy, but perhaps also numba and pandas, that is not only (much) faster but more Pythonic and easier to use. Share this:Twitter Related tuberculosis
antimicrobial resistance New paper: quantitative measurement of effect of mutations on antibiotics in M. tuberculosis 15th January 202415th January 2024 The CRyPTIC project played a major role in the release by the WHO of their… Share this:Twitter Read More
New refereed preprint: BashTheBug 31st March 202231st March 2022 BashTheBug is a citizen science project hosted on the Zooniverse platform that we launched in… Share this:Twitter Read More
antimicrobial resistance 2018 PhD projects announced 27th October 20175th August 2018 As described here, one of the main ways of getting funding to studying for a… Share this:Twitter Read More