Numpy v Biopython Philip Fowler, 25th July 20194th August 2019 Having only recently having to write bioinformatics Python code that e.g. interrogate GenBank files to find out the sequence of specific genes I’ve learnt a bit of Biopython. I’ve always wondered why (and I could be wrong) the bioinformatics community doesn’t make more use of numpy? For example the Seq class in Biopython seems to be built off the str built-in type yet representing the genetic sequence as a numpy array of single characters would not only be faster, but would allow some analysis to be done much more simply (more on this later). Much of the Python code I’ve written for the CRyPTIC project relies on Biopython and has always seemed a touch slow, but I was astonished to see just how much I could speed certain operations up by switching from Biopython to numpy internally at the first opportunity. All of the below was done in a jupyter lab and used the magic %timeit to repeat operations to ensure an accurate time. First, let’s load and parse a Genbank file for the M. tuberculosis reference, H37rV [1]: from Bio import SeqIO [2]: %timeit reference_genome=SeqIO.read("H37rV_v3.gbk",'genbank') 466 ms ± 3.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Not too shoddy: about 0.5sec to read in 4.4 million bases and parse the metadata for 4000 odd genes. Let’s say we want to retrieve the genetic sequence for my current favourite M. tuberculosis gene, rpoB, by slicing the reference_genome object. [3]: %timeit reference_genome[759807:763325] 12.4 ms ± 459 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) Doesn’t look too bad, does it. But how about if instead we directly access the seq variable and try slicing that instead? [4]: %timeit reference_genome.seq[759807:763325] 1.83 µs ± 92.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) Woah, suddenly we get a 6,775x speedup! Could we speed things up even more if we use numpy? First though we have to spend some time storing the whole genome as a numpy array. [5]: import numpy [6]: %timeit numpy_genome=numpy.array(reference_genome.seq) 7.26 s ± 231 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Ouch, 7.3 s is 15x slower than simply parsing the genome with BioPython, so we’ll have to do a lot of slicing to get that time back! But is there a better way? How about if we convert it into a list first? [7]: %timeit numpy_genome=numpy.array(list(reference_genome.seq)) 2.08 s ± 96.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) A bit better, but that is still longer than reading the file in the first place. What about list comprehensions? They are nicely Pythonic. [8]: %timeit numpy_genome=numpy.array(numpy.array([i for i in str(reference_genome.seq)])) 426 ms ± 12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Much better: about the same length of time as reading the GenBank file in the first place, which kind of makes sense. Back to our original question, does this now speed up slicing? [9]: %timeit numpy_genome[759807:763325] 252 ns ± 8.74 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) Yes! Another 7x speedup, which means it is almost 50,000x faster than just slicing the BioPython Genome object. But we’ve had to double the time taken to create our “genome” object, however, assuming we noticed that accessing the seq variable is faster, we only have to do 270 slice operations before we’ve recovered the additional setup time. Once a sequence is stored as an array you can then do some neat things. Say we want to count the number of adenosine bases. This is simply. [10]: numpy.sum(numpy_genome=='A') 758552 In summary Never directly slice reference_genome, always access reference_genome.seqIf you are going to fish in the genome multiple times, convert as soon as possible to numpy arrays Now, I’m not involved in either project, but it would seem that numpy, which was developed from numeric, matured after Biopython, and hence the Biopython developers were not able to take advantage of the optimisation and clever design that has gone into numpy. It also means there is an opportunity for someone to create a next generation Python3 bioinformatic package that builds off numpy, but perhaps also numba and pandas, that is not only (much) faster but more Pythonic and easier to use. Share this:Twitter Related tuberculosis
New refereed preprint: BashTheBug 31st March 202231st March 2022 BashTheBug is a citizen science project hosted on the Zooniverse platform that we launched in… Share this:Twitter Read More
antimicrobial resistance New paper: detecting compensatory mutations in the RNAP of M. tuberculosis 5th February 20245th February 2024 In this paper, by examining testing the association between mutations known to be associate with… Share this:Twitter Read More
New preprint: Including minor alleles improves fluoroquinolone resistance prediction 10th November 202217th November 2022 Fluoroquinolones are used to treat both normal and drug resistant tuberculosis and therefore being able… Share this:Twitter Read More