Numpy v Biopython

Having only recently having to write bioinformatics Python code that e.g. interrogate GenBank files to find out the sequence of specific genes I’ve learnt a bit of Biopython. I’ve always wondered why (and I could be wrong) the bioinformatics community doesn’t make more use of numpy? For example the Seq class in Biopython seems to be built off the str built-in type yet representing the genetic sequence as a numpy array of single characters would not only be faster, but would allow some analysis to be done much more simply (more on this later). Much of the Python code I’ve written for the CRyPTIC project relies on Biopython and has always seemed a touch slow, but I was astonished to see just how much I could speed certain operations up by switching from Biopython to numpy internally at the first opportunity. All of the below was done in a jupyter lab and used the magic %timeit to repeat operations to ensure an accurate time.

First, let’s load and parse a Genbank file for the M. tuberculosis reference, H37rV

[1]: from Bio import SeqIO
[2]: %timeit reference_genome=SeqIO.read("H37rV_v3.gbk",'genbank')
466 ms ± 3.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Not too shoddy: about 0.5sec to read in 4.4 million bases and parse the metadata for 4000 odd genes. Let’s say we want to retrieve the genetic sequence for my current favourite M. tuberculosis gene, rpoB, by slicing the reference_genome object.

[3]: %timeit reference_genome[759807:763325]
12.4 ms ± 459 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Doesn’t look too bad, does it. But how about if instead we directly access the seq variable and try slicing that instead?

[4]: %timeit reference_genome.seq[759807:763325]
1.83 µs ± 92.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Woah, suddenly we get a 6,775x speedup! Could we speed things up even more if we use numpy? First though we have to spend some time storing the whole genome as a numpy array.

[5]: import numpy
[6]: %timeit numpy_genome=numpy.array(reference_genome.seq)
7.26 s ± 231 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Ouch, 7.3 s is 15x slower than simply parsing the genome with BioPython, so we’ll have to do a lot of slicing to get that time back! But is there a better way? How about if we convert it into a list first?

[7]: %timeit numpy_genome=numpy.array(list(reference_genome.seq))
2.08 s ± 96.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

A bit better, but that is still longer than reading the file in the first place. What about list comprehensions? They are nicely Pythonic.

[8]: %timeit numpy_genome=numpy.array(numpy.array([i for i in str(reference_genome.seq)]))
426 ms ± 12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Much better: about the same length of time as reading the GenBank file in the first place, which kind of makes sense. Back to our original question, does this now speed up slicing?

[9]: %timeit numpy_genome[759807:763325]
252 ns ± 8.74 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Yes! Another 7x speedup, which means it is almost 50,000x faster than just slicing the BioPython Genome object. But we’ve had to double the time taken to create our “genome” object, however, assuming we noticed that accessing the seq variable is faster, we only have to do 270 slice operations before we’ve recovered the additional setup time.

Once a sequence is stored as an array you can then do some neat things. Say we want to count the number of adenosine bases. This is simply.

[10]: numpy.sum(numpy_genome=='A')
758552

In summary

  • Never directly slice reference_genome, always access reference_genome.seq
  • If you are going to fish in the genome multiple times, convert as soon as possible to numpy arrays

Now, I’m not involved in either project, but it would seem that numpy, which was developed from numeric, matured after Biopython, and hence the Biopython developers were not able to take advantage of the optimisation and clever design that has gone into numpy. It also means there is an opportunity for someone to create a next generation Python3 bioinformatic package that builds off numpy, but perhaps also numba and pandas, that is not only (much) faster but more Pythonic and easier to use.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.