Skip to content
Fowler Lab
Fowler Lab

Predicting antibiotic resistance de novo

  • News
  • Research
    • Overview
    • Manifesto
    • Software
    • Reproducibility
    • Publications
  • Members
  • Teaching
  • Contact
    • PhDs
  • Wiki
Fowler Lab
Fowler Lab

Predicting antibiotic resistance de novo

Numpy v Biopython

Philip Fowler, 25th July 20194th August 2019

Having only recently having to write bioinformatics Python code that e.g. interrogate GenBank files to find out the sequence of specific genes I’ve learnt a bit of Biopython. I’ve always wondered why (and I could be wrong) the bioinformatics community doesn’t make more use of numpy? For example the Seq class in Biopython seems to be built off the str built-in type yet representing the genetic sequence as a numpy array of single characters would not only be faster, but would allow some analysis to be done much more simply (more on this later). Much of the Python code I’ve written for the CRyPTIC project relies on Biopython and has always seemed a touch slow, but I was astonished to see just how much I could speed certain operations up by switching from Biopython to numpy internally at the first opportunity. All of the below was done in a jupyter lab and used the magic %timeit to repeat operations to ensure an accurate time.

First, let’s load and parse a Genbank file for the M. tuberculosis reference, H37rV

[1]: from Bio import SeqIO
[2]: %timeit reference_genome=SeqIO.read("H37rV_v3.gbk",'genbank')
466 ms ± 3.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Not too shoddy: about 0.5sec to read in 4.4 million bases and parse the metadata for 4000 odd genes. Let’s say we want to retrieve the genetic sequence for my current favourite M. tuberculosis gene, rpoB, by slicing the reference_genome object.

[3]: %timeit reference_genome[759807:763325]
12.4 ms ± 459 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Doesn’t look too bad, does it. But how about if instead we directly access the seq variable and try slicing that instead?

[4]: %timeit reference_genome.seq[759807:763325]
1.83 µs ± 92.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Woah, suddenly we get a 6,775x speedup! Could we speed things up even more if we use numpy? First though we have to spend some time storing the whole genome as a numpy array.

[5]: import numpy
[6]: %timeit numpy_genome=numpy.array(reference_genome.seq)
7.26 s ± 231 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Ouch, 7.3 s is 15x slower than simply parsing the genome with BioPython, so we’ll have to do a lot of slicing to get that time back! But is there a better way? How about if we convert it into a list first?

[7]: %timeit numpy_genome=numpy.array(list(reference_genome.seq))
2.08 s ± 96.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

A bit better, but that is still longer than reading the file in the first place. What about list comprehensions? They are nicely Pythonic.

[8]: %timeit numpy_genome=numpy.array(numpy.array([i for i in str(reference_genome.seq)]))
426 ms ± 12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Much better: about the same length of time as reading the GenBank file in the first place, which kind of makes sense. Back to our original question, does this now speed up slicing?

[9]: %timeit numpy_genome[759807:763325]
252 ns ± 8.74 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Yes! Another 7x speedup, which means it is almost 50,000x faster than just slicing the BioPython Genome object. But we’ve had to double the time taken to create our “genome” object, however, assuming we noticed that accessing the seq variable is faster, we only have to do 270 slice operations before we’ve recovered the additional setup time.

Once a sequence is stored as an array you can then do some neat things. Say we want to count the number of adenosine bases. This is simply.

[10]: numpy.sum(numpy_genome=='A')
758552

In summary

  • Never directly slice reference_genome, always access reference_genome.seq
  • If you are going to fish in the genome multiple times, convert as soon as possible to numpy arrays

Now, I’m not involved in either project, but it would seem that numpy, which was developed from numeric, matured after Biopython, and hence the Biopython developers were not able to take advantage of the optimisation and clever design that has gone into numpy. It also means there is an opportunity for someone to create a next generation Python3 bioinformatic package that builds off numpy, but perhaps also numba and pandas, that is not only (much) faster but more Pythonic and easier to use.

Share this:

  • Click to share on X (Opens in new window) X
  • Click to share on Bluesky (Opens in new window) Bluesky
  • Click to email a link to a friend (Opens in new window) Email
  • Click to share on LinkedIn (Opens in new window) LinkedIn
  • Click to share on Mastodon (Opens in new window) Mastodon

Related

tuberculosis

Post navigation

Previous post
Next post

Related Posts

antimicrobial resistance

New publication: Automated detection of bacterial growth on 96-well plates for high-throughput drug susceptibility testing of M. tuberculosis

26th October 2018

In this Microbiology paper we show how a Python package, called the Automated Mycobacterial Detection Growth…

Share this:

  • Click to share on X (Opens in new window) X
  • Click to share on Bluesky (Opens in new window) Bluesky
  • Click to email a link to a friend (Opens in new window) Email
  • Click to share on LinkedIn (Opens in new window) LinkedIn
  • Click to share on Mastodon (Opens in new window) Mastodon
Read More
antimicrobial resistance

New preprint: looking at rifampicin-resistant subpopulations in clinical samples

10th April 202510th April 2025

Since clinical samples are usually grown in a MGIT tube for a while before some…

Share this:

  • Click to share on X (Opens in new window) X
  • Click to share on Bluesky (Opens in new window) Bluesky
  • Click to email a link to a friend (Opens in new window) Email
  • Click to share on LinkedIn (Opens in new window) LinkedIn
  • Click to share on Mastodon (Opens in new window) Mastodon
Read More
citizen science

Automated detection of bacterial growth on 96-well plates (AMyGDA)

11th December 20175th August 2018

I am involved in an international collaboration, the Comprehensive Resistance Prediction for Tuberculosis: an International Consortium…

Share this:

  • Click to share on X (Opens in new window) X
  • Click to share on Bluesky (Opens in new window) Bluesky
  • Click to email a link to a friend (Opens in new window) Email
  • Click to share on LinkedIn (Opens in new window) LinkedIn
  • Click to share on Mastodon (Opens in new window) Mastodon
Read More

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Privacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use.
To find out more, including how to control cookies, see here: Cookie Policy
    ©2025 Fowler Lab | WordPress Theme by SuperbThemes