Compression FASTA files natively in Python Philip Fowler, 23rd May 201926th May 2019 The M. tuberculosis genome is pretty small, only 4.4 million nucleotides, so storing all that as plaintext means each genome is 4.2MB, but when you have tens of thousands of genomes it starts to add up, particularly as I want to keep my data tree on my workstation so I can view the images produced by AMyGDA, some of which are then fed to BashTheBug. I’ve always thought it neat that in Python you can write and read compressed text files “on the fly” using gzip or bzip2, so how do they perform? Both accept a compressionlevel argument that runs from 1 to 9 and tells the algorithm how hard to try and compress the text. How does that affect the time taken to compress a TB genome? I’d expected some kind of linearity, but neither algorithm behaves that way on this data at least: bzip2 seems to take about the same time whatever the setting is (these data were gathered using the %timeit magic so are the mean of multiple repeats) whereas gzip suddenly slows down once you go past a compression level of 6. What effect does that have on the achieved compression? For bzip2, no. The same level of compression is achieved whatever the setting is. For gzip, to a point. There is a point of diminishing returns once you go past a compression level of 5 or 6, after which you are just slowing your code down and wasting electricity. I was expecting bzip2 to ‘win’ but I’ve ended up concluding that using gzip with a very low compression level (1 or 2) is a good compromise as it is very fast and you get most, but not all, of the compression you could otherwise get. Share this:TwitterBlueskyEmailLinkedInMastodon Related computing
antimicrobial resistance BioExcel Alchemical Free Energy workshop 17th June 2019 Last month I was invited to give a talk on using alchemical free energy methods… Share this:TwitterBlueskyEmailLinkedInMastodon Read More
computing Running GROMACS on an AMD GPU using OpenCL 10th July 2015 I first used an Apple Mac when I was eight. Apart from a brief period… Share this:TwitterBlueskyEmailLinkedInMastodon Read More
computing CECAM Macromolecular simulation software workshop 14th July 2015 I’m co-organiser of this slightly-different CECAM workshop in October 2015 at the Forschungszentrum Jülich, Germany. Rather than following the… Share this:TwitterBlueskyEmailLinkedInMastodon Read More