Can medical microbiology become a big data science? Lessons from CRyPTIC Philip Fowler, 11th March 202511th March 2025 The CRyPTIC project ran from 2017 to around 2022 and in that time collected over 20,000 clinical samples of M. tuberculosis. Each sample underwent whole genome sequencing and phenotypic drug susceptibility testing (pDST); the minimum inhibitory concentrations (MICs) of 13 different antibiotics was tested using a bespoke 96-well broth microdilution plate. The project also aggregated previously published samples which has had pDST data and had undergone WGS. The resulting dataset of about 40k samples was passed to the Seq&Treat project, which was run through FIND, in 2020 and resulting in the first catalogue of resistance associated variants for M. tuberculosis being published in July 2021 by the World Health Organisation. This post is about the hard yards we had to go through assembling the dataset. The fundamental difficulty is early on the sample is split; part of it goes through a sequencing machine and the other part is inoculated onto the 96-well plate. Ultimately both halves of the data should then get reunited a few weeks or months later. But this often didn’t happen. The genetics side of the equation was easier since the data itself was in a digital format (FASTQ files) which were named. The phenotypic side threw up lots of problems we might have expected and many we didn’t. I am recording them here as a warning to future large projects; if clinical microbiology is to become a Big Data science (and it needs to if it is to adopt whole genome sequencing) then it will have to get much better at data entry and quality. We had setup a web portal in which laboratory scientists could enter the MICs of each plate as they were reading it after the prescribed two weeks incubation. This led to: Problem 1. The web portal was confusing. Take a look at the photograph. There is much good going on here — there are drop downs to ensure the user can only choose from a limited number of options. But there is some poor design as well; the key identifiers are free text and are not validated or checked and the visualisation of the 96-well plate is confusing and it is difficult to align the image with the 96-well plate the scientist would have been looking at. This was entirely within our ability to change and we didn’t and had to suffer the consequences which were mainly Mistakes in recording the MIC Non-standard characters. There were free text comment fields and I think some labs had some standard comments in a Word document which they used to copy and paste the comments in because they consistently used non-ASCII characters like the em-dash which when you aren’t expecting it can crash your code. Errors in the identifiers which (we’ll come back to this later). Problem 2. The web portal asked for too much information. You can’t see this in the screenshot but the laboratory scientist was asked to enter a lot of other information about the sample. Not all of it was mandatory but a lot of it was, and this made the job of the laboratory scientist very tedious and ultimately they requested we dramatically reduce the number of mandatory fields, which we did. In addition the web portal wasn’t ready in time for the start of data collection. As a result of these two factors several labs asked if they could send us their results as Excel spreadsheets. This introduced a LOT of problems as the data was not validated at all. Wrong MICs. Often we would find an MIC for a drug that simply does not exist on the plate design. If you can hypothesise how the mistake was made (e.g. it was ten times a real MIC) then perhaps you feel confident enough to catch and correct it, but that has to be a manual process. Inconsistent drug abbreviations. TB is pretty good at using three letter abbreviations for all the common antibiotics but there is still degeneracy e.g. moxifloxacin is referred to as MOX, MFX and MXF but different labs. Accents. These are, of course, an essential part of language, but when you are trying to parse an Excel spreadsheet they can really complicate matters. Decimal notation. Perhaps trivial, but one still has to catch and deal with it, is that some labs would write 0.06 and others 0,06. Inconsistency in the MIC values. These form a doubling dilution series based around 1 mg/L which means the if you descend the sequence goes 0.5, 0.25, 0.125, 0.0625 .. mg/L. Conventionally people truncate the decimal so you tend to see it written as 0.5, 0.25, 0.12, 0.06 but not everyone does this so you have to deal with 0.06, 0.0625 etc. Inconsistency in the interval notation. If there is no growth in a lane of MICs and the lowest concentration is 0.06 mg/L then all you know is that the MIC could be 0.06 mg/L or lower, hence this is usually written as ≤ 0.06 mg/L. But you can also write that as <= 0.06 mg/L and some people (incorrectly) write it as < 0.06 mg/L. You get the idea. Leaving the example in. Later on in the project when we had an Excel spreadsheet with some validation we also put some example data in the sheet to help users understand what we were asking for. But some people then returned the spreadsheet with the example data left in so we had to catch it and remove it… Uploading the wrong photograph. We asked the users to upload to the portal the photograph of the 96-well plate after two weeks growth — this turned out to be hugely useful as we used image processing to double check their measurements and did some other cool things. I calculated and stored a hash of each image (md5sum). Now this hash is unique for a particular photograph and we found that one laboratory in particular had uploaded the same photograph for about 20 samples which clearly wasn’t possible so we had to drop all those samples from the final dataset as whilst one of them was right, we didn’t know which one. Problem 3. The unique identifiers were manually entered and there was no validation. If we can’t join the WGS data back to pDST data that sample can’t contribute to the research goals and all the work on the same is wasted. Despite this being motivation in itself to ensure all the identifiers were correct, we had many, many problems. Now most of the problems listed above are annoying but you can deal with them. Not being able to join the WGS and pDST data back together was existential for the project so this was a lot more serious. Spaces at the end of cells. Because there was no validation, we sometimes found an (invisible) space at the end of an identifier which prevented the join. UPPERCASE v lowercase. Apparently simple but if the identifiers are UPPERCASE in half the pipeline and lowercase in the other they won’t join… Delimiter characters in the identifiers. We used the unique identifiers to form a sharded data structure to store the information — hence the “/“ character in any of the unique identifier would mess this up. Likewise we built a single UNIQUEID by concatenating the other unique identifiers together. We chose “.” as the delimiter but that of course meant that “.” in the any of the unique identifiers messed this up. Both were caught and replaced with underscores. Inconsistent naming. Some sites just did things a little bit differently with the unique identifiers on both sides of the process. One site used %03d for the genetics and %04d for the phenotypes, so 003 v 0003. That one is at least easy to spot. Different identifiers. We found this one when one laboratory had zero matching samples. Upon inspection the WGS data had a completely different set of identifiers to the pDST data. When we spoke to them they just hadn’t realised we would want to join the datasets together.. The thing is you might have read all of the above and thought “well, why couldn’t you have caught the problem, talked to the lab and fixed it?”. And that is true. When you are dealing with tens or hundreds of samples it is feasible. But when you have tens of thousands of samples it starts getting difficult simply due to the volume of data. And this is perhaps the main point; many research projects have got by using Excel and posthoc fixing but if we are to really become a Big Data science you can’t do this anymore; rather you have to accept that any mistakes will likely never be caught and will remain in your dataset. This requires a change in mindset and, ideally, a move towards automated data collection (for example QR codes, automated reading) and away from humans entering data. Unfortunately that also requires strong computational skills which is not something many clinical microbiology researchers have and so training and upskilling will be required. Share this:TwitterBlueskyEmailLinkedInMastodon Related antimicrobial resistance clinical microbiology tuberculosis
Updated preprint: A validated cloud-based genomic platform for co-ordinated, expedient global analysis of SARS-CoV-2 genomic epidemiology 23rd January 202528th January 2025 In August 2022 seven laboratories across the world uploaded their SARS-CoV-2 genetics files for processing… Share this:TwitterBlueskyEmailLinkedInMastodon Read More
antimicrobial resistance BashTheBug reaches one million classifications 4th October 20184th October 2018 BashTheBug, a citizen science project I run that is helping us measure how different… Share this:TwitterBlueskyEmailLinkedInMastodon Read More
citizen science Automated detection of bacterial growth on 96-well plates (AMyGDA) 11th December 20175th August 2018 I am involved in an international collaboration, the Comprehensive Resistance Prediction for Tuberculosis: an International Consortium… Share this:TwitterBlueskyEmailLinkedInMastodon Read More