This study was performed by Josh Carter back in 2019 and we uploaded a preprint to bioRxiv and submitted the manuscript for review. Unfortunately the reviews came back just as the UK was going into lockdown in March 2020 and my memory was that the manuscript was rejected. The editor, however, had asked for major revisions so this is a lesson for me in carefully reading one’s emails!
Josh had left the group and started a combined PhD / MD programme at Stanford University and I got involved in our Covid response work so it wasn’t until 2022 that we were able to think about this manuscript again. Fortunately, by this time the CRyPTIC Consortium had published its first dataset which allowed us to roughly double the Train/Test dataset, thereby addressing one of the reviewer’s concerns. Also another group had published a model a few months after our preprint, so we were able to benchmark the performance of our best model. Finally, the original work was done in
R and we took the opportunity to rewrite everything in
Python, making use of a Python package Charlotte Lynch, Dylan Adlard and myself had written (
sbmlcore) to simplify adding the structural and chemical features to the different datasets.
This has enabled us to make the entire code publicly available — from parsing the original datasets to aggregating the datasets to adding the features, performing the Test/Train split, training the models and plotting all the graphs. Any interested person can therefore, we hope, reproduce our work.