Updating the Grammar for Antimicrobial Resistance Catalogues Philip Fowler, 18th July 202418th July 2024 This blog updates an old (and now out of date) post describing the grammar we’ve developed for tuberculosis resistance catalogues. It also includes elements from another post that advocated a set of requirements for all AMR catalogues. Philip Fowler and Jeremy Westhead, 2024. Assumptions Resistance to an antimicrobial can be predicted by genetics The genetics can be adequately described by variants as encapsulated by a VCF file (true for M. tuberculosis, not true more generally) Principles A catalogue must be a single object (e.g. computer file) be human- and computer-readable contain all the logic necessary to build an antibiogram be with respect to a specified version of a reference genome (currently H37Rv version 3 for M. tuberculosis) provide an estimate of the uncertainty underlying each association and, ideally, the evidence supporting the association be versioned to allow for bugs/mistakes to be corrected following publication ideally, be provided in a standard format so that the performance of different catalogues can be directly compared Existing codebases and catalogues Version 1 and 2 of the WHO catalogue of resistance-associated variants for M. tuberculosis can be found here encoded with this grammar and is part of a larger GitHub repository containing a number of historical TB catalogues going back to 2015. The original (i.e. source) Excel and VCF files describing Version 2 of the WHO catalogue can be got from here, but remember there are also additional rules that have to be extracted from the report which is here. Note the GitHub repo is silently updated so you can be caught out by updates. Our version of the WHO catalogue can be parsed by the ResistanceCatalogue object provided by piezo, a Python package. The object has a predict method that when given a definite mutation returns a dictionary of drug-level predictions. The GitHub repository is here and it is pip installable. Alternatively, once can pass (i) a VCF file, (ii) a GenBank file and (iii) a Resistance Catalogue in the same format as per the WHO catalogue above with mutations described using this grammar to gnomonicus which returns a range of outputs, including the detected mutations, their effects and the drug-level predictions. The GitHub repository is here and it is pip installable. Implications The grammar needs to have wildcards i.e. the ability in a single rule to specify “any frameshifting mutation in katG confers resistance” There has to be a hierarchy; this can be explicit (each rule given a score) or implicit (more specific rules override more general rules The catalogue should be stored as a plaintext file to facilitate version control (e.g.GitHub) All genetic variants must be unambiguously described using the genes in the GenBank file e.g. if a promoter mutation for gene X lies in the coding region of gene Y, it should be described as a gene Y mutation as otherwise it cannot be determined computationally. Likewise since promoters are not described in the GenBank file, assumed promoter regions can only be intergenic i.e. must stop before the next gene starts to avoid ambiguity. To cope with cases where changes to a single loci on the genome affects more than one gene (in the case of e.g. overlap), any code which parses VCF files must report all resulting mutations (i.e. a one-to-many relationship) Nomenclature Genes For core chromosomal genes, the name of the gene in the catalogue must match the gene name if available, or failing that the locus tag in the GenBank reference. E.g. gene 759807..763325 /gene="rpoB" /locus_tag="Rv0667" For genes that are not present in the reference genome (GenBank file) for a species, for example plasmid or other mobile genetic elements, ensuring consistency of nomenclature is obviously more difficult since this is all based on a mapping paradigm. Amino acids and nucleotides Amino acids are always given in UPPERCASE. Nucleotides in lowercase. This should be checked by the code and fail if this is not the case (‘halt and catch fire’). Mixed calls at a loci/codon are denoted by the letter z, hence Z for an amino acid or z for a base. Likewise, null calls are denoted by the letter x, hence X for an amino acid or x for a base. Wildcards, other special characters and reserved words This is designed to be general and expandable (especially for indels) * is reserved to mean any residue (or base, depending on context) in a gene. Note that -* is expanded to mean ‘any promoter position’ ! is reserved for the STOP codon. (rather than Stop or * as is conventional) @ is reserved as a delimiter between gene and mutation e.g. rpoB@S450L. In TB, _ is often used, but _ is found in gene names, hence the difference. ? is a wildcard for any non-synonmous mutation and = is a wildcard for synoymous mutations. & is logical AND and allows variants to be combined. > is reserved for genome-level variants where base changes are specified w.r.t a genome loci e.g. 7611365a>t ^ was introduced to cater for epistatic rules e.g. whilst a frameshift in Rv0678 would be expected to confer resistance to BDQ, if it co-occurs with a premature stop in mmpL5 then the net effect on BDQ would be susceptibility, e.g. ^Rv0678@*fs&mmpL5@*! . : allows us to (optionally) specify the minimum number of reads (if followed by an int) or fraction of reads (if followed by a float) required to support the specified mutation thereby triggering the rule e.g. gyrA@A90V:4 means “four or more reads supporting Ala90Val in gyrA”. indel means “any insertion or deletion at this nucleotide position” (cannot be codon) ins means “any insertion at this position” and may be made more specific by appending (after a _) the either the number of bases inserted or the sequence of bases inserted. del means “any deletion at this position” and may be made more specific by appending (after a _) and then the number of bases deleted. Alternatively when used on its own it is followed by the proportion (as a float) of the gene deleted. fs means “any frame shifting insertion or deletion at this position” Some examples Example of grammarDescriptionrpoB@=Any synonymous mutation in the coding region of rpoBrpoB@*?Any non-synonymous mutation in the coding region of rpoBrpoB@-*?Any mutation in the (assumed) promoter of rpoB. Typically we assume this is 100 bases upstream of the gene start or until the next gene, whichever comes first.rpoB@*_indel Any insertion or deletion in the coding region of rpoBrpoB@*_insAny insertion in the coding region of rpoBrpoB@*_ins_5Any insertion of five bases in the coding region of rpoBrpoB@*_del_5Any deletion of 5 bases in the coding region of rpoBrpoB@*del_0.5Any deletion of over half the gene.rpoB@*_fsAny frame shifting (`length % 3 != 0`) insertion or deletion in the coding regionrpoB@S450=Any synonymous mutation to Ser450 in rpoBrpoB@S450?Any non-synonymous mutation to Ser450 in rpoBrpoB@S450LAny Ser450Leu mutation in rpoBrpoB@S450L:0.1 Any Ser450Leu in rpoB supported by 10% or more of the reads that map here. Note that the code that parses the VCF file must be aware that you are looking for a minor allele at this position as, depending on the variant caller, a mixed call may trigger a filter.rpoB@S450L:3Any Ser450Leu in rpoB supported by three or more of the reads that map hererpoB@1200_ins_ttgAny insertion of ttg at nucleotide position 1200 of the gene rpoBrpoB@1200_del_2Any deletion of 2 bases at nucleotide position 1200 of the gene rpoB. More examples in the NOMENCLATURE.md in the piezo code repository. Backus–Naur form This is a definition of the grammar acceptable to use within a catalogue. Where <gene-name> is any valid gene or locus name (usually matching the regex [a-zA-Z0-9_]+). It is missing the definition of epistatic rules but is otherwise complete. See here. <complete-mutation> ::= <mutation> | <mutation>":"<number> | <mutation>":0."<number> | <complete-mutation>"&"<complete-mutation> <mutation> ::= <gene-name>"@"<nucleotide><position><nucleotide> | <gene-name>"@"<amino-acid><number><amino-acid> | <gene-name>"@"<position>"_ins_"<nucleotides> | <gene-name>"@"<position>"_ins_"<number> | <gene-name>"@"<position>"_ins" | <gene-name>"@"<position>"_del_"<nucleotides> | <gene-name>"@"<position>"_del_"<number> | <gene-name>"@"<position>"_del" | <gene-name>"@"<position>"_indel" | <gene-name>"@"<position>"_fs" | <gene-name>"@"<pos-wildcard><wildcard> | <gene-name>"@"<nucleotide><pos>"?" | <gene-name>"@"<amino-acid><number>"?" | <gene-name>"@"<positive-position>"=" | <gene-name>"@"del_0."<number> | <gene-name>"@"del_1.0" <wildcard> ::= "?" | "=" <positive-position> ::= <number> | "*" <position> ::= <pos> | <pos-wildcard> <pos-wildcard> ::= "*" | "-*" <pos> ::= <number> | "-"<number> <nucleotides> ::= <nucleotide> | <nucleotide><nucleotide> <nucleotide> ::= "a" | "c" | "t" | "g" | "x" | "z" <amino-acid> ::= "A" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | "S" | "T" | "V" | "W" | "X" | "Y" | "Z" | "!" <number> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" | <number><number> Outcomes These are not fixed, but there is a column in the catalogue PREDICTION_VALUES that contains a string of the supported values e.g. RUS means a ternary resistant / unknown / susceptible system is supported whilst RFUS indicates a quaternary system — the F indicates Fail and is for situations where there are no, or insufficient reads to know the genetic sequence at a genetic loci we know is associated with resistance. There is ar argument for introducing an I (or equivalent) for mutations whose MIC distributions straddle the ECOFF and so will never be classified as either R or S to distinguish these from mutations detected in genes known to be associated with resistance for which we have insufficient data to classify them properly (i.e. Us). Finally, the outcomes have a hierarchy R > F > U > S i.e. any resistance rule will override any failure rule which will override any unknown outcome etc. Epistasis breaks this, hence the introduction of it as a specific feature. Share this:Twitter Related antimicrobial resistance clinical microbiology computing research tuberculosis
tuberculosis Kafka and tuberculosis 21st June 20241st July 2024 To mark the centenary of Franz Kafka’s death from laryngeal tuberculosis at the age of… Share this:Twitter Read More
antimicrobial resistance New software: gemucator 4th September 20184th September 2018 Short for “Genbank Mutation Locator”. A simple Python3 package that if you pass it a… Share this:Twitter Read More
antimicrobial resistance New publication: how quickly can be calculate the effect of a mutation on an antibiotic? 20th November 202020th November 2020 The idea for this paper arose during talking over coffee at the BioExcel Alchemical Free… Share this:Twitter Read More