Updating the Grammar for Antimicrobial Resistance Catalogues

This blog updates an old (and now out of date) post describing the grammar we’ve developed for tuberculosis resistance catalogues. It also includes elements from another post that advocated a set of requirements for all AMR catalogues.

Philip Fowler and Jeremy Westhead, 2024.

Assumptions

Resistance to an antimicrobial can be predicted by genetics
The genetics can be adequately described by variants as encapsulated by a VCF file (true for M. tuberculosis, not true more generally)

Principles

A catalogue

must be a single object (e.g. computer file)
be human- and computer-readable
contain all the logic necessary to build an antibiogram
be with respect to a specified version of a reference genome (currently H37Rv version 3 for M. tuberculosis)
provide an estimate of the uncertainty underlying each association and, ideally, the evidence supporting the association
be versioned to allow for bugs/mistakes to be corrected following publication
ideally, be provided in a standard format so that the performance of different catalogues can be directly compared

Existing codebases and catalogues

Version 1 and 2 of the WHO catalogue of resistance-associated variants for M. tuberculosis can be found here encoded with this grammar and is part of a larger GitHub repository containing a number of historical TB catalogues going back to 2015. The original (i.e. source) Excel and VCF files describing Version 2 of the WHO catalogue can be got from here, but remember there are also additional rules that have to be extracted from the report which is here. Note the GitHub repo is silently updated so you can be caught out by updates.

Our version of the WHO catalogue can be parsed by the ResistanceCatalogue object provided by piezo, a Python package. The object has a predict method that when given a definite mutation returns a dictionary of drug-level predictions. The GitHub repository is here and it is pip installable.

Alternatively, once can pass (i) a VCF file, (ii) a GenBank file and (iii) a Resistance Catalogue in the same format as per the WHO catalogue above with mutations described using this grammar to gnomonicus which returns a range of outputs, including the detected mutations, their effects and the drug-level predictions. The GitHub repository is here and it is pip installable.

Implications

The grammar needs to have wildcards i.e. the ability in a single rule to specify “any frameshifting mutation in katG confers resistance”
There has to be a hierarchy; this can be explicit (each rule given a score) or implicit (more specific rules override more general rules
The catalogue should be stored as a plaintext file to facilitate version control (e.g.GitHub)
All genetic variants must be unambiguously described using the genes in the GenBank file e.g. if a promoter mutation for gene X lies in the coding region of gene Y, it should be described as a gene Y mutation as otherwise it cannot be determined computationally.
Likewise since promoters are not described in the GenBank file, assumed promoter regions can only be intergenic i.e. must stop before the next gene starts to avoid ambiguity.
To cope with cases where changes to a single loci on the genome affects more than one gene (in the case of e.g. overlap), any code which parses VCF files must report all resulting mutations (i.e. a one-to-many relationship)

Nomenclature

Genes

For core chromosomal genes, the name of the gene in the catalogue must match the gene name if available, or failing that the locus tag in the GenBank reference. E.g.

gene 759807..763325
/gene="rpoB"
/locus_tag="Rv0667"

For genes that are not present in the reference genome (GenBank file) for a species, for example plasmid or other mobile genetic elements, ensuring consistency of nomenclature is obviously more difficult since this is all based on a mapping paradigm.

Amino acids and nucleotides

Amino acids are always given in UPPERCASE. Nucleotides in lowercase. This should be checked by the code and fail if this is not the case (‘halt and catch fire’).
Mixed calls at a loci/codon are denoted by the letter z, hence Z for an amino acid or z for a base.
Likewise, null calls are denoted by the letter x, hence X for an amino acid or x for a base.

Wildcards, other special characters and reserved words

This is designed to be general and expandable (especially for indels)

* is reserved to mean any residue (or base, depending on context) in a gene. Note that -* is expanded to mean ‘any promoter position’
! is reserved for the STOP codon. (rather than Stop or * as is conventional)
@ is reserved as a delimiter between gene and mutation e.g. rpoB@S450L. In TB, _ is often used, but _ is found in gene names, hence the difference.
? is a wildcard for any non-synonmous mutation and = is a wildcard for synoymous mutations.
& is logical AND and allows variants to be combined.
> is reserved for genome-level variants where base changes are specified w.r.t a genome loci e.g. 7611365a>t
^ was introduced to cater for epistatic rules e.g. whilst a frameshift in Rv0678 would be expected to confer resistance to BDQ, if it co-occurs with a premature stop in mmpL5 then the net effect on BDQ would be susceptibility, e.g. ^Rv0678@*fs&mmpL5@*! .
: allows us to (optionally) specify the minimum number of reads (if followed by an int) or fraction of reads (if followed by a float) required to support the specified mutation thereby triggering the rule e.g. gyrA@A90V:4 means “four or more reads supporting Ala90Val in gyrA”.
indel means “any insertion or deletion at this nucleotide position” (cannot be codon)
ins means “any insertion at this position” and may be made more specific by appending (after a _) the either the number of bases inserted or the sequence of bases inserted.
del means “any deletion at this position” and may be made more specific by appending (after a _) and then the number of bases deleted. Alternatively when used on its own it is followed by the proportion (as a float) of the gene deleted.
fs means “any frame shifting insertion or deletion at this position”

Some examples

Example of grammar	Description
`rpoB@`=	Any synonymous mutation in the coding region of rpoB
`rpoB@*?`	Any non-synonymous mutation in the coding region of rpoB
`rpoB@-*?`	Any mutation in the (assumed) promoter of rpoB. Typically we assume this is 100 bases upstream of the gene start or until the next gene, whichever comes first.
`rpoB@*_indel`	Any insertion or deletion in the coding region of rpoB
`rpoB@*_ins`	Any insertion in the coding region of rpoB
`rpoB@*_ins_5`	Any insertion of five bases in the coding region of rpoB
`rpoB@*_del`_5	Any deletion of 5 bases in the coding region of rpoB
rpoB@*del_0.5	Any deletion of over half the gene.
`rpoB@*_fs`	Any frame shifting (`length % 3 != 0`) insertion or deletion in the coding region
`rpoB@S450=`	Any synonymous mutation to Ser450 in rpoB
`rpoB@S450?`	Any non-synonymous mutation to Ser450 in rpoB
`rpoB@S450L`	Any Ser450Leu mutation in rpoB
`rpoB@S450L:0.1`	Any Ser450Leu in rpoB supported by 10% or more of the reads that map here. Note that the code that parses the VCF file must be aware that you are looking for a minor allele at this position as, depending on the variant caller, a mixed call may trigger a filter.
`rpoB@S450L:3`	Any Ser450Leu in rpoB supported by three or more of the reads that map here
`rpoB@1200_ins_ttg`	Any insertion of `ttg` at nucleotide position 1200 of the gene rpoB
rpoB@1200_del_2	Any deletion of 2 bases at nucleotide position 1200 of the gene rpoB.

More examples in the NOMENCLATURE.md in the piezo code repository.

Backus–Naur form

This is a definition of the grammar acceptable to use within a catalogue. Where <gene-name> is any valid gene or locus name (usually matching the regex [a-zA-Z0-9_]+). It is missing the definition of epistatic rules but is otherwise complete. See here.

<complete-mutation> ::= <mutation> | <mutation>":"<number> | <mutation>":0."<number> | <complete-mutation>"&"<complete-mutation>
<mutation> ::= 
               <gene-name>"@"<nucleotide><position><nucleotide> | 
               <gene-name>"@"<amino-acid><number><amino-acid> |
               <gene-name>"@"<position>"_ins_"<nucleotides> |
               <gene-name>"@"<position>"_ins_"<number> |
               <gene-name>"@"<position>"_ins" |
               <gene-name>"@"<position>"_del_"<nucleotides> |
               <gene-name>"@"<position>"_del_"<number> |
               <gene-name>"@"<position>"_del" |
               <gene-name>"@"<position>"_indel" |
               <gene-name>"@"<position>"_fs" |
               <gene-name>"@"<pos-wildcard><wildcard> |
               <gene-name>"@"<nucleotide><pos>"?" |
               <gene-name>"@"<amino-acid><number>"?" |
               <gene-name>"@"<positive-position>"=" |
               <gene-name>"@"del_0."<number> |
               <gene-name>"@"del_1.0"

<wildcard> ::= "?" | "="

<positive-position> ::= <number> | "*"

<position> ::= <pos> | <pos-wildcard>

<pos-wildcard> ::= "*" | "-*"

<pos> ::= <number> | "-"<number>

<nucleotides> ::= <nucleotide> | <nucleotide><nucleotide>

<nucleotide> ::= "a" | "c" | "t" | "g" | "x" | "z"

<amino-acid> ::= "A" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | "S" | "T" | "V" | "W" | "X" | "Y" | "Z" | "!"

<number> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" | <number><number>

Outcomes

These are not fixed, but there is a column in the catalogue PREDICTION_VALUES that contains a string of the supported values e.g. RUS means a ternary resistant / unknown / susceptible system is supported whilst RFUS indicates a quaternary system — the F indicates Fail and is for situations where there are no, or insufficient reads to know the genetic sequence at a genetic loci we know is associated with resistance. There is ar argument for introducing an I (or equivalent) for mutations whose MIC distributions straddle the ECOFF and so will never be classified as either R or S to distinguish these from mutations detected in genes known to be associated with resistance for which we have insufficient data to classify them properly (i.e. Us).

Finally, the outcomes have a hierarchy

R > F > U > S

i.e. any resistance rule will override any failure rule which will override any unknown outcome etc. Epistasis breaks this, hence the introduction of it as a specific feature.

Updating the Grammar for Antimicrobial Resistance Catalogues

Assumptions

Principles

Existing codebases and catalogues

Implications

Nomenclature

Genes

Amino acids and nucleotides

Wildcards, other special characters and reserved words

Some examples

Backus–Naur form

Outcomes

Related

Leave a Reply Cancel reply

Assumptions

Principles

Existing codebases and catalogues

Implications

Nomenclature

Genes

Amino acids and nucleotides

Wildcards, other special characters and reserved words

Some examples

Backus–Naur form

Outcomes

Share this:

Related

Related Posts

Share this:

Share this:

Share this:

Leave a Reply Cancel reply