During the CRyPTIC project it has become obvious that we need a grammar to describe genetic changes that is readable by both human and code and avoids confusion but also allows for the rather sophisticated rules that are currently being developed.
For example, in the supplement of the recent publication of the now-quite extensive genetic resistance catalogue for the four front-line anti-TB compounds in New Eng J Med you will find rules such as “any frame shift in this gene”. Now how do you encode that?
To try and resolve this issue ahead of time, I propose the following grammar based on these assumptions.
All comments welcome!
- Resistance to an antimicrobial can be predicted by genetics i.e. the presence or absence of specific genetic sequences. AMR catalogues therefore need to specify
- the presence of absence of genes (i.e. coding sequences)
- changes, insertions or deletions to coding sequences and also to promoter regions
- which version of the reference sequence they are comparing to
- Any code that parses catalogues should
- apply general rules first and specific rules last to allow specific entries to override general entries.
- have unit testing using known, high-confidence mutations to provide confidence that the code is working as intended
- Where possible, catalogues should be complete and computer-readable i.e. no logic should be written in to any computer code and the catalogue should not contain rules written in sentences! For example, catalogues should specify the effect of synonymous mutations in the coding region of genes of interest.
- This grammar is designed with non-additive qualitative AMR catalogues in mind i.e. if both mutations X and Y confer Resistance, then having both mutations present would also lead to a prediction of R. Using the grammar for quantitative, additive catalogues which predict the change in minimum inhibitor concentration (MIC) with respect to an arbitary reference (e.g. the mode MIC for susceptible strains) would be simple. Since each row is a mutation, describing non-linearities is more challenging. This could be achieved by allowing an entry in the catalogue to be a list of mutations (or gene presences) which, being more specific, would then override any more general rules.
For core chromosomal genes, the name of the gene in the catalogue must match either the gene name or locus tag in the GenBank reference. E.g. rpoB in M. tuberculosis can also be referred to as Rv0667.
gene 759807..763325 /gene="rpoB" /locus_tag="Rv0667"
For genes that are not present in the reference genome for a species, for example plasmid or other mobile genetic elements, ensuring consistency of nomenclature is more difficult. The developer must ensure that the name given to specific genetic elements by the bioinformatics workflow matches the name found in the catalogue, otherwise there will be false negatives leading to very major errors.
Amino acids and nucleotides
- Amino acids are always given in UPPERCASE. Nucleotides in lowercase. This should be checked by the code and fail if this is not the case (‘halt and catch fire’).
- Het calls at present are given by the letter z, hence
Zfor an amino acid or
zfor a base.
- Likewise, null calls are given by the letter x, hence
Xfor an amino acid or
xfor a base.
- Hence code should insist that
aminoacids in ['A','C','D','E','F','G','H','I','K','L','M','N','P','Q','R','S','T','V','W','X','Y','Z','!'] nucleotides in ['a','c','t','g','x','z']
Wildcards and other special characters
This is designed to be general and expandable (especially for indels)
*is reserved to mean any residue (or base, depending on context). Note that
-*is expanded to mean ‘any promoter position’
!is reserved for the STOP codon. (rather than
?is a wildcard for any non-synonmous mutation and
=is a wildcard for synoymous mutations.
Presence or absence of genes
This is the highest level of the hierarchy and these rules are therefore applied first. A gene presence entry is simply
To avoid confusion with the wildcards, we use
~ to indicate logical NOT hence
indicates the absence of gene pncA.
- There are two types of mutations: single nucleotide polymorphisms (SNP) or insertions/deletions (INDEL), each of which can apply to a coding region of a protein (CDS) or the coding region of ribosomal RNA (RNA) or a promoter (PROM). A CDS is always translated into an amino acid sequence and hence the numeric position is the number of the amino acid residue, whereas a gene encoding rRNA needs to be treated as nucleotides, so is not translated.
- In the catalogue these are specified by the TYPE (SNP or INDEL) and AFFECTS (CDS, RNA or PROM). No other values are allowed.
Single nucleotide polymorphisms (SNPs)
The general format is
Splitting the mutation using an at symbol (@) therefore always gives you 2 strings which is used to identfy this as a SNP. Note historically we have used an underscore (
_) but some of the gene names contain this symbol and therefore it cannot be used as a delimiter. Below are three examples for a CDS, RNA and PROM SNP:
Note that position is context dependent! Hence for the CDS mutation it is amino acid number, whilst for the RNA and PROM mutation it is nucleotide number! Since the reference amino acid or base is recorded in all cases, this is always checked against the supplied (H37rV) Genbank file and a warning is logged if it is different – this likely indicates a different version of the reference was used to define the mutation. The catalogue can be checked for consistency against the reference using gemucator so any warnings generate merit investigation.
Any non-synonymous mutation can be encoded as
Any non-synonymous (which is has to be) mutation to the stop codon only makes sense for CDS mutations i.e.
The synonymous mutations are obviously
* to mean ‘any position’ then we can start to create more complex rules.
For example, any non-synonymous mutation at any amino acid in the coding sequences of these genes
Whilst any non-synonymous mutation in the promoter of these genes is
Note that the reference amino acid or base cannot, and therefore is not, specified. Likewise, all synonymous mutations in the coding sequences are
Lastly, these rules have to be applied in a descending hierarchy which is based on the assumption that more specfic rules should override more general rules. This is a made up example.
rpoB@*= S rpoB@*? U rpoB@S450? R rpoB@S450T S
The first two rules say that synonymous mutations in the coding region of rpoB have no effect but any non-synonymous mutation has an unknown effect (U). The third second rule overrides this at position 450 so adds “except at position 450 where any non-synonymous mutation is classified R”, then the final rule adds a final exception “except if the alt/target amino acid is Threonine, in which case it is classified S.
An important implication of this is that the rules can INTERACT and these needs to be born in mind (but this was always true..)
Insertions or deletions of nucleotides (INDELs)
INDELs require a more complex hierarchy, although only the top level is used for the time being. The top level is
which means any insertion or deletion of any length at this position. If position is positive it is the nucleotide number within the coding sequence. If negative, it is within the promoter.
This can be overridden with more specific rules, the first of which is
which means any frame-shifting mutation at the position. In other words the length of the insertion or deletion is not divisible by three. Then we have a more specific format
where the length is specified e.g.
means any insertion of two bases at position 1300 in rpoB. Logically, the numbering is again nucleotide (not amino acid residue). Finally, and not implemented yet, the specific bases inserted can be specific
Note that the deepest rule for deletions is the layer above i.e.
rpoB@1300_del_2 since we cannot specific which bases were deleted! Hence we end up with this descending hierarchy
rpoB@*_indel any insertion or deletion in the CDS of rpoB rpoB@*_ins, rpoB@*_del any insertion (or deletion) in the CDS rpoB@*_ins_2, rpoB@*_del_2 any insertion of 2 bases (or deletion of 2 bases) in the CDS rpoB@*_fs any frameshifting insertion or deletion in the CDS (notice that this is rpoB@*_ins_1 + rpoB@*_ins_2 + rpoB@*_ins_4 + rpoB@*_ins_5 +... and the same for deletions) rpoB@1300_indel any insertion or deletion at nucleotide 1300 in the CDS rpoB@1300_ins, rpoB@1300_del any insertion (or deletion) at nucleotide 1300 in the CDS rpoB@1300_ins_2, rpoB@1300_del_2 any insertion of length 2 (or deletion of length 2) in the CDS rpoB@1300_ins_ca insertion of bases ca at position 1300 in the CDS (does not make sense for deletions)
Hence to specify any insertion that doesn’t frame shift is susceptible, but any frame shifting insertions (i.e. introduces whole numbers of amino acids) confer resistance, whilst not enough deletions have been observed to classify. In addition insertions at position 1300 are classified as conferring resistance.
rpoB@*_ins S rpoB@*_del U rpoB@*_fs R rpoB@1300_ins R
Here the hierarchy ensures that the susceptible insertions (or unknown deletions) that are actually frame shifts can overriden with an R.
For promoters the picture is a bit more straightforward since there is no concept of a frame shift.
fabG1@-*_indel any insertion or deletion in the promoter of fabG1 fabG1@-*_ins, fabG1@-*_del any insertion (or deletion) in the promoter fabG1@-15_indel any insertion or deletion at nucleotide -15 in the promoter fabG1@-15_ins_2, fabG1@-15_del_2 any insertion of length 2 (or deletion of length 2) in the promoter fabG1@-15_ins_ca insertion of bases ca at position -15 in the promoter