When strains are evolving at 1 to 10 nucleotides per year per genome, it is important to address as much of the genome sequence information as possible within the limits of what can be achieved accurately. This cannot be 100% of the genome with current sequencing technologies. For example, regions that are repeatedly present in the genome either identically or with minor variations (for example the ribosomal RNA sequences) are inherently difficult to address. Exactly how much of a genome can be robustly addressed varies by species and differences in the underlying composition of their genomes.
Sequence Typing using a widely used method called MLST typically addresses 3000 to 4000 nucleotides derived from sections of 5 to 7 highly conserved genes or around 1% of an average genome sequence. Its WGS-based successor, cgMLST, typically addresses the whole sequence of 50 to 70% of genes present in individual strains, representing up to 50 or 60% of the sequence. wgMLST is a more inclusive version of this analysis, which is used to assess specific groups of strains rather than the whole species and can increase this by around 5 to 10%. However, it should be noted that in addition to excluding genes that are not universally present, non-protein-coding sequences, and genes in closely related families, none of these methods uses the SNP-level resolution and information available. Instead, they give arbitrary numbers to different versions of the genes that are included; thus, their real resolution is less than the sequence length that they address. SNP-level resolution analysis requires separate and additional bioinformatic analysis of strains that have been identified as related at the level of the Sequence Types and Clonal Clusters. This is perfectly reasonable if it is remembered that the original purpose of Typing was either to provide supporting evidence for an already suspected set of connected strains or to exclude connection by showing that strains were unrelated. The original purpose of Typing was never to proactively detect outbreaks and direct infection prevention and control.
Analysis of SNPs within genomes is more complicated. Using previous methods its resolution is limited by whether there is a high-quality highly related reference to work with or not; and for some species this is not possible. If there is a reasonable reference, how similar that reference is, how many strains are compared, and how diverse they are, all impact resolution. Whole genome sequencing information might be used, but that doesn’t mean the whole genome is addressed. As stated above, some regions cannot be safely analyzed and, in addition, it is only the shared genome from the analyzed stains which is included in what might be more clearly described as ‘Common genome SNP’ analysis. (Rather than ‘Whole genome SNP analysis’ or ‘Core genome SNP analysis’ – which is the SNP-level analysis of what was included in cgMLST). The result for a small analysis using a closely related reference strain is that only 70 to 80% of the genome is addressed. When addressing more diverse strains this can fall to to as low as what is addressed by cgMLST.
The real-world challenge is to reduce noise, increase accuracy, and at the same time address the maximum amount of the safely analyzable genome – and to maintain that resolution regardless of the availability of a highly related reference genome or the number and diversity of strains compared. Genpax achieves this. At the same time as reducing the noise and increasing the accuracy, the analysis achieves a resolution of typically 5 to 10% above whole genome SNP analysis at its best (ranging from 75 to 90% of the genome, depending upon the species and its sequence composition), and maintains this regardless of the number and diversity of strains that are addressed. It is also equally good at analyzing species for which reference genomes for most strains do not exist (e.g., Salmonella) or where recombination means that they cannot exist (e.g., Campylobacter).