With some variation by species and genome size, most bacterial pathogens normally change at a rate of up to 10 SNP changes per genome per year. A clinical outbreak spanning weeks or months will include strains diverging at double this rate but also contain strains that differ by less because 99% of strains replicate without error. Meanwhile, sequencing, assembly, and mapping errors can generate tens of errors per MB, resulting in noise that exceeds the essential signal to identify recently connected strains, source attribution, and transmission inference.
Various partial solutions exist, such as focusing on the most accurately addressed parts of the genome, using thresholds that allow a certain number of differences while still typing a strain similarly, and labelling gene versions rather than using the detail of the several differences that they may contain. All involve a sacrifice of resolution and have their own issues. None provide a solution that delivers what is required for proactive detection and direction of outbreak and infection prevention and control responses.
The accuracy issues create another problem. In order to not to exclude genuinely connected strains, the distance threshold to identify them is loosened (for example from 10 to 15, to 20 to 30, sometimes more). However, strains that are not connected by transmission can differ within this range, especially the highly related strains that tend to exist within ‘local healthcare microbiomes’ (that exist within hospitals, associated care facilities, and healthcare workers). It is essential to be able to discriminate between these unconnected but very similar strains from those that are genuinely transmission- or common source-associated in order to target infection prevention responses properly, and not to over-identify and over-extend the strains that are considered to be part of an outbreak.. Both the under- and over-detection of outbreaks can have significant costs and impacts.
The third issue is that the noise is not constant, nor completely random. As a result, it frequently re-orders the connections that can be inferred with respect to the chain of transmission between patients or other sources. Being able to reliably determine the detail of the distances and connections between strains is necessary for clinical genomics to be able to be used for proactive infection prevention and control as a leading source of information, rather than for confirmation or as an adjunct to epidemiological and other indicators.
Where the noise is coming from.
Error has always been tolerated, and it is unavoidable in genome sequencing. Partly it is generated in the creation of the raw sequence data from the sequencing machine, predominantly it is generated and assimilated in the process of genome assembly, and both are the result of differences in the data which is analyzed from the FASTQ file (the universal output file format from DNA sequencers). The properties of this file are influenced by multiple things: the quality of the DNA extracted from the bacteria, the preparation of the sequencing library, the flow cell it is run on (even the lane of the flow cell), the sequencing protocol (e.g., fast or slower with more wash steps), and also what reads from the output are used in assembly. This is why some try to demonstrate analysis performance using the same FASTQ file rather than genuine replicate sequencing files; or even simulated read-sets – which generate the fewest errors possible but do not represent the real-world situation. Some of these issues are overcome by using mapping-based sequence analysis, but this is dependent upon having a good quality highly-related reference genome, which isn’t always available; and for some species it is a practically impossible solution. And some genomes and parts of genomes are inherently more accurately assembled and analyzed than others.
Lowest possible noise with Genpax
Genpax has developed strategies and analysis resources that address the sources of sequencing noise and that recognize the essential nature of a Near-Zero Noise solution. Samples that reflect real-world conditions of independently generated sequences; including replicate cultures in true technical replicate DNA sequencing prepared from the same DNA sample, different DNA preparations from co-cultures, and very closely epidemiologically linked independent cultures consistently generate Zero distances. No previous platform has achieved this for strains representing clinical isolates and diversity. This is the necessary starting point for an impactful bacterial genomics solution for infection prevention and control to detect connected strains and direct necessary responses.