Updated: May 22
Sequence Typing (ST) is a widely established and used methodology. It is most certainly better than its predecessors from the era before affordable and accessible DNA sequencing, and its development and implementation was an enormous step forward for pathogen identification and the study of bacterial population genetics. In its original form, MLST, it was very much a DNA-based version of its protein-based predecessor, only better in every meaningful way. The conceptual predecessor, MLEE, involved separating proteins on gels in slow ways that were hard to do reproducibly in the same lab, let alone in different ones, and what could be studied was limited to enzymes that could be detected after they were separated. There were other DNA-based tests, which were also far from ideal or reproducible, with the added disadvantage that you often didn’t know what you were looking at, or why strains were different (e.g. PFGE and RAPiD).
MLST changed this dramatically. Pieces of any gene could be selected, amplified, and sequenced. The same sequences could be amplified in any lab, and the sequences could be shared so that labs could get the same result on different days, and different labs could generate results that could be reliably compared. When it started, it was quite expensive, but as amplification and sequencing costs came down, it became cheaper as well as better. It was not perfect: not everyone could afford it; it uses genes that are always present and change at a useful rate (meaning slowly) – so it is under genes that are very highly conserved, and changes within them don’t change what the bacteria do. For this last reason, I personally never liked it very much, also because the STs it attributes don’t readily tell you how closely related different bugs are.
In the new world of whole genome sequencing ST has evolved, or at least got bigger. It is now cheaper to sequence a whole genome than to specifically amplify 6 to 8 genes and sequence them individually. You can get millions of bases of information for less than the cost of a few thousand, and if all you want are still those few thousand, it’s easier to pull them out of the few million than to get them separately. This is probably why there was little to no resistance to converting MLST-based reference laboratories from focused to whole genome sequencing. Since you are generating the information on more than the original 8-ish genes used for MLST, you might as well use it, for cgST and wgST are the result, where the genes thought to be in all strains, or the larger number of genes commonly present in more related strains are used. But they are all ST, which means that the information from each gene looked at is reduced to a single number, a gene version, and it is the string of these versions which makes up ‘the Sequence Type’.
This is the blessing and the curse of ST methods. By reducing each individual version of a gene (an allele) to a single number, it becomes possible to easily work out whether you are dealing with related strains or not. But the price of sacrificing the detail of a gene which is an average of about 800 or more nucleotides to a single number, and doing so in a way that the detail is no longer accessible, is substantial. Also, it’s not perfect, sequence-error can generate a mistaken ST, and only some of the genome can be compared, even when it is the ‘core’ or a larger set of genes. Also, the ST used to describe a strain has nothing to do with how similar they are. ST1034 might be completely unrelated to ST1035. An additional problem is that strains often spread and evolve more quickly than the ST changes, and for the most successful clones, the number of strains within an ST can be huge. If only you could compare all strains in ways that enabled you to track and trace them at greater resolution, and which retained the computational and practical deliverability of STs.
The problem is … you can’t. You can’t because the very thing that is most annoying about ST systems is the same thing that enables them to work. Whole genome sequencing isn’t complete genome sequencing. This might sound like splitting hairs, but it isn’t. Just because you put the whole of a genome into your sequencing process, it doesn’t mean that you get all of it out the other end, nor that it’s all joined-up. In fact, it isn’t (a well sequenced genome might still be in 100 or more pieces at the end of the process). Also, different genomes contain different genes, and identifying which bits are equivalent and should be compared is an imperfect science. By focussing on a subset of parts that are more comparable than others (and coincidentally happen to have a lower tendency to contain sequencing errors) and then reducing this to a ST system, it then becomes possible to compare the information in simple computationally possible ways. Whether the typing number has 8 parts or a thousand – it is still computationally easy and linear. And this is where we are STuck. At least for now.
Dr Nigel Saunders, Chief Scientific Officer