Monday, April 03, 2023

BLAST vs. custom tools for pathogen identification

Our analysis of issues with using BLAST vs. NCBI for pathogen identification is out today: “Studying pathogens degrades BLAST-based pathogen identification.”  This paper is the full published version of the preprint I posted about a few months ago, investigating an emergent dynamic, in which biological research and development ends up contaminating public databases with chimeric material that can confound biosecurity systems that trust those databases.

The most important addition between the preprint and this final version was to make direct head-to-head comparisons between BLAST vs. NCBI and two tools specifically designed for biosecurity analysis, our own FAST-NA Scanner and a free tool called SeqScreen (there are other tools we'd like to have compared with as well, but they were not available for comparison). 

As predicted, the actual biosecurity tools completely dominated over BLAST vs. NCBI, making more than an order of magnitude less mistakes---not a surprise, but nice to see experimentally validated. In fact, each biosecurity tool only made one mistake in judgement, and in both cases it was the same mistake that NCBI did, which is an important lesson: the big NCBI databases aren't bad, they're just dirty, and so they just need a lot of care and refinement when they're being put to a use (like biosecurity determinations) where mistakes can be costly and dangerous.

This is important for biosecurity, but I also think people need to be aware of this in the larger scientific world as well. In biology, curation quality really matters, and many people are far too blasé about the potential impact of dirty data on their applications. If you want to do biosecurity right, you need to use an actual biosecurity tool and not just trust the databases. I'm sure the same applies for many aspects of medicine, diagnostics, etc., and I fear that not enough people are taking these issues seriously.

No comments: