Tuesday, November 14, 2023

How do you describe genetic construction plans?

ACS Synthetic Biology has just published "Standardized Representation of Parts and Assembly for Build Planning", our new article on how to better communicate about building genetic constructs. The paper is basically a more friendly user manual for the best practice that we wrote up last year.

Fundamentally, this is all just about trying to reduce the confusion that commonly occurs when we're talking about build plans. If somebody shares a sequence, is it for the bit they want synthesized, what a vector will look like after the synthesized bit gets stuck in, what gets digested out of the vector, or what it looks like as part of the final construct after it gets ligated together with other constructs? 

When we were collaborating on building the new iGEM distribution, we ran into a lot of confusion amongst the many different participants along these lines, so we worked out a standard vocabulary for describing what we were talking about, with intuitive names for different stages in typical digestion/ligation assembly processes.


And once we humans were clear on what we wanted to say to one another, it was easy enough to take the next step and use SBOL3 to make a simple description to describe it to the machines as well, including the exact reactions one would want to run to actually execute the plan. This is one of the nice things about SBOL, which you can't do with formats like GenBank, FASTA, or GFF: describe not just a construct, but its relationship with other constructs and your whole plan for how to use it.

We're still using this vocabulary quite extensively in the iGEM Engineering Committee, as well as using the representations in our software, and we hope that others will find it useful for clarifying their discussions as well.

Monday, April 03, 2023

BLAST vs. custom tools for pathogen identification

Our analysis of issues with using BLAST vs. NCBI for pathogen identification is out today: “Studying pathogens degrades BLAST-based pathogen identification.”  This paper is the full published version of the preprint I posted about a few months ago, investigating an emergent dynamic, in which biological research and development ends up contaminating public databases with chimeric material that can confound biosecurity systems that trust those databases.

The most important addition between the preprint and this final version was to make direct head-to-head comparisons between BLAST vs. NCBI and two tools specifically designed for biosecurity analysis, our own FAST-NA Scanner and a free tool called SeqScreen (there are other tools we'd like to have compared with as well, but they were not available for comparison). 

As predicted, the actual biosecurity tools completely dominated over BLAST vs. NCBI, making more than an order of magnitude less mistakes---not a surprise, but nice to see experimentally validated. In fact, each biosecurity tool only made one mistake in judgement, and in both cases it was the same mistake that NCBI did, which is an important lesson: the big NCBI databases aren't bad, they're just dirty, and so they just need a lot of care and refinement when they're being put to a use (like biosecurity determinations) where mistakes can be costly and dangerous.

This is important for biosecurity, but I also think people need to be aware of this in the larger scientific world as well. In biology, curation quality really matters, and many people are far too blasé about the potential impact of dirty data on their applications. If you want to do biosecurity right, you need to use an actual biosecurity tool and not just trust the databases. I'm sure the same applies for many aspects of medicine, diagnostics, etc., and I fear that not enough people are taking these issues seriously.