Jake Beal's Next Step

Monday, May 06, 2024

Does your biosecurity screening work?

So, you've decided you don't want to help randos get ahold of smallpox, ebola, and ricin. That's a good start! So, you've decided to check if the DNA, RNA, or protein sequences that people are asking you to build or work with are dangerous, and you've either built your own biosecurity screening system or you've obtained a biosecurity screening system from us or one of the other screening tool providers. Great!

Does it work?

How would you even know?

That's the question that our new paper, Progress and Prospects for a Nucleic Acid Screening Test Set, is beginning to answer.

We organized a collaboration of four tool providers and two synthesis companies to start with the simplest question of all: on what sequences do we agree if they're dangerous? Even that question wasn't straightforward, because all of the tools have qualitatively different ways of reporting their analysis and some approach the question of sequence danger as "innocent until proven guilty" while others are instead "guilty until proven innocent."

Still, we were able to come up with a way of making their results all sufficiently comparable, and we did a first test on three controlled threat organism groups - one viral, one bacterial, and one fungal:

The good news is that we all basically agreed on the viral sequences.
The bad news is that we couldn't agree on about 10% of the bacterial sequences.
The challenging news is that there wasn't enough information to decide one way or another for more than 30% of the bacterial sequences and more than 80% of the fungal sequences.

That last bit, about not being able to decide, is not actually as bad as it might sound: if nobody knows enough about a sequence to decide whether it's dangerous, then it's also not likely that anybody knows enough to actually do something bad with it. And being able to agree large numbers of sequences is a very good thing if you want to be able to do basic "competence tests" to make sure that a tool isn't making bad decisions.

This might sound very down in the weeds, but being able to answer this question is actually a big deal, and an important part of making the biosecurity screening framework just put out by the White House actually work. So now over the next few months we're scaling up our "bronze standard" test set effort to cover all of the regulated pathogens and toxins out there, and collaborating with EBRC and NIST to make sure what we're doing can be used to benefit the whole community.

So much of civilization depends on little details of measurement and standards... I just hope that we can work quickly and effectively enough to help ward off the threats that are coming over the next few years.

Tuesday, November 14, 2023

How do you describe genetic construction plans?

ACS Synthetic Biology has just published "Standardized Representation of Parts and Assembly for Build Planning", our new article on how to better communicate about building genetic constructs. The paper is basically a more friendly user manual for the best practice that we wrote up last year.

Fundamentally, this is all just about trying to reduce the confusion that commonly occurs when we're talking about build plans. If somebody shares a sequence, is it for the bit they want synthesized, what a vector will look like after the synthesized bit gets stuck in, what gets digested out of the vector, or what it looks like as part of the final construct after it gets ligated together with other constructs?

When we were collaborating on building the new iGEM distribution, we ran into a lot of confusion amongst the many different participants along these lines, so we worked out a standard vocabulary for describing what we were talking about, with intuitive names for different stages in typical digestion/ligation assembly processes.

And once we humans were clear on what we wanted to say to one another, it was easy enough to take the next step and use SBOL3 to make a simple description to describe it to the machines as well, including the exact reactions one would want to run to actually execute the plan. This is one of the nice things about SBOL, which you can't do with formats like GenBank, FASTA, or GFF: describe not just a construct, but its relationship with other constructs and your whole plan for how to use it.

We're still using this vocabulary quite extensively in the iGEM Engineering Committee, as well as using the representations in our software, and we hope that others will find it useful for clarifying their discussions as well.

Monday, April 03, 2023

BLAST vs. custom tools for pathogen identification

Our analysis of issues with using BLAST vs. NCBI for pathogen identification is out today: “Studying pathogens degrades BLAST-based pathogen identification.” This paper is the full published version of the preprint I posted about a few months ago, investigating an emergent dynamic, in which biological research and development ends up contaminating public databases with chimeric material that can confound biosecurity systems that trust those databases.

The most important addition between the preprint and this final version was to make direct head-to-head comparisons between BLAST vs. NCBI and two tools specifically designed for biosecurity analysis, our own FAST-NA Scanner and a free tool called SeqScreen (there are other tools we'd like to have compared with as well, but they were not available for comparison).

As predicted, the actual biosecurity tools completely dominated over BLAST vs. NCBI, making more than an order of magnitude less mistakes---not a surprise, but nice to see experimentally validated. In fact, each biosecurity tool only made one mistake in judgement, and in both cases it was the same mistake that NCBI did, which is an important lesson: the big NCBI databases aren't bad, they're just dirty, and so they just need a lot of care and refinement when they're being put to a use (like biosecurity determinations) where mistakes can be costly and dangerous.

This is important for biosecurity, but I also think people need to be aware of this in the larger scientific world as well. In biology, curation quality really matters, and many people are far too blasé about the potential impact of dirty data on their applications. If you want to do biosecurity right, you need to use an actual biosecurity tool and not just trust the databases. I'm sure the same applies for many aspects of medicine, diagnostics, etc., and I fear that not enough people are taking these issues seriously.

Wednesday, July 27, 2022

Multicolor Plate Reader Fluorescence Calibration

Just out in OUP Synthetic Biology, "Multicolor Plate Reader Fluorescence Calibration" extends our prior work on calibrating green fluorescence and cell count to calibrate red and blue fluorescence as well. The results are no surprise (if we can use a green dye, we ought to be able to use other dyes too), but it's valuable to have specific recommendations for dyes to use and to have an interlab study validate that yes, they really do perform as well as the others.

So everybody out there listening, please start using sulforhodamine-101 to calibrate your red fluorescence and Cascade Blue to calibrate your blue fluorescence! Everybody who uses your data will thank you for providing equivalent molecule/cell estimates rather than irreproductible arbitrary or relative units.

Red and blue fluorescence calibrants were just as precise as the prior green and cell-count calibrants

The paper also reports on some of the travails we ran into making the study work: some of the fluorescent proteins we wanted to try out didn't work in our hands, and there were miscellaneous other problems: a promoter sequence got messed up, some things wouldn't synthesize, one of the plasmids seemed problematic, and timing problems meant not all labs could run all constructs.

Problems like that are frustrating, but ultimately I'm happier reporting them than burying them. Remember: if you read a synthetic biology study with lab work and it doesn't talk about failures, it just means they either aren't aware of them or else they've pruned them from the narrative! Calibration methods like these help us see better when things go wrong and understand what's happened.

Thursday, July 14, 2022

Studying Pathogens Degrades BLAST-based Pathogen Identification

Using the BLAST algorithm to search the NCBI databases is the typical way one goes about identifying a DNA sequence, so it's been the typical way biosecurity systems decide if something is potentially a dangerous pathogen or toxin too. Problem is, that's not what BLAST and those databases were designed for, and we've observed that they aren't working as well for that purpose as they used to, as we report in our new preprint: "Studying Pathogens Degrades BLAST-based Pathogen Identification"

Specifically, we've found an inherent problem that is growing in seriousness due to a non-obvious emergent dynamic. Now that sequencing and bioengineering tools are getting much more accessible, lots of sequences are being studied by modifying them with "tool" sequences like purification tags, fluorescent proteins, stabilizing sequences, etc. Those sequences get (appropriately) classified based on what's being studied, and now you've got chimeric material that includes both the subject of study and the bioengineering tool. Then when you run BLAST on a sequence with that tool, you start finding that tools are classified as what they're used to study.

Example of BLAST classification failure: using a purification tag to study an Ebola protein means that now a fluorescent protein plus a purification tag gets mis-identified as Ebola.

This doesn't seem to be much of a problem for most uses of BLAST against NCBI, but it's poisonous for making biosecurity decisions, since it can cause benign sequences to be classified as dangerous or vice versa. Moreover, the effect gets stronger the more problematic a pathogen is (since more sequences are recorded) and the more useful a tool is (since more chimeric material is produced), meaning that the problem is most likely to occur in the most important. For example, over the last two years, quite a lot of stuff has started coming back as COVID-19, since everybody in the world is studying COVID-19 with all of the tools that they can get their hands on.

This is a serious problem, and it's not likely to get better, since NCBI and BLAST aren't doing the wrong thing: they're just getting less suitable to use as a short-cut for doing something that they were never designed to do.

So how do we fix it? Switch to tools that are actually designed for pathogen identification. We've got one (FAST-NA Scanner), and a whole bunch of other folks worked on the same problem in the FunGCAT program. The solutions are there, we just have to help folks switch to them.

Wednesday, July 13, 2022

pySBOL3: SBOL3 for Python Programmers

Our Python library for the SBOL3 standard now has an official citable publication in ACS Synthetic Biology, called "pySBOL3: SBOL3 for Python Programmers."

The article is a good short read, but for any Python programmers, out there I recommend just jumping straight in with the tutorial instead. Happy hacking, everyone!

Tuesday, July 05, 2022

Functional Synthetic Biology

Synthetic biology isn't about sequences. Don't agree? Tell me what this is without looking it up: atgcgtaaaggagaagaacttttcactggagttgtcccaattcttgttga

Tell you what, I'll give you a hint, make it easy. It's a coding sequence translating to MRKGEELFTGVVPILV. Everybody knows this one, right?

How about this instead?

That's right. That mystery sequence up top is the first 50 bases of BBa_E0040, the widely used iGEM part with a coding sequence for GFPmut3. Now that one, a great many folks working in synthetic biology know, have used in their work, and maybe even have strong opinions about.

Notice that this is a description of biological function: the important thing is that the coding sequence makes a protein that emits a lot of green light when you hit it with a blue laser. There's a sequence in there somewhere but that's not what gets put on the whiteboard or what gets discussed.

Don't get me wrong, sequences are important. But right now we're living with a mis-match in synthetic biology, where most of our discussions about design are about function, but nearly all of our tooling is heavily focused on sequences (e.g., GenBank format), with any information about function tacked on as an afterthought or else confined to specialized databases that each pose their own sui generis integration problem.

We need a new focus on functional synthetic biology, and that's one of the things we've been working on in the iGEM Engineering Committee. We're trying to change how we do synthetic biology, so that we can pull together the work that lots of people have been doing on calibration, insulation, characterization, context effects, modeling, assembly, etc., in one place and make at least a small class of synthetic biology engineering really simple and predictable.

We aren't there yet, but we've gotten to the point where we think we've figured out some of the important shifts in thinking, representation, and tooling that need to happen in order to make functional synthetic biology possible. If you're interested in this too, I encourage you to read more in our newly available pre-print on Functional Synthetic Biology.

Thursday, May 05, 2022

AI for Synthetic Biology

Several of my colleagues have been organizing an series of "AI for SynBio" workshops over the last few years. I've been to some and they have been both stimulating and enjoyable. Now they have an article out in Communications of the ACM, along with a nice short video in which Aaron Adler introduces this increasingly important cross-disciplinary interaction for folks who aren't familiar with one or both of the subjects.

Friday, April 22, 2022

Talking measurement and standards with "The Living Revolution"

Yesterday I had an enjoyable conversation with Luke Roche and Sara Knurowska, who do a podcast called "The Living Revolution." They'd read some of my work on measurement, which led inevitably to a wide-ranging discussion including fundamental principles in engineering and science, when to standardize (or not), SBOL, etc.

Check out the podcast here (if it works for in your browser), or on Spotify or Apple Podcasts

Monday, February 14, 2022

"Meeting Measurement Precision" published in ACS Synthetic Biology

The pre-print that I wrote about in October, "Meeting Measurement Precision Requirements for Effective Engineering of Genetic Regulatory Networks", has just been published in ACS Synthetic Biology. Check out the official final version at: https://doi.org/10.1021/acssynbio.1c00488!