Jake Beal's Next Step: February 2020

Wednesday, February 26, 2020

Looks like we found something significant in the coronavirus...

It looks like the unique sequences we found in the 2019-nCoV coronavirus were indeed significant!

In this article in last week's Science, the authors found key differences between this virus and SARS, focused most strongly on the N-terminal domain (NTD) and receptor binding domain (RBD) regions of the viruses spike glycoprotein. This is important to understand, because this protein is what the viruses uses to actually infect cells, and also a primary target for antibodies to identify or neutralize the virus.

These regions are also right where we pointed our spotlight in our bioRxiv paper, with the surface glyoprotein region of interest that we identified! In particular, we identified the region from amino acids 9 to 275 as the largest unique sequence, and found it was part of a cluster spanning from amino acids 9 to 883. In the Science paper, the key NTD sequence goes from amino acids 17 - 305, nearly a perfect match to our largest unique sequence, and the RBD sequence goes from amino acids 330 to 521, meaning that together the two cover the majority of our identified cluster!

Now, these folks went a lot deeper than we could (not being protein modelers ourselves), and I'm sure they didn't use our research, given they were likely starting their investigation at the same time we started ours. That said, it's a nice confirmation of our methods and their potential significance to have rapidly and independently identified these regions with our FAST-NA method.

My next question for other researchers, however, is this: what about the other two domains we found?

Sunday, February 09, 2020

Congratulations to Cassandra Overney!

Congratulations to my former intern Cassandra Overney, who is a finalist for the National Center for Women & Information Technology (NCWIT) Collegiate Award!

Cassandra is an undergrad at Olin College who first began working for me at BBN in the summer of 2018, contributing to the NSF Expeditions “Living Computing Project” by improving our TASBE Flow Analytics software package for calibrated flow cytometry (which you may remember from a post last year). Flow cytometry is a method for measuring the fluorescence of large numbers of cells, often used as a “logic probe” for genetic engineering projects, and TASBE Flow Analytics allows precise and replicable interpretation of the results of complex experiments, and is being used in a number of laboratories and large-scale projects.

Cassandra's recognition by NCWIT is based on the critical contributions that she made for this project, most notably developing an Excel-based user interface that has proven to be much simpler and more intuitive for most of its biologist users. In developing this software, Cassandra worked closely with the biologists who would become her users, prototyping, testing, and adjusting in multiple rounds in order to provide a workflow that has significantly increased the adoption of TASBE Flow Analytics by bench scientists. Better, though, why not learn about it from the video that Cassandra made for her NCWIT award entry?

Although her internship is long over, Cassandra has continued to work part-time on this project, further improving the user interface she designed and addressing other issues as raised by users. Wearing my selfish primary investigator hat, I'd hire her full time if I could, but wearing my mentor hat, I expect both she (and science) will be better served by instead continuing to explore her interests in different areas of potential research and going off to graduate school. This is the bittersweet joy of a mentor: the better the student you work with, the faster they are likely to leave the nest!

So congratulations again, Cassandra!

Tuesday, February 04, 2020

Organizing genome engineering for the gigabase scale

Just out in Nature Communications, our new paper on "Organizing Genome Engineering for the Gigabase Scale"!

This perspective piece, a companion to the technology perspective last fall, analyzes the trends in the growing size of organisms getting their genomes re-engineered, and concludes that, while impressive, it's growing more slowly than one might think: big, complex organisms like mammals and plants are only likely to become tractable around 2050. Moreover, the complexity of the projects has been growing exponentially as well, as measured by the number of authors per paper.

The largest engineered genomes have grown exponentially, doubling approximately every 3 years (a), but the number of authors credited on projetcs has been growing exponentially as well (b).

We look at this problem and see not just a genome technology issue, but a massive organizational challenge as well: these projects are going to be big, and in order to manage them effectively we're going to need a lot of friction-reducing software tooling automation. The bulk of the piece is then dedicated to looking at the design/build/test cycle and analyzing the sticking points and how to address them.

Bottom line: it's not going to be simple, but it looks quite tractable, and there are things that can be done right now that will likely have a significant impact on our ability to engineer ever-larger genomes.

Sunday, February 02, 2020

Unique sequences found in Wuhan coronavirus

Like many people, I have some concerns about the emerging virus in Wuhan. I am also fortunate enough to have some tools that might turn out to be helpful. For the past two years, I've been leading a project on improving pathogen screening in DNA orders by applying cybersecurity tools, and was, in fact, in the midst of writing up a paper on our improved ability to detect small virus fragments with high precision.

So it just so happens that I've got software to hand that's very good at detecting the unique aspects of a viral pathogen, and a pre-existing collection of organized coronavirus data, and it looks like we may have found something interesting---some chunks of the virus that look unlike any of its known relatives. We've written this up in a quick manuscript that's now under review and up on bioRxiv:

Highly Distinguished Amino Acid Sequences of 2019-nCoV (Wuhan Coronavirus)
Using a method for pathogen screening in DNA synthesis orders, we have identified a number of amino acid sequences that distinguish 2019-nCoV (Wuhan Coronavirus) from all other known viruses in Coronaviridae. We find three main regions of unique sequence: two in the 1ab polyprotein QHO60603.1, one in surface glycoprotein QHO60594.1.

Summary statistics of distinguishing amino acid sequences identified for 2019-nCoV (Wuhan coronavirus), organized by the identifiers of protein sequences in which we found unique content. The blue is the fraction of sequence that's judged unique and the red is the total amount: the left-most and right-most sequences look particularly interesting.

It's also been a fascinatingly fast project: we noticed the sequence and decided to evaluate it on Tuesday morning and got our first results that afternoon. On Wednesday, we refined and confirmed the results. Thursday, we checked with others that it might be interesting, and I wrote up the quick report. Friday was polishing and submission as a research letter to CDC Emerging Infectious Diseases and a bioRxiv preprint, and then it took 48 hours for bioRxiv to post it. At just under a week from project conception to submitted preprint with DOI, this is definitely my fastest experience with scientific publication, and it's been a strange experience.

I don't know just how important this might or might not be---I am definitely not a viral pathology specialist. And maybe the journal will just laugh at us and reject it all as naive. But I'm still happy that this is out there, no matter what, in case it may indeed be useful. More than anything else, I really hope that this gets in front of people who are, in fact, the right type of expert, so that they can evaluate it and see if they can put this information to effective use in helping diagnose, prevent, and mitigate this new disease.