Friday, October 18, 2019

Gigabase-scale genome engineering

Just out in today's Science: "Technological challenges and milestones for writing genomes." One of a pair of papers I've been working on with the GP-write consortium, both of which are asking the question: what, exactly, do we need in order to go from engineering millions of base-pairs of DNA in bacteria and yeast to the billions of base-pairs in complex organisms like mammals, plants, and people?

This paper focuses on the DNA-wrangling side of the problem, while its complement (on arXiv and under revision) focuses on the informational and coordination side of the problem. Both need to be addressed, and the complexity---while daunting---is tractable. Take a read-through and see our take on the matter!

Monday, October 14, 2019

Getting plate readers right

If you ever used a plate reader to measure either OD or fluorescence, you'll want to check out the iGEM 2018 interlab preprint on bioRxiv!

We just submitted this manuscript, "Robust Estimation of Bacterial Cell Count from Optical Density," for review on Friday, but we think a lot of folks will want to make use of this information, and so we've gotten a preprint up early as well.  The big deal of this study is that we've now got a good calibration process for both optical density (OD) measurement, which is commonly used for estimating cell count in a sample, and fluorescence measurement, which is commonly used as a "debugging probe" for estimating cellular activity.  Both of these are usually reported in relative or arbitrary units right now, which causes lots of trouble interpreting what's even going on in your experiment, as well as greatly limiting how results can be shared and applied.

No more: we have protocols that are cheap (less than $0.10/run) and easy (reliably executed by high school students just getting started in a lab), and that this manuscript shows are also both precise and accurate.  All you have to do is dilute little cell-sized silica beads and fluorescent dye, plug the measurements into a spreadsheet, and you're good to go.

Serial dilution of fluorescein (from iGEM protocols page)

And here's the most important result from our paper: a nearly perfect match between per-cell fluorescence estimate from plate reader measurements and the ground truth captured from single-cell measurements in flow cytometers.
Plate reader (calibrated with microsphere dilution) vs. flow cytometry showing a 1.07-fold mean difference over 6 test devices.
In fact, this match is even better than we deserve: we know there are factors that should distort the plate reader measurements both up and down, but they're small and appear to be canceling one another out. The only device with a notable difference in measured value is the one that's got very low fluorescence---and even there it's not significant and conforms with our expectation that flow cytometers will be better able to measure extremely faint fluorescence than plate readers.

This is new science, so there's lots of caveats, of course: this has only been validated for E. coli, and probably won't work well for murky cultures with a lot of background or for biofilms or long filamentous strands.  Nevertheless, it's a big step forward, since a huge amount of what people use plate readers for is covered by this study already.  We'll see what the reviewers think, but I expect this paper is going to have a big impact because it's addressing a problem that so many people are encountering.

The next key challenge, however, is this: can we get somebody manufacturing plate readers to make calibration plates so that people don't have to prepare their reference materials themselves?

Friday, September 27, 2019

New aggregate programming survey!

Just out, a new survey entitled "From distributed coordination to field calculus and aggregate

computing", which surveys aggregate programming work by my collaborators and myself. This paper expands on a conference version published last year, and gives a nice overview of how all of the different pieces of our work in this area fit together.

How the past, present, and future fit together in our view of aggregate programming.
One of this nice things about this survey was that we also were able to spend some time tracing out the roots of this work in the past, including a something that I really like: a diagram of all the key different traces of past work coming together to form aggregate computing (not the one above, but something much more complicated).  We also spent half a dozen pages laying out our view on key problems to be addressed and the likely roadmap for near-term progress in the area. If you're interested in either making use of this work or getting involved in research in this area yourself, this paper is a great place to start reading!

Tuesday, September 10, 2019

Damn you, asparagine!

Deep inside big public databases, you can find quite curious things, especially when biology is involved.

For example, I spent several hours today hunting down a mysterious bug in the DNA screening project that I've been leading. We're working on improving the ability to detect when somebody orders DNA that they shouldn't be ordering (e.g., smallpox, ebola), and so it's really important to not let anything get past. So while most classification projects might be fine with getting nearly everything right, our system has to catch every single problematic sequence every time.

That means I get to drill down and try to classify every miss our system makes, and I learn some strange and interesting things while doing it. For instance, these pseudo-fascinating trivia are amongst the things that I have recently learned:

  • The same DNA sequence from the same publication is often uploaded twice and categorized differently each time.
  • Fish in fish farms get sick with a virus related to rabies.  It doesn't hurt humans, though.
  • Somebody is running automated systems to infer the organisms that DNA sequences are associated with, and that produces a lot of "unknown member of [family/order]" entries.
  • Somebody published a paper where they claimed to discover a bunch of new virus species by just sort of sequencing samples from healthy people and not actually checking in any way whether actual viruses were involved.
  • When NCBI updates its taxonomy which organisms are related to which, the sequence records don't change to reflect their new taxonomy.

With these discoveries and a few other tweaks, I was able to categorize and plan mitigations covering all of the classes of failures that our system was encountering.  Almost.

There was just one miss that I just could not explain, a short little snippet from a virus coat.  There were no related "safe" viruses that would cause us to overlook its sequence, nothing in the protein sequences and nothing that could even be mis-translated from other DNA sequences.  And I thought, "that's funny..."

I dug down and dug down and eventually found something both embarrassing and wonderful. You see, in DNA sequences, there's often parts that are unknown, and so instead of the standard "A", "C", "T", and "G" DNA bases, these bits of missing information get marked as "N" for an unknown "any" base.  These get used in ordering DNA too, to indicate places where you don't care what the sequence is. We've long been excluding these from matches, since it makes no sense to say, "Aha! Somebody once didn't know part of a virus, and you don't care what you get!"  So our detector throws out potential matches that include an unknown.

Only thing is, when you're working with proteins, the missing information letter isn't "N". There are a lot more amino acids than nucleic acids, and so they use up more of the alphabet, including "N", which stands for the amino acid asparagine. With proteins, the missing information letter is "X" instead.

Most of our system knew that.  Most of our system was doing the right thing.  But one little part of one little script wasn't getting switched into protein mode at the right time.

We've been systematically excluding every protein pathogen signature with asparagine in it.

Our system: "Damn you, asparagine! Get out of my house!"

That's embarrassing.  Easy to fix, but still embarrassing.

And yet...

Asparagine is a pretty common amino acid, so we've been accidentally throwing away around one third of the detection power of our system.  And out of tens of thousands of tests, there was precisely one where this blatant and egregious error caused us to miss a detection.

The wonderful thing is that the system is still working almost perfectly, even while we've been unknowingly arbitrarily throwing away a vast amount of its ability to detect pathogens.  That speaks to its resilience, and how many alternative routes it explores to achieve its goal.  I can live with that, with a nice natural experiment accidentally conducted by a misbehaving script. We'll fix it, and move forward.

But such remarkable things you may find when you follow just one little thread of something funny in your data...

Sunday, August 11, 2019

Can we put an end to secret parenting?

Recently, one of my colleagues at BBN shared an article about "secret parenting," and the concept really struck a chord with me.  The basic idea is that people often feel that they will be judged for choosing parenting over putting in more hours at the office, and so they end up hiding these choices, making excuses, and generally having their work-life balance (or lack thereof) degraded further.

It's unfortunately easy to simply brush away one's parenting, to pretend that it's not happening, to pretend it's not important. And it's not just parenting, of course: people have all sorts of other things outside of work. Parenting, however, is something that's particularly strong and gendered in its impact in American society, at least.

In my group at BBN, I think we do pretty well on not hiding our parenting. The group mailing list is always abuzz with notifications of people saying they're going to be out or working from home for personal or family reasons: taking the kids to the doctor, dealing with child-care failure, going to see a kid's baseball game, helping out with the grand-kids, fixing an air conditioner, keeping their new dog company, etc. Also, importantly, I see it coming very much from both men and women.  I think that this visibility on the mailing list is really important, because it makes it much more comfortable to make those choices oneself, and to feel less pressure to engage in secret parenting. I definitely know that it matters for me.

With other colleagues outside of my home organization, however, I often do not feel such comfort. Whenever I make a choice that's driven by my desire to be a present and responsible parent (or other personal things, though parenting dominates in my life right now), I feel that I have to worry about things like:

  • Will this person think less of me professionally?
  • Will they worry I'm not sufficiently committed?
  • Will they feel like I'm putting them at a lower priority?

This shows up in lots of little micro choices.  Like, do I tell people I can't make it because I'm volunteering to drive for a field-trip at my daughter's school, or just say that I have a conflict?  Do I say that I'm heading for the airport early because I want to see my kids in the morning, or just blame it on flight combinations to Iowa?

As I get to know somebody better, the barriers can come down, but in the world of science there are always new collaborators, new potential competitors, new program managers. I don't feel secure enough to expose myself in that way with people that I do not know well. And if I don't, as somebody who should probably be considered well established at this point in my career, how much more vulnerable my younger colleagues, my colleagues who are female or minorities?

On this blog, on my online persona, you get to see the highlights of my life. You don't get to see my times of burnout and depression. You don't get to see me struggle with imposter syndrome. I'm still not going to post these here, in full public record, for all to see, because I do not want to make myself that vulnerable to judgement. But dear reader, I would encourage you to count the posts that are not there.

Writing posts like these is a good sign, for me, because it's showing that I'm finding time enough to sit down and reflect and find the things I want to share.  Posts show me operating at peak functionality in my life, and if I'm operating at Peak Jake, I'd probably post just about once a week.  Thus, if you don't see a post, it doesn't necessarily mean that things are bad in my life---but it means that I don't feel I have the luxury to indulge in these delightful pseudo-conversations. Not without neglecting things that are more important to me, at least, like parenting and career.

But I do think that in my professional interactions, I'm going to try to shift my boundaries a bit more, indulge my trust a bit more freely in my outside-of-BBN colleagues. I don't like hiding my life from my work, parenting or otherwise, and since I am indeed in a somewhat secure and privileged position in my career, I think that one of my responsibilities is to help to shape my professional environment to be more of the sort in which I would like to live and work.

And with that, dear reader, let me sign off by informing you that this post appears in the midst of a two week vacation. My older daughter is between school and camp, and I've decided that I should spend that time with her, prioritizing parenting over work for at least a little while. I just hope that I don't pay too much for this choice in the state of my email and my projects at the time when I return.

Sunday, August 04, 2019

Two Maxims of Project Management

I hold these two maxims of project management to be unwaveringly true:

  1. If it's not in the repository, it doesn't exist.
  2. If it's not running under continuous integration, it's broken.

These two maxims come to me through long and painful experience, which I'd like to pass on to you, in hopes that your learning process will be less long and less painful.

If it's not in the repository, it doesn't exist

The first maxim, "if it's not in the repository, it doesn't exist," is something that I first learned in writing LARPs but is just as true in scientific projects or any other form of collaboration.  For any project I am working with people on, I always, always set up some sort of shared storage repository, whether it be DropBox, Google Drive, git, subversion, etc. If something matters, it needs to be in that repository, because if it isn't, there are oh-so-many ways for it to get accidentally deleted.

More importantly, however, anything in the repository can be seen by other people on the team, which means there's some accountability for its content. I can't count the number of times that somebody has said they're working on something, but it's just not checked in yet, and then it turns out that they weren't working on it at all, or they were working on it but it was terrible and wrong. Some of the worst experiences of my professional life, like nearly-quit-your-job level of painful, have involved somebody I was counting on failing me in this way. If somebody's reluctant to put their work in the team repository, well, that's a pretty good hint that they are embarrassed by it in some way, and thus that their work might as well not exist.

Share your work with your team. Even if it's "messy" and "not ready," insulate yourself from disaster and give people evidence that you are on the right track---or a chance to help you and correct you if you aren't.

If it's not running under continuous integration, it's broken.

The second maxim, "If it's not running under continuous integration, it's broken," appears on the surface to be more specific to software. Continuous integration is a type of software testing infrastructure, where on a regular basis a copy of your system gets checked out of the repository (see Maxim #1), and a batch of tests are run to see if it's still working or not. Typically, continuous integration gets run both every time something changes in the repository and also nightly (because something might have changed the external systems it depends on).

This makes a lot of sense to do for software, because software is complicated. When you improve one thing, it's easy to accidentally break another as a side effect. Building tests as you go is a way to make sure that you don't accidentally break anything (at least not anything you're testing for). If you don't test, it's a good bet that you will break things and not know it. Likewise, the environment is always changing too, as other people improve their software and hardware, so code tends to "rot" if left untouched and untested over time. So if you don't test, you won't know when it breaks, and if you don't automate the testing, you won't remember to run the tests, and then everything will break and it will be a pain.

Surprisingly, I find that this applies not just to software, but to pretty much anything where there's a chance to make a mistake and a chance to check your work. Whenever I analyze data, for example, I always make sure that I automate the calculation so that I can easily re-run the analysis from scratch---and then I add "idiot checks" that give me numbers and graphs that I can look at to make sure that the analysis is actually working properly. Things often go wrong, even in routine experiments and analyses, and if I put these tests in, then I can notice when things go wrong and re-run the analysis to make it right.  I fear that I annoy my collaborators with these checks, sometimes, because they find embarrassing problems, but I'd much rather have a little bit of friction than a retraction due to easily avoidable mistakes in our interpretation of our experiments.

Even my personal finances use tests. In my spreadsheets, I always include check-sums that add things up two different ways so that I can make sure that they match. Otherwise I'm going to make some little cut-and-paste error or typo and then have some sort of unpleasant surprise when I figure out I've got ten thousand dollars less than I thought I did or something like that.

Check your work, and check it more than one way, and add a little bit of automation so that the checks run even when you don't think about them.  It takes a bit of extra time and thought, and it's easy to neglect it because it's hard to measure disasters that don't happen.  I promise you, though, investing in testing is worth it for the bigger mistakes that you'll avoid making and the crises that you'll avoid creating.

Wednesday, July 31, 2019

To go fast, you have to share

The paper I'm going to tell you about today is a nice advance for aggregate computing, but also one that I feel quite ambivalent about. On the one hand, the new "share" operator that it introduces is a nice piece of work that doubles the speed of pretty much every aggregate program that we've implemented. On the other hand, it's fixing a problem that I find embarrassing and wish we'd found a way to deal with long ago.

The origin of this problem goes back a long, long way, all the way to my first serious publication on engineering distributed systems, Infrastructure for Engineered Emergence on Sensor/Actuator Networks, published more than a decade ago in 2006 back in the middle of grad school. This is the publication that started my whole line of work with Proto, spatial computing, and aggregate computing. Unfortunately, it contains a subtle but important flaw: we separated memory and communication.

In principle, this makes a lot of sense, and it's the way that nearly every networking system has been constructed: sending information to your neighbors is, after all, a different sort of thing than remembering a piece of state for yourself. But this choice ended up injecting a subtle delay: when a program receives shared information with "nbr", it has to remember the value with "rep" before it can share the information onward in its next "nbr" execution. Every step of propagating a calculation thus gets an extra round of delay, though it never really mattered much when we were operating more in theory and simulation and assuming fast iterations.

Handling sharing ("nbr") and memory ("rep") separately injects an extra round of delay while information "loops around" to where it can be shared.  Combining them into the single "share" operator eliminates that delay.

Now that we're doing more deployments on real hardware, however, there's often good reasons to keep executions slower in order to save network capacity. And that, finally, has motivated us to fix the delay by combining the "nbr" and "rep" operations into a single unified "share" operation that sends the value stored to one's neighbors.

Theoretically, it's elegant, since this one operation can actually implement both of the previous separate functionalities. Pragmatically, it's a lifesaver, since pretty much every program we run just started converging at least twice as fast, if not faster.  I also wonder how many other distributed algorithms built by other people have this subtle flaw hiding inside of them---though most algorithms probably won't just because they're implemented so much more at the "assembly language" level in how they handle interactions, and the humans implementing them will likely have spotted the optimization opportunity and taken it.

So, all's well that ends well, I guess. I just wish I'd thought of this a decade back.

Sunday, July 28, 2019

Communicating Structure and Function in Synthetic Biology Diagrams

Our new paper, "Communicating Structure and Function in Synthetic Biology Diagrams," has just been accepted in ACS Synthetic Biology, and is up online in their collection of just-accepted-manuscripts. This article provides a nice summary and introduction of how to draw genetic systems diagrams that are unambiguous and easy to understand.

Example diagram illustrating SBOL Visual, highlighting all the types of glyphs that are used in diagrams.
If you can't get at this article behind the ACS paywall, you can also find a nice slide show introducing SBOL Visual on our website or get the material in long form in the full SBOL Visual 2.1 specification.  If you can get at the paper, though, I definitely recommend it, as it has some nice examples showing how this can be used not just for circuits but for pretty much any synthetic biology project, including metabolic engineering and large-scale genome editing and insertions.  It's also got some newly accepted material (e.g., proteins no longer look like yeast "shmoos") that's available online but won't be bundled into a specification release until 2.2 (which is likely 6-12 months away).

I hope you'll find SBOL Visual useful and adopt these methods for all of your illustrations and tools. Now that we've got a good emerging graphical standard that's easy to use in most illustration tools, I see no reason for anyone to avoid embracing it.  And if you run into obstacles or have suggestions for how to improve the standard---get in touch! It's an open community process, and we've had lots of good stuff come in from people joining the community over time!

Sunday, July 21, 2019

Backyard Naturalists

When my older daughter decided she wanted "science" as the theme for her seventh birthday party, we all brainstormed up "experiment" contests for the kids to do. The kids had a blast doing things like geometry (constructions with gumdrops and toothpicks) and chemistry (who can get the biggest Mentos and soda explosion?), but my favorite was our biology experiment. 

We called it "backyard naturalists", and I'd gotten a little USB "microscope" (really a macro camera) so that the kids could go find interesting things in the garden and then look at their findings blown up huge on my computer screen.

There are wonderful things hiding in your yard, and even very little kids can become entranced with the intricacy and beauty of them, and all the new questions that can be revealed when you look at something up close and carefully.  Here are some of the best of the things that all the backyard naturalists brought in and crowded around the screen to see.  No animals were harmed in the making of these images: all were released safely into the backyard when their photo sessions were complete. Enjoy!

Petals on a clover flower
A bird's discarded feather
A small moth, gently contained within a plastic cup.

The interior of a flower, including little white shrimp-like mites (one is particularly visible in the lower center)

Japanese beetle, exploring possible food sources offered to it.

Interior of a flower looking beautifully spiky and crystalline.

Wolf spider, found scuttling along and hunting.

Friday, July 19, 2019

Patrick Winston and the Power of Imperfection

I learned today from colleagues that one of my mentors had passed away. Patrick Winston was one of my thesis advisors in graduate school, but also one of the people who first inspired me to seek to go there (along with Gerry Sussman). He was also my boss as a TA, a colleague and collaborator, and also someone that I think I may have disappointed with my choices.

At first, I didn't quite engage, but a colleague and friend asked me how I was doing, and I found myself writing back much more than I had thought to, as I thought more and more about how big an influence Patrick was on my whole life, as well as my career. I learned so much from him, both his good points and his flaws. And more than anything else, I think I learned about the courage to let go of perfectionism and worry less about being right than just being better than I was before.

Patrick, in my experience, was not particularly a natural leader or a charismatic speaker or a gifted teacher. And yet he excelled at all three, at leadership and speaking and teaching, not by dint of some amazing gift but by the fact that he deeply cared to do all of these things well. To do this, Patrick collected heuristics from observing people he admired, learning not their brilliance but their ways of avoiding disaster. And amazingly, it turns out that you can go a lot farther and get a lot more done just by carefully using a little checklist of heuristics to avoid pitfalls than you can get done by being a brilliant egotist with a fatal, unacknowledged flaw.  Patrick was one of the most humble people that I knew, with a quiet way of speaking and a careful attention to the big picture and just plain being effective and consistent at whatever he judged to be the most important things to do.

Patrick was generous with his knowledge too. Above all else, I knew Patrick as a teacher, a teacher in many different ways, who would smuggle extra lessons into all his actions, just because he thought his listeners might appreciate knowing these things too. I learned from him for years, and I'm still using many of the most important lessons that he taught me.

Just last week, I was talking about Patrick to a younger colleague, introducing yet another person to his wonderful heuristics on how to give a talk. I'm thinking about that as I write this, and how grateful I've been for all those lessons, and the trust and help he gave me when I was his student, even when I was on my way to make another set of interesting mistakes. But I'm also thinking about how I took his presence for granted and hadn't stopped by to visit him at MIT for years, which makes me sad.

He was not a Great Man, in the sense that I no longer believe in Greatness (partly due to the lessons I learned from him). But he was a person who achieved greatness in many different ways and, I believe, above all else in the ways he invested in teaching so many of us in so many different ways.

I am deeply grateful for the gifts that I received from him, and will continue to do my best to pass them on.