Friday, October 18, 2019

Gigabase-scale genome engineering

Just out in today's Science: "Technological challenges and milestones for writing genomes." One of a pair of papers I've been working on with the GP-write consortium, both of which are asking the question: what, exactly, do we need in order to go from engineering millions of base-pairs of DNA in bacteria and yeast to the billions of base-pairs in complex organisms like mammals, plants, and people?

This paper focuses on the DNA-wrangling side of the problem, while its complement (on arXiv and under revision) focuses on the informational and coordination side of the problem. Both need to be addressed, and the complexity---while daunting---is tractable. Take a read-through and see our take on the matter!

Monday, October 14, 2019

Getting plate readers right

If you ever used a plate reader to measure either OD or fluorescence, you'll want to check out the iGEM 2018 interlab preprint on bioRxiv!

We just submitted this manuscript, "Robust Estimation of Bacterial Cell Count from Optical Density," for review on Friday, but we think a lot of folks will want to make use of this information, and so we've gotten a preprint up early as well.  The big deal of this study is that we've now got a good calibration process for both optical density (OD) measurement, which is commonly used for estimating cell count in a sample, and fluorescence measurement, which is commonly used as a "debugging probe" for estimating cellular activity.  Both of these are usually reported in relative or arbitrary units right now, which causes lots of trouble interpreting what's even going on in your experiment, as well as greatly limiting how results can be shared and applied.

No more: we have protocols that are cheap (less than $0.10/run) and easy (reliably executed by high school students just getting started in a lab), and that this manuscript shows are also both precise and accurate.  All you have to do is dilute little cell-sized silica beads and fluorescent dye, plug the measurements into a spreadsheet, and you're good to go.

Serial dilution of fluorescein (from iGEM protocols page)

And here's the most important result from our paper: a nearly perfect match between per-cell fluorescence estimate from plate reader measurements and the ground truth captured from single-cell measurements in flow cytometers.
Plate reader (calibrated with microsphere dilution) vs. flow cytometry showing a 1.07-fold mean difference over 6 test devices.
In fact, this match is even better than we deserve: we know there are factors that should distort the plate reader measurements both up and down, but they're small and appear to be canceling one another out. The only device with a notable difference in measured value is the one that's got very low fluorescence---and even there it's not significant and conforms with our expectation that flow cytometers will be better able to measure extremely faint fluorescence than plate readers.

This is new science, so there's lots of caveats, of course: this has only been validated for E. coli, and probably won't work well for murky cultures with a lot of background or for biofilms or long filamentous strands.  Nevertheless, it's a big step forward, since a huge amount of what people use plate readers for is covered by this study already.  We'll see what the reviewers think, but I expect this paper is going to have a big impact because it's addressing a problem that so many people are encountering.

The next key challenge, however, is this: can we get somebody manufacturing plate readers to make calibration plates so that people don't have to prepare their reference materials themselves?

Friday, September 27, 2019

New aggregate programming survey!

Just out, a new survey entitled "From distributed coordination to field calculus and aggregate

computing", which surveys aggregate programming work by my collaborators and myself. This paper expands on a conference version published last year, and gives a nice overview of how all of the different pieces of our work in this area fit together.

How the past, present, and future fit together in our view of aggregate programming.
One of this nice things about this survey was that we also were able to spend some time tracing out the roots of this work in the past, including a something that I really like: a diagram of all the key different traces of past work coming together to form aggregate computing (not the one above, but something much more complicated).  We also spent half a dozen pages laying out our view on key problems to be addressed and the likely roadmap for near-term progress in the area. If you're interested in either making use of this work or getting involved in research in this area yourself, this paper is a great place to start reading!

Tuesday, September 10, 2019

Damn you, asparagine!

Deep inside big public databases, you can find quite curious things, especially when biology is involved.

For example, I spent several hours today hunting down a mysterious bug in the DNA screening project that I've been leading. We're working on improving the ability to detect when somebody orders DNA that they shouldn't be ordering (e.g., smallpox, ebola), and so it's really important to not let anything get past. So while most classification projects might be fine with getting nearly everything right, our system has to catch every single problematic sequence every time.

That means I get to drill down and try to classify every miss our system makes, and I learn some strange and interesting things while doing it. For instance, these pseudo-fascinating trivia are amongst the things that I have recently learned:

  • The same DNA sequence from the same publication is often uploaded twice and categorized differently each time.
  • Fish in fish farms get sick with a virus related to rabies.  It doesn't hurt humans, though.
  • Somebody is running automated systems to infer the organisms that DNA sequences are associated with, and that produces a lot of "unknown member of [family/order]" entries.
  • Somebody published a paper where they claimed to discover a bunch of new virus species by just sort of sequencing samples from healthy people and not actually checking in any way whether actual viruses were involved.
  • When NCBI updates its taxonomy which organisms are related to which, the sequence records don't change to reflect their new taxonomy.

With these discoveries and a few other tweaks, I was able to categorize and plan mitigations covering all of the classes of failures that our system was encountering.  Almost.

There was just one miss that I just could not explain, a short little snippet from a virus coat.  There were no related "safe" viruses that would cause us to overlook its sequence, nothing in the protein sequences and nothing that could even be mis-translated from other DNA sequences.  And I thought, "that's funny..."

I dug down and dug down and eventually found something both embarrassing and wonderful. You see, in DNA sequences, there's often parts that are unknown, and so instead of the standard "A", "C", "T", and "G" DNA bases, these bits of missing information get marked as "N" for an unknown "any" base.  These get used in ordering DNA too, to indicate places where you don't care what the sequence is. We've long been excluding these from matches, since it makes no sense to say, "Aha! Somebody once didn't know part of a virus, and you don't care what you get!"  So our detector throws out potential matches that include an unknown.

Only thing is, when you're working with proteins, the missing information letter isn't "N". There are a lot more amino acids than nucleic acids, and so they use up more of the alphabet, including "N", which stands for the amino acid asparagine. With proteins, the missing information letter is "X" instead.

Most of our system knew that.  Most of our system was doing the right thing.  But one little part of one little script wasn't getting switched into protein mode at the right time.

We've been systematically excluding every protein pathogen signature with asparagine in it.

Our system: "Damn you, asparagine! Get out of my house!"

That's embarrassing.  Easy to fix, but still embarrassing.

And yet...

Asparagine is a pretty common amino acid, so we've been accidentally throwing away around one third of the detection power of our system.  And out of tens of thousands of tests, there was precisely one where this blatant and egregious error caused us to miss a detection.

The wonderful thing is that the system is still working almost perfectly, even while we've been unknowingly arbitrarily throwing away a vast amount of its ability to detect pathogens.  That speaks to its resilience, and how many alternative routes it explores to achieve its goal.  I can live with that, with a nice natural experiment accidentally conducted by a misbehaving script. We'll fix it, and move forward.

But such remarkable things you may find when you follow just one little thread of something funny in your data...

Sunday, August 11, 2019

Can we put an end to secret parenting?

Recently, one of my colleagues at BBN shared an article about "secret parenting," and the concept really struck a chord with me.  The basic idea is that people often feel that they will be judged for choosing parenting over putting in more hours at the office, and so they end up hiding these choices, making excuses, and generally having their work-life balance (or lack thereof) degraded further.

It's unfortunately easy to simply brush away one's parenting, to pretend that it's not happening, to pretend it's not important. And it's not just parenting, of course: people have all sorts of other things outside of work. Parenting, however, is something that's particularly strong and gendered in its impact in American society, at least.

In my group at BBN, I think we do pretty well on not hiding our parenting. The group mailing list is always abuzz with notifications of people saying they're going to be out or working from home for personal or family reasons: taking the kids to the doctor, dealing with child-care failure, going to see a kid's baseball game, helping out with the grand-kids, fixing an air conditioner, keeping their new dog company, etc. Also, importantly, I see it coming very much from both men and women.  I think that this visibility on the mailing list is really important, because it makes it much more comfortable to make those choices oneself, and to feel less pressure to engage in secret parenting. I definitely know that it matters for me.

With other colleagues outside of my home organization, however, I often do not feel such comfort. Whenever I make a choice that's driven by my desire to be a present and responsible parent (or other personal things, though parenting dominates in my life right now), I feel that I have to worry about things like:

  • Will this person think less of me professionally?
  • Will they worry I'm not sufficiently committed?
  • Will they feel like I'm putting them at a lower priority?

This shows up in lots of little micro choices.  Like, do I tell people I can't make it because I'm volunteering to drive for a field-trip at my daughter's school, or just say that I have a conflict?  Do I say that I'm heading for the airport early because I want to see my kids in the morning, or just blame it on flight combinations to Iowa?

As I get to know somebody better, the barriers can come down, but in the world of science there are always new collaborators, new potential competitors, new program managers. I don't feel secure enough to expose myself in that way with people that I do not know well. And if I don't, as somebody who should probably be considered well established at this point in my career, how much more vulnerable my younger colleagues, my colleagues who are female or minorities?

On this blog, on my online persona, you get to see the highlights of my life. You don't get to see my times of burnout and depression. You don't get to see me struggle with imposter syndrome. I'm still not going to post these here, in full public record, for all to see, because I do not want to make myself that vulnerable to judgement. But dear reader, I would encourage you to count the posts that are not there.

Writing posts like these is a good sign, for me, because it's showing that I'm finding time enough to sit down and reflect and find the things I want to share.  Posts show me operating at peak functionality in my life, and if I'm operating at Peak Jake, I'd probably post just about once a week.  Thus, if you don't see a post, it doesn't necessarily mean that things are bad in my life---but it means that I don't feel I have the luxury to indulge in these delightful pseudo-conversations. Not without neglecting things that are more important to me, at least, like parenting and career.

But I do think that in my professional interactions, I'm going to try to shift my boundaries a bit more, indulge my trust a bit more freely in my outside-of-BBN colleagues. I don't like hiding my life from my work, parenting or otherwise, and since I am indeed in a somewhat secure and privileged position in my career, I think that one of my responsibilities is to help to shape my professional environment to be more of the sort in which I would like to live and work.

And with that, dear reader, let me sign off by informing you that this post appears in the midst of a two week vacation. My older daughter is between school and camp, and I've decided that I should spend that time with her, prioritizing parenting over work for at least a little while. I just hope that I don't pay too much for this choice in the state of my email and my projects at the time when I return.


Sunday, August 04, 2019

Two Maxims of Project Management

I hold these two maxims of project management to be unwaveringly true:

  1. If it's not in the repository, it doesn't exist.
  2. If it's not running under continuous integration, it's broken.


These two maxims come to me through long and painful experience, which I'd like to pass on to you, in hopes that your learning process will be less long and less painful.

If it's not in the repository, it doesn't exist


The first maxim, "if it's not in the repository, it doesn't exist," is something that I first learned in writing LARPs but is just as true in scientific projects or any other form of collaboration.  For any project I am working with people on, I always, always set up some sort of shared storage repository, whether it be DropBox, Google Drive, git, subversion, etc. If something matters, it needs to be in that repository, because if it isn't, there are oh-so-many ways for it to get accidentally deleted.

More importantly, however, anything in the repository can be seen by other people on the team, which means there's some accountability for its content. I can't count the number of times that somebody has said they're working on something, but it's just not checked in yet, and then it turns out that they weren't working on it at all, or they were working on it but it was terrible and wrong. Some of the worst experiences of my professional life, like nearly-quit-your-job level of painful, have involved somebody I was counting on failing me in this way. If somebody's reluctant to put their work in the team repository, well, that's a pretty good hint that they are embarrassed by it in some way, and thus that their work might as well not exist.

Share your work with your team. Even if it's "messy" and "not ready," insulate yourself from disaster and give people evidence that you are on the right track---or a chance to help you and correct you if you aren't.

If it's not running under continuous integration, it's broken.


The second maxim, "If it's not running under continuous integration, it's broken," appears on the surface to be more specific to software. Continuous integration is a type of software testing infrastructure, where on a regular basis a copy of your system gets checked out of the repository (see Maxim #1), and a batch of tests are run to see if it's still working or not. Typically, continuous integration gets run both every time something changes in the repository and also nightly (because something might have changed the external systems it depends on).

This makes a lot of sense to do for software, because software is complicated. When you improve one thing, it's easy to accidentally break another as a side effect. Building tests as you go is a way to make sure that you don't accidentally break anything (at least not anything you're testing for). If you don't test, it's a good bet that you will break things and not know it. Likewise, the environment is always changing too, as other people improve their software and hardware, so code tends to "rot" if left untouched and untested over time. So if you don't test, you won't know when it breaks, and if you don't automate the testing, you won't remember to run the tests, and then everything will break and it will be a pain.

Surprisingly, I find that this applies not just to software, but to pretty much anything where there's a chance to make a mistake and a chance to check your work. Whenever I analyze data, for example, I always make sure that I automate the calculation so that I can easily re-run the analysis from scratch---and then I add "idiot checks" that give me numbers and graphs that I can look at to make sure that the analysis is actually working properly. Things often go wrong, even in routine experiments and analyses, and if I put these tests in, then I can notice when things go wrong and re-run the analysis to make it right.  I fear that I annoy my collaborators with these checks, sometimes, because they find embarrassing problems, but I'd much rather have a little bit of friction than a retraction due to easily avoidable mistakes in our interpretation of our experiments.

Even my personal finances use tests. In my spreadsheets, I always include check-sums that add things up two different ways so that I can make sure that they match. Otherwise I'm going to make some little cut-and-paste error or typo and then have some sort of unpleasant surprise when I figure out I've got ten thousand dollars less than I thought I did or something like that.

Check your work, and check it more than one way, and add a little bit of automation so that the checks run even when you don't think about them.  It takes a bit of extra time and thought, and it's easy to neglect it because it's hard to measure disasters that don't happen.  I promise you, though, investing in testing is worth it for the bigger mistakes that you'll avoid making and the crises that you'll avoid creating.

Wednesday, July 31, 2019

To go fast, you have to share

The paper I'm going to tell you about today is a nice advance for aggregate computing, but also one that I feel quite ambivalent about. On the one hand, the new "share" operator that it introduces is a nice piece of work that doubles the speed of pretty much every aggregate program that we've implemented. On the other hand, it's fixing a problem that I find embarrassing and wish we'd found a way to deal with long ago.

The origin of this problem goes back a long, long way, all the way to my first serious publication on engineering distributed systems, Infrastructure for Engineered Emergence on Sensor/Actuator Networks, published more than a decade ago in 2006 back in the middle of grad school. This is the publication that started my whole line of work with Proto, spatial computing, and aggregate computing. Unfortunately, it contains a subtle but important flaw: we separated memory and communication.

In principle, this makes a lot of sense, and it's the way that nearly every networking system has been constructed: sending information to your neighbors is, after all, a different sort of thing than remembering a piece of state for yourself. But this choice ended up injecting a subtle delay: when a program receives shared information with "nbr", it has to remember the value with "rep" before it can share the information onward in its next "nbr" execution. Every step of propagating a calculation thus gets an extra round of delay, though it never really mattered much when we were operating more in theory and simulation and assuming fast iterations.

Handling sharing ("nbr") and memory ("rep") separately injects an extra round of delay while information "loops around" to where it can be shared.  Combining them into the single "share" operator eliminates that delay.

Now that we're doing more deployments on real hardware, however, there's often good reasons to keep executions slower in order to save network capacity. And that, finally, has motivated us to fix the delay by combining the "nbr" and "rep" operations into a single unified "share" operation that sends the value stored to one's neighbors.

Theoretically, it's elegant, since this one operation can actually implement both of the previous separate functionalities. Pragmatically, it's a lifesaver, since pretty much every program we run just started converging at least twice as fast, if not faster.  I also wonder how many other distributed algorithms built by other people have this subtle flaw hiding inside of them---though most algorithms probably won't just because they're implemented so much more at the "assembly language" level in how they handle interactions, and the humans implementing them will likely have spotted the optimization opportunity and taken it.

So, all's well that ends well, I guess. I just wish I'd thought of this a decade back.

Sunday, July 28, 2019

Communicating Structure and Function in Synthetic Biology Diagrams

Our new paper, "Communicating Structure and Function in Synthetic Biology Diagrams," has just been accepted in ACS Synthetic Biology, and is up online in their collection of just-accepted-manuscripts. This article provides a nice summary and introduction of how to draw genetic systems diagrams that are unambiguous and easy to understand.

Example diagram illustrating SBOL Visual, highlighting all the types of glyphs that are used in diagrams.
If you can't get at this article behind the ACS paywall, you can also find a nice slide show introducing SBOL Visual on our website or get the material in long form in the full SBOL Visual 2.1 specification.  If you can get at the paper, though, I definitely recommend it, as it has some nice examples showing how this can be used not just for circuits but for pretty much any synthetic biology project, including metabolic engineering and large-scale genome editing and insertions.  It's also got some newly accepted material (e.g., proteins no longer look like yeast "shmoos") that's available online but won't be bundled into a specification release until 2.2 (which is likely 6-12 months away).

I hope you'll find SBOL Visual useful and adopt these methods for all of your illustrations and tools. Now that we've got a good emerging graphical standard that's easy to use in most illustration tools, I see no reason for anyone to avoid embracing it.  And if you run into obstacles or have suggestions for how to improve the standard---get in touch! It's an open community process, and we've had lots of good stuff come in from people joining the community over time!

Sunday, July 21, 2019

Backyard Naturalists

When my older daughter decided she wanted "science" as the theme for her seventh birthday party, we all brainstormed up "experiment" contests for the kids to do. The kids had a blast doing things like geometry (constructions with gumdrops and toothpicks) and chemistry (who can get the biggest Mentos and soda explosion?), but my favorite was our biology experiment. 

We called it "backyard naturalists", and I'd gotten a little USB "microscope" (really a macro camera) so that the kids could go find interesting things in the garden and then look at their findings blown up huge on my computer screen.

There are wonderful things hiding in your yard, and even very little kids can become entranced with the intricacy and beauty of them, and all the new questions that can be revealed when you look at something up close and carefully.  Here are some of the best of the things that all the backyard naturalists brought in and crowded around the screen to see.  No animals were harmed in the making of these images: all were released safely into the backyard when their photo sessions were complete. Enjoy!

Petals on a clover flower
A bird's discarded feather
A small moth, gently contained within a plastic cup.

The interior of a flower, including little white shrimp-like mites (one is particularly visible in the lower center)

Japanese beetle, exploring possible food sources offered to it.

Interior of a flower looking beautifully spiky and crystalline.

Wolf spider, found scuttling along and hunting.









Friday, July 19, 2019

Patrick Winston and the Power of Imperfection


I learned today from colleagues that one of my mentors had passed away. Patrick Winston was one of my thesis advisors in graduate school, but also one of the people who first inspired me to seek to go there (along with Gerry Sussman). He was also my boss as a TA, a colleague and collaborator, and also someone that I think I may have disappointed with my choices.

At first, I didn't quite engage, but a colleague and friend asked me how I was doing, and I found myself writing back much more than I had thought to, as I thought more and more about how big an influence Patrick was on my whole life, as well as my career. I learned so much from him, both his good points and his flaws. And more than anything else, I think I learned about the courage to let go of perfectionism and worry less about being right than just being better than I was before.

Patrick, in my experience, was not particularly a natural leader or a charismatic speaker or a gifted teacher. And yet he excelled at all three, at leadership and speaking and teaching, not by dint of some amazing gift but by the fact that he deeply cared to do all of these things well. To do this, Patrick collected heuristics from observing people he admired, learning not their brilliance but their ways of avoiding disaster. And amazingly, it turns out that you can go a lot farther and get a lot more done just by carefully using a little checklist of heuristics to avoid pitfalls than you can get done by being a brilliant egotist with a fatal, unacknowledged flaw.  Patrick was one of the most humble people that I knew, with a quiet way of speaking and a careful attention to the big picture and just plain being effective and consistent at whatever he judged to be the most important things to do.

Patrick was generous with his knowledge too. Above all else, I knew Patrick as a teacher, a teacher in many different ways, who would smuggle extra lessons into all his actions, just because he thought his listeners might appreciate knowing these things too. I learned from him for years, and I'm still using many of the most important lessons that he taught me.

Just last week, I was talking about Patrick to a younger colleague, introducing yet another person to his wonderful heuristics on how to give a talk. I'm thinking about that as I write this, and how grateful I've been for all those lessons, and the trust and help he gave me when I was his student, even when I was on my way to make another set of interesting mistakes. But I'm also thinking about how I took his presence for granted and hadn't stopped by to visit him at MIT for years, which makes me sad.


He was not a Great Man, in the sense that I no longer believe in Greatness (partly due to the lessons I learned from him). But he was a person who achieved greatness in many different ways and, I believe, above all else in the ways he invested in teaching so many of us in so many different ways.

I am deeply grateful for the gifts that I received from him, and will continue to do my best to pass them on.

Wednesday, July 03, 2019

Introducing the SBOL Industrial Consortium

We're officially announcing the launch of the SBOL Industrial Consortium today, which makes me very happy.

This is something that's been a long time coming: there's a number of us at companies that are using, invested in, or interested in SBOL. Pretty much every synthetic biology company grapples with the problems that SBOL aims to solve, and either we have to roll our own or else we need to make sure that something is out there that's close enough to our needs to make it work. Or, as I like to put it: if SBOL didn't exist, we'd have to invent something very much like it anyway.

But most of the development to date has been done by universities, and there there are things that are just plain hard to do on a model of grant-based funding of development by graduate students. To bring things to the next level, and ensure a stable base of shared infrastructure, we need industry involvement, and the goal of the SBOL Industrial Consortium is to organize and coordinate the interested industrial players in a free and open context, where all of us can benefit both in industry and in academia and in government.

The SBOL Industrial Consortium first really nucleated in hallway discussions at the SEED Synthetic Biology conference last year. From then until now was a few months of discussions to figure out who was interested and the principles for organization, then about six months getting the legal framework set up, and then a couple of months getting the founding members sorted out, arranging finances, and bootstrapping up our organization.

SBOL Industrial Consortium logo and founding members

We've got a nice strong founding team, and we've got some nice press in SynBioBeta and  PLOS to help announce our launch. Next step: making sure that we're able to really work together and help each other out in a way that makes it worth it for us all. If it works, I expect that the consortium will grow naturally and organically from here.

One way or another though: as all of us companies working in synthetic biology grow and need to exchange more information about what we're doing in synthetic biology in our business transactions, something like the SBOL Industrial Consortium needs to exist. I'm happy to be helping try to keep that in the open and non-proprietary space, and have confidence that we've a good shot at being able to make something useful work.

Thursday, June 20, 2019

Engines, Specialists, and Ambassadors

Managing projects with volunteers is a very different challenge than managing projects that people are being paid to work on.  I've ended up doing a lot more of this professionally than I might have expected in my life, through my involvement with things like the SBOL standards, the iGEM Measurement Committee, GP-write, organizing reviewing for workshops and conferences, etc. 

When people are being paid to work on a project, then you can rightfully expect contributions at a certain level, based on what they have committed to. If they aren't contributing at that level, then that's a problem, and you can have a discussion about how to fix that problem, up to and including redirecting those resources to somebody who can contribute as expected.

With volunteers, on the other hand, every hour of effort is a gift to you and to the project. You have no right to expect any particular level of contribution from any person, and if you demand more than somebody feels like giving, it's entirely appropriate for them to simply walk away.

This can leave a volunteer-based project with a real dilemma. How do you actually get stuff done when nobody is required to do something? It also can feel quite unfair, since often a few people are giving lots of time and effort, while many others are doing barely anything.  Shouldn't those lazy people do more work?

I've come to realize that, for many volunteer projects, that's not the case. It's OK to have a lot of very different levels of contribution, and even the people who look like they are doing nothing can often make a very valuable contribution to the project.

How I've come to think of it, inspired by something I heard from somebody else at a meeting a few years back, is that you can think of the people involved in a volunteer project as falling into three rough clusters, Engines, Specialists, and Ambassadors.

Three clusters of volunteer project contributors: Engines, Specialists, and Ambassadors.
  • Engines: These people are the working core of the project, who can be counted on to step up if something needs to be done, just because it needs to be done. There are usually very few people who are Engines, but they get a lot done and deal with a lot of scut work and thankless tasks. As such, Engines are also in danger of developing feelings of elitism and entitlement towards the less committed members, which can quickly poison an organization. If you're running (or de facto running) a volunteer project, you are probably an Engine.
  • Specialists: These people tend to have particular aspects of the project that they are interested in, and contribute only to those. Specialists often also have particular narrow skills that may be in high demand both for this project and for other things that will take them away from the project. On the parts they want to do, they may put in lots of time and be fantastically productive. Other things, they either just won't volunteer, or else may offer but fail to deliver on anything but the parts that are their specialty.
  • Ambassadors: The rest of the people on the project, the majority group of "slackers" who never get anything much done, are your Ambassadors, and they're more important than you may think. Consider this: why do they stick around if they aren't actually getting anything done? After all, there's something missing that means they can't actually contribute effectively---most often either time or relevant skills. Yet they keep hanging around, which means that they must think that the project is important to pay attention to in some way! That's what makes them your Ambassadors, because they carry their knowledge of the project into all the other non-project things that they are involved with, and will spread that knowledge to other people in the community, making connections and effectively promoting your project.
Hopefully, you can see that all three groups, Engines, Specialists, and Ambassadors, have an important role to play in making a volunteer project successful. Moreover, I've personally found virtually no way of telling who is going to turn out to fall into which category. As such, when I'm running a project, what I tend to do is simply welcome all comers and let them sort themselves over time by interest and inclination. 

Similarly, over the lifetime of a project, people will tend to drift up and down the classes based on their interests and the other things that are going on in their lives. A healthy project will then adapt to people drifting in and out, and will adjust the scope of its ambition to match the contributions that its volunteers are capable of making, rather than trying to extract more labor out of people who do not need to even be contributing at all.

Engines, Specialists, and Ambassadors: understand and embrace the differences in skills and interest levels, rather than trying to make things fair, and I believe your volunteer-based projects are more likely to succeed.

Tuesday, June 04, 2019

Open Technologies are Not a Passive Choice

Open technologies make our society a better place. What I mean by "open" is things that are not encumbered by patents, costly licenses, proprietary "know-how", or other rent-extracting dependencies that make it difficult for new people or organizations to start using them. Open technologies abound in the world of computing: the internet, email, and web pages are prime examples, as are most of the key libraries and software tools that support them. Similarly, the current revolutions going on with big data and machine learning are driven in part by the enabling power of the plethora of available powerful free and open software tools.

Open technologies, however, are not the natural state of the world. Some of that, of course, is due to simple human greed and competition. If you can technologically lock people into your platform, then you can make money off of them because the cost of switching is just too high or because you've established a de facto monopoly. I am convinced, however, that closed technologies are much more often simply the default position, and that we degrade toward that position whenever there is insufficient investment in keeping technologies open.

Open technologies in synthetic biology, as everywhere, are constantly being nibbled away at by antithetical market forces.

Consider the fact that making something an open technology is hard and takes a continual investment of resources:

  • You have to document and explain things clearly, so that other people besides your team can use the technology.
  • You have to bring together and maintain a sufficiently amicable community of people who find enough value in the technology to want to use it.
  • People in the community will have different needs, so the technology is going to have to become more general and more complex, or else the community of users will fragment or shrink.
  • Different implementations will have different mistakes and ambiguities, and if you don't identify them the technology will start to develop "dialects" and incompatibilities.
  • As the technology evolves, or the world around it does, you have to adapt and update the whole mess, plus handle backward compatibility since older uses of the technology will still be around.

Notice that none of these steps are easy, and if anything goes wrong with any of them, the result is a less open technology. Keeping any technology open is thus a continual and ongoing struggle.

Now put that in a world of careers and money, and it all gets more complex. 

First, there's the straightforward problem of competition between open and proprietary technologies. There is always somebody who is interested in making money off of a proprietary alternative to an open technology, and if they've got more resources, they can often either "embrace, extend, and extinguish" or simply out-develop and out-market the open technology.

A more insidious problem, however, is passive choice. In our competitive global world, it doesn't matter whether you're in academia, in government, or in industry, in a big organization or a little one: most people who are doing something interesting are stretched for time and for resources. That means nobody is choosing whether to invest in an open technology or not. They're choosing whether to invest in an open technology or whether to invest in something else that's probably more urgent and more directly related to their career, their bottom line, etc. So for most people, it's always easy to say that the time is not right for them to invest their time, energy, money, credibility into an open technology, or even to just not pay attention at all.

Where does that leave us?

It's really easy to endorse open technologies and to say that you support them.

But the ongoing cost and challenges of maintaining open technologies also means this: if you aren't actively investing in open technologies, then you are actively choosing proprietary technologies over open technologies.

In the world of computing, it's been a long, hard fight, but open technologies are extremely firmly established in the general culture and there are many effective people and organizations that are actively investing to keep these technologies open.

In synthetic biology, the future is much less certain. On the one hand, there is a great and general enthusiasm for open communities, engineering ideas, and the vast possibilities of the field, which tends to support development of open technologies. On the other hand, there are a lot of broad intellectual property claims on fundamental technologies and a lot of money flowing into a lot of quickly growing companies, both of which tend to strongly promote the closure of technologies.

I would judge that within the next 5-10 years, we're going to be in a situation where either a) we are able to develop a strongly established foundation of open technologies and a supporting culture, as in the computing world, or b) the potential of the field becomes badly stunted by the difficulty of operating, where the cost of doing business is high and so are the barriers to entry for new players.

If you are in the field of synthetic biology, I believe that you need to think about where you stand on this, and make a decision about what you're going to do.  Are you going to actively invest in open technologies, or are you going to sit back and simply hope that the field does not get closed?

So if you are a synthetic biologist who agrees that open technologies are valuable, what should you do? Here are three simple ways to start investing:

  • Figure out which of your proprietary things aren't actually important to keep proprietary, and make them available on open terms.
  • Try out an open technology, and figure out how to make it work for your group. There will be bumps and problems, but when you face them go ask for help from the developers rather than dropping the project.
  • Help develop an open technology. Any healthy community will welcome you with open arms.

Doing any of these will cost you, whether in money or opportunities.  Any benefits you get are more likely to be long-term than short-term.

I think it's worth it, though.

I spend most of my working life in the synthetic biology community. When I invest in open technologies, I'm investing in helping keep that community the sort of community where I want to spend my time.

And so I contribute to SBOL and to iGEM, we release and maintain software like TASBE Flow Analytics and TASBE Image Analytics, and I choose to use my time to go to meetings like the one where I just spent my last two days---the BioRoboost "Workshop on Synthetic Biology Standards and Standardisation."  I'm imperfect and the things I do and build are imperfect, but so far as I can tell, overall I am helping to make a useful contribution to our community.

What has your organization done for open technologies in synthetic biology lately?

Saturday, May 04, 2019

Down in the weeds with flow cytometry

As a computer scientist and an engineer, I love flow cytometry, and today I'm excited to tell you about the new paper that we've just published about the subject.

I love flow cytometry because it's the closest that I currently get to being able to stick logic probes into cells (though we're trying to do better). I also get measurements from hundreds of thousands of individual cells, so there's more than enough to get really deeply into their statistics and learn a lot. Plus, it uses frickin' laser beams: the cells' fluorescence actually gets interrogated by sputtering them in a stream past several lasers of different colors, which blast the cells so that we can see the light that gets thrown off in response.

All well and good, but actually using the information is not all joy and laser beams. There's a whole bunch of complexities to deal with in order to turn the raw numbers into reliable measurements of the biology I'm interested in, rather than just the physics of blasting cells with lasers. And not just cells either: the first thing we have to do is try to sort out the single cells from the bits of debris and from the pairs and clumps of cells. Then you have to trim off the background fluorescence of the cells, sort out spectral overlap between the different proteins and lasers, and then somehow relate all of those numbers to molecules and make them comparable even though the different colors come from different molecules.

A typical workflow for processing raw flow cytometry data into comparable biological units, implemented by our TASBE Flow Analytics tool.
Getting all of this right is surprisingly difficult. It's not that any one thing is all that difficult, but there's a lot of them. All of these have failure modes and hidden gotchas to deal with too, and it's easy to stumble over one thing or another, especially when you're dealing with datasets with hundreds of samples. As a good (lazy) computer scientist, my response to challenges like this is to make the computer do it for me instead, and so my colleagues and I have done so.

In a way, it's surprising that we've needed to do this, given that flow cytometry has been established for decades and is widely used in both medicine and research. Most people, however, still aren't trying to use the data for precision quantitative modeling with big data sets in the way that a few folks (including me) have been doing in synthetic biology. As such, the prior tools that were out there weren't up to the job.  It's not that there is anything wrong with these tools, it's just that they are not designed to be good at the particular types of automation and assistance that I've found are needed for characterizing systems in synthetic biology. Thus, we've ended up needing to build our own tools, and have been incrementally developing and refining them for years on project after project after project that has made use of them.

Two years ago, we began to share with others by releasing our TASBE Flow Analytics package as free and open source software, and as of yesterday it's also been officially published on in a scientific paper in ACS Synthetic Biology. I'm pretty happy with where our tools have gotten and the number of projects and collaborators that they are supporting, with different modes and interfaces for different folks:

  • The basic Matlab/Octave/Python interactive interface, which is what I most often use myself as a programmer-analyst.
  • An Excel-sheet interface, which our laboratory-based collaborators have found much more intuitive and user friendly, since it's a lot like the spreadsheets they use to design their experiments in the first place.
  • A scripting interface for use in high-throughput automated systems, which is how it's being used in the DARPA Synergistic Discovery and Design (SD2)  program.
Infrastructure projects like this aren't particularly flashy or cool, but it's the sort of thing that greatly changes what can be accomplished. We're back to third grade science once again, and those foundations of the scientific method: units and reproducibility. TASBE Flow Analytics is one more piece of that puzzle, and I hope that it will continue to expand the number of people and projects that are benefiting from high-quality measurements of cells.

Wednesday, April 24, 2019

Friend or Foe at BBN

My colleague Aaron Adler just got a nice press-release writeup on a new project he's been starting in the DARPA Friend or Foe program, in which he's leading a team developing methods for culturing and assessing bacteria for their potential danger as human pathogens. I think this is a pretty cool project and potentially quite significant if it can be made to work. Basically, it's all about building environments that can trick a bacteria into behaving "normally" even when it's in the lab, first "culturing the unculturable" by turning the bacteria's home into a lab dish rather than the other way around, then shifting samples into miniature human-like tissue environments to trick the bacteria into revealing otherwise hidden behaviors.

This project is also a nice example of how our synthetic biology group has been growing at BBN, as the core technologies of synthetic biology prove relevant across all sorts of other biological application spaces. Aaron's also one of the folks who has been really excited about the application of artificial intelligence methods to synthetic biology problems, co-organizing workshops and symposia with another colleague, Fusun Yaman. I'm excited to see if this can achieve the potential of its vision, and also excited that our group has grown enough that cool projects like this can be lifting off with minimal involvement from myself.

Sunday, February 10, 2019

A Moral Basis for Ethical Genetic Engineering

I recently read a news report saying that two colleagues of mine have been trying to put together a business to create genetically modified human babies with added "good" traits. My immediate reaction to reading this news was deep upset and strong moral distaste.

But why? If I believe that some types of genetic engineering are wrong, while other types are permissible, what is the actual basis for my personal moral judgement? Relatedly, what does that say about the code of ethics that I would want practitioners of the field to follow? (For purposes of this discussion, I will use "morals" to refer to personal evaluations of right and wrong, and "ethics" to refer to the practices a community uses to try to avoid bad moral consequences).

I've been mulling these questions over personally for several years now, driven both by my own thoughts and my conversations with friends, family, and colleagues. More recently, I've been starting to have these conversations with other people at BBN as well, as our synthetic biology group grows and we consider an ever broader set of possible opportunities to pursue. Which calls for proposals should we embrace, and which should we pass over because we do not approve of their direction?

I think these are extremely important questions to think about carefully and have a clear understanding of where one's judgements are actually rooted. On the one hand, our minds easily conflate "unfamiliar" with "wrong", and history is full of lessons on how words like "unnatural", "improper", and "distasteful" have simply been codes for prejudice that has had to be unlearned one small step at a time. On the other hand, history is also full of lessons about how easy it is to step into morally abhorrent positions and actions one seemingly reasonable step at a time.  Having a clear understanding of the basis of one's judgements is an important defense against both of these failure modes.

A Few Assertions on Morality

For a starting point then, let me begin by leaving any potential deities out of the discussion. Instead, let me start with a few grounding assertions that I think most will find non-controversial:

  1. Conscious minds are precious. I know I treasure my existence, and expect that most others generally do as well. My circle of empathy extends at least as far as nearly all living humans and a lot of the more brainy animals.
  2. I should treat others as I would like to be treated.
  3. Deriving from the first two: a person's autonomy of choice should be respected, at least so far as it does not infringe on others.
  4. It is better to avoid unnecessary suffering. Sometimes suffering is necessary or unavoidable, but given a choice it is generally preferable to have less suffering in the world.
  5. We often make mistakes. This is especially true when dealing with new or poorly understood things and large-scale or long-term consequences.

These statements are by no means the whole of my moral system, and there's lots of grey areas to explore with regards to their definitions, boundaries, and conflicts. They are however, some good basic guardrails for my thinking: anything that clearly starts to violate one of these assertions is a place where I don't want to go.

Moral Judgements on Genetic Engineering

So let's start looking at genetic engineering as a subject of these moral judgements.

First off, is there anything about the creation and editing of DNA (or similar) that is inherently morally problematic?

The necessary materials and equipment involved are relatively cheap and easy to obtain from normal sources. So with respect to the material resources involved in genetic engineering, we're talking about moral issues akin to those involved in eating cereal or buying clothing, not anything specific to genetic engineering.

Likewise, I see no inherent problem in modifying the DNA of living creatures. There are examples that I find clearly in support of my moral values, such as the development of gene therapy to correct otherwise fatal or debilitating genetic diseases.

It seems then that any moral judgements that I make are not grounding in the technology itself, but in the bad effects resulting from choices that we may make about how to use it. In short, genetic engineering poses moral and ethical challenges because it is a disruptive technology that gives us choices that we did not have before.

What sort of bad consequences am I concerned may result from poor choices in the use of genetic engineering?  Well, some are just the usual concerns related to the potential for any disruptive technology to reorganize money and power in societies, but those are not specific to genetic engineering. When thinking about the specific technologies, here are some key things that I would like to avoid that I also think should be relatively non-controversial:

  • Injuring or killing people: obviously counter to the moral values I've expressed above
  • Degrading people's autonomy: likewise, obviously counter to moral values.
  • Damage or destruction of infrastructure: creates disruptions that tend to involve suffering, injury, and death.
  • Disruption of ecosystems: another source of disruption, and often unpredictably so
  • Splitting humanity into more than one species: we have done badly enough morally with "other" groups when we are all at least members of the same rather homogeneous species.
  • Significant loss of human diversity: seems likely to involve degradation of autonomy and to lead to increased fragility.

Again, this is by no means attempting to be a complete list, but is at least a good set of guardrails to begin with. If a project has the potential to make one or more of these scenarios more likely, then that is clearly a moral hazard to concern ourselves with.

Ethical Genetic Engineering

Turning from morality to ethics, how do I think that concern about potential consequences should affect our actions? First and foremost, I am strongly aware of the fact that we humans frequently make mistakes and that some consequences of genetic engineering, once set in motion, might be quite large-scale in impact and also quite hard to reverse. This leads me to embrace a version of the precautionary principle: whenever considering a research or development choice that may have a major moral impact, I would hold that one should move slowly and incrementally, step by step building up knowledge, precise and predictive models, and increasing levels of consensus regarding the morality of particular choices and their consequences.

In my view, then, an ethical approach to decision-making in genetic engineering ultimately boils down to a relatively simple core:

  • For any potential project or technology, one must assess the degree to which it increases risk on "the checklist of bad consequences."
  • The closer one is to real-world applications, the more predictive certainty is needed in this risk assessment.

Great complexities, of course, may arise in actually making these evaluations, and that is the point: if you want to take risks with lives, species, or ecosystems, you'd better be able to establish with great depth and certainty that the risks you want to take are truly low.

Returning to the Matter at Hand

With this enunciation of my approach to the ethics of genetic engineering, I think the reasons for my reaction to the news I read becomes quite clear. First, the report spoke of plans that could take potentially grave risks with thousands of human lives within a mere few years (e.g., what if a "good" modification turns out to have nasty side-effects on the children bearing it a few years down the line). Moreover, in the report there seemed to be no evidence that those proposing those plans were even thinking about the risks, let alone making reasonable plans to assess and mitigate these risks. Perhaps the report is wrong, and perhaps those colleagues will communicate facts to me that would cause me to change my judgement of their work. For now, however, the facts that I have been made aware of certainly seem to show a serious violation of the ethical principles that I espouse.

Making this assessment has been quite a challenge, and I expect that I will revisit it over time to see if there are things I want to add or to adjust. For now, however, I am satisfied to rest with this increase in understanding of where I stand and why on the moral and ethical questions that are involved in genetic engineering.

In short: "First, do no harm."

Thursday, January 17, 2019

The end of an era

One week from today will be the official end of an era for me. After nearly 23 years, more than half of my life, I will no longer have an account with MIT.

My MIT email address was my first "real" email address, at least in the sense that I can no longer remember definitely just what my high school email addresses actually were: you'd have to dig into the prehistory of AOL or old shareware repositories to find that information. I signed onto Project Athena in August of 1996, was indoctrinated into the joys of "mh", and began my tumultuous undergraduate career.

I stayed at MIT for 12 years as a full-time member of its community: four years as an undergraduate, one as a Masters student, one in a strange superposed state of both Masters and Ph.D. programs, which mightily confused the registrar's systems, five more years of purely Ph.D., and then a year of transitional postdoc while I figured out what to do, ultimately departing for my current employer of BBN in 2008.

Self-portrait as a postdoc, in my old office in Project MAC at MIT CSAIL
Even after leaving, however, I maintained my affiliation and strong collaborations. In the first couple of years, I was actually also still running the same MIT projects that I had been running while I was still actually employed there, and so of course I was always on campus to meet the students who were working for me. Other collaborations started up thereafter, and one way or other, it tended to be the case that I was working on campus at MIT at least half a day in every week. Keeping research affiliate status for me made a lot of sense for all involved.

Back in those early post-departure days, I used to still be much more involved as an alum in student group activities as well.  One of my long-running joys, which I still miss, was running live action role playing games with the MIT Assassin's Guild. More regularly, however, I also continued to be a volunteer librarian with the MIT Science Fiction Society, and every week would spend two hours as the on-duty librarian holding open the worlds largest publicly browsable collection of science fiction. I needed my card and my affiliation to be effective at those duties as well, and I enjoyed them much: the Assassin's Guild as a heated activity of creative passion and adrenaline, MITSFS as a cool oasis of two calm hours of mostly only reading.

When I followed my wife to Iowa in 2013, however, the actual "showing up on campus" part stopped happening. I resigned as a librarian, and I'd already mostly stopped writing and playing games, as new parenthood and professional travel began to squeeze that time more and more.  My collaborations have continued, but with me no longer actively on campus or needing special access to resources, there's not as much point in having me still maintain an active affiliation.

Sometime last summer, my affiliation failed to renew, and I didn't notice. When I got my account deactivation warning, I pinged the collaborator who'd been sponsoring me, but neither of us got around to following up. And really, the fact that it just wasn't making my triage list as "important" any more was the sign that it was time to let go.  I'm no longer an active alum, and I don't need a research affiliate status to be an effective remote collaborator, after all.

And so, over the past two weeks I've been packing up to go electronically. I've redirected my non-BBN mirror of my professional webpage away from MIT and over to GitHub. I've copied over all of the material from my old Athena account (finding and revisiting some amazing old memorabilia in the process). I've even gone through every email received at the old account in the last year and switched over all the ones I cared about. I'm as ready as I can be to let go.

Goodbye old friend, old email address. I never like to truly let anything be gone, but I'm not there any more, and at least I've still got my alum account.

Saturday, January 12, 2019

Taming emergent engineering

Understanding and engineering emergent behaviors is one of the long-standing challenges of complex systems.  Over the past fifteen years, one step at a time my collaborators and I have been pinning down the engineering of emergent behaviors.  Our most recent publication, however, represents quite a major step in the project.

"A Higher-Order Calculus of Computational Fields", out this week in ACM Transactions on Computational Logic, finally puts a solid mathematical link between collective phenomena and local actions.  In this paper, we present not one but two equivalent semantics for aggregate programs: one in terms of local actions of devices and the other in terms of collectives extending across space and time.  Every field calculus program expressed in one view can be automatically translated to the other, from global to local and from local to global.  We've been working with this result informally for many years, but now we have rock-solid mathematical proof.

Now combine that with "Space-Time Universality of Field Calculus", a paper we published last year demonstrating that every computable function over space and time can be implemented using field calculus.  That tells us that, no matter what emergent behavior you might be dealing with, if it is physically possible, there is guaranteed to be a program that can be expressed in our simple language that can both describe the collective behavior and be applied to produce it from local interactions.

This doesn't mean we can predict the behavior of any old system out there.  Just because you know there is a description doesn't mean it will be easy to find it, or that said description will be simple. Likewise, it might be difficult to understand the implications of a program. But having a simple language that is guaranteed to cover all of the relationships of interest can make a very big difference in just how hard that search space is to navigate.

Unfortunately, I don't really recommend that you read either paper unless you love wading through heavy mathematical symbology.  Ultimately, once you wrap your head around the mathematics, the core ideas of each paper are fairly simple and elegant, but there's a lot of supporting details that have to be dealt with, systematized, and pinned down with mathematical variable names.

Next step: making a more digestible summary of the key results available to the wider community who may be interested.

Example of resolving an aggregate function call over space and time in higher-order field calculus.