Sunday, April 08, 2018

Linking biological designs and experimental data

One of the biggest points of friction in my professional life is the disconnect between the design of an experiment and the data that comes out of it. Not in any deep or scientific sense, but in a boringly practical sense of "How do I know what's in file MyRun_F05_039_pXK405.fcs?"

When I'm working with experimentalists and analyzing the data that they've produced, in order to make this connection, I get sent spreadsheets with colored cells and personal shorthands, or unintentionally cryptic emails, or scans of tables with hand-written notes. Then I make my best guess as to what's being encoded there and start organizing file names into scripts to run my analysis. The actual process of analysis is often very fast, only a few minutes, but for a good-sized experiment it can take hours to set it up to be able to run.
Example of fairly typical current integration of biological data with experimental design.
Even then, our pain isn't over, because there's a major challenge in comparing across data sets, especially when working with multiple people on a project or across a project spanning many months or even years.  Is the control the same as it was two months ago? What does "same" even mean, exactly? I had a data-set go completely wonky once because the experimentalist working with me had run out of one plasmid and substituted another that they thought should be equivalent but had an extra "unimportant" gene on it.  The descriptions that I got gave the same descriptor to refer to the new plasmid as they used for the old one, because of course they were only describing the "important" parts of the construct. We lost at least a month of time on the project.

All of this can be simplified if we get automated software tooling involved, so that with minimal human involvement we can link data to laboratory samples, samples to the descriptions of what they are supposed to contain, and designs for DNA to the biological functions and interactions that they are intended to produces.  For that to work, we need to agree on how we are going to describe those relationships, and thus I believe that the most critical part of what our newest release of the Synthetic Biology Open Language (SBOL), version 2.2, gives to us, along with some tools for describing combinatorial designs.  Version 2.2 has just been officially published as a free journal article, and we're well into putting these new linkages to use in several programs, as well as organizing a workshop to teach people how to link these and other tools together

Step by step, we are getting closer to removing this persistent source of friction and error in our biological studies.

Sunday, March 18, 2018

Diagrams showing structure and function in biological organism engineering

We've just had official publication of another major step forward in turning synthetic biology into a well-organized field of engineering: the SBOL Visual 2.0 standard. This is a big one, because it means we have a clear way not only of summarizing genetic structure (as we have had since SBOL Visual 1.0), but also of showing the interactions of genes with proteins and other molecules in order to actually affect cellular functions.
Example of an SBOL Visual 2.0 diagram, showing a system with two functional units: one producing the regulatory protein TetR, which in turn represses the other's production of green fluorescent protein (GFP).
Everybody's been drawing diagram sort of like this already, in the papers that they publish, but there hasn't been any agreement on how to do so, and so every diagram's a little (or a lot) different, with no good way to make sure that you really know what somebody's diagram means besides reading the whole text in detail---and sometimes not even then. Now, with this standard, we have such a system, and we just need to work with folks to keep spreading the word so that people are aware and can understand how following the suggested guidelines will help them by making it easier for others to read what they have written.

Friday, March 16, 2018

Good Measurement Practices

As we work to promote awareness and use of good scientific measurement practices in iGEM (the International Genetically Engineered Machines competition), we've just posted an educational video with me giving a (hopefully accessible) introduction to four simple principles of good measurement practices.

Tuesday, February 06, 2018

The LOLCAT Method

You probably think the title of this post is a joke. Well, it is, but probably not in the way that you think it is.
LOLCAT helping me with SCIENCE!
You see, back in the waning days of my grad student career, I started working with an ambitious and enthusiastic young undergrad named Sagar Indurkhya who wanted to work on better ways to design synthetic biology circuits. I was just getting into the area myself, and our efforts quickly wandered sideways, from work on circuit design to work on simulators. Sagar was using stochastic simulators and found (as many people do) that they were way too slow for his taste. So he went to town on the optimization problem, finding all sorts of crazy ways to improve the speed, from highly general (factoring reactions to improve scaling properties), to super-specialized (making his own specialized virtual machine). Happy with the remarkable improvements in speed that we'd gotten, we decided to write it up and, liking publications without paywalls and having no particular reason to send it anywhere else, we sent it to PLOS ONE.

In the process of writing things up, however, we needed to give the algorithm a name, and one fateful day Sagar asked me: "Can I name it anything?" I said sure, and he continued, "Even something silly, like LOLCAT?" I hesitated, but couldn't really find any particularly good argument against it besides the fact that it was silly, which at the time didn't seem to me to be a sufficient argument against. And if it was a problem, the reviewers would ask us to change it, right?

Not a peep. I just looked back through and found that the reviewers were perfectly happy with our absurd title, engaged seriously with the paper to provide a sound and sober analysis of the LOLCAT method that resulted in significant improvement in manuscript presentation, and then the paper went through for publication. And then I mostly just forgot about it.  I don't use stochastic simulations very often, and when I have it's typically been on much smaller systems, so I just haven't ever had reason to use the work myself.

But others have. I was reminded of the paper this morning, in fact, by a citation alert. After a long period of dormancy, the LOLCAT method is gathering citations as reaction network simulations grow and people are apparently finding it to be of significance in their work. As of this writing, it has received 18 citations---not huge, but definitely showing a significant impact.  I am profoundly ambivalent about this fact: happy that it's a useful piece of work, cringingly embarrassed at my early career naiveté, yet also defiantly proud of our little joke. We didn't even have the good grace to try to make the name an acronym.

It's out there still, and will be in the scientific record forever after, for good or ill: "Reaction Factoring and Bipartite Update Graphs Accelerate the Gillespie Algorithm for Large-Scale Biochemical Systems."  The LOLCAT method.

Wednesday, January 31, 2018

The Mark of Dubstep

I did a very joyful and stupid thing today, but they can't say they weren't warned.

Some of my fellow committee members submitted their pictures and bio blurbs on time. Some did not. The ones who did not were asked again. It's just a sentence or two. The blurbs went unwritten. Last week they were jokingly warned: send in your bios, or else Jake will write something unusual and ridiculous for you. At today's meeting, I was given free rein.

They all got bios and DJ names. The first sentence was serious, the second exposed the free-associated and unusual fictional lives of my colleagues:
The one who hadn't submitted a headshot yet got Wikipedia's current illustration of a kitten. He responded with a correction very quickly indeed. The rest are still up there, as of this writing.

I am disproportionately tickled by my own jokes, and it has been making me smile all day.
I am clearly a bad, bad man and an unreliable and dangerous troublemaker.

I wonder how long the Mark of Dubstep will remain.

Tuesday, January 02, 2018

The Physics of Time Management

As my professional life grows increasingly complex, I have found a need to organize it with the aid of physics-style laws. The three basic principles that I use are:

  1. Conservation of Time: Time can neither be created nor destroyed (though it can be wasted).
  2. No Free Lunch: Accomplishing goals requires time.
  3. Burnout Limit: The (sustainable) amount of time available for work in each week is limited.
Considering these three principles forces me to make difficult decisions about triage. No Free Lunch means that I cannot hope for things to be accomplished that I do not make real time for in my schedule: at best, I can hope for my accomplishments to be proportional to the time that I invest. So my (average) week needs to have time set aside for all of the major ingredients that I need in order to be the scientist I want to be: delivering on my current projects, securing funding for new projects, nurturing my collaborations, pursuing strategic technical goals, and service to my professional community. Each of these requires a certain amount of hours to reasonably make progress (and my billing and timesheet goals are subsumed within these too), and at this point in my career, I am not too bad at making estimates.

The burnout limit, on the other hand, is about the relationship of my professional life to my marriage, parenting, sleep, friendships, and self-care. Here, I estimate both the number of hours per week that are sustainable without pain, by looking at my "normal" work times, and also estimate the "surge capacity" that can be obtained if necessary by neglecting the other aspects of my life and calling in favors from my wife. I know that I most certainly will face surges during the year (e.g., paper and proposal deadlines, technical review meetings, parts of travel that aren't dual-use) and this capacity is also where I can try to catch up following surges in other parts of my life (e.g., sick child, doctor's appointments, etc.).  So I'd better make sure my "normal week" planning is restricted to the sustainable level, or else every surge will be not just be a strain, but instead a serious crisis.

These two collide painfully in the principle of conservation of time. If I want more time to write papers, that means less time for something else. As my responsibilities for management and advising grow, that means my time for doing my own work on programming decreases. I can allocate my time around in many ways, but somewhere, somehow, I will have to say no to things, and conservation of time enforces that dismal fact upon me, forcing me to limit my wishful thinking to something that is more likely to be actually doable.

I have just finished going through this exercise for planning 2018, and it took 2.5 hours (budgeted in my schema to "self-organization" and "group organization"). The spreadsheet is very complicated, but I have made the numbers balance in a way that I know will not entirely match reality---but at least gets me started in a way that does not have predictable failures built in. I don't enjoy doing this, but doing it once a year has turned out to be important for me so far, and it's better than the closing my eyes and wishing for something that I know deep down cannot be true.

Physics is painful, eppur si muove.

Thursday, November 16, 2017

Pre-Publication Review: Validity vs. Significance

A fellow researcher was recently telling me about their frustrating experience with a journal, in which their paper was rejected when reviewers said it wasn't "significant," but didn't actually bother to explain why they thought so.

This struck a chord with me, and made me think about the two fundamentally different ways that that I see peer reviewers approaching scientific papers, which I think of as "validity" and "significance."
  • "Validity" reviewers focus primarily on the question of whether a paper's conclusions are justified by the evidence presented, and whether its citations relate it appropriately to prior work.
  • "Significance" reviewers, in addition to validity, also evaluate whether a paper's conclusions are important, interesting, and newsworthy.

I strongly favor the "validity" approach, for the simple reason that you really can't tell in advance which results are actually going to turn out to be scientifically important. You can only really know by looking back later and seeing what has been built on top of them and how they have moved out into the larger world.

Science is full of examples like this:
  • Abstract mathematical properties of arithmetic groups turned out to be the foundations of modern electronic commerce.
  • Samples contaminated by sloppy lab work led directly to penicillin and antibiotics.
  • Difficulties in dating ancient specimens exposed the massive public health crisis of airborne lead contamination.

The significance of these pieces of work is only obvious in retrospect, often many years or even decades later. Moreover, for every example like these, there are myriad things that people thought would be important and that didn't turn out that way after all. Validity, is thus a much more objective and data-driven standard, while significance is much more relative and a matter of personal opinion.

There are, of course, some reasonable minimum thresholds, but to my mind that's all about the question of relating to prior work. Likewise, a handful of journals are, in fact, intended to be "magazines" where the editors' job includes picking and choosing a small selection of pieces to be featured. 

Every scientific community, however, needs its solid bread-and-butter journals (and conferences): the ones that don't try to do significance fortune telling to select a magic few, but focus on validity, expect their reviewers to do likewise, and are flexible in the amount of work they publish. Otherwise, the community is likely to be starving itself of the unexpected things that will become important in the future, five or ten years down the road, as well as becoming vulnerable to parochialism and cliquishness as researchers jockey and network for position in "significance" judgements.

Those bread-and-butter venues are the ones that I prefer to publish in, being fortunate enough that my career is not dependent on having to shoot for the "high-impact" magazines that try to guess at importance. I'm happy to take a swing at high-impact publications, and I'm happy to support the needs of my colleagues in more traditional academic positions, for whom those articles are more important.  My experience with these journals, however, has mostly just been about being judged as "not what we're looking for right now." So, for the most part, I am quite content to simply stay in the realm of validity and to publish in those solid venues that form the backbone of every field.

Wednesday, October 18, 2017

Professional Life Transition

I haven't posted anything for a while, and I'd like to talk about the reasons. I've been going through an interesting professional life transition right now, and as I've been working on coping and adapting, one of the things that had fallen through the cracks is my online writing. As I am starting to stabilize again, however, I'm feeling inspired to write and would like to share some of my thoughts and experiences with you, dear readers.

I find that a useful way for understanding how my professional life has recently been evolving is Latour's cycle of scientific credibility. I explored this in more detail in a prior post, but it may be simplified to relations between three primary "currencies" of credibility: data can be invested to develop publications, publications invested to develop funding, and funding invested to develop data.

A researcher always needs to be tending to all parts of the cycle at least to some degree. At different points in a project or in one's professional life in general, however, the emphasis and available resources may shift around. For the past few years, I had been very heavily invested in the data and publications portions of the cycle, getting stuff done as part of a number of delightful collaborations and as a byproduct demonstrating that the ideas and approaches I've been advocating are capable of providing some real value.

Across the course of this year, that has resulted in several really fun new projects kicking off (which I intend to share with you as I come back to writing once again), and me needing to spend more time coordinating with the folks I'm working with. So these days, in addition to my existing external collaborations, I'm working in partnership with an amazing super-experienced program manager (one of the big benefits of my niche in the scientific world), growing my group, and ramping up a number of other folks on these projects.

This is all good, but it's a significant transition, and I've needed to shift around a bunch of my personal heuristics in how I organize my work life. For example, I have to be less of a perfectionist and control freak when I need to be delegating a larger fraction of the work on a project. I have also had to accept that I can't write most proposals in LaTeX any more.

Going through a transition is always intense for me, but I feel fortunate that this is being a good and joyful one so far.

Sunday, July 30, 2017

Mantra: a trip down memory lane

I woke up this morning with bright plans to be productive and focused and accomplish various things. Instead, I have spent the morning on a delightful trip down memory lane.

Way back in high school, more than 20(!) years ago, my friends and I made a video game called Mantra. It was a short, fun freeware adventure with a Zelda-like feel and a bunch of obscure jokes (my favorite was a villager who said: "Godot is coming, please wait"---we got so much tech support mail asking us how long you needed to wait before Godot showed up). It was a lot of fun, actually got kinda popular, and probably helped to get me into my college of choice, and then I would forget all about it for years at a time.

This morning, I was reminded again when I found a link shared by my friend Ben to a person who'd done a wonderful play-through on YouTube with commentary. There's a whole six-episode series, and very well done, and I totally blew all my early morning time-to-myself watching it and indulging in a couple of bucketloads of nostalgia.

Even more amazing to me, Mantra apparently got a page on TVTropes too! OMG, my fanboy self totally sqees! There is something incredibly amazing to me about seeing the Internet dissect my work and identifying the tropes, just as they do with my favorite pieces of media.

It's been a nice, if unproductive, morning.

Wednesday, July 12, 2017

Why gene expression has a log-normal distribution

In a new paper just out, Biochemical Complexity Drives Log-Normal Variation in Genetic Expression, I explain a biological mystery: why do log-normal distributions keep showing up in gene expression data?

Anybody who's spent much time looking at gene expression data has probably noticed this: lots of distributions tend to have nice bell-curve shapes when plotted on a log scale. Consider, for example, a few samples of a gene being repressed by various levels of LmrA:

Some typical distributions taken from the Cello LmrA repressor transfer curve, all approximately log-normal

In short, these distributions are approximately log-normal, though they might also be described by one of a number of similar heavy-tailed distributions like the Gamma or Weibull distributions. Indeed, the typical explanation for gene expression variation has been that it's a Gamma distribution, based on the underlying randomness of chemical reactions causing stochastic bursts of gene expression.

What kept bugging me about that explanation, though, is that it just doesn't fit what we know about how gene expression actually works.  If it's basically about randomness in chemical reactions, then as expression gets stronger, the law of large numbers should take over and the distributions should get tighter. Think about it like flipping coins: when you flip a few coins there's a lot of variation in how many come up heads and how many come up tails, but when you flip lots of coins it always comes out pretty even.  But in most cases we deal with in synthetic biology, that just doesn't happen. Consider for example, the distributions of LmrA above: the high and low levels of expression are just about as wide, even though one's nearly 100 times higher than the other.

Instead, the answer turns out to be a beautifully simple emergent phenomenon. Gene expression is a really, really complicated chemical process. Most of the time, we don't pay attention to most of that complexity because we're not attempting to affect it, just use it as a given. But that complexity means we can describe gene expression as a catalytic chemical reaction whose rate is the product of a lot of different factors. And the same Central Limit Theorem that tells us that coin flips should make a nice bell-shaped normal distribution also says that when we multiply a lot of distributions, it should tend to a log-normal distribution.

This has a few different implications, but the most important ones are these:

  • When you are analyzing gene expression data, you should use geometric mean and geometric standard deviation, not ordinary mean and standard deviation. 
  • When you plot gene expression data, you should use logarithmic axes, not linear axes.
Any discussion of gene expression data that does otherwise, without good reason, will end up with distorted data and misleading graphs. In short: welcome to a brave new world of geometric statistics!