Tuesday, June 30, 2015

What do we mean when we say a circuit "works"?

Today, I have a new paper out in Frontiers about biological circuits, addressing a fundamental question about the field.  This question is rather simple at its root, but surprisingly has not previously been answered (to the best of my knowledge and ability to literature search).  It is this:
What does it mean for a biological circuit to "work"?
In research on information-processing and control in synthetic biology, we often take this notion for granted, saying things like "we built a toggle switch, and it works great!" or "it took a long time to get this detector working, but now we've got it," or "these repressors work really well."  But how good is good enough?  And surely "good enough" will differ from application to application, won't it?

An influential way of thinking about this problem can be found in papers like Sussman and Knight's paper, "Cellular Gate Technology" or the Weiss, Homsy, and Knight paper "Toward in vivo Digital Circuits". These consider how biological systems might be used to implement digital logic, considering standard ideas from the electronic world like the availability of strong non-linear amplification and identification of "high" and "low" signal regions set to reject noise.  
From our MatchMaker paper: digital logic "transfer curves" identifying "high" and "low" signal regions, based on the location of regions of strong non-linear amplification.  Do they reject noise?  Who knows?
What is easy to forget, however, even for folks like myself who have been trained in electrical engineering and computing, is where these concepts come from.  Ultimately, all computation, digital or analog, "program" or "controller," is about processing information.  And every time that we use a device to process a piece of information, the signals that come out of the device may be more or less hard to interpret than the signals that go in.

All of this stuff about strong non-linear amplification and high and low signal regions set to reject noise is a particular recipe that, in the world of digital electronics, is quite effective at producing devices that have outputs easier to interpret than inputs, and this is what lets us build very complicated digital computing systems, like you are using right now to read these words.

So how do we actually know how if we're actually getting signals out that are easier to understand than the signals that come in?  Information theory developed tools for doing this the better part of a century ago: one can directly determine how intelligible a signal is by computing its signal-to-noise ratio, which quantifies how clear your signal is in units of decibels---the same numbers that describe how loudly your music is playing.  That's a good way to think about it: high decibels = coming through loud and clear; low decibels = really hard to understand.

So if we want to measure how well a biological circuit is working, we can simply compute how many decibels its output signal is.  Some applications need only low signal-to-noise ratio: control of a simple chemical fermentation process might need only a couple of decibels if it just needs to nudge collective behavior a little bit.  Other applications need really, really high signal-to-noise ratio: treating cancer with modified immune cells, for example, probably wants at least 30 decibels (or more), because even a few cells making a bad decision can cause serious harm to the patient.

Using the same principles, we can ask how good a biological device is at processing information by comparing the signal-to-noise ratio of its input to the signal-to-noise ratio of its output.  This depends in part on how it's used and what it's connected up to, but the relationship is relatively straightforward, well-understood, and easy to analyze without the need for lots of additional laboratory experimentation.  Interestingly, this even lets you categorize any biological computing technology into one of three qualitative categories:
  1. "Difficult circuit" technologies, where it's really hard to get anything to work
  2. "Shallow circuit" technologies, where devices generally degrade information as they process it, so it's easy to get simple circuits to work, but hard to get complex circuits to work.
  3. "Deep circuit" technologies, where clarity of information is not the limiting factor and there is the potential to build very complicated systems.
So far, so good: we've got a well-grounded measuring stick for biological computation and control that can actually be applied to any information processing system.  Electronic computing is so powerful because most devices fall into the "deep circuit" category.  That's also where "strong non-linear amplification" and "high and low regions of noise rejection" make sense to talk about.

How about biological computing?

That's the bad news.
  • For much of the published work on biological computation, we simply can't compute signal to noise ratio or its change across a computing device.  
    • Often, papers report only the variation in the mean output values, but not the mean of the cell-to-cell variation (so we know the signal, but not the noise).
    • Papers about computing devices often do not quantify device inputs and instead report the induction used to stimulate the input, which means we cannot compute whether the device improves or degrades a signal.
    • Device inputs and outputs are often not measured at intermediate values, or are not measured in comparable and reproducible units, which means we cannot predict the signal-to-noise behavior when two devices are connected together.
  • The best performing and best quantified biological device technologies currently out there only provide "shallow circuit" performance, which means that at present it is simply impossible to build biological computation or control systems with more than a certain limited complexity.
I don't think this represents a real barrier, however.  A good measuring stick does more than tell you where you fall short: it also tells you what needs to improve and by how much in order to get what you want.  Signal-to-noise ratio analysis certainly does that for biological devices: in many cases, the information currently missing from publications can be readily acquired, and will hopefully shed more light on the true current information-processing capabilities of synthetic biology.  Likewise, signal-to-noise analysis shows that the various current technologies differ from one another in the issues that limit their signal-to-noise ratio. This analysis can hopefully be a useful guiding light directing improvement of those technologies---and some of them are pretty close to hitting the "deep circuits" level and making it much easier to engineer complex biological computation and control.

My vision is of a world where biological information processing becomes not a challenge, but a useful and reliable tool, supporting all sort of useful applications.  Just like in the electronic world, I expect that reliable computation will play a foundational enabling role, letting people stretch for goals not currently considered possible, for better medicine and a more sustainable environment, for cleaner energy and a safer world, for art and beauty, and for all the things we haven't even thought of yet.

Saturday, June 20, 2015

Are publication delays aimed at manipulating impact factor?

Today while dealing with citation queries from a journal's editing staff on a pre-publication proof, I was confronted once again with the recurrent annoyance of delay "formal" publication. Back in November, we published a nice paper on high-precision prediction of genetic circuits.  Well, I say we published it in November, but technically it was only just published today, seven months later

This is due to the curious phenomenon where many journals will publish "online early" shortly after a paper is accepted (an excellent idea!), yet still wait, sometimes for many months, to bundle papers together into an "issue," as if the journal were still all about printing on dead trees and shipping to libraries, rather than having most people simply access it directly online.  This phenomenon has always struck me as odd, and it's a pain in the butt, because it means citations have to change over time and different citations to the same document end up with different years in them.

Confronted again with this today, I had an insight.  I wonder if this phenomenon is no mistake, but perhaps in fact intentional on the part of some journals, in order to manipulate their Impact Factors. The "Impact Factor" of a journal is a horrible, broken statistic that is used to make or break people's careers, particularly in the biomedical fields.  It is calculated as the average number of citations that papers in a journal receive during the two years following their publication.  For example, if Journal X published three papers in 2015, and two of them are never cited, but one gets cited 5 times in 2016 and 7 times in 2017, then Journal X would get a nice high Impact Factor of 4.0 (i.e., (5+7)/3).  Yes, it's kind of a dumb statistic, but it's heavily used and thus frequently gamed.

Here's the thing, though: because it was "online early," my paper has collected several citations before it was ever officially published.  So when it gets included in the computation of the journal's impact factor, it's effectively going to get 24+7 = 31 months of citations, rather than the usual 24 months of citations.  That's going to increase its expected number of citations, and thus the all-important Impact Factor of the journal.  This is further compounded by the fact that getting noticed takes time and publishing a citing paper takes time, so the later we get from a significant paper, the higher its citation rate is likely to be.

So from a journal's perspective, it seems like it would make sense to drag out the time between online publication (when a paper starts being noticed and collecting citations) and official publication for as long as possible.  It's also possible that some enterprising editors or publication houses have noticed this and may thus set their publication delays intentionally to manipulate this impact factor.  Even if the reasons are benign, however (e.g., smoothing out the publication pipeline), the distortion in statistics is still there.

Maybe the citation indices that compute the magic Impact Factor numbers have noticed this and accounted for it... and maybe they haven't.  I would not be surprised in either case, but I'd be very interested to know the answer.  The real answer, though, is not to be more precise about Impact Factor computations, but to discard the damned thing and obtain a more sane and reasonable metric for discussing the significance of papers and journals.