Wednesday, July 12, 2017

Why gene expression has a log-normal distribution

In a new paper just out, Biochemical Complexity Drives Log-Normal Variation in Genetic Expression, I explain a biological mystery: why do log-normal distributions keep showing up in gene expression data?

Anybody who's spent much time looking at gene expression data has probably noticed this: lots of distributions tend to have nice bell-curve shapes when plotted on a log scale. Consider, for example, a few samples of a gene being repressed by various levels of LmrA:

Some typical distributions taken from the Cello LmrA repressor transfer curve, all approximately log-normal

In short, these distributions are approximately log-normal, though they might also be described by one of a number of similar heavy-tailed distributions like the Gamma or Weibull distributions. Indeed, the typical explanation for gene expression variation has been that it's a Gamma distribution, based on the underlying randomness of chemical reactions causing stochastic bursts of gene expression.

What kept bugging me about that explanation, though, is that it just doesn't fit what we know about how gene expression actually works.  If it's basically about randomness in chemical reactions, then as expression gets stronger, the law of large numbers should take over and the distributions should get tighter. Think about it like flipping coins: when you flip a few coins there's a lot of variation in how many come up heads and how many come up tails, but when you flip lots of coins it always comes out pretty even.  But in most cases we deal with in synthetic biology, that just doesn't happen. Consider for example, the distributions of LmrA above: the high and low levels of expression are just about as wide, even though one's nearly 100 times higher than the other.

Instead, the answer turns out to be a beautifully simple emergent phenomenon. Gene expression is a really, really complicated chemical process. Most of the time, we don't pay attention to most of that complexity because we're not attempting to affect it, just use it as a given. But that complexity means we can describe gene expression as a catalytic chemical reaction whose rate is the product of a lot of different factors. And the same Central Limit Theorem that tells us that coin flips should make a nice bell-shaped normal distribution also says that when we multiply a lot of distributions, it should tend to a log-normal distribution.

This has a few different implications, but the most important ones are these:

  • When you are analyzing gene expression data, you should use geometric mean and geometric standard deviation, not ordinary mean and standard deviation. 
  • When you plot gene expression data, you should use logarithmic axes, not linear axes.
Any discussion of gene expression data that does otherwise, without good reason, will end up with distorted data and misleading graphs. In short: welcome to a brave new world of geometric statistics!