I woke up this morning with bright plans to be productive and focused and accomplish various things. Instead, I have spent the morning on a delightful trip down memory lane.
Way back in high school, more than 20(!) years ago, my friends and I made a video game called Mantra. It was a short, fun freeware adventure with a Zelda-like feel and a bunch of obscure jokes (my favorite was a villager who said: "Godot is coming, please wait"---we got so much tech support mail asking us how long you needed to wait before Godot showed up). It was a lot of fun, actually got kinda popular, and probably helped to get me into my college of choice, and then I would forget all about it for years at a time.
This morning, I was reminded again when I found a link shared by my friend Ben to a person who'd done a wonderful play-through on YouTube with commentary. There's a whole six-episode series, and very well done, and I totally blew all my early morning time-to-myself watching it and indulging in a couple of bucketloads of nostalgia.
Even more amazing to me, Mantra apparently got a page on TVTropes too! OMG, my fanboy self totally sqees! There is something incredibly amazing to me about seeing the Internet dissect my work and identifying the tropes, just as they do with my favorite pieces of media.
It's been a nice, if unproductive, morning.
Sunday, July 30, 2017
Wednesday, July 12, 2017
Why gene expression has a log-normal distribution
In a new paper just out, Biochemical Complexity Drives Log-Normal Variation in Genetic Expression, I explain a biological mystery: why do log-normal distributions keep showing up in gene expression data?
Anybody who's spent much time looking at gene expression data has probably noticed this: lots of distributions tend to have nice bell-curve shapes when plotted on a log scale. Consider, for example, a few samples of a gene being repressed by various levels of LmrA:
In short, these distributions are approximately log-normal, though they might also be described by one of a number of similar heavy-tailed distributions like the Gamma or Weibull distributions. Indeed, the typical explanation for gene expression variation has been that it's a Gamma distribution, based on the underlying randomness of chemical reactions causing stochastic bursts of gene expression.
What kept bugging me about that explanation, though, is that it just doesn't fit what we know about how gene expression actually works. If it's basically about randomness in chemical reactions, then as expression gets stronger, the law of large numbers should take over and the distributions should get tighter. Think about it like flipping coins: when you flip a few coins there's a lot of variation in how many come up heads and how many come up tails, but when you flip lots of coins it always comes out pretty even. But in most cases we deal with in synthetic biology, that just doesn't happen. Consider for example, the distributions of LmrA above: the high and low levels of expression are just about as wide, even though one's nearly 100 times higher than the other.
Instead, the answer turns out to be a beautifully simple emergent phenomenon. Gene expression is a really, really complicated chemical process. Most of the time, we don't pay attention to most of that complexity because we're not attempting to affect it, just use it as a given. But that complexity means we can describe gene expression as a catalytic chemical reaction whose rate is the product of a lot of different factors. And the same Central Limit Theorem that tells us that coin flips should make a nice bell-shaped normal distribution also says that when we multiply a lot of distributions, it should tend to a log-normal distribution.
This has a few different implications, but the most important ones are these:
Anybody who's spent much time looking at gene expression data has probably noticed this: lots of distributions tend to have nice bell-curve shapes when plotted on a log scale. Consider, for example, a few samples of a gene being repressed by various levels of LmrA:
Some typical distributions taken from the Cello LmrA repressor transfer curve, all approximately log-normal |
In short, these distributions are approximately log-normal, though they might also be described by one of a number of similar heavy-tailed distributions like the Gamma or Weibull distributions. Indeed, the typical explanation for gene expression variation has been that it's a Gamma distribution, based on the underlying randomness of chemical reactions causing stochastic bursts of gene expression.
What kept bugging me about that explanation, though, is that it just doesn't fit what we know about how gene expression actually works. If it's basically about randomness in chemical reactions, then as expression gets stronger, the law of large numbers should take over and the distributions should get tighter. Think about it like flipping coins: when you flip a few coins there's a lot of variation in how many come up heads and how many come up tails, but when you flip lots of coins it always comes out pretty even. But in most cases we deal with in synthetic biology, that just doesn't happen. Consider for example, the distributions of LmrA above: the high and low levels of expression are just about as wide, even though one's nearly 100 times higher than the other.
Instead, the answer turns out to be a beautifully simple emergent phenomenon. Gene expression is a really, really complicated chemical process. Most of the time, we don't pay attention to most of that complexity because we're not attempting to affect it, just use it as a given. But that complexity means we can describe gene expression as a catalytic chemical reaction whose rate is the product of a lot of different factors. And the same Central Limit Theorem that tells us that coin flips should make a nice bell-shaped normal distribution also says that when we multiply a lot of distributions, it should tend to a log-normal distribution.
This has a few different implications, but the most important ones are these:
- When you are analyzing gene expression data, you should use geometric mean and geometric standard deviation, not ordinary mean and standard deviation.
- When you plot gene expression data, you should use logarithmic axes, not linear axes.
Any discussion of gene expression data that does otherwise, without good reason, will end up with distorted data and misleading graphs. In short: welcome to a brave new world of geometric statistics!
Subscribe to:
Posts (Atom)