Deep inside big public databases, you can find quite curious things, especially when biology is involved.
For example, I spent several hours today hunting down a mysterious bug in the DNA screening project that I've been leading. We're working on improving the ability to detect when somebody orders DNA that they shouldn't be ordering (e.g., smallpox, ebola), and so it's really important to not let
anything get past. So while most classification projects might be fine with getting nearly everything right, our system has to catch every single problematic sequence every time.
That means I get to drill down and try to classify every miss our system makes, and I learn some strange and interesting things while doing it. For instance, these pseudo-fascinating trivia are amongst the things that I have recently learned:
- The same DNA sequence from the same publication is often uploaded twice and categorized differently each time.
- Fish in fish farms get sick with a virus related to rabies. It doesn't hurt humans, though.
- Somebody is running automated systems to infer the organisms that DNA sequences are associated with, and that produces a lot of "unknown member of [family/order]" entries.
- Somebody published a paper where they claimed to discover a bunch of new virus species by just sort of sequencing samples from healthy people and not actually checking in any way whether actual viruses were involved.
- When NCBI updates its taxonomy which organisms are related to which, the sequence records don't change to reflect their new taxonomy.
With these discoveries and a few other tweaks, I was able to categorize and plan mitigations covering all of the classes of failures that our system was encountering. Almost.
There was just one miss that I just could not explain, a short little snippet from a virus coat. There were no related "safe" viruses that would cause us to overlook its sequence, nothing in the protein sequences and nothing that could even be mis-translated from other DNA sequences. And I thought,
"that's funny..."
I dug down and dug down and eventually found something both embarrassing and wonderful. You see, in DNA sequences, there's often parts that are unknown, and so instead of the standard "A", "C", "T", and "G" DNA bases, these bits of missing information get marked as "N" for an unknown "any" base. These get used in ordering DNA too, to indicate places where you don't care what the sequence is. We've long been excluding these from matches, since it makes no sense to say, "Aha! Somebody once didn't know part of a virus, and you don't care what you get!" So our detector throws out potential matches that include an unknown.
Only thing is, when you're working with proteins, the missing information letter isn't "N". There are a lot more amino acids than nucleic acids, and so they use up more of the alphabet, including "N", which stands for the amino acid asparagine. With proteins, the missing information letter is "X" instead.
Most of our system knew that. Most of our system was doing the right thing. But one little part of one little script wasn't getting switched into protein mode at the right time.
We've been systematically excluding every protein pathogen signature with asparagine in it.
|
Our system: "Damn you, asparagine! Get out of my house!" |
That's embarrassing. Easy to fix, but still embarrassing.
And yet...
Asparagine is a pretty common amino acid, so we've been accidentally throwing away around one third of the detection power of our system. And out of tens of thousands of tests, there was precisely one where this blatant and egregious error caused us to miss a detection.
The wonderful thing is that the system is
still working almost perfectly, even while we've been unknowingly arbitrarily throwing away a vast amount of its ability to detect pathogens. That speaks to its resilience, and how many alternative routes it explores to achieve its goal. I can live with that, with a nice natural experiment accidentally conducted by a misbehaving script. We'll fix it, and move forward.
But such remarkable things you may find when you follow just one little thread of something funny in your data...