Sunday, April 08, 2018

Linking biological designs and experimental data

One of the biggest points of friction in my professional life is the disconnect between the design of an experiment and the data that comes out of it. Not in any deep or scientific sense, but in a boringly practical sense of "How do I know what's in file MyRun_F05_039_pXK405.fcs?"

When I'm working with experimentalists and analyzing the data that they've produced, in order to make this connection, I get sent spreadsheets with colored cells and personal shorthands, or unintentionally cryptic emails, or scans of tables with hand-written notes. Then I make my best guess as to what's being encoded there and start organizing file names into scripts to run my analysis. The actual process of analysis is often very fast, only a few minutes, but for a good-sized experiment it can take hours to set it up to be able to run.
Example of fairly typical current integration of biological data with experimental design.
Even then, our pain isn't over, because there's a major challenge in comparing across data sets, especially when working with multiple people on a project or across a project spanning many months or even years.  Is the control the same as it was two months ago? What does "same" even mean, exactly? I had a data-set go completely wonky once because the experimentalist working with me had run out of one plasmid and substituted another that they thought should be equivalent but had an extra "unimportant" gene on it.  The descriptions that I got gave the same descriptor to refer to the new plasmid as they used for the old one, because of course they were only describing the "important" parts of the construct. We lost at least a month of time on the project.

All of this can be simplified if we get automated software tooling involved, so that with minimal human involvement we can link data to laboratory samples, samples to the descriptions of what they are supposed to contain, and designs for DNA to the biological functions and interactions that they are intended to produces.  For that to work, we need to agree on how we are going to describe those relationships, and thus I believe that the most critical part of what our newest release of the Synthetic Biology Open Language (SBOL), version 2.2, gives to us, along with some tools for describing combinatorial designs.  Version 2.2 has just been officially published as a free journal article, and we're well into putting these new linkages to use in several programs, as well as organizing a workshop to teach people how to link these and other tools together

Step by step, we are getting closer to removing this persistent source of friction and error in our biological studies.