Jake Beal's Next Step: An Accidental Investigation of Publication Metrics

Dear reader, welcome once again to one of my more philosophical posts. I've been working on reorganizing my webpage---something long in need of doing. It used to make sense, when I was a grad student or a young postdoc, to have a simple list of all my publications. Over the past few years, though, as both the number and variety of my publications has grown, I think this has become less sensible. Now the list is rather long, and all silted up with the detritus of scientific publication---dead ends, early work, incremental reports, and important-but-boring filling in the gaps.

One of the things that makes my webpage such a mess is that my current list does not discriminate between types of publication: journals, book chapters, conferences, workshops, tech reports, and unpublished white-papers are all jumbled together in chronological order (possibly the worst reasonable ordering tiebreaker).

I used to solve the density problem by segregating the publications by subject area. Subdividing further would be unsatisfactory to me these days, however, since there are so many connections between different pieces of work---do I put the first "functional blueprints" paper into morphogenetic engineering or spatial computing, since it was much more focused on spatial/cellular approaches than what came after? How about my energy work, which started out as an application of Proto, but has evolved to shed both Proto and spatial computing in general? There are far too many such boundary cases, and I don't want a reader to miss a publication because they're looking in the wrong section.

I suppose I could resolve the density problem by segregating them into type: put the Respectable Journals up front, followed by the High-Impact Factor Conferences, and so on. Problem is, I've got tech reports and workshop papers that I think are more important than some of my journal papers.

Which leads to a general comment on scientific publication, I think. So far, in my career at least, I find there to be a minimal correlation between importance of publication and "significance" of venue. An idea put forth first in a workshop (the amorphous medium abstraction), has become the most central element of my whole line of research, and I still cite that workshop paper. Maybe someday it will be replaced with a Reputable Journal paper updating and expanding the results, but that hasn't happened yet, and isn't likely to happen soon, what with my jam-packed publication queue and parenthood.

So, let's see how my intuitions hold up against data (ah, the scientific lifestyle), and try plotting "venue" vs. "importance" . First, I've gone through all the publications on my website and pulled out those that I think are "important," further coding some of them as "foundational"---meaning they are something whose importance I think is broad and durable, generally leading meaning it's at the root of a significant ongoing research program. Now let's group them into publication classes using my CV, which lists 91 non-thesis publications (Google scholar finds more, but we'll ignore that whole can of worms for the moment). In my CV, where publications are broken up into six classes, which we'll order by typical ferocity of peer review (a proxy for venue quality), in decreasing order: Journal, Conference, Book chapter, Workshop, Abstract, Informal (tech report, white-paper, etc.). Plotting the numbers of each type as a stacked graph, we have:

Huh... my publication profile actually looks a lot more conventional than I expected.

It's completely unsurprising that the abstracts are barren of value, since they're typically just too short for anything significant---no more than two pages. The big surprise, looking at this, is how barren the conferences are. My guess is that a lot of those "unimportant" conference articles are steps on the way to a more complete result---and looking more deeply into them, it seems like about half of them are exactly that. That workshop articles are largely barren is less of a surprise, since so many of them are position papers, dead ends, or roads not taken---and a deeper inspection confirms that completely. Workshops are apparently where I toss ideas against the wall, and some of them stick (with massive importance), while most of them just fade away.

Digging into those journal articles further, I find that six of the eight journal articles started life as a "lesser" publication, and then were extended and upgraded into a full journal publication---which then supersedes the prior publication in importance, hogging all of the spotlight. That's appropriate, I suppose.

Does this mean that I should expect the foundational workshop and informal publications to migrate into journals as well over time? Perhaps they will---and in fact, I know that one of them is trying to already.

So what we have here in many ways is a "revisionist" picture of science, where the material that turns out to be important ends up migrating over time upwards in venue quality. If that's the case, then "journal papers are more important" is only true for people who aren't the author: it's a selection process that retroactively highlights the important work, rather than a leading indicator. Perhaps we should instead think of publications as some sort of an exploratory tree process. Here's a notional diagram of what that might look like:

Color to match bar graph above. Arrows indicate dependency, pointing from a dependent work to its source. Size: large=journal, medium=conference/chapter, small=workshop/abstract/informal. Concentric publications indicate "venue promotions" that supersede a prior citation.

Let's say a research program started at the large bottom node with a workshop publication. As it goes up and out, it grows and branches. Importance tends to relate to how much research is running back through a publication. Also, as publications become more important, they sometimes upgrade into more "quality" venues---which renders the prior version (shown as concentric) unimportant. Sometimes a big step can be taken directly, sometimes it needs to go through bridging stages on the way. And of course there are lots of things that end up staying unimportant, either because the initial idea was wrong, hit a dead end, or just plain got triaged by the 24-hours-per-day limit.

I suspect that I may have a somewhat higher than average branching factor, given the nature of my research and personality. I don't know though---this may instead be an impression that I've gotten due to the operation of just such a process. After all, the informal publications tend to fall away from visibility if they are not deliberately preserved and archived online by a researcher, and it's hard to see anything besides the mature work of another researcher. It would be fascinating to study this over a number of scientists, but really hard to do effective coding on publications.

Coming back to the root problem that started me down this intriguing rathole: when it comes to laying out my webpage, since I'm going to be showing people a snapshot of time, I think it's only right to classify things by current perceived importance, and not by category. And now, dear reader, an exercise for you: let's see just how long it takes between this post and an actual restructuring of my website. If it happens very quickly, it probably means I'm engaged in proscrastination; if it takes more than six months, well, you have my permission to point and laugh. And if you're a scientist reading this, would you be willing to contribute a coding of your own publications?

Jake Beal's Next Step

Monday, November 05, 2012

An Accidental Investigation of Publication Metrics

No comments: