Where’s the collaboration?

I had another topic queued up this morning, but an email from my sister-in-law reminded me of a more pressing beef: Lack of collaboration in the sciences. And, of course, I have no statistics to back this up, so I’m going to put this out there and see if anyone has anything to comment on the topic.

My contention is that the current methods for funding scientists is the culprit for driving less efficient science, mixed with a healthy dose of Zero Sum Game thinking.

First, my biggest pet peeve is that scientists – and bioinformaticians in particular – spend a lot of time reinventing the wheel.  How many SNP callers are currently available?  How many ChiP-Seq packages? How many aligners?  And, more importantly, how can you tell one from the other?  (How many of the hundreds of snp callers have you actually used?)

It’s a pretty annoying aspect of bioinformatics that people seem to feel the need to start from scratch on a new project every time they say “I could tweak a parameter in this alignment algorithm…”  and then off they go, writing aligner #23,483,337 from scratch instead of modifying the existing aligner.  At some point, we’ll have more aligners than genomes!  (Ok, that’s a shameless hyperbole.)

But, the point stands.  Bioinformaticians create a plethora of software that solve problems that are not entirely new.  While I’m not saying that bioinformaticians are working on solved problems, I am asserting that the creation of novel software packages is an inefficient way to tackle problems that someone else has already invested time/money into building software for. But I’ll come back to that in a minute.

But why is the default behavior to write your own package instead of building on top of an existing one?  Well, that’s clear: Publications.  In science, the method of determining your progress is how many journal publications you have, skewed by some “impact factor” for how impressive the name of the journal is.  The problem is that this is a terrible metric to judge progress and contribution.  Solving a difficult problem in an existing piece of software doesn’t merit a publication, but wasting 4 months to rewrite a piece of software DOES.

The science community, in general, and the funding community more specifically, will reward you for doing wasteful work instead of focusing your energies where it’s needed. This tends to squash software collaborations before they can take off simply by encouraging a proliferation of useless software that is rewarded because it’s novel.

There are examples of bioinformatics packages where collaboration is a bit more encouraged – and those provide models for more efficient ways of doing research.  For instance, in the molecular dynamics community, Charmm and Amber are the two software frameworks around which most people have gathered. Grad students don’t start their degree by being told to re-write one or the other packages, but are instead told to learn one and then add modules to it.  Eventually the modules are released along with a publication describing the model.  (Or left to rot in a dingy hard drive somewhere if they’re not useful.)   Publications come from the work done and the algorithm modifications being explained.  That, to me, seems like a better model – and means everyone doesn’t start from scratch

If you’re wondering where I’m going with this, it’s not towards the Microsoft model where everyone does bioinformatics in Excel, using Microsoft generated code.

Instead, I’d like to propose a coordinated bioinformatics code-base.  Not a single package, but a unified set of hooks instead.  Imagine one code base, where you could write a module and add it to a big git hub of bioinformatics code – and re-use a common (well debugged) core set of functions that handle many of the common pieces.  You could swap out aligner implementations and have modular common output formats.  You could build a chip-seq engine, and use modular functions for FDR calculations, replacing them as needed.  Imagine you could collaborate on code design with someone else – and when you’re done, you get a proper paper on the algorithm, not an application note announcing yet another package.

(We have been better in the past couple years with tool sets like SAMTools, but that deals with a single common file format.  Imagine if that also allowed for much bigger projects like providing core functions for RNA-Seq or CNV analysis…  but I digress.)

Even better, if we all developed around a single set of common hooks, you can imagine that, at the end of the day (once you’ve submitted your repository to the main trunk), someone like the Galaxy team would simply vacuum up your modules and instantly make your code available to every bioinformatician and biologist out there.  Instant usability!

While this model of bioinformatics development would take a small team of core maintainers for the common core and hooks, much the same way Linux has Linus Torvalds working on the Kernel, it would also cut down severely on code duplication, bugs in bioinformatics code and the plethora of software packages that never get used.

I don’t think this is an unachievable goal, either for the DIY bioinformatics community, the Open Source bioinformatics community or the academic bioinformatics community.  Indeed, if all three of those decided to work together, it could be a very powerful movement.  Moreso, corporate bioinformatics could be a strong player in it, providing support and development for users, much the way corporate Linux players have done for the past two decades.

What is needed, however, is buy-in from some influential people, and some influential labs.  Putting aside their own home grown software and investing in a common core is probably a challenging concept, but it could be done – and the rewards would be dramatic.

Finally, coming back to the funding issue.  Agencies funding bioinformatics work would also save a lot of money by investing in this type of framework.  It would ensure more time is spent on more useful coding, more time is spent on publications that do more to describe algorithms and to ensure higher quality code is being produced at the end of the day.  The big difference is that they’d have to start accepting that bioinformatics papers shouldn’t be about “new software” available, but “new statistics”, “new algorithms” and “new methods” – which may require a paradigm change in the way we evaluate bioinformatics funding.

Anyhow, I can always dream.

Notes: Yes, there are software frameworks out there that could be used to get the ball rolling.  I know Galaxy does have some fantastic tools, but (if I’m not mistaken), it doesn’t provide a common framework for coding – only for interacting with the software.  I’m also aware that Charmm and Amber have problems – mainly because they were developed by competing labs that failed to become entirely enclusive of the community, or to invest substantially in maintaining the infrastructure in a clean way.Finally, Yes, the licensing of this code would determine the extent of corporate participation, but the GPL provides at least one successful example of this working.

14 thoughts on “Where’s the collaboration?

  1. Bioconductor seems to meet a lot of your criteria. Publications are indeed frequently of the app-note-announcing-a-new-package variety, but many of the statisticians publishing key methods (e.g. FDR as you mention) maintain Bioconductor packages. There is some redundancy, but there are also a lot of packages built on top of other packages, and attempts to enable workflows with several options at each step. For high-throughput sequencing, the hardworking core team has put a lot of effort into defining core data object types with reasonably rich functionality (including the ability to e.g. use samtools from R) that is shared amongst many packages.

    For your ChIP-Seq example, one can easily build workflows to check the QA of sequencing runs, align and filter reads, load datasets from GEO and SRA, call peaks, find differentially bound peaks, and perform all manner of downstream analysis: clustering, classification, annotation, motif finding, gene set enrichment, GO terms, etc., plus flexible plotting and report generation.

    • Hi Rory,

      Thanks for your feedback – you’re right, bioconductor does seem to have the right model. The only problem I have with it is that it’s designed around R – which does not make a lot of sense for scaling up bioinformatics work. One of the important criteria for such a common framework would be to accept code from a variety of languages/packages/scaffolds – OR to be designed in a general purpose language like C/Java/Python etc. Unfortunately, as much as I have heard great stuff about bioconductor, using R as the basis for all future bioinformatics work seems somewhat inappropriate to me. (I could likely be convinced otherwise, tho, if you have some good evidence to the contrary.)

  2. Hi Anthony-

    As a computer scientist and as a software engineer who spent over a decade building big systems mostly in C/C++, I sympathize with your wariness towards R!

    We use Java as our primary sequencing pipeline (basecalling, QC, alignments) and to maintain the core databases. From there we use R for analysis workflow. We never read the alignments into main R memory, and anything “serious” is done with R wrappers around integrated C++ modules (or callouts to external programs), but I’ve been pretty impressed how much we can do within R. For example, we’ve processed many hundreds of ChIP-Seq libraries, using R to do QC on the alignments and to schedule peak callers (e.g. FindPeaks and MACS) on our HPC, subsequently reading the peaks back into R for downstream processing. R can handle a binding matrix of 150,000 peaks by 100 sequencing libraries (read count data) cleanly, and common data formats mean we can easily try different normalization methods, or various ways of estimating parameters to fit different distributions, and pass off interesting peaksets to downstream annotation or motif finding packages painlessly.

    Sometimes R drives me nuts — they don’t seem to put much stock in backward compatibility and every release breaks something, whereas C code I wrote during my PhD in the early ’90s still runs. So while the “serious” work of many modules is done in embedded C/C++, or callouts to external Java and Python programs, R workflows let us develop a template script that is customizable for each project. Using Swembl, these scripts repeatably execute an analysis and produce latex documents with all the plots, software and package versions, etc.

    The big win for us is Bioconductor itself, for basically all the reasons you describe in your article: the availability of an extensive set of well documented, easy to install modules within a common processing framework, along with an active and engaged community providing feedback and support.

    • Hi Rory,

      Again, thanks! Actually, I think you’ve summed up the strengths and weaknesses of R *very* well. I’m always impressed with what you CAN do with R, but it always strikes me as a bad way to do many of the things that need to get done for working with NGS data. Maybe what I should be advocating for is a bigger framework, then – one that could build from bioconductor, but is not limited to R.

      Still, I’ve played with R and found its interface to be painful for users (and even experienced programmers!) – and while it certainly has strengths when it comes to manipulating data, I don’t think those strengths are limited to R, and could be replicated in many other languages. Still you’re right that of all the tools out there now, the closest is bioconductor.

      And I’ve just been told that it can encapsulate C. Maybe it isn’t such bad solution after all.

  3. Hi Anthony

    Totally agree with you… it irritates me to see these publications where you have minor improvement on an existing algorithm.. Scientists including bioinformaticians need to spend more time on good experimental design (that’s where you’ll get the biggest bang for your buck). You see this less in industry but surprisingly it does happen. Maybe these people are looking for job security… I do like Galaxy, it still needs some work compared to commercial tools like Pipeline Pilot. I try to have in my toolkit, a robust programming language like java/python, a workflow engine (galaxy etc..) and a statistical engine like R.. That should solve most of your problems..

    • Hi Steve,

      You make a good point that you can solve most of your problems with a workflow engine, a stats engine and a robust programming language – but that leads you to exactly where we are: avoiding collaboration. (I personally use Java, databases and a big code base I’ve developed, which is both a symptom and a cause of the problem.)

      I think, as a community, we all need to put our heads together and come up with something more than just tools that solve our own problems. We need tools that help the entire community solve it’s problems, because we’re all facing the same problems individually and we’d all do better if we faced them together.

  4. Pingback: Animal health, Eisai, institutes and infrastructure | Professor Douglas Kell's blog

  5. Pingback: Animal health, Eisai, institutes and infrastructure | BiotechLive.com

  6. I’m catching up on your posts via reading them backwards. Interesting that you are joining CLCbio. A firm which provides a framework but only if you march with them. Not only do we have competition in the academic realm but also in the commercial ventures.

    • I can see why that might seem somewhat askew – however, I haven’t discussed any of my blog ideas with them, nor have they discussed any related topics with me. I suppose that posts like this will be interesting grounds for future discussions internally at CLC, and hopefully at other companies as well.

      At any rate, if I may quote myself from a conversation I had earlier today: “If the academics don’t get their A$$es in gear, the corporate world will move in to fill the void.” I’m certainly not against corporations fulfilling needs in the real world where academics are unable to tread. How my future employers perceive the situation has yet to be communicated to me.

  7. Hi Rory,

    I think you should give R and Bioconductor a try. Not only can we actually do a lot of the things you are afraid we can’t do, but we actually solve a lot of the problems that you are talking about as well. We already do have a core team that maintains the project as a federation of reusable packages , and we carefully maintain this project and encourage input from all comers. As a result, there is a lot of reuse of code and a lot of collaboration. There is even a review process to encourage code reuse and collaboration in new packages. And not all the code that it reused by the project is R either. R is really just the glue that holds it all together inside of modular packages. Why R? Well R is a statistical language which makes it a great framework for doing data analysis. These other languages have other strengths, and different packages in the project frequently tap into other languages as appropriate to take advantage of those strengths. We know that R is not a perfect solution for everything and we would never pretend that it is. But it is pretty darned great. And I think you will find that it’s well worth your time to learn a bit more about it.

    Given that the alternative is to try and recreate a new version of what already exists, I hope you will give R and Bioconductor a try.

    Marc

  8. Pingback: Links 11/2/11 | Mike the Mad Biologist

    • One of these days, I’m just going to have to rant about R. I’d be happy with R if its syntax weren’t so damn annoying to learn (referring to the R shell, not the programming language, which I’ve only used once for a small project and know nothing about.). That said, I truly see and respect what people have done with bioconductor and R.

      Ugh. Leaves me torn.

      Still, my post isn’t about languages, as I really don’t care what language this comes together in – and the ideally it should be language independent. It’s not about getting everyone to pick one language, but to build one project together. I hope that wasn’t lost.

Leave a Reply

Your email address will not be published. Required fields are marked *