>Field Programmable Gate Arrays

>Yes, I’m procrastinating again. I have two papers, two big chunks of code and a thesis proposal to write, a paper to review (it’s been done but I have yet to type out my comments..), several major experiments to do and at least one poster looming on the horizon – not to mention squeezing in a couple of manuals for the Vancouver Package Software. And yet, I keep finding other stuff to work on, because it’s the weekend.

So, I figured this would be a good time to touch on a topic of Field Programmable Gate Arrays or FPGAs. I’ve done very little research on this topic, since it’s so far removed from my own core expertise, but it’s a hot topic in bioinformatics, so I’d be doing a big disservice by not touching on this subject at all. However, I hope people will correct me if they spot errors.

So what is an FPGA? I’d suggest you read the wikipedia article linked above, but I’d sum it up as a chip that can be added to a computer, which has the ability to optimize the way in which information is processed, so as to accellerate a given algorithm. It’s a pretty cool concept – move a particular part of an algorithm into the hardware itself to speed it up. Of course, there are disadvantages as well. Reprogramming is (was? – this may have changed) a few orders of magnitude slower than processing information, so you can’t change the programming on the fly while processing data and still hope to get a speed up. Some chips can change programming of unused sub-sections, while other algorithms are running… but now we’re getting really technical.

(For a very good technical discussion, I suggest this book, of which I’ve read a few useful paragraphs.)

Rather than discuss FPGAs, which are a cool subject on their own, I’d rather discuss their applications in Bioinformatics. As far as I know, they’re not widely used for most applications at the moment. The most processor intensive bioinformatics applications, Molecular Modeling and drug docking, are mainly vector-based calculationd, so vector chips (eg Graphics Processing Units – GPUs) are more applicable for them. As for the rest, CPUs have traditionally been “good enough”. However, recently the following two things seem to have accelerated this potential mariage of technology:

  1. The makers of FPGAs have been looking for applications for their products for years and have targeted bioinformatics because of it’s intense computer use. Heavy computer use is always considered to be a sign that more efficient processing speed is an industry need – and FPGAs appear to meet that need – on the surface.
  2. Bioinformatics was doing well with the available computers, but suddenly found itself behind the processing curve with the advent of Second Generation Sequencing (SGS). Suddenly, the amount of information being processed spiked by an order of magnitude (or more), causing the bioinformaticians to scream for more processing power and resources.

So, it was inevitable that FPGA producers would hear about the demand for more power in the field, and believe that it’s the ideal market into which they should pluge. To the casual observer, Bioinformatics needs more efficiency and power, and FPGA producers are looking for a martet where efficiency and power are needed! Is this a match made in heaven or what?

Actually, I contend that FPGAs are the wrong solution for several reasons.

While Second Generation Sequencing produces tons more data, the algorithms being employed haven’t yet settled down. Every 4 months we pick a different aligner. Every 3 months we add a new data base. Every month we produce a more efficient version of our algorithms for interpreting the data. Due to the overhead in producing an algorithm translation into hardware necessary to use the FPGA (which seems large to me, but may not be to people more fluent in HDL) would mean that you’d spend a disproportionate amount of time trying to get the chips set up to process your data – which you’re only going to use for a short period of time before moving on. And the gain of efficiency would probably be wiped out by the amount of effort introduced.

Furthermore, even when we do know the algorithms being used are going to stay around, a lot of our processing isn’t necessarily CPU bound – but rather is I/O or memory bound. When you’re trawling through 16Gb or memory, it’s not necessarily obvious that adding more speed to the CPU will help. Pre-fetching and pre-caching are probably doing more to help you out than anything else bound to your CPU.

In the age of multi-CPUs, using multi-threaded programs already reduces many of the pains that plague bioinformaticians. Most of my java code is thrilled to pull 2, 3, or more processors in to work faster – without a lotof explicit multi-treadding. (My record so far is 1496% cpu usage – nearly 15 processors.) I would expect that buying 16-way processors is probably more cost-efficient than buying 16 FPGAs in terms of processing data for many of the current algorithms in use.

Buying more conventional resources will probably alleviate the sudden bottle-neck in compute power, rather than innovating around new solutions to solve the need. It’s likely that many groups getting into the second generation genomics technologies failed to understand the processing demands of the data, and thus didn’t plan adequately for the resources. This means that much of the demand for data processing is just temporary, and may even be aleviated with more efficient algorithms in the future.

So where does the FPGA fit in?

I’d contend that there are very few products out there that would benefit from FPGAs in Bioinformatics… but there are a few. Clearly, all bioinformaticians know that aligning short reads is one of those areas. Considering that a full Maq run for a flow cell from an Illumina GAII takes 14+ hours on a small cluster, that would be one area in which they’d clearly benefit.

Of course, no bioinformatician wants to have to reprogram an FPGA on the fly to utilize their work. Were I to pick a model, it would probably be to team up with an aligner group, to produce a stand alone, multi-FPGA/CPU hybrid box with 32Gb of RAM, and a 3-4 year upgrade path. Every 4 months you produce a new aligner algorithm and HDL template, and users pick up the aligner and HDL upgrade, and “flash” their computer to use the new software/hardware. This would follow the Google Appliance model: an automated box that does one task, and does it well, with the exception that hardware “upgrades” come along with the software patches. That would certainly turn a few heads.

At any rate, only time will tell. If the algorithms settle down, FPGAs may become more useful. If the FPGAs become easier to program for bioinformaticians, they may find a willing audience. If the FPGAs begin to understand the constraints of the bioinformatics groups, they may find niche applications that will truly benefit from this technology. I look forward to seeing where this goes.

Ok… now that I’ve gone WAY out on a limb, I think it’s time to tackle a few of those tasks on my list.

>SISSR

>One more day, one more piece of ChIP-Seq software to cover. I’ve not talked about FindPeaks, much, which is the software descended from Robertson et al, for obvious reasons. The paper was just an application note – and well, I’m really familiar with how it works, so I’m not going to review it. I have talked about Quest, however, which was presumably descended from Johnson et al.. And, for those of you who have been following ChIP-Seq papers since the early days will realize that there’s still something missing: The aligner descended from Barski et al, which is the subject of today’s blog: SISSR. Those were the first three published ChIP-Seq papers, and so it’s no surprise that each of them followed up with a paper (or application note!) on their software.

So, today, I’ll take a look at SISSR, to complete the series.

From the start, the Barski paper was discussing both histone modifications and transcription factors. Thus, the context of the peak finder is a little different. Where FindPeaks (and possibly QuEST as well) was originally conceived for identifying single peaks, and expanded to do multiple peaks, I would imagine that SISSR was conceived with the idea of working on “complex” areas of overlapping peaks. Although, that’s only relevant in terms of their analysis, but I’ll come back to that.

The most striking thing you’ll notice about this paper is that the datasets look familiar. They are, in fact the sets from Robertson, Barski and Johnson: STAT1, CTCF and NRSF, respectively. This is the first of the Chip-Seq application papers that actually performs a comparison between the available peak finders, and of course, claim that theirs is the best. Again, I’ll come back to that.

The method used by SISSR is almost identical to the method used by FindPeaks, with the use of directional information built into the base algorithm, whereas FindPeaks provides it as an optional module (-directional flag, which uses a slightly different method). They provide an excellent visual image on the 4th page of the article, demonstrating their concept, which will explain the method better than I can, but I’ll try anyhow.

In ChIP-Seq, a binding site is expected to have many real tags pointing at it, as tags upstream should be on the sense strand, and tags on downstream should be on the anti-sense strand. Thus, a real binding site should exist at transition points, where the majority of tags switch from the sense to the anti-sense tag. By identifying these transition points, they will be able to identify locations of real binding sites. More or less, that describes the algorithm employed, with the following modifications: A window is used, (20bp default) instead of doing it on a base-by-base basis, and parameter estimation is employed to guess the length of the fragments.

In my review of QuEST, I complained that windows are a bad idea(tm) for ChIP-Seq, only to be corrected that QuEST wasn’t using a window. This time, the window is explicitly described – and again, I’m puzzled. FindPeaks uses an identical operation without windows, and it runs blazingly fast. Why throw away resolution when you don’t need to?

On the subject of length estimation, I’m again less than impressed. I realize this is probably an early attempt at it – and FindPeaks has gone through it’s fair share of bad length estimators, so it’s not a major criticism, but it is a weakness. To quote a couple lines from the paper: “For every tag i in the sense strand, the nearest tag k in the anti-sense strand is identified. Let J be the tag in the sense strand immediately upstream of k.” Then follows a formula based upon the distances between (i,j) and (j,k). I completely fail to understand how this provides an accurate assessment of the real fragment length. I’m sure I’m missing something. As a function that describes the width of peaks, that may be a good method, which is really what the experiment is aiming for, anyhow – so it’s possible that this may just be poorly named.

In fairness, they also provide options for a manual length estimation (or XSET, as it was referred to at the time), which overrides the length estimation. I didn’t see a comparison in the paper about which one provided the better answers, but having lots of options is always a good thing.

Moving along, my real complaint about this article is the analysis of their results compared to past results, which comes in two parts. (I told you I’d come back to it.)

The first complaint is what they were comparing against. The article was submitted for publication in May 2008, but they compared results to those published in the June 2007 Robertson article for STAT1. By August, our count of peaks had changed. By January 2008, several upgraded versions of FindPeaks were available, and many bugs had been ironed out. It’s hardly fair to compare the June 2007 FindPeaks results to the May 2008 version of SISSR, and then declare SISSR the clear winner. Still, that’s not a great problem – albeit somewhat misleading.

More vexing is their quality metric. In the Motif analysis, they clearly state that because of the large amount of computing power, only the top X% of reads were used in their analysis. For comparison with FindPeaks, the top 5% of peaks were used – and were able to observe the same motifs. Meanwhile, their claim to find 74% more peaks than FindPeaks, is not really discussed in terms of the quality of the additional sites. (FindPeaks was also modified to identify sub-peaks after the original data set was published, so this is really comparing apples to oranges, a fact glossed over in the discussion.)

Anyhow, complaints aside, it’s good to see a paper finally compare the various peak finders out there. They provide some excellent graphics, and a nice overview on how their ChIP-Seq application works, while contrasting it to published data available. Again, I enjoyed the motif work, particularly that of figure 5, which correlates four motif variants to tag density – which I feel is a fantastic bit of information, buried deeper in the paper than it should be.

So, in summary, this paper presents a rather unfair competition by using metrics guaranteed to make SISSR stand out, but still provides a good read with background on ChIP-Seq, excellent illustrations and the occasional moment of deep insight.