>Aligning DNA – a quick intro to some of the tools

>A long time ago, I mentioned that I’d use this blog for some science. I don’t really know who my audience is, but I suspect it’s 1 part people who google in for linux related key words, 1 part people who are googling me because they’ve read my name somewhere, and 1 part random people… oh, and not to forget those people who I actually know in real life, but don’t get to see often. Since I’m unlikely to scare of the googlers or the random people with some science, and the people I know in person will just skip over this posting (that’s a hint, if you haven’t already stopped reading…), I’ll just go ahead with a quick lesson in Genomics.

As I’ve mentioned to many people, my research is currently heavily embedded in writing code for the processing of Solexa/Illumina reads. For those who aren’t in the know, the Illumina 1G is a DNA sequencing machine that can produce ~40 Million DNA sequencing reactions in about 36 hours, where each sequence is 36 bases long. This is drastically different from traditional sequencing, where you get far fewer sequences (~100’s), each of which can be up to 1000 bases long. With the low volume of alignments produced, and the long reads, traditional sequencing can be processed with a rather laisez-faire attitute. if you take 10 seconds to align 100 reads, you’re talking about 1000 seconds to do all the alignments, or roughly 15 minutes.

In contrast, with 40 million reads, at ten seconds each, you’re talking about waiting 12 years for your data to be processed. Clearly this is not an option, if you want to get into a cutting edge journal.

(For those who are curious to know where the 10 second mark came from, my supervisor timed a few blast searches on a local server.)

Consequently, computational scientists have risen to the task, and created several new algorithms for performing these alignments. The first one that I met is called “Eland”, which I believe stands for “Efficient Local Alignment of Nucleotide Data”. To be honest, I’m not really sure how to get Eland, as there is very little information available on the web. I’ve gleaned a few scraps of information: It was written by Anthony Cox, and was distributed by Solexa/Illumina. As far as I can tell, it came with the Illumina 1G machines.

The second one I met was called “Mosaik”. This aligner comes from the lab of Gabor Marth, at boston college. The third was “Exonerate”, followed by several others. Hands down, the best name for an aligner has got to be “MAPASS”… I’ll let you ponder that for a few minutes.

Anyhow, each alginer has it’s advantages and disadvantages. Eland, for instance is crippled by a algorithmic limitation to only ever being able to align the first 32 bases of a sequence, and it’s inability to map a read to more than one location in a target DNA source (i.e genome or transcriptome). On the other hand, it’s one of the fastest aligners out there.

Mosaik, on the other hand, is a bit slower, but has several nice features – and is able to handle the cases that Eland can’t. On the other hand, it dumps out it’s alignments to a file format that really isn’t convenient for doing any further processing, which is a product of it’s original use in sequence assembly.

Just to throw in one more curve, it’s worth mentioning a competing proprietary piece of software. There’s a company in Malaysia that has what appears to be the fastest aligner out there, with none of the limitations of Eland, and a flexible file output. Like all commercial products for the science market, they set the base price of their product outside of the reach of most consumers: I can’t seem to find a good justification to get my supervisor to pay $200,000+ USD/CAD for a copy of their software. (You could hire four post-docs for a year for that… or 10 grad students.)

Anyhow, now you know who the players are. My next post on this topic will be to introduce some of them in detail, explain how they work, and then tell you how we use them.

Yes, I’m feeling ambitious: I got three pieces of software working today: two of which are likely to be in production in the next couple of weeks at the Genome Sciences Centre: One for ChIP-Seq experiments (FindPeaks 3.0) and the other for transcriptome processing for mammalian cells (yet to be named.)

Leave a Reply

Your email address will not be published. Required fields are marked *