>Aligning DNA – Eland

>Continuing in my series of blog articles on short-read sequencing, it’s time for a little bit of a discussion on the first short-sequence aligner that I’d ever met: Eland. The problem with writing anything about Eland, is that 90% of the information I have on it can’t be confirmed independently. In fact, the majority of what I know comes from the header of the ELAND.cpp file. I’ll admit, while I’ve gone through the source code a couple of times, I haven’t spent much time getting to know it’s well – I’ve processed billions of lines of Eland files, but have yet to run it, myself…

Anyhow, here’s what I can tell you: Eland stands for “Efficient Local Alignment of Nucleotide Data” and was written by Anthony J. Cox for Solexa Ltd. Of course, Solexa has since been bought out by Illumina, so I assume the copyright has been transferred along with the technology during that transaction.

What makes Eland so fascinating to me is that it’s a direct departure from the dynamic programming based algorithms (like the venerable Smith-Waterman alignment), making it somewhat interesting as a computational program.

The basic algorithm works along these lines: Given a sequence of length N, it can be divided into four subsequences (A, B, C, and D), which are of equal (or nearly equal length). Assuming there are no more than 2 errors, at least two of these subsequences will be “error free”, so that the two error free sequences can then be searched for in a database containing all possible subsequences in the genome of interest. Thus, you can search your database for the subsequence AB and CD. Of course, you don’t know which subsequences are the error free ones, so you need to try all possible combination.

What Eland does to speed things up is to combine these subsequences into sets of two. Thus, rather than searching for 4 independent entries in their database, they simply have to search for 2 sets of two, and as long as one set matches, you can use looser matching criteria on the other two, ie, allowing for mismatches. That is to say, if you make two sequences out of the 4 subsequences {AB and CD}, you can search for an exact match for AB, and test all the possible results for mismatches in the {CD} portion of the sequence. This has the effect of significantly reducing the search space. (Only sequences containing the exact match to {AB}.)

Searching for the {AB and CD} subsequences would only work if the first half of your sequence has no errors. What if B and C had the errors? The simple answer is to shuffle your subsequences to make other combinations: ({AB and CD}, {AC and BD}, {AD and CD}, {BA and CD}, etc.) This still provides you with a relatively small search space for each sequence, as there are only 4! possible combinations (which is 4x3x2x1 = 24 possible sequences) to search for.

Fortunately, you can bound this even further. You always know that you want sequences in which the first pair and second pair are in the correct order, (ie {AB and CD} and {AC and BD} are ok, but {CA and DB} and {BA and DC} would give you an incorrect result) limiting you to only six possible combinations, which still allows you to find any correct match where at least two of the four subsequences are error free.

That, in general is how it works.. with a few nifty shortcuts. ie, they only need to pass over the genome search space 3 times to find all possible matches with up to two errors, because they can search for subsequence combinations simultaneously and efficiently. In other words, you get a very quick alignment.

On the other hand, the way in which ELAND was implemented is very important. The Eland aligner has some severe restrictions, some of which are algorithm specific, some of which are implementation specific:

  1. The Eland aligner can only align sequences up to 32 base pairs long. (Implementation specific: they use four integers to represent the subsequences {A,B,C and D}, which reduces the memory requirements.)
  2. The Eland aligner can only align sequences with up to two mismatches. (Algorithm specific: at least two of the keys must match, although the algorithm could be expanded to use more subsequences, at the expense of expanding the sequence space.)
  3. The Eland aligner only returns sequences that match to one unique location in the genome (Implementation specific: Eland already tells you when a sequence matches to more than one location, it just doesn’t tell you what those locations are.)
  4. The Eland aligner can do mismatches in the sequence, but can not handle insertions or deletions (i.e, gapped alignments). (Algorithm specific: The algorithm for matching the keys could be expanded to include this, but it would be very costly in time/memory.)
  5. The code is not freely available. (Implementation specific: I don’t know of anywhere you can go to get the code… or documentation!)
  6. Changing the length of the alignment sequence (I called it N, above), requires a recompile. (Implementation, obviously. Why this isn’t a command line parameter, I don’t know.)

I’m sure I could go on and list other disadvantages, but I’d like to point out the main advantage: It’s FAST. Blazing fast, compared to the other aligners. From my own personal experience, Eland may only be able to map 60-70% of your sequences to the genome (compared to 70-80% for other aligners with which I have less experience), but it does so in a few hours, compared to several days or weeks for most other aligners (or years, compared to Blast). (Comparisons are for a single processor, matching 1 million+ reads to a mamalian size genome.) The only other aligner that can come close to this speed is, to the best of my knowledge, SXOligoSearch, produced by Synamatix, though that beast will cost upwards of $225,000CAD to license.

Anyhow, to close this discussion, it’s worth mentioning that rumours of a new version of Eland have been circulating since March 2007. Beta versions, capable of performing searches on sequences larger than 32 base pairs in length and able to perform multi-matches, were supposed to be released in September 2007. Unfortunately, the new betas appear to be vapourware, which has prompted us to look into several other aligners, which I’ll discuss in future posts.

(2008-01-28 Update: I stand corrected: I’ve heard that the new betas have finally arrived! I haven’t seen what they can do yet, but at least we know my prediction that they’re vapourware is wrong.)

Leave a Reply

Your email address will not be published. Required fields are marked *