>Chip-Seq on a laptop.

>Every once in a while, I get an interesting email or comment from someone, which is worthy of a blog post. I woke up this morning to a comment left on my last post:

What kind of computing infrastructure do you need to be able to handle chipSEQ datasets? I’m guessing my standard IBM T60 laptop is not going to do the trick – but what does?

So yes, I do check my email when I get up… but that’s not the point. Moving along…

The simple answer is, your laptop probably can’t run a chip-Seq data set with the currently available programs… but it’s probably not far off. My laptop has 2Gb of RAM (after tossing two new 1Gb sticks in last week), and a dual core AMD, running at 2GHz. Ironically, it’s not far from what we do use to run FindPeaks: A quadcore 2.2GHz opteron with 8Gb of RAM.

Of course, FindPeaks doesn’t take up the full resources of the machine. (Though I’ve seen it take up to 12CPUs on one of our fancier servers.) Generally, I only allocate 4Gb of RAM to the process, which should be more than enough for several millions or tens of millions of reads. The reason it takes so much ram? Because I don’t drop any of the information being held that’s contained in the Eland file, and because I need to retain all of the reads in memory to do a sort.

What does it need? Generally, FindPeaks only needs the start, end and direction of each read to do the basics, however, I don’t throw away any of the other information collected, in case we want to use it later. If I did that, or if I pre-sorted the reads, I could probably drop the memory requirements down by an order of magnitude or more. (No one’s asked me to do this, yet, however.) Is there anyone out there who needs a laptop-runnable version of FindPeaks?

This is somewhat timely, though. For the last few days I’ve been playing with a companion piece of software for FindPeaks which generates UCSC-compatible BED tracks from Eland data where I’ve adopted a minimalist approach. It takes about the same amount of time, and runs in under 300m of RAM. That, clearly, should run on anyone’s laptop.

>Catching up….

>I can’t believe it’s been nearly a month since my last post! I feel like I’ve been neglecting this a bit more than I should, but I’ll try to rectify that as best I can.

For an indication of how busy I’ve been, I sat down to update my resume yesterday, and ended up adding 3 papers (all in submission) and two posters. That just about doubles what was in there previously in the papers section.

Anyhow, Next-generation sequencing doesn’t stand still, so I thought I’d outline some of the things I want to talk about in my next posts, and set up a few pointers to other resources:

1. SeqAnswers. This aptly named forum has been around for a few months now, but has recently become more popular, and a great forum for discussing relevant Next-gen technology and sequencing methods. I’m especially happy to see the automated posts triggered by new literature on the subject, which are a great resource for those of us who are busy and forget to check for new publications ourselves.

2. There’s one forum in particular that’s of great interest: Benchmarking different aligners. This appears to be a well done comparison (if lightweight) that may be a good focus for other people who are interested in comparing aligners, and discussing it in a wider forum.

3. For people interested in ChIP-Seq, or Chromatin immunoprecipitation and massively parallel sequencing, I’ve finally gotten around to posting FindPeaks 3.1 on the web. I’d consider this release (3.1.3) an alpha release. I’d love to get more people contributing by using this application and telling me what could be improved on it, or what enhancements they’d like to see. I’m always happy to discuss new features, and can probably add most of them in with a relatively quick turn around time.

4. For people interested in assessing the quality of the whole transcriptome shotgun sequencing (WTSS), I’m about to break out a tool that should fit that purpose. If anyone is interested in giving suggestions on ways they’d like to see quality tests performed, I’d be more than happy to code those into my applications. (And yes, if you contribute to the tool, I will provide you a copy of the tool to use. Collaborations, etc, can be discussed, as well.)

5. A quick question, of which I’ll post more in the future. Has anyone here managed to get Exonerate 2.0 to work in client/server mode on two separate machines?

6. I’ll also post a little more about this in the future: building environments, ant and java. Why are people still doing large projects in perl?

7. One last thing I wanted to mention. I was going to write more on this topic, but eh… I’ll let slashdot do it for me: The more you drink, the less you publish. Well, So much for keeping a bottle of tequila under the desk. Now I know what to get the competition for x-mas, though…

Cheers!

>Transcriptome sequencing.

>In an earlier comment, Jason Stajich asked:

What I am most curious about is how people are planning to do the statistics of gene expression comparison from the EST sequencing library approach. It made sense to me for the SAGE approach, but how do you get the overall expression for the gene (really you want the per-transcript numbers). Do you assemble and count the union of all tags across a transcript? Do you normalize that by length of the transcript? Do you only count 3′ biased tags?

Though I’ve been taking my time about answering, its a really good question. I’ve been working with transcriptomes for a while, now, and have a fair amount of experience with it. I don’t want to give away all of my secrets, but I can give a few pointers. If anyone wants to collaborate on something, you know where to find me. (-;

So, first things first, with transcriptome sequencing using Illumina based sequencing, each read you get is presumably from a single molecule of DNA, which presumably came from a single molecule of cDNA, from your library. I can’t speak for all of the protocols used by the labs here at the Genome Science Centre, but the results I’ve seen have shown a very close correlation with expression levels measured by Nimblegen/Affymetrix DNA arrays, and so I tend to believe that the number of tags we’re observing per region(eg gene/transcript/exon) are a direct (or nearly direct) measurement of the RNA levels in the general cell population used to create the library.

I should also mention that this is very dependent upon the protocols being used. If your protocol involves amplifying the cDNA with the use of PCR, you’re really not going to maintain that relationship. Consult an expert on this subject, if you plan to try this at home. (-;

The other questions Jason asked are not quite as straight forward. We have a protocol here at the GSC that gives pretty darn even coverage across transcripts as a whole, which means that transcript end bias is pretty minimal. That totally negates the need to look at biases, or otherwise. Of course, this comes down to a lot of lab technique (which is totally outside the scope of my post), as it seems to be dependent on following the appropriate protocols. I’ve seen libraries which are completely skewed, libraries that perfectly follow the transcript outlines, and libraries somewhere in between. As it stands, I now run my tools over each data set as it appears to judge the quality before I ask for more lanes.

So, the short answer is: no, I don’t normalize or exclude any data when I deal with transcriptomes, but I’m in the fortunately position of being able to identify (and accept or reject!) which data sets meet a reasonable level of quality before I process them.

>Aligning DNA – comments from above

>I’ve been pretty bad about continuing my posts on how the different aligners work. It’s a lot of work keeping up with them, since I seem to hear about a new one each week. However, a post-doc in my lab gave a presentation on contrasting the various aligners, to discuss each of their strengths and weaknesses for doing short (Illumina) read alignments.

Admittedly, I don’t know how accurate the presenter’s data was – most of the presentation was in being used to set up his own in-house aligner development, and thus all of the aligners were painted in a poor light, except his, of course. That being said, there’s some truth to what he found: most of the aligners out there have some pretty serious issues.

Eland is still limited by it’s 32-base limit, which you’d think they’d have been over by now. For crying out loud, the company that produces it is trying to sell kits for doing 36-base alignments. It’s in their best interest to have an aligner that does more than 32 bases. (Yes, they have a new work-around in their Gerald program, but it’s hardly ideal.)

MAQ, apparently, has a weird “feature” that if multiple alignments are found, it just picks one at random as the “best”. Hardly ideal for most experiments.

Mosaik provides output in .ace files – which are useless for any further work, unless you want to reverse engineer converters to other, more reasonable, formats.

SOAP only aligns against the forward strand! (How hard can it be to map the reverse compliment???)

Exonerate is great when run in “slow mode”, at which point it’s hardly usable for 40M reads, and when it’s run in “fast mode”, it’s results are hardly usable at all.

SHRiMP, I just don’t know enough about to comment on.

And yes, even the post-doc’s in-house aligner (called Slider) has some serious issues: it’s going to miscall all SNPs, unless you’re aligning fragments from the reference sequence back to itself. (That’s not counting the 20 hours I’ve already put in to translate the thing to java proper, patching memory leaks, and the like…)

Seriously, what’s with all of these aligners? Why hasn’t anyone stepped up to the plate and come up with a decent open-source aligner? There are got to be hundreds of groups out there who are struggling to make these work, and not one of them is ideal for use with Illumina reads. Isn’t there one research group out there dog-fooding their own Illumina sequence aligner?

At this rate, I may have to build my own. I know what they say about software, though: You can have fast, efficient or cheap – pick any two. With aligners, it seems that’s exactly where we’re stuck.

AGBT post #2.

Good news.. my bag arrived! I’m going to go pick it up after the current session, and finally get some clean clothes and a shave. Phew!

Anyhow, on the AGBT side of things, I just came back form the Pacific Biosciences panel discussion, which was pretty neat. The discussion was on “how many base pairs will it take to enable personalized medicine?” A topic I’m really quite interested in.

The answers stretched from infinite, to 6 Billion, to 100TB, to 100 people (if they can pick the right person), to 1 (if they find the right one). It was a pretty decent discussion, covering things from American politics, to snp finding, to healthcare… you get the idea. The moderator was also good, the host of a show (Biotechworld?) on NPR.

My one problem is that in giving their answers, they brushed on several key points, but never really followed up on it.

1) just having the genome isn’t enough. Stuff like transription factor binding sites, methylation, regulation, and so forth are all important. If you don’t know how the genome works, personal medicine applications aren’t going to fall out of it. (Elaine Mardis did mention this, but there was little discussion of it.)

2) Financial aspects will drive this. That, in itself was mentioned, but the real paradigm shifts will happen when you can convince the U.S. insurance companies that preventive medicine is cheaper than treating illness. That’s only a matter of time, but I think that will drive FAR more long term effects than having people’s genomes. (If insurance companies gave obese people a personal trainer and cooking lessons, assuming their health issues are diet related, they’d save a bundle in not having to pay for diabetes medicine, heart surgery, and associated costs…. but targeting people for preventive treatment requires much more personal medicine than we have now.)

Other points that were well covered include the effect of computational power as a limiting agent in processing information, the importance of sequencing the right people, and how its impossible to predict where the technology will take us, both morally and scientifically.

Anyhow, as I’m typing this while sitting in other talks:

Inanc Birol, also from the GSC, gave a talk on his work on a new de novo assembler:

80% reconstruction of the C.elegans genome from 30x coverage, which required 6 hours (10 cpu) for data preparation and performing the assembly in less than 10 minutes on a single CPU, using under 4Gb of RAM.

There you go.. the question for me (relevant to the last posting) is “how much of the 20% remaining has poor sequencability?” I’m willing to bet it’s the same.

And I just heard a talk on SSAHA_pileup, which seems to try to sort snps. Unfortunately, every SNP caller talk I see always assumes 30X coverage.. How realistic is that for human data? Anyhow, I’m sure I missed something. I’ll have to check out the slides on slideshare.net, once they’re posted.

And the talks continue….

btw, remind me to look into the fast smith-waterman in cross-match – it sounds like it could be useful.

>AGBT post #1.

>I’m here.. and online. I almost didn’t make it, thanks to bad weather in florida, but at least the car we rented didn’t break down on the road, the way the other group’s did. Apparently the police saved them from the aligators and wild pigs… No one can say AGBT hasn’t been exciting, so far.

Anyhow, lots of good topics, and meeting interesting people already. (I’m even sitting beside an exec from Illumina, in the ABI sponsored lunch.. how’s that for irony?) Anyhow, I’m excited to start the poster sessions and get some good discussions going.

Unfortunately, I missed two of the talks this morning, while I negotiated with the good people at United airlines to have my bag delivered. The three others I’ve seen so far have been good. Some interesting points:

The best graphics are the ones with the two DNA strands shown separately. Too cool – must include that in my FindPeaksToR scripts.

Loss or gain of homozygosity can screw up what you think you have, compared to what’s really there. Many models assume you have only one copy of a gene, or just don’t really do much to make sense of these events.

From David Cox, I learned that Barcoding isn’t new (it’s not), but that it usually doesn’t work well (I can’t prove that), but hey, they got it to work (and that’s good!).

And yes, my favorite line from David Cox’s presentation was something like: 900 PCR products, 90 people[‘s samples]… 1 tube. Make sure you don’t drop it!

Anyhow, I’m getting lots of ideas, and I’m thoroughly enjoying this conference. I’m saturated in next-gen sequencing work.

Anyhow, if anyone else is reading this, My poster is #38… feel free come come by and talk.

>AGBT and Sequencability

>First of all, I figured I’d try to do some blogging from ABGT, while I’m there. I don’t know how effective it’ll be, or even how real-time, but we’ll give it a shot. (Wireless in Linux on the Vostro 1000 isn’t particularly reliable, and I don’t even know how accessible internet will be.)

Second, what I wrote yesterday wasn’t very clear, so I thought I’d take one more stab at it.

Sequencability (or mappability) is a direct measure of how well you’ll be able to sequence a genome using short reads. Thus, by definition, de novo sequencing of a genome is going to be a direct result of the sequencability of that genome. Unfortunately, when people talk about the sequencability, they talk about it in terms of “X% of the genome is sequencable”, which means “sequencability is not zero for X% of the genome.”

Unfortunately, even if sequencability is not zero, it doesn’t mean you can generate all of the sequences (even if you could do 100% random fragments, which we can’t), indicating that much of the genome beyond that magical “X% sequencable” is still really not assemblable. (Wow, that’s such a bad word.)

Fortunately, sequencability is a function of the length of the reads used, and as the read length increases, so does sequencability.

Thus, there’s hope that if we increase the read length of the Illumina machines, or someone else comes up with a way to do longer sequences with the same throughput (e.g. ABI Solid, or 454’s GS FLX), the assemblability of the genome will increase accordingly. All of this goes hand in hand: longer reads and better lab techniques always make a big contribution to the end results.

Personally, I think the real answer lays in using a variety of techniques: Paired-End-Tags to span difficult to sequence areas (eg. low or zero sequencability regions), and Single-End-Tags to get high coverage… and hey throw in a few BACs and ESTs reads for good luck. (=

>Comments on de novo assembly

>First off, I’d like to say thanks to Paul and stajich for their comments. Paul for raising several points that I’d like to discuss tonight, and stajich for bringing my attention to the SHRiMP aligner. Obviously, I haven’t had much chance to look at SHRiMP, yet, but I’ll definitely get around to it.

So, paul mentioned several things that are worth discussing:

The Velvet roadmap for mRNA could be interesting. Sometimes the intron is snipped, sometimes it isn’t, get a nice little bubble. And the similar transcripts will do… interesting things.

Short reads would be fine for de-novo if DNA was completely random. Pity that it isn’t.

For the most part, I agree with Paul, Using Velvet on a transcriptome would be pretty cool. Yes, you’d get bubbles, but you’d get a lot of nifty information about new transcripts, or intermediate forms of transcripts. This neatly avoids one of the major hurdles I currently face when working with transcriptomes: I can either align against a genome, or a list of known genes, and neither one really gives me much opportunity to identify new genes. (Of course, I’m working on doing just that, with a colleague of mine, but I’m saving that for another post.)

Where I disagree with paul, however, is his final statement, that short reads would be fine for de novo assembly if they were truly random. I have two problems with that: The first is mapability (or sequencability), and the second is the short nature of the reads themselves.

Mappability has been defined in many different ways, but for general purposes, it’s the ability to identify a sequenced read that can be unambiguously aligned to that location on a chromosome. Fortunately, with 36-mer reads, as are produced by Illumina 1G’s, something like 70-80% of the genome is mappable. Not 100% mappable, but mappable to some degree. This may seem trivial, but it’s important.

Why it’s important is that a genome with 95% mapability doesn’t mean you get a single contig covering 95% of the genome, and a chunk of 5% of your reads that don’t cover your genome. It’s more like every 20-100kb you’ll find something that you just can’t assemble over. (I’m guestimating that number, by the way.) This means you have lots of small contigs that have no overlap. That is, of course, assuming you had enough highly mappable reads to do a relatively error free assembly, and, also of course, assuming your sequencing errors haven’t completely interfered with your assembly. I don’t think either can really be taken for granted.

In reality, the mappability of a genome isn’t a property you can figure out until you’ve sequenced it, so I don’t want to discourage anyone from trying. I’m aware of people working on using Illumina reads to do this, albeit they’ve supplemented the reads with BAC sequencing, ESTs and various other options, which will provide a nice scaffolding. This approach allows them to augment their sequencing power by obtaining a lot of coverage through the Illumina runs, rather than having to use them as the basis of a serious de novo assembly – which seems wise to me. (And even then, they’re only doing a fungus – not a higher vertebrate!)

And to return to my earlier point about the mappable genome not being 100% mappable, this is where I think paul’s point over simplifies. Although some bases may be 50% mappable, and are thus considered to be “mappable” in common discussion, that means that 50% of the possible sequencing reads in which they’re likely to participate will not yield an un-ambiguous fragment! That means you can say goodbye to 50% of the likely fragments which you would need to generate an overlap to span two forming contigs. That, to me, indicates that any de novo assembly is unlikely to correctly proceed past that base, and 50% mappability is not unusual in the genome.

The other point of paul’s that I wanted to discuss, was the assertion that the non-random fragmenting of DNA is what’s stopping us from doing a serious de novo assembly. While it’s true that shearing DNA isn’t an entirely random process, it’s also true that it doesn’t need to be. People have been doing restriction enzyme digests for decades now, in order to map vectors and inserts. (I learned how to do it in grade 11 biology, and that was back in 1994). So yes, while sonication or digests may not be random, what’s to stop someone from stripping DNA of it’s histones, and then doing 25 different digests? The net effect is just about the same (assuming you pick 25 good restriction enzymes with different base recognitions), and will yield fragments that will get around paul’s issue. But does that mean we can now do a de novo assembly?

No… I don’t think so. I doubt the lack of a random fragmentation pattern is the limiting factor in de novo assemblies. Unfortunately, the only way to prove myself or paul right or wrong is to simulate this: Take the mouse genome, fragment it into random 36-mers (the same size you get from an Illumina sequencing run), then inject 1 random base for every 1000 bases read (I’ll be generous and assume a .1% error rate, though the real thing is closer to 3-5% from what I’ve seen), and then try running velvet on it.

I bet you’ll observe that somewhere around 40x coverage you’ll start to saturate and discover that your assembly has covered anywhere from 60-80% of the genome (say 5% of that is just way wrong), and that it’ll have taken you longer to do that assembly than it would have to just sequenced the damned thing in the first place with PCR.

Anyhow, I take paul’s point to heart: We’re still early on in this game, and there are a lot of neat things we can try. I’m looking forward to seeing the results of many of them. (Who has time to try them all?)

By the way, FindPeaks 3.0.1 is done, and I’ll be presenting it as a poster at AGBT 2008 this week. If you’re interested in ChIP-Seq/Chip-Solexa (or for the pedantic, ChIP-Illumina), come find me, and I’ll tell you some of its new features.

>Solexa Transcriptom Shotgun:Transcriptome alignments vs. Genome Alignments

>First off, I was up late trying to finish off one of my many projects, so I didn’t get a lot of sleep. Thus, if my writing is more incoherent than usual, that’s probably why. And now on with the show.

I’m jumping gears a bit, today. I haven’t finished my discussion about Aligners, of which I still want to talk about Exonerate in detail, and then discuss some of the other aligners in overview. (For instance, the one that I found out about today, called GMAP, a Genomic Mapping and Alignment Program for mRNA and EST sequences.) Anyhow, the point is that part of the purpose of using an aligner is to align to something in particular, such as a genome or a contig, but selecting what you align your sequences back to is a big issues.

When you’re re-sequencing a genome, you map back to the genome template. Yes, you’re probably sequencing a different individual, so you’ll find occasional sections that don’t match, but most humans are ~99% identical, and you can look into SNP (single nucleotide polymorphism) databases to find out if the differences you’re seeing are already commonly known changes. Of course, if you’re re-sequencing Craig Venter, you don’t need to worry about SNPs as much. Fortunately, most of us are sequencing more exciting genomes and so forth.

When you’re sequencing a genome you’ve never sequenced before, you can’t do mapping at all. There are various assemblers (i.e., Velvet (written by Daniel Zerbino, who is a lot of fun to hang out with at conferences, I might add… ), SSake (written by Rene Warren, who incidentally also works at the GSC, although several floors above me.), and Euler (which I’d never heard of till I googled the web page for velvet…). The key point: you don’t need to worry about what you’re mapping back to when you do de novo assembly, since you’re creating your own map. I’ll digress further for one brief comment: assembly from Solexa/Illumina sequences is a bad idea, because they’re so short!

Moving right along, we come to the third thing people are sequencing these days: Transcriptomes. (Yes, I’m ignoring cloned libraries… they’re so 90’s!) Transriptomes are essentially a snapshot of the mRNA in a set of cells at a given point in time. Of course, mRNA is rather unstable, so protocols have been developed to convert mRNA to cDNA (complementary DNA), which is essentially a copy of the mRNA in DNA form. (Yes, I’m ignoring some complexity here, because it makes for a better story.) But I’m getting ahead of myself. Lets talk about the mRNA, but be aware that the sequencing is actually done on cDNA.

mRNA is an interesting beast. Unlike Genomic DNA, it’s a more refined creature. For Eukaryotes, the mRNA is processed by the cell, removing some segments that are non-coding. Once the cell removes these segments (labeled introns), and leaves other segments (labeled exons), we have a sequence of bases that no longer matches the genomic DNA sequence from which it came. Sure, there are short segments that map back very well (i.e. the exons), but if you take a small random snippet from the mRNA, there’s a small chance that it might overlap the boundaries between two exons, which means the bases you have won’t map back nicely to the genomic point of origin. That can be a serious problem.

Sure, you say, we can do a gapped alignment, and try to find two points where this sequence originated, with one big gap. If you’re sophisticated, you’ll even know that introns often have signals that indicate their presence most of the time. And yes, you’d be right, we can do that. Unfortunately, for most solexa runs, you get 20,000,000+ sequences. At 10 seconds a sequence (which doesn’t seem like much, really), how long would it take to do that alignment?

Too long.

So most of the time, we don’t do gapped alignments. Instead, we have two choices:

  1. Align against the genome, and throw away reads that we can’t align (i.e. those that over lap intron/exon boundaries.)
  2. Align against a collection of known coding DNA sequences

Number two isn’t a bad option: it already has all the introns spliced out, so you don’t need to worry about missing those alignments. Unfortunately, there are several issues with this approach:

  • Do we really know all of the coding DNA sequences? For most species, probably not, but this is a great idea for animals like Drosophila. (I tried this yesterday on a fruit fly Illumina run and it worked VERY well.
  • Many transcripts are very similar. This isn’t a problem with the data, but with your alignment program. If your aligner doesn’t handle multi-matches (like Eland), this will never work.
  • Many transcripts are very similar. Did I say that already? Actually, it causes more problems. How do you know which transcript was really the source of the sequence? I have several ways to get around it, but nothing foolproof yet.
  • Aligning to a transcript is hard to visualize. This is going to be one of my next projects… but with all the fantastic genomes out there, I’m still not aware of a good transcriptome visualization tool.

And that brings us to the conclusion. Aligning a transcriptome run against a genome or against a transcriptome both have serious problems, and there really are no good solutions for this yet.

For now, all I can do is run both: they tell you very different things, but both have fantastic potential. I haven’t released my code for either one, yet, but they both work, and if you contact my supervisor, he’d probably be interested in collaborating, or possibly sharing the code.

>Aligning DNA – Eland

>Continuing in my series of blog articles on short-read sequencing, it’s time for a little bit of a discussion on the first short-sequence aligner that I’d ever met: Eland. The problem with writing anything about Eland, is that 90% of the information I have on it can’t be confirmed independently. In fact, the majority of what I know comes from the header of the ELAND.cpp file. I’ll admit, while I’ve gone through the source code a couple of times, I haven’t spent much time getting to know it’s well – I’ve processed billions of lines of Eland files, but have yet to run it, myself…

Anyhow, here’s what I can tell you: Eland stands for “Efficient Local Alignment of Nucleotide Data” and was written by Anthony J. Cox for Solexa Ltd. Of course, Solexa has since been bought out by Illumina, so I assume the copyright has been transferred along with the technology during that transaction.

What makes Eland so fascinating to me is that it’s a direct departure from the dynamic programming based algorithms (like the venerable Smith-Waterman alignment), making it somewhat interesting as a computational program.

The basic algorithm works along these lines: Given a sequence of length N, it can be divided into four subsequences (A, B, C, and D), which are of equal (or nearly equal length). Assuming there are no more than 2 errors, at least two of these subsequences will be “error free”, so that the two error free sequences can then be searched for in a database containing all possible subsequences in the genome of interest. Thus, you can search your database for the subsequence AB and CD. Of course, you don’t know which subsequences are the error free ones, so you need to try all possible combination.

What Eland does to speed things up is to combine these subsequences into sets of two. Thus, rather than searching for 4 independent entries in their database, they simply have to search for 2 sets of two, and as long as one set matches, you can use looser matching criteria on the other two, ie, allowing for mismatches. That is to say, if you make two sequences out of the 4 subsequences {AB and CD}, you can search for an exact match for AB, and test all the possible results for mismatches in the {CD} portion of the sequence. This has the effect of significantly reducing the search space. (Only sequences containing the exact match to {AB}.)

Searching for the {AB and CD} subsequences would only work if the first half of your sequence has no errors. What if B and C had the errors? The simple answer is to shuffle your subsequences to make other combinations: ({AB and CD}, {AC and BD}, {AD and CD}, {BA and CD}, etc.) This still provides you with a relatively small search space for each sequence, as there are only 4! possible combinations (which is 4x3x2x1 = 24 possible sequences) to search for.

Fortunately, you can bound this even further. You always know that you want sequences in which the first pair and second pair are in the correct order, (ie {AB and CD} and {AC and BD} are ok, but {CA and DB} and {BA and DC} would give you an incorrect result) limiting you to only six possible combinations, which still allows you to find any correct match where at least two of the four subsequences are error free.

That, in general is how it works.. with a few nifty shortcuts. ie, they only need to pass over the genome search space 3 times to find all possible matches with up to two errors, because they can search for subsequence combinations simultaneously and efficiently. In other words, you get a very quick alignment.

On the other hand, the way in which ELAND was implemented is very important. The Eland aligner has some severe restrictions, some of which are algorithm specific, some of which are implementation specific:

  1. The Eland aligner can only align sequences up to 32 base pairs long. (Implementation specific: they use four integers to represent the subsequences {A,B,C and D}, which reduces the memory requirements.)
  2. The Eland aligner can only align sequences with up to two mismatches. (Algorithm specific: at least two of the keys must match, although the algorithm could be expanded to use more subsequences, at the expense of expanding the sequence space.)
  3. The Eland aligner only returns sequences that match to one unique location in the genome (Implementation specific: Eland already tells you when a sequence matches to more than one location, it just doesn’t tell you what those locations are.)
  4. The Eland aligner can do mismatches in the sequence, but can not handle insertions or deletions (i.e, gapped alignments). (Algorithm specific: The algorithm for matching the keys could be expanded to include this, but it would be very costly in time/memory.)
  5. The code is not freely available. (Implementation specific: I don’t know of anywhere you can go to get the code… or documentation!)
  6. Changing the length of the alignment sequence (I called it N, above), requires a recompile. (Implementation, obviously. Why this isn’t a command line parameter, I don’t know.)

I’m sure I could go on and list other disadvantages, but I’d like to point out the main advantage: It’s FAST. Blazing fast, compared to the other aligners. From my own personal experience, Eland may only be able to map 60-70% of your sequences to the genome (compared to 70-80% for other aligners with which I have less experience), but it does so in a few hours, compared to several days or weeks for most other aligners (or years, compared to Blast). (Comparisons are for a single processor, matching 1 million+ reads to a mamalian size genome.) The only other aligner that can come close to this speed is, to the best of my knowledge, SXOligoSearch, produced by Synamatix, though that beast will cost upwards of $225,000CAD to license.

Anyhow, to close this discussion, it’s worth mentioning that rumours of a new version of Eland have been circulating since March 2007. Beta versions, capable of performing searches on sequences larger than 32 base pairs in length and able to perform multi-matches, were supposed to be released in September 2007. Unfortunately, the new betas appear to be vapourware, which has prompted us to look into several other aligners, which I’ll discuss in future posts.

(2008-01-28 Update: I stand corrected: I’ve heard that the new betas have finally arrived! I haven’t seen what they can do yet, but at least we know my prediction that they’re vapourware is wrong.)