>Science Cartoons – 3

>I wasn’t going to do more than one comic a day, but since I just published it into the FindPeaks 4.0 manual today, I may as well put it here too, and kill two birds with one stone.

Just to clarify, under copyright laws, you can certainly re-use my images for teaching purposes or your own private use (that’s generally called “fair use” in the US, and copyright laws in most countries have similar exceptions), but you can’t publish it, take credit for it, or profit from it without discussing it with me first. However, since people browse through my page all the time, I figure I should mention that I do hold copyright on the pictures, so don’t steal them, ok?

Anyhow, Comic #3 is a brief description of how the compare in FindPeaks 4.0 works. Enjoy!

>Searching for SNPs… a disaster waiting to happen.

>Well, I’m postponing my planned article, because I just don’t feel in the mood to work on that tonight. Instead, I figured I’d touch on something a little more important to me this evening: WTSS SNP calls. Well, as my committee members would say, they’re not SNPs, they’re variations or putative mutations. Technically, that makes them Single Nucleotide Variations, or SNVs. (They’re only polymorphisms if they’re common to a portion of the population.

In this case, they’re from cancer cell lines, so after I filter out all the real SNPs, what’s left are SNVs… and they’re bloody annoying. This is the second major project I’ve done where SNP calling has played a central role. The first was based on very early 454 data, where homopolymers were frequent, and thus finding SNVs was pretty easy: they were all over the place! After much work, it turned out that pretty much all of them were fake (false positives), and I learned to check for homopolymer runs – a simple trick, easily accomplished by visualizing the data.

We moved onto Illumina, after that. Actually, it was still Solexa at the time. Yes, this is older data – nearly a year old. It wasn’t particularly reliable, and I’ve now used several different aligners, references and otherwise, each time (I thought) improving the data. We came down to a couple very intriguing variations, and decided to sequence them. After several rounds of primer design, we finally got one that worked… and lo and behold. 0/2. Neither of them are real. So, now comes the post-mortem: Why did we get the false positives this time? Is it bias from the platform? Bad alignments? Or something even more suspicious… do we have evidence of edited RNA? Who knows. The game begins all over again, in the quest for answering the question “why?” Why do we get unexpected results?

Fortunately, I’m a scientist, so that question is really something I like. I don’t begrudge the last year’s worth of work – which apparently is now more or less down the toilet – but I hope that the why leads to something more interesting this time. (Thank goodness I have other projects on the go, as well!)

Ah, science. Good thing I’m hooked, otherwise I’d have tossed in the towel long ago.

>Genomics Forum 2008

>You can probably guess what this post is about from the title – which means I still haven’t gotten around to writing an entry on thresholding for ChIP-Seq. Actually, it’s probably a good thing I haven’t, as we’ve been learning a lot about thresholding in the past week. It seems many things we took for granted aren’t really the case. Anyhow, I’m not going to say too much about that, as I plan to collect my thoughts and discuss it in a later entry.

Instead, I’d like to discuss the 2008 Genomics Forum, sponsored by Genome BC, which took place on Friday – though, in particular, I’m going to focus on one talk, near to my own research. Dr. Barbara Wold from Caltech gave the first of the science talks, and focussed heavily on ChIP-Seq and Whole Transcriptome Shotgun Sequencing (WTSS). But before I get to that, I wanted to mention a few other things.

The first is that Genome BC took a few minutes to announce a really neat funding competition, which really impressed me, the Genome BC Science Opportunities Fund. (There’s nothing up on the web page yet, but if you google for it, you’ll come across the agenda for Friday’s forum in which it’s mentioned – I’m sure more will appear soon.) Its whole premise revolves around the question: “Are there experiments that we need to be doing, that are of strategic importance to the BC life science community?” I take that to mean, are there projects that we can’t afford not to undertake, that we wouldn’t have the funding to do otherwise? I find that to be very flexible, and very non-academic in nature – but quite neat. I hope the funding competition goes well, and I’m looking forward to seeing what they think falls into the “must do” category.

The second was the surprising demand for Bioinformaticians. I’m aware of several jobs for bioinformaticians with experience in next-gen sequencing, but the surprise to me was the number of times (5) I heard people mention that they were actively recruiting. If anyone with next-gen experience is out there looking for a job (post-doc, full time or grad student), drop me a note, and I can probably point you in the right direction.

The third was one of the afternoon talks, on journalism in science, from the perspective of traditional news paper/tv journalists. It seems so foreign to me, yet the talk touched on several interesting points, including the fact that journalists are struggling to come to terms with “new media.” (… which doesn’t seem particularly new to those of us who have been using the net since the 90′s, but I digress.) That gave me several ideas about things I can do with my blog, to bring it out of the simple text format I use now. I guess even those of us who live/breath/sleep internet don’t do a great job of harnessing it’s power for communicating effectively. Food for though.

Ok… so on to the main topic of tonight’s blog: Dr. Wold’s talk.

Dr. Wold spoke at length on two topics, ChIP-Seq and Whole Transcriptome Shotgun Sequencing. Since these are the two subject I’m actively working on, I was obviously very interested in hearing what she has to say, though I’ll comment more on the ChIP-Seq side of things.

One of the great open questions at the Genome Sciences Centre has been how to do an effective control for a ChIP-Seq experiment. It’s not something we’ve done much of, in the past, but the Wold lab demonstrated why they’re necessary, and how to do them well. It seems that ChIP-Seq experiments tend to yield fragments in several genomic regions that have nothing to do with the antibody or experiment itself. The educated guess is that these are caused by hypersensitive sites in the genome that tend to fragment in repeatable patterns, giving rise to peaks that appear in all samples. Indeed, I spend a good portion of this past week talking about observations of peaks exactly like that, and how to “filter” them out of the ChIP-Seq results. I wasn’t able to get a good idea of how the Wold lab does this, other than by eye, (which isn’t very high throughput), but knowing what needs to be done now, it shouldn’t be particularly difficult to incorporate into our next release of the FindPeaks code.

Another smart thing that the Wold lab has done is to separate the interactions of ChIP-Seq into two different types: Type 1 and Type 2, where Type 1 refers to single molecule-DNA binding events, which give rise to sharp peaks, and very clean profiles. These tend be transcription factors like NRSF, or STAT1, upon which the first generation of ChIP-Seq papers were published. Type 2 interactomes tend to be less clear, as they are transcription factors that recruit other elements, or form complexes that bind to the DNA at specific sites, and require other proteins to bind to encourage transcription. My own interpretation is that the number of identifiable binding sites should indicate the type, and thus, if there were three identifiable transcription factor consensus sites lined up, it should be considered a Type 3 interactome, though, that may be simplifying the case tremendously, as there are, undoubtedly, many other proteins that must be recruited before any transcription will take place.

In terms of applications, the members of the wold lab have been using their identified peaks to locate novel binding site motifs. I think this is the first thing everyone thinks of when they hear of ChIP-Seq for the first time, but it’s pretty cool to see it in action. (We also do it at the GSC too, I might add.) The neatest thing, however, was that they were able to identify a rather strange binding site, with two halves of a motif, split by a variable distance. I haven’t quite figured out how that works, in terms of DNA/Protein structure, but it’s conceptually quite neat. They were able to show that the distance between the two halves of the structure vary by 10-20 bases, making it a challenge to identify, for most traditional motif scanners. Nifty.

Another neat thing, which I think everyone knows, but was cool to hear that it’s been shown is that the binding sites often line up on areas of high conservation across species. I use that as a test for my own work, but it was good to have it confirmed.

Finally, one of the things Dr. Wold mentioned was that they were interested in using the information in the directionality of reads in their analysis. Oddly enough, this was one of the first problems I worked on in ChIP-Seq, months ago, and discovered several ways to handle it. I enjoyed knowing that there’s at least one thing my own ChIP-Seq code does that is unique, and possibly better than the competition. (-;

As for transcriptome work, there were only a couple things that are worth mentioning. The Wold lab seems to be using MAQ and a list of splice junctions assembled from annotated exons to map the transcriptome sequences. I’ve heard that before, actually, from someone at the GSC who is doing exactly the same thing. It’s a small world. I’m not really a fan of the technique, however. Yes, you’ll get a lot of the exon junction reads, but you’ll only find the ones you’re looking for, which is exactly the criticism all the next-gen people throw at the use of micro-arrays. There has got to be a better solution… but I don’t yet know what it is. (We thought it was Exonerate, but we can’t seem to get it to work well, due to several bugs in the software. It’s clearly a work in progress.)

Anyhow, I think I’m going to stop here. I’ll just sum it all up by saying it was a pretty good talk, and it’s given me lots of things to think about. I’m looking forward to getting back to coding tomorrow.

>Catching up….

>I can’t believe it’s been nearly a month since my last post! I feel like I’ve been neglecting this a bit more than I should, but I’ll try to rectify that as best I can.

For an indication of how busy I’ve been, I sat down to update my resume yesterday, and ended up adding 3 papers (all in submission) and two posters. That just about doubles what was in there previously in the papers section.

Anyhow, Next-generation sequencing doesn’t stand still, so I thought I’d outline some of the things I want to talk about in my next posts, and set up a few pointers to other resources:

1. SeqAnswers. This aptly named forum has been around for a few months now, but has recently become more popular, and a great forum for discussing relevant Next-gen technology and sequencing methods. I’m especially happy to see the automated posts triggered by new literature on the subject, which are a great resource for those of us who are busy and forget to check for new publications ourselves.

2. There’s one forum in particular that’s of great interest: Benchmarking different aligners. This appears to be a well done comparison (if lightweight) that may be a good focus for other people who are interested in comparing aligners, and discussing it in a wider forum.

3. For people interested in ChIP-Seq, or Chromatin immunoprecipitation and massively parallel sequencing, I’ve finally gotten around to posting FindPeaks 3.1 on the web. I’d consider this release (3.1.3) an alpha release. I’d love to get more people contributing by using this application and telling me what could be improved on it, or what enhancements they’d like to see. I’m always happy to discuss new features, and can probably add most of them in with a relatively quick turn around time.

4. For people interested in assessing the quality of the whole transcriptome shotgun sequencing (WTSS), I’m about to break out a tool that should fit that purpose. If anyone is interested in giving suggestions on ways they’d like to see quality tests performed, I’d be more than happy to code those into my applications. (And yes, if you contribute to the tool, I will provide you a copy of the tool to use. Collaborations, etc, can be discussed, as well.)

5. A quick question, of which I’ll post more in the future. Has anyone here managed to get Exonerate 2.0 to work in client/server mode on two separate machines?

6. I’ll also post a little more about this in the future: building environments, ant and java. Why are people still doing large projects in perl?

7. One last thing I wanted to mention. I was going to write more on this topic, but eh… I’ll let slashdot do it for me: The more you drink, the less you publish. Well, So much for keeping a bottle of tequila under the desk. Now I know what to get the competition for x-mas, though…

Cheers!

>Transcriptome sequencing.

>In an earlier comment, Jason Stajich asked:

What I am most curious about is how people are planning to do the statistics of gene expression comparison from the EST sequencing library approach. It made sense to me for the SAGE approach, but how do you get the overall expression for the gene (really you want the per-transcript numbers). Do you assemble and count the union of all tags across a transcript? Do you normalize that by length of the transcript? Do you only count 3′ biased tags?

Though I’ve been taking my time about answering, its a really good question. I’ve been working with transcriptomes for a while, now, and have a fair amount of experience with it. I don’t want to give away all of my secrets, but I can give a few pointers. If anyone wants to collaborate on something, you know where to find me. (-;

So, first things first, with transcriptome sequencing using Illumina based sequencing, each read you get is presumably from a single molecule of DNA, which presumably came from a single molecule of cDNA, from your library. I can’t speak for all of the protocols used by the labs here at the Genome Science Centre, but the results I’ve seen have shown a very close correlation with expression levels measured by Nimblegen/Affymetrix DNA arrays, and so I tend to believe that the number of tags we’re observing per region(eg gene/transcript/exon) are a direct (or nearly direct) measurement of the RNA levels in the general cell population used to create the library.

I should also mention that this is very dependent upon the protocols being used. If your protocol involves amplifying the cDNA with the use of PCR, you’re really not going to maintain that relationship. Consult an expert on this subject, if you plan to try this at home. (-;

The other questions Jason asked are not quite as straight forward. We have a protocol here at the GSC that gives pretty darn even coverage across transcripts as a whole, which means that transcript end bias is pretty minimal. That totally negates the need to look at biases, or otherwise. Of course, this comes down to a lot of lab technique (which is totally outside the scope of my post), as it seems to be dependent on following the appropriate protocols. I’ve seen libraries which are completely skewed, libraries that perfectly follow the transcript outlines, and libraries somewhere in between. As it stands, I now run my tools over each data set as it appears to judge the quality before I ask for more lanes.

So, the short answer is: no, I don’t normalize or exclude any data when I deal with transcriptomes, but I’m in the fortunately position of being able to identify (and accept or reject!) which data sets meet a reasonable level of quality before I process them.

>Jason’s comment

>I received a comment on my posting last night from Jason Stajich, which (as always) is right on the money. To quote the main theme I want to reply to:

While I am all for people writing their own software, writing your own aligner seems more difficult than figuring out how to get what you want from a published (and used format) like ACE assembly files.

That’s absolutely correct. I think writing my own aligner is probably the last thing I want to do. While I have the source for at least our own internal aligner on my desktop this week (and have been doing some serious hacking on it), I really don’t want to have to build my own. It’s a distraction from my real research, it’s a pain to support, it’s going to have to compete with much larger groups doing similar things, and it’s probably not a long-term viable option. That said, I still think the current state of affairs in the realm of aligners is pretty poor – thought that’s not to say I think I could do it better. I’ll leave writing assemblers to the professionals – or at least those who have more time to do this stuff.

Still, maybe if I can offer one helpful suggestion to people writing aligners. Get together and use a standard format! I’m less inclined to test Yet Another Aligner when I have to write another filter to interpret the results. There are even currently efforts under way to create a unified format for aligned reads and short-read data. (Disclaimer: I’ve only read a couple handfuls of posts, and not all of it appears to be on topic.)

The formats we’re familiar with (such as fasta) aren’t really designed for this type of work. (It hasn’t stopped several aligners from using them however – which is still better than creating their own file formats, I might add.) What’s actually needed is a purpose built format.

I’m sure you can guess from this rant that I’m a big fan of stuff like the OpenDocument Format (ODF), which are levelling the playing field for word processing documents (despite MicroSoft’s best efforts to remain the dominant force at all costs), but even a limited (i.e. non-ISO standardized) approach could make a huge impact in this area.

Why not have a two day get together for all the people building aligners, and decide on a subset of formats? Make it versioned, so that it can change over time, and elect a maintainer of the official standard. While you’re at it decide on a few conventions – what’s a U0 hit? (Slider can have up to 9 SNPs and still call it a U0…) What is a read start? How do you represent stranded genomic information – which end is really the “start”?

Anyhow, I know it’s wishful thinking, but hey, maybe this rant will encourage a few people to band together and come up with something to make all of our lives easer. Even something as simple as Fasta freed up thousands (millions?) of man-hours (or woman-hours) for bioinformaticians, since they no longer need a custom format library for every new program they use. Maybe it’s time to do the same for aligned data.

P.S. I’ll answer Jason’s questions on EST/transcriptome approaches in my next post.

>Pacific Biotech new sequencing technology

>I have some breaking news. I doubt I’m the first to blog this, but I find it absolutely amazing, so I had to share.

Steve Turner from Pacific Biosciences (PacBio), just gave the final talk of the AGBT session, and it was damn impressive. They have a completely new method of doing sequencing that uses DNA polymerase as a sequencing engine. Most impressively, they’ve completed their proof of concept, and they presented data from it in the session.

The method is called Single Molecule Real Time (SMRT) sequencing. It’s capable of producing 5000-25,000 base pair reads, at a rate of 10 bases/second. (They apparently have 25bps techniques in development, and expect to release when they have 50bps working!)

The machinery has zero moving part, and once everything is in place, they anticipate that they’ll have a sequencing rate of greater than 100 Gb per hour! As they are proud to mention, that’s about a full draft genome for a human being in 15 minutes, and at a cost of about $100. Holy crap!

>Comments on de novo assembly

>First off, I’d like to say thanks to Paul and stajich for their comments. Paul for raising several points that I’d like to discuss tonight, and stajich for bringing my attention to the SHRiMP aligner. Obviously, I haven’t had much chance to look at SHRiMP, yet, but I’ll definitely get around to it.

So, paul mentioned several things that are worth discussing:

The Velvet roadmap for mRNA could be interesting. Sometimes the intron is snipped, sometimes it isn’t, get a nice little bubble. And the similar transcripts will do… interesting things.

Short reads would be fine for de-novo if DNA was completely random. Pity that it isn’t.

For the most part, I agree with Paul, Using Velvet on a transcriptome would be pretty cool. Yes, you’d get bubbles, but you’d get a lot of nifty information about new transcripts, or intermediate forms of transcripts. This neatly avoids one of the major hurdles I currently face when working with transcriptomes: I can either align against a genome, or a list of known genes, and neither one really gives me much opportunity to identify new genes. (Of course, I’m working on doing just that, with a colleague of mine, but I’m saving that for another post.)

Where I disagree with paul, however, is his final statement, that short reads would be fine for de novo assembly if they were truly random. I have two problems with that: The first is mapability (or sequencability), and the second is the short nature of the reads themselves.

Mappability has been defined in many different ways, but for general purposes, it’s the ability to identify a sequenced read that can be unambiguously aligned to that location on a chromosome. Fortunately, with 36-mer reads, as are produced by Illumina 1G’s, something like 70-80% of the genome is mappable. Not 100% mappable, but mappable to some degree. This may seem trivial, but it’s important.

Why it’s important is that a genome with 95% mapability doesn’t mean you get a single contig covering 95% of the genome, and a chunk of 5% of your reads that don’t cover your genome. It’s more like every 20-100kb you’ll find something that you just can’t assemble over. (I’m guestimating that number, by the way.) This means you have lots of small contigs that have no overlap. That is, of course, assuming you had enough highly mappable reads to do a relatively error free assembly, and, also of course, assuming your sequencing errors haven’t completely interfered with your assembly. I don’t think either can really be taken for granted.

In reality, the mappability of a genome isn’t a property you can figure out until you’ve sequenced it, so I don’t want to discourage anyone from trying. I’m aware of people working on using Illumina reads to do this, albeit they’ve supplemented the reads with BAC sequencing, ESTs and various other options, which will provide a nice scaffolding. This approach allows them to augment their sequencing power by obtaining a lot of coverage through the Illumina runs, rather than having to use them as the basis of a serious de novo assembly – which seems wise to me. (And even then, they’re only doing a fungus – not a higher vertebrate!)

And to return to my earlier point about the mappable genome not being 100% mappable, this is where I think paul’s point over simplifies. Although some bases may be 50% mappable, and are thus considered to be “mappable” in common discussion, that means that 50% of the possible sequencing reads in which they’re likely to participate will not yield an un-ambiguous fragment! That means you can say goodbye to 50% of the likely fragments which you would need to generate an overlap to span two forming contigs. That, to me, indicates that any de novo assembly is unlikely to correctly proceed past that base, and 50% mappability is not unusual in the genome.

The other point of paul’s that I wanted to discuss, was the assertion that the non-random fragmenting of DNA is what’s stopping us from doing a serious de novo assembly. While it’s true that shearing DNA isn’t an entirely random process, it’s also true that it doesn’t need to be. People have been doing restriction enzyme digests for decades now, in order to map vectors and inserts. (I learned how to do it in grade 11 biology, and that was back in 1994). So yes, while sonication or digests may not be random, what’s to stop someone from stripping DNA of it’s histones, and then doing 25 different digests? The net effect is just about the same (assuming you pick 25 good restriction enzymes with different base recognitions), and will yield fragments that will get around paul’s issue. But does that mean we can now do a de novo assembly?

No… I don’t think so. I doubt the lack of a random fragmentation pattern is the limiting factor in de novo assemblies. Unfortunately, the only way to prove myself or paul right or wrong is to simulate this: Take the mouse genome, fragment it into random 36-mers (the same size you get from an Illumina sequencing run), then inject 1 random base for every 1000 bases read (I’ll be generous and assume a .1% error rate, though the real thing is closer to 3-5% from what I’ve seen), and then try running velvet on it.

I bet you’ll observe that somewhere around 40x coverage you’ll start to saturate and discover that your assembly has covered anywhere from 60-80% of the genome (say 5% of that is just way wrong), and that it’ll have taken you longer to do that assembly than it would have to just sequenced the damned thing in the first place with PCR.

Anyhow, I take paul’s point to heart: We’re still early on in this game, and there are a lot of neat things we can try. I’m looking forward to seeing the results of many of them. (Who has time to try them all?)

By the way, FindPeaks 3.0.1 is done, and I’ll be presenting it as a poster at AGBT 2008 this week. If you’re interested in ChIP-Seq/Chip-Solexa (or for the pedantic, ChIP-Illumina), come find me, and I’ll tell you some of its new features.

>Solexa Transcriptom Shotgun:Transcriptome alignments vs. Genome Alignments

>First off, I was up late trying to finish off one of my many projects, so I didn’t get a lot of sleep. Thus, if my writing is more incoherent than usual, that’s probably why. And now on with the show.

I’m jumping gears a bit, today. I haven’t finished my discussion about Aligners, of which I still want to talk about Exonerate in detail, and then discuss some of the other aligners in overview. (For instance, the one that I found out about today, called GMAP, a Genomic Mapping and Alignment Program for mRNA and EST sequences.) Anyhow, the point is that part of the purpose of using an aligner is to align to something in particular, such as a genome or a contig, but selecting what you align your sequences back to is a big issues.

When you’re re-sequencing a genome, you map back to the genome template. Yes, you’re probably sequencing a different individual, so you’ll find occasional sections that don’t match, but most humans are ~99% identical, and you can look into SNP (single nucleotide polymorphism) databases to find out if the differences you’re seeing are already commonly known changes. Of course, if you’re re-sequencing Craig Venter, you don’t need to worry about SNPs as much. Fortunately, most of us are sequencing more exciting genomes and so forth.

When you’re sequencing a genome you’ve never sequenced before, you can’t do mapping at all. There are various assemblers (i.e., Velvet (written by Daniel Zerbino, who is a lot of fun to hang out with at conferences, I might add… ), SSake (written by Rene Warren, who incidentally also works at the GSC, although several floors above me.), and Euler (which I’d never heard of till I googled the web page for velvet…). The key point: you don’t need to worry about what you’re mapping back to when you do de novo assembly, since you’re creating your own map. I’ll digress further for one brief comment: assembly from Solexa/Illumina sequences is a bad idea, because they’re so short!

Moving right along, we come to the third thing people are sequencing these days: Transcriptomes. (Yes, I’m ignoring cloned libraries… they’re so 90′s!) Transriptomes are essentially a snapshot of the mRNA in a set of cells at a given point in time. Of course, mRNA is rather unstable, so protocols have been developed to convert mRNA to cDNA (complementary DNA), which is essentially a copy of the mRNA in DNA form. (Yes, I’m ignoring some complexity here, because it makes for a better story.) But I’m getting ahead of myself. Lets talk about the mRNA, but be aware that the sequencing is actually done on cDNA.

mRNA is an interesting beast. Unlike Genomic DNA, it’s a more refined creature. For Eukaryotes, the mRNA is processed by the cell, removing some segments that are non-coding. Once the cell removes these segments (labeled introns), and leaves other segments (labeled exons), we have a sequence of bases that no longer matches the genomic DNA sequence from which it came. Sure, there are short segments that map back very well (i.e. the exons), but if you take a small random snippet from the mRNA, there’s a small chance that it might overlap the boundaries between two exons, which means the bases you have won’t map back nicely to the genomic point of origin. That can be a serious problem.

Sure, you say, we can do a gapped alignment, and try to find two points where this sequence originated, with one big gap. If you’re sophisticated, you’ll even know that introns often have signals that indicate their presence most of the time. And yes, you’d be right, we can do that. Unfortunately, for most solexa runs, you get 20,000,000+ sequences. At 10 seconds a sequence (which doesn’t seem like much, really), how long would it take to do that alignment?

Too long.

So most of the time, we don’t do gapped alignments. Instead, we have two choices:

  1. Align against the genome, and throw away reads that we can’t align (i.e. those that over lap intron/exon boundaries.)
  2. Align against a collection of known coding DNA sequences

Number two isn’t a bad option: it already has all the introns spliced out, so you don’t need to worry about missing those alignments. Unfortunately, there are several issues with this approach:

  • Do we really know all of the coding DNA sequences? For most species, probably not, but this is a great idea for animals like Drosophila. (I tried this yesterday on a fruit fly Illumina run and it worked VERY well.
  • Many transcripts are very similar. This isn’t a problem with the data, but with your alignment program. If your aligner doesn’t handle multi-matches (like Eland), this will never work.
  • Many transcripts are very similar. Did I say that already? Actually, it causes more problems. How do you know which transcript was really the source of the sequence? I have several ways to get around it, but nothing foolproof yet.
  • Aligning to a transcript is hard to visualize. This is going to be one of my next projects… but with all the fantastic genomes out there, I’m still not aware of a good transcriptome visualization tool.

And that brings us to the conclusion. Aligning a transcriptome run against a genome or against a transcriptome both have serious problems, and there really are no good solutions for this yet.

For now, all I can do is run both: they tell you very different things, but both have fantastic potential. I haven’t released my code for either one, yet, but they both work, and if you contact my supervisor, he’d probably be interested in collaborating, or possibly sharing the code.

>Aligning DNA – Eland

>Continuing in my series of blog articles on short-read sequencing, it’s time for a little bit of a discussion on the first short-sequence aligner that I’d ever met: Eland. The problem with writing anything about Eland, is that 90% of the information I have on it can’t be confirmed independently. In fact, the majority of what I know comes from the header of the ELAND.cpp file. I’ll admit, while I’ve gone through the source code a couple of times, I haven’t spent much time getting to know it’s well – I’ve processed billions of lines of Eland files, but have yet to run it, myself…

Anyhow, here’s what I can tell you: Eland stands for “Efficient Local Alignment of Nucleotide Data” and was written by Anthony J. Cox for Solexa Ltd. Of course, Solexa has since been bought out by Illumina, so I assume the copyright has been transferred along with the technology during that transaction.

What makes Eland so fascinating to me is that it’s a direct departure from the dynamic programming based algorithms (like the venerable Smith-Waterman alignment), making it somewhat interesting as a computational program.

The basic algorithm works along these lines: Given a sequence of length N, it can be divided into four subsequences (A, B, C, and D), which are of equal (or nearly equal length). Assuming there are no more than 2 errors, at least two of these subsequences will be “error free”, so that the two error free sequences can then be searched for in a database containing all possible subsequences in the genome of interest. Thus, you can search your database for the subsequence AB and CD. Of course, you don’t know which subsequences are the error free ones, so you need to try all possible combination.

What Eland does to speed things up is to combine these subsequences into sets of two. Thus, rather than searching for 4 independent entries in their database, they simply have to search for 2 sets of two, and as long as one set matches, you can use looser matching criteria on the other two, ie, allowing for mismatches. That is to say, if you make two sequences out of the 4 subsequences {AB and CD}, you can search for an exact match for AB, and test all the possible results for mismatches in the {CD} portion of the sequence. This has the effect of significantly reducing the search space. (Only sequences containing the exact match to {AB}.)

Searching for the {AB and CD} subsequences would only work if the first half of your sequence has no errors. What if B and C had the errors? The simple answer is to shuffle your subsequences to make other combinations: ({AB and CD}, {AC and BD}, {AD and CD}, {BA and CD}, etc.) This still provides you with a relatively small search space for each sequence, as there are only 4! possible combinations (which is 4x3x2x1 = 24 possible sequences) to search for.

Fortunately, you can bound this even further. You always know that you want sequences in which the first pair and second pair are in the correct order, (ie {AB and CD} and {AC and BD} are ok, but {CA and DB} and {BA and DC} would give you an incorrect result) limiting you to only six possible combinations, which still allows you to find any correct match where at least two of the four subsequences are error free.

That, in general is how it works.. with a few nifty shortcuts. ie, they only need to pass over the genome search space 3 times to find all possible matches with up to two errors, because they can search for subsequence combinations simultaneously and efficiently. In other words, you get a very quick alignment.

On the other hand, the way in which ELAND was implemented is very important. The Eland aligner has some severe restrictions, some of which are algorithm specific, some of which are implementation specific:

  1. The Eland aligner can only align sequences up to 32 base pairs long. (Implementation specific: they use four integers to represent the subsequences {A,B,C and D}, which reduces the memory requirements.)
  2. The Eland aligner can only align sequences with up to two mismatches. (Algorithm specific: at least two of the keys must match, although the algorithm could be expanded to use more subsequences, at the expense of expanding the sequence space.)
  3. The Eland aligner only returns sequences that match to one unique location in the genome (Implementation specific: Eland already tells you when a sequence matches to more than one location, it just doesn’t tell you what those locations are.)
  4. The Eland aligner can do mismatches in the sequence, but can not handle insertions or deletions (i.e, gapped alignments). (Algorithm specific: The algorithm for matching the keys could be expanded to include this, but it would be very costly in time/memory.)
  5. The code is not freely available. (Implementation specific: I don’t know of anywhere you can go to get the code… or documentation!)
  6. Changing the length of the alignment sequence (I called it N, above), requires a recompile. (Implementation, obviously. Why this isn’t a command line parameter, I don’t know.)

I’m sure I could go on and list other disadvantages, but I’d like to point out the main advantage: It’s FAST. Blazing fast, compared to the other aligners. From my own personal experience, Eland may only be able to map 60-70% of your sequences to the genome (compared to 70-80% for other aligners with which I have less experience), but it does so in a few hours, compared to several days or weeks for most other aligners (or years, compared to Blast). (Comparisons are for a single processor, matching 1 million+ reads to a mamalian size genome.) The only other aligner that can come close to this speed is, to the best of my knowledge, SXOligoSearch, produced by Synamatix, though that beast will cost upwards of $225,000CAD to license.

Anyhow, to close this discussion, it’s worth mentioning that rumours of a new version of Eland have been circulating since March 2007. Beta versions, capable of performing searches on sequences larger than 32 base pairs in length and able to perform multi-matches, were supposed to be released in September 2007. Unfortunately, the new betas appear to be vapourware, which has prompted us to look into several other aligners, which I’ll discuss in future posts.

(2008-01-28 Update: I stand corrected: I’ve heard that the new betas have finally arrived! I haven’t seen what they can do yet, but at least we know my prediction that they’re vapourware is wrong.)