>I finally have a good resource for locating second gen (next gen) sequencing analysis software. For a long time, people have just been collecting it on a single thread in the bioinformatics section of the SeqAnswers.com forum, however, the brilliant people at SeqAnswers have spawned off a wiki for it, with an easy to use form. I highly recommend you check it out, and possibly even add your own package.
>You know what I’d kill for? A simple set of tests for each aligner available. I have no idea why we didn’t do this ages ago. I’m sick of off-by-one errors caused by all sorts of slightly different formats available – and I can’t do unit tests without a good simple demonstration file for each aligner type.
I know Sam format should help with this – assuming everyone adopts it – but even for SAM I don’t have a good control file.
I’ve asked someone here to set up this test using a known sequence- and if it works, I’ll bundle the results into the Vancouver Package so everyone can use it.
Here’s the 50-mer I picked to do the test. For those of you with some knowledge of cancer, it comes from tp53. It appears to blast uniquely to this location only.
>forward - chr17:7,519,148-7,519,197
>reverse - chr17:7,519,148-7,519,197
>Well, I’m postponing my planned article, because I just don’t feel in the mood to work on that tonight. Instead, I figured I’d touch on something a little more important to me this evening: WTSS SNP calls. Well, as my committee members would say, they’re not SNPs, they’re variations or putative mutations. Technically, that makes them Single Nucleotide Variations, or SNVs. (They’re only polymorphisms if they’re common to a portion of the population.
In this case, they’re from cancer cell lines, so after I filter out all the real SNPs, what’s left are SNVs… and they’re bloody annoying. This is the second major project I’ve done where SNP calling has played a central role. The first was based on very early 454 data, where homopolymers were frequent, and thus finding SNVs was pretty easy: they were all over the place! After much work, it turned out that pretty much all of them were fake (false positives), and I learned to check for homopolymer runs – a simple trick, easily accomplished by visualizing the data.
We moved onto Illumina, after that. Actually, it was still Solexa at the time. Yes, this is older data – nearly a year old. It wasn’t particularly reliable, and I’ve now used several different aligners, references and otherwise, each time (I thought) improving the data. We came down to a couple very intriguing variations, and decided to sequence them. After several rounds of primer design, we finally got one that worked… and lo and behold. 0/2. Neither of them are real. So, now comes the post-mortem: Why did we get the false positives this time? Is it bias from the platform? Bad alignments? Or something even more suspicious… do we have evidence of edited RNA? Who knows. The game begins all over again, in the quest for answering the question “why?” Why do we get unexpected results?
Fortunately, I’m a scientist, so that question is really something I like. I don’t begrudge the last year’s worth of work – which apparently is now more or less down the toilet – but I hope that the why leads to something more interesting this time. (Thank goodness I have other projects on the go, as well!)
Ah, science. Good thing I’m hooked, otherwise I’d have tossed in the towel long ago.
>William left an interesting comment yesterday, which I figured was worthy of it’s own post, albeit a short one. (Are any of my posts really short, tho?) His point was that everyone in the genomics field is currently paying a lot of attention to SNPs, and very little attention to INDELs (Insertions and Deletions). And, I have to admit – this is true. I’m not aware of anyone really trying to do indels with Solexa reads, yet. But there are very good reasons for this.
The first is a lack of tools – most aligners aren’t doing gapped alignments. Well, there’s Exonerate and ZOOM, and possibly blast, but all of those have their problems. (Exonerate gapped mode is very slow, poor support, and we had a very difficult time making some versions work at all, while Blast is completely infeasible for short reads in a reasonable time scale and not guaranteed to give the right answer, while Zoom just hasn’t been released yet. If there are others, feel free to let me know.) Nothing currently available will really do a good gapped short read alignment. And, if you’ve noticed they key words: “short read”, you’re on to the real reason why no one is currently working with indels.
Yep – the reads are just too short. If you have a 36bp read, or even, say a 42bp read, you’re looking at (best case) having your indel right in the middle of the sequence, giving you two 21-base sequences, one on each side of the gap. Think about that for a moment and let that settle in. How much of the genome is unique for 21bp reads, which may have 2 or more SNPs or sequencing errors? I’d venture to say it’s 60% or so. With the 36 base pair read, you’re looking at two 18-bp reads, which is more like 40-50% of the genome. (Please don’t quote me on those numbers – they’re just estimates.) And that’s best case.
If your gap is closer to the end, you’ll get something more like a 32bp read and a 10bp read…. and I wouldn’t trust a 10bp seed to give the correct match against the genome no matter what aligner you’ve got – especially if it comes from the “poor” end of an Illumina sequence.
So that leaves you with two options: use a paired end tag, or use a longer read.
Paired end tags (PET) have been around for a couple months, now. We’re still trying to figure out the best way of using the technology, but it’s coming. People are mostly interested in using them for other applications – gross structural abnormalities, inversions, duplications, etc, but indels will be in there. It should be a few more months before we really see some good work done with PETs in the literature. I know of a couple of neat applications already, but a lot of the difficulty was just getting a good PET aligner going. Maq is there now, and it does an excellent job, albeit post processing the .map files for PET is not a lot of fun. (I do have software for it, tho, so it’s definitely a tractable problem.)
Longer reads are also good. Once you get gaps with enough bases on either side to do double-seed searches, we’ll get fast Indel capable aligners – and I’m sure it’s coming. There are long reads being attempted this week at the GSC. I don’t know anything about them, or the quality, but if they work, I’d expect to see a LOT more sequences being generated like this in the future, and a lot more attention being paid to indels.
So, I can only agree with William: we need to pay more attention to indels, but we need the technology to catch up first.
P.S. For 454 fans out there, yes, you do get longer reads, but I think you also need a lot of redundancy to show that the reads aren’t artifacts. As 454 ramps up its throughput, we’ll see both the Solexa and 454 platforms converge towards better data for indel studies.
>Every once in a while, I get an interesting email or comment from someone, which is worthy of a blog post. I woke up this morning to a comment left on my last post:
What kind of computing infrastructure do you need to be able to handle chipSEQ datasets? I’m guessing my standard IBM T60 laptop is not going to do the trick – but what does?
So yes, I do check my email when I get up… but that’s not the point. Moving along…
The simple answer is, your laptop probably can’t run a chip-Seq data set with the currently available programs… but it’s probably not far off. My laptop has 2Gb of RAM (after tossing two new 1Gb sticks in last week), and a dual core AMD, running at 2GHz. Ironically, it’s not far from what we do use to run FindPeaks: A quadcore 2.2GHz opteron with 8Gb of RAM.
Of course, FindPeaks doesn’t take up the full resources of the machine. (Though I’ve seen it take up to 12CPUs on one of our fancier servers.) Generally, I only allocate 4Gb of RAM to the process, which should be more than enough for several millions or tens of millions of reads. The reason it takes so much ram? Because I don’t drop any of the information being held that’s contained in the Eland file, and because I need to retain all of the reads in memory to do a sort.
What does it need? Generally, FindPeaks only needs the start, end and direction of each read to do the basics, however, I don’t throw away any of the other information collected, in case we want to use it later. If I did that, or if I pre-sorted the reads, I could probably drop the memory requirements down by an order of magnitude or more. (No one’s asked me to do this, yet, however.) Is there anyone out there who needs a laptop-runnable version of FindPeaks?
This is somewhat timely, though. For the last few days I’ve been playing with a companion piece of software for FindPeaks which generates UCSC-compatible BED tracks from Eland data where I’ve adopted a minimalist approach. It takes about the same amount of time, and runs in under 300m of RAM. That, clearly, should run on anyone’s laptop.
While I am all for people writing their own software, writing your own aligner seems more difficult than figuring out how to get what you want from a published (and used format) like ACE assembly files.
That’s absolutely correct. I think writing my own aligner is probably the last thing I want to do. While I have the source for at least our own internal aligner on my desktop this week (and have been doing some serious hacking on it), I really don’t want to have to build my own. It’s a distraction from my real research, it’s a pain to support, it’s going to have to compete with much larger groups doing similar things, and it’s probably not a long-term viable option. That said, I still think the current state of affairs in the realm of aligners is pretty poor – thought that’s not to say I think I could do it better. I’ll leave writing assemblers to the professionals – or at least those who have more time to do this stuff.
Still, maybe if I can offer one helpful suggestion to people writing aligners. Get together and use a standard format! I’m less inclined to test Yet Another Aligner when I have to write another filter to interpret the results. There are even currently efforts under way to create a unified format for aligned reads and short-read data. (Disclaimer: I’ve only read a couple handfuls of posts, and not all of it appears to be on topic.)
The formats we’re familiar with (such as fasta) aren’t really designed for this type of work. (It hasn’t stopped several aligners from using them however – which is still better than creating their own file formats, I might add.) What’s actually needed is a purpose built format.
I’m sure you can guess from this rant that I’m a big fan of stuff like the OpenDocument Format (ODF), which are levelling the playing field for word processing documents (despite MicroSoft’s best efforts to remain the dominant force at all costs), but even a limited (i.e. non-ISO standardized) approach could make a huge impact in this area.
Why not have a two day get together for all the people building aligners, and decide on a subset of formats? Make it versioned, so that it can change over time, and elect a maintainer of the official standard. While you’re at it decide on a few conventions – what’s a U0 hit? (Slider can have up to 9 SNPs and still call it a U0…) What is a read start? How do you represent stranded genomic information – which end is really the “start”?
Anyhow, I know it’s wishful thinking, but hey, maybe this rant will encourage a few people to band together and come up with something to make all of our lives easer. Even something as simple as Fasta freed up thousands (millions?) of man-hours (or woman-hours) for bioinformaticians, since they no longer need a custom format library for every new program they use. Maybe it’s time to do the same for aligned data.
P.S. I’ll answer Jason’s questions on EST/transcriptome approaches in my next post.