While I am all for people writing their own software, writing your own aligner seems more difficult than figuring out how to get what you want from a published (and used format) like ACE assembly files.
That’s absolutely correct. I think writing my own aligner is probably the last thing I want to do. While I have the source for at least our own internal aligner on my desktop this week (and have been doing some serious hacking on it), I really don’t want to have to build my own. It’s a distraction from my real research, it’s a pain to support, it’s going to have to compete with much larger groups doing similar things, and it’s probably not a long-term viable option. That said, I still think the current state of affairs in the realm of aligners is pretty poor – thought that’s not to say I think I could do it better. I’ll leave writing assemblers to the professionals – or at least those who have more time to do this stuff.
Still, maybe if I can offer one helpful suggestion to people writing aligners. Get together and use a standard format! I’m less inclined to test Yet Another Aligner when I have to write another filter to interpret the results. There are even currently efforts under way to create a unified format for aligned reads and short-read data. (Disclaimer: I’ve only read a couple handfuls of posts, and not all of it appears to be on topic.)
The formats we’re familiar with (such as fasta) aren’t really designed for this type of work. (It hasn’t stopped several aligners from using them however – which is still better than creating their own file formats, I might add.) What’s actually needed is a purpose built format.
I’m sure you can guess from this rant that I’m a big fan of stuff like the OpenDocument Format (ODF), which are levelling the playing field for word processing documents (despite MicroSoft’s best efforts to remain the dominant force at all costs), but even a limited (i.e. non-ISO standardized) approach could make a huge impact in this area.
Why not have a two day get together for all the people building aligners, and decide on a subset of formats? Make it versioned, so that it can change over time, and elect a maintainer of the official standard. While you’re at it decide on a few conventions – what’s a U0 hit? (Slider can have up to 9 SNPs and still call it a U0…) What is a read start? How do you represent stranded genomic information – which end is really the “start”?
Anyhow, I know it’s wishful thinking, but hey, maybe this rant will encourage a few people to band together and come up with something to make all of our lives easer. Even something as simple as Fasta freed up thousands (millions?) of man-hours (or woman-hours) for bioinformaticians, since they no longer need a custom format library for every new program they use. Maybe it’s time to do the same for aligned data.
P.S. I’ll answer Jason’s questions on EST/transcriptome approaches in my next post.