>Universal format converter for aligned reads

>Last night, I was working on FindPeaks when I realized what an interesting treasure trove of libraries I was really sitting on. I have readers and writers for many of the most common aligned read formats, and I have several programs that do useful functions. So, that raise the distinctly interesting point that all of them should be applied together in one shot… and so I did exactly that.

I now have an interesting set of utilities that can be used to convert from one file format to another: bed, gff, eland, extended eland, MAQ .map (read only), mapview, bowtie…. and several other more obscure formats.

For the moment, the “conversion utility” forces the output to bed file format (since that’s the file type with the least information, and I don’t have to worry about unexpected file information loss), which can then be viewed with the UCSC browser, or interpreted by FindPeaks to generate wig files. (BED files are really the lowest common denominator of aligned information.) But why stop there?

Why not add a very simple functionality that lets one format be converted to the other? Actually, there’s no good reason not to, but it does involve some heavy caveats. Conversion from one format type to another is relatively trivial until you hit the quality strings. since these aren’t being scaled or altered, you could end up with some rather bizzare conversions unless they’re handled cleanly. Unfortunately, doing this scaling is such a moving target that it’s just not possible to keep up with that and do all the other devlopment work I have on my plate. (I think I’ll be asking for a co-op student for the summer to help out.)

Anyhow, I’ll be including this nifty utility in my new tags. Hopefully people will find the upgraded conversion utility to be helpful to them. (=

>Bowtie and Single End Mapped Paired End Data

>Strange title for a posting, but it actually makes sense, if you think about it.

One of the researchers here has been working on an assembly of a reasonably small fungal genome, using short read technology with which I’ve been somewhat involved. It’s been an educational project for me, so I’m glad I had the opportunity to contribute. One of the key elements of the paper, however, was the use of Paired End Tags (PET) from an early Illumina machine to assess the quality of the assembly.

Unfortunately, an early run of the software I’d written to analyze the Maq alignments of the paired ends to the assemblies had a bug, which made it look like the data supported the theory quite nicely – but alas, it was just a bug. The bug was fixed a few weeks ago, and a re-run of the data turned up something interesting:

If you use Maq alignments, you’re biasing your alignments towards the smith-waterman aligned sequences, which are not an independent measure of the quality of the assembly. Not shocking? I know – it’s pretty obvious.

Unfortunately, we hadn’t put it in context before hand, so we had to find another way of doing this. We wanted to get a fairly exhaustive set of alignments for each short read – and we wanted to get it quickly, so we turned to Bowtie. While I wasn’t the one running the alignments, I have to admit, I’m impressed with the quick performance of the aligner. Multi-matches work well, the file format is intuitive – similar to an eland file, and the quality of the information seems to be good. (The lack of a good scoring mechanism is a problem, but wasn’t vital for this project.)

Anyhow, by performing a Single End Tag style alignment on PET data, while retaining multimatches, we were able to remove the biases of the aligners, and perform the pairings ourselves – learning more about the underlying assembly to which we were aligning. This isn’t something you’d want to do often, unless you’re assembling genomes fairly regularly, but it’s quite cool.

Alas, for the paper, I don’t think this data was good enough quality – there may not be a good story in the results that will replace the one we thought we had when the data was first run… but there were other good results in the data, and the journey was worth the trip, even if the destination wasn’t all we’d hoped it would be.

As a sidenote, my code was run on an otherwise unused 16-way box, and managed to get 1592% CPU usage, at one point, while taking up 46Gb of RAM. (I should have processed each lane of reads separately, but didn’t.) That’s the first time I’ve seen CPU usage that high on my code. My previous record was 1499%. I should note that I only observe these scalings during the input processing – so the overal speed up was only 4x.

The code is part of the Vancouver Short Read Analysis Package, called BowtieToBedFormat.java, if anyone needs a process similar to this for Bowtie Reads. (There’s no manual for it at the moment, but I can put one togther, upon request.)