>Last night, I was working on FindPeaks when I realized what an interesting treasure trove of libraries I was really sitting on. I have readers and writers for many of the most common aligned read formats, and I have several programs that do useful functions. So, that raise the distinctly interesting point that all of them should be applied together in one shot… and so I did exactly that.
I now have an interesting set of utilities that can be used to convert from one file format to another: bed, gff, eland, extended eland, MAQ .map (read only), mapview, bowtie…. and several other more obscure formats.
For the moment, the “conversion utility” forces the output to bed file format (since that’s the file type with the least information, and I don’t have to worry about unexpected file information loss), which can then be viewed with the UCSC browser, or interpreted by FindPeaks to generate wig files. (BED files are really the lowest common denominator of aligned information.) But why stop there?
Why not add a very simple functionality that lets one format be converted to the other? Actually, there’s no good reason not to, but it does involve some heavy caveats. Conversion from one format type to another is relatively trivial until you hit the quality strings. since these aren’t being scaled or altered, you could end up with some rather bizzare conversions unless they’re handled cleanly. Unfortunately, doing this scaling is such a moving target that it’s just not possible to keep up with that and do all the other devlopment work I have on my plate. (I think I’ll be asking for a co-op student for the summer to help out.)
Anyhow, I’ll be including this nifty utility in my new tags. Hopefully people will find the upgraded conversion utility to be helpful to them. (=
>Well, I promised a quick update on the MAQ mapview format, after I wrote the interpreter for it, but there isn’t much to say.
Key bits of information:
The file is simply a zipped text file. (Update: this file is not normally gzipped. While the .map file is zipped, the .mapview file is not normally found in the gzipped state.) You can unzip it with ‘gunzip’ on a linux system.
Reads are pre-sorted by chromosome, then position.
The format is 1 based, so if you’re using a zero based format, you’ll need to convert.
Starting points are for the “left end”, ie, regardless of which strand the sequence aligned to, the matching position with the lowest position is reported.
Sequences are not contained in this file, but if you go back to the original fastq file you can retrieve the sequence. If you do so, you will need to obtain the reverse compliment of any read that maps to the reverse strand to map to your fasta file sequence. Forward strand sequences will map correctly to the fasta sequence.
Most of the fields are not useful for any form of analysis, and what’s given is mostly incomprehensible.
The best information I had was from the MAQ manpage. In a slightly more readable format:
insert size from the outer coorniates of a pair
single-end mapping quality
alternative mapping quality
number of mismatches of the best hit
sum of qualities of mismatched bases of the best hit
number of 0-mismatch hits of the first 24bp
number of 1-mismatch hits of the first 24bp on the reference
length of the read
Most strikingly, you’ll notice 16 fields are listed above, while the file appears to have 14 fields. It seems not all files have the last two fields. I don’t know if it’s just the file I have, or if it’s usually that way. (Update: Actually, there are normally 16 fields. The files I was given were generated using the mapview -B flag, which strips out some of the information, which I believe are the final two fields. Thus, the comments above reflect the -B flag output only! Thanks to Ryan for catching that!)
If I come across anything else that needs to be added, I’ll update this entry.