>MAQ map format

>I’m currently working on a couple of coding projects, though the one that’s urgent priority is to analyze my samples of interest using the MAQ aligner, which shouldn’t be too bad. Unfortunately, there are no real standards when it comes to output from alignment programs, which makes it impossible to do a simple replacement of the previous aligner with the current aligner. Thus, Eland and MAQ are not interchangeable modular components. Switching between them takes a large investment in time and effort. Blah.

MAQ, however, is also more tricky because of it’s poor documentation. I can understand why that is – if they stopped to document, they might not have time to write the code, which is fair enough since all of us in this field are busy. Compounding the problem, however, is MAQ’s use of really cryptic file formats. Unfortunately, I don’t know enough to really discuss that issue for the majority of the MAQ formats, but I did want to make two points in this morning’s post.

The first is simply that MAQ’s map format is a little sketchy – it’s a binary format that depends heavily on c code calls to “sizeof()”, which means that compiling code with different compilers may give very different results. That’s always a “BAD THING”(tm), in my (possibly not so) humble opinion. It’s also brutal for interoperability. I’m now working on coding up a java interface that reads the map file, allowing users to skip the mapview file – a gzipped human readable file – that takes up a lot of space. My issue is that I have to worry about many more things than I should need to: is the size of the variable the same between c and java? is the compiler for c big endian or little endian, and does java use the same conventions? how do I use masks to de-convolute the compressed flags? Admittedly, I love the idea of the binary file to save space for these large documents, but I’m not really a fan of the method.

The second is an appeal to people to stop coding in perl! Come on people, can’t we pick a language that is maintainable? I’d like to show you a piece of code from the MAQ package, which I thought was responsible for converting the .map file to a .mapview file. (Fortunately, I was wrong, but it took me a day of staring at it before I realized I was looking at the wrong piece of code.)


while (<>) {
my @t = split;
next if ($t[8] <= $opts{q});
next if ($t[5] == 18 || $t[5] == 64 || $t[5] == 130);
my $do_break = ($t[1] ne $lasts || $t[2] - $lastc > $d)? 1 : 0;
if ($do_break) {
if ($lastc - $beginc > $opts{s}) { # skip short/unreliable regions
my $k = @regs;
my $flag = ($lastc - $beginc < $opts{i})? '*' : '.';
push(@reg_name, "$beginst$beginct$lastct$flag");
my $p = @{$regs[$k]};
foreach (@reg_seq) {
push(@$p, $_);
my @s = split;
push(@{$read{$s[0]}}, $k);
}
}
($begins, $beginc) = @t[1..2];
@reg_seq = ();
}
push(@reg_seq, join(" ", @t[0..3,5,8,13]));
($lasts, $lastc) = @t[1..2];
}

Take a look at that code. I have no idea what 18, 64 and 130 are supposed to be – and I have no documentation to tell me. It took me 2 days to figure out that $p is actually being pulled from the data file as a string, and then the content of $p becomes the name of the array further on, which is populated from the gzipped binary file. Even now, I’m still unsure about how the $regs variable gets populated.

If this were written in python, java, c…. well, anything but perl, it would probably be interpretable and maintainable. I’ve heard it said that if you want to fix a bug in perl, it’s easier to rewrite the program than to look for the error – and after trying to read this script, I believe it.

Anyhow, while slagging perl, I should also mention that perl can be done in a maintainable way, and well documented. Unfortunately, the perl programming community seems to emphasize witty, compact and undecipherable code styles, so I’ve just rarely seen it done. If you’re going to open source your code, why not do it in a way that makes your code re-usable?

Leave a Reply

Your email address will not be published. Required fields are marked *