>Rumours..

>Apparently there’s a rumour going around that I think FindPeaks isn’t a good software program, and that I’ve said so on my blog. I don’t know who started this, or where it came from, but I would like to set the record straight.

I think FindPeaks is one of the best ChIP-Seq packages currently available. Yes, I would like to have better documentation, yes there are occasionally bugs in it, and yes, the fact that I’m the only full-time developer on the project are detriments…. But those are only complaints, and they don’t diminish the fact that I think FindPeaks can out peak-call, out control, out compare…. and will outlast, outwit and outplay all of it’s opponents. (=

FindPeaks is actually a VERY well written piece of code, and has the flexibility to do SO much more than just ChIP-Seq. If I had my way, I’d have another developer or two working with the code to expand many of the really cool features that it already has.

If people are considering using it, I’m always happy to help get them started, and I’m thrilled to add more people to the project if they would like to contribute.

So, just to make things clear – I think FindPeaks (and the rest of the Vancouver Short Read Analysis Package) are great pieces of software, and are truly worth using. If you don’t believe me, send me an email, and I’ll help make you a believer too. (=

>Has Apple gone too far?

>As a bioinformatician, I enjoy a good looking piece of computer hardware and, for the last few years, the best looking hardware around has been the Apple Macs. I’ve even thought about buying their new macbooks, although for the same specs, you can pick up a dell on sale at 1/3rd of the price, so it’s hardly a good deal. I really can’t see myself running anything other than Linux on it, though, so despite the beautiful engineering, I can’t see myself paying ~$300 for an OS I’d just remove. (I was even upset at paying ~$50 for a copy of Windows XP with my current laptop. Drop me a line if you want to buy the license – It doesn’t even have a Valid EULA… but that’s another story.)

Anyhow, I’ve got to admit, Apple has finally managed to turn me off completely. Check out this article. To paraphrase, Apple has decided to follow suit with Microsoft and Intel in order to prevent you from enjoying the content you own in the way you’d like to use it. In other words, Mac OSX is now claiming control over your media files. (And, I might add, this is not about copyright, because the article shows uses that are clearly restricting “fair use” as well.) DRM is now built right into your hardware, and if your hardware isn’t DRM enabled, you can’t use it. Ouch.

I feel sorry for those people who have jumped the Microsoft ship just to end up in the Apple camp and are about to discover that Apple doesn’t have their best interests at heart either. Why shouldn’t you be able show a movie on an external monitor or projector?

In the long run, this is probaby good advertising for GNU/Linux, which doesn’t enforce media company greed on it’s users. So, if anyone wants a free Ubuntu disk to make their Apple harware work for them instead of against them, here you go.

>Maq Bug

>I came across an interesting bug today, when trying to work with reads aligned by Smith-Waterman (flag = 130), in PET alignments. Indeed, I even filed a Bug Report on it.

The low down on the “bug” or “feature” (I’m not sure which it is yet), is that sequences containing deletions don’t actually show the deletions – they show the straight genomic sequence at that location. The reason that I figure this is a bug instead of by design is because sequences with insertions show the insertion. So why the discrepancy?

Anyhow, the upshot of it is that I’m only able to use 1/2 of the Smith-Waterman alignments maq produces when doing SNP calls in my software. (I can’t trust that the per-base quality scores align to the correct bases with this bug) Since other people are doing SNP calls using MAQ alignments… what are they doing?

>SNP calling from MAQ

>With that title, you’re probably expecting a discussion on how MAQ calls snps, but you’re not going to get it. Instead, I’m going to rant a bit, but bear with me.

Rather than just use the MAQ snp caller, I decided to write my own. Why, you might ask? Because I already had all of the code for it, my snp caller has several added functionalities that I wanted to use, and *of course*, I thought it would be easy. Was it, you might also ask? No – but not for the reasons you might expect.

I spent the last 4 days doing nothing but working on this. I thought it would be simple to just tie the elements together: I have a working .map file parser (don’t get me started on platform dependent binary files!), I have a working snp caller, I even have all the code to link them together. What I was missing was all of the little tricks, particularly the ones for intron-spanning reads in transcriptome data sets, and the code that links together the “kludges” with the method I didn’t know about when I started. After hacking away at it, bit by bit things began to work. Somewhere north of 150 code commits later, it all came together.

If you’re wondering why it took so long, it’s three fold:

1. I started off depending on someone else’s method, since they came up with it. As is often the case, that person was working quickly to get results, and I don’t think they had the goal of writing production quality code. Since I didn’t have their code (though, honestly, I didn’t ask for it either since it was in perl, which is another rant for another day) it took a long time to settle all of the 1-off, 2-off and otherwise unexpected bugs. They had given me all of the clues, but there’s a world of difference between being pointed in the general direction towards your goal and having a GPS to navigate you there.

2. I was trying to write code that would be re-usable. That’s something I’m very proud of, as most of my code is modular and easy to re-purpose in my next project. Half way through this, I gave up: the code for this snp calling is not going to be re-usable. Though, truth be told, I think I’ll have to redo the whole experiment from the start at some point because I’m not fully satisfied with the method, and we won’t be doing it exactly this way in the future. I just hope the change doesn’t happen in the next 3 weeks.

3. Name space errors. For some reason, every single person has a different way of addressing the 24-ish chromosomes in the human genome. (Should we include the mitochondrial genome in our own?) I find myself building functions that strip and rename chromosomes all the time, using similar rules. Is the Mitochondrial genome a “MT” or just “M”? What case do we use for “X” and “Y” (or is it “x” and “y”?) in our files? Should we pre-pend “chr” to our chromsome names? And what on earth is “chr5_random” doing as a chromosome? This is even worse when you need to compare two active indexes, plus the strings in each read… bleh.

Anyhow, I fully admit that SNP calling isn’t hard to do. Once you’ve read all of your sequences in, determined which bases are worth keeping (prb scores), determined the minimum level of coverage, minimum number of bases that are needed to call a snp, there’s not much left to do. I check it all against the Ensembl database to determine which ones are non-synonymous, and then: tada, you have all your snps.

However, once you’re done all of this, you realize that the big issue is that there are now too many snp callers, and everyone and their pet dog is working on one. There are several now in use at the GSC: Mine, at least one custom one that I’m aware of, one built into an aligner (Bad Idea(tm)) under development here and the one tacked on to the swiss army knife of aligners and tools: MAQ. Do they all give different results, or is one better than another? who knows. I look forward to finding someone who has the time to compare, but I really doubt there’s much difference beyond the alignment quality.

Unfortunately, because the aligner field is still immature, there is no single file output format that’s common to all aligners, so the comparison is a pain to do – which means it’s probably a long way off. That, in itself, might be a good topic for an article, one day.

>New ChIP-Seq tool from Illumina

>Ok, I had to blog this. Someone on the SeqAnswers forum brought it to my attention that Illumina has a new tool for ChIP-Seq experiments. That in itself doesn’t bother me – the more people in this space, the faster we learn about what makes us tick.

What surprises me, though, is the tool itself (beadstudio data analysis software – chip sequencing module). It’s implemented only for Windows, for one. (Don’t most self-respecting scientists use Macs or Linux these days? Or at least use and develop tools that can be used cross-platform?) Second, the feature set appears to be a re-implementation of the UCSC Genome Browser. Given the choice between the two, I don’t see any reason to buy the Illumina version. (Yes, you have to pay for it, whereas UCSC is free and flexible.) I can’t tell if it loads bed files or wig files, but the screen shots show a rather unflexible tool that looks like a graphical version of Gap4 or Consed. I’m not particularly impressed.

Worse still, I can’t see this being implemented in a pipeline. If you’re processing 100’s of ChIP-Seq experiments in a year, or 1000’s once this technique really starts to hit it’s stride, why would you want to force it all through a GUI? I just don’t get it.

Well, what do I know? Maybe there’s a big market for people out there who don’t want free cross-platform tools, and would rather pay for a brand name science application than use something that works. Come to think of it, I’m willing to bet there are a few pharma companies out there who do think like that, and Illumina is likely to conquer that market with their tool. Happy clicking, Vista users.