Faster is better…

I was going to write about the crazy street preacher I met on the bus this morning, but I have some far more topical stuff to mention.

I’ve been working on some code in python, which, much to my dismay, was taking a LONG time to run.  Around 11 hours of processing time, using 100% of all 8 CPUs (summed up user time of 6110 minutes – about 100 hours of total CPU time), for something I figured should take about about an hour.

I’d already used profiling to identify that the biggest bottlenecks were in two places – putting and retrieving information from a single queue shared by all 8 worker processes, as well as a single function of code that is central to the calculations being done by the workers.  Not being able to figure out what was happening in the worker’s code, I spent some time optimizing the queue management, with some fancy queue management (which turned out later to be outperformed by simply picking a random number for a queue to push the data into) and a separate queue for each process (which did, in fact cut wasted time significantly).  Before that, it had been hovering around the 20 hours, with 8 processes. So, I’d already been doing well, but it was well above where I thought it should be.

So, following up on John’s comment the other day, I gave Pypy a shot.  Unfortunately, the first thing I discovered was that pypy doesn’t work with numpy, the numerical computing library for python.  No problem – I was only using it in one place.  It only took a few seconds to rewrite that code so that it used a regular 2D array.

Much to my surprise, I started getting an error elsewhere in the code, indicating that a float was being used as an index to a matrix!

Indeed, it only took a few seconds to discover that the code was calling the absolute value  of an int, and the returned value was a float – not an integer…

Which means that numpy was hiding that mistake all along, without warning or error!  Simply putting a cast on the float (eg, int(math.fabs(x))) was sufficient to drive the total running time of the code to 111m of user time on the next data set of comparable size. (about 2 hours, with a real time of 48 minutes because the fancy queue manager mentioned above wasn’t working well).

Yes, I’m comparing apples with oranges by not rerunning on the data set, but I am trying to process a large volume of data, and I wasn’t expecting such a massive performance boost.- I’ll get around to proper bench marking when that’s done.

Unfortunately, in the end, I never could get pypy running.  It turns out that it’s incompatible with pysam, a library I’m using to process bam (next-generation sequencing) files.  I don’t have any alternatives for that library and I can’t drop it, so pypy is out.  However, it did help me identify that numpy is far too accepting of bad data for array indexes, and while it is able to handle it correctly, numpy does so with a huge time penalty on your code.

So, lessons learned: profiling is good, pypy isn’t ready for heavy computing use, and numpy should be treated with caution!  And, of course, yes, you can write fast code with python – you just have to know what you’re doing!

I’ve landed.

So, I think it’s time to return to blogging.  I’ve started in a new group and have begun feeling my way around in a new area – so, for those who followed me for Next Gen Sequencing in the past, you may be surprised that it’s likely to play a diminished role in my new position.  I don’t think I’m done in NGS, but it looks like I’ll have a little break from it, until it my new group completes a few upgrades, at least.

Are you curious?  I’m working at the CMMT in Vancouver, in the Kobor Lab.  I can’t say enough how awesome this group is, and how welcoming they’ve been (and I don’t even think they read my blog…).  I’ll also likely be collaborating with a few other groups here – but the extent of that is yet to be determined.

So what will feature prominently in my blog?  Well, that’s a good question.

It seems like Chip-Seq will come back.  I don’t think I’ll be returning to FindPeaks – I’ve got better ideas and more interesting plans that I hope to move forward on. It seems likely that I have more to contribute in this particular area, so I expect I’ll be starting a new code base that deals a bit more with the statistics of Chip-Seq.  The findpeaks code base has become a bit too big for rapid prototyping, so it’s time to step out of it to move forward.

I’m sure that epigenetics will take a front row seat in my work. That’s a major focus in this group, both for histones and DNA methylation, so I can’t see it not playing a significant part.  (I’m looking forward to working with methylation, which I’ve never done before…)

I’ll probably be working with Python – I’ve been thinking that it’s time to move away from Java.  Not that there’s anything wrong with Java, but I’ve heard really good things about Python, and I’m excited to start a language that seems to fit a little more naturally with the way I’d like to approach the problem.

I’m hoping to work with Open Source..  well, that hasn’t been discussed much yet, but I still believe strongly in the open source philosophy – particularly in the academic world.  I’d rather not work on closed source code in this environment.

I’ll also likely be working with a bit of Yeast genomics – it’s a great model system, and there’s still a lot to learn about regulation and epigenetics in that particular organism.  And there’s always the tie in to beer.  That doesn’t hurt either.

At any rate, things are still evolving, and I have a 500Mb stack of papers to read (yes, I’m saving paper), but I think that I’m back.  I may do a few reviews of the subjects I’ll have to read up on, which include the epigenetics of healthy ageing and childhood development.  Oddly enough, I think we can learn things at both ends of the human age spectrum, so why not?

And yes, I’ll try to keep the disparaging comments about Denmark to a minimum from now on, but I can’t promise there won’t be any.  Not, at least, till the lawyers finish working out who’s owed what.

>ChIP-Seq normalization.

>I’ve spent a lot of time working on ChIP-Seq controls recently, and wanted to raise an interesting point that I haven’t seen addressed much: How to normalize well. (I don’t claim to have read ALL of the chip-seq literature, and someone may have already beaten me to the punch… but I’m not aware of anything published on this yet.)

The question of normalization occurs as soon as you raise the issue of controls or comparing any two samples. You have to take it in to account when doing any type of comparision, really, so it’s somewhat important as the backbone to any good second-gen work.

The most common thing I’ve heard to date is to simply normalize by the number of tags in each data set. As far as I’m concerned, that really will only work when your data sets come from the same library, or two highly correlated samples – when nearly all of your tags come from the same locations.

However, this method fails as soon as you move into doing a null control.

Imagine you have two samples, one is your null control, with the “background” sequences in it. When you seqeunce, you get ~6M tags, all of which represent noise. The other is ChIP-Seq, so some background plus an enriched signal. When you sequence, hopefully you sequence 90% of your signal, and 10% of the background to get ~8M tags – of which ~.8M are noise. When you do a compare, the number of tags isn’t quite doing justice to the relationship between the two samples.

So what’s the real answer? Actually, I’m not sure – but I’ve come up with two different methods of doing controls in FindPeaks: One where you normalize by identifying a (symmetrical) linear regression through points that are found in both samples, the other by identifying the points that appear in both samples and summing up their peak heights. Oddly enough, they both work well, but in different scenarios. And clearly, both appear (so far) to work better than just assuming the number of tags is a good normalization ratio.

More interesting, yet, is that the normalization seems to change dramatically between chromosomes (as does the number of mapping reads), which leads you to ask why that might be. Unfortunately, I’m really not sure why it is. Why should one chromosome be over-represented in an “input dna” control?

Either way, I don’t think any of us are getting to the bottom of the rabbit hole of doing comparisons or good controls yet. On the bright side, however, we’ve come a LONG way from just assuming peak heights should fall into a nice Poisson distribution!

>New ChIP-seq control

>Ok, so I’ve finally implemented and debugged a second type of control in FindPeaks… It’s different, and it seems to be more sensitive, requiring less assumptions to be made about the data set itself.

What it needs, now, is some testing. Is anyone out there willing to try a novel form of control on a dataset that they have? (I won’t promise it’s flawless, but hey, it’s open source, and I’m willing to bug fix anything people find.)

If you do, let me know, and I’ll tell you how to activate it. Let the testing begin!

>Why peak calling is painful.

>In discussing my work, I’m often asked how hard it is to write a peak calling algorithm. The answer usually surprises people: It’s trivial. Peak calling itself isn’t hard. However, there are plenty of pitfalls that can surprise the unwary. (I’ve found myself in a few holes along the way, which have been somewhat challenging to get out of.)

The pitfalls, when they do show up, can be very painful – masking the triviality of the situation.

In reality, the three most frustrating things that occur in peak calling:

  1. Maintaining the software
  2. Peak calling without unlimited resources eg, 64Gb RAM
  3. Keeping on the cutting edge

On the whole, each of these things is a separate software design issue worthy of a couple of seconds of discussion.

When it comes to building software, it’s really easy to fire up a “one-off” script. Anyone can write something that can be tossed aside when they’re done with it – but code re-use and recycling is a skill. (And an important one.) Writing your peak finder to be modular is a lot of work, and a huge amount of time investment is required to keep the modules in good shape as the code grows. A good example of why this is important can be illustrated with file formats. Since the first version of FindPeaks, we’ve transitioned through two versions of Eland output, Maq’s .map format and now on to SAM and BAM (but not excluding BED, GFF, and several other more or less obscure formats). In each case, we’ve been able to simply write a new iterator and plug it into the existing modular infrastructure. In fact, SAM support was added in quite rapidly by Tim with only a few hours of investment. That wouldn’t have been possible without the massive upfront investment in good modularity.

The second pitfall is memory consumption – and this is somewhat more technical. When dealing with sequencing reads, you’re faced with a couple of choices: you either sort the reads and then move along the reads one at a time, determining where they land – OR – you can pre-load all the reads, then move along the chromosome. The first model takes very little memory, but requires a significant amount of pre-processing, which I’ll come back to in a moment. The second requires much less cpu time – but is intensely memory thirsty.

If you want to visualize this, the first method is to organize all of your reads by position, then to walk down the length of the chromosome with a moving window, only caring about the reads that fall into the window at any given point in time. This is how FindPeaks works now. The second is to build a model of the chromosome, much like a “pileup” file, which then can be processed however you like. (This is how I do SNP calling.) In theory, it shouldn’t matter which one you do, as long as all your reads can be sorted correctly. The first can usually be run with a limited amount of memory, depending on the memory strucutures you use, whereas the second pretty much is determined by the size of the chromosomes you’re using (multiplied by a constant that also depends on the structures you use.)

Unfortunately, using the first method isn’t always as easy as you might expect. For instance, when doing alignments with transcriptomes (or indels), you often have gapped reads. An early solution to this in FindPeaks was to break each portion of the read into separate aligned reads, and process them individually – which works well when correctly sorted. Unfortunately, new formats no longer allow that – using a “pre-sorted” bam/sam file, you can now find multi-part reads, but there’s no real option of pre-fragmenting those reads and re-sorting. Thus, FindPeaks now has an additional layer that must read ahead and buffer sam reads in order to make sure that the next one returned is the correct order. (You can get odd bugs, otherwise, and yes, there are many other potential solutions.)

Moving along to the last pitfall, the one thing that people want out of a peak finder is that it is able to do the latest and greatest methods – and do it ahead of everyone else. That on it’s own is a near impossible task. To keep a peak finder relevant, you not only need to implement what everyone else is doing, but also do things that they’re not. For a group of 30 people, that’s probably not too hard, but for academic peak callers, that can be a challenge – particularly since every use wants something subtly different than the next.

So, when people ask how hard it is to write their own peak caller, that’s the answer I give: It’s trivial – but a lot of hard work. It’s rewarding, educational and cool, but it’s a lot of work.

Ok, so is everyone ready to write their own peak caller now? (-;

>new repository of second generation software

>I finally have a good resource for locating second gen (next gen) sequencing analysis software. For a long time, people have just been collecting it on a single thread in the bioinformatics section of the SeqAnswers.com forum, however, the brilliant people at SeqAnswers have spawned off a wiki for it, with an easy to use form. I highly recommend you check it out, and possibly even add your own package.

http://seqanswers.com/wiki/SEQanswers

>Science Cartoons – 3

>I wasn’t going to do more than one comic a day, but since I just published it into the FindPeaks 4.0 manual today, I may as well put it here too, and kill two birds with one stone.

Just to clarify, under copyright laws, you can certainly re-use my images for teaching purposes or your own private use (that’s generally called “fair use” in the US, and copyright laws in most countries have similar exceptions), but you can’t publish it, take credit for it, or profit from it without discussing it with me first. However, since people browse through my page all the time, I figure I should mention that I do hold copyright on the pictures, so don’t steal them, ok?

Anyhow, Comic #3 is a brief description of how the compare in FindPeaks 4.0 works. Enjoy!

>Can’t we use ChIP-chip controls on *-Seq?

>Thanks to Nicholas, who left this comment on my web page this morning, in reference to my post on controls in Second-Gen Seqencing:

Hi Anthony,

Don't you think that controls used for microarray (expression
and ChIP-chip) are well established and that we could use
these controls with NGS?

Cheers!

I think this is a valid question, and one that should be addressed. My committee asked me the same thing during my comprehensive exam, so I’ve had a chance to think about it. Unfortunately, I’m not a statistics expert, or a ChIP-chip expert, so I would really value other people’s opinion on the matter.

Anyhow, I think the answer has to be put in perspective: Yes, we can learn from ChIP-chip and Arrays for the statistics that are being used, but no, they’re not directly applicable.

Both ChIP-chip and array experiments are based on hybridization to a probe – which makes them cheap and reasonably reliable. Unfortunately, it also leads to a much lower dynamic range, since they saturate out at the high end, and can be undetectable at the low end of the spectrum. This alone should be a key difference. What signal would be detected from a single hybridization event on a micro-array?

Additionally, the resolution of a chip-chip probe is vastly different from that of a sequencing reaction. In ChIP-Seq or RNA-Seq, we can get unique signals for sequences with a differing start location only one base apart, which should then be interpreted differently. With ChIP-chip, the resolution is closer to 400bp windows, and thus the statistics take that into account.

Another reason why I think the statistics are vastly different is because of the way we handle the data itself, when setting up an experiment. With arrays, you repeat the same experiment several times, and then use that data as several repeats of the same experiment, in order to quantify the variability (deviation and error) between the repeats. With second-generation sequencing, we pool the results from several different lanes, meaning we always have N=1 in our statistical analysis.

So, yes, I think we can learn from other methods of statistical analysis, but we can’t blindly apply the statistics from ChIP-chip and assume they’ll correctly interpret our results. The more statistics I learn, the more I realize how many assumptions go into each method – and how much more work it is to get the statistics right for each type of experiment.

At any rate, these are the three most compelling reasons that I have, but certainly aren’t the only ones. If anyone would like to add more reasons, or tell me why I’m wrong, please feel free to add a comment!

>Rumours..

>Apparently there’s a rumour going around that I think FindPeaks isn’t a good software program, and that I’ve said so on my blog. I don’t know who started this, or where it came from, but I would like to set the record straight.

I think FindPeaks is one of the best ChIP-Seq packages currently available. Yes, I would like to have better documentation, yes there are occasionally bugs in it, and yes, the fact that I’m the only full-time developer on the project are detriments…. But those are only complaints, and they don’t diminish the fact that I think FindPeaks can out peak-call, out control, out compare…. and will outlast, outwit and outplay all of it’s opponents. (=

FindPeaks is actually a VERY well written piece of code, and has the flexibility to do SO much more than just ChIP-Seq. If I had my way, I’d have another developer or two working with the code to expand many of the really cool features that it already has.

If people are considering using it, I’m always happy to help get them started, and I’m thrilled to add more people to the project if they would like to contribute.

So, just to make things clear – I think FindPeaks (and the rest of the Vancouver Short Read Analysis Package) are great pieces of software, and are truly worth using. If you don’t believe me, send me an email, and I’ll help make you a believer too. (=