I was going to write about the crazy street preacher I met on the bus this morning, but I have some far more topical stuff to mention.
I’ve been working on some code in python, which, much to my dismay, was taking a LONG time to run. Around 11 hours of processing time, using 100% of all 8 CPUs (summed up user time of 6110 minutes – about 100 hours of total CPU time), for something I figured should take about about an hour.
I’d already used profiling to identify that the biggest bottlenecks were in two places – putting and retrieving information from a single queue shared by all 8 worker processes, as well as a single function of code that is central to the calculations being done by the workers. Not being able to figure out what was happening in the worker’s code, I spent some time optimizing the queue management, with some fancy queue management (which turned out later to be outperformed by simply picking a random number for a queue to push the data into) and a separate queue for each process (which did, in fact cut wasted time significantly). Before that, it had been hovering around the 20 hours, with 8 processes. So, I’d already been doing well, but it was well above where I thought it should be.
So, following up on John’s comment the other day, I gave Pypy a shot. Unfortunately, the first thing I discovered was that pypy doesn’t work with numpy, the numerical computing library for python. No problem – I was only using it in one place. It only took a few seconds to rewrite that code so that it used a regular 2D array.
Much to my surprise, I started getting an error elsewhere in the code, indicating that a float was being used as an index to a matrix!
Indeed, it only took a few seconds to discover that the code was calling the absolute value of an int, and the returned value was a float – not an integer…
Which means that numpy was hiding that mistake all along, without warning or error! Simply putting a cast on the float (eg, int(math.fabs(x))) was sufficient to drive the total running time of the code to 111m of user time on the next data set of comparable size. (about 2 hours, with a real time of 48 minutes because the fancy queue manager mentioned above wasn’t working well).
Yes, I’m comparing apples with oranges by not rerunning on the data set, but I am trying to process a large volume of data, and I wasn’t expecting such a massive performance boost.- I’ll get around to proper bench marking when that’s done.
Unfortunately, in the end, I never could get pypy running. It turns out that it’s incompatible with pysam, a library I’m using to process bam (next-generation sequencing) files. I don’t have any alternatives for that library and I can’t drop it, so pypy is out. However, it did help me identify that numpy is far too accepting of bad data for array indexes, and while it is able to handle it correctly, numpy does so with a huge time penalty on your code.
So, lessons learned: profiling is good, pypy isn’t ready for heavy computing use, and numpy should be treated with caution! And, of course, yes, you can write fast code with python – you just have to know what you’re doing!