Faster is better…

I was going to write about the crazy street preacher I met on the bus this morning, but I have some far more topical stuff to mention.

I’ve been working on some code in python, which, much to my dismay, was taking a LONG time to run.  Around 11 hours of processing time, using 100% of all 8 CPUs (summed up user time of 6110 minutes – about 100 hours of total CPU time), for something I figured should take about about an hour.

I’d already used profiling to identify that the biggest bottlenecks were in two places – putting and retrieving information from a single queue shared by all 8 worker processes, as well as a single function of code that is central to the calculations being done by the workers.  Not being able to figure out what was happening in the worker’s code, I spent some time optimizing the queue management, with some fancy queue management (which turned out later to be outperformed by simply picking a random number for a queue to push the data into) and a separate queue for each process (which did, in fact cut wasted time significantly).  Before that, it had been hovering around the 20 hours, with 8 processes. So, I’d already been doing well, but it was well above where I thought it should be.

So, following up on John’s comment the other day, I gave Pypy a shot.  Unfortunately, the first thing I discovered was that pypy doesn’t work with numpy, the numerical computing library for python.  No problem – I was only using it in one place.  It only took a few seconds to rewrite that code so that it used a regular 2D array.

Much to my surprise, I started getting an error elsewhere in the code, indicating that a float was being used as an index to a matrix!

Indeed, it only took a few seconds to discover that the code was calling the absolute value  of an int, and the returned value was a float – not an integer…

Which means that numpy was hiding that mistake all along, without warning or error!  Simply putting a cast on the float (eg, int(math.fabs(x))) was sufficient to drive the total running time of the code to 111m of user time on the next data set of comparable size. (about 2 hours, with a real time of 48 minutes because the fancy queue manager mentioned above wasn’t working well).

Yes, I’m comparing apples with oranges by not rerunning on the data set, but I am trying to process a large volume of data, and I wasn’t expecting such a massive performance boost.- I’ll get around to proper bench marking when that’s done.

Unfortunately, in the end, I never could get pypy running.  It turns out that it’s incompatible with pysam, a library I’m using to process bam (next-generation sequencing) files.  I don’t have any alternatives for that library and I can’t drop it, so pypy is out.  However, it did help me identify that numpy is far too accepting of bad data for array indexes, and while it is able to handle it correctly, numpy does so with a huge time penalty on your code.

So, lessons learned: profiling is good, pypy isn’t ready for heavy computing use, and numpy should be treated with caution!  And, of course, yes, you can write fast code with python – you just have to know what you’re doing!

6 thoughts on “Faster is better…

    • I’ll admit, I hadn’t considered switching bam file libraries, although, you’re absolutely right that a very simple implementation would be sufficient for what I’m doing. Although, for today’s purposes, I’m already pretty happy with the speed of the code. (45 minutes per chip seq run isn’t bad at all, considering the functionality of the code…)

      I’ll definitely revisit this in a week or two, when I’m done with the next part of my project – I can see phot’s potential, for sure.

      There are some great signs that numpypy is coming along well, but I suspect that my use of it was pretty trivial anyhow. I can see where I’ll want to use it in other projects, though. It does look like it’ll be in great shape soon!

  1. This is also the inherent problem with scripting languages. Implicit typecast assumptions are very dangerous. For prototyping applications, scripting languages are fine, but for production/analysis applications, I still go with more structured languages. Perl has the ‘use strict’ option. I’m sure Python has something similar.

    • Oddly enough, it wasn’t a problem with implicit typecast assumptions – a python array will fail if you pass a float as an index, which is the behaviour I was expecting. After reading up on it some more, it turns out that there actual use cases where NumPy uses float indices, which just blew my mind.

      Either way, I should probably not have been using NumPy to begin with, as my case didn’t need any of the functions that are specific to NumPy Arrays, and as soon as I removed NumPy, the problem became apparent and was instantly identified.

      Thus, I don’t think python needs “use strict” – it generally already incorporates that behaviour by default, unless you use a more promiscuous library like NumPy.

    • Thanks for the tips, Anthony. I’ve been using profile (cProfile) for this work, but I hadn’t yet seen pycallgraph.

      Interestingly enough, getting profile to work on the individual processes took a lot of work – though the solution is not difficult. I was debating writing a post on that, since profiling multiprocess and multi-threaded code is quite a change from single threaded code.

Leave a Reply

Your email address will not be published. Required fields are marked *