Handy little command for upgrading python libraries…

About three weeks ago I googled for a quick tutorial on how to upgrade all of the libraries being used by python – and came up completely empty handed. Absolutely nothing useful turned up, which I found rather frustrating. The Python installer (pip) should certainly have an “upgrade all” function – but if it does, I couldn’t find it. If anyone comes across such a thing, I’d love to hear about it.

This morning, on my bike in to work, I realized I could hack a very quick command line together to make it work:

sudo pip freeze | awk '{FS = "==";print $1}' | xargs -I {} sudo pip install {} --upgrade

Nothing to it! It iterates one by one and upgrades all of the installed software. When a package is up to date, it’s clearly indicated, and when it’s not, it tries to upgrade, rolling back if it’s unsuccessful. I’ve noticed that many of the upgrades failed because of an out of date numpy package, so you may want to upgrade that first. Also, Eclipse isn’t too happy with the process, as it will detect the changes and freak out a bit – you might want to exit anything using or depending on the python libraries (such as django web server) first.

Of course, beware that this may involve re-compiling a fair amount of code, which means it’s not necessarily going to be fast. (Took about 15 minutes on my computer, with quite a few out of date libraries)

Faster is better…

I was going to write about the crazy street preacher I met on the bus this morning, but I have some far more topical stuff to mention.

I’ve been working on some code in python, which, much to my dismay, was taking a LONG time to run.  Around 11 hours of processing time, using 100% of all 8 CPUs (summed up user time of 6110 minutes – about 100 hours of total CPU time), for something I figured should take about about an hour.

I’d already used profiling to identify that the biggest bottlenecks were in two places – putting and retrieving information from a single queue shared by all 8 worker processes, as well as a single function of code that is central to the calculations being done by the workers.  Not being able to figure out what was happening in the worker’s code, I spent some time optimizing the queue management, with some fancy queue management (which turned out later to be outperformed by simply picking a random number for a queue to push the data into) and a separate queue for each process (which did, in fact cut wasted time significantly).  Before that, it had been hovering around the 20 hours, with 8 processes. So, I’d already been doing well, but it was well above where I thought it should be.

So, following up on John’s comment the other day, I gave Pypy a shot.  Unfortunately, the first thing I discovered was that pypy doesn’t work with numpy, the numerical computing library for python.  No problem – I was only using it in one place.  It only took a few seconds to rewrite that code so that it used a regular 2D array.

Much to my surprise, I started getting an error elsewhere in the code, indicating that a float was being used as an index to a matrix!

Indeed, it only took a few seconds to discover that the code was calling the absolute value  of an int, and the returned value was a float – not an integer…

Which means that numpy was hiding that mistake all along, without warning or error!  Simply putting a cast on the float (eg, int(math.fabs(x))) was sufficient to drive the total running time of the code to 111m of user time on the next data set of comparable size. (about 2 hours, with a real time of 48 minutes because the fancy queue manager mentioned above wasn’t working well).

Yes, I’m comparing apples with oranges by not rerunning on the data set, but I am trying to process a large volume of data, and I wasn’t expecting such a massive performance boost.- I’ll get around to proper bench marking when that’s done.

Unfortunately, in the end, I never could get pypy running.  It turns out that it’s incompatible with pysam, a library I’m using to process bam (next-generation sequencing) files.  I don’t have any alternatives for that library and I can’t drop it, so pypy is out.  However, it did help me identify that numpy is far too accepting of bad data for array indexes, and while it is able to handle it correctly, numpy does so with a huge time penalty on your code.

So, lessons learned: profiling is good, pypy isn’t ready for heavy computing use, and numpy should be treated with caution!  And, of course, yes, you can write fast code with python – you just have to know what you’re doing!

A few things on python

A few quick notes.  Python, so far, is really a neat language.  It’s fast to write in, it’s easy to do unit tests and using pydev made the transition to python a lot easier from Java.  I’m not nearly a professional python developer by any stretch, but I can bang out python code pretty quickly now, and I’m pretty happy.

Duck-typing does annoy the heck out of me still, because I know things will crash at run time instead of easily caught errors being flagged while writing the code or while compiling, but I’m sure I’ll get over that, and unit tests do compensate for it.

I’ve also picked up Egit for version control, and that has been a bit confusing – not because it’s git, but because the git model of software development doesn’t provide an external backup for your repository unless you push it to the server.  Somehow, I hadn’t actually realized that until I started playing with it.  It simply means pushing to a branch at the end of the day for a backup, or at stable points, which isn’t a bad idea, really.  Once I get more comfortable with the system, I’m sure it’ll work far better than SVN ever did for me.

Beyond that, I’m also pretty impressed with the libraries available for python.  I’ve played a little with Pysam, and have been reasonably impressed with the results.  I haven’t done any benchmarks on it yet, but I’m ok with it so far.  It did take me a while to realize that an aligned read’s “.aend” property is the 3′ end and the “.pos” is the 5′ end on the positive strand… and I’m not convinced that I haven’t somehow introduced an off-by one error in the code, but these things will be sorted out in time.

Otherwise, I think I can say I’m reasonably happy with the choice of python.  I’m looking forward to playing with a few other libraries, like matplotlib, and scientific python, as well as Tkinter, and – although I know there will be a learning curve on it – I think there are more than sufficient python tutorials and forums to help make it reasonably easy to get through.

Ok, time for some more coding. (-: