To those who’ve worked with me over the past couple years, you’ll know I’m a big fan of multiprocessing, which is a python package that effectively spawns new processes, much the same way you’d use threads in any other programming language. Mainly, that’s because python’s GIL (global interpreter lock) more or less throttles any attempt you might seriously make to get threads to work. However, multiprocessing is a nice replacement and effectively sidesteps those issues, allowing you to use as much of your computer’s resources as are available to you.
Consequently, I’ve spent part of the last couple days building up a new set of generic processes that will let me parallelize pretty much any piece of code that can work with a queue. That is to say, if I can toss a bunch of things into a pile, and have each piece processed by a separate running instance of code, I can use this library. It’ll be very handy for processing individual lines in a file (eg, VCF or fastq, or anything where the lines are independent)
Of course, this post only has any relevance because I’ve also decided to move from python 2.7 to 3.6 – and to no one’s surprise, things have changed. In 2.7, I spent time creating objects that had built in locks, and shared c_type variables that could be passed around. In 3.6, all of that becomes irrelevant. Instead, you create a new object, a Manager().
The Manager is a relatively complex object, in that it has built in locks – for which I haven’t figured out how efficient they are yet, that’s probably down the road a bit – which makes all of the Lock wrapping I’d done in 2.7 obsolete. My first attempt a making it work was a failure, as it constantly threw errors that you can’t put Locks into the Manager. In fact, you also can’t put objects containing locks (such as multiprocessing Value) into the Manager. You can, however, replace them with Value objects from the manager class.
The part of the Manager that I haven’t played with yet, is that they also seem to have the ability to share information across computers, if you launch it as a server process. Although likely overkill (and network latency makes me really shy away from that), it seems like it could be useful for building big cluster jobs. Again, something much further down the road for me.
Although not a huge milestone, it’s good to have at least one essential component back in my toolkit: My unit test suite passes, doing some simple processing using the generic processing class. And yes, good code requires good unit tests, so I’ve also been writing those.
Lessons learned the hard way are often remembered the best. Writing multiprocessing code out from scratch was a great exercise, and learning some of the changes between 2.7 and 3.6 was definitely worthwhile.