Issues with Python 3.6.3 Multiprocessing. Anyone else see the same?

On the ferry once more, and wanted to share a hard-fought lesson that I learned today, which somewhat updates my post from the other day on multiprocessing and python 3.6.3.

Unfortunately, the lessons weren’t nice at all.

First, I discovered that using the new Manager object is a terrible idea, even for incredibly simple objects (eg. an incrementing value, incremented every couple of seconds). The implementation is significantly slower than creating your own object out of a lock and a shared value, just to have two threads take turns incrementing the value. Ouch. (I don’t have my bench marks, unfortunately, but it was about 10% of the run time, IIRC.)

Worse still, using a manager.Queue object is horrifically bad. I created an app where one process reads from a file and puts things into a a queue, and a second process reads from that queue and does some operations on the object. Now, my objects are just small lists with one integer in it, so they’re pretty small. Switching from a multiprocessing Queue to a Manager Queue caused a 3-fold increase in the time to execute. (5 seconds to 15 seconds.) Given that the whole reason for writing multiprocessing code is to speed up the processing of my data, the Manager is effectively a non-starter for me.

I understand, of course, that that overhead might be worth it if your Manager runs on a separate server, and can make use of multiple machines, but I’m working on the opposite problem, with one machine and several cores.

The second big discovery, of course, was that multiprocessing Queues really dont’ work well in python 3.6.3. I don’t know when this happened, but somewhere along the line, someone has changed their behaviour.

In 2.7, I could create one process that fills the Queue, and then create a second type of process that reads from the queue. As long as process 1 is much faster than process 2, the rate limiting step would be process 2. Thus, doubling the number of process 2’s, should double the processing of the job.

Unfortunately, in 3.6.3, this is no longer the case – the speed with which the processes obtain data from the queue is now the rate limiting step. Process 2 can call Queue.get(), but get is only serving the data at a constant speed, no matter how many processes 2’s are there calling the Queue.get() function.

That means that you can’t get any speed up from multiprocessing Queues…. unless you have a single queue for every process 2. Yep… that’s what I did this afternoon. Replaced the single queue with a list of queues, so that I have a single Queue for every processing queue.

Bad design, you say? Yes! I agree. In fact, since I now have a set of queues in which there’s only one writer and one reader, I shouldn’t be using queues at all. I should be using Pipes!

So, tomorrow, I’ll rip out all of my queues, and start putting in pipes. (Except where I have multiple processes writing to a single pipe, of course)

I don’t know where multiprocessing in python went wrong, but that was a severely disappointing moment this morning when I discovered this issue. For now, I’ll resist the urge to return to python 2.7.

(If anyone knows where I went wrong, please let me know – we all make mistakes, and I’m really hoping I’m wrong on this one.)

Handy little command for upgrading python libraries…

About three weeks ago I googled for a quick tutorial on how to upgrade all of the libraries being used by python – and came up completely empty handed. Absolutely nothing useful turned up, which I found rather frustrating. The Python installer (pip) should certainly have an “upgrade all” function – but if it does, I couldn’t find it. If anyone comes across such a thing, I’d love to hear about it.

This morning, on my bike in to work, I realized I could hack a very quick command line together to make it work:

sudo pip freeze | awk '{FS = "==";print $1}' | xargs -I {} sudo pip install {} --upgrade

Nothing to it! It iterates one by one and upgrades all of the installed software. When a package is up to date, it’s clearly indicated, and when it’s not, it tries to upgrade, rolling back if it’s unsuccessful. I’ve noticed that many of the upgrades failed because of an out of date numpy package, so you may want to upgrade that first. Also, Eclipse isn’t too happy with the process, as it will detect the changes and freak out a bit – you might want to exit anything using or depending on the python libraries (such as django web server) first.

Of course, beware that this may involve re-compiling a fair amount of code, which means it’s not necessarily going to be fast. (Took about 15 minutes on my computer, with quite a few out of date libraries)

Faster is better…

I was going to write about the crazy street preacher I met on the bus this morning, but I have some far more topical stuff to mention.

I’ve been working on some code in python, which, much to my dismay, was taking a LONG time to run.  Around 11 hours of processing time, using 100% of all 8 CPUs (summed up user time of 6110 minutes – about 100 hours of total CPU time), for something I figured should take about about an hour.

I’d already used profiling to identify that the biggest bottlenecks were in two places – putting and retrieving information from a single queue shared by all 8 worker processes, as well as a single function of code that is central to the calculations being done by the workers.  Not being able to figure out what was happening in the worker’s code, I spent some time optimizing the queue management, with some fancy queue management (which turned out later to be outperformed by simply picking a random number for a queue to push the data into) and a separate queue for each process (which did, in fact cut wasted time significantly).  Before that, it had been hovering around the 20 hours, with 8 processes. So, I’d already been doing well, but it was well above where I thought it should be.

So, following up on John’s comment the other day, I gave Pypy a shot.  Unfortunately, the first thing I discovered was that pypy doesn’t work with numpy, the numerical computing library for python.  No problem – I was only using it in one place.  It only took a few seconds to rewrite that code so that it used a regular 2D array.

Much to my surprise, I started getting an error elsewhere in the code, indicating that a float was being used as an index to a matrix!

Indeed, it only took a few seconds to discover that the code was calling the absolute value  of an int, and the returned value was a float – not an integer…

Which means that numpy was hiding that mistake all along, without warning or error!  Simply putting a cast on the float (eg, int(math.fabs(x))) was sufficient to drive the total running time of the code to 111m of user time on the next data set of comparable size. (about 2 hours, with a real time of 48 minutes because the fancy queue manager mentioned above wasn’t working well).

Yes, I’m comparing apples with oranges by not rerunning on the data set, but I am trying to process a large volume of data, and I wasn’t expecting such a massive performance boost.- I’ll get around to proper bench marking when that’s done.

Unfortunately, in the end, I never could get pypy running.  It turns out that it’s incompatible with pysam, a library I’m using to process bam (next-generation sequencing) files.  I don’t have any alternatives for that library and I can’t drop it, so pypy is out.  However, it did help me identify that numpy is far too accepting of bad data for array indexes, and while it is able to handle it correctly, numpy does so with a huge time penalty on your code.

So, lessons learned: profiling is good, pypy isn’t ready for heavy computing use, and numpy should be treated with caution!  And, of course, yes, you can write fast code with python – you just have to know what you’re doing!

A few things on python

A few quick notes.  Python, so far, is really a neat language.  It’s fast to write in, it’s easy to do unit tests and using pydev made the transition to python a lot easier from Java.  I’m not nearly a professional python developer by any stretch, but I can bang out python code pretty quickly now, and I’m pretty happy.

Duck-typing does annoy the heck out of me still, because I know things will crash at run time instead of easily caught errors being flagged while writing the code or while compiling, but I’m sure I’ll get over that, and unit tests do compensate for it.

I’ve also picked up Egit for version control, and that has been a bit confusing – not because it’s git, but because the git model of software development doesn’t provide an external backup for your repository unless you push it to the server.  Somehow, I hadn’t actually realized that until I started playing with it.  It simply means pushing to a branch at the end of the day for a backup, or at stable points, which isn’t a bad idea, really.  Once I get more comfortable with the system, I’m sure it’ll work far better than SVN ever did for me.

Beyond that, I’m also pretty impressed with the libraries available for python.  I’ve played a little with Pysam, and have been reasonably impressed with the results.  I haven’t done any benchmarks on it yet, but I’m ok with it so far.  It did take me a while to realize that an aligned read’s “.aend” property is the 3′ end and the “.pos” is the 5′ end on the positive strand… and I’m not convinced that I haven’t somehow introduced an off-by one error in the code, but these things will be sorted out in time.

Otherwise, I think I can say I’m reasonably happy with the choice of python.  I’m looking forward to playing with a few other libraries, like matplotlib, and scientific python, as well as Tkinter, and – although I know there will be a learning curve on it – I think there are more than sufficient python tutorials and forums to help make it reasonably easy to get through.

Ok, time for some more coding. (-:

Starting over

I’m back.  Physically, I’m back in Canada, although not yet home.  I’m visiting family while all our possessions make their way back to Vancouver.  In the meantime, I wanted to get back to blogging.  To re-engage in the community and return my life to some sense of normalcy.

On Denmark, I don’t plan to say much.  It was a terrible experience from start to finish, and I’m leaving with less money, stability – and none of the bioinformatics experience I had wanted.  All in all, it was a disaster.  If people want specific details or advice about moving to Denmark, of course I’ll share what I know, but this isn’t the right forum for it.  For the moment, I won’t comment on how things went down at the end, although I’ve heard less than accurate versions in circulation.

On the subject of bioinformatics, I feel a bit out of touch.  I’ll be starting to get back into it shortly.  Obviously, it’ll take some time to ramp up and get back in to the swing of things.  However, I can say that last night was the first time in a year that I had actual free time. So what did I do?  I started to learn Python.  Honestly, I don’t think Java is the right tool for all occasions, and with about a month of downtime, python just feels like it might be the best fit for some of the stuff I’ll be working on in the future.

Anyhow, with any luck, things will start to work their way out.  At least, being back in Canada, I can see the light at the end of the tunnel.

Where’s the collaboration?

I had another topic queued up this morning, but an email from my sister-in-law reminded me of a more pressing beef: Lack of collaboration in the sciences. And, of course, I have no statistics to back this up, so I’m going to put this out there and see if anyone has anything to comment on the topic.

My contention is that the current methods for funding scientists is the culprit for driving less efficient science, mixed with a healthy dose of Zero Sum Game thinking.

First, my biggest pet peeve is that scientists – and bioinformaticians in particular – spend a lot of time reinventing the wheel.  How many SNP callers are currently available?  How many ChiP-Seq packages? How many aligners?  And, more importantly, how can you tell one from the other?  (How many of the hundreds of snp callers have you actually used?)

It’s a pretty annoying aspect of bioinformatics that people seem to feel the need to start from scratch on a new project every time they say “I could tweak a parameter in this alignment algorithm…”  and then off they go, writing aligner #23,483,337 from scratch instead of modifying the existing aligner.  At some point, we’ll have more aligners than genomes!  (Ok, that’s a shameless hyperbole.)

But, the point stands.  Bioinformaticians create a plethora of software that solve problems that are not entirely new.  While I’m not saying that bioinformaticians are working on solved problems, I am asserting that the creation of novel software packages is an inefficient way to tackle problems that someone else has already invested time/money into building software for. But I’ll come back to that in a minute.

But why is the default behavior to write your own package instead of building on top of an existing one?  Well, that’s clear: Publications.  In science, the method of determining your progress is how many journal publications you have, skewed by some “impact factor” for how impressive the name of the journal is.  The problem is that this is a terrible metric to judge progress and contribution.  Solving a difficult problem in an existing piece of software doesn’t merit a publication, but wasting 4 months to rewrite a piece of software DOES.

The science community, in general, and the funding community more specifically, will reward you for doing wasteful work instead of focusing your energies where it’s needed. This tends to squash software collaborations before they can take off simply by encouraging a proliferation of useless software that is rewarded because it’s novel.

There are examples of bioinformatics packages where collaboration is a bit more encouraged – and those provide models for more efficient ways of doing research.  For instance, in the molecular dynamics community, Charmm and Amber are the two software frameworks around which most people have gathered. Grad students don’t start their degree by being told to re-write one or the other packages, but are instead told to learn one and then add modules to it.  Eventually the modules are released along with a publication describing the model.  (Or left to rot in a dingy hard drive somewhere if they’re not useful.)   Publications come from the work done and the algorithm modifications being explained.  That, to me, seems like a better model – and means everyone doesn’t start from scratch

If you’re wondering where I’m going with this, it’s not towards the Microsoft model where everyone does bioinformatics in Excel, using Microsoft generated code.

Instead, I’d like to propose a coordinated bioinformatics code-base.  Not a single package, but a unified set of hooks instead.  Imagine one code base, where you could write a module and add it to a big git hub of bioinformatics code – and re-use a common (well debugged) core set of functions that handle many of the common pieces.  You could swap out aligner implementations and have modular common output formats.  You could build a chip-seq engine, and use modular functions for FDR calculations, replacing them as needed.  Imagine you could collaborate on code design with someone else – and when you’re done, you get a proper paper on the algorithm, not an application note announcing yet another package.

(We have been better in the past couple years with tool sets like SAMTools, but that deals with a single common file format.  Imagine if that also allowed for much bigger projects like providing core functions for RNA-Seq or CNV analysis…  but I digress.)

Even better, if we all developed around a single set of common hooks, you can imagine that, at the end of the day (once you’ve submitted your repository to the main trunk), someone like the Galaxy team would simply vacuum up your modules and instantly make your code available to every bioinformatician and biologist out there.  Instant usability!

While this model of bioinformatics development would take a small team of core maintainers for the common core and hooks, much the same way Linux has Linus Torvalds working on the Kernel, it would also cut down severely on code duplication, bugs in bioinformatics code and the plethora of software packages that never get used.

I don’t think this is an unachievable goal, either for the DIY bioinformatics community, the Open Source bioinformatics community or the academic bioinformatics community.  Indeed, if all three of those decided to work together, it could be a very powerful movement.  Moreso, corporate bioinformatics could be a strong player in it, providing support and development for users, much the way corporate Linux players have done for the past two decades.

What is needed, however, is buy-in from some influential people, and some influential labs.  Putting aside their own home grown software and investing in a common core is probably a challenging concept, but it could be done – and the rewards would be dramatic.

Finally, coming back to the funding issue.  Agencies funding bioinformatics work would also save a lot of money by investing in this type of framework.  It would ensure more time is spent on more useful coding, more time is spent on publications that do more to describe algorithms and to ensure higher quality code is being produced at the end of the day.  The big difference is that they’d have to start accepting that bioinformatics papers shouldn’t be about “new software” available, but “new statistics”, “new algorithms” and “new methods” – which may require a paradigm change in the way we evaluate bioinformatics funding.

Anyhow, I can always dream.

Notes: Yes, there are software frameworks out there that could be used to get the ball rolling.  I know Galaxy does have some fantastic tools, but (if I’m not mistaken), it doesn’t provide a common framework for coding – only for interacting with the software.  I’m also aware that Charmm and Amber have problems – mainly because they were developed by competing labs that failed to become entirely enclusive of the community, or to invest substantially in maintaining the infrastructure in a clean way.Finally, Yes, the licensing of this code would determine the extent of corporate participation, but the GPL provides at least one successful example of this working.

>How to be a better Programmer: Tactics.

>I’m a bit too busy for a long post, but a link was circulating around the office that I thought was worth passing on to any bioinformaticians out there.

http://dlowe-wfh.blogspot.com/2007/06/tactics-tactics-tactics.html

The article above is on how to be a better programmer – and I wholeheartedly agree with what the author proposed, with one caveat that I’ll get to in a minute. The point of the the article is that learning to see the big picture (not specific skills) will make you a better programmer. In fact, this is the same advice Sun Tzu gives in “The Art of War”, where understanding the terrain, the enemy, etc are the tools you need to be a better general. [This would be in contrast to learning how to wield each weapon, which would only make you a better warrior.] Frankly, it’s good advice, and this leads you down the path towards good planning and clear thinking – the keys to success in most fields.

The caveat, however, is that there are times in your life where this is the wrong approach: ie. grad school. As a grad student, your goal isn’t to be great at everything you touch – it’s to specialize in some small corner of one field, and tactics are no help here. If grad school existed for Ninjas, the average student would walk out being the best (pick one of: poisoner/dart thrower/wall climber/etc) in the world – and likely knowing little or nothing about how to be a real ninja beyond what they learned in their Ninja undergrad. Tactics are never a bad investment, but they aren’t always what is being asked of you.

Anyhow, I plan to take the advice in the article and to keep studying the tactics of bioinformatics in my spare time, even though my daily work is more on the details and implementation side of it. There are a few links in the comments of the original article to sites the author believes are good comp-sci tactics… I’ll definitely be looking into those tonight. Besides, when it comes down to it, the tactics are really the fun parts of the problems, although there is also something to be said for getting your code working correctly and efficiently…. which I’d better get back to. (=

Happy coding!

>Link Roundup Returns – Dec 16-22

>I’ve been busy with my thesis project for the past couple weeks, which I think is understandable, but all work and no play kinda doesn’t sit well for me. So, over the weekend, I learned go, google’s new programming languages, and wrote myself a simple application for keeping track of links – and dumping them out in a pretty html format that I can just cut and paste into my blog.

While I’m not quite ready to release the code for my little go application, I am ready to test it out. I went back through the last 200 twitter posts I have (about 8 days worth), and grabbed the ones that looked interesting to me. I may have missed a few, or grabbed a few less than thrilling ones. It’s simply a consequence of me skimming some of the articles less well than others. I promise the quality of my links will be better in the future.

Anyhow, this experiment gave me a few insights into the process of “reprocessing” tweets. The first is that my app only records the person from whom I got the tweet – not the people from who they got it. I’ll try to address that in the future. The second is that it’s a very simple interface – and a lot of things I wanted to say just didn’t fit. (Maybe that’s for the better.. who knows.)

Regardless (or irregardless, for those of you in the U.S.) here are my picks for the week.

Bioinformatics:

  • Bringing back Blast (Blast+) (PDF) – Link (via @BioInfo)
  • Incredibly vague advice on how to become a bioinformatician – Link (via @KatherineMejia)
  • Cleaning up the Human Genome – Link (via @dgmacarthur)
  • Neat article on “4th paradigm of computing: exaflod of observational data” – Link (via @genomicslawyer)

Biology:

  • Gene/Protein Annotation is worse than you thought – Link (via @BioInfo)
  • Why are europeans white? – Link (via @lukejostins)

Future Technology:

  • D-Wave Surfaces again in discussions about bioinformatics – Link (via @biotechbase)
  • Changing the way we give credit in science – Link (via @genomicslawyer)

Off topic:

  • On scientists getting quote-mined by the press – Link (via @Etche_homo)
  • Give away of the best science cookie cutters ever – Link (via @apfejes)
  • Neat early history of the electric car – Link (via @biotechbase)
  • Wild (innacurate and funny) conspiracy theories about the Wellcome Trust Sanger Institute – Link (via @dgmacarthur)
  • The Eureka Moment: An Interview with Sir Alec Jeffreys (Inventor of the DNA Fingerprint) – Link (via @dgmacarthur)
  • Six types of twitter user (based on The Tipping Point) – Link (via @ritajlg)

Personal Medicine:

  • Discussion on mutations in cancer (in the press) – Link (via @CompleteGenomic)
  • Upcoming Conference: Personalized Medicine World Conference (Jan 19-20, 2010) – Link (via @CompleteGenomic)
  • deCODEme offers free analysis for 23andMe customers – Link (via @dgmacarthur)
  • UK government waking up to the impact of personalized medicine – Link (via @dgmacarthur)
  • Doctors not adopting genomic based tests for drug suitabiity – Link (via @dgmacarthur)
  • Quick and dirty biomarker detection – Link (via @genomicslawyer)
  • Personal Genomics article for the masses – Link (via @genomicslawyer)

Sequencing:

  • Paper doing the rounds: Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data – Link (via @BioInfo)
  • Archiving Next Generation Sequencing Data – Link (via @BioInfo)
  • Epigenetics takes aim at cancer and other illnesses – Link (via @BioInfo)
  • (Haven’t yet read) Changing ecconomics of DNA Synthesis – Link (via @biotechbase)
  • Genomic players for investors. (Very light overview) – Link (via @genomicslawyer)
  • Haven’t read yet: Recommended review of 2nd and 3rd generation seq. technologies – Link (via @nanopore)
  • De novo assembly of Giant Panda Genome – Link (via @nanopore)
  • Welcome Trust summary of 2nd Gen sequencing technologies – Link (via @ritajlg)

>Go from Google…

>Just a short post, since I’m actually (although you probably can’t tell) rather busy today. However, I’m absolutely fascinated by Google’s new language, Go. It’s taken the best from just about every existing language out there, and appears so clean!

I’m currently watching Google’s talk on it, while I write… I’m only a few minutes in, but it seems pretty good. Watching this seriously makes me want to start a new bio-go project… so nifty!

>New ChIP-seq control

>Ok, so I’ve finally implemented and debugged a second type of control in FindPeaks… It’s different, and it seems to be more sensitive, requiring less assumptions to be made about the data set itself.

What it needs, now, is some testing. Is anyone out there willing to try a novel form of control on a dataset that they have? (I won’t promise it’s flawless, but hey, it’s open source, and I’m willing to bug fix anything people find.)

If you do, let me know, and I’ll tell you how to activate it. Let the testing begin!