On the ferry once more, and wanted to share a hard-fought lesson that I learned today, which somewhat updates my post from the other day on multiprocessing and python 3.6.3.
Unfortunately, the lessons weren’t nice at all.
First, I discovered that using the new Manager object is a terrible idea, even for incredibly simple objects (eg. an incrementing value, incremented every couple of seconds). The implementation is significantly slower than creating your own object out of a lock and a shared value, just to have two threads take turns incrementing the value. Ouch. (I don’t have my bench marks, unfortunately, but it was about 10% of the run time, IIRC.)
Worse still, using a manager.Queue object is horrifically bad. I created an app where one process reads from a file and puts things into a a queue, and a second process reads from that queue and does some operations on the object. Now, my objects are just small lists with one integer in it, so they’re pretty small. Switching from a multiprocessing Queue to a Manager Queue caused a 3-fold increase in the time to execute. (5 seconds to 15 seconds.) Given that the whole reason for writing multiprocessing code is to speed up the processing of my data, the Manager is effectively a non-starter for me.
I understand, of course, that that overhead might be worth it if your Manager runs on a separate server, and can make use of multiple machines, but I’m working on the opposite problem, with one machine and several cores.
The second big discovery, of course, was that multiprocessing Queues really dont’ work well in python 3.6.3. I don’t know when this happened, but somewhere along the line, someone has changed their behaviour.
In 2.7, I could create one process that fills the Queue, and then create a second type of process that reads from the queue. As long as process 1 is much faster than process 2, the rate limiting step would be process 2. Thus, doubling the number of process 2’s, should double the processing of the job.
Unfortunately, in 3.6.3, this is no longer the case – the speed with which the processes obtain data from the queue is now the rate limiting step. Process 2 can call Queue.get(), but get is only serving the data at a constant speed, no matter how many processes 2’s are there calling the Queue.get() function.
That means that you can’t get any speed up from multiprocessing Queues…. unless you have a single queue for every process 2. Yep… that’s what I did this afternoon. Replaced the single queue with a list of queues, so that I have a single Queue for every processing queue.
Bad design, you say? Yes! I agree. In fact, since I now have a set of queues in which there’s only one writer and one reader, I shouldn’t be using queues at all. I should be using Pipes!
So, tomorrow, I’ll rip out all of my queues, and start putting in pipes. (Except where I have multiple processes writing to a single pipe, of course)
I don’t know where multiprocessing in python went wrong, but that was a severely disappointing moment this morning when I discovered this issue. For now, I’ll resist the urge to return to python 2.7.
(If anyone knows where I went wrong, please let me know – we all make mistakes, and I’m really hoping I’m wrong on this one.)