Check your assumptions

I just went through one of those random Linux trial-by-fire exercises.  I had two web servers, one cloned from the other, behaving differently: One would send emails, the other wouldn’t.

After walking through the tree of all possible things that could be the problem: user installed software, system software, system configurations, and right down to logs and individual files…  and then realized the most obvious source of the difference:  The processes that handle mail on one server had died, but not on the other.

Yes, 2 hours of debugging mail handling on a linux machine only to discover that I could have figured this out with “ps aux | grep mail” in about 10 seconds, had I known what to look for.

Well, that’s the nature of troubleshooting – the answer is always obvious after you find it.

On the bright side, it means my lab has a shiny new blog to play with – and it seems like everything is working now.


Python “threading”

Just a quick rant about threading.  After working with Java for so long, I’d gotten used to the idea that a thread is an independent entity, which can work for you without slowing down the main body of your program.  In python, that’s really not the case.

In python, a thread shares memory with the main “thread” of your code, which prevents it from running on other CPUs, or running independently.  In fact, with a python thread, you’re stuck with every line of code running on the same core, with the limitation that only one line of code can be processed at a time – meaning that all your threads take turns passing lines of code to the CPU to be run. (Or, as a mental image, that’s how it’s working, the reality is a bit more subtle.)

Unfortunately, that means that python threads don’t really speed up your code, and if there are enough of them, they can slow it down significantly.

The solution turned out to be to use a module in python called “multiprocessing”, which allows you to spawn processes instead of threads, which means that each individual process can run on a different CPU (if you have enough cores…), but does not share any memory with the main process or thread of your code.  Thus, you have to work out a system of thread-safe (process-safe?) queues, where each process can dump information into a buffer, allowing other threads or processes to pick up information and process it independently.  The worker threads can consume the information in parallel, giving you a speed up of the running time (wall time) of your code.

All in all, it’s actually a relatively elegant system, not much different than many other languages, with the exception of the terminology – processes vs threads.  Python got it right, but it took me a while to figure out that I was using the terms incorrectly.

At any rate, without any optimization, multiprocessing with 30 processes brings down the wall time of my code from about 3 hours down to about 15 minutes.  It’s almost time to start looking into c code optimization…  Almost. (-;

Link to a Ted talk – reversing desertification

I enjoy listening to TED talks while I work – although I usually end up tuning out for most of half of it while I focus on coding, but this one kept my attention throughout.  I think I got 10 lines of code written in 20 minutes, albeit it was 10 good lines…

Anyhow, reversing desertification is always an interesting topic, and it’s a great counter point to the usual doom and gloom of climate change articles you see in the press.  Either way, if you can find 22 minutes, I highly suggest watching this talk.

Allan Savory – How to green the world and reverse climate change


Illumina 450k methylation array re-annotation

The Kobor lab, where I’m currently working, does a lot of epigenetics and have been working with the Illumina 450K Human methylation array.  In some ways, it’s a stepping stone towards a next-gen platform, and in other ways, it’s still a perfectly valid platform all on it’s own.  I’m not really an array person (yet?), but I can see the advantages when it comes to bisulfite sequencing – mapping with bisulfite treated DNA is a bit of a dark art. (I know it can be done, but it’s not ideal.)

Anyhow,  they’ve undertaken an interesting project, with a few collaborators,  to re-annotate the 450k methylation array, identifying probes that are give erroneous results due to the presence of snps or cross hybridization.  They show it in action with a couple of data sets as well, so it’s not just a theory paper.  Anyhow, it’s a quick, but interesting must-read for those interested in using the 450k human methylation array.

You can find it at