Pac Bio Sequel

This isn’t anything others haven’t heard about, I’m sure, but I just saw the announcement for the Pac Bio Sequel.

It’s a pretty looking machine, and it’s promise (according to the press release) is pretty awesome. Actually, I’ve always had a sweet spot for Pac Bio, despite never having worked with Pac Bio data. It’s just that I so much want it to work. There’s just something appealing to me about tethered enzymes and single molecule sequencing.

Anyhow, I don’t have much commentary, though I’d love to hear if others do, about the Sequel.

Something they don’t tell you about PyMongo 3.0 and Multiprocessing.

EDIT: This post turned into a bug report over at the mongo python driver wiki, where it was confirmed to be a bug, and not a feature. Ultimately, the issue hasn’t been resolved yet, but version 3.0.4 will now throw a warning, preventing this issue from failing silently. Thanks to A. Jesse Jiryu Davis for suggesting I file it as a bug, and Anna Herlihy for the patch!

I had an interesting bug in a piece of software that I’ve been working on, that involves some heavy multithreading.  Running 18 processes simultaneously, of which at least 9 of them require some form of database interaction with MongoDB, is really not all that complicated… but I hit something that tossed in a wrench and confused me for 2 days.  What was it, you might ask?

Well, it looked like this:

 File "", line 177, in flush
File "/Users/afejes/sandboxes/pipeline4/lib/python2.7/site-packages/pymongo/", line 582, in execute
  return self.__bulk.execute(write_concern)
File "/Users/afejes/sandboxes/pipeline4/lib/python2.7/site-packages/pymongo/", line 430, in execute
  with client._socket_for_writes() as sock_info:
File "/usr/local/Cellar/python/2.7.10/Frameworks/Python.framework/Versions/2.7/lib/python2.7/", line 17, in __enter__
File "/Users/afejes/sandboxes/pipeline4/lib/python2.7/site-packages/pymongo/", line 663, in _get_socket
  server = self._get_topology().select_server(selector)
File "/Users/afejes/sandboxes/pipeline4/lib/python2.7/site-packages/pymongo/", line 121, in select_server
File "/Users/afejes/sandboxes/pipeline4/lib/python2.7/site-packages/pymongo/", line 97, in select_servers
ServerSelectionTimeoutError: No servers found yet

Basically, the new pymongo drivers (3.0.x) have changed their initialization, so that they no longer actually create the connection pool when you initialize them.  You say:

mongo = MongoClient()

and they go off and do a non-blocking initialization of everything pymongo needs to start the server. All is good.

However, if you’re doing multiprocessing, the temptation is to allow each of your threads to launch a new instance of the MongoClient. Indeed, I’ve done that before with 2.8.x series of pymongo, and it worked well. However, in this case, pymongo 3.0.2 REALLY doesn’t like it, and you’ll get the “No Servers found yet” error when you try to retrieve results from your database. Oddly enough, it’s especially hard to figure out because pymongo has one more hidden surprise for you: serverSelectionTimeoutMS.

You probably have never heard of this parameter, but it’s kinda important, now. It goes on your initialization of the MongoWapper:

self.mongo = MongoClient(mongo_url, mongo_port, serverSelectionTimeoutMS=500) 

If you don’t put it there, the default value is 30 seconds… Which means your application sits there, waiting to see if the mongo database will connect for 30 seconds, once it realizes that the database is missing. When it finally does fail, you’ll get the error above… 30 seconds after your database went down. That’s cool… except when the issue is actually not related to the database going down.

In my case, the issue was not that the database went down, but that each thread should not be initializing a new instance of MongoClient! The only solution: have the parent thread create one instance of MongoClient, and then pass that as a parameter to the processes. Tada! – the error disappears, and your program starts to run, instead of failing and waiting 30 seconds to tell you.

On the subject of indels..

Ah, a blog post. It’s been a while, as life has been busy lately. My daughter turned 3 last week, and I’ve moved half way across the world and back, but I have slowly found myself with things to say again.

And, the one that needs saying first is that, as a community, NGS people have done a terrible job on standardizing how we deal with Indels. SNVs aren’t bad – we only have half a dozen ways to mess them up – but indels are just something else.

After a year of working hard on SNVs, indels have fallen back on the menu, and I’ve been beating my head on the wall trying to solve it all in one shot. Needless to say, it’s not going to be that easy, but there are a few things that are really worth pointing out:

If you can represent something in the genome two different ways, you should pick the easiest, right? Wrong, there are people who don’t agree with this, and I can give you an example. Lets say you have a reference sequence GAAAC, and you delete two As. Personally, I’d pick the left justified version and say GAA -> G. That’s pretty clear: you’ve removed to A’s after the G. Using the single redundant G makes it left justified , and anchored or rooted, and intuitively obvious. However, other people might disagree.

For instance, if you use a more old school style, that pre-dates Next-gen sequencing, you’d probably right justify it: AAC ->C… or take it one step further and drop the C, giving you AA->-. Yes, that’s a dash. Between the left and right justification, there’s not much to say: it’s either one standard or the other. Right justification is used by a lot of databases, such as clinvar, where many (most? all?) of the known deletions are pulled from clinical papers, who adopted that as the standard.

However, that’s far from the worst you can do.  You can also add one step to the confusion and pad your variant.  For instance, you could also represent the deletion of the two As with GAAAC->GAC.  Now, you’ll see it’s anchored on the left and the right, which is not necessarily a bad thing, but it is redundant.  You don’t need both for an unambiguous representation of the indel.  This is a non-reduced representation of the variant.  You can make them more confusing, if you try, though.  There are no bounds to the padding you can add.  Want a simple SNV to look more complicated?  How about: ACGTACTCGGCTAG->AGGTACTCGGCTAG. I would probably just shift the position over by one to the right and call it a C->G variant, and drop the padding.

Why do people not use reduced representation padding, though?  Because it’s more convenient for them.   Here’s an example I got from ExAC:  GAAA -> G,GA,GAA.  See what they’ve done there?  It’s actually three variants at the same position that I would represent with three different reference sequences, but by padding the variants, they can place them all on one line.  GA->G, GAA->G and GAAA->G.  If you don’t know that they’ve done this, it’s a bit surprising.  Indeed, I had to write to them to ask about it, because it wasn’t intuitively obvious to me why they show reduced variants on their web page, but distribute a VCF file with non-reduced variants.  There is a blog post about how to reduce variants, but as of last week, it wasn’t referenced in the readme files of their FTP site.

Regardless, ExAC isn’t the only one to use non-reduced representations – dbSNP does it as well, and I haven’t even begun to look at the myriad of other data sources we depend on for indel interpretation. It was rightly pointed out to me that non-reduced representations are not forbidden in the VCF 4.2 standard.  It’s definitely not forbidden, but then again, as a community, taking the position that anything not forbidden is allowed is a dangerous path for those who would like to see a unified standard.  We’re just not going to converge on the same page, if we keep stuff like this going.

Alas, Indels are a difficult minefield.  They are hard to call, hard to represent and hard to interpret.  We have a long path ahead of us to straighten it all out, but I don’t doubt we’ll get there.  This is just one more step we’ll have to take, in order to make sure we start getting these things right.


AMA for fun.

I’ve been asked by a few people to do an AMA, since I seem to be one of the few PhD-level Bioinformaticians working in industry who are active on the Reddit bioinformatics forum.  There are probably a lot of others, but I suspect that the bulk of people there are mostly graduate students or academics.

Anyhow, If anyone is interested in such silliness, here’s the link.

Of course, I’m going to feel pretty silly about the whole thing if no one asks any questions…

The glamour of Pipeline bioinformatics

I’m going to have to eat a bit of humble pie.  When I was a grad student, I may have just slightly looked down on “pipeline bioinformatics”, thinking it was a subject that was boring.  It clearly wasn’t as glamorous as designing new algorithms or plucking hidden bits of information out of giant data sets… I may have even thought it was something you just did as an after thought.

I was wrong.

I have to admit, now that I’ve had a taste of it, I’m enjoying it for exactly the opposite reasons:  It’s a fascinating game of balancing everything you know about computers and biology all at the same time, while making sure you get the right answer consistently.  It’s a cross between doing jigsaw puzzles and playing jeopardy…  and I’m kinda liking it.

In order to build a good pipeline, you need infrastructure that glues all the parts together, you need planning to make sure that it has room for growth, and you need to know what constraints the pipeline will face…  And, you need to be able to understand how everything from the bits of data you’re pushing through it will interact with all of the hardware on all of the machines and wires it’s going to run on.  That’s no small feat – but it’s an exhilarating challenge.

While I may have thought algorithm design was the cat’s pyjamas, building a pipeline is the same resource management challenge scaled up to include a whole lot more moving parts.  And, to those who manage all of those working parts, I finally grok what it is that drives you – and I am only working on a pipeline that was assembled by others, not even one of my own creation – which just increases my respect for those who have built pipelines out of nothing:

The thrill of watching data cascade through the waterfall that is the pipeline.

The excitement of having each individual piece operating in harmony, squeezing out that last bit of performance.

The fun of adding in three more pieces you thought would never fit, but making it work.

The satisfaction of knowing you managed to tame the mangy electrons that seemed so unruly before they entered your pipeline.

The reward of having someone look at the data afterwords, and learning something new from it.

Yes, pipeline bioinformaticians, I owe you an apology, your product is a magnificent work of art in it’s own right – and it is only truly completed when people are able to forget that it’s there. Cheers to you!


How do you become a bioinformatician?

I’ve been following the bioinformatics sub-reddit for the past couple of months, ever since I stumbled upon it when a colleague asked me about bioinformatics resources on the web.  It’s a fascinating place to visit, but it’s incredibly repetitive in that people keep asking “How do I become a bioinformatician?”

Unfortunately there is not a single answer, because bioinformatics isn’t a single job – it’s a collection of people who have found a way to live with one foot in each of two worlds: computer programming and biology.  Getting a firm footing in each can be a serious challenge, as people spend years studying just one of those to become proficient at it.

However, I think there are some common threads that tie the field together.  You need to invest the time in at least a handful of basic fields: some basic programming, some elementary cell biology and at least a simple understanding of math or statistics.  What you can accomplish with just that little can be incredibly productive.  Mostly in terms of automation of data processing or modelling of your results.

On the other hand, bioinformatics also includes a lot of sub-disciplines.  Great programmers can build incredible pipelines.  Great mathematicians can invent or apply algorithms to create new ways of interpreting data, and great biologists can develop heuristics and re-interpret data in new ways to generate insights that others have overlooked.  There’s even room for “neat freaks” in organizing and imposing order on unruly data.

The challenge of becoming a bioinformatician is learning where your strengths and weaknesses lay, and using them to your advantage.  Finding a research group that shores up your weaknesses – or helps you fill them in – can be a great boost to your career.  After my masters degree, I felt I had two big gaping holes in my resume: big data and databases, which I made the focus of my PhD research. Coming out of my defence, I felt I was able to bring a more balanced approach to the table – and had simultaneously purged any instinct I might have ever had to reach for a spreadsheet to interpret information. (Spreadsheets and big data don’t mix.)

So, where does that lead an aspiring bioinformatician?  Unless you take the time to do both a computer science degree and a biology degree, you probably won’t be able to shoehorn everything in to become an expert in both, and not everyone wants to get their PhD to fill in the gaps left in an undergrad education.

With that said, let me lay down a few useful points:

  1. Pick and chose to study subjects that interest you because you’ll at least end up with strengths in things you enjoy, which leads to jobs doing things you enjoy.
  2. You can always learn something new later… but take opportunities to try new things when they come.
  3. Remember that you’re not going to be the expert in every field you put your foot into – so look for opportunities to collaborate with the people who are.  (If you’re going into bioinformatics and expect to do everything yourself, you’re probably doing it wrong.)
  4. Don’t be afraid of the fact that you don’t know stuff.  Your job isn’t to be the best biologist and best computer scientist at the same time – it’s to be the bridge between.  The stronger your foundations, the better a bridge you can be, but unlike a concrete bridge, you can always invest in learning more.
  5. Yes, higher education does help in this field.  Bioinformatics is still dominated by research based organizations, and the academic hierarchy saturates the mindset of bioinformaticians everywhere.  (Or, almost everywhere.)
  6. Bioinformatics is also about the “soft” skills.  Don’t forget that bioinformaticians are also in a good place to be good leaders – since you’ll be one of the few people who can speak both languages, and tie together groups that would otherwise lack a common language.
  7. Don’t believe the hype about what you should learn:  R isn’t really the only language for doing bioinformatics.  Perl isn’t always evil (just most of the time, though it did save the human genome…), Java isn’t the slowest language out there, and c isn’t only for hardcore programmers. (Python, though, is a pretty good all-around language.)  Everyone has an opinion on where bioinformatics is going – but it’s just an opinion, so make your own choices.

At the end of the day, I always give students the same piece of advice:  As you go through life, you will learn new skills that you can apply as you see fit.  At the end of the day, each of these skills will be a tool in your toolbox that you can turn to when you hit a problem.  If you only have a hammer in your toolbox, your repertoire is pretty limited.  On the other hand, if you collect a fantastic assembly of tools, you’ll be equipped to handle just about anything that comes your way.  Your job is to invest your time into building the best toolkit you can, so that when you get out of school, you’ll be ready to solve as many problems as you can.

Bioinformatics is just a special case of toolbox building, in that you need the tools of at least two disciplines in your toolbox.  What you chose to put into your toolbox is entirely up to you, but (to stretch the toolbox analogy just a little too far), take a few minutes to ask if you’d like to be a plumber or a carpenter before you start collecting your tools. Or, without the metaphoric toolkit, ask yourself what kind of bioinformatician you want to be.

Once you know the answer to that question, you’ll figure out pretty quickly which tools you want to start collecting.  And the path towards becoming a bioinformatician will start to become clear.  It may not take you where you expect, but I can guarantee that you’ll be walking down an interesting road.

American Hospitals

This is probably not an informative post for most people who’ve visited my blog, but I thought I’d share a perspective.

Last week, I signed up for a health care plan, and discovered that the plan to which I’d signed up was offering free flu shots.  Not being one to pass up on an offer like that, I traipsed down to the local hospital’s paediatric division, to get my daughter ready for the flu season, with a scheduled stop at the adult clinic just down the street on the way home.

Upon arrival, it turned out that the whole family could get our shots at once, saving us a trip across the park to the adult shot clinic – a nice bonus for us.  Anyhow, once the forms were filled out, and the (now expected) confusion about the existence of people without social security numbers was sorted out, the deed was done. (And, I might add that the woman who did it was exceptional – I barely noticed the shot, and my 2 year old daughter looked at the woman and said “Ow…” before promptly forgetting all about it and enjoying the quickly offered princess sticker.  “Princess Sticker!!!”)

In any case, the real story is what happened after – although it was as much a non-event as the actual shot.  We walked back home, taking a short cut through one of the hostpital’s other buildings.  It was new, it was shiny and it was pimped out.  It looked like the set of Grey’s Anatomy or the set of a Holywood sponsored action movie that will shortly be blown into a million pieces by several action heroes.  I half expected the counters to glint and glitter like a cleaning product commercial.

But, it was also, in a way, surreal.  That hospital doesn’t exist to cure people, or to as a place of healing – or even to do research.  Unlike a Canadian hospital, which is the bulk of my experience with hospitals (although I did visit Danish hospitals disproportionately more than you might think for the length of time I was there), the whole building, it’s contents and it’s staff are all there to turn a profit.

It’s not a tangible difference, but it makes you think about the built in drug stores and cafeterias and posters advertising drugs in a slightly different light.

Why are they promoting that drug?  Would that security guard kick me out if he knew I didn’t have my ID card yet?  Is that doctor running down the hall just trying to cram in as many patients as possible?

It’s strange, because superficially, the hospital isn’t any different than a Canadian hospital (other than being newer than any I’ve ever visited, and the ever present posters advertising drugs, of course), and yet it’s function is different.  It’s roughly the difference between visiting a community centre and a country club.  In any other country in the western world, a hospital is open to all members of the community, whereas the hospitals here require a membership.  It’s just hard not to see it through the Canadian lens, which tells us it’s one of those things American’s “just can’t seem to get right.” Well, that’s the Canadian narrative – whether it’s right or wrong.

Anyhow, a hospital is a hospital: the net product of the hospital is keeping people healthy.  Whether it’s for profit or government run, it does the same things and works the same way.

At the end of the day, I can’t say anything other than that the experience was pleasant, and this is the first year that I’ve gotten a flu shot and didn’t get sick immediately afterwards.  So really, all in all, I guess you get what you pay for…  It’s just a new experience to see such a direct connection between the money and the services.

I just have to wonder how Americans see Canadian hospitals. (-:

Ikea furniture and bioinformatics.

I’ll just come out and say it:  I love building Ikea furniture.  I know that sounds strange, but it truly amuses me and makes me happy.  I could probably do it every day for a year and be content.

I realized, while putting together a beautiful wooden FÖRHÖJA kitchen cart, that there is a good reason for it: because it’s the exact opposite of everything I do in my work.  Don’t get me wrong – I love my work, but sometimes you just need to step away from what you do and switch things up.

When you build ikea furniture, you know exactly what the end result will be.  You know what it will look like, you’ve seen an example in the showroom and you know all of the pieces that will go into putting it together.  Beyond that, you know that all the pieces you need will be in the box, and you know that someone, probably in Sweden, has taken the time to make sure that all of the pieces fit together and that it is not only possible to build whatever it is you’re assembling, but that you probably won’t damage your knuckles putting it together because something just isn’t quite aligned correctly.

Bioinformatics is nearly always the opposite.  You don’t know what the end result will be, you probably will hit at least three things no one else has ever tried, and you may or may not achieve a result that resembles what you expected.  Research and development are often fraught with traps that can snare even the best scientists.

But getting back to my epiphany, I realized that now and then, it’s really nice to know what the outcome of a project should be, and that you will be successful at it, before you start it.  Sometimes it’s just comforting to know that everything will fit together, right out of the box.

I’m looking forward to putting together a dresser tomorrow.

Biking in Oakland

I am slowly feeling like I have more to write, but today, I have a rant.  What the heck is up with Oakland cyclists?

Perhaps I’m just used to Vancouver cyclists, but I’ve never seen a group of people with less regard for the rules of the road.  I always thought drivers who complain about cyclists were just being whiney jerks…  but after 1 week of biking in Oakland, I’m starting to complain about cyclists.

Oddly enough, I think I’m the only cyclist who actually waits for lights to turn green before crossing intersections.  (Though I did go through a red light the other day, when I misunderstood a signal… mea culpa.)  I’m definitely the only cyclist in the city that signals before turning or changing lanes – and probably the only cyclist that isn’t aggressively swerving in and out of traffic!  (I think I’ve seen one other person, but she was going really slow and holding up a line of cars instead.)

Clearly, the insanity is not constrained to cyclists, however.  My wife was yelled at, inexplicably, for coming to a complete stop at a 4-way stop, because apparently that meant she was giving up her right of way.  It was hard to tell, really, as the woman who was yelling tried to cut us off, scream out her window at us, and ignore her own stop sign at the same time.  Oakland is clearly a culture in flux, and I have much to learn, yet, about how to stay safe on the road!

A bit of blogging

I’m more or less sure everyone has forgotten this blog by now… but that’s not a bad thing, really.   I don’t think I had much to say, and life has had a way of keeping me busy. Papers, work, changing work, changing diapers, all of it somehow keeps you from getting a lot of sleep, and that keeps me from having the motivation to write much.

However, I thought I’d start jotting down a few things that are interesting, as I come across them.  One that I’ve recently discovered is that reddit has a bioinformatics subreddit. (, which has been been inspiring me to start writing again.

The other, is that I’ve learned a LOT about mongodb recently, which I would like to start writing about.  Mostly under the “lessons learned” category, because scale up on software is just like scale up in the lab – it doesn’t just work.  Scaling things is tough.

Otherwise, I have a move to Oakland coming up, and there will probably be a few Goodbye Vancouver/Hello Oakland posts as well.  Somehow, I think the urge to write is coming back, and I haven’t had that spark since Denmark ripped it out of me.  Perhaps that’s just a bit of optimism coming back.  I would’t object to that.