Pac Bio Sequel

This isn’t anything others haven’t heard about, I’m sure, but I just saw the announcement for the Pac Bio Sequel.

It’s a pretty looking machine, and it’s promise (according to the press release) is pretty awesome. Actually, I’ve always had a sweet spot for Pac Bio, despite never having worked with Pac Bio data. It’s just that I so much want it to work. There’s just something appealing to me about tethered enzymes and single molecule sequencing.

Anyhow, I don’t have much commentary, though I’d love to hear if others do, about the Sequel.

http://blog.pacificbiosciences.com/2015/09/introducing-sequel-system-scalable.html

Something they don’t tell you about PyMongo 3.0 and Multiprocessing.

EDIT: This post turned into a bug report over at the mongo python driver wiki, where it was confirmed to be a bug, and not a feature. Ultimately, the issue hasn’t been resolved yet, but version 3.0.4 will now throw a warning, preventing this issue from failing silently. Thanks to A. Jesse Jiryu Davis for suggesting I file it as a bug, and Anna Herlihy for the patch!

I had an interesting bug in a piece of software that I’ve been working on, that involves some heavy multithreading.  Running 18 processes simultaneously, of which at least 9 of them require some form of database interaction with MongoDB, is really not all that complicated… but I hit something that tossed in a wrench and confused me for 2 days.  What was it, you might ask?

Well, it looked like this:

 File "something.py", line 177, in flush
  b.execute()
File "/Users/afejes/sandboxes/pipeline4/lib/python2.7/site-packages/pymongo/bulk.py", line 582, in execute
  return self.__bulk.execute(write_concern)
File "/Users/afejes/sandboxes/pipeline4/lib/python2.7/site-packages/pymongo/bulk.py", line 430, in execute
  with client._socket_for_writes() as sock_info:
File "/usr/local/Cellar/python/2.7.10/Frameworks/Python.framework/Versions/2.7/lib/python2.7/contextlib.py", line 17, in __enter__
  return self.gen.next()
File "/Users/afejes/sandboxes/pipeline4/lib/python2.7/site-packages/pymongo/mongo_client.py", line 663, in _get_socket
  server = self._get_topology().select_server(selector)
File "/Users/afejes/sandboxes/pipeline4/lib/python2.7/site-packages/pymongo/topology.py", line 121, in select_server
address))
File "/Users/afejes/sandboxes/pipeline4/lib/python2.7/site-packages/pymongo/topology.py", line 97, in select_servers
  self._error_message(selector))
ServerSelectionTimeoutError: No servers found yet

Basically, the new pymongo drivers (3.0.x) have changed their initialization, so that they no longer actually create the connection pool when you initialize them.  You say:

mongo = MongoClient()

and they go off and do a non-blocking initialization of everything pymongo needs to start the server. All is good.

However, if you’re doing multiprocessing, the temptation is to allow each of your threads to launch a new instance of the MongoClient. Indeed, I’ve done that before with 2.8.x series of pymongo, and it worked well. However, in this case, pymongo 3.0.2 REALLY doesn’t like it, and you’ll get the “No Servers found yet” error when you try to retrieve results from your database. Oddly enough, it’s especially hard to figure out because pymongo has one more hidden surprise for you: serverSelectionTimeoutMS.

You probably have never heard of this parameter, but it’s kinda important, now. It goes on your initialization of the MongoWapper:

self.mongo = MongoClient(mongo_url, mongo_port, serverSelectionTimeoutMS=500) 

If you don’t put it there, the default value is 30 seconds… Which means your application sits there, waiting to see if the mongo database will connect for 30 seconds, once it realizes that the database is missing. When it finally does fail, you’ll get the error above… 30 seconds after your database went down. That’s cool… except when the issue is actually not related to the database going down.

In my case, the issue was not that the database went down, but that each thread should not be initializing a new instance of MongoClient! The only solution: have the parent thread create one instance of MongoClient, and then pass that as a parameter to the processes. Tada! – the error disappears, and your program starts to run, instead of failing and waiting 30 seconds to tell you.

On the subject of indels..

Ah, a blog post. It’s been a while, as life has been busy lately. My daughter turned 3 last week, and I’ve moved half way across the world and back, but I have slowly found myself with things to say again.

And, the one that needs saying first is that, as a community, NGS people have done a terrible job on standardizing how we deal with Indels. SNVs aren’t bad – we only have half a dozen ways to mess them up – but indels are just something else.

After a year of working hard on SNVs, indels have fallen back on the menu, and I’ve been beating my head on the wall trying to solve it all in one shot. Needless to say, it’s not going to be that easy, but there are a few things that are really worth pointing out:

If you can represent something in the genome two different ways, you should pick the easiest, right? Wrong, there are people who don’t agree with this, and I can give you an example. Lets say you have a reference sequence GAAAC, and you delete two As. Personally, I’d pick the left justified version and say GAA -> G. That’s pretty clear: you’ve removed to A’s after the G. Using the single redundant G makes it left justified , and anchored or rooted, and intuitively obvious. However, other people might disagree.

For instance, if you use a more old school style, that pre-dates Next-gen sequencing, you’d probably right justify it: AAC ->C… or take it one step further and drop the C, giving you AA->-. Yes, that’s a dash. Between the left and right justification, there’s not much to say: it’s either one standard or the other. Right justification is used by a lot of databases, such as clinvar, where many (most? all?) of the known deletions are pulled from clinical papers, who adopted that as the standard.

However, that’s far from the worst you can do.  You can also add one step to the confusion and pad your variant.  For instance, you could also represent the deletion of the two As with GAAAC->GAC.  Now, you’ll see it’s anchored on the left and the right, which is not necessarily a bad thing, but it is redundant.  You don’t need both for an unambiguous representation of the indel.  This is a non-reduced representation of the variant.  You can make them more confusing, if you try, though.  There are no bounds to the padding you can add.  Want a simple SNV to look more complicated?  How about: ACGTACTCGGCTAG->AGGTACTCGGCTAG. I would probably just shift the position over by one to the right and call it a C->G variant, and drop the padding.

Why do people not use reduced representation padding, though?  Because it’s more convenient for them.   Here’s an example I got from ExAC:  GAAA -> G,GA,GAA.  See what they’ve done there?  It’s actually three variants at the same position that I would represent with three different reference sequences, but by padding the variants, they can place them all on one line.  GA->G, GAA->G and GAAA->G.  If you don’t know that they’ve done this, it’s a bit surprising.  Indeed, I had to write to them to ask about it, because it wasn’t intuitively obvious to me why they show reduced variants on their web page, but distribute a VCF file with non-reduced variants.  There is a blog post about how to reduce variants, but as of last week, it wasn’t referenced in the readme files of their FTP site.

Regardless, ExAC isn’t the only one to use non-reduced representations – dbSNP does it as well, and I haven’t even begun to look at the myriad of other data sources we depend on for indel interpretation. It was rightly pointed out to me that non-reduced representations are not forbidden in the VCF 4.2 standard.  It’s definitely not forbidden, but then again, as a community, taking the position that anything not forbidden is allowed is a dangerous path for those who would like to see a unified standard.  We’re just not going to converge on the same page, if we keep stuff like this going.

Alas, Indels are a difficult minefield.  They are hard to call, hard to represent and hard to interpret.  We have a long path ahead of us to straighten it all out, but I don’t doubt we’ll get there.  This is just one more step we’ll have to take, in order to make sure we start getting these things right.

 

AMA for fun.

I’ve been asked by a few people to do an AMA, since I seem to be one of the few PhD-level Bioinformaticians working in industry who are active on the Reddit bioinformatics forum.  There are probably a lot of others, but I suspect that the bulk of people there are mostly graduate students or academics.

Anyhow, If anyone is interested in such silliness, here’s the link.

Of course, I’m going to feel pretty silly about the whole thing if no one asks any questions…

Frontiers in Science Latex missing packages.

I’m working on a manuscript to be sent in to a frontiers journal, and discovered a few missing dependencies for LaTeX, so I figured I’d share them here.

If you find you’re missing chngpage.sty, install texlive-latex-extra

if you find you’re missing lineno.sty, install texlive-humanities

On a Mac, that’s:

sudo port -v install texlive-humanities texlive-latex-extra

Happy compiling.

What’s the point of doing a PhD? (reply to Kathy Weston)

I wanted to comment on a blog post called “What’s the point of doing a PhD” on Blue Skies and Bench Space, by Kathy Weston.  Right off the top, I want to admit that I’ve not followed the blog in the past, and my comments aren’t to say that Kathy doesn’t have a point, but that what she’s proposing is only a partial solution – or rather, it feels to me like it’s only half the picture.

Warning, I’ve not edited this yet – it’s probably pretty edgy.

I believe Kathy is responding to a report (Alberts et al) that proposes such things as cutting the numbers of postdocs, creating more staff scientist positions and making sure that non-academic PhD options are seen as successful careers.  Those are the usual talking points of academics on this subject, so I don’t see much new there.  Personally, I suspect that the systemic failures of the research community are small part of a far broader culture war in which research is seen as an “optional” part of the economy rather than a driver, which results in endless budget cuts, leading to our current system, rather than as an issue on it’s own.  However, that’s another post for another time.

Originally, I’d written out a point-by-point rebuttal of the whole thing, but I realized I can sum it up in one nice little package:  Please read all of Kathy’s criteria for who should get a PhD.  Maybe read it twice, then think about what she’s selecting for… could it possibly be academics?

Advice to undergrads could be summarized as: prepare for a career of 80 hour workweeks (aka, the academic lifestyle) and if you don’t know for sure why you’re getting your PhD (aka, to become an academic), don’t do it!   Frankly,  there are lots of reasons to get a PhD that don’t involve becoming an academic.  There’s nothing wrong with that path, but a PhD leads to many MANY 9-5 jobs, if that’s what you want, or sales jobs, or research jobs, or entrepreneurial jobs.  Heck, my entire life story could be summed up as “I don’t know why I need to know that, but it’s cool, so I’ll go learn it!”, which is probably why I was so upset with Kathy’s article in the first place.

Lets summarize the advice to PhD students: If you don’t know why you need to learn a specific skill, don’t do a post doc! I’m going to gloss over the rest of that section – I really don’t think I need a committee of external adjudicators to tell me if they think my dreams are firmly grounded, and if your dream isn’t to be an academic, why should you walk away from a postdoc?

Advice to postdocs: “This is your last chance to become an academic, so think hard about it!”  Meh – Academics R Us.  (Repeat ad nauseum for N-plex postdoc positions.)

Everything else is just a rehash of the tenure track, with some insults thrown in:  “Don’t hire mediocre people” is just the salt in the wound.  No one wants to hire mediocre people, but people who are brilliant at one thing are often horribly bad at another.  Maybe the job is a bad fit.  Maybe the environment is a bad fit.  Is there a ruler by which we can judge another person’s mediocrity?  Perhaps Kathy’s post is mediocre, in my opinion, but there are likely thousands of people who think it’s great – should I tell others not to hire Kathy?  NO!

I think this whole discussion needs to be re-framed into something more constructive.  We can’t keep the mediocre people of out science – and we shouldn’t even try.  We shouldn’t tell people they can’t get a PhD, or discourage them.  What we should be doing is three-fold:

First, we should take a long hard look at the academic system and ask ourselves why we allow Investigators to exploit young students in the name of research.  Budget cuts aren’t going to come to a sudden halt, and exploitation is only going to get worse as long as we continue to have a reward-based system that requires more papers with less money.  It can’t be done indefinitely.

Second, we should start giving students the tools to ask the right questions from early on in their careers.  I’d highlight organizations like UBC’s Student Biotechology Network, which exist with this goal as their main function.  Educate students to be aware of the fact that >90% of the jobs that will exist, once they’re done their degrees, will be non-academic.  A dose of accurate statistics never hurts your odds in preparing for the future.

Finally, we can also stop this whole non-sense that academia is the goal of the academic process!  Seriously, people.  Not everyone wants to be a prof, so we should stop up-selling it.  Tenure is not a golden apple, or the pot at the end of the rainbow.  It’s a career, and we don’t all need to idolize it.  Just like we’re not all going to be CEOs (and wouldn’t all want to be), we’re not all going to be professors emeritus!

If you’re an Investigator, and you want to do your students a favour, help organize events where they get to see the other (fantastic!) career options out there.  Help make the contacts that will help them find jobs.  Help your students ask the right questions… and then ask them yourself, why did you hire 8 post-docs?  Is it because they are cheap trained labour, or are you actually invested in their careers too?

Lets not kid ourselves – part of it is the system.  The other part of it is people who are exploiting the system.

A few open bioinformatics positions.

Occasionally, I get emailed information about open positions at bioinformatics companies, so I thought I’d pass along a couple today.

First and foremost, if anyone is interested, the company I’m looking forward to starting with next week is hiring, so I’ll pass along that link:  http://www.omicia.com/jobs/ There are software engineer, bioinformatics and data scientist positions available, so I suggest checking them out.

Second, for those who are a little further in their career, I understand that Caprion is looking for a director of bioinformatics as well as a biostatistician (http://www.caprion.com/en/caprion/career.php).  It’s a little far outside my field, given that it’s mostly proteomics work, but I’ve heard good things about Caprion, and they’re in Montreal, which is a pretty awesome place with excellent poutine.  (I’ve only spent two days there, so yes, the poutine does stand out, along with the excellent smoked meat sandwiches and a very crowded hostel… maybe it’s best if you don’t ask my advice on Montreal.)

Otherwise, I’ve also been passed a description from what appears to be a startup company looking for an “important position” with the following description:

Experience in genomic research relating to the development of novel computational approaches and tools. Preference may be given to candidates with expertise in one or more of the following areas: Modeling and network analysis; Molecular pathways; Systems biology; Comparative genomics; Quantitative genetics / Genomics (QTL / eQTL). Knowledge and ability of applying bioinformatics programming languages to develop and/or improve computational analysis tools (i.e. algorithms, statistical analysis).

If you’re interested, I can cheerfully pass contact information along to the right people.

An Open Post-Doc Position

From time to time, I hear of an open position, which I’m happy to post on my blog.  If I were hunting for a post-doc position, I’d be tempted to check out this one in the Ramsey Lab at the University of Oregon Oregon State University in Corvalis, Oregeon. A quick excerpt:

You will have a key role in the lab’s research in gene regulatory networks in innate immune cells, developing integrative algorithms and applying them to analyze genomic, epigenomic, and transcriptomic data. The job is an exciting opportunity to combinestate-of-the-art methods in machine learning and statistical network inference to improve our molecular network understanding of the innate immune system and its roles in diseases. More broadly, our research program aims to develop new methods for integrating “omics” datasets with an emphasis on high-impact applications in biomedicine.

If you are interested, you can find out more on the lab’s web page: http://lab.saramsey.org/#Join

 

New Years Resolutions 2014

This used to be a yearly tradition for me – setting goals or resolutions for myself. It’s mostly a way for me to give myself something to aim for, as well as a time limit in which to accomplish it. Unlike my daily task list, it’s for things that aren’t simple to resolve quickly – things that do take a year to complete. Last year, I had one task: recover from the insanity that was Denmark – and I think that’s been done. I still haven’t written up the lies and financial hell that CLC put me through on my way out of the country, but that isn’t so emotionally charged anymore that it hurts to write.  (I still don’t have a sense of humour about all of it, but that’s a different story entirely.)

In any case, my resolutions for 2014 are a little more career and family focused, aiming to bring a bit more balance back to my life.  Starting with the family, here they are:

  1. Teach my daughter to ask “Why?”, instead of “What’s that? (or “Dassit?” as she pronounces it – and then take the time to answer in as much detail as she can handle.
  2. Get back into photography, and take more pictures of my wife and my daughter.  A picture without a person in it is never as good as one that has someone in it – and it’s never as good as one that has someone you care about in it.
  3. Do more for my wife – after a year and a half at home with our daughter, I can’t express my gratitude for her patience enough, but I can do a better job of showing it.
  4. Get back into Fencing.  My daughter is sleeping through the evening, if not the whole night, most nights.  It’s time for my wife and I to get out a bit more and get some physical activity, and my activity of choice involves pointy sticks.
  5. Finish off each and every one of my projects at work, and then publish it!  I have a nearly complete chip-seq project, chip-chip project, human methylation visualization project and several others.  It’s time they all got out into the world and into the hands of those who can use them.
  6. Social networking update.  I’ve been neglecting twitter, blogging and my feeds for too long.  It’s time for a fresh start, and a return to engaging with the world.
  7. Be a leader, not a follower.  I feel like I’ve been a bit on auto-pilot this year, in that I haven’t really done a lot of cutting edge work, and haven’t pushed the envelope as much as I’d like.  After a year in Denmark, where I spent all my time just trying to keep afloat over the culture shock and language barrier, I’ve lost a bit of my edge. It’s past time to get it back.

None of my resolutions this year are all that challenging, but they all have a place in helping me get back to being the person I would like to be.  Isn’t that, after all, what New Year Resolutions are about?

2-year computational biology position open in Grenoble, France

A quick announcement for a position available in France, with an outstanding researcher. (I’ve personally had the opportunity to work with François, and he is also a great guy, so this would be a pretty rocking position…)

A 2-year position in Computational Biology is available immediately in François PARCY group in Grenoble (France). The project aims at deciphering the rules governing transcriptional regulation in plants. We take flower development as a model system to study the interplay between transcription factors (TFs), genomic DNA features (accessibility, chromatin marks, methylation), and gene expression. We use genome-wide data (ChIP-Seq, expression data (RNA-Seq or microarray), DNAse-seq, plant genomes) to better understand the binding of TFs to the DNA and its impact on gene regulation. The applicant will be in charge of developing new methods and models to analyze the large-scale in-house and public data available and will interact with experimentalists to ground the model to biology. The ideal candidate will have already shown success in developing new tools/software analyzing large-scale (e.g. NGS) biological data.
 
We prefer applicant at the post-doctoral level but candidates with a master will also be considered. Grenoble is a great place for Science and also outdoors activities!
 
If you’re interested, please contact Francois PARCY (francois.parcy \at\ cea.fr).