A bit of blogging

I’m more or less sure everyone has forgotten this blog by now… but that’s not a bad thing, really.   I don’t think I had much to say, and life has had a way of keeping me busy. Papers, work, changing work, changing diapers, all of it somehow keeps you from getting a lot of sleep, and that keeps me from having the motivation to write much.

However, I thought I’d start jotting down a few things that are interesting, as I come across them.  One that I’ve recently discovered is that reddit has a bioinformatics subreddit. (www.reddit.com/r/bioinformatics), which has been been inspiring me to start writing again.

The other, is that I’ve learned a LOT about mongodb recently, which I would like to start writing about.  Mostly under the “lessons learned” category, because scale up on software is just like scale up in the lab – it doesn’t just work.  Scaling things is tough.

Otherwise, I have a move to Oakland coming up, and there will probably be a few Goodbye Vancouver/Hello Oakland posts as well.  Somehow, I think the urge to write is coming back, and I haven’t had that spark since Denmark ripped it out of me.  Perhaps that’s just a bit of optimism coming back.  I would’t object to that.

On to Omicia!

I’ve dropped a few hints about where I’m headed, recently.  I left a pretty awesome lab at the CMMT last week to join a small company that most people have probably never heard of.  Yes, I still owe the Kobor Lab a final blog post, and have 2 publications in preparation going at the moment, so I guess I haven’t completely left yet, but as of tomorrow morning, I’ll be starting my new role with Omicia.  They’re a small company in the Bay Area, but they have a disproportionate passion for their work and some pretty cool ideas and connections. (Shoutout to the Yandell lab.  Yes, I’d like a few VAAST stickers, but I’m hoping to meet @zevkronenberg in person to ask for them one day…)

For the moment, I’m still in Vancouver, where I’ll stay until we find a place to live in the Oakland Area, but things are officially in motion.

Ironically, it’s been a long road to get here, since I first met up with Omicia about 3 years ago.  Somehow, it just took a long time for the stars to align… but here we are, and the real journey is about to begin.

Frontiers in Science Latex missing packages.

I’m working on a manuscript to be sent in to a frontiers journal, and discovered a few missing dependencies for LaTeX, so I figured I’d share them here.

If you find you’re missing chngpage.sty, install texlive-latex-extra

if you find you’re missing lineno.sty, install texlive-humanities

On a Mac, that’s:

sudo port -v install texlive-humanities texlive-latex-extra

Happy compiling.

What’s the point of doing a PhD? (reply to Kathy Weston)

I wanted to comment on a blog post called “What’s the point of doing a PhD” on Blue Skies and Bench Space, by Kathy Weston.  Right off the top, I want to admit that I’ve not followed the blog in the past, and my comments aren’t to say that Kathy doesn’t have a point, but that what she’s proposing is only a partial solution – or rather, it feels to me like it’s only half the picture.

Warning, I’ve not edited this yet – it’s probably pretty edgy.

I believe Kathy is responding to a report (Alberts et al) that proposes such things as cutting the numbers of postdocs, creating more staff scientist positions and making sure that non-academic PhD options are seen as successful careers.  Those are the usual talking points of academics on this subject, so I don’t see much new there.  Personally, I suspect that the systemic failures of the research community are small part of a far broader culture war in which research is seen as an “optional” part of the economy rather than a driver, which results in endless budget cuts, leading to our current system, rather than as an issue on it’s own.  However, that’s another post for another time.

Originally, I’d written out a point-by-point rebuttal of the whole thing, but I realized I can sum it up in one nice little package:  Please read all of Kathy’s criteria for who should get a PhD.  Maybe read it twice, then think about what she’s selecting for… could it possibly be academics?

Advice to undergrads could be summarized as: prepare for a career of 80 hour workweeks (aka, the academic lifestyle) and if you don’t know for sure why you’re getting your PhD (aka, to become an academic), don’t do it!   Frankly,  there are lots of reasons to get a PhD that don’t involve becoming an academic.  There’s nothing wrong with that path, but a PhD leads to many MANY 9-5 jobs, if that’s what you want, or sales jobs, or research jobs, or entrepreneurial jobs.  Heck, my entire life story could be summed up as “I don’t know why I need to know that, but it’s cool, so I’ll go learn it!”, which is probably why I was so upset with Kathy’s article in the first place.

Lets summarize the advice to PhD students: If you don’t know why you need to learn a specific skill, don’t do a post doc! I’m going to gloss over the rest of that section – I really don’t think I need a committee of external adjudicators to tell me if they think my dreams are firmly grounded, and if your dream isn’t to be an academic, why should you walk away from a postdoc?

Advice to postdocs: “This is your last chance to become an academic, so think hard about it!”  Meh – Academics R Us.  (Repeat ad nauseum for N-plex postdoc positions.)

Everything else is just a rehash of the tenure track, with some insults thrown in:  “Don’t hire mediocre people” is just the salt in the wound.  No one wants to hire mediocre people, but people who are brilliant at one thing are often horribly bad at another.  Maybe the job is a bad fit.  Maybe the environment is a bad fit.  Is there a ruler by which we can judge another person’s mediocrity?  Perhaps Kathy’s post is mediocre, in my opinion, but there are likely thousands of people who think it’s great – should I tell others not to hire Kathy?  NO!

I think this whole discussion needs to be re-framed into something more constructive.  We can’t keep the mediocre people of out science – and we shouldn’t even try.  We shouldn’t tell people they can’t get a PhD, or discourage them.  What we should be doing is three-fold:

First, we should take a long hard look at the academic system and ask ourselves why we allow Investigators to exploit young students in the name of research.  Budget cuts aren’t going to come to a sudden halt, and exploitation is only going to get worse as long as we continue to have a reward-based system that requires more papers with less money.  It can’t be done indefinitely.

Second, we should start giving students the tools to ask the right questions from early on in their careers.  I’d highlight organizations like UBC’s Student Biotechology Network, which exist with this goal as their main function.  Educate students to be aware of the fact that >90% of the jobs that will exist, once they’re done their degrees, will be non-academic.  A dose of accurate statistics never hurts your odds in preparing for the future.

Finally, we can also stop this whole non-sense that academia is the goal of the academic process!  Seriously, people.  Not everyone wants to be a prof, so we should stop up-selling it.  Tenure is not a golden apple, or the pot at the end of the rainbow.  It’s a career, and we don’t all need to idolize it.  Just like we’re not all going to be CEOs (and wouldn’t all want to be), we’re not all going to be professors emeritus!

If you’re an Investigator, and you want to do your students a favour, help organize events where they get to see the other (fantastic!) career options out there.  Help make the contacts that will help them find jobs.  Help your students ask the right questions… and then ask them yourself, why did you hire 8 post-docs?  Is it because they are cheap trained labour, or are you actually invested in their careers too?

Lets not kid ourselves – part of it is the system.  The other part of it is people who are exploiting the system.

A few open bioinformatics positions.

Occasionally, I get emailed information about open positions at bioinformatics companies, so I thought I’d pass along a couple today.

First and foremost, if anyone is interested, the company I’m looking forward to starting with next week is hiring, so I’ll pass along that link:  http://www.omicia.com/jobs/ There are software engineer, bioinformatics and data scientist positions available, so I suggest checking them out.

Second, for those who are a little further in their career, I understand that Caprion is looking for a director of bioinformatics as well as a biostatistician (http://www.caprion.com/en/caprion/career.php).  It’s a little far outside my field, given that it’s mostly proteomics work, but I’ve heard good things about Caprion, and they’re in Montreal, which is a pretty awesome place with excellent poutine.  (I’ve only spent two days there, so yes, the poutine does stand out, along with the excellent smoked meat sandwiches and a very crowded hostel… maybe it’s best if you don’t ask my advice on Montreal.)

Otherwise, I’ve also been passed a description from what appears to be a startup company looking for an “important position” with the following description:

Experience in genomic research relating to the development of novel computational approaches and tools. Preference may be given to candidates with expertise in one or more of the following areas: Modeling and network analysis; Molecular pathways; Systems biology; Comparative genomics; Quantitative genetics / Genomics (QTL / eQTL). Knowledge and ability of applying bioinformatics programming languages to develop and/or improve computational analysis tools (i.e. algorithms, statistical analysis).

If you’re interested, I can cheerfully pass contact information along to the right people.

Mongo Database tricks

I’ve been using Mongodb for just over a year now, plus or minus a few small tests I did with it before that, but only in the past year have I really played with it at a level that required more than just a basic knowledge of how to use it.  Of course, I’m not using shards yet, so what I’ve learned applies just to a single DB instance – not altogether different from what you might have with a reasonable size postgres or mysql database.

Regardless, a few things stand out for me, and I thought they were worth sharing, because getting you head wrapped around mongo isn’t an obvious process, and while there is a lot of information on how to get started, there’s not a lot about how to “grok” your data in the context of building or designing a database.  So here are a few tips.

1. Everyone’s data is different.  

There isn’t going to be one recipe for success, because no two people have the same data, and even if they did, they probably aren’t going to store it the same way anyhow.  Thus, the best thing to do is look at your data critically, and expect some growing pains.  Experiment with the data and see how it groups naturally.

2.  Indexes are expensive.  

I was very surprised to discover that indexes carry a pretty big penalty, based on the number of documents in a collection.  I made the mistake in my original schema of making a new document for every data point in every sample in my data set.  With 480k data points times 1000 samples, I quickly ended up with half a billion documents (or rows, for SQL people), each holding only one piece of data.  On it’s own, it wasn’t too efficient, but the real killer was that the two keys required to access the data took up more space than the data itself, inflating the size of the database by an order of magnitude more than it should have.

3. Grouping data can be very useful.

The solution to the large index problem turned out to be that it’s much more efficient to group data into “blobs” of whatever metric is useful to you.  In my case, samples come in batches for a “project”, so rebuilding the core table to store data points by project instead of sample turned out to be a pretty awesome way to go – not only did the number of documents drop by more than two orders of magnitude, the grouping worked nicely for the interface as well.

This simple reordering of data dropped the size of my database from 160Gb down to 14Gb (because of the reduce sizes of the indexes required), and gave the web front end a roughly 10x speedup as well, partly because retrieving the records from disk is much faster.  (It searches a smaller space on the disk, and reads more contiguous areas.)

4. Indexes are everything.

Mongo has a nice feature that an index can use a prefix of any other index as if it were a unique index.  If you have an existing index on fields “Country – Occupation – Name”, you also get ‘free’ virtual indexes “Country – Occupation” and “Country” as well, because they are prefixes of the first index.  That’s kinda neat, but it also means that you don’t get a free index on “Occupation”, “Name” or “Occupation-Name”.  Thus, you either have to create those indexes as well if you want them.

That means that Accessing data from your database really needs to be carefully thought through – not unlike SQL, really.  It’s always a trade off between how your data is organized and what queries you want to run.

So, unlike SQL, it’s almost easier to write your application and then design the database that lives underneath it.  In fact, that almost describes the process I followed:a  I made a database, learned what queries were useful, then redesigned the database to better suit the queries.  It is definitely a daunting process, but far more successful than trying to follow the SQL process, where the data is normalized, then an app is written to take advantage of it.  Normalization is not entirely necessary with mongo.

5.  Normalization isn’t necessary, but relationships must be unique.

While you don’t have to normalize the way you would for SQL (relationships can be turned on their head quite nicely in Mongo, if you’re into that sort of thing), duplicating the relationship between two data points is bad.  If the same data exists in two places, you have to remember to modify both places simultaneously (eg, fixing a typo in a province name), which isn’t too bad, but if two items have a relationship and that exists in two places, it gets overly complicated for every day use.

My general rule of thumb has been to allow each relationship to only appear in the database once.  A piece of data can appear as many places as it needs to, but it becomes a key to find something new.  However, the relationship between any two fields is important and must be kept updated – and exist in only one place.

As a quick example, I may have a field for “province”, in the data base.  If I have a collection of salesmen in the database, I can write down which province each one lives/works in – and there may be 6 provinces for each sales person.  That information would be the only place to store that given relationship.  For each sales person, that list must be kept updated – and not duplicated.  If I want to know which sales people are in a particular province, I would absolutely not store a collection of sales people by state, but would instead create queries that check each sales person to see if they work in that province.  (It’s not as inefficient as you think, if you’re doing it right.)

On the other hand,  I may have information about provinces, and I would create a list of provinces, each with their own details, but then those relationships would also be unique, and I wouldn’t duplicate that relationship elsewhere.  Salespeople wouldn’t appear in that list.

In Mongo, duplication of information isn’t bad – but duplication of relationships is!

6. There are no joins – and you shouldn’t try.

Map reduce queries aside, you won’t be joining your collections.  You have to think about the collection as an answer to a set of queries.  If I want to know about a specific sample, I turn to the sample table to get information.  If I want to know details about the relationship of a set of samples to a location in the data space, I first ask about the set of samples, then turn to my other collection with the knowledge about which samples I’m interested in, and then ask a completely separate question.

This makes things simultaneously easy and complex.  Complex if you’re used to SQL and just want to know about some intersecting data points.  Simple if you can break free of that mind set and ask it as two separate questions.  When your Mongo-fu is in good working order, you’ll understand how to make any complex question into a series of smaller questions.

Of course, that really takes you back to what I said in point 4 – your database should be written once you understand the questions you’re going to ask of it. Unlike SQL, knowing the questions you’re asking is as important as knowing your data.

7. Cursors are annoying, but useful.

Actually, this is something you’ll only discover if you’re doing things wrong.  Cursors, in mongo APIs, are pretty useful – they are effectively iterators over your result set.  If you care about how things work, you’ll discover that they buffer result sets, sending them in chunks.  This isn’t really the key part, unless you’re dealing with large volumes of collections.  If you go back to point #3, you’ll recall that I highly suggested grouping your data to reduce the number of records returned – and this is one more reason why.

No matter how fast your application is, running out of documents in a buffer and having to refill it over a network is always going to be slower than not running out of documents.  Try to keep your queries from returning large volumes of documents at a time.  If you can reduce it down to under the buffer size, you’ll alway do better. (You can also change the buffer size manually, but I didn’t have great luck improving the performance that way.)

As a caveat, I used to also try reading all the records from a cursor into a table and tossing away the cursor.  Meh – you don’t gain a lot that way.  It’s only worth doing if you need to put the data in a specific format, or do transformations over the whole data set.   Otherwise, don’t go there.

Anything else I’ve missed?

Replacing science publications in the 21st century

Yasset Perez-Riverol asked me to take a look at a post he wrote: a commentary on an article titled Beyond the Paper.  In fact, I suggest reading the original paper, as well as taking a look at Yasset’s wonderful summary image that’s being passed around.  There’s some merit to both of them in elucidating where the field is going, as well as how to capture the different forms of communication and the tools available to do so.

My first thought after reading both articles was “Wow… I’m not doing enough to engage in social media.”  And while that may be true, I’m not sure how many people have the time to do all of those things and still accomplish any real research.

Fortunately, as a bioinformatician, there are moments when you’ve sent all your jobs off and can take a blogging break.  (Come on statistics… find something good in this data set for me!)  And it doesn’t hurt when Lex Nederbragt asks your opinion, etither

However, I think there’s more to my initial reaction than just a glib feeling of under-accomplishment.  We really do need to consider streamlining the publication process, particularly for fast moving fields.  Whereas the blog and the paper above show how the current process can make use of social media, I’d rather take the opposite tack: How can social media replace the current process.  Instead of a slow, grinding peer-review process, a more technologically oriented one might replace a lot of the tools we currently have built ourselves around.  Let me take you on a little thought experiment, and please consider that I’m going to use my own field as an example, but I can see how it would apply to others as well. Imagine a multi-layered peer review process that goes like this:

  1. Alice has been working with a large data set that needs analysis.  Her first step is to put the raw data into an embargoed data repository.  She will have access to the data, perhaps even through the cloud, but now she has a backup copy, and one that can be released when she’s ready to share her data.  (A smart repository would release the data after 10 years, published or not, so that it can be used by others.)
  2. After a few months, she has a bunch of scripts that have cleaned up the data (normalization, trimming, whatever), yielding a nice clean data set.  These scripts end up in a source code repository, for instance github.
  3. Alice then creates a tool that allows her to find the best “hits” in her data set.  Not surprisingly, this goes to github as well.
  4. However, there’s also a meta data set – all of the commands she has run through part two and three.  This could become her electronic notebook, and if Alice is good, she could use this as her methods section: It’s a clear concise list of commands needed to take her raw data to her best hits.
  5. Alice takes her best hits to her supervisor Bob to check over them.  Bob thinks this is worthy of dissemination – and decides they should draft a blog post, with links to the data (as an attached file, along with the file’s hash), the github code and the electronic notebook.
  6. When Bob and Alice are happy with their draft, they publish it – and announce their blog post to a “publisher”, who lists their post as an “unreviewed” publication on their web page.  The data in the embargoed repository is now released to the public so that they can see and process it as well.
  7. Chris, Diane and Elaine notice the post on the “unreviewed” list, probably via an RSS feed or by visiting the “publisher’s” page and see that it is of interest to them.  They take the time to read and comment on the post, making a few suggestions to the authors.
  8. The authors make note of the comments and take the time to refine their scripts, which shows up on github, and add a few paragraphs to their blog post – perhaps citing a few missed blogs elsewhere.
  9. Alice and Bob think that the feedback they’ve gotten back has been helpful, and they inform the publisher, who takes a few minutes to check that they have had comments and have addressed the comments, and consequently they move the post from the “unreviewed” list to the “reviewed” list.  Of course, checks such as ensuring that no data is supplied in the dreaded PDF format are performed!
  10. The publisher also keeps a copy of the text/links/figures of the blog post, so that a snapshot of the post exists. If future disputes over the reviewed status of the paper occur, or if the author’s blog disappears, the publisher can repost the blog. (If the publisher was smart, they’d have provided the host for the blog post right from the start, instead of having to duplicate someone’s blog, otherwise.)
  11. The publisher then sends out tweets with hashtags appropriate to the subject matter (perhaps even the key words attached to the article), and Alice’s and Bob’s peers are notified of the “reviewed” status of their blog post.  Chris, Diane and Elaine are given credit for having made contributions towards the review of the paper.
  12. Alice and Bob interact with the other reviewers via comments and twitters, for which links are kept from the article.  (trackbacks and pings) Authors from other fields can point out errors or other papers of interest in the comments below.
  13. Google notes all of this interaction, and updates the scholar page for Alice and Bob, noting the interactions, and number of tweets in which the blog post is mentioned.   This is held up next to some nice stats about the number of posts that Alice and Bob have authored, and the impact of their blogging – and of course – the number of posts that achieve the “peer reviewed” status.
  14. Reviews or longer comments can be done on other blog pages, which are then collected by the publisher and indexed on the “reviews” list, cross-linked from the original post.

Look – science just left the hands of the vested interests, and jumped back into the hands of the scientists!

Frankly, I don’t see it as being entirely far fetched.  The biggest issue is going to be harmonizing a publisher’s blog with a personal blog – which means that most likely personal blogs will probably shrink pretty rapidly, or they’ll move towards consortia of “publishing” groups.

To be clear, the publisher, in this case, doesn’t have to be related whatsoever to the current publishers – they’ll make their money off of targeted ads, subscriptions to premium services (advanced notice of papers? better searches for relevant posts?) and their reputation will encourage others to join.  Better bloging tools and integration will the grounds by which the services compete, and more engagement in social media will benefit everyone.  Finally, because the bar for new publishers to enter the field will be relatively low, new players simply have to out-compete the old publishers to establish a good profitable foothold.

In any case – this appears to be just a fantasy, but I can see it play out successfully for those who have the time/vision/skills to grow a blogging network into something much more professional.  Anyone feel like doing this?

Feel free to comment below – although, alas, I don’t think your comments will ever make this publication count as “peer reviewed”, no matter how many of my peers review it. :(

What is a bioinformatician

I’ve been participating in an interesting conversation on linkedin, which has re-opened the age old question of what is a bioinformatician, which was inspired by a conversation on twitter, that was later blogged.  Hopefully I’ve gotten that chain down correctly.

In any case, it appears that there are two competing schools of thought.  One is that bioinformatician is a distinct entity, and the other is that it’s a vague term that embraces anyone and anything that has to do with either biology or computer science.  Frankly, I feel the second definition is a waste of a perfectly good word, despite being a commonly accepted method.

That leads me to the following two illustrations.

How bioinformatics is often used, and I would argue that it’s being used incorrectly.:

bioinformatics_chart2

And how it should be used, according to me:

bioinformatics_chart1

I think the second clearly describes something that just isn’t captured otherwise. It covers a specific skill set that’s otherwise not captured by anything else.

In fact, I have often argued that bioinformatician is really a position along a gradient from computer science to biology, where your skills in computer science would determine whether you’re a computational biologist (someone who applies computer programs to solve biology problems) or a bioinformatician (someone who designs computer programs to solve biology problems). Those, to me, are entirely different skill sets – and although bioinformaticians are often those who end up implementing the computer programs, that’s yet another skill, but can be done by a programmer who doesn’t understand the biology.

bioinformatics_chart3

That, effectively, makes bioinformatician an accurate description of a useful skill set – and further divides the murky field of “people who understand biology and use computers” – which is vague enough to include people who use an excel spreadsheets to curate bacterial strain collections.

I suppose the next step is to get those who do taxonomy into the computational side of things and have them sort us all out.

Handy little command for upgrading python libraries…

About three weeks ago I googled for a quick tutorial on how to upgrade all of the libraries being used by python – and came up completely empty handed. Absolutely nothing useful turned up, which I found rather frustrating. The Python installer (pip) should certainly have an “upgrade all” function – but if it does, I couldn’t find it. If anyone comes across such a thing, I’d love to hear about it.

This morning, on my bike in to work, I realized I could hack a very quick command line together to make it work:

sudo pip freeze | awk '{FS = "==";print $1}' | xargs -I {} sudo pip install {} --upgrade

Nothing to it! It iterates one by one and upgrades all of the installed software. When a package is up to date, it’s clearly indicated, and when it’s not, it tries to upgrade, rolling back if it’s unsuccessful. I’ve noticed that many of the upgrades failed because of an out of date numpy package, so you may want to upgrade that first. Also, Eclipse isn’t too happy with the process, as it will detect the changes and freak out a bit – you might want to exit anything using or depending on the python libraries (such as django web server) first.

Of course, beware that this may involve re-compiling a fair amount of code, which means it’s not necessarily going to be fast. (Took about 15 minutes on my computer, with quite a few out of date libraries)

An Open Post-Doc Position

From time to time, I hear of an open position, which I’m happy to post on my blog.  If I were hunting for a post-doc position, I’d be tempted to check out this one in the Ramsey Lab at the University of Oregon Oregon State University in Corvalis, Oregeon. A quick excerpt:

You will have a key role in the lab’s research in gene regulatory networks in innate immune cells, developing integrative algorithms and applying them to analyze genomic, epigenomic, and transcriptomic data. The job is an exciting opportunity to combinestate-of-the-art methods in machine learning and statistical network inference to improve our molecular network understanding of the innate immune system and its roles in diseases. More broadly, our research program aims to develop new methods for integrating “omics” datasets with an emphasis on high-impact applications in biomedicine.

If you are interested, you can find out more on the lab’s web page: http://lab.saramsey.org/#Join