Mongo Database tricks

I’ve been using Mongodb for just over a year now, plus or minus a few small tests I did with it before that, but only in the past year have I really played with it at a level that required more than just a basic knowledge of how to use it.  Of course, I’m not using shards yet, so what I’ve learned applies just to a single DB instance – not altogether different from what you might have with a reasonable size postgres or mysql database.

Regardless, a few things stand out for me, and I thought they were worth sharing, because getting you head wrapped around mongo isn’t an obvious process, and while there is a lot of information on how to get started, there’s not a lot about how to “grok” your data in the context of building or designing a database.  So here are a few tips.

1. Everyone’s data is different.  

There isn’t going to be one recipe for success, because no two people have the same data, and even if they did, they probably aren’t going to store it the same way anyhow.  Thus, the best thing to do is look at your data critically, and expect some growing pains.  Experiment with the data and see how it groups naturally.

2.  Indexes are expensive.  

I was very surprised to discover that indexes carry a pretty big penalty, based on the number of documents in a collection.  I made the mistake in my original schema of making a new document for every data point in every sample in my data set.  With 480k data points times 1000 samples, I quickly ended up with half a billion documents (or rows, for SQL people), each holding only one piece of data.  On it’s own, it wasn’t too efficient, but the real killer was that the two keys required to access the data took up more space than the data itself, inflating the size of the database by an order of magnitude more than it should have.

3. Grouping data can be very useful.

The solution to the large index problem turned out to be that it’s much more efficient to group data into “blobs” of whatever metric is useful to you.  In my case, samples come in batches for a “project”, so rebuilding the core table to store data points by project instead of sample turned out to be a pretty awesome way to go – not only did the number of documents drop by more than two orders of magnitude, the grouping worked nicely for the interface as well.

This simple reordering of data dropped the size of my database from 160Gb down to 14Gb (because of the reduce sizes of the indexes required), and gave the web front end a roughly 10x speedup as well, partly because retrieving the records from disk is much faster.  (It searches a smaller space on the disk, and reads more contiguous areas.)

4. Indexes are everything.

Mongo has a nice feature that an index can use a prefix of any other index as if it were a unique index.  If you have an existing index on fields “Country – Occupation – Name”, you also get ‘free’ virtual indexes “Country – Occupation” and “Country” as well, because they are prefixes of the first index.  That’s kinda neat, but it also means that you don’t get a free index on “Occupation”, “Name” or “Occupation-Name”.  Thus, you either have to create those indexes as well if you want them.

That means that Accessing data from your database really needs to be carefully thought through – not unlike SQL, really.  It’s always a trade off between how your data is organized and what queries you want to run.

So, unlike SQL, it’s almost easier to write your application and then design the database that lives underneath it.  In fact, that almost describes the process I followed:a  I made a database, learned what queries were useful, then redesigned the database to better suit the queries.  It is definitely a daunting process, but far more successful than trying to follow the SQL process, where the data is normalized, then an app is written to take advantage of it.  Normalization is not entirely necessary with mongo.

5.  Normalization isn’t necessary, but relationships must be unique.

While you don’t have to normalize the way you would for SQL (relationships can be turned on their head quite nicely in Mongo, if you’re into that sort of thing), duplicating the relationship between two data points is bad.  If the same data exists in two places, you have to remember to modify both places simultaneously (eg, fixing a typo in a province name), which isn’t too bad, but if two items have a relationship and that exists in two places, it gets overly complicated for every day use.

My general rule of thumb has been to allow each relationship to only appear in the database once.  A piece of data can appear as many places as it needs to, but it becomes a key to find something new.  However, the relationship between any two fields is important and must be kept updated – and exist in only one place.

As a quick example, I may have a field for “province”, in the data base.  If I have a collection of salesmen in the database, I can write down which province each one lives/works in – and there may be 6 provinces for each sales person.  That information would be the only place to store that given relationship.  For each sales person, that list must be kept updated – and not duplicated.  If I want to know which sales people are in a particular province, I would absolutely not store a collection of sales people by state, but would instead create queries that check each sales person to see if they work in that province.  (It’s not as inefficient as you think, if you’re doing it right.)

On the other hand,  I may have information about provinces, and I would create a list of provinces, each with their own details, but then those relationships would also be unique, and I wouldn’t duplicate that relationship elsewhere.  Salespeople wouldn’t appear in that list.

In Mongo, duplication of information isn’t bad – but duplication of relationships is!

6. There are no joins – and you shouldn’t try.

Map reduce queries aside, you won’t be joining your collections.  You have to think about the collection as an answer to a set of queries.  If I want to know about a specific sample, I turn to the sample table to get information.  If I want to know details about the relationship of a set of samples to a location in the data space, I first ask about the set of samples, then turn to my other collection with the knowledge about which samples I’m interested in, and then ask a completely separate question.

This makes things simultaneously easy and complex.  Complex if you’re used to SQL and just want to know about some intersecting data points.  Simple if you can break free of that mind set and ask it as two separate questions.  When your Mongo-fu is in good working order, you’ll understand how to make any complex question into a series of smaller questions.

Of course, that really takes you back to what I said in point 4 – your database should be written once you understand the questions you’re going to ask of it. Unlike SQL, knowing the questions you’re asking is as important as knowing your data.

7. Cursors are annoying, but useful.

Actually, this is something you’ll only discover if you’re doing things wrong.  Cursors, in mongo APIs, are pretty useful – they are effectively iterators over your result set.  If you care about how things work, you’ll discover that they buffer result sets, sending them in chunks.  This isn’t really the key part, unless you’re dealing with large volumes of collections.  If you go back to point #3, you’ll recall that I highly suggested grouping your data to reduce the number of records returned – and this is one more reason why.

No matter how fast your application is, running out of documents in a buffer and having to refill it over a network is always going to be slower than not running out of documents.  Try to keep your queries from returning large volumes of documents at a time.  If you can reduce it down to under the buffer size, you’ll alway do better. (You can also change the buffer size manually, but I didn’t have great luck improving the performance that way.)

As a caveat, I used to also try reading all the records from a cursor into a table and tossing away the cursor.  Meh – you don’t gain a lot that way.  It’s only worth doing if you need to put the data in a specific format, or do transformations over the whole data set.   Otherwise, don’t go there.

Anything else I’ve missed?

Replacing science publications in the 21st century

Yasset Perez-Riverol asked me to take a look at a post he wrote: a commentary on an article titled Beyond the Paper.  In fact, I suggest reading the original paper, as well as taking a look at Yasset’s wonderful summary image that’s being passed around.  There’s some merit to both of them in elucidating where the field is going, as well as how to capture the different forms of communication and the tools available to do so.

My first thought after reading both articles was “Wow… I’m not doing enough to engage in social media.”  And while that may be true, I’m not sure how many people have the time to do all of those things and still accomplish any real research.

Fortunately, as a bioinformatician, there are moments when you’ve sent all your jobs off and can take a blogging break.  (Come on statistics… find something good in this data set for me!)  And it doesn’t hurt when Lex Nederbragt asks your opinion, etither

However, I think there’s more to my initial reaction than just a glib feeling of under-accomplishment.  We really do need to consider streamlining the publication process, particularly for fast moving fields.  Whereas the blog and the paper above show how the current process can make use of social media, I’d rather take the opposite tack: How can social media replace the current process.  Instead of a slow, grinding peer-review process, a more technologically oriented one might replace a lot of the tools we currently have built ourselves around.  Let me take you on a little thought experiment, and please consider that I’m going to use my own field as an example, but I can see how it would apply to others as well. Imagine a multi-layered peer review process that goes like this:

  1. Alice has been working with a large data set that needs analysis.  Her first step is to put the raw data into an embargoed data repository.  She will have access to the data, perhaps even through the cloud, but now she has a backup copy, and one that can be released when she’s ready to share her data.  (A smart repository would release the data after 10 years, published or not, so that it can be used by others.)
  2. After a few months, she has a bunch of scripts that have cleaned up the data (normalization, trimming, whatever), yielding a nice clean data set.  These scripts end up in a source code repository, for instance github.
  3. Alice then creates a tool that allows her to find the best “hits” in her data set.  Not surprisingly, this goes to github as well.
  4. However, there’s also a meta data set – all of the commands she has run through part two and three.  This could become her electronic notebook, and if Alice is good, she could use this as her methods section: It’s a clear concise list of commands needed to take her raw data to her best hits.
  5. Alice takes her best hits to her supervisor Bob to check over them.  Bob thinks this is worthy of dissemination – and decides they should draft a blog post, with links to the data (as an attached file, along with the file’s hash), the github code and the electronic notebook.
  6. When Bob and Alice are happy with their draft, they publish it – and announce their blog post to a “publisher”, who lists their post as an “unreviewed” publication on their web page.  The data in the embargoed repository is now released to the public so that they can see and process it as well.
  7. Chris, Diane and Elaine notice the post on the “unreviewed” list, probably via an RSS feed or by visiting the “publisher’s” page and see that it is of interest to them.  They take the time to read and comment on the post, making a few suggestions to the authors.
  8. The authors make note of the comments and take the time to refine their scripts, which shows up on github, and add a few paragraphs to their blog post – perhaps citing a few missed blogs elsewhere.
  9. Alice and Bob think that the feedback they’ve gotten back has been helpful, and they inform the publisher, who takes a few minutes to check that they have had comments and have addressed the comments, and consequently they move the post from the “unreviewed” list to the “reviewed” list.  Of course, checks such as ensuring that no data is supplied in the dreaded PDF format are performed!
  10. The publisher also keeps a copy of the text/links/figures of the blog post, so that a snapshot of the post exists. If future disputes over the reviewed status of the paper occur, or if the author’s blog disappears, the publisher can repost the blog. (If the publisher was smart, they’d have provided the host for the blog post right from the start, instead of having to duplicate someone’s blog, otherwise.)
  11. The publisher then sends out tweets with hashtags appropriate to the subject matter (perhaps even the key words attached to the article), and Alice’s and Bob’s peers are notified of the “reviewed” status of their blog post.  Chris, Diane and Elaine are given credit for having made contributions towards the review of the paper.
  12. Alice and Bob interact with the other reviewers via comments and twitters, for which links are kept from the article.  (trackbacks and pings) Authors from other fields can point out errors or other papers of interest in the comments below.
  13. Google notes all of this interaction, and updates the scholar page for Alice and Bob, noting the interactions, and number of tweets in which the blog post is mentioned.   This is held up next to some nice stats about the number of posts that Alice and Bob have authored, and the impact of their blogging – and of course – the number of posts that achieve the “peer reviewed” status.
  14. Reviews or longer comments can be done on other blog pages, which are then collected by the publisher and indexed on the “reviews” list, cross-linked from the original post.

Look – science just left the hands of the vested interests, and jumped back into the hands of the scientists!

Frankly, I don’t see it as being entirely far fetched.  The biggest issue is going to be harmonizing a publisher’s blog with a personal blog – which means that most likely personal blogs will probably shrink pretty rapidly, or they’ll move towards consortia of “publishing” groups.

To be clear, the publisher, in this case, doesn’t have to be related whatsoever to the current publishers – they’ll make their money off of targeted ads, subscriptions to premium services (advanced notice of papers? better searches for relevant posts?) and their reputation will encourage others to join.  Better bloging tools and integration will the grounds by which the services compete, and more engagement in social media will benefit everyone.  Finally, because the bar for new publishers to enter the field will be relatively low, new players simply have to out-compete the old publishers to establish a good profitable foothold.

In any case – this appears to be just a fantasy, but I can see it play out successfully for those who have the time/vision/skills to grow a blogging network into something much more professional.  Anyone feel like doing this?

Feel free to comment below – although, alas, I don’t think your comments will ever make this publication count as “peer reviewed”, no matter how many of my peers review it. :(

What is a bioinformatician

I’ve been participating in an interesting conversation on linkedin, which has re-opened the age old question of what is a bioinformatician, which was inspired by a conversation on twitter, that was later blogged.  Hopefully I’ve gotten that chain down correctly.

In any case, it appears that there are two competing schools of thought.  One is that bioinformatician is a distinct entity, and the other is that it’s a vague term that embraces anyone and anything that has to do with either biology or computer science.  Frankly, I feel the second definition is a waste of a perfectly good word, despite being a commonly accepted method.

That leads me to the following two illustrations.

How bioinformatics is often used, and I would argue that it’s being used incorrectly.:

bioinformatics_chart2

And how it should be used, according to me:

bioinformatics_chart1

I think the second clearly describes something that just isn’t captured otherwise. It covers a specific skill set that’s otherwise not captured by anything else.

In fact, I have often argued that bioinformatician is really a position along a gradient from computer science to biology, where your skills in computer science would determine whether you’re a computational biologist (someone who applies computer programs to solve biology problems) or a bioinformatician (someone who designs computer programs to solve biology problems). Those, to me, are entirely different skill sets – and although bioinformaticians are often those who end up implementing the computer programs, that’s yet another skill, but can be done by a programmer who doesn’t understand the biology.

bioinformatics_chart3

That, effectively, makes bioinformatician an accurate description of a useful skill set – and further divides the murky field of “people who understand biology and use computers” – which is vague enough to include people who use an excel spreadsheets to curate bacterial strain collections.

I suppose the next step is to get those who do taxonomy into the computational side of things and have them sort us all out.

Handy little command for upgrading python libraries…

About three weeks ago I googled for a quick tutorial on how to upgrade all of the libraries being used by python – and came up completely empty handed. Absolutely nothing useful turned up, which I found rather frustrating. The Python installer (pip) should certainly have an “upgrade all” function – but if it does, I couldn’t find it. If anyone comes across such a thing, I’d love to hear about it.

This morning, on my bike in to work, I realized I could hack a very quick command line together to make it work:

sudo pip freeze | awk '{FS = "==";print $1}' | xargs -I {} sudo pip install {} --upgrade

Nothing to it! It iterates one by one and upgrades all of the installed software. When a package is up to date, it’s clearly indicated, and when it’s not, it tries to upgrade, rolling back if it’s unsuccessful. I’ve noticed that many of the upgrades failed because of an out of date numpy package, so you may want to upgrade that first. Also, Eclipse isn’t too happy with the process, as it will detect the changes and freak out a bit – you might want to exit anything using or depending on the python libraries (such as django web server) first.

Of course, beware that this may involve re-compiling a fair amount of code, which means it’s not necessarily going to be fast. (Took about 15 minutes on my computer, with quite a few out of date libraries)

An Open Post-Doc Position

From time to time, I hear of an open position, which I’m happy to post on my blog.  If I were hunting for a post-doc position, I’d be tempted to check out this one in the Ramsey Lab at the University of Oregon Oregon State University in Corvalis, Oregeon. A quick excerpt:

You will have a key role in the lab’s research in gene regulatory networks in innate immune cells, developing integrative algorithms and applying them to analyze genomic, epigenomic, and transcriptomic data. The job is an exciting opportunity to combinestate-of-the-art methods in machine learning and statistical network inference to improve our molecular network understanding of the innate immune system and its roles in diseases. More broadly, our research program aims to develop new methods for integrating “omics” datasets with an emphasis on high-impact applications in biomedicine.

If you are interested, you can find out more on the lab’s web page: http://lab.saramsey.org/#Join

 

Great primer on the why and how of genome sequencing

I’m often asked to explain the human genome project, or sequencing in general when discussing what I do with those outside of the field.  I’d like to think I’m not bad at explaining it in lay terms, either.

On the other hand, there’s now a video that does a VERY good job of this, written by Mark J. Kiel, from the University of Michigan.  The illustrations are a great mix of simplicity and detail, that captures the essence of the process while not omitting the actual science.  It’s pretty impressive and well worth the 5 minutes it takes to watch it. You can also catch the full thing on Youtube:

New Years Resolutions 2014

This used to be a yearly tradition for me – setting goals or resolutions for myself. It’s mostly a way for me to give myself something to aim for, as well as a time limit in which to accomplish it. Unlike my daily task list, it’s for things that aren’t simple to resolve quickly – things that do take a year to complete. Last year, I had one task: recover from the insanity that was Denmark – and I think that’s been done. I still haven’t written up the lies and financial hell that CLC put me through on my way out of the country, but that isn’t so emotionally charged anymore that it hurts to write.  (I still don’t have a sense of humour about all of it, but that’s a different story entirely.)

In any case, my resolutions for 2014 are a little more career and family focused, aiming to bring a bit more balance back to my life.  Starting with the family, here they are:

  1. Teach my daughter to ask “Why?”, instead of “What’s that? (or “Dassit?” as she pronounces it – and then take the time to answer in as much detail as she can handle.
  2. Get back into photography, and take more pictures of my wife and my daughter.  A picture without a person in it is never as good as one that has someone in it – and it’s never as good as one that has someone you care about in it.
  3. Do more for my wife – after a year and a half at home with our daughter, I can’t express my gratitude for her patience enough, but I can do a better job of showing it.
  4. Get back into Fencing.  My daughter is sleeping through the evening, if not the whole night, most nights.  It’s time for my wife and I to get out a bit more and get some physical activity, and my activity of choice involves pointy sticks.
  5. Finish off each and every one of my projects at work, and then publish it!  I have a nearly complete chip-seq project, chip-chip project, human methylation visualization project and several others.  It’s time they all got out into the world and into the hands of those who can use them.
  6. Social networking update.  I’ve been neglecting twitter, blogging and my feeds for too long.  It’s time for a fresh start, and a return to engaging with the world.
  7. Be a leader, not a follower.  I feel like I’ve been a bit on auto-pilot this year, in that I haven’t really done a lot of cutting edge work, and haven’t pushed the envelope as much as I’d like.  After a year in Denmark, where I spent all my time just trying to keep afloat over the culture shock and language barrier, I’ve lost a bit of my edge. It’s past time to get it back.

None of my resolutions this year are all that challenging, but they all have a place in helping me get back to being the person I would like to be.  Isn’t that, after all, what New Year Resolutions are about?

On 23andMe v. the FDA

Ok, it’s not really a court case… yet. However, from what I’ve read, it’s a pretty adversarial interaction. I’ve read a bunch of articles on the topic, already, and I have to say I’ve yet to see anyone state what I think is the obvious issue with the approach the FDA has taken.

They’re not regulating the equipment that does the testing.
They’re not regulating the interpretation of the information.

What’s left is that they appear to regulating the business model. It’s ok to do exactly what 23andMe is doing, but it’s not ok to do it if the consumer is uneducated. Were they handing the tests to an MD (who may or may not know what to do with the information) or a researcher (who may or may not have the ability to tell the subject of the test what the results are), it would be fine. As soon as it’s being handed over to a general consumer, it’s now going to be regulated.

I find that pretty hard to swallow.

If the FDA wants to regulate it as a medical device, then fine – regulate access to the medical device itself, and don’t try to regulate the burgeoning field of information interpretation and dissemination.

(Sorry for the lack of links – it’s been a busy week.)

Dr. Dawn Bowdish on After Office Hours.

I’m going to blog more, damnit. I’ve working on one big post to wrap up the Danish misadventure, and I haven’t wanted to post anything until that’s done.

However, I came across a video that I wanted to share: Dr. Dawn Bowdish on After Office Hours. I’ve known Dawn since her grad school days, where she was in my wife’s lab. She’s given me great advice a few times in the past, and she shares a bit about her personal life and career. Useful information for people who are thinking about the academic career.

edit: sorry for the typo in your name, Dawn! Fixed it as soon as it was brought to my attention.

2-year computational biology position open in Grenoble, France

A quick announcement for a position available in France, with an outstanding researcher. (I’ve personally had the opportunity to work with François, and he is also a great guy, so this would be a pretty rocking position…)

A 2-year position in Computational Biology is available immediately in François PARCY group in Grenoble (France). The project aims at deciphering the rules governing transcriptional regulation in plants. We take flower development as a model system to study the interplay between transcription factors (TFs), genomic DNA features (accessibility, chromatin marks, methylation), and gene expression. We use genome-wide data (ChIP-Seq, expression data (RNA-Seq or microarray), DNAse-seq, plant genomes) to better understand the binding of TFs to the DNA and its impact on gene regulation. The applicant will be in charge of developing new methods and models to analyze the large-scale in-house and public data available and will interact with experimentalists to ground the model to biology. The ideal candidate will have already shown success in developing new tools/software analyzing large-scale (e.g. NGS) biological data.
 
We prefer applicant at the post-doctoral level but candidates with a master will also be considered. Grenoble is a great place for Science and also outdoors activities!
 
If you’re interested, please contact Francois PARCY (francois.parcy \at\ cea.fr).