Mongo Database tricks

I’ve been using Mongodb for just over a year now, plus or minus a few small tests I did with it before that, but only in the past year have I really played with it at a level that required more than just a basic knowledge of how to use it.  Of course, I’m not using shards yet, so what I’ve learned applies just to a single DB instance – not altogether different from what you might have with a reasonable size postgres or mysql database.

Regardless, a few things stand out for me, and I thought they were worth sharing, because getting you head wrapped around mongo isn’t an obvious process, and while there is a lot of information on how to get started, there’s not a lot about how to “grok” your data in the context of building or designing a database.  So here are a few tips.

1. Everyone’s data is different.  

There isn’t going to be one recipe for success, because no two people have the same data, and even if they did, they probably aren’t going to store it the same way anyhow.  Thus, the best thing to do is look at your data critically, and expect some growing pains.  Experiment with the data and see how it groups naturally.

2.  Indexes are expensive.  

I was very surprised to discover that indexes carry a pretty big penalty, based on the number of documents in a collection.  I made the mistake in my original schema of making a new document for every data point in every sample in my data set.  With 480k data points times 1000 samples, I quickly ended up with half a billion documents (or rows, for SQL people), each holding only one piece of data.  On it’s own, it wasn’t too efficient, but the real killer was that the two keys required to access the data took up more space than the data itself, inflating the size of the database by an order of magnitude more than it should have.

3. Grouping data can be very useful.

The solution to the large index problem turned out to be that it’s much more efficient to group data into “blobs” of whatever metric is useful to you.  In my case, samples come in batches for a “project”, so rebuilding the core table to store data points by project instead of sample turned out to be a pretty awesome way to go – not only did the number of documents drop by more than two orders of magnitude, the grouping worked nicely for the interface as well.

This simple reordering of data dropped the size of my database from 160Gb down to 14Gb (because of the reduce sizes of the indexes required), and gave the web front end a roughly 10x speedup as well, partly because retrieving the records from disk is much faster.  (It searches a smaller space on the disk, and reads more contiguous areas.)

4. Indexes are everything.

Mongo has a nice feature that an index can use a prefix of any other index as if it were a unique index.  If you have an existing index on fields “Country – Occupation – Name”, you also get ‘free’ virtual indexes “Country – Occupation” and “Country” as well, because they are prefixes of the first index.  That’s kinda neat, but it also means that you don’t get a free index on “Occupation”, “Name” or “Occupation-Name”.  Thus, you either have to create those indexes as well if you want them.

That means that Accessing data from your database really needs to be carefully thought through – not unlike SQL, really.  It’s always a trade off between how your data is organized and what queries you want to run.

So, unlike SQL, it’s almost easier to write your application and then design the database that lives underneath it.  In fact, that almost describes the process I followed:a  I made a database, learned what queries were useful, then redesigned the database to better suit the queries.  It is definitely a daunting process, but far more successful than trying to follow the SQL process, where the data is normalized, then an app is written to take advantage of it.  Normalization is not entirely necessary with mongo.

5.  Normalization isn’t necessary, but relationships must be unique.

While you don’t have to normalize the way you would for SQL (relationships can be turned on their head quite nicely in Mongo, if you’re into that sort of thing), duplicating the relationship between two data points is bad.  If the same data exists in two places, you have to remember to modify both places simultaneously (eg, fixing a typo in a province name), which isn’t too bad, but if two items have a relationship and that exists in two places, it gets overly complicated for every day use.

My general rule of thumb has been to allow each relationship to only appear in the database once.  A piece of data can appear as many places as it needs to, but it becomes a key to find something new.  However, the relationship between any two fields is important and must be kept updated – and exist in only one place.

As a quick example, I may have a field for “province”, in the data base.  If I have a collection of salesmen in the database, I can write down which province each one lives/works in – and there may be 6 provinces for each sales person.  That information would be the only place to store that given relationship.  For each sales person, that list must be kept updated – and not duplicated.  If I want to know which sales people are in a particular province, I would absolutely not store a collection of sales people by state, but would instead create queries that check each sales person to see if they work in that province.  (It’s not as inefficient as you think, if you’re doing it right.)

On the other hand,  I may have information about provinces, and I would create a list of provinces, each with their own details, but then those relationships would also be unique, and I wouldn’t duplicate that relationship elsewhere.  Salespeople wouldn’t appear in that list.

In Mongo, duplication of information isn’t bad – but duplication of relationships is!

6. There are no joins – and you shouldn’t try.

Map reduce queries aside, you won’t be joining your collections.  You have to think about the collection as an answer to a set of queries.  If I want to know about a specific sample, I turn to the sample table to get information.  If I want to know details about the relationship of a set of samples to a location in the data space, I first ask about the set of samples, then turn to my other collection with the knowledge about which samples I’m interested in, and then ask a completely separate question.

This makes things simultaneously easy and complex.  Complex if you’re used to SQL and just want to know about some intersecting data points.  Simple if you can break free of that mind set and ask it as two separate questions.  When your Mongo-fu is in good working order, you’ll understand how to make any complex question into a series of smaller questions.

Of course, that really takes you back to what I said in point 4 – your database should be written once you understand the questions you’re going to ask of it. Unlike SQL, knowing the questions you’re asking is as important as knowing your data.

7. Cursors are annoying, but useful.

Actually, this is something you’ll only discover if you’re doing things wrong.  Cursors, in mongo APIs, are pretty useful – they are effectively iterators over your result set.  If you care about how things work, you’ll discover that they buffer result sets, sending them in chunks.  This isn’t really the key part, unless you’re dealing with large volumes of collections.  If you go back to point #3, you’ll recall that I highly suggested grouping your data to reduce the number of records returned – and this is one more reason why.

No matter how fast your application is, running out of documents in a buffer and having to refill it over a network is always going to be slower than not running out of documents.  Try to keep your queries from returning large volumes of documents at a time.  If you can reduce it down to under the buffer size, you’ll alway do better. (You can also change the buffer size manually, but I didn’t have great luck improving the performance that way.)

As a caveat, I used to also try reading all the records from a cursor into a table and tossing away the cursor.  Meh – you don’t gain a lot that way.  It’s only worth doing if you need to put the data in a specific format, or do transformations over the whole data set.   Otherwise, don’t go there.

Anything else I’ve missed?