Mongo Database tricks

I’ve been using Mongodb for just over a year now, plus or minus a few small tests I did with it before that, but only in the past year have I really played with it at a level that required more than just a basic knowledge of how to use it.  Of course, I’m not using shards yet, so what I’ve learned applies just to a single DB instance – not altogether different from what you might have with a reasonable size postgres or mysql database.

Regardless, a few things stand out for me, and I thought they were worth sharing, because getting you head wrapped around mongo isn’t an obvious process, and while there is a lot of information on how to get started, there’s not a lot about how to “grok” your data in the context of building or designing a database.  So here are a few tips.

1. Everyone’s data is different.  

There isn’t going to be one recipe for success, because no two people have the same data, and even if they did, they probably aren’t going to store it the same way anyhow.  Thus, the best thing to do is look at your data critically, and expect some growing pains.  Experiment with the data and see how it groups naturally.

2.  Indexes are expensive.  

I was very surprised to discover that indexes carry a pretty big penalty, based on the number of documents in a collection.  I made the mistake in my original schema of making a new document for every data point in every sample in my data set.  With 480k data points times 1000 samples, I quickly ended up with half a billion documents (or rows, for SQL people), each holding only one piece of data.  On it’s own, it wasn’t too efficient, but the real killer was that the two keys required to access the data took up more space than the data itself, inflating the size of the database by an order of magnitude more than it should have.

3. Grouping data can be very useful.

The solution to the large index problem turned out to be that it’s much more efficient to group data into “blobs” of whatever metric is useful to you.  In my case, samples come in batches for a “project”, so rebuilding the core table to store data points by project instead of sample turned out to be a pretty awesome way to go – not only did the number of documents drop by more than two orders of magnitude, the grouping worked nicely for the interface as well.

This simple reordering of data dropped the size of my database from 160Gb down to 14Gb (because of the reduce sizes of the indexes required), and gave the web front end a roughly 10x speedup as well, partly because retrieving the records from disk is much faster.  (It searches a smaller space on the disk, and reads more contiguous areas.)

4. Indexes are everything.

Mongo has a nice feature that an index can use a prefix of any other index as if it were a unique index.  If you have an existing index on fields “Country – Occupation – Name”, you also get ‘free’ virtual indexes “Country – Occupation” and “Country” as well, because they are prefixes of the first index.  That’s kinda neat, but it also means that you don’t get a free index on “Occupation”, “Name” or “Occupation-Name”.  Thus, you either have to create those indexes as well if you want them.

That means that Accessing data from your database really needs to be carefully thought through – not unlike SQL, really.  It’s always a trade off between how your data is organized and what queries you want to run.

So, unlike SQL, it’s almost easier to write your application and then design the database that lives underneath it.  In fact, that almost describes the process I followed:a  I made a database, learned what queries were useful, then redesigned the database to better suit the queries.  It is definitely a daunting process, but far more successful than trying to follow the SQL process, where the data is normalized, then an app is written to take advantage of it.  Normalization is not entirely necessary with mongo.

5.  Normalization isn’t necessary, but relationships must be unique.

While you don’t have to normalize the way you would for SQL (relationships can be turned on their head quite nicely in Mongo, if you’re into that sort of thing), duplicating the relationship between two data points is bad.  If the same data exists in two places, you have to remember to modify both places simultaneously (eg, fixing a typo in a province name), which isn’t too bad, but if two items have a relationship and that exists in two places, it gets overly complicated for every day use.

My general rule of thumb has been to allow each relationship to only appear in the database once.  A piece of data can appear as many places as it needs to, but it becomes a key to find something new.  However, the relationship between any two fields is important and must be kept updated – and exist in only one place.

As a quick example, I may have a field for “province”, in the data base.  If I have a collection of salesmen in the database, I can write down which province each one lives/works in – and there may be 6 provinces for each sales person.  That information would be the only place to store that given relationship.  For each sales person, that list must be kept updated – and not duplicated.  If I want to know which sales people are in a particular province, I would absolutely not store a collection of sales people by state, but would instead create queries that check each sales person to see if they work in that province.  (It’s not as inefficient as you think, if you’re doing it right.)

On the other hand,  I may have information about provinces, and I would create a list of provinces, each with their own details, but then those relationships would also be unique, and I wouldn’t duplicate that relationship elsewhere.  Salespeople wouldn’t appear in that list.

In Mongo, duplication of information isn’t bad – but duplication of relationships is!

6. There are no joins – and you shouldn’t try.

Map reduce queries aside, you won’t be joining your collections.  You have to think about the collection as an answer to a set of queries.  If I want to know about a specific sample, I turn to the sample table to get information.  If I want to know details about the relationship of a set of samples to a location in the data space, I first ask about the set of samples, then turn to my other collection with the knowledge about which samples I’m interested in, and then ask a completely separate question.

This makes things simultaneously easy and complex.  Complex if you’re used to SQL and just want to know about some intersecting data points.  Simple if you can break free of that mind set and ask it as two separate questions.  When your Mongo-fu is in good working order, you’ll understand how to make any complex question into a series of smaller questions.

Of course, that really takes you back to what I said in point 4 – your database should be written once you understand the questions you’re going to ask of it. Unlike SQL, knowing the questions you’re asking is as important as knowing your data.

7. Cursors are annoying, but useful.

Actually, this is something you’ll only discover if you’re doing things wrong.  Cursors, in mongo APIs, are pretty useful – they are effectively iterators over your result set.  If you care about how things work, you’ll discover that they buffer result sets, sending them in chunks.  This isn’t really the key part, unless you’re dealing with large volumes of collections.  If you go back to point #3, you’ll recall that I highly suggested grouping your data to reduce the number of records returned – and this is one more reason why.

No matter how fast your application is, running out of documents in a buffer and having to refill it over a network is always going to be slower than not running out of documents.  Try to keep your queries from returning large volumes of documents at a time.  If you can reduce it down to under the buffer size, you’ll alway do better. (You can also change the buffer size manually, but I didn’t have great luck improving the performance that way.)

As a caveat, I used to also try reading all the records from a cursor into a table and tossing away the cursor.  Meh – you don’t gain a lot that way.  It’s only worth doing if you need to put the data in a specific format, or do transformations over the whole data set.   Otherwise, don’t go there.

Anything else I’ve missed?

A Reply to: Should a Rich Genome Variant File End the Storage of Raw Read Data?

Before you read my post, you might want to zip over to Complete Genomic’s blog where C.S.O. Dr. Rade Drmanac wrote an entry titled “Should a Rich Genome Variant File End the Storage of Raw Read Data?”  It’s an interesting perspective where he suggests that, as the article’s title might indicate, we should only be keeping a rich variant file as the only trace of a sequencing run.

I should mention that I’m really not going to distinguish between storing raw reads and storing aligned reads – you can go from one to the other by stripping out the alignment information, or by aligning to a reference template.  As far as I’m concerned, no information is lost when you align, unlike moving to a rich variant file (or any other non-rich variant file, for that matter.)

I can certainly see the attraction of deleting the raw or aligned data – much of the data we store is never used again, takes up space and could be regenerated at will from frozen samples of DNA, if needed.  As a scientist, I’ve very rarely have had to go back into the read-space files and look at the pre-variant calling data – and only a handfull of times in the past 2 years.  As such, it really doesn’t make much sense to store alignments, raw reads or other data. If I needed to go back to a data set, it would be (could be?) cheaper just to resequence the genome and regenerate the missing read data.

I’d like to bring out a real-world analogy, to summarize this argument. I had recently been considering a move, only to discover that storage in Vancouver is not cheap.  It’ll cost me $120-150/month for enough room to store some furniture and possessions I wouldn’t want to take with me.  If the total value of those possessions is only $5,ooo, storing them for more than 36 months means that (if I were to move) it would have been cheaper to sell it all and then buy a new set when I come back.

Where the analogy comes into play quite elegantly is if I have some interest in those particular items.  My wife, for instance, is quite attached to our dining room set.  If we were to sell it, we’d have to eventually replace it, and it might be impossible to find another one just like it at any price.  It might be cheaper to abandon the furniture than to store it in the long run, but if there’s something important in that data, the cost of storage isn’t the only thing that comes into play.

While I’m not suggesting that we should be emotionally attached to our raw data, there is merit in having a record that we can return to – if (and only if) there is a strong likelyhood that we will return to the data for verification purposes.  You can’t always recreate something of interest in a data set by resequencing.

That is a poor reason to keep the data around, most of the time.  We rarely find things of interest that couldn’t be recreated in a specific read set when we’re doing genome-wide analysis. Since most of the analysis we do uses only the variants and we frequently verify our findings with other means, the cost/value argument is probably strongly in favour of throwing away raw reads and only storing the variants.  Considering my recent project on storage of variants (all 3 billion of the I’ve collected), people have probably heard me make the same arguments before.

But lets not stop here. There is much more to this question than meets the eye. If we delve a little deeper into what Dr. Drmanac is really asking, we’ll find that this question isn’t quite as simple as it sounds.  Although the question stated basically boils down to “Can a rich genome variant file be stored instead of a raw read data file?”, the underlying question is really: “Are we capable of extracting all of the information from the raw reads and storing it for later use?”

Here, I actually would contend the answer is no, depending on the platform.  Let me give you examples of what data I feel we do a poor example of extracting, right now.

  • Structural variations:  My experience with structural variations is that no two SV callers give you the same information or make the same calls.  They are notoriously difficult to evaluate, so any information we are extracting is likely just the tip of the iceberg.  (Same goes for structural rearrangements in cancers, etc.)
  • Phasing information:  Most SNP callers aren’t giving phasing information, and some aren’t capable of it.  However, those details could be teased from the raw data files (depending on the platform).  We’re just not capturing it efficiently.
  • Exon/Gene expression:  This one is trivial, and I’ve had code that pulls this data from raw aligned read files since we started doing RNA-Seq.  Unfortunately, due to exon annotation issues, no one is doing this well yet, but it’s a huge amount of clear and concise information available that is obviously not captured in variant files.  (We do reasonably well with Copy Number Variations (CNVs), but again, those aren’t typically sored in rich variant files.)
  • Variant Quality information: we may have solved the base quality problems that plagued early data sets, but let me say that variant calling hasn’t exactly come up with a unified method of comparing SNV qualities between variant callers.  There’s really no substitute for comparing other than to re-run the data set with the same tools.
  • The variants themselves!  Have you ever compared the variants observed by two SNP callers run on the same data set? I’ll spoil the suspense for you: they never agree completely, and may in fact disagree on up to 25% of the variants called. (Personal observation – data not shown.)

Even dealing only with the last item, it should be obvious:  If we can’t have two snp callers produce the same set of variants, then no amount of richness in the variant file will replace the need to store the raw read data because we should always be double checking interesting findings with (at least) a second set of software.

For me, the answer is clear:  If you’re going to stop storing raw read files, you need to make sure that you’ve extracted all of the useful information – and that the information you’ve extracted is complete.  I just don’t think we’ve hit those milestones yet.

Of course, if you work in a shop where you only use one set of tools, then none of the above problems will be obvious to you and there really isn’t a point to storing the raw reads.  You’ll never return to them because you already have all the answers you want.  If, on the other hand, you get daily exposure to the uncertainty in your pipeline by comparing it to other pipelines, you might look at it with a different perspective.

So, my answer to the question “Are we ready to stop storing raw reads?” is easy:  That depends on what you think you need from a data set and if you think you’re already an expert at extracting it.  Personally, I think we’ve barely scratched the surface on what information we can get out of genomic and transcriptomic data, we just don’t know what it is we’re missing yet.

Completely off topic, but related in concept: I’m spending my morning looking for Scalable Vector Graphics (.svg) files for many jpg and png files I’d created along the course of my studies.  Unfortunately, jpg and png are lossy formats and don’t reproduce as nicely in the Portable Document Format (PDF) export process.  Having deleted some of those .svg files because I thought I had extracted all of the useful information from them in the export to png format, I’m now at the point where I might have to recreate them to properly export the files again in a lossless format for my thesis.  If I’d just have stored them (as the cost is negligible) I wouldn’t be in this bind….   meh.


Nature Comment : The case for locus-specific databases

There’s an interesting comment available in Nature today (EDIT: it came out last month, though I only found it today.) Unfortunately, it’s by subscription only, but let me save you the hassle of downloading it, if you don’t already have a subscription.  It’s not what I thought it was.

The entire piece fails to make the case for locus-specific databases, but instead conflates locus-specific with “high-resolution”, and then proceeds to tell us why we need high resolution data.  The argument can roughly be summarized as:

  • Omim and databases like it are great, but don’t list all known variations
  • Next-gen sequencing gives us the ability to see genome in high resolution
  • You can only get high-resolution data by managing data in a locus-specific manner
  • Therefore, we should support locus-specific databases

Unfortunately, point number three is actually wrong.  It’s just that our public databases haven’t yet transitioned to the high resolution format.  (ie, we have an internal database that stores data in a genome-wide manner at high resolution…  the data is, alas, not public.)

Thus, on that premise, I don’t think we should be supporting locus specific databases specifically –  indeed, I would say that the support they need is to become amalgamated in to a single genome-wide database at high resolution.

You wouldn’t expect major gains in understanding of car mechanics if you, by analogy, insisted that all parts should be studied independently at high resolution.  Sure you might improve your understanding of each part, and how it works alone, but the real gains come from understanding the whole system.  You might not actually need certain parts, and sometimes you need to understand how two parts work together.  It’s only by studying the whole system that you begin to see the big picture.

IMHO, Locus-specific databases are blinders that we adopt in the name of needing higher resolution, which is more of a comment on the current state of biology.  In fact, the argument can really be made that we don’t need locus-specific databases, we need better bioinformatics!

Dueling Databases of Human Variation

When I got it to work this morning, I was greeted by an email from 23andMe’s PR company, saying they have “built one of the world’s largest databases of individual genetic information.”   Normally, I wouldn’t even bat an eye at a claim like that.  I’m pretty sure it is a big database of variation…  but I thought I should throw down the gauntlet and give 23andMe a run for their money.  (-:

The timing for it couldn’t be better for me.  My own database actually ran out of auto-increment IDs this week, as we surpassed 2^31 snps entered into the db and had to upgrade the key field to bigint from int. (Some variant calls have been deleted and replaced as variant callers have improved, so we actually have only 1.2 Billion variations recorded against the hg18 version of the human genome.  A few hundred million more than that for hg19.)  So, I thought I might have a bit of a claim to having one of the largest databases of human variation as well.  Of course, comparing databases really is dependent on the metric being used, but hey, there’s some academic value in trying anyhow.

In the first corner, my database stores information from 2200+ samples (cancer and non-cancer tissue), genome wide (or transcriptome wide, depending on the source of the information.), giving us a wide sampling of data, including variations unique to individuals, as well as common polymorphisms.  In the other corner, 23andMe has sampled a much greater number of individuals (100,000) using a SNP chip, meaning that they’re only able to sample a small amount of the variation in an individual – about 1/3rd of a single percent of the total amount of DNA in each individual.

(According to this page, they look at only 1 million possible SNPs, instead of the 3 Billion bases at which single nucleotide variations can be found – although arguments can be made about the importance of that specific fraction of a percent.)

The nature of the data being stored is pretty important, however.  For many studies, the number of people sampled has a greater impact on the statistics than the number of sites studied and, since those are mainly the ones 23andMe are doing, clearly their database is more useful in that regard.  In contrast, my database stores data from both cancer and non-cancer samples, which allows us to make sense of variations observed in specific types of cancers – and because cancer derived variations are less predictable (ie, not in the same 1M snps each time) than the run-of-the-mill-standard-human-variation-type snps, the same technology 23andMe used would have been entirely inappropriate for the cancer research we do.

Unfortunately, that means comparing the two databases is completely impossible – they have different purposes, different data and probably different designs.  They have a database of 100k individuals, covering 1 million sites, whereas my database has 2k individuals, covering closer to 3 billion base pairs.  So yeah, apples and oranges.

(In practice, however, we don’t see variations at all 3 Billion base pairs, so that metric is somewhat skewed itself.  The number is closer to 100 Million bp –  a fraction of the genome nearly 100 times larger than what 23andMe is actually sampling.)

But, I’d still be interested in knowing the absolute number of variations they’ve observed…  a great prize upon which we could hold this epic battle of “largest database of human variations.”  At best, 23andMe’s database holds 10^11 variations, (1×10^6 SNPs x 1×10^5 people), if every single variant was found in every single person – a rather unlikely case.  With my database currently  at 1.2×10^9 variations, I think we’ve got some pretty even odds here.

Really, despite the joking about comparing database sizes, the real deal would be the fantastic opportunity to learn something interesting by merging the two databases, which could teach use something both about cancer and about the frequencies of variations in the human population.

Alas, that is pretty much certain to never happen.  I doubt 23andMe will make their database public – and our organization never will either.  Beyond the ethical issues of making that type of information public, there are pretty good reasons why this data can only be shared with collaborators – and in measured doses at that.  That’s another topic for another day, which I won’t go into here.

For now, 23andMe and I will just have to settle for both having “one of the world’s largest databases of individual genetic information.”  The battle royale for the title will have to wait for another day… and who knows what other behemoths are lurking in other research labs around the world.

On the other hand, the irony of a graduate student challenging 23andMe for the title of largest database of human variation really does make my day. (=

[Note: I should mention that when I say that I have a database of human variation, the database was my creation but the data belongs to the Genome Sciences Centre – and credit should be given to all of those who did the biology and bench work, performed the sequencing, ran the bioinformatics pipelines and assisted in populating the database.]

CPHx: Peter Jabbour, Sponsored by BlueSEQ – An exchange for next-generation sequencing

An exchange for next-generation sequencing

Peter Jabbour, Sponsored by BlueSEQ


A very new company, just went live last month.

What is an exchange?  A platform that brings together buyers and sellers within a market.  A web portal that helps place researchers, clinicians individuals, etc. with providers of next-gen sequencing services.

[web portals?  This seems very 1990s… time warp!]

Why do users need an exchange?  Users have limited access, need better access to technology, platvform, application, etc.

Why do providers need an exchange?  Providers may want to fill their queues.

[This is one stop shopping for next-gen sequencing providers?  How do you make money doing this?]

BlueSEQ platform: 3 parts.

  1. Knowledge Bank:  Comprehensive collection of continuously updated Next Generation Sequencing information, opinons, evaluations, tech bechmarks.
  2. Project Design: Standardized project parameters.  eg, de novo, etc. [How do you standardize the bioinformatics?  Seems… naive.]
  3. Sequencing exchange:  Providers get a list of projects that they can bid on.

[wow… not buying this. Keeps referring back to the model with airline tickets.]

Statistics will come out of the exchange – cost of sequencing, etc.

No cost to users.  Exchange fees for providers. [again, why would providers want to opt in to this?] 100 users have already signed up.

Future directions:  Specialized project desin tools, quoting tools, project management tools, comparison tools, customer reviews.

There are extensive tools for giving feedback, and rating other user’s feedback.

[Sorry for my snarky comments throughout.  This just really doesn’t seem like a well thought out business plan.  I see TONs of reasons why this shouldn’t work… and really not seeing any why it should.  Why would any provider want customer reviews of NGS data… the sample prep is a huge part of the quality, and if they don’t control it, it’s just going to be disaster.  I also don’t really see the value added component.  Good luck to the business, tho!]


Java 1.6 based fork of the Ensembl API.

Just in case anyone is still interested, I have started an ensj (Ensembl Java API) project at sourceforge, using the latest version of the ensj-core project as the root of the fork.  It fixes at least one bug with using the java API on hg19, and makes some improvements to the code for compatibility with java 1.6.

There are another few thousand changes I could make, but I’m just working on it slowly, if at all.

I’m not intending to support this full time, but interested parties are welcome to join the project and contribute, making this a truly open source version of the ensembl interface.  That is to say, community driven.

Ensembl isn’t interested in providing support (they no longer have people with the in-depth knowledge of the API to provide support), so please don’t use this project with the expectation of help from the Ensembl team.  Also note that significant enhancements or upgrades are unlikely unless you’re interested in contributing to them! (I have my own dissertation to write and am not looking to take this on as a full time job!)

If you’re interested in using it, however, you can find the project here:

and a few notes on getting started here and here.  I will get around to posting some more information on the project on sourceforge web site when I get a chance.



Some days you celebrate the little victories, other days, you celebrate the big ones.

Today, I get to celebrate a pretty significant victory, in my humble opinion: I managed to get the ensembl java API to compile and generate a fully operational battle station jar file that works with my Java code.

I know, it doesn’t sound like such a big deal, but that means I worked out all of it’s dependencies, managed to get all of it to compile without errors and THEN managed to fix a bug.  Not bad for a project I thought would take months.  In fact, I’ve even made some significant upgrades, for instance, it now creates a java 1.6 jar file, which should run a bit faster than the original java 1.4.  I’ve also gone through and upgraded some of the code – making it a bit more readable and in the java 1.6 format with “enhanced loops”.  All in all, I’m pretty pleased with this particular piece of work.  Considering I started on friday, and I’ve managed to make headway on my thesis project in the meantime, I’d say I’m doing pretty well.

So, as I said, I get to celebrate a nice little victory…. and then I’ll have to immediately get back to some more thesis writing.


For posterity’s sake, here are the steps required to complete this project:

  1. Get the full package from the Ensembl people. (They have a version that includes the build file and the licence for the software.  The one I downloaded from the web was incomplete.)
  2. Get all of the dependencies.  They are available on the web, but most of them are out of date and new ones can be used.
  3. Figure out that java2html.jar needs to be in ~/.ant/lib/, not in the usual ./lib path
  4. Fix the problem of new data types in (It’s a 2 line fix, btw.)
  5. Modify the file to use the latest version of the mysql API, and then copy that to the appropriate ./lib path.
  6. Modify the file to reflect that you’re generating a custom jar file.
  7. Modify the build.xml to use java 1.6 instead of 1.4
  8. Figure out how to use the ant code.  Turns out both “ant build” and “ant jar” both work.
  9. Note, the project uses a bootstrap manifest file which isn’t available in the source package on the web. If you use that code, you have to modify the build.xml file to generate a custom manifest file, which is actually pretty easy to do.  This isn’t required, however, if you have the full source code.

When you write it out that way, it doesn’t sound like such a big project does it?  I’m debating putting the modified version somewhere like sourceforge, if there’s any interest from the java/bioinformatics community.  Let me know if you think it might be useful.

Why I haven’t graduated yet and some corroborating evidence – 50 breast cancers sequenced.

Judging a cancer by it’s cover tissue of origin may be the wrong approach.  It’s not a publication yet, as far as I can tell, but summaries are flying around about a talk presented at AACR 2011 on Saturday, in which 50 breast cancer genomes were analyzed:

Ellis et al. Breast cancer genome. Presented Saturday, April 2, 2011, at the 102nd Annual Meeting of the American Association for Cancer Research in Orlando, Fla.

I’ll refer you to a summary here, in which some of the results are discussed.  [Note: I haven’t seen the talk myself, but have read several summaries of it.] Essentially, after sequencing 50 breast cancer genomes – and 50 matched normal genomes from the same individuals – they found nothing of consequence.  Everyone knows TP53 and signaling pathways are involved in cancer, and those were the most significant hits.

“To get through this experiment and find only three additional gene mutations at the 10 percent recurrence level was a bit of a shock,” Ellis says.

My own research project is similar in the sense that it’s a collection of breast cancer and matched normal samples, but using cell lines instead of primary tissues.  Unfortunately, I’ve also found a lot of nothing.  There are a couple of genes that no one has noticed before that might turn into something – or might not.  In essence, I’ve been scooped with negative results.

I’ve been working on similar data sets for the whole of my PhD, and it’s at least nice to know that my failures aren’t entirely my fault. This is a particularly difficult set of genomes to work on and so my inability to find anything may not be because I’m a terrible researcher. (It isn’t ruled out by this either, I might add.)  We originally started with a set of breast cancer cell lines spanning across 3 different types of cancer.  The quality of the sequencing was poor (36bp reads for those of you who are interested) and we found nothing of interest.  When we re-did the sequencing, we moved to a set of cell lines from a single type of breast cancer, with the expectation that it would lead us towards better targets.  My committee is adamant  that I be able to show some results of this experiment before graduating, which should explain why I’m still here.

Every week, I poke through the data in a new way, looking for a new pattern or a new gene, and I’m struck by the absolute independence of each cancer cell line.  The fact that two cell lines originated in the same tissue and share some morphological characteristics says very little to me about how they work. After all, cancer is a disease in which cells forget their origins and become, well… cancerous.

Unfortunately, that doesn’t bode well for research projects in breast cancer.  No matter how many variants I can filter through, at the end of the day, someone is going to have to figure out how all of the proteins in the body interact in order for us get a handle on how to interrupt cancer specific processes.  The (highly overstated) announcement of p53’s tendency to mis-fold and aggregate is just one example of these mechanisms – but only the first step in getting to understand cancer. (I also have no doubts that you can make any protein mis-fold and aggregate if you make the right changes.)  The pathway driven approach to understanding cancer is much more likely to yield tangible results than the genome based approach.

I’m not going to say that GWAS is dead, because it really isn’t.  It’s just not the right model for every disease – but I would say that Ellis makes a good point:

“You may find the rare breast cancer patient whose tumor has a mutation that’s more commonly found in leukemia, for example. So you might give that breast cancer patient a leukemia drug,” Ellis says.

I’d love to get my hands on the data from the 50 breast cancers, merge it with my database, and see what features those cancers do share with leukemia.  Perhaps that would shed some light on the situation.  In the end, cancer is going to be more about identifying targets than understanding its (lack of ) common genes.

Thought’s on Andrew G Clark’s Talk and Cancer Genomics

Last night, I hung around late into the evening to hear Dr. Andrew G Clark give a talk focusing on how most of the variations we see in the modern human genome are rare variants that haven’t had a chance to equilibrate into the larger population.  This enormous expansion of rare variants is courtesy of the population explosion of humans since the dawn of the agricultural age, specifically in the past 2000 years at the dawn of modern science and education.

I think the talk was a very well done and managed to hit a lot of points that struck home for me.  In particular, my own collected database of human variations in cancers and normals has shown me much of the same information that Dr Clark illustrated using 1000 genome data, as well as information from his 2010 paper on deep re-sequencing.

However interesting the talk was, one particular piece just didn’t click in until after the talk was over.  During a conversation prior to the talk, I described my work to Dr. Clark and received a reaction I wasn’t expecting.  Paraphrased, this is how the conversation went:

Me: “I’ve assembled a very large database, where all of the cancers and normals that we sequence here at the genome science centre are stored, so that we can investigate the frequency of variations in cancers to identify mutations of interest.”

Dr. Clark: “Oh, so it’s the same as a HapMap project?”

Me: “Yeah, I guess so…”

What I didn’t understand at the time was that Dr. Clark was asking was: “So, you’re just cataloging rare variations, which are more or less meaningless?”  Which is exactly what HapMap projects are: Nothing more than large surveys of human variation across genomes.  While they could be the basis of GWAS studies, the huge amount of rare variants in the modern human population means that many of these GWAS studies are doomed to fail.  There will not be a large convergence of variations causing the disease, but rather an extreme number of rare variations with similar outcomes.

However, I think the problem was that I handled the question incorrectly.  My answer should have touched on the following point:

“In most diseases, we’re stuck using lineages to look for points of interest (variations) passed on from parent to child and the large number of rare variants in the human population makes this incredibly difficult to do as each child will have a significant number of variation that neither parent passed on to them.  However, in cancer, we have the unique ability to compare diseased cancer cells with a matched normal from the same patient, which allows us to effectively mask all of the rare variants that are not contributing to cancer.  Thus, the database does act like a large HapMap database, if you’re interested in studying non-cancer, but the matched-normal sample pairing available to cancer studies means we’re not confined to using it as a HapMap-style database, enabling incredibly detailed and coherent information about the drivers and passengers involved in oncogenesis, without the same level of rare variants interfering in the interpretation of the genome.”

Alas, in the way of all things, that answer only came to me after I heard Dr. Clark’s talk and understood the subtext of his question.  However, that answer is very important on its own.

It means that while many diseases will be hard slogs through the deep rare variant populations (which SNP chips will never be detailed enough to elucidate, by the way, for those of you who think 23andMe will solve a large number of complicated diseases), cancer is bound to be a more tractable disease in comparison!  We will by-pass the misery of studying every single rare variant, which is a sizeable fraction of each new genome sequenced!

Unfortunately, unlike many other human metabolic diseases that target a single gene or pathway, cancer is really a whole genome disease and is vastly more complex than any other disease.  Thus, even if our ability to zoom in on the “driver” mutations progresses rapidly as we sequence more cancer tissues (and their matched normal samples, of course!), it will undoubtedly be harder to interpret how all of these work and identify a cure.

So, as with everything, cancer’s somatic nature is a double edged sword: it can be used to more efficiently sort the wheat from the chaff, but will also be a source of great consternation for finding cures.

Now, if only I could convince other people of the dire necessity of matched normals in cancer research…

Simple SNP Visualization…

Before I disappear for AGBT, I thought I’d finally get around to writing a quick post about some of the visualization work I’ve done.  In fact, this is my first shot at interactive visualization – and while I’m not necessarily thrilled with it, it is a neat first try.

The data comes from my Variation Database (which has been accepted for publication – I’ll be submitting the final revisions today), and was an attempt to make an interactive method of searching through files that can be in the hundreds of Mb long, without going insane.  The db does produce smaller summary files – which can save your mind from the pit of despair of reading a 300Mb file – but I thought there has to be a better way.

And so there is!  In the live version (for which I have yet to make an example file, but I will do that after AGBT), you can scroll forward and backwards through the “tunnel” of variants, so that it can be obvious as to which libraries variants are found in – or not.  There are some neat examples where you’ll see two polymorphisms side by side, but NEVER in the same libraries.  Neat to be able to pick out stuff like that at a glance.

If you go to the location on the web where this script resides (for now), you’ll see options for filtering on the side, but in the name of providing an explanation, I’ll just give you the static image:

Semi-circular visualization of SNVs.

Click on the image to see it in full size.

Obviously, it’s just a simple first pass – but hey, I think there’s lots of room for improvement, and likely lots of room for innovation.  If only I could find the time to do this stuff more often!

If you’re interested, I stared a wiki page for it as part of the VSRAP (here), and the code is also available from the VSRAP svn (here), and of course, all of my code is open source, so feel free to play with it, adapt it for your own, or otherwise.

And now, I have some work to do before AGBT…  see you there!  (or, if you’re not there, you can pretend you’re there by reading my notes!)  Cheers!