A Reply to: Should a Rich Genome Variant File End the Storage of Raw Read Data?

Before you read my post, you might want to zip over to Complete Genomic’s blog where C.S.O. Dr. Rade Drmanac wrote an entry titled “Should a Rich Genome Variant File End the Storage of Raw Read Data?”  It’s an interesting perspective where he suggests that, as the article’s title might indicate, we should only be keeping a rich variant file as the only trace of a sequencing run.

I should mention that I’m really not going to distinguish between storing raw reads and storing aligned reads – you can go from one to the other by stripping out the alignment information, or by aligning to a reference template.  As far as I’m concerned, no information is lost when you align, unlike moving to a rich variant file (or any other non-rich variant file, for that matter.)

I can certainly see the attraction of deleting the raw or aligned data – much of the data we store is never used again, takes up space and could be regenerated at will from frozen samples of DNA, if needed.  As a scientist, I’ve very rarely have had to go back into the read-space files and look at the pre-variant calling data – and only a handfull of times in the past 2 years.  As such, it really doesn’t make much sense to store alignments, raw reads or other data. If I needed to go back to a data set, it would be (could be?) cheaper just to resequence the genome and regenerate the missing read data.

I’d like to bring out a real-world analogy, to summarize this argument. I had recently been considering a move, only to discover that storage in Vancouver is not cheap.  It’ll cost me $120-150/month for enough room to store some furniture and possessions I wouldn’t want to take with me.  If the total value of those possessions is only $5,ooo, storing them for more than 36 months means that (if I were to move) it would have been cheaper to sell it all and then buy a new set when I come back.

Where the analogy comes into play quite elegantly is if I have some interest in those particular items.  My wife, for instance, is quite attached to our dining room set.  If we were to sell it, we’d have to eventually replace it, and it might be impossible to find another one just like it at any price.  It might be cheaper to abandon the furniture than to store it in the long run, but if there’s something important in that data, the cost of storage isn’t the only thing that comes into play.

While I’m not suggesting that we should be emotionally attached to our raw data, there is merit in having a record that we can return to – if (and only if) there is a strong likelyhood that we will return to the data for verification purposes.  You can’t always recreate something of interest in a data set by resequencing.

That is a poor reason to keep the data around, most of the time.  We rarely find things of interest that couldn’t be recreated in a specific read set when we’re doing genome-wide analysis. Since most of the analysis we do uses only the variants and we frequently verify our findings with other means, the cost/value argument is probably strongly in favour of throwing away raw reads and only storing the variants.  Considering my recent project on storage of variants (all 3 billion of the I’ve collected), people have probably heard me make the same arguments before.

But lets not stop here. There is much more to this question than meets the eye. If we delve a little deeper into what Dr. Drmanac is really asking, we’ll find that this question isn’t quite as simple as it sounds.  Although the question stated basically boils down to “Can a rich genome variant file be stored instead of a raw read data file?”, the underlying question is really: “Are we capable of extracting all of the information from the raw reads and storing it for later use?”

Here, I actually would contend the answer is no, depending on the platform.  Let me give you examples of what data I feel we do a poor example of extracting, right now.

  • Structural variations:  My experience with structural variations is that no two SV callers give you the same information or make the same calls.  They are notoriously difficult to evaluate, so any information we are extracting is likely just the tip of the iceberg.  (Same goes for structural rearrangements in cancers, etc.)
  • Phasing information:  Most SNP callers aren’t giving phasing information, and some aren’t capable of it.  However, those details could be teased from the raw data files (depending on the platform).  We’re just not capturing it efficiently.
  • Exon/Gene expression:  This one is trivial, and I’ve had code that pulls this data from raw aligned read files since we started doing RNA-Seq.  Unfortunately, due to exon annotation issues, no one is doing this well yet, but it’s a huge amount of clear and concise information available that is obviously not captured in variant files.  (We do reasonably well with Copy Number Variations (CNVs), but again, those aren’t typically sored in rich variant files.)
  • Variant Quality information: we may have solved the base quality problems that plagued early data sets, but let me say that variant calling hasn’t exactly come up with a unified method of comparing SNV qualities between variant callers.  There’s really no substitute for comparing other than to re-run the data set with the same tools.
  • The variants themselves!  Have you ever compared the variants observed by two SNP callers run on the same data set? I’ll spoil the suspense for you: they never agree completely, and may in fact disagree on up to 25% of the variants called. (Personal observation – data not shown.)

Even dealing only with the last item, it should be obvious:  If we can’t have two snp callers produce the same set of variants, then no amount of richness in the variant file will replace the need to store the raw read data because we should always be double checking interesting findings with (at least) a second set of software.

For me, the answer is clear:  If you’re going to stop storing raw read files, you need to make sure that you’ve extracted all of the useful information – and that the information you’ve extracted is complete.  I just don’t think we’ve hit those milestones yet.

Of course, if you work in a shop where you only use one set of tools, then none of the above problems will be obvious to you and there really isn’t a point to storing the raw reads.  You’ll never return to them because you already have all the answers you want.  If, on the other hand, you get daily exposure to the uncertainty in your pipeline by comparing it to other pipelines, you might look at it with a different perspective.

So, my answer to the question “Are we ready to stop storing raw reads?” is easy:  That depends on what you think you need from a data set and if you think you’re already an expert at extracting it.  Personally, I think we’ve barely scratched the surface on what information we can get out of genomic and transcriptomic data, we just don’t know what it is we’re missing yet.

Completely off topic, but related in concept: I’m spending my morning looking for Scalable Vector Graphics (.svg) files for many jpg and png files I’d created along the course of my studies.  Unfortunately, jpg and png are lossy formats and don’t reproduce as nicely in the Portable Document Format (PDF) export process.  Having deleted some of those .svg files because I thought I had extracted all of the useful information from them in the export to png format, I’m now at the point where I might have to recreate them to properly export the files again in a lossless format for my thesis.  If I’d just have stored them (as the cost is negligible) I wouldn’t be in this bind….   meh.


Nature Comment : The case for locus-specific databases

There’s an interesting comment available in Nature today (EDIT: it came out last month, though I only found it today.) Unfortunately, it’s by subscription only, but let me save you the hassle of downloading it, if you don’t already have a subscription.  It’s not what I thought it was.

The entire piece fails to make the case for locus-specific databases, but instead conflates locus-specific with “high-resolution”, and then proceeds to tell us why we need high resolution data.  The argument can roughly be summarized as:

  • Omim and databases like it are great, but don’t list all known variations
  • Next-gen sequencing gives us the ability to see genome in high resolution
  • You can only get high-resolution data by managing data in a locus-specific manner
  • Therefore, we should support locus-specific databases

Unfortunately, point number three is actually wrong.  It’s just that our public databases haven’t yet transitioned to the high resolution format.  (ie, we have an internal database that stores data in a genome-wide manner at high resolution…  the data is, alas, not public.)

Thus, on that premise, I don’t think we should be supporting locus specific databases specifically –  indeed, I would say that the support they need is to become amalgamated in to a single genome-wide database at high resolution.

You wouldn’t expect major gains in understanding of car mechanics if you, by analogy, insisted that all parts should be studied independently at high resolution.  Sure you might improve your understanding of each part, and how it works alone, but the real gains come from understanding the whole system.  You might not actually need certain parts, and sometimes you need to understand how two parts work together.  It’s only by studying the whole system that you begin to see the big picture.

IMHO, Locus-specific databases are blinders that we adopt in the name of needing higher resolution, which is more of a comment on the current state of biology.  In fact, the argument can really be made that we don’t need locus-specific databases, we need better bioinformatics!

Dueling Databases of Human Variation

When I got it to work this morning, I was greeted by an email from 23andMe’s PR company, saying they have “built one of the world’s largest databases of individual genetic information.”   Normally, I wouldn’t even bat an eye at a claim like that.  I’m pretty sure it is a big database of variation…  but I thought I should throw down the gauntlet and give 23andMe a run for their money.  (-:

The timing for it couldn’t be better for me.  My own database actually ran out of auto-increment IDs this week, as we surpassed 2^31 snps entered into the db and had to upgrade the key field to bigint from int. (Some variant calls have been deleted and replaced as variant callers have improved, so we actually have only 1.2 Billion variations recorded against the hg18 version of the human genome.  A few hundred million more than that for hg19.)  So, I thought I might have a bit of a claim to having one of the largest databases of human variation as well.  Of course, comparing databases really is dependent on the metric being used, but hey, there’s some academic value in trying anyhow.

In the first corner, my database stores information from 2200+ samples (cancer and non-cancer tissue), genome wide (or transcriptome wide, depending on the source of the information.), giving us a wide sampling of data, including variations unique to individuals, as well as common polymorphisms.  In the other corner, 23andMe has sampled a much greater number of individuals (100,000) using a SNP chip, meaning that they’re only able to sample a small amount of the variation in an individual – about 1/3rd of a single percent of the total amount of DNA in each individual.

(According to this page, they look at only 1 million possible SNPs, instead of the 3 Billion bases at which single nucleotide variations can be found – although arguments can be made about the importance of that specific fraction of a percent.)

The nature of the data being stored is pretty important, however.  For many studies, the number of people sampled has a greater impact on the statistics than the number of sites studied and, since those are mainly the ones 23andMe are doing, clearly their database is more useful in that regard.  In contrast, my database stores data from both cancer and non-cancer samples, which allows us to make sense of variations observed in specific types of cancers – and because cancer derived variations are less predictable (ie, not in the same 1M snps each time) than the run-of-the-mill-standard-human-variation-type snps, the same technology 23andMe used would have been entirely inappropriate for the cancer research we do.

Unfortunately, that means comparing the two databases is completely impossible – they have different purposes, different data and probably different designs.  They have a database of 100k individuals, covering 1 million sites, whereas my database has 2k individuals, covering closer to 3 billion base pairs.  So yeah, apples and oranges.

(In practice, however, we don’t see variations at all 3 Billion base pairs, so that metric is somewhat skewed itself.  The number is closer to 100 Million bp –  a fraction of the genome nearly 100 times larger than what 23andMe is actually sampling.)

But, I’d still be interested in knowing the absolute number of variations they’ve observed…  a great prize upon which we could hold this epic battle of “largest database of human variations.”  At best, 23andMe’s database holds 10^11 variations, (1×10^6 SNPs x 1×10^5 people), if every single variant was found in every single person – a rather unlikely case.  With my database currently  at 1.2×10^9 variations, I think we’ve got some pretty even odds here.

Really, despite the joking about comparing database sizes, the real deal would be the fantastic opportunity to learn something interesting by merging the two databases, which could teach use something both about cancer and about the frequencies of variations in the human population.

Alas, that is pretty much certain to never happen.  I doubt 23andMe will make their database public – and our organization never will either.  Beyond the ethical issues of making that type of information public, there are pretty good reasons why this data can only be shared with collaborators – and in measured doses at that.  That’s another topic for another day, which I won’t go into here.

For now, 23andMe and I will just have to settle for both having “one of the world’s largest databases of individual genetic information.”  The battle royale for the title will have to wait for another day… and who knows what other behemoths are lurking in other research labs around the world.

On the other hand, the irony of a graduate student challenging 23andMe for the title of largest database of human variation really does make my day. (=

[Note: I should mention that when I say that I have a database of human variation, the database was my creation but the data belongs to the Genome Sciences Centre – and credit should be given to all of those who did the biology and bench work, performed the sequencing, ran the bioinformatics pipelines and assisted in populating the database.]

Thought’s on Andrew G Clark’s Talk and Cancer Genomics

Last night, I hung around late into the evening to hear Dr. Andrew G Clark give a talk focusing on how most of the variations we see in the modern human genome are rare variants that haven’t had a chance to equilibrate into the larger population.  This enormous expansion of rare variants is courtesy of the population explosion of humans since the dawn of the agricultural age, specifically in the past 2000 years at the dawn of modern science and education.

I think the talk was a very well done and managed to hit a lot of points that struck home for me.  In particular, my own collected database of human variations in cancers and normals has shown me much of the same information that Dr Clark illustrated using 1000 genome data, as well as information from his 2010 paper on deep re-sequencing.

However interesting the talk was, one particular piece just didn’t click in until after the talk was over.  During a conversation prior to the talk, I described my work to Dr. Clark and received a reaction I wasn’t expecting.  Paraphrased, this is how the conversation went:

Me: “I’ve assembled a very large database, where all of the cancers and normals that we sequence here at the genome science centre are stored, so that we can investigate the frequency of variations in cancers to identify mutations of interest.”

Dr. Clark: “Oh, so it’s the same as a HapMap project?”

Me: “Yeah, I guess so…”

What I didn’t understand at the time was that Dr. Clark was asking was: “So, you’re just cataloging rare variations, which are more or less meaningless?”  Which is exactly what HapMap projects are: Nothing more than large surveys of human variation across genomes.  While they could be the basis of GWAS studies, the huge amount of rare variants in the modern human population means that many of these GWAS studies are doomed to fail.  There will not be a large convergence of variations causing the disease, but rather an extreme number of rare variations with similar outcomes.

However, I think the problem was that I handled the question incorrectly.  My answer should have touched on the following point:

“In most diseases, we’re stuck using lineages to look for points of interest (variations) passed on from parent to child and the large number of rare variants in the human population makes this incredibly difficult to do as each child will have a significant number of variation that neither parent passed on to them.  However, in cancer, we have the unique ability to compare diseased cancer cells with a matched normal from the same patient, which allows us to effectively mask all of the rare variants that are not contributing to cancer.  Thus, the database does act like a large HapMap database, if you’re interested in studying non-cancer, but the matched-normal sample pairing available to cancer studies means we’re not confined to using it as a HapMap-style database, enabling incredibly detailed and coherent information about the drivers and passengers involved in oncogenesis, without the same level of rare variants interfering in the interpretation of the genome.”

Alas, in the way of all things, that answer only came to me after I heard Dr. Clark’s talk and understood the subtext of his question.  However, that answer is very important on its own.

It means that while many diseases will be hard slogs through the deep rare variant populations (which SNP chips will never be detailed enough to elucidate, by the way, for those of you who think 23andMe will solve a large number of complicated diseases), cancer is bound to be a more tractable disease in comparison!  We will by-pass the misery of studying every single rare variant, which is a sizeable fraction of each new genome sequenced!

Unfortunately, unlike many other human metabolic diseases that target a single gene or pathway, cancer is really a whole genome disease and is vastly more complex than any other disease.  Thus, even if our ability to zoom in on the “driver” mutations progresses rapidly as we sequence more cancer tissues (and their matched normal samples, of course!), it will undoubtedly be harder to interpret how all of these work and identify a cure.

So, as with everything, cancer’s somatic nature is a double edged sword: it can be used to more efficiently sort the wheat from the chaff, but will also be a source of great consternation for finding cures.

Now, if only I could convince other people of the dire necessity of matched normals in cancer research…