Before you read my post, you might want to zip over to Complete Genomic’s blog where C.S.O. Dr. Rade Drmanac wrote an entry titled “Should a Rich Genome Variant File End the Storage of Raw Read Data?” It’s an interesting perspective where he suggests that, as the article’s title might indicate, we should only be keeping a rich variant file as the only trace of a sequencing run.
I should mention that I’m really not going to distinguish between storing raw reads and storing aligned reads – you can go from one to the other by stripping out the alignment information, or by aligning to a reference template. As far as I’m concerned, no information is lost when you align, unlike moving to a rich variant file (or any other non-rich variant file, for that matter.)
I can certainly see the attraction of deleting the raw or aligned data – much of the data we store is never used again, takes up space and could be regenerated at will from frozen samples of DNA, if needed. As a scientist, I’ve very rarely have had to go back into the read-space files and look at the pre-variant calling data – and only a handfull of times in the past 2 years. As such, it really doesn’t make much sense to store alignments, raw reads or other data. If I needed to go back to a data set, it would be (could be?) cheaper just to resequence the genome and regenerate the missing read data.
I’d like to bring out a real-world analogy, to summarize this argument. I had recently been considering a move, only to discover that storage in Vancouver is not cheap. It’ll cost me $120-150/month for enough room to store some furniture and possessions I wouldn’t want to take with me. If the total value of those possessions is only $5,ooo, storing them for more than 36 months means that (if I were to move) it would have been cheaper to sell it all and then buy a new set when I come back.
Where the analogy comes into play quite elegantly is if I have some interest in those particular items. My wife, for instance, is quite attached to our dining room set. If we were to sell it, we’d have to eventually replace it, and it might be impossible to find another one just like it at any price. It might be cheaper to abandon the furniture than to store it in the long run, but if there’s something important in that data, the cost of storage isn’t the only thing that comes into play.
While I’m not suggesting that we should be emotionally attached to our raw data, there is merit in having a record that we can return to – if (and only if) there is a strong likelyhood that we will return to the data for verification purposes. You can’t always recreate something of interest in a data set by resequencing.
That is a poor reason to keep the data around, most of the time. We rarely find things of interest that couldn’t be recreated in a specific read set when we’re doing genome-wide analysis. Since most of the analysis we do uses only the variants and we frequently verify our findings with other means, the cost/value argument is probably strongly in favour of throwing away raw reads and only storing the variants. Considering my recent project on storage of variants (all 3 billion of the I’ve collected), people have probably heard me make the same arguments before.
But lets not stop here. There is much more to this question than meets the eye. If we delve a little deeper into what Dr. Drmanac is really asking, we’ll find that this question isn’t quite as simple as it sounds. Although the question stated basically boils down to “Can a rich genome variant file be stored instead of a raw read data file?”, the underlying question is really: “Are we capable of extracting all of the information from the raw reads and storing it for later use?”
Here, I actually would contend the answer is no, depending on the platform. Let me give you examples of what data I feel we do a poor example of extracting, right now.
- Structural variations: My experience with structural variations is that no two SV callers give you the same information or make the same calls. They are notoriously difficult to evaluate, so any information we are extracting is likely just the tip of the iceberg. (Same goes for structural rearrangements in cancers, etc.)
- Phasing information: Most SNP callers aren’t giving phasing information, and some aren’t capable of it. However, those details could be teased from the raw data files (depending on the platform). We’re just not capturing it efficiently.
- Exon/Gene expression: This one is trivial, and I’ve had code that pulls this data from raw aligned read files since we started doing RNA-Seq. Unfortunately, due to exon annotation issues, no one is doing this well yet, but it’s a huge amount of clear and concise information available that is obviously not captured in variant files. (We do reasonably well with Copy Number Variations (CNVs), but again, those aren’t typically sored in rich variant files.)
- Variant Quality information: we may have solved the base quality problems that plagued early data sets, but let me say that variant calling hasn’t exactly come up with a unified method of comparing SNV qualities between variant callers. There’s really no substitute for comparing other than to re-run the data set with the same tools.
- The variants themselves! Have you ever compared the variants observed by two SNP callers run on the same data set? I’ll spoil the suspense for you: they never agree completely, and may in fact disagree on up to 25% of the variants called. (Personal observation – data not shown.)
Even dealing only with the last item, it should be obvious: If we can’t have two snp callers produce the same set of variants, then no amount of richness in the variant file will replace the need to store the raw read data because we should always be double checking interesting findings with (at least) a second set of software.
For me, the answer is clear: If you’re going to stop storing raw read files, you need to make sure that you’ve extracted all of the useful information – and that the information you’ve extracted is complete. I just don’t think we’ve hit those milestones yet.
Of course, if you work in a shop where you only use one set of tools, then none of the above problems will be obvious to you and there really isn’t a point to storing the raw reads. You’ll never return to them because you already have all the answers you want. If, on the other hand, you get daily exposure to the uncertainty in your pipeline by comparing it to other pipelines, you might look at it with a different perspective.
So, my answer to the question “Are we ready to stop storing raw reads?” is easy: That depends on what you think you need from a data set and if you think you’re already an expert at extracting it. Personally, I think we’ve barely scratched the surface on what information we can get out of genomic and transcriptomic data, we just don’t know what it is we’re missing yet.
Completely off topic, but related in concept: I’m spending my morning looking for Scalable Vector Graphics (.svg) files for many jpg and png files I’d created along the course of my studies. Unfortunately, jpg and png are lossy formats and don’t reproduce as nicely in the Portable Document Format (PDF) export process. Having deleted some of those .svg files because I thought I had extracted all of the useful information from them in the export to png format, I’m now at the point where I might have to recreate them to properly export the files again in a lossless format for my thesis. If I’d just have stored them (as the cost is negligible) I wouldn’t be in this bind…. meh.