A Reply to: Should a Rich Genome Variant File End the Storage of Raw Read Data?

Before you read my post, you might want to zip over to Complete Genomic’s blog where C.S.O. Dr. Rade Drmanac wrote an entry titled “Should a Rich Genome Variant File End the Storage of Raw Read Data?”  It’s an interesting perspective where he suggests that, as the article’s title might indicate, we should only be keeping a rich variant file as the only trace of a sequencing run.

I should mention that I’m really not going to distinguish between storing raw reads and storing aligned reads – you can go from one to the other by stripping out the alignment information, or by aligning to a reference template.  As far as I’m concerned, no information is lost when you align, unlike moving to a rich variant file (or any other non-rich variant file, for that matter.)

I can certainly see the attraction of deleting the raw or aligned data – much of the data we store is never used again, takes up space and could be regenerated at will from frozen samples of DNA, if needed.  As a scientist, I’ve very rarely have had to go back into the read-space files and look at the pre-variant calling data – and only a handfull of times in the past 2 years.  As such, it really doesn’t make much sense to store alignments, raw reads or other data. If I needed to go back to a data set, it would be (could be?) cheaper just to resequence the genome and regenerate the missing read data.

I’d like to bring out a real-world analogy, to summarize this argument. I had recently been considering a move, only to discover that storage in Vancouver is not cheap.  It’ll cost me $120-150/month for enough room to store some furniture and possessions I wouldn’t want to take with me.  If the total value of those possessions is only $5,ooo, storing them for more than 36 months means that (if I were to move) it would have been cheaper to sell it all and then buy a new set when I come back.

Where the analogy comes into play quite elegantly is if I have some interest in those particular items.  My wife, for instance, is quite attached to our dining room set.  If we were to sell it, we’d have to eventually replace it, and it might be impossible to find another one just like it at any price.  It might be cheaper to abandon the furniture than to store it in the long run, but if there’s something important in that data, the cost of storage isn’t the only thing that comes into play.

While I’m not suggesting that we should be emotionally attached to our raw data, there is merit in having a record that we can return to – if (and only if) there is a strong likelyhood that we will return to the data for verification purposes.  You can’t always recreate something of interest in a data set by resequencing.

That is a poor reason to keep the data around, most of the time.  We rarely find things of interest that couldn’t be recreated in a specific read set when we’re doing genome-wide analysis. Since most of the analysis we do uses only the variants and we frequently verify our findings with other means, the cost/value argument is probably strongly in favour of throwing away raw reads and only storing the variants.  Considering my recent project on storage of variants (all 3 billion of the I’ve collected), people have probably heard me make the same arguments before.

But lets not stop here. There is much more to this question than meets the eye. If we delve a little deeper into what Dr. Drmanac is really asking, we’ll find that this question isn’t quite as simple as it sounds.  Although the question stated basically boils down to “Can a rich genome variant file be stored instead of a raw read data file?”, the underlying question is really: “Are we capable of extracting all of the information from the raw reads and storing it for later use?”

Here, I actually would contend the answer is no, depending on the platform.  Let me give you examples of what data I feel we do a poor example of extracting, right now.

  • Structural variations:  My experience with structural variations is that no two SV callers give you the same information or make the same calls.  They are notoriously difficult to evaluate, so any information we are extracting is likely just the tip of the iceberg.  (Same goes for structural rearrangements in cancers, etc.)
  • Phasing information:  Most SNP callers aren’t giving phasing information, and some aren’t capable of it.  However, those details could be teased from the raw data files (depending on the platform).  We’re just not capturing it efficiently.
  • Exon/Gene expression:  This one is trivial, and I’ve had code that pulls this data from raw aligned read files since we started doing RNA-Seq.  Unfortunately, due to exon annotation issues, no one is doing this well yet, but it’s a huge amount of clear and concise information available that is obviously not captured in variant files.  (We do reasonably well with Copy Number Variations (CNVs), but again, those aren’t typically sored in rich variant files.)
  • Variant Quality information: we may have solved the base quality problems that plagued early data sets, but let me say that variant calling hasn’t exactly come up with a unified method of comparing SNV qualities between variant callers.  There’s really no substitute for comparing other than to re-run the data set with the same tools.
  • The variants themselves!  Have you ever compared the variants observed by two SNP callers run on the same data set? I’ll spoil the suspense for you: they never agree completely, and may in fact disagree on up to 25% of the variants called. (Personal observation – data not shown.)

Even dealing only with the last item, it should be obvious:  If we can’t have two snp callers produce the same set of variants, then no amount of richness in the variant file will replace the need to store the raw read data because we should always be double checking interesting findings with (at least) a second set of software.

For me, the answer is clear:  If you’re going to stop storing raw read files, you need to make sure that you’ve extracted all of the useful information – and that the information you’ve extracted is complete.  I just don’t think we’ve hit those milestones yet.

Of course, if you work in a shop where you only use one set of tools, then none of the above problems will be obvious to you and there really isn’t a point to storing the raw reads.  You’ll never return to them because you already have all the answers you want.  If, on the other hand, you get daily exposure to the uncertainty in your pipeline by comparing it to other pipelines, you might look at it with a different perspective.

So, my answer to the question “Are we ready to stop storing raw reads?” is easy:  That depends on what you think you need from a data set and if you think you’re already an expert at extracting it.  Personally, I think we’ve barely scratched the surface on what information we can get out of genomic and transcriptomic data, we just don’t know what it is we’re missing yet.

Completely off topic, but related in concept: I’m spending my morning looking for Scalable Vector Graphics (.svg) files for many jpg and png files I’d created along the course of my studies.  Unfortunately, jpg and png are lossy formats and don’t reproduce as nicely in the Portable Document Format (PDF) export process.  Having deleted some of those .svg files because I thought I had extracted all of the useful information from them in the export to png format, I’m now at the point where I might have to recreate them to properly export the files again in a lossless format for my thesis.  If I’d just have stored them (as the cost is negligible) I wouldn’t be in this bind….   meh.


BlueSeq Knowledgebase

Remember BlueSeq?  The company I gave a hard time after their presentation at Copenhagenomics?  Turns out they have some cool stuff up on the web.  Here’s a comparison of sequencing technologies that they’ve posted.  Looks like they’ve put together quite a decent set of resources.  I haven’t finished exploring it yet, but it looks quite useful.

Via CLC bio blog – Post: Goldmine of unbiased expert knowledge on next generation sequencing.

>One lane is (still) not enough…

>After my quick post yesterday where I said one lane isn’t enough, I was asked to elaborate a bit more, if I could. Well, I don’t want do get into the details of the experiment itself, but I’m happy to jump into the “controls” a bit more in depth.

What I can tell is that with one lane of RNA-Seq (Illumina data50bp), all of the variations I find show up either in known polymorphism database or as somatic SNPs, with a few exceptions. The few exceptions just turn out to be exceptions for lack of coverage.

For a “control”, I took two data sets (from two separate patients) – each with 6 individual lanes of sequencing data. (I realize this isn’t the most robust experiment, but it shows a point.) In the perfect world, each of the 6 lanes per person would have sampled the original library equally well.

So, I matched up one lane from each patient into 6 sets and asked the question: How many transcripts are void (less than 5 tags) in one sample and at least 5x greater in the other sample. (I did this in both directions.)

The results aren’t great. In one direction, I see an average of 1245 Transcripts (about 680 genes, so there’s some overlap amongst the transcript set) with a std dev. of 38 Transcripts. That sounds pretty consistent, till you look for the overlap in actual transcripts: avg 27.3 with a std dev of 17.4. (range 0-60). And, when with do the calculations, the most closely matched data sets only have a 5% overlap.

The results for the opposite direction were similar: Average of 277 transcripts found that met the criteria (std.dev of 33.61), with an average overlap between data sets being 4.8, std. dev 4.48. (range of 0-11 transcripts in common.) The best overlap in “upregulated” genes for this dataset was just over 4% concordance with a second pair of lanes.

So, what this tells me (for a VERY dirty experiment) is that expression of genes in one lane is highly variable depending on the lane for genes expressed at the low end. (Sampling at the high end usually pretty good, so I’m not too concerned about that.)

What I haven’t answered yet is how many lanes is enough. Alas, I have to go do some volunteering, so that experiment will have to wait for another day. And, of course, the images I created along the way will have to follow later as well.

>DNA sequencing Videos.

>With IBM tossing it’s hat into the ring of “next-next-generation” sequencing, I’m starting to get lost as to which generation is which. For the moment, I’m sort of lumping things together, while I wait to see how the field plays out. In my mind, first generation is anything that requires chain termination, Second generation is chemical based pyrosequencing, and third generation is single molecule sequencing based on a nano-scale mechanical process. It’s a crude divide, but it seems to have some consistency.

At any rate, I decided I’d collect a few videos to illustrate each one. For Sanger, there are a LOT of videos, many of which are quite excellent, but I only wanted one. (Sorry if I didn’t pick yours.) For second and third generation DNA sequencing videos, the selection kind of flattens out, and two of them come from corporate sites, rather than youtube – which seems to be the general consensus repository of technology videos.

Personally, I find it interesting to see how each group is selling themselves. You’ll notice some videos press heavily on the technology, while others focus on the workflow.

As an aside, I also find it interesting to look for places where the illustrations don’t make sense… there’s a lovely place in the 454 video where two strands of DNA split from each other on the bead, leaving the two full strands and a complete primer sequence… mysterious! (Yes, I do enjoy looking for inconsistencies when I go to the movies.)

Ok, get out your popcorn.

First Generation:
Sanger Entry: Link

Second Generation:
Pyrosequencing Entry: Link

Helicose Entry: Link

Illumina (Corporate site): Link

(Click to see the Flash animation)

454 Entry: Link

Third Generation:

Pacific Biosciences: Link

(Click to see the Flash Video)

Oxford Nanopore Entry: Link

IBM’s Entry: Link

Note: If I’ve missed something, please let me know. I’m happy to add to this post whenever something new comes up.

>Base quality by position

>A colleague of mine was working on a nifty tool to give graphs of the base quality at each position in a read using Eland Export files, which could be incorporated into his pipeline. Over a discussion about the length of time it should take to do that analysis, (His script was taking an hour, and I said it should take about a minute to analyze 8M illumina reads…) I ended up saying I’d write my own version to do the analysis, just to show how quickly it could be done.

Well, I was wrong about it taking about a minute. It turns out that the file has a lot more than about double the originally quoted 8 million reads (QC, no match and multi match reads were not previously filtered), and the whole file was bzipped, which adds to the processing time.

Fortunately, I didn’t have to add bzip support in to the reader, as tcezard (Tim) had already added in a cool “PIPE” option for piping in whatever data format I want in to applications of the Vancouver Short Read Analysis Package, thus, I was able to do the following:

time bzcat /archive/solexa1_4/analysis2/HS1406/42E6FAAXX_7/42E6FAAXX_7_2_export.txt.bz2 | java6 src/projects/maq_utilities/QualityReport -input PIPE -output /projects/afejes/temp -aligner elandext

Quite a neat use of piping, really.

Anyhow, the fun part is that this was that the library was a 100-mer illumina run, and it makes a pretty picture. Slapping the output into openoffice yields the following graph:

I didn’t realize quality dropped so dramatically at 100bp – although I remember when qualities looked like that for 32bp reads…

Anyhow, I’ll include this tool in Findpeaks 4.0.8 in case any one is interested in it. And for the record, this run took 10 minutes, of which about 4 were taken up by bzcat. Of the 16.7M reads in the file, only 1.5M were aligned, probably due to the poor quality out beyond 60-70bp.

>Second Gen Sequencer Map and naming our generations properly

>Yes, this is old news, but I’ve found myself searching for the second generation sequencing map that was started at seqanswers several times in the last few days. Just to make it even easier to find – and of course to give some positive publicity to a really cool project – here’s the Google Maps based list of facilities that run second generation (Next-Generation) sequencing machines:


For the record, I understand a few places are missing, however. I hear there’s some second-generation sequencing going on Alberta, which clearly hasn’t appeared on the map yet.

And, as a foot note, now that I see other people have picked up on it and started calling “next-generation sequencing” the more appropriate label of “second generation sequencing” (e.g. here at nature), I’m going to drop next-gen as a label entirely. First-gen is sanger/dideoxy/capillary/etc, second-gen is pyrosequencing and the third gen is biotech-based (using cellular components such as DNA polymerases and the like). Lets end the confusion and name our generations accordingly, shall we?

>new repository of second generation software

>I finally have a good resource for locating second gen (next gen) sequencing analysis software. For a long time, people have just been collecting it on a single thread in the bioinformatics section of the SeqAnswers.com forum, however, the brilliant people at SeqAnswers have spawned off a wiki for it, with an easy to use form. I highly recommend you check it out, and possibly even add your own package.


>What would you do with 10kbp reads?

>I just caught a tweet about an article on the Pathogens blog (What can you do with 1000 base pair reads?), which is specifically about 454 reads. Personally, I’m not so interested in 454 reads – the technology is good, but I don’t have access to 454 data, so it’s somewhat irrelevant to me. (Not to say 1kbp reads isn’t neat, but no one has volunteered to pass me 454 data in a long time…)

So, anyhow, I’m trying to think two steps ahead. 2010 is supposed to be the year that Pacific Biosciences (and other companies) release the next generation of sequencing technologies – which will undoubtedly be longer than 1k. (I seem to recall hearing that PacBio has 10k+ reads.- UPDATE: I found a reference.) So to heck with 1kbp reads, this raises the real question: What would you do with a 10,000bp read? And, equally important, how do you work with a 10kbp read?

  • What software do you have now that can deal with 10k reads?
  • Will you align or assemble with a 10k read?
  • What experiments will you be able to do with a 10k read?

Frankly, I suspect that nothing we’re currently using will work well with them – we’ll all have to go back to the drawing board and rework the algorithms we use.

So, what do you think?

>Aligner tests

>You know what I’d kill for? A simple set of tests for each aligner available. I have no idea why we didn’t do this ages ago. I’m sick of off-by-one errors caused by all sorts of slightly different formats available – and I can’t do unit tests without a good simple demonstration file for each aligner type.

I know Sam format should help with this – assuming everyone adopts it – but even for SAM I don’t have a good control file.

I’ve asked someone here to set up this test using a known sequence- and if it works, I’ll bundle the results into the Vancouver Package so everyone can use it.

Here’s the 50-mer I picked to do the test. For those of you with some knowledge of cancer, it comes from tp53. It appears to blast uniquely to this location only.

>forward - chr17:7,519,148-7,519,197

>reverse - chr17:7,519,148-7,519,197

>how recently was your sample sequenced?

>One more blog for the day. I was postponing writing this one because it’s been driving me nuts, and I thought I might be able to work around it… but clearly I can’t.

With all the work I’ve put into the controls and compares in FindPeaks, I thought I was finally clear of the bugs and pains of working on the software itself – and I think I am. Unfortunately, what I didn’t count on was that the data sets themselves may not be amenable to this analysis.

My control finally came off the sequencer a couple weeks ago, and I’ve been working with it for various analyses (snps and the like – it’s a WTSS data set)… and I finally plugged it into my FindPeaks/FindFeatures pipeline. Unfortunately, while the analysis is good, the sample itself is looking pretty bad. In looking at the data sets, the only thing I can figure is that the year and a half of sequencing chemistry changes has made a big impact on the number of aligning reads and the quality of the reads obtained. I no longer get a linear correlation between the two libraries – it looks partly sigmoidal.

Unfortunately, there’s nothing to do except re-seqeunce the sample. But really, I guess that makes sense. If you’re doing a comparison between two data-sets, you need them to have as few differences as possible.

I just never realized that the time between samples also needed to be controlled. Now I have a new question when I review papers: How much time elapsed between the sequencing of your sample and it’s control?