How deep is deep enough in RNA-Seq?

An interesting paper showed up in my twitter feed yesterday, courtesy of @fiamh.  The paper is called: RNA-Sequence analysis of human B-cells.  It sounds simple enough, but @fiamh’s twitter message contains the take home message from the paper:

@fiamh: Ouch…: “about 500 million reads are needed to measure accurately [transcript] expression levels.

It’s not particularly hidden – you’ll find that claim in the abstract.  On it’s surface, that sounds pretty desperate – is RNA-seq unreliable for expression level analysis with under 500M reads?  Well, no, it doesn’t seem to be so.

This is where the paper gets interesting, though.  This isn’t 500 million reads from one sample – it’s actually the pooled reads from 20 different samples, each with about 40M reads, giving a total of 800M reads when pooled.

Skipping that for the moment, this analysis seems interesting enough – I’m working with a few immortalized b-cell derived cell lines and it is fascinating to look at the gene expression.  Frankly, 40M reads isn’t that bad for a single sample, and is certainly enough to get excellent coverage of pretty much all of the expressed genes.  Since that’s what I spent the last couple of weeks doing, I feel like I can make that statement with some prior knowledge – most of my samples have 30-40M aligned reads.

Moving along, most of the paper tells us stuff we know already from decades of microarray data: a few genes are expressed highly, many of them have alternate expression, and there’s a “long tail” of weird, rare or otherwise obscure genes being expressed. 88% of genes are common to all 20 B-cell samples – that is to say, that b-cells are like other b-cells, regardless of the individual from whom they come. (Although they do specify in the methods that all of these are from a European population and, as far as I know,Europeans are one of the populations that have the least amount of diversity.)

I did find it interesting that 97% of multi-exon genes had two or more alternative splicing isoforms.   I wasn’t aware that it was quite that high – although, again, that’s across 20 distinct individuals, and unfortunately, no indication is given if there is overlap or 20 distinct patterns to alternate splicing.

In any case, the results are also compared to microarray data to show that RNA-Seq has better resolution than microarray. (I thought this was an odd step to undertake for a half paragraph of text, and something that was done to death 3 years ago.  Perhaps they were simply comparing with already published data?)  No surprise there.

And then comes the interesting part of the experiment:

Making the assumption that the dataset combining all 20 samples (a full 879M reads) represents all of the expressed genes, you can then ask how many reads you have to sample to find all or some fraction of your genes.

It’s a relatively simple method – and it’s effective for quantifying what you found, but is that assumption valid?  I would argue that it’s not actually that good.  At least, it’s an approximation that makes for a good estimate, but it’s not complete.

As an aside, I really like the ROC style graphs done in the paper – they’re an exceptionally clear way to display the sampling issue.  I’ll refer you to the paper, because it is really worth reading, and the illustrations are fantastically informative.  They do a great job of letting you understand the pooled data in a very simple manner.  If you want X% of the genes covered at Y% of the “final” depth, then you need to sequence Z reads.  That’s very useful information.

For a thought experiment, we can look at what they’ve done.  For each of the 20 samples, they’ve sequenced 40M reads – and we know 40M reads is good enough to cover the most highly expressed genes, but not those expressed at low levels.  In fact, we can even use the number reported above – 88% of genes observed were found in all 20 samples.   Since they found a total of 20,766 genes, that’s around 2,492 genes that were found in less that 20.  How many of those were unique to a since sample is impossible to guess, but it’s not an insignificant number – and knowing that 97% of multi-exonic genes display alternate splicing patterns, we probably have a large number of unique sequences that are likely to be rare.

So, by pooling 20 “under-sampled” libraries, do we come up with a complete set of transcripts?  Probably not – but lets say it’s close enough, which is the assumption that the authors make.

The issue I have, however, is that the pooled data is not necessarily representative of the actual “final” depth.  If the data is not homogenious across the 20 samples, then the sampling is an overstatement:  One rare transcript of interest that is represented 100 times in a single sample of 40M reads just becomes diluted 20-fold in this experiment – and the required depth of sampling (500M reads) is a 20-fold overstatement.  You only needed 25M reads to have found it in the original sample, and no amount of sampling in the other 19 data sets would have found it.

Even worse, if the expression level of genes isn’t homogenious, then the pooling is somewhat of a poor data set for this type of experiment.  If gene A is found at high levels  in some samples, and low or undetectable levels in others, then the sampling of the gene is going to be skewed.  In my own data, I’m aware of genes with 20,000 reads aligning to it’s exons in some samples, and 15-20 reads in others.  Granted, these aren’t strictly normal samples, as they’re infected with viruses, so they may not be representative of all the genes, but it is certainly possible.  In fact, since the sample in question seems to be a set of 20 immortalized B-cell derived cell lines, it’s a reasonable assumption to make that the data used in this experiment will contain exactly this type of sampling – which is not ideal.

Thus, we can be pretty certain that the homogeneity of the pooled samples is probably poor – which makes the pooling of the samples an interesting experiment, but not an ideal one in which to ask the question of sampling depth.  If the authors really wanted to ask this question, they’ve have been much better served to take one sample and sequence it to great depth.  That is to say, if you want to study sampling, you should study a sample in which your sampling is likely to be representative of the “final” set.

Unfortunately, the paper itself doesn’t discuss the issue of sample heterogeneity in any depth – and that’s my real complaint.  By pooling, you guarantee that your ROC curves will reflect the average of populations of transcript depth across a large sample – rather than actually looking for the rare and extremely rare transcripts that one would expect in a single cell line.

So, while I think this paper is well written and provides an incredibly useful “rule of thumb” for depth required (again, see the beautiful ROC figures), I don’t think they’ve actually addressed that one particular question that they state they have.

I think the question is still open as to how deep you really do need to go.  Fortunately, thanks to this paper, we now know the upper bound: 500 million reads.

7 thoughts on “How deep is deep enough in RNA-Seq?

  1. Hi Anthony,

    I believe those charts are rarefaction curves. They use them in ecology to measure species diversity within a defined area. You would, e.g. rope off an area and count the different species of bugs that you identify. The individuals per species are usually somewhat lognormally (or zipf’s) distributed, e.g. lots of ants, few crickets, and just an occasional butterfly. It’s more or less the same problem as in RNA Seq, but the ecologists get to go outside.

    You may enjoy this. It is a classic of both ecology and snarky 1970’s intellectualism:–%20A%20Critique%20and%20Alternative%20Parameters.pdf

    • Interesting, I’d never heard of a rarefaction curve – I’ll have to look into it. However, I believe the ones I liked the most are indeed ROC curves.

      And thanks, I’ll take a look at the paper you’ve suggested. Just for the record, I’m not actually aiming for snarky, myself – and I’m definitely trying to avoid the 1970’s intellectualism!

    • Wow… I can see why the subject of my post reminded you of that paper. It’s an interesting monologue, and it really is quite snarky! Just… wow!

  2. Yeah,

    Thats what I was hoping to find in the paper .. how much rna-seq coverage is good enough for what application! With more than 100 million clusters from HiSEQ, the obvious question is if one can pool multiple rna-seq samples, without losing much information … and pool how many samples!!

    Hopefully your data would answer that

    • Absolutely – and I don’t think they did a great job of answering those questions. Unfortunately, my own data set is too small to answer them either, and it predates the HiSEQ, so I’m the wrong person to try.

      Regardless, there’s still room, IMHO, for someone to get a great paper out of that particular experiment!

  3. Well. There is small problem as I see it. Noise. Notice what they did: they obtained 879 mln reads, than they *assumed* that this 879 mln reads represent the entire collection of expressed genes. Then they did 100mln sampling and noticed that they can still see only 80% of genes. It is not clear what initial assumption is based on. Or rather, it is clear – the number is big so it must be comprehensive. I would argue, however, that if you produce 100 billion reads you will cover each and every existing gene. Does it mean that each and every gene is expressed? Hardly. After all, from a phenotype point of view (for a protein coding genes) the only criteria of expression is whether you have protein product in the end. All it means that with certain very low probability in a very small number of cells each gene will be transcribed and given sufficient number of reads you will be able to detect it. It is easy to calculate actually for a given probability of expression.

    What make more sense to me is to select a group of low expressing genes (we have to figure out what “low” means) and see how many reads you need to produce before and expression value for a given gene or transcript stabilizes.

    That, actually has been done (several times) for example here Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. (2008). “Mapping and quantifying mammalian transcriptomes by RNA-seq”. Nature Methods 5 (7): 621–628
    See figure 2.

    I can give you an interesting example from my own experience: Interferon beta and viral response. IFNb has a peripheral interest for me but it is an amazing marker as it is not expressed (by any methods available) in unstimulated cells. Indeed at 40 mln reads it is not expressed – 0 reads, not a peep. However at 160 mln reads I got 1 read (in one out 4 samples). Does it mean that it is expressed, not really; I also got 2 reads that corresponded to a pre-centromeric sequence .

  4. Pingback: Leung Lab Blog » Blog Archive » 2011-09-17 new articles we read this week

Leave a Reply

Your email address will not be published. Required fields are marked *