An interesting paper showed up in my twitter feed yesterday, courtesy of @fiamh. The paper is called: RNA-Sequence analysis of human B-cells. It sounds simple enough, but @fiamh’s twitter message contains the take home message from the paper:
@fiamh: Ouch…: “about 500 million reads are needed to measure accurately [transcript] expression levels.
It’s not particularly hidden – you’ll find that claim in the abstract. On it’s surface, that sounds pretty desperate – is RNA-seq unreliable for expression level analysis with under 500M reads? Well, no, it doesn’t seem to be so.
This is where the paper gets interesting, though. This isn’t 500 million reads from one sample – it’s actually the pooled reads from 20 different samples, each with about 40M reads, giving a total of 800M reads when pooled.
Skipping that for the moment, this analysis seems interesting enough – I’m working with a few immortalized b-cell derived cell lines and it is fascinating to look at the gene expression. Frankly, 40M reads isn’t that bad for a single sample, and is certainly enough to get excellent coverage of pretty much all of the expressed genes. Since that’s what I spent the last couple of weeks doing, I feel like I can make that statement with some prior knowledge – most of my samples have 30-40M aligned reads.
Moving along, most of the paper tells us stuff we know already from decades of microarray data: a few genes are expressed highly, many of them have alternate expression, and there’s a “long tail” of weird, rare or otherwise obscure genes being expressed. 88% of genes are common to all 20 B-cell samples – that is to say, that b-cells are like other b-cells, regardless of the individual from whom they come. (Although they do specify in the methods that all of these are from a European population and, as far as I know,Europeans are one of the populations that have the least amount of diversity.)
I did find it interesting that 97% of multi-exon genes had two or more alternative splicing isoforms. I wasn’t aware that it was quite that high – although, again, that’s across 20 distinct individuals, and unfortunately, no indication is given if there is overlap or 20 distinct patterns to alternate splicing.
In any case, the results are also compared to microarray data to show that RNA-Seq has better resolution than microarray. (I thought this was an odd step to undertake for a half paragraph of text, and something that was done to death 3 years ago. Perhaps they were simply comparing with already published data?) No surprise there.
And then comes the interesting part of the experiment:
Making the assumption that the dataset combining all 20 samples (a full 879M reads) represents all of the expressed genes, you can then ask how many reads you have to sample to find all or some fraction of your genes.
It’s a relatively simple method – and it’s effective for quantifying what you found, but is that assumption valid? I would argue that it’s not actually that good. At least, it’s an approximation that makes for a good estimate, but it’s not complete.
As an aside, I really like the ROC style graphs done in the paper – they’re an exceptionally clear way to display the sampling issue. I’ll refer you to the paper, because it is really worth reading, and the illustrations are fantastically informative. They do a great job of letting you understand the pooled data in a very simple manner. If you want X% of the genes covered at Y% of the “final” depth, then you need to sequence Z reads. That’s very useful information.
For a thought experiment, we can look at what they’ve done. For each of the 20 samples, they’ve sequenced 40M reads – and we know 40M reads is good enough to cover the most highly expressed genes, but not those expressed at low levels. In fact, we can even use the number reported above – 88% of genes observed were found in all 20 samples. Since they found a total of 20,766 genes, that’s around 2,492 genes that were found in less that 20. How many of those were unique to a since sample is impossible to guess, but it’s not an insignificant number – and knowing that 97% of multi-exonic genes display alternate splicing patterns, we probably have a large number of unique sequences that are likely to be rare.
So, by pooling 20 “under-sampled” libraries, do we come up with a complete set of transcripts? Probably not – but lets say it’s close enough, which is the assumption that the authors make.
The issue I have, however, is that the pooled data is not necessarily representative of the actual “final” depth. If the data is not homogenious across the 20 samples, then the sampling is an overstatement: One rare transcript of interest that is represented 100 times in a single sample of 40M reads just becomes diluted 20-fold in this experiment – and the required depth of sampling (500M reads) is a 20-fold overstatement. You only needed 25M reads to have found it in the original sample, and no amount of sampling in the other 19 data sets would have found it.
Even worse, if the expression level of genes isn’t homogenious, then the pooling is somewhat of a poor data set for this type of experiment. If gene A is found at high levels in some samples, and low or undetectable levels in others, then the sampling of the gene is going to be skewed. In my own data, I’m aware of genes with 20,000 reads aligning to it’s exons in some samples, and 15-20 reads in others. Granted, these aren’t strictly normal samples, as they’re infected with viruses, so they may not be representative of all the genes, but it is certainly possible. In fact, since the sample in question seems to be a set of 20 immortalized B-cell derived cell lines, it’s a reasonable assumption to make that the data used in this experiment will contain exactly this type of sampling – which is not ideal.
Thus, we can be pretty certain that the homogeneity of the pooled samples is probably poor – which makes the pooling of the samples an interesting experiment, but not an ideal one in which to ask the question of sampling depth. If the authors really wanted to ask this question, they’ve have been much better served to take one sample and sequence it to great depth. That is to say, if you want to study sampling, you should study a sample in which your sampling is likely to be representative of the “final” set.
Unfortunately, the paper itself doesn’t discuss the issue of sample heterogeneity in any depth – and that’s my real complaint. By pooling, you guarantee that your ROC curves will reflect the average of populations of transcript depth across a large sample – rather than actually looking for the rare and extremely rare transcripts that one would expect in a single cell line.
So, while I think this paper is well written and provides an incredibly useful “rule of thumb” for depth required (again, see the beautiful ROC figures), I don’t think they’ve actually addressed that one particular question that they state they have.
I think the question is still open as to how deep you really do need to go. Fortunately, thanks to this paper, we now know the upper bound: 500 million reads.