This isn’t a particularly in depth blog. Rather, I just want to touch on a few points in reply to a twitter question asking about how to normalize reads.
Actually, normalization is something I haven’t studied in great depth beyond the applications for Chip-Seq, where – of course – it’s still an open question. So, if my answer is incomplete, please feel free to point out other resources in the comments. Any contributions would be welcome.
First, one can do all sorts of fancy things to perform a normalization. Frankly, I think the term is pretty abused, so a lot of what passes for normalization is a bit sketchy. Anything that makes two samples “equivalent” in any way is often referred to as normalization, so yeah, your millage may vary when it comes to applying any approach.
For RNA-Seq, I’ve heard all sorts of techniques being applied. The most common is to simply count the number of reads in the two samples, and then normalize by dividing by the ratio of reads between the two samples. Very crude, and often very wrong, depending on what the two samples actually are. I don’t use this approach unless I’m feeling lazy, someone has asked me to normalize a sample and it’s 5:40 on a Thursday evening. (I have better things to do on Thursday evenings.)
The second method is to bin the reads, then apply the same normalization as above. The bins can be as large as a whole chromosome or as small as a few hundred base pairs, but the general method is the same: Use some average over a collection of bins to work out the ratio, then use that ratio to force the two samples to have the same approximate total read numbers. It’s a bit better than what’s above, but not a whole lot better.
The third method that I’m aware of it to use a subset of reads to determine the normalization ratio. This is a bit better – assuming you know enough to pick a good subset. For instance, if you know housekeeping genes, you can use the total coverage over that set to approximate the relative abundance of reads in order to set the correct ratios. This method can be dramatically better, if you happen to know a good set of genes (or other subset) to use, as it prevents you from comparing non-equivalent sets.
Just to harp on that last point, if you’re comparing a B-Cell- and a Mammary-derived cell line, you might be tempted to normalized on the total number of reads, however, it would quickly become apparent (once you look at the expressed genes) that some B-Cell genes are highly expressed and swamp your data set. By paring those out of the normalization subset, you’d find your core genes in common to be more comparable – and thus less prone to bias introduced by genes only expressed in one sample.
You’ll notice, however, that all of the methods above use a simple ratio, with increasingly better methods of approximation. That’s pretty much par for the course, as far as I’ve seen in RNA-Seq. It’s not ideal, but I haven’t seen much more elegance than that.
When it comes to ChIP-Seq, the same things apply – most software does some variation of the above, and many of them are still floundering around with the first two types, of which I’m not a big fan.
The version I implemented in FindPeaks 4.0 goes a little bit differently, but can be applied to RNA-Seq just as well as for ChIP-Seq. (yes, I’ve tried) The basic idea is that you don’t actually know the subset of house-keeping genes in common in ChIP-Seq because, well, you aren’t looking at gene expression. Instead, you’re looking at peaks – which can be broadly defined as any collection of peaks above the background. Thus, the first step is to establish what the best subset of your data should be used for normalization – this can be done by peak calling your reads. (Hence, using a peak caller.)
Once you have peak-calling done for both datasets, you can match the peaks up. (Note, this is not a trivial operation, as it must be symmetrical, and repeatable regardless of the order of samples presented.) Once you’ve done this, you’ll find you have three subsets: peaks in sample 1, but not in sample 2. Peaks in sample 2 but not in sample 1, and peaks common to both. (Peaks missing in both are actually important for anchoring your data set, but I won’t get into that.) If you only use the peaks common to both data sets, rather than peaks unique to one sample, you have a natural data subset ideal for normalization.
Using this subset, you can then perform a linear regression (again, it’s not a standard linear regression, as it must be symmetrical about the regression line, and not Y-axis dependent) and identify the best fit line for the two samples. Crucially, this linear regression must pass through your point of origin, otherwise, you haven’t found a normalization ratio.
In any case, once all this is done, you can then use the slope of the regression line to determine the best normalization for your data sets.
The beauty of it is that you also end up with a very nice graph, which makes it easy to understand the data set you’ve compared, and you have your three subsets, each of which will be of some interest to the investigator
(I should also note, however, that I have not expanded this method to more than two data sets, although I don’t see any reason why it could not be. The math becomes more challenging, but the concepts don’t change.)
Regardless, the main point is simply to provide a method by which two data sets become more comparable – the method by which you compare them will dictate how you do the normalization, so what I’ve provided above is only a vague outline that should provide you with a rough guide to some of the ways you can normalize on a single trait. If you’re asking more challenging questions, what I’ve presented above may not be sufficient for comparing your data.
[Edit: Twitter user @audyyy sent me this link, which describes an alternate normalization method. In fact, they have two steps – a pre-normalization log transform (which isn’t normalization, but it’s common. Even FindPeaks 4.0 has it implemented), and then a literal “normalization” which makes the mean = 0, and the standard deviation =1. However, this is only applicable for one trait across multiple data sets (eg, count of total reads for a large number of total libraries.) That said, it wouldn’t be my first choice of normalization techniques.]