Normalizing reads for Next-Gen Sequencing.

This isn’t a particularly in depth blog.  Rather, I just want to touch on a few points in reply to a twitter question asking about how to normalize reads.

Actually, normalization is something I haven’t studied in great depth beyond the applications for Chip-Seq, where – of course – it’s still an open question.  So, if my answer is incomplete, please feel free to point out other resources in the comments.  Any contributions would be welcome.

First, one can do all sorts of fancy things to perform a normalization.  Frankly, I think the term is pretty abused, so a lot of what passes for normalization is a bit sketchy.  Anything that makes two samples “equivalent” in any way is often referred to as normalization, so yeah, your millage may vary when it comes to applying any approach.

For RNA-Seq, I’ve heard all sorts of techniques being applied.  The most common is to simply count the number of reads in the two samples, and then normalize by dividing by the ratio of reads between the two samples.  Very crude, and often very wrong, depending on what the two samples actually are.  I don’t use this approach unless I’m feeling lazy, someone has asked me to normalize a sample and it’s 5:40 on a Thursday evening. (I have better things to do on Thursday evenings.)

The second method is to bin the reads, then apply the same normalization as above.  The bins can be as large as a whole chromosome or as small as a few hundred base pairs, but the general method is the same: Use some average over a collection of bins to work out the ratio, then use that ratio to force the two samples to have the same approximate total read numbers.  It’s a bit better than what’s above, but not a whole lot better.

The third method that I’m aware of it to use a subset of reads to determine the normalization ratio.  This is a bit better – assuming you know enough to pick a good subset.   For instance, if you know housekeeping genes, you can use the total coverage over that set to approximate the relative abundance of reads in order to set the correct ratios.  This method can be dramatically better, if you happen to know a good set of genes (or other subset) to use, as it prevents you from comparing non-equivalent sets.

Just to harp on that last point, if you’re comparing a B-Cell- and a Mammary-derived cell line, you might be tempted to normalized on the total number of reads, however, it would quickly become apparent (once you look at the expressed genes) that some B-Cell genes are highly expressed and swamp your data set.  By paring those out of the normalization subset, you’d find your core genes in common to be more comparable – and thus less prone to bias introduced by genes only expressed in one sample.

You’ll notice, however, that all of the methods above use a simple ratio, with increasingly better methods of approximation.  That’s pretty much par for the course, as far as I’ve seen in RNA-Seq.  It’s not ideal, but I haven’t seen much more elegance than that.

When it comes to ChIP-Seq, the same things apply – most software does some variation of the above, and many of them are still floundering around with the first two types, of which I’m not a big fan.

The version I implemented in FindPeaks 4.0 goes a little bit differently, but can be applied to RNA-Seq just as well as for ChIP-Seq. (yes, I’ve tried) The basic idea is that you don’t actually know the subset of house-keeping genes in common in ChIP-Seq because, well, you aren’t looking at gene expression.  Instead, you’re looking at peaks – which can be broadly defined as any collection of peaks above the background.  Thus, the first step is to establish what the best subset of your data should be used for normalization – this can be done by peak calling your reads.  (Hence, using a peak caller.)

Once you have peak-calling done for both datasets, you can match the peaks up.  (Note, this is not a trivial operation, as it must be symmetrical, and repeatable regardless of the order of samples presented.)  Once you’ve done this, you’ll find you have three subsets:  peaks in sample 1, but not in sample 2.  Peaks in sample 2 but not in sample 1, and peaks common to both. (Peaks missing in both are actually important for anchoring your data set, but I won’t get into that.) If you only use the peaks common to both data sets, rather than peaks unique to one sample, you have a natural data subset ideal for normalization.

Using this subset, you can then perform a linear regression (again, it’s not a standard linear regression, as it must be symmetrical about the regression line, and not Y-axis dependent) and identify the best fit line for the two samples.  Crucially, this linear regression must pass through your point of origin, otherwise, you haven’t found a normalization ratio.

In any case, once all this is done, you can then use the slope of the regression line to determine the best normalization for your data sets.

The beauty of it is that you also end up with a very nice graph, which makes it easy to understand the data set you’ve compared, and you have your three subsets, each of which will be of some interest to the investigator

(I should also note, however, that I have not expanded this method to more than two data sets, although I don’t see any reason why it could not be.  The math becomes more challenging, but the concepts don’t change.)

Regardless, the main point is simply to provide a method by which two data sets become more comparable – the method by which you compare them will dictate how you do the normalization, so what I’ve provided above is only a vague outline that should provide you with a rough guide to some of the ways you can normalize on a single trait.  If you’re asking more challenging questions, what I’ve presented above may not be sufficient for comparing your data.

Good luck!

[Edit:  Twitter user @audyyy sent me this link, which describes an alternate normalization method.  In fact, they have two steps – a pre-normalization log transform (which isn’t normalization, but it’s common.  Even FindPeaks 4.0 has it implemented), and then a literal “normalization” which makes the mean = 0, and the standard deviation =1.   However, this is only applicable for one trait across multiple data sets (eg, count of total reads for a large number of total libraries.)  That said, it wouldn’t be my first choice of normalization techniques.]


VanBUG: Andrew G Clark, Professor of Population Genetics, Cornell University

[My preamble to this talk is that I was fortunate enough to have had the opportunity to speak with Dr. Clark before the talk along with a group of students from the Bioinformatics Training Program.  Although asked to speak today on the subject of the 1000 genomes work that he’s done, I was able to pose several questions to him, including “If you weren’t talking about 1000 Genomes, what would would you have been speaking about instead?”  I have to admit, I had a very interesting tour of the chemistry of drosophila mating, parental specific gene expression in progeny and even some chicken expression.  Rarely has 45 minutes of science gone by so quickly.  Without further ado (and with great respect to Rodrigo Goya, who is speaking far too briefly – and at a ridiculous speed – on RNA-seq and alternative splicing  in cancer before Dr. Clark takes the stage), here are my notes. ]

Human population genomics with large sample size and full genome sequences

Talking about two projects – one sequencing a large number of genomes (1000 Genomes project), the other sequencing a very large number of samples in only 2 genes (Rare Variant studies).

The ability to predict phenotype from genotype is still small – where is the heritability?  Using simple snps is insufficient to figure out disease and heritibility.  Perhaps it’s rare variation that is responsible.  That launched the 1000 Genome project.

1000 Genome was looking to find stuff down to 1% of population.   (In accessible regions)

See Nature for pilot project publication of the 1000 Genomes project.. This included several trios (Parents and child).  Found more than 15M snps across the human genome.  Biggest impact, however, has been the impact on informatics – How do you deal with that large volume of snps?  Snp calling, alignment, codification, etc…

Much of the standard file formats, etc came from the 1000 Genomes groups working on that data. Biggest issue is (of course) to avoid mapping to the wrong reference!  “High quality mismatches” ->  Many false positives that failed to validate: misalignments of reads.  Read length improvements helped keep this down, as did using the insertions found in other 1000 Genome project subjects.

Tuning of snp callling made a big difference.  Process with validations made a significant impact.  However, for rare snps, it’s still hard to call snps.

Novel SNPs tend to be population specific.  Eg. Yoruban vs. European have different patterns of SNPs.  There is a core of common SNPs, but each has it’s own distribution of the rare or population specific SNPs.

“Imputation” using haplotype information (phasing) was a key item for making sense of the different sources of the data.

Great graph on fequency spectrum.  (Number of variants – log vs allele frequency (0.01 – 1)) Gives a lying out flat hockey stick.  Lots of very rare frequency snps, decreasing towards 1, but a spike at 1.

>100kb from each gene there is reduced variation (eg, Transcription start site.)

Some discussion of recombination hotspots, which were much better mapped by using the 1000 genome project data.

Another application: de novo mutation.  Identify where there are variations in the offspring where they are not found in either present.   Roughly about 1000 mutations per gamete.  ~3×10^-8 substitution per generation.

1000 Genomes project is now expanding to 2500 samples.  Trying to distribute across 25 population groups, with 100 individuals per group.

Well, what do we expect to discover from ultra-deep sampling?

There are >3000 mutations in dystrophin.  (Ascertained cases of muscular dystrophy. – Flanagan et al, 2009, Human Mutation)

If you think of any gene, you can expect to find every gene mutated at every point across every population… eventually.  [Actually, I do see this in most genes, but not all… some are hyper conserved, if I’ve interpreted it correctly.]

Major problem, tho: sequencing error.  If you’re sampling billions of base pairs, with 1/100,000 error rate, you’ll still find bad base calls!

Alex Coventry: There are only 6 types of heterozygotes (CG, CT, GT, AC, AG, AT)… ancient technology, not getting into it – was developed for sanger.

Studied HHEX and KCNJ11 genes, sequenced in 13,715 people. Validated by Barcoding and 454 sequencing.

Using the model from Alex’s work, you could use a posterior probabilty of each SNP.  Helped in validating.  When dealing with rare variants, there isn’t a lot of information.

The punchline: “There are a lot of rare SNPs out there!”

Some data shown (site frequency) as sample data increases.  The vast majority of what you get in the long run is the rare SNPs.

Human rare variation is “in excess” of what you’d expect from classical theory.  So why are there so many variants?

Historical population was small, but underwent a recent population explosion in the last 2000 years. This allows for a rapid diversity to be generated as each new generation has new variants, and no dramatic culls to force this rare variation to consolidate.

How many excess rare variants would you expect from the population explosion?  (Guttenkunst et al, 2009, PLOS Genetics)  Population has expanded 100x in about 100 generations.  Thus, we see the core set, which were present in the population before the explosion, followed by the rapid diversification explosion of rare snps.

You can do age inferrence, then, with the frequency of SNPs.  older snps must be present across more of the population.  Very few SNVs are older than 100 generations.  If you fit the population model back to the expected SNV frequency in100 generations ago, the current data fits very well.

When fitting to effective sample size of humans, you can see that we’re WAY out of equilibrium from what the common snps would suggest.  [I’m somewhat lost on this, actually.  Ne (parent) vs n (offspring).  I think the point is that we’ve not yet seen consolidation (coalescence?) of SNPs.]

“Theory of Multiple Mergers”  Essentially, we have a lot of branches that haven’t had the chance to blend – each node on the variation tree has a lot of unique traits (SNPs) independent of the ancestors.  (The bulk of the weight of the branch lengths is in the many many leaves at the tips of the trees.)

[If that didn’t make sense, it’s my fault – the talk is very clear, but I don’t have the population genetics vocabulary to explain this on the fly.]

What proportion of SNPs found in each new full genome sequence do we expect to be novel? (For each human.)  “It’s a fairly large number.”  It’s about 5-7%, Outliers from ]3-17%.  [I see about the same for my database,  which is neat to confirm.]  Can fit this to models: constant population size would give a low fraction (0.1%), with explosive model (1.4%) over very large sample sizes.

Rare variants are enriched for non-synonymous and premature terminations (Marth et al , submitted) [Cool – not surprising, and very confounding if you don’t take population frequency into account in your variant discovery.]

What does this mean in complex diseases?  Many of our diseases are going to be caused by rare variants, rather than common variants.  Analogy of jets that have 4x redundancy, versus humans with 2x redundancy at the genome level.


  • Human population has exploded, but it has a huge effect on rare variations.
  • Huge samples must be sequenced to detect and test effects
  • Will impact out studies of diseases, as we have to come to terms with the effects of the rare variations.

[Great talk!  I’ve enjoyed this tremendously!]