Normalizing reads for Next-Gen Sequencing.

This isn’t a particularly in depth blog.  Rather, I just want to touch on a few points in reply to a twitter question asking about how to normalize reads.

Actually, normalization is something I haven’t studied in great depth beyond the applications for Chip-Seq, where – of course – it’s still an open question.  So, if my answer is incomplete, please feel free to point out other resources in the comments.  Any contributions would be welcome.

First, one can do all sorts of fancy things to perform a normalization.  Frankly, I think the term is pretty abused, so a lot of what passes for normalization is a bit sketchy.  Anything that makes two samples “equivalent” in any way is often referred to as normalization, so yeah, your millage may vary when it comes to applying any approach.

For RNA-Seq, I’ve heard all sorts of techniques being applied.  The most common is to simply count the number of reads in the two samples, and then normalize by dividing by the ratio of reads between the two samples.  Very crude, and often very wrong, depending on what the two samples actually are.  I don’t use this approach unless I’m feeling lazy, someone has asked me to normalize a sample and it’s 5:40 on a Thursday evening. (I have better things to do on Thursday evenings.)

The second method is to bin the reads, then apply the same normalization as above.  The bins can be as large as a whole chromosome or as small as a few hundred base pairs, but the general method is the same: Use some average over a collection of bins to work out the ratio, then use that ratio to force the two samples to have the same approximate total read numbers.  It’s a bit better than what’s above, but not a whole lot better.

The third method that I’m aware of it to use a subset of reads to determine the normalization ratio.  This is a bit better – assuming you know enough to pick a good subset.   For instance, if you know housekeeping genes, you can use the total coverage over that set to approximate the relative abundance of reads in order to set the correct ratios.  This method can be dramatically better, if you happen to know a good set of genes (or other subset) to use, as it prevents you from comparing non-equivalent sets.

Just to harp on that last point, if you’re comparing a B-Cell- and a Mammary-derived cell line, you might be tempted to normalized on the total number of reads, however, it would quickly become apparent (once you look at the expressed genes) that some B-Cell genes are highly expressed and swamp your data set.  By paring those out of the normalization subset, you’d find your core genes in common to be more comparable – and thus less prone to bias introduced by genes only expressed in one sample.

You’ll notice, however, that all of the methods above use a simple ratio, with increasingly better methods of approximation.  That’s pretty much par for the course, as far as I’ve seen in RNA-Seq.  It’s not ideal, but I haven’t seen much more elegance than that.

When it comes to ChIP-Seq, the same things apply – most software does some variation of the above, and many of them are still floundering around with the first two types, of which I’m not a big fan.

The version I implemented in FindPeaks 4.0 goes a little bit differently, but can be applied to RNA-Seq just as well as for ChIP-Seq. (yes, I’ve tried) The basic idea is that you don’t actually know the subset of house-keeping genes in common in ChIP-Seq because, well, you aren’t looking at gene expression.  Instead, you’re looking at peaks – which can be broadly defined as any collection of peaks above the background.  Thus, the first step is to establish what the best subset of your data should be used for normalization – this can be done by peak calling your reads.  (Hence, using a peak caller.)

Once you have peak-calling done for both datasets, you can match the peaks up.  (Note, this is not a trivial operation, as it must be symmetrical, and repeatable regardless of the order of samples presented.)  Once you’ve done this, you’ll find you have three subsets:  peaks in sample 1, but not in sample 2.  Peaks in sample 2 but not in sample 1, and peaks common to both. (Peaks missing in both are actually important for anchoring your data set, but I won’t get into that.) If you only use the peaks common to both data sets, rather than peaks unique to one sample, you have a natural data subset ideal for normalization.

Using this subset, you can then perform a linear regression (again, it’s not a standard linear regression, as it must be symmetrical about the regression line, and not Y-axis dependent) and identify the best fit line for the two samples.  Crucially, this linear regression must pass through your point of origin, otherwise, you haven’t found a normalization ratio.

In any case, once all this is done, you can then use the slope of the regression line to determine the best normalization for your data sets.

The beauty of it is that you also end up with a very nice graph, which makes it easy to understand the data set you’ve compared, and you have your three subsets, each of which will be of some interest to the investigator

(I should also note, however, that I have not expanded this method to more than two data sets, although I don’t see any reason why it could not be.  The math becomes more challenging, but the concepts don’t change.)

Regardless, the main point is simply to provide a method by which two data sets become more comparable – the method by which you compare them will dictate how you do the normalization, so what I’ve provided above is only a vague outline that should provide you with a rough guide to some of the ways you can normalize on a single trait.  If you’re asking more challenging questions, what I’ve presented above may not be sufficient for comparing your data.

Good luck!

[Edit:  Twitter user @audyyy sent me this link, which describes an alternate normalization method.  In fact, they have two steps – a pre-normalization log transform (which isn’t normalization, but it’s common.  Even FindPeaks 4.0 has it implemented), and then a literal “normalization” which makes the mean = 0, and the standard deviation =1.   However, this is only applicable for one trait across multiple data sets (eg, count of total reads for a large number of total libraries.)  That said, it wouldn’t be my first choice of normalization techniques.]

 

Julie Chen – University of Toronto

It seems to be University of Toronto week on my blog.  Today, Julie Chih-yu Chen is visiting to give a talk  titled:

Identifying tissue specific distal regulatory sequences in the mouse genome.

Enhancer Identification in Mouse Embryonic Stem Cells

(A last second change)

which, from all indications, is Chip-Seq related.  Julie is currently wrapping up a masters degree in the Mitchell lab at U of T and has done a lot of coding work in the past, but has been working on more biological questions recently.

I was also fortunate enough to have been invited to lunch with Julie before the talk and to ask a few questions and to confirm that she would be happy to have her talk blogged.

And now, on with the talk.

—–

Distal Regulatory elements – Non coding elemnents.  Histone modifications are found to be tissue specific at enhancers, rather than at promoters and insulators.  Over 40% of peaks for several Transcription factors are in transgenic region (more than 10kb from tss.)

Due to folding of DNA, enhancers that are not sequentially adjacent can drive transcription by folding the DNA to become proximal and effect expression in ways that would be expected from closer elements.  (Carter 2002).

Examples: Thallasaemias result from deletions or rearrangements of beta-globin gene (HBB) enhancers, 50kb upstream.  SHH enhancer mutations in mice, 1Mb upstream can cause severely shortened limbs.

How do we find these?  ChIP-Seq or Chip-chip technology can be used to identify binding sites, and that information can be used to identify binding motifs.

We can also use other methods to enhance this analysis:  high-throughput sequencing data for TFs, p300, histone methylation,  or you can use annotations from comparative genomics, such as highly conserved regions.

Motivation:  identify significant markers at known enhancers, predict enhancer regions, identify TFs potentially regulating the cell type from motif analysis.

Training data and features -extended sets of known enhancer positives and negatives.

Illustration of an example at a TSS, showing that there is frequent activity of different sorts all at the same location.

Method: Binning in 1kb increments. [Gah! Another Binning Method!]  Something about input reads +3 for control…  missed the detail, tho.

Feature extraction improves enhancer prediction….  classifier used for cross validation assessment.  [Not sure how it works.]

Use a maximize penalized likelihood, with three classes: Positive, Negative and unknown.  As lambda decreases, you can see some classifiers become more important than others.  This gives you a signature that can be exploited to identify enhancers.

Enhancer candidates are located further from TSS compared to promoter-like regions.  There is a distinct distribution for positive, negative and unknown types. Negative is closer, positive is a bit further, (10kb) and unknown is even further away.

when working with enhancers, it is assigned to the closest gene downstream.

Trend: genes with (predicted) enhancers have higher expression compared to genes without enhancers.

20% of the top enhancer regions locate near genes encoding trancription factors. [Ok, that’s neat.]  Top 2000 highest are enriched in a small number of functions [by go terms?]

Previously identified and validated enhancers were used from lothi 2008 (SISSR), to compare.   Compares well, and a few new ones were identified.

Can identify important functional regions… but is it cell type specific?  [A bit lost for a minute – the graphs aren’t well labeled, so I’m somewhat puzzled as to what’s coming out of the Venn diagrams.]

When comparing enhancers from Embryonic Stem cells, you find more overlap with other data sets also done in Embryong Stem cells, as opposed to other types of cells, which means that the TF networks are cell type specific.

Various other TF enhancers are identified from this data set – which can be compared with known TF expression, to identify which ones are already known to be utilized by mouse ESC.  Good concordance observed.

Summary:

Ranked enhancer signatures.

Enhancer candidates:  coupled with promoter-like regions increase expression of nearby genes.   Overlap significantly with multiple transcribed loci.  Potentially regulate genes encoding TFs.  Are tissue specific and overlap with active histone mark.

Identified known and novel TFs in mESC with motif enrichment analysis of enhancers.

Future work:

It is worth noting that some enhancers can interact with insulators, or can interact with different genes other than the closest.  Other mechanisms may be possible.

[Overall, not a bad talk – and very bioinformatics-ish.  That is to say, it could have been a little more heavy on either the algorithm or the biology, but seemed aimed to appeal to both audiences, but may not be detailed enough for either, however, that’s pretty typical for the field, and isn’t a criticism.

I think it’s also clear that the biology, in this case,  is now being driven by the development of novel algorithms – and that this is a valid approach to gain insight into the discovery of enhancer biology.  It’s a great initial foray into the topic, but the data itself shows that there is still lots more to learn about how everything integrates at the molecular level.]

>ChIP-Seq normalization.

>I’ve spent a lot of time working on ChIP-Seq controls recently, and wanted to raise an interesting point that I haven’t seen addressed much: How to normalize well. (I don’t claim to have read ALL of the chip-seq literature, and someone may have already beaten me to the punch… but I’m not aware of anything published on this yet.)

The question of normalization occurs as soon as you raise the issue of controls or comparing any two samples. You have to take it in to account when doing any type of comparision, really, so it’s somewhat important as the backbone to any good second-gen work.

The most common thing I’ve heard to date is to simply normalize by the number of tags in each data set. As far as I’m concerned, that really will only work when your data sets come from the same library, or two highly correlated samples – when nearly all of your tags come from the same locations.

However, this method fails as soon as you move into doing a null control.

Imagine you have two samples, one is your null control, with the “background” sequences in it. When you seqeunce, you get ~6M tags, all of which represent noise. The other is ChIP-Seq, so some background plus an enriched signal. When you sequence, hopefully you sequence 90% of your signal, and 10% of the background to get ~8M tags – of which ~.8M are noise. When you do a compare, the number of tags isn’t quite doing justice to the relationship between the two samples.

So what’s the real answer? Actually, I’m not sure – but I’ve come up with two different methods of doing controls in FindPeaks: One where you normalize by identifying a (symmetrical) linear regression through points that are found in both samples, the other by identifying the points that appear in both samples and summing up their peak heights. Oddly enough, they both work well, but in different scenarios. And clearly, both appear (so far) to work better than just assuming the number of tags is a good normalization ratio.

More interesting, yet, is that the normalization seems to change dramatically between chromosomes (as does the number of mapping reads), which leads you to ask why that might be. Unfortunately, I’m really not sure why it is. Why should one chromosome be over-represented in an “input dna” control?

Either way, I don’t think any of us are getting to the bottom of the rabbit hole of doing comparisons or good controls yet. On the bright side, however, we’ve come a LONG way from just assuming peak heights should fall into a nice Poisson distribution!

>New ChIP-seq control

>Ok, so I’ve finally implemented and debugged a second type of control in FindPeaks… It’s different, and it seems to be more sensitive, requiring less assumptions to be made about the data set itself.

What it needs, now, is some testing. Is anyone out there willing to try a novel form of control on a dataset that they have? (I won’t promise it’s flawless, but hey, it’s open source, and I’m willing to bug fix anything people find.)

If you do, let me know, and I’ll tell you how to activate it. Let the testing begin!

>Why peak calling is painful.

>In discussing my work, I’m often asked how hard it is to write a peak calling algorithm. The answer usually surprises people: It’s trivial. Peak calling itself isn’t hard. However, there are plenty of pitfalls that can surprise the unwary. (I’ve found myself in a few holes along the way, which have been somewhat challenging to get out of.)

The pitfalls, when they do show up, can be very painful – masking the triviality of the situation.

In reality, the three most frustrating things that occur in peak calling:

  1. Maintaining the software
  2. Peak calling without unlimited resources eg, 64Gb RAM
  3. Keeping on the cutting edge

On the whole, each of these things is a separate software design issue worthy of a couple of seconds of discussion.

When it comes to building software, it’s really easy to fire up a “one-off” script. Anyone can write something that can be tossed aside when they’re done with it – but code re-use and recycling is a skill. (And an important one.) Writing your peak finder to be modular is a lot of work, and a huge amount of time investment is required to keep the modules in good shape as the code grows. A good example of why this is important can be illustrated with file formats. Since the first version of FindPeaks, we’ve transitioned through two versions of Eland output, Maq’s .map format and now on to SAM and BAM (but not excluding BED, GFF, and several other more or less obscure formats). In each case, we’ve been able to simply write a new iterator and plug it into the existing modular infrastructure. In fact, SAM support was added in quite rapidly by Tim with only a few hours of investment. That wouldn’t have been possible without the massive upfront investment in good modularity.

The second pitfall is memory consumption – and this is somewhat more technical. When dealing with sequencing reads, you’re faced with a couple of choices: you either sort the reads and then move along the reads one at a time, determining where they land – OR – you can pre-load all the reads, then move along the chromosome. The first model takes very little memory, but requires a significant amount of pre-processing, which I’ll come back to in a moment. The second requires much less cpu time – but is intensely memory thirsty.

If you want to visualize this, the first method is to organize all of your reads by position, then to walk down the length of the chromosome with a moving window, only caring about the reads that fall into the window at any given point in time. This is how FindPeaks works now. The second is to build a model of the chromosome, much like a “pileup” file, which then can be processed however you like. (This is how I do SNP calling.) In theory, it shouldn’t matter which one you do, as long as all your reads can be sorted correctly. The first can usually be run with a limited amount of memory, depending on the memory strucutures you use, whereas the second pretty much is determined by the size of the chromosomes you’re using (multiplied by a constant that also depends on the structures you use.)

Unfortunately, using the first method isn’t always as easy as you might expect. For instance, when doing alignments with transcriptomes (or indels), you often have gapped reads. An early solution to this in FindPeaks was to break each portion of the read into separate aligned reads, and process them individually – which works well when correctly sorted. Unfortunately, new formats no longer allow that – using a “pre-sorted” bam/sam file, you can now find multi-part reads, but there’s no real option of pre-fragmenting those reads and re-sorting. Thus, FindPeaks now has an additional layer that must read ahead and buffer sam reads in order to make sure that the next one returned is the correct order. (You can get odd bugs, otherwise, and yes, there are many other potential solutions.)

Moving along to the last pitfall, the one thing that people want out of a peak finder is that it is able to do the latest and greatest methods – and do it ahead of everyone else. That on it’s own is a near impossible task. To keep a peak finder relevant, you not only need to implement what everyone else is doing, but also do things that they’re not. For a group of 30 people, that’s probably not too hard, but for academic peak callers, that can be a challenge – particularly since every use wants something subtly different than the next.

So, when people ask how hard it is to write their own peak caller, that’s the answer I give: It’s trivial – but a lot of hard work. It’s rewarding, educational and cool, but it’s a lot of work.

Ok, so is everyone ready to write their own peak caller now? (-;

>new repository of second generation software

>I finally have a good resource for locating second gen (next gen) sequencing analysis software. For a long time, people have just been collecting it on a single thread in the bioinformatics section of the SeqAnswers.com forum, however, the brilliant people at SeqAnswers have spawned off a wiki for it, with an easy to use form. I highly recommend you check it out, and possibly even add your own package.

http://seqanswers.com/wiki/SEQanswers

>Science Cartoons – 3

>I wasn’t going to do more than one comic a day, but since I just published it into the FindPeaks 4.0 manual today, I may as well put it here too, and kill two birds with one stone.

Just to clarify, under copyright laws, you can certainly re-use my images for teaching purposes or your own private use (that’s generally called “fair use” in the US, and copyright laws in most countries have similar exceptions), but you can’t publish it, take credit for it, or profit from it without discussing it with me first. However, since people browse through my page all the time, I figure I should mention that I do hold copyright on the pictures, so don’t steal them, ok?

Anyhow, Comic #3 is a brief description of how the compare in FindPeaks 4.0 works. Enjoy!

>Can’t we use ChIP-chip controls on *-Seq?

>Thanks to Nicholas, who left this comment on my web page this morning, in reference to my post on controls in Second-Gen Seqencing:

Hi Anthony,

Don't you think that controls used for microarray (expression
and ChIP-chip) are well established and that we could use
these controls with NGS?

Cheers!

I think this is a valid question, and one that should be addressed. My committee asked me the same thing during my comprehensive exam, so I’ve had a chance to think about it. Unfortunately, I’m not a statistics expert, or a ChIP-chip expert, so I would really value other people’s opinion on the matter.

Anyhow, I think the answer has to be put in perspective: Yes, we can learn from ChIP-chip and Arrays for the statistics that are being used, but no, they’re not directly applicable.

Both ChIP-chip and array experiments are based on hybridization to a probe – which makes them cheap and reasonably reliable. Unfortunately, it also leads to a much lower dynamic range, since they saturate out at the high end, and can be undetectable at the low end of the spectrum. This alone should be a key difference. What signal would be detected from a single hybridization event on a micro-array?

Additionally, the resolution of a chip-chip probe is vastly different from that of a sequencing reaction. In ChIP-Seq or RNA-Seq, we can get unique signals for sequences with a differing start location only one base apart, which should then be interpreted differently. With ChIP-chip, the resolution is closer to 400bp windows, and thus the statistics take that into account.

Another reason why I think the statistics are vastly different is because of the way we handle the data itself, when setting up an experiment. With arrays, you repeat the same experiment several times, and then use that data as several repeats of the same experiment, in order to quantify the variability (deviation and error) between the repeats. With second-generation sequencing, we pool the results from several different lanes, meaning we always have N=1 in our statistical analysis.

So, yes, I think we can learn from other methods of statistical analysis, but we can’t blindly apply the statistics from ChIP-chip and assume they’ll correctly interpret our results. The more statistics I learn, the more I realize how many assumptions go into each method – and how much more work it is to get the statistics right for each type of experiment.

At any rate, these are the three most compelling reasons that I have, but certainly aren’t the only ones. If anyone would like to add more reasons, or tell me why I’m wrong, please feel free to add a comment!

>Rumours..

>Apparently there’s a rumour going around that I think FindPeaks isn’t a good software program, and that I’ve said so on my blog. I don’t know who started this, or where it came from, but I would like to set the record straight.

I think FindPeaks is one of the best ChIP-Seq packages currently available. Yes, I would like to have better documentation, yes there are occasionally bugs in it, and yes, the fact that I’m the only full-time developer on the project are detriments…. But those are only complaints, and they don’t diminish the fact that I think FindPeaks can out peak-call, out control, out compare…. and will outlast, outwit and outplay all of it’s opponents. (=

FindPeaks is actually a VERY well written piece of code, and has the flexibility to do SO much more than just ChIP-Seq. If I had my way, I’d have another developer or two working with the code to expand many of the really cool features that it already has.

If people are considering using it, I’m always happy to help get them started, and I’m thrilled to add more people to the project if they would like to contribute.

So, just to make things clear – I think FindPeaks (and the rest of the Vancouver Short Read Analysis Package) are great pieces of software, and are truly worth using. If you don’t believe me, send me an email, and I’ll help make you a believer too. (=