>One lane is (still) not enough…

>After my quick post yesterday where I said one lane isn’t enough, I was asked to elaborate a bit more, if I could. Well, I don’t want do get into the details of the experiment itself, but I’m happy to jump into the “controls” a bit more in depth.

What I can tell is that with one lane of RNA-Seq (Illumina data50bp), all of the variations I find show up either in known polymorphism database or as somatic SNPs, with a few exceptions. The few exceptions just turn out to be exceptions for lack of coverage.

For a “control”, I took two data sets (from two separate patients) – each with 6 individual lanes of sequencing data. (I realize this isn’t the most robust experiment, but it shows a point.) In the perfect world, each of the 6 lanes per person would have sampled the original library equally well.

So, I matched up one lane from each patient into 6 sets and asked the question: How many transcripts are void (less than 5 tags) in one sample and at least 5x greater in the other sample. (I did this in both directions.)

The results aren’t great. In one direction, I see an average of 1245 Transcripts (about 680 genes, so there’s some overlap amongst the transcript set) with a std dev. of 38 Transcripts. That sounds pretty consistent, till you look for the overlap in actual transcripts: avg 27.3 with a std dev of 17.4. (range 0-60). And, when with do the calculations, the most closely matched data sets only have a 5% overlap.

The results for the opposite direction were similar: Average of 277 transcripts found that met the criteria (std.dev of 33.61), with an average overlap between data sets being 4.8, std. dev 4.48. (range of 0-11 transcripts in common.) The best overlap in “upregulated” genes for this dataset was just over 4% concordance with a second pair of lanes.

So, what this tells me (for a VERY dirty experiment) is that expression of genes in one lane is highly variable depending on the lane for genes expressed at the low end. (Sampling at the high end usually pretty good, so I’m not too concerned about that.)

What I haven’t answered yet is how many lanes is enough. Alas, I have to go do some volunteering, so that experiment will have to wait for another day. And, of course, the images I created along the way will have to follow later as well.

>ChIP-Seq normalization.

>I’ve spent a lot of time working on ChIP-Seq controls recently, and wanted to raise an interesting point that I haven’t seen addressed much: How to normalize well. (I don’t claim to have read ALL of the chip-seq literature, and someone may have already beaten me to the punch… but I’m not aware of anything published on this yet.)

The question of normalization occurs as soon as you raise the issue of controls or comparing any two samples. You have to take it in to account when doing any type of comparision, really, so it’s somewhat important as the backbone to any good second-gen work.

The most common thing I’ve heard to date is to simply normalize by the number of tags in each data set. As far as I’m concerned, that really will only work when your data sets come from the same library, or two highly correlated samples – when nearly all of your tags come from the same locations.

However, this method fails as soon as you move into doing a null control.

Imagine you have two samples, one is your null control, with the “background” sequences in it. When you seqeunce, you get ~6M tags, all of which represent noise. The other is ChIP-Seq, so some background plus an enriched signal. When you sequence, hopefully you sequence 90% of your signal, and 10% of the background to get ~8M tags – of which ~.8M are noise. When you do a compare, the number of tags isn’t quite doing justice to the relationship between the two samples.

So what’s the real answer? Actually, I’m not sure – but I’ve come up with two different methods of doing controls in FindPeaks: One where you normalize by identifying a (symmetrical) linear regression through points that are found in both samples, the other by identifying the points that appear in both samples and summing up their peak heights. Oddly enough, they both work well, but in different scenarios. And clearly, both appear (so far) to work better than just assuming the number of tags is a good normalization ratio.

More interesting, yet, is that the normalization seems to change dramatically between chromosomes (as does the number of mapping reads), which leads you to ask why that might be. Unfortunately, I’m really not sure why it is. Why should one chromosome be over-represented in an “input dna” control?

Either way, I don’t think any of us are getting to the bottom of the rabbit hole of doing comparisons or good controls yet. On the bright side, however, we’ve come a LONG way from just assuming peak heights should fall into a nice Poisson distribution!

>On the necessity of controls

>I guess I’ve had this rant building up for a while, and it’s finally time to write it up.

One of the fundamental pillars of science is the ability to isolate a specific action or event, and determine it’s effects on a particular closed system. The scientific method actually demands that we do it – hypothesize, isolate, test and report in an unbiased manner.

Unfortunately, for some reason, the field of genomics has kind of dropped that idea entirely. At the GSC, we just didn’t bother with controls for ChIP-Seq for a long time. I can’t say I’ve even seen too many matched WTSS (RNA-SEQ) experiments for cancer/normals. And that scares me, to some extent.

With all the statistics work I’ve put in to the latest version of FindPeaks, I’m finally getting a good grasp of the importance of using controls well. With the other software I’ve seen, they do a scaled comparison to calculate a P-value. That is really only half of the story. It also comes down to normalization, to comparing peaks that are present in both sets… and to determining which peaks are truly valid. Without that, you may as well not be using a control.

Anyhow, that’s what prompted me to write this. As I look over the results from the new FindPeaks (3.3.3.1), both for ChIP-Seq and WTSS, I’m amazed at how much clearer my answers are, and how much better they validate compared to the non-control based runs. Of course, the tests are still not all in – but what a huge difference it makes. Real control handling (not just normalization or whatever everyone else is doing) vs. Monte Carlo show results that aren’t in the same league. The cutoffs are different, the false peak estimates are different, and the filtering is incredibly more accurate.

So, this week, as I look for insight in old transcription factor runs and old WTSS runs, I keep having to curse the lack of controls that exist for my own data sets. I’ve been pushing for a decent control for my WTSS lanes – and there is matched normal for one cell line – but it’s still two months away from having the reads land on my desk… and I’m getting impatient.

Now that I’m able to find all of the interesting differences with statistical significance between two samples, I want to get on with it and find them, but it’s so much more of a challenge without an appropriate control. Besides, who’d believe it when I write it up with all of the results relative to each other?

Anyhow, just to wrap this up, I’m going to make a suggestion: if you’re still doing experiments without a control, and you want to get them published, it’s going to get a LOT harder in the near future. After all, the scientific method has been pretty well accepted for a few hundred years, and genomics (despite some protests to the contrary) should never have felt exempt from it.

>ChIP-Seq in silico

>Yesterday I got to dish out some criticism, so it’s only fair that I take some myself, today. It came in the form of an article called “Modeling ChIP Sequencing In Silico with Applications”, by Zhengdong D. Zhang et al., PLoS Computational Biology, August 2008: 4(8).

This article is actually very cool. They’ve settled several points that have been hotly debated here at the Genome Sciences Centre, and made the case for some of the stuff I’ve been working on – and then show me a few places where I was dead wrong.

The article takes direct aim at the work done in Robertson et al., using the STAT1 transcription factor data produced in that study. Their key point is that the “FDR” used in that study was far from ideal, and that it could be significantly improved. (Something I strongly believe as well.)

For those that aren’t aware, Robertson et al. is sort of the ancestral origin of the FindPeaks software, so this particular paper is more or less aiming at the FindPeaks thresholding method. (Though I should mention that they’re comparing their results to the peaks in the publication, which used the unreleased FindPeaks 1.0 software – not the FindPeaks 2+ versions, of which I’m the author.) Despite the comparison to the not-quite current version of the software, their points are still valid, and need to be taken seriously.

Mainly, I think there are two points that stand out:

1. The null model isn’t really appropriate
2. The even distribution isn’t really appropriate.

The first, the null model, is relatively obvious – everyone has been pretty clear from the start that the null model doesn’t really work well. This model, pretty much consistent across ChIP-Seq platforms can be paraphrased as “if my reads were all noise, what would the data look like?” This assumption is destined to fail every time – the reads we obtain aren’t all noise, and thus assuming they are as a control is really a “bad thing”(tm).

The second, the even distribution model, is equally disastrous. This can be paraphrased as “if all of my noise were evenly distributed across some portion of the chromosome, what would the data look like?” Alas, noise doen’t distribute evenly for these experiments, so it should be fairly obvious why this is also a “bad thing”(tm).

The solution presented in the paper is fairly obvious; create a full simulation for your ChIP-Seq data. Their version requires a much more rigorous process, however. They simulate a genome-space, remove areas that would be gaps or repeats in the real chromosome, then begin tweaking the genome simulation to replicate their experiment using weighted statistics collected in the ChIP-Seq experiment.

On the one hand, I really like this method, as it should give a good version of a control, whereas on the other hand, I don’t like that you need to know a lot about the genome of interest before you can analyze your ChIP-Seq data. (ie, mappability, repeat-masking, etc.) Of course, if you’re going to simulate your genome, simulate it well – I agree with that.

I don’t want to belabor the point, but this paper provides a very nice method for simulating ChIP-Seq noise in the absence of a control, as in Robertson et al. However, I think there are two things that have changed since this paper was submitted (January 2008) that should be mentioned:

1. FDR calculations haven’t stood still. Even at the GSC, we’ve been working on two separate FDR models that no longer use the null model, however, both still make even distribution assumptions, which, is also not ideal.

2. I believe everyone has now acknowledged that there are several biases that can’t be accounted for in any simulation technique, and that controls are the way forward. (They’re used very successfully in QuEST, which I discussed yesterday.)

Anyhow, to summarize this paper: Zhang et al. provide a fantastic critique of the thresholding and FDR used in early ChIP-Seq papers (which is still in use today, in one form or another), and demonstrate a viable and clearly superior method for refining Chip-Seq results with out a matched control. This paper should be read by anyone working on FDRs for next-gen sequencing and ChIP-Seq software.

(Post-script: In preparation for my comprehensive exam, I’m trying to prepare critical evaluations of papers in the area of my research. I’ll provide comments, analysis and references (where appropriate), and try to make the posts somewhat interesting. However, these posts are simply comments and – coming from a graduate student – shouldn’t be taken too seriously. If you disagree with my points, please feel free to comment on the article and start a discussion. Nothing I say should be taken as personal or professional criticism – I’m simply trying to evaluate the science in the context of the field as it stands today.)