>Questions about controls.

>After my post on controls, the other day, I’ve discovered I’d left a few details out of my blog entry. It was especially easy to come to that conclusion, since it was pointed out by Aaron in a comment below…. Anyhow, since his questions were right on, I figured that they deserved a post of their own, not just a quick reply.

There were two parts to the questions, so I’ll break up the reply into two parts. First:

One question – what exactly are you referring to as ‘control’ data? For chip-seq, sequencing the “input” ie non-enriched chromatin of your IPs makes sense, but does doubling the amount of sequencing you perform (therefore cost & analysis time) give enough of a return? eg in my first analysis we are using a normal and cancer cell line, and looking for differences between the two.

Control data, in the context I was using it, should be experiment specific. Non-enriched chromatin is probably going to be the most common for experiments with transcription factors, or possibly even histone modifications. Since you’re most interested in the enrichment, a non-enriched control makes sense. At the GSC, we’ve had many conversations about what the appropriate controls are for any given experiment, and our answers have ranged from genomic DNA or non-enriched-DNA all the way to stimulated/unstimulated sample pairs. The underlying premise is that the control is always the one that makes sense for the given experiment.

On that note, I suspect it’s probably worth rehashing the Robertson 2007 paper on stimulated vs. unstimulated IFN-gamma STAT1 transcription markers… hrm..

Anyhow, The second half of that question boils down to “is it really worth it to spend the money on controls?”

YES! Unequivocally YES!

Without a control, you’ll never be able to get a good handle on the actual statistics behind your experiment, you’ll never be able to remove bias in your samples (from the sequencing itself, and yes, it is very much present). In the tests I’ve been running on the latest FP code (3.3.3.1), I’ve been seeing what I think are incredibly promising results. If it cost 5 times as much to do controls, I’d still be pushing for them. (Of course, it’s actually not my money, so take that for what it’s worth.)

In any case, there’s a good light at the end of the cost tunnel on this. Although yes, it does appear that the better the control, the better the results will be, we’ve also been able to “re-use” controls. So, in fact, having one good set of controls per species already seems to make a big contribution. Thus, controls should be amortizable over the life of the sequencing data, so the cost should not be anywhere near double for large sequencing centres. (I don’t have hard data, to back this up, yet, but it is generally the experience I’ve had so far.)

To get to the second half of the question:

Also, what controls can you use for WTSS? Or do you just mean comparing two groups eg cancer vs normal?

That question is similar to the one above, and so is my answer: it depends on the question you’re asking. If you want to use it to compare against another sample (eg, using FP 3.3.3.1’s compare function), then you’ll want to compare with another sample, which will be your “control”. I’ve been working with this pipeline for a few weeks now and have been incredibly impressed with how well this works over my own (VERY old 32bp) data sets.

On the other hand, if you’re actually interested in discovering new exons and new genes using this approach, you’d want to use a whole genome shotgun as your control.

As with all science, there’s no single right answer for which protocol you need to use until you’ve decided what question you want to ask.

And finally, there’s one last important part of this equation – have you sequenced either your control or your sample to enough depth? Even if you have controls working well, it’s absolutely key to make sure your data can answer your question well. So many questions, so little time!

>On the necessity of controls

>I guess I’ve had this rant building up for a while, and it’s finally time to write it up.

One of the fundamental pillars of science is the ability to isolate a specific action or event, and determine it’s effects on a particular closed system. The scientific method actually demands that we do it – hypothesize, isolate, test and report in an unbiased manner.

Unfortunately, for some reason, the field of genomics has kind of dropped that idea entirely. At the GSC, we just didn’t bother with controls for ChIP-Seq for a long time. I can’t say I’ve even seen too many matched WTSS (RNA-SEQ) experiments for cancer/normals. And that scares me, to some extent.

With all the statistics work I’ve put in to the latest version of FindPeaks, I’m finally getting a good grasp of the importance of using controls well. With the other software I’ve seen, they do a scaled comparison to calculate a P-value. That is really only half of the story. It also comes down to normalization, to comparing peaks that are present in both sets… and to determining which peaks are truly valid. Without that, you may as well not be using a control.

Anyhow, that’s what prompted me to write this. As I look over the results from the new FindPeaks (3.3.3.1), both for ChIP-Seq and WTSS, I’m amazed at how much clearer my answers are, and how much better they validate compared to the non-control based runs. Of course, the tests are still not all in – but what a huge difference it makes. Real control handling (not just normalization or whatever everyone else is doing) vs. Monte Carlo show results that aren’t in the same league. The cutoffs are different, the false peak estimates are different, and the filtering is incredibly more accurate.

So, this week, as I look for insight in old transcription factor runs and old WTSS runs, I keep having to curse the lack of controls that exist for my own data sets. I’ve been pushing for a decent control for my WTSS lanes – and there is matched normal for one cell line – but it’s still two months away from having the reads land on my desk… and I’m getting impatient.

Now that I’m able to find all of the interesting differences with statistical significance between two samples, I want to get on with it and find them, but it’s so much more of a challenge without an appropriate control. Besides, who’d believe it when I write it up with all of the results relative to each other?

Anyhow, just to wrap this up, I’m going to make a suggestion: if you’re still doing experiments without a control, and you want to get them published, it’s going to get a LOT harder in the near future. After all, the scientific method has been pretty well accepted for a few hundred years, and genomics (despite some protests to the contrary) should never have felt exempt from it.

>FindPeaks 4.0

>After much delay, and more statistics than I’d ever hoped to see in my entire life, I’m just about ready to tag FindPeaks 4.0.

There are some really cool things going on – better handling of control data, better compares, and even saturation analysis made it in. Overall, I’m really impressed with how well things are working on that front.

The one thing that I’m disappointed to have not been able to include was support for SAM/BAM files. No one is really pushing for it yet, here at the GSC, but it’s only a matter of time. Unfortunately, the integration is not trivial, and I made the executive decision to work on other things first.

Still, I can’t say I’m unhappy at all – there’s room for a fresh publication, if we decide to go down that route. If not, I have at least 2 possible publications in the novel applications of the code for the biology. That can’t be a bad thing.

Anyhow, I’m off to put a few final updates on the code before tagging the final 3.3.x release.

>2 weeks of neglect on my blog = great thesis progress.

>I wonder if my blogging output is inversely proportional to my progress on my thesis. I stopped writing two weeks ago for a little break, and ended up making big steps forward. The vast amount of my work went into FindPeaks, which included the following:

  • A complete threaded Saturation analysis for next-gen libraries.
  • A method of comparing next-gen libraries to identify peaks that are statistically significant outliers. (It’s also symmetic, unlike a linear regression based methods.)
  • A better control method
  • A whole new way of analysing WTSS data, which gives statistically valid expression differences

And, of course many many other changes. Not everything is bug-free, yet, but it’s getting there. All that’s left on my task list are debugging a couple of things in the compare mode, relating to peaks present in only one of the two librarires, and an upgrade to my FDR cutoff prediction methods. Once those are done, I think I’ll be ready to push out FindPeaks 4.0. YAY!

Actually, what was surprising to me was the sheer amount of work that I’ve done on this since January. I compiled the change list since my last “quarterly report” for a project that used FindPeaks (but doesn’t support it, ironically…. why am I doing reports for them again?) and came up with 24 pages of commit messages – over 575 commits. Considering the amount of work I’ve done on my actual projects, away from FindPeaks, I think I’ve been pretty darn productive.

Yes, I’m going to take this opportunity to pat myself on the back in public for a whole 2 seconds… ok, done.

So, overall, blogging may not have been distracting me from my work, as even at the height of my blogging (around AGBT), I was still getting lots done, but the past two weeks have really been a help. I’ll be back to blogging all the good stuff on monday. And I’m looking forward to doing some writing now, on some of the cool things in FP4.0 that haven’t made it into the manual… yet.

Anyone want some fresh ChIP-Seq results? (-;

>Multi-match reads in ChIP-Seq

>I had an interesting comment left on my blog today, which is worth taking a few minutes to write a response to:

"Hi Anthony, I just discovered your blog and it looks very interesting to me!
Since this article on Eland is now more than one year old, I was wondering
if the description at point 3 about multi matching locations is still
applicable to the Eland program in the Illumina pipeline 1.3. More in general,
would you trust the multi matching locations extracted from the multi_eland
output files to perform a repeat enrichment analysis over an experiment of
ChIP-seq? If no, why? Thank you in advance for your attention."

The first question asks about multi-matching locations – and if the point in question (point 3) applies to the Illumina Pipeline 1.3. Since point 3 was just that the older pipeline didn’t provide the locations of the multi-matche reads, I suppose this no longer really applies: I understand the new version of Eland does provide multi-match alignment information, as do other aligners such as Bowtie. However, I should also mention that since I adopted Maq as my preferred aligner, I haven’t used Eland much – so it’s hard for me to give an informed opinion on the quality of the matches. I simply don’t know if they’re any good, and I won’t belabour that point. I have used Bowtie specifically because it was able to do mutli-matches, but we didn’t use it for ChIP-Seq, and the multi-matches had other uses in that experiment.

So, the more interesting question is whether I’d use multi-match reads in a ChIP-Seq analysis. And, off hand, my answer has to be no. But let me explain my reasoning, and the conditions in which I would change that answer.

First, lets assume that we have Single End Tags, so the multi-match information is not resolvable. That means anytime we have a read that maps to more than one location, we have the possibility that we can either map it to it’s source – or we’re mapping it incorrectly. A 50% change of “getting it right.” The greater the number of multi-match locations, the smaller the chance we’re actually finding it’s correct origin. So, at best we’ve got a 50-50 chance that we’re not adversely affecting the outcome of the experiment. That’s not great.

In contrast, there are things we could do to make them usable. The most widely used method from FindPeaks is the weighted fragment distribution type. Thus, we could expand the principle to weight the fragments according to the number of sites. That would be… bearable. But would it significantly add to the quality of the alignment?

I’m still going to say no. Fragments we see in ChIP-Seq experiments tend to fall within 200-300bp of the regions in which the transcription factor (or other sites) bind. Thus, even if we were concerned that a particular transcription factor binds primarily to the similar motif regions at two sites, there should be more than enough (unique) sequence around that site (which is usually <30-40bp in length) to which you'll still see fragments aligning. That should compensate for the loss of the multi-match fragments. Even more importantly, as read lengths increase, the amount of non-unique sequence decreases rapidly, making the shrinking number of multi-match reads less important. The same argument can be extended for paired end tags: Just as read lengths improve and reduce the number of multi-match sites, more of the multi-match reads will be resolved by pairing them with a second read, which is unlikely to be within the same repeat region, thus reducing the number of reads that become unresolvable multi-matches. Proportionally, one would then expect that leaving out these reads become a smaller and smaller segment of the population, and would have to worry less and less about their contribution. So, then, when would I want them? Well, on the odd chance you’re working with very short reads, you can pull off the weighting properly, and you have single end tags – and the multi-match reads make up a significant proportion of the reads, then it’s worth exploring. You’d need to start asking the tough questions: did the aligner simply find that a small k-mer of the read aligned to multiple locations (and was then unable to resolve the tie by extension the way some Eland aligners work)? Does the aligner use quality scores to identify mis-alignments? How reliable are the alignments (what’s their error rate)? What was your sample, and how divergent is it from reference ? (e.g., cancer samples have a high variation rate, and so encourage many false alignments, making the alignments less reliable.) Overall, I really don’t see too many cases where you’re going to gain a lot by digging in the multi-match files. That’s not too say that you won’t find anything good in there – you probably would, if you knew where to look, but the noise to signal ratio is going to be pretty poor – just by definition of the fact that they’re mutli-match reads alone. You’ll just have to ask if it’s worth your time. For the moment, I don’t think my time (even at grad student wages) is worth it. It’s just not low hanging fruit, when it comes to ChIP-Seq.

>Universal format converter for aligned reads

>Last night, I was working on FindPeaks when I realized what an interesting treasure trove of libraries I was really sitting on. I have readers and writers for many of the most common aligned read formats, and I have several programs that do useful functions. So, that raise the distinctly interesting point that all of them should be applied together in one shot… and so I did exactly that.

I now have an interesting set of utilities that can be used to convert from one file format to another: bed, gff, eland, extended eland, MAQ .map (read only), mapview, bowtie…. and several other more obscure formats.

For the moment, the “conversion utility” forces the output to bed file format (since that’s the file type with the least information, and I don’t have to worry about unexpected file information loss), which can then be viewed with the UCSC browser, or interpreted by FindPeaks to generate wig files. (BED files are really the lowest common denominator of aligned information.) But why stop there?

Why not add a very simple functionality that lets one format be converted to the other? Actually, there’s no good reason not to, but it does involve some heavy caveats. Conversion from one format type to another is relatively trivial until you hit the quality strings. since these aren’t being scaled or altered, you could end up with some rather bizzare conversions unless they’re handled cleanly. Unfortunately, doing this scaling is such a moving target that it’s just not possible to keep up with that and do all the other devlopment work I have on my plate. (I think I’ll be asking for a co-op student for the summer to help out.)

Anyhow, I’ll be including this nifty utility in my new tags. Hopefully people will find the upgraded conversion utility to be helpful to them. (=

>FindPeaks 3.3

>I have to admit, I’m feeling a little shy about writing on my blog, since the readership jumped in the wake of AGBT. It’s one thing to write when you’ve got 30 people reading your blog… and yet another thing when there are 300 people reading it. I suppose if I can’t keep up the high standards I’ve being trying to maintain, people will stop reading it and then I won’t have anything to feel shy about… Either way, I’ll keep doing the blog because I’m enjoying the opportunity to write in a less than formal format. I do have a few other projects on the go, as well, which include a few more essays on personal health and next-gen sequencing… I think I’ll aim for one “well thought through essay” a week, possibly on Fridays. We’ll see if I can manage to squeeze that in as a regular feature from now on.

In addition to blogging, the other thing I’m enjoying these days is the programming I’m doing in Java for FindPeaks 3.3 (which is the unstable version of FindPeaks 4.0.) It’s taking a lot longer to get going than I thought it would, but the efforts are starting to pay off. At this point, a full chip-seq experiment (4 lanes of Illumina data + 3 lanes of control data) can be fully processed in about 4-5 minutes. That’s a huge difference from the 40 minutes that it would have taken with previous versions, which would have been sample only.

Of course, the ChIP-seq field hasn’t stood still, so a lot of this is “catch-up” to the other applications in the field, but I think I’ve finally gotten it right with the stats. With some luck, this will be much more than just a catch-up release, though. It will probably be a few more days before I produce a 4.0 alpha, but it shouldn’t be long, now. Just a couple more bugs to squash. (-;

At any rate, in addition to the above subjects, there are certainly some interesting things going on in the lab, so I’ll need to put more time into those projects as well. As a colleague of mine said to me recently, you know you’re doing good work when you feel like you’re always in panic mode. I guess this is infinitely better than being underworked! In case anyone is looking for me, I’m the guy with his nose pressed to the monitor, fingers flying on the keyboard and the hunched shoulders. (That might not narrow it down that much, I suppose, but it’s a start…)

>EMBL Advanced Course in Analysis of Short Read Sequencing Data

>This email showed up in my mailbox today, and I figured I could pass it along. I don’t know anything about it other than what was shown below, but I thought people who read my blog for ChIP-Seq information might find it… well, informative.

I’m not sure where they got my name from. Hopefully it wasn’t someone who reads my blog and thought I needed a 1.5 day long course in ChIP-Seq! (-;

At any rate, even if I were interested, putting the workshop in Heidelberg is definitely enough to keep me from going. The flight alone would be about as long as the workshop. Anyhow, here’s the email:


Dear colleagues,

We would like to announce a new course that we will be having in June 2009 addressed to bioinformaticians with basic understanding of next generation sequencing data.

The course will cover all processing steps in the analysis of ChIP-Seq and RNA-Seq experiments: base calling, alignment, region identification, integrative bioinformatic and statistical analysis.

It will be a mix of lectures and practical computer exercises (ca. 1.5 days and 1 day, respectively).

Course name: EMBL Advanced Course in Analysis of Short Read Sequencing Data
Location: Heidelberg, Germany
Period: 08 – 10 June 2009
Website: link
Registration link: link
Registration deadline: 31 March 2009

Best wishes.

Looking forward for your participation to the course,
Adela Valceanu
Conference Officer
European Molecular Biology Laboratory
Meyerhofstr. 1
D-69117 Heidelberg

>The Future of FindPeaks

>At the end of my committee meeting, last month, my advisors suggested I spend less time on engineering questions, and more time on the biology of the research I’m working on. Since that means spending more time on the cancer biology project, and less on FindPeaks, I’ve been spending some time thinking about how I want to proceed forward – and I think the answer is to work smarter on FindPeaks. (No, I’m not dropping FindPeaks development. It’s just too much fun.)

For me, the amusing part of it is that FindPeaks is already on it’s 4th major structural iteration. Matthew Bainbridge wrote the first, I duplicated it by re-writing it’s code for the second version, then came the first round of major upgrades in version 3.1, and then I did the massive cleanup that resulted in the 3.2 branch. After all that, why would I want to write another version?

Somewhere along the line, I’ve realized that there are several major engineering things that could be done that would make FindPeaks faster, more versatile and able to provide more insight into the biology of ChIP-Seq and similar experiments. Most of the changes are a reflection of the fact that the underlying aligners that are being used have changed. When I first got involved we were using Eland 0.3 (?), which was simple compared to the tools we now have available. It just aligned each fragment individually and spit out the results, which left the filtering and sorting up to FindPeaks. Thus, early versions of FindPeaks were centred on those basic operations. As we moved to sorted formats like .map and _sorted.txt files, those issues have mostly dissapeared, allowing more emphasis to be placed on the statistics and functionality.

At this point, I think we’re coming to the next generation of biology problems – integrating FindPeaks into the wider toolset – and generating real knowledge about what’s going on in the genome, and I think it’s time for FindPeaks to evolve to fill that role, growing out to better use the information available in the sorted aligner results.

Ever since the end of my exam, I haven’t been able to stop thinking of neat applications for FindPeaks and the rest of my tool kit – so, even if I end up focussing on the cancer biology that I’ve got in front of me, I’m still going to find the time to work on FindPeaks, to better take advantage of the information that FindPeaks isn’t currently using.

I guess that desire to do things well, and to get at the answers that are hidden in the data is what drives us all to do science. And probably what drives grad students to work late into the night on their projects…. I think I see a few more late nights in the near future. (-;

>Paired End Chip-Seq

>Ok, time for another editorial on ChIP-Seq and a very related topic. Paired-End Tags.

When I first heard about Paired End Tags (PETs), I thought they’d be the solution to all of our worries for ChIP-Seq. You’d be able to more accurately pin-point the ends of each fragment, and thus bracket binding sites more easily.

Having played with them for a bit, I’m not realy sure that’s the case. I don’t think they do any better than the adaptive or triangle distributions that I’ve been using for a while, and I don’t think it really matters where that other end is, in the grand scheme of things. The peaks don’t seem to dramatically shift anywhere they wouldn’t be otherwise, and the resolution doesn’t seem to change dramatically.

I guess I just don’t see much use in using them.

Of course, what I haven’t had the chance to do is run a direct PET against a SET library to see how they compete head to head. (I’ve only played with the stats for each individually, so take my comments with a grain of salt.)

That said, people will start using PET for ChIP-Seq, despite the increased cost and slightly smaller number of tags. The theory goes that the number of mappable tags will increase slightly to compensate for the smaller number of tags.

That increase in mappable tags may, in fact, be the one redeeming factor here. If the background noise turns out to be full of tags that are better mapped with their paired end mate, then noise decreases, and signal increases – and that WOULD be an advantage.

Anyhow, if anyone really has the data (SET + PET for one ChIP-Seq) to do this study, FindPeaks 3.2 now has PET support, and all the tools for doing PET work. All that’s left is to get the manual together. A work in progress, which I’d be willing to accellerate if I knew someone was using it.