>Can’t we use ChIP-chip controls on *-Seq?

>Thanks to Nicholas, who left this comment on my web page this morning, in reference to my post on controls in Second-Gen Seqencing:

Hi Anthony,

Don't you think that controls used for microarray (expression
and ChIP-chip) are well established and that we could use
these controls with NGS?


I think this is a valid question, and one that should be addressed. My committee asked me the same thing during my comprehensive exam, so I’ve had a chance to think about it. Unfortunately, I’m not a statistics expert, or a ChIP-chip expert, so I would really value other people’s opinion on the matter.

Anyhow, I think the answer has to be put in perspective: Yes, we can learn from ChIP-chip and Arrays for the statistics that are being used, but no, they’re not directly applicable.

Both ChIP-chip and array experiments are based on hybridization to a probe – which makes them cheap and reasonably reliable. Unfortunately, it also leads to a much lower dynamic range, since they saturate out at the high end, and can be undetectable at the low end of the spectrum. This alone should be a key difference. What signal would be detected from a single hybridization event on a micro-array?

Additionally, the resolution of a chip-chip probe is vastly different from that of a sequencing reaction. In ChIP-Seq or RNA-Seq, we can get unique signals for sequences with a differing start location only one base apart, which should then be interpreted differently. With ChIP-chip, the resolution is closer to 400bp windows, and thus the statistics take that into account.

Another reason why I think the statistics are vastly different is because of the way we handle the data itself, when setting up an experiment. With arrays, you repeat the same experiment several times, and then use that data as several repeats of the same experiment, in order to quantify the variability (deviation and error) between the repeats. With second-generation sequencing, we pool the results from several different lanes, meaning we always have N=1 in our statistical analysis.

So, yes, I think we can learn from other methods of statistical analysis, but we can’t blindly apply the statistics from ChIP-chip and assume they’ll correctly interpret our results. The more statistics I learn, the more I realize how many assumptions go into each method – and how much more work it is to get the statistics right for each type of experiment.

At any rate, these are the three most compelling reasons that I have, but certainly aren’t the only ones. If anyone would like to add more reasons, or tell me why I’m wrong, please feel free to add a comment!

>On the necessity of controls

>I guess I’ve had this rant building up for a while, and it’s finally time to write it up.

One of the fundamental pillars of science is the ability to isolate a specific action or event, and determine it’s effects on a particular closed system. The scientific method actually demands that we do it – hypothesize, isolate, test and report in an unbiased manner.

Unfortunately, for some reason, the field of genomics has kind of dropped that idea entirely. At the GSC, we just didn’t bother with controls for ChIP-Seq for a long time. I can’t say I’ve even seen too many matched WTSS (RNA-SEQ) experiments for cancer/normals. And that scares me, to some extent.

With all the statistics work I’ve put in to the latest version of FindPeaks, I’m finally getting a good grasp of the importance of using controls well. With the other software I’ve seen, they do a scaled comparison to calculate a P-value. That is really only half of the story. It also comes down to normalization, to comparing peaks that are present in both sets… and to determining which peaks are truly valid. Without that, you may as well not be using a control.

Anyhow, that’s what prompted me to write this. As I look over the results from the new FindPeaks (, both for ChIP-Seq and WTSS, I’m amazed at how much clearer my answers are, and how much better they validate compared to the non-control based runs. Of course, the tests are still not all in – but what a huge difference it makes. Real control handling (not just normalization or whatever everyone else is doing) vs. Monte Carlo show results that aren’t in the same league. The cutoffs are different, the false peak estimates are different, and the filtering is incredibly more accurate.

So, this week, as I look for insight in old transcription factor runs and old WTSS runs, I keep having to curse the lack of controls that exist for my own data sets. I’ve been pushing for a decent control for my WTSS lanes – and there is matched normal for one cell line – but it’s still two months away from having the reads land on my desk… and I’m getting impatient.

Now that I’m able to find all of the interesting differences with statistical significance between two samples, I want to get on with it and find them, but it’s so much more of a challenge without an appropriate control. Besides, who’d believe it when I write it up with all of the results relative to each other?

Anyhow, just to wrap this up, I’m going to make a suggestion: if you’re still doing experiments without a control, and you want to get them published, it’s going to get a LOT harder in the near future. After all, the scientific method has been pretty well accepted for a few hundred years, and genomics (despite some protests to the contrary) should never have felt exempt from it.

>I went to see a talk by Dr. Irmtraud Meyer yesterday afternoon, over on the UBC campus. I haven’t been down that way for a long time – and it was just as gloomy in the rain as it was when I did my masters there. (The bright red trees do make a nice contrast in the fall, however.)

The title of the talk was “Investigating the Transcriptome of Higher Eukaryotes”, which had me fooled into thinking it would be directly relevant to the work I’m doing and transcriptomes of human beings. Alas, I was wrong. However, it was related to some work in which I was involved during my masters degree. Oddly enough, that was a course project that just turned out well, and provided the group with a publication on inverse RNA folding, and sent one of the brighter grad students down the path of RNA work.

As I said, Dr. Meyer’s work was quite interesting, and – in a strange way – turned out to be relevant after all. As a person working on transcriptomes, I tend to have the view that RNA are linear bits of sequence, which cells produce as part of the pathway of producing proteins.

central paradigm of molecular biology - transcription and translation

That is the classical view of mRNA – and we tend not to stop and re-think it. However, that’s exactly what I ended up doing yesterday.

Two interesting bits of information came up that I knew in general, but hadn’t really processed:

  • 80% of transcribed sequence corresponds to unannotated regions (Science, 2005, 308:1149-1154)
  • 40-65% of known mammalian genes are alternately spliced (Science, 2005, 309:1559-1563)

And then, there’s ample evidence that RNA folding is involved in alternate splicing… well. Suddenly it’s hard to think of those little RNA sequences as linear strings – it’s hard even to think of them outside of the normal context of transcription and translation. Yet, we have tRNA, mRNA and even miRNA! Clearly transcriptomes aren’t the simple model that we perceive them to be in genomics.

While it’s nice to have a clearer picture on what’s going on at the molecular level, I don’t really know how to apply this information. I can’t use it to analyze the transcriptomes I work with, and I can’t use it to deal with alternative splicing that I see. I can’t even figure out what all those splice sites are, yet, but eventually, this information will have to be integrated into our annotations. miRNA and “junk” RNA all probably have meaning, which we just don’t understand at that level, yet.

Just a few more things to work on in the future, I suppose.

>I’ve spent the last week madly putting together a poster for the “Reasons for Hope 2008” conference this past weekend, which focuses on breast cancer science, treatment and quality of life research. So, you’ll notice (shortly), a new poster in my poster section. It was a educational experience, and I must admit I learned a lot. Not so much in the areas that I need to learn for my own research, but about physiology, psychology and general health research. And that’s even considering how few talks I went to!

Still, I highly recommend dropping into talks that aren’t in your field, on occasion. I try to make a habit of it, which included a pathology lecture just before xmas, last year, and this time, I learned a lot about mammography, and new techniques for mammography that are up and coming. Neither are really practical skills for a bioinformatician, but it gives me a good idea of where the samples I’ll be dealing with come from. Nifty.

Anyhow, I had a few minutes to revisit my ChIP-Seq code, FindPeaks, and do a few things I’d been hoping to do for a while. I got around to reducing the memory requirement – going from about 4Gb of RAM for a 12M+ read run down to under 1Gb. (I’d discussed this before in another posting.) The other thing I did was to re-write the core peak-finding algorithm. It was something I’d known was not-optimal for a while, but re-implementing a core routine isn’t something you do without a lot of thought. The good news, it runs about 2x as fast, scales better on multiple cores and guarantees not to produce any of the type of bugs that have been relatively common in early versions of FindPeaks.

Having invested the 2 hours to do it, I’m very glad to see it provide some return. Since my next project is to clean up the Transcripter code (for whole transcriptome shotgun sequencing), this was a nice lesson in coding: if you find a problem, don’t patch the problem: solve it. I think I have a lot of “solving” to do. (-;

For those of you who are interested, the next version of FindPeaks will be released once I can include support for the SRF files – hopefully the end of the week.

>Genomics Forum 2008

>You can probably guess what this post is about from the title – which means I still haven’t gotten around to writing an entry on thresholding for ChIP-Seq. Actually, it’s probably a good thing I haven’t, as we’ve been learning a lot about thresholding in the past week. It seems many things we took for granted aren’t really the case. Anyhow, I’m not going to say too much about that, as I plan to collect my thoughts and discuss it in a later entry.

Instead, I’d like to discuss the 2008 Genomics Forum, sponsored by Genome BC, which took place on Friday – though, in particular, I’m going to focus on one talk, near to my own research. Dr. Barbara Wold from Caltech gave the first of the science talks, and focussed heavily on ChIP-Seq and Whole Transcriptome Shotgun Sequencing (WTSS). But before I get to that, I wanted to mention a few other things.

The first is that Genome BC took a few minutes to announce a really neat funding competition, which really impressed me, the Genome BC Science Opportunities Fund. (There’s nothing up on the web page yet, but if you google for it, you’ll come across the agenda for Friday’s forum in which it’s mentioned – I’m sure more will appear soon.) Its whole premise revolves around the question: “Are there experiments that we need to be doing, that are of strategic importance to the BC life science community?” I take that to mean, are there projects that we can’t afford not to undertake, that we wouldn’t have the funding to do otherwise? I find that to be very flexible, and very non-academic in nature – but quite neat. I hope the funding competition goes well, and I’m looking forward to seeing what they think falls into the “must do” category.

The second was the surprising demand for Bioinformaticians. I’m aware of several jobs for bioinformaticians with experience in next-gen sequencing, but the surprise to me was the number of times (5) I heard people mention that they were actively recruiting. If anyone with next-gen experience is out there looking for a job (post-doc, full time or grad student), drop me a note, and I can probably point you in the right direction.

The third was one of the afternoon talks, on journalism in science, from the perspective of traditional news paper/tv journalists. It seems so foreign to me, yet the talk touched on several interesting points, including the fact that journalists are struggling to come to terms with “new media.” (… which doesn’t seem particularly new to those of us who have been using the net since the 90’s, but I digress.) That gave me several ideas about things I can do with my blog, to bring it out of the simple text format I use now. I guess even those of us who live/breath/sleep internet don’t do a great job of harnessing it’s power for communicating effectively. Food for though.

Ok… so on to the main topic of tonight’s blog: Dr. Wold’s talk.

Dr. Wold spoke at length on two topics, ChIP-Seq and Whole Transcriptome Shotgun Sequencing. Since these are the two subject I’m actively working on, I was obviously very interested in hearing what she has to say, though I’ll comment more on the ChIP-Seq side of things.

One of the great open questions at the Genome Sciences Centre has been how to do an effective control for a ChIP-Seq experiment. It’s not something we’ve done much of, in the past, but the Wold lab demonstrated why they’re necessary, and how to do them well. It seems that ChIP-Seq experiments tend to yield fragments in several genomic regions that have nothing to do with the antibody or experiment itself. The educated guess is that these are caused by hypersensitive sites in the genome that tend to fragment in repeatable patterns, giving rise to peaks that appear in all samples. Indeed, I spend a good portion of this past week talking about observations of peaks exactly like that, and how to “filter” them out of the ChIP-Seq results. I wasn’t able to get a good idea of how the Wold lab does this, other than by eye, (which isn’t very high throughput), but knowing what needs to be done now, it shouldn’t be particularly difficult to incorporate into our next release of the FindPeaks code.

Another smart thing that the Wold lab has done is to separate the interactions of ChIP-Seq into two different types: Type 1 and Type 2, where Type 1 refers to single molecule-DNA binding events, which give rise to sharp peaks, and very clean profiles. These tend be transcription factors like NRSF, or STAT1, upon which the first generation of ChIP-Seq papers were published. Type 2 interactomes tend to be less clear, as they are transcription factors that recruit other elements, or form complexes that bind to the DNA at specific sites, and require other proteins to bind to encourage transcription. My own interpretation is that the number of identifiable binding sites should indicate the type, and thus, if there were three identifiable transcription factor consensus sites lined up, it should be considered a Type 3 interactome, though, that may be simplifying the case tremendously, as there are, undoubtedly, many other proteins that must be recruited before any transcription will take place.

In terms of applications, the members of the wold lab have been using their identified peaks to locate novel binding site motifs. I think this is the first thing everyone thinks of when they hear of ChIP-Seq for the first time, but it’s pretty cool to see it in action. (We also do it at the GSC too, I might add.) The neatest thing, however, was that they were able to identify a rather strange binding site, with two halves of a motif, split by a variable distance. I haven’t quite figured out how that works, in terms of DNA/Protein structure, but it’s conceptually quite neat. They were able to show that the distance between the two halves of the structure vary by 10-20 bases, making it a challenge to identify, for most traditional motif scanners. Nifty.

Another neat thing, which I think everyone knows, but was cool to hear that it’s been shown is that the binding sites often line up on areas of high conservation across species. I use that as a test for my own work, but it was good to have it confirmed.

Finally, one of the things Dr. Wold mentioned was that they were interested in using the information in the directionality of reads in their analysis. Oddly enough, this was one of the first problems I worked on in ChIP-Seq, months ago, and discovered several ways to handle it. I enjoyed knowing that there’s at least one thing my own ChIP-Seq code does that is unique, and possibly better than the competition. (-;

As for transcriptome work, there were only a couple things that are worth mentioning. The Wold lab seems to be using MAQ and a list of splice junctions assembled from annotated exons to map the transcriptome sequences. I’ve heard that before, actually, from someone at the GSC who is doing exactly the same thing. It’s a small world. I’m not really a fan of the technique, however. Yes, you’ll get a lot of the exon junction reads, but you’ll only find the ones you’re looking for, which is exactly the criticism all the next-gen people throw at the use of micro-arrays. There has got to be a better solution… but I don’t yet know what it is. (We thought it was Exonerate, but we can’t seem to get it to work well, due to several bugs in the software. It’s clearly a work in progress.)

Anyhow, I think I’m going to stop here. I’ll just sum it all up by saying it was a pretty good talk, and it’s given me lots of things to think about. I’m looking forward to getting back to coding tomorrow.

>Transcriptome sequencing.

>In an earlier comment, Jason Stajich asked:

What I am most curious about is how people are planning to do the statistics of gene expression comparison from the EST sequencing library approach. It made sense to me for the SAGE approach, but how do you get the overall expression for the gene (really you want the per-transcript numbers). Do you assemble and count the union of all tags across a transcript? Do you normalize that by length of the transcript? Do you only count 3′ biased tags?

Though I’ve been taking my time about answering, its a really good question. I’ve been working with transcriptomes for a while, now, and have a fair amount of experience with it. I don’t want to give away all of my secrets, but I can give a few pointers. If anyone wants to collaborate on something, you know where to find me. (-;

So, first things first, with transcriptome sequencing using Illumina based sequencing, each read you get is presumably from a single molecule of DNA, which presumably came from a single molecule of cDNA, from your library. I can’t speak for all of the protocols used by the labs here at the Genome Science Centre, but the results I’ve seen have shown a very close correlation with expression levels measured by Nimblegen/Affymetrix DNA arrays, and so I tend to believe that the number of tags we’re observing per region(eg gene/transcript/exon) are a direct (or nearly direct) measurement of the RNA levels in the general cell population used to create the library.

I should also mention that this is very dependent upon the protocols being used. If your protocol involves amplifying the cDNA with the use of PCR, you’re really not going to maintain that relationship. Consult an expert on this subject, if you plan to try this at home. (-;

The other questions Jason asked are not quite as straight forward. We have a protocol here at the GSC that gives pretty darn even coverage across transcripts as a whole, which means that transcript end bias is pretty minimal. That totally negates the need to look at biases, or otherwise. Of course, this comes down to a lot of lab technique (which is totally outside the scope of my post), as it seems to be dependent on following the appropriate protocols. I’ve seen libraries which are completely skewed, libraries that perfectly follow the transcript outlines, and libraries somewhere in between. As it stands, I now run my tools over each data set as it appears to judge the quality before I ask for more lanes.

So, the short answer is: no, I don’t normalize or exclude any data when I deal with transcriptomes, but I’m in the fortunately position of being able to identify (and accept or reject!) which data sets meet a reasonable level of quality before I process them.