>SISSR

>One more day, one more piece of ChIP-Seq software to cover. I’ve not talked about FindPeaks, much, which is the software descended from Robertson et al, for obvious reasons. The paper was just an application note – and well, I’m really familiar with how it works, so I’m not going to review it. I have talked about Quest, however, which was presumably descended from Johnson et al.. And, for those of you who have been following ChIP-Seq papers since the early days will realize that there’s still something missing: The aligner descended from Barski et al, which is the subject of today’s blog: SISSR. Those were the first three published ChIP-Seq papers, and so it’s no surprise that each of them followed up with a paper (or application note!) on their software.

So, today, I’ll take a look at SISSR, to complete the series.

From the start, the Barski paper was discussing both histone modifications and transcription factors. Thus, the context of the peak finder is a little different. Where FindPeaks (and possibly QuEST as well) was originally conceived for identifying single peaks, and expanded to do multiple peaks, I would imagine that SISSR was conceived with the idea of working on “complex” areas of overlapping peaks. Although, that’s only relevant in terms of their analysis, but I’ll come back to that.

The most striking thing you’ll notice about this paper is that the datasets look familiar. They are, in fact the sets from Robertson, Barski and Johnson: STAT1, CTCF and NRSF, respectively. This is the first of the Chip-Seq application papers that actually performs a comparison between the available peak finders, and of course, claim that theirs is the best. Again, I’ll come back to that.

The method used by SISSR is almost identical to the method used by FindPeaks, with the use of directional information built into the base algorithm, whereas FindPeaks provides it as an optional module (-directional flag, which uses a slightly different method). They provide an excellent visual image on the 4th page of the article, demonstrating their concept, which will explain the method better than I can, but I’ll try anyhow.

In ChIP-Seq, a binding site is expected to have many real tags pointing at it, as tags upstream should be on the sense strand, and tags on downstream should be on the anti-sense strand. Thus, a real binding site should exist at transition points, where the majority of tags switch from the sense to the anti-sense tag. By identifying these transition points, they will be able to identify locations of real binding sites. More or less, that describes the algorithm employed, with the following modifications: A window is used, (20bp default) instead of doing it on a base-by-base basis, and parameter estimation is employed to guess the length of the fragments.

In my review of QuEST, I complained that windows are a bad idea(tm) for ChIP-Seq, only to be corrected that QuEST wasn’t using a window. This time, the window is explicitly described – and again, I’m puzzled. FindPeaks uses an identical operation without windows, and it runs blazingly fast. Why throw away resolution when you don’t need to?

On the subject of length estimation, I’m again less than impressed. I realize this is probably an early attempt at it – and FindPeaks has gone through it’s fair share of bad length estimators, so it’s not a major criticism, but it is a weakness. To quote a couple lines from the paper: “For every tag i in the sense strand, the nearest tag k in the anti-sense strand is identified. Let J be the tag in the sense strand immediately upstream of k.” Then follows a formula based upon the distances between (i,j) and (j,k). I completely fail to understand how this provides an accurate assessment of the real fragment length. I’m sure I’m missing something. As a function that describes the width of peaks, that may be a good method, which is really what the experiment is aiming for, anyhow – so it’s possible that this may just be poorly named.

In fairness, they also provide options for a manual length estimation (or XSET, as it was referred to at the time), which overrides the length estimation. I didn’t see a comparison in the paper about which one provided the better answers, but having lots of options is always a good thing.

Moving along, my real complaint about this article is the analysis of their results compared to past results, which comes in two parts. (I told you I’d come back to it.)

The first complaint is what they were comparing against. The article was submitted for publication in May 2008, but they compared results to those published in the June 2007 Robertson article for STAT1. By August, our count of peaks had changed. By January 2008, several upgraded versions of FindPeaks were available, and many bugs had been ironed out. It’s hardly fair to compare the June 2007 FindPeaks results to the May 2008 version of SISSR, and then declare SISSR the clear winner. Still, that’s not a great problem – albeit somewhat misleading.

More vexing is their quality metric. In the Motif analysis, they clearly state that because of the large amount of computing power, only the top X% of reads were used in their analysis. For comparison with FindPeaks, the top 5% of peaks were used – and were able to observe the same motifs. Meanwhile, their claim to find 74% more peaks than FindPeaks, is not really discussed in terms of the quality of the additional sites. (FindPeaks was also modified to identify sub-peaks after the original data set was published, so this is really comparing apples to oranges, a fact glossed over in the discussion.)

Anyhow, complaints aside, it’s good to see a paper finally compare the various peak finders out there. They provide some excellent graphics, and a nice overview on how their ChIP-Seq application works, while contrasting it to published data available. Again, I enjoyed the motif work, particularly that of figure 5, which correlates four motif variants to tag density – which I feel is a fantastic bit of information, buried deeper in the paper than it should be.

So, in summary, this paper presents a rather unfair competition by using metrics guaranteed to make SISSR stand out, but still provides a good read with background on ChIP-Seq, excellent illustrations and the occasional moment of deep insight.

>ChIP-Seq in silico

>Yesterday I got to dish out some criticism, so it’s only fair that I take some myself, today. It came in the form of an article called “Modeling ChIP Sequencing In Silico with Applications”, by Zhengdong D. Zhang et al., PLoS Computational Biology, August 2008: 4(8).

This article is actually very cool. They’ve settled several points that have been hotly debated here at the Genome Sciences Centre, and made the case for some of the stuff I’ve been working on – and then show me a few places where I was dead wrong.

The article takes direct aim at the work done in Robertson et al., using the STAT1 transcription factor data produced in that study. Their key point is that the “FDR” used in that study was far from ideal, and that it could be significantly improved. (Something I strongly believe as well.)

For those that aren’t aware, Robertson et al. is sort of the ancestral origin of the FindPeaks software, so this particular paper is more or less aiming at the FindPeaks thresholding method. (Though I should mention that they’re comparing their results to the peaks in the publication, which used the unreleased FindPeaks 1.0 software – not the FindPeaks 2+ versions, of which I’m the author.) Despite the comparison to the not-quite current version of the software, their points are still valid, and need to be taken seriously.

Mainly, I think there are two points that stand out:

1. The null model isn’t really appropriate
2. The even distribution isn’t really appropriate.

The first, the null model, is relatively obvious – everyone has been pretty clear from the start that the null model doesn’t really work well. This model, pretty much consistent across ChIP-Seq platforms can be paraphrased as “if my reads were all noise, what would the data look like?” This assumption is destined to fail every time – the reads we obtain aren’t all noise, and thus assuming they are as a control is really a “bad thing”(tm).

The second, the even distribution model, is equally disastrous. This can be paraphrased as “if all of my noise were evenly distributed across some portion of the chromosome, what would the data look like?” Alas, noise doen’t distribute evenly for these experiments, so it should be fairly obvious why this is also a “bad thing”(tm).

The solution presented in the paper is fairly obvious; create a full simulation for your ChIP-Seq data. Their version requires a much more rigorous process, however. They simulate a genome-space, remove areas that would be gaps or repeats in the real chromosome, then begin tweaking the genome simulation to replicate their experiment using weighted statistics collected in the ChIP-Seq experiment.

On the one hand, I really like this method, as it should give a good version of a control, whereas on the other hand, I don’t like that you need to know a lot about the genome of interest before you can analyze your ChIP-Seq data. (ie, mappability, repeat-masking, etc.) Of course, if you’re going to simulate your genome, simulate it well – I agree with that.

I don’t want to belabor the point, but this paper provides a very nice method for simulating ChIP-Seq noise in the absence of a control, as in Robertson et al. However, I think there are two things that have changed since this paper was submitted (January 2008) that should be mentioned:

1. FDR calculations haven’t stood still. Even at the GSC, we’ve been working on two separate FDR models that no longer use the null model, however, both still make even distribution assumptions, which, is also not ideal.

2. I believe everyone has now acknowledged that there are several biases that can’t be accounted for in any simulation technique, and that controls are the way forward. (They’re used very successfully in QuEST, which I discussed yesterday.)

Anyhow, to summarize this paper: Zhang et al. provide a fantastic critique of the thresholding and FDR used in early ChIP-Seq papers (which is still in use today, in one form or another), and demonstrate a viable and clearly superior method for refining Chip-Seq results with out a matched control. This paper should be read by anyone working on FDRs for next-gen sequencing and ChIP-Seq software.

(Post-script: In preparation for my comprehensive exam, I’m trying to prepare critical evaluations of papers in the area of my research. I’ll provide comments, analysis and references (where appropriate), and try to make the posts somewhat interesting. However, these posts are simply comments and – coming from a graduate student – shouldn’t be taken too seriously. If you disagree with my points, please feel free to comment on the article and start a discussion. Nothing I say should be taken as personal or professional criticism – I’m simply trying to evaluate the science in the context of the field as it stands today.)

>QuEST

>(Pre-script: In preparation for my comprehensive exam, I’m trying to prepare critical evaluations of papers in the area of my research. I’ll provide comments, analysis and references (where appropriate), and try to make the posts somewhat interesting. However, these posts are simply comments and – coming from a graduate student – shouldn’t be taken too seriously. If you disagree with my points, please feel free to comment on the article and start a discussion. Nothing I say should be taken as personal or professional criticism – I’m simply trying to evaluate the science in the context of the field as it stands today.)

(UPDATE: A response to this article was kindly provided by Anton Valouev, and can be read here.)

I once wrote a piece of software called WINQ, which was the predecessor of a piece of software called Quest. Not that I’m going to talk about that particular piece of Quest software for long, but bear with me a moment – it makes a nice lead in.

The software I wrote wasn’t started before the University of Waterloo’s version of Quest, but it was released first. Waterloo was implementing a multi-million dollar set of software for managing student records built on oracle databases, PeopleSoft software, and tons of custom extensions to web interfaces and reporting. Unfortunately, The project was months behind, and the Quest system was no where near being deployed. (Vendor problems and the like.) That’s when I became involved – in two months of long days, I used Cognos tools (several of them, involving 5 separate scripting and markup languages) to build the WINQ system, which provided the faculty with a way to access query the oracle database through a secure web frontend and get all of the information they needed. It was supposed to be in use for about 4-6 months, until Quest took over… but I heard it was used for more than two years. (There are many good stories there, but I’ll save them for another day.)

Back to ChIP-Seq’s QuEST, this application was the subject of a recently published article. In a parallel timeline to the Waterloo story, QuEST was probably started before I got involved in ChIP-Seq, and was definitely released after I released my software – but this time I don’t think it will replace my software.

The paper in question (Valouev et al, Nature Methods, Advanced Online Publication) is called “Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. I suspect it was published with the intent of being the first article on ChIP-Seq software, which, unfortunately, it wasn’t. What’s most strange to me is that it seems to be simply a reiteration of the methods used by Johnson et al. in their earlier ChIP-Seq paper. I don’t see anything novel in this paper, though maybe someone else has seen something I’ve missed.

The one thing that surprises me about this paper, however, is their use of a “kernel density bandwidth”, which appears to be a sliding window of pre-set length. This flies in the face of the major advantage of ChIP-Seq, which is the ability to get very strong signals at high resolution. By forcing a “window” over their data, they are likely losing a lot of the resolution they could have found by investigating the reads directly. (Admittedly, with a window of 21bp, as used in the article, they’re not losing much, so it’s not a very heavy criticism.) I suppose it could be used to provide a quick way of doing subpeaks (finding individual peaks in areas of contiguous read coverage) at a cost of losing some resolving power, but I don’t see that discussed as an advantage.

The second thing they’ve done is provide a directional component to peak finding. Admittedly, I tried to do the same thing, but found it didn’t really add much value. Both the QuEST publication and my application note on FindPeaks 3.1 mention the ability to do this – and then fail to show any data that demonstrates the value of using this mechanism versus identifying peak maxima. (In my case, I wasn’t expected to provide data in the application note.)

Anyhow, that was the down side. There are two very good aspects to this paper. The first is that they do use controls. Even now, the Genome Sciences Centre is struggling with ChIP-Seq controls, while it seems everyone else is using them to great effect. I really enjoyed this aspect of it. In fact, I was rather curious how they’d done it, so I took a look through the source code of the application. I found the code somewhat difficult to wade through, as the coding style was very different from my own, but well organized. Unfortunately, I couldn’t find any code for dealing with controls, which leads me to think this is an unreleased feature, and was handled by post-processing the results of their application. Too bad.

The second thing I really appreciated was the motif finding work, which isn’t strictly ChIP-Seq, but is one of the uses to which the data can be applied. Unfortunately, this is also not new, as I’m aware of many earlier experiments (published and unpublished) that did this as well, but it does make a nice little story. There’s good science behind this paper – and the data collected on the chosen transcription factors will undoubtedly be exploited by other researchers in the future.

So, here’s my summary of this paper: As a presentation of a new algorithm, they failed to produce anything novel, and with respect to the value of those algorithms versus any other algorithm, no experiments were provided. On the other hand, as a paper on growth-associated binding protein, and serum response factor proteins (GABP and SRF respectively), it presents a nice compact story.

>Chip-Seq revisited

>In the world of ChIP-Seq, things don’t seem to slow down. A collaborator of mine pointed out the new application called MACS, which is yet another peak finder, written in python as an open source project. That makes 2 open source peak finders that I’m aware of: Useq and now MACS.

The interesting thing, to me, is the maturity of the code (in terms of features implemented). In neither cases is it all that great, as it’s mostly lacking features I consider to be relatively basic, and relatively naive in terms of algorithms used for peak detection. Though, I suppose I’ve been working with FindPeaks long enough that nearly everything else will seem relatively basic in comparison.

However, I’ll come back to a few more FP related things in a moment. I wanted to jump to another ChIP-Seq related item that I’d noticed this week. The Wold lab merged their Peak Finder software into a larger development package for Genomic and Transcriptome work, which I think they’re calling ERANGE. I’ve long argued that the Peak Finding tools are really just a subset of the whole Illumina tool-set required, and it’s nice to see other people doing this.

This is the development model I’ve been using, though I don’t know if the wold lab does exactly the same thing. The high-level organization uses a core library set, core object set, and then FindPeaks and other projects just sit on top, using those shared layers. It’s a reasonably efficient model. And, in a blog a while ago, I mentioned that I’d made a huge number of changes to my code after coming across the tool called “Enerjy“. I sat down to figure out how many lines were changed in the last two weeks: 26,000+ lines of code, comments and javadoc. That’s a startling figure, since my entire code base ( grep -r ” ” * | wc -l) is only 22,884 lines, of which 15,022 contain semi-colons.

Anyhow, I have several plans for the next couple of days:

  1. try to get my SVN repository to somewhere other people can work on it as well, and not just restricted to GSC developers.
  2. Improve the threading I’ve got going
  3. Clean up the documentation, where possible
  4. and work on the Adaptive mode code.

Hopefully, that’ll clean things up a bit.

Back to FindPeaks itself, the latest news is that my Application note in Bioinformatics has been accepted. Actually, it was accepted about a week ago, but I’m still waiting to see it in the advanced access section – hopefully it won’t be much longer. I also have a textbook chapter on ChIP-Seq coming out relatively soon, (I’m absolutely honoured to have been given that opportunity!) assuming I can get my changes done by Monday.

I don’t think that’ll be a problem.

>Downloads and aligners

>At the Genome Science Centre, we have a platform Scientific Advisory Board meeting coming up in the near future, so a lot of people are taking the time to put together some information on what’s being going on since the last meeting. As part of that, I was asked to provide some information on the FindPeaks application, since it’s been relatively successful. One of those things was how many times it’s been downloaded.

Surprisingly, it’s not an insigificant number: FindPeaks 2.x has been downloaded 280 times, while FindPeaks 3.1.x has been downloaded 325 times. I was amazed. Admittedly, I filtered out google and anything calling itself “spider”, and didn’t filter on unique IPs, but it’s still more than I expected.

The other surprise to me was the institutions from which it was downloaded, which are scattered across the world. Very cool to see it’s not just Vancouver people who took an interest in FindPeaks, though I somewhat suspected as much, already.

Thinking of FindPeaks, I put in one last marathon cleaning session for all my software. I’m sure I’m somewhere north of 15,000 lines of code modified in the past week. Even when you consider that some of those changes were massive search-and-replace jobs, (very few, really), or refactoring and renaming variables (many many more), it’s still an impressive number for two weeks. With that done, I’m looking to see some significant gains in development speed, and in developer productivity. (I learned a LOT doing this.)

The last 122 warnings will just have to get cleaned up when I get around to it – although they’re really only 4 warnings types repeated so many times. (The majority of them are just that the branched logic goes deeper than 5 levels, or that my objects have public variables. (You can only write so many accessor functions in a week.)

Anyhow, I’m looking forward to testing out FindPeaks 4.0 alphas starting tomorrow, and to put some documents together on it. (And catch up on the flood of emails I received in the last 2 days.

Finally, I’m working on a MAQ file interpreter for another project I’m working on. If I ever manage to figure out how to interpret the (nearly undocumented) files, I’ll post that here. If anyone’s done this already (though I didn’t see anything publicly available on the web), I’d love to hear from them.

Cheers!

>Random Update on FP/Coding/etc.

>I had promised to update my blog more often, but then failed miserably to follow through last week. I guess I have to chalk it up to unforeseen circumstances. On the bright side, it gave me the opportunity to come up with several things to discuss here.

1. Enerjy: I learned about this tool on Slashdot, last week while doing some of my usual lunch hour “open source news” perusal. I can unequivocally say that installing the Enerjy tool in Eclipse has improved my coding by unimaginable leaps and bounds. I tested it out on my Java codebase that has my FindPeaks application and the Transcriptome/Genome analysis tools, and was appalled by the number of suggestions it gave. Admittedly, I’m self taught in Java, but I thought I had grasped the “Zen” of Java by now, though the 2000+ warnings it gave disagreed. I’ve since been cleaning up the code like there’s no tomorrow, and have brought it down to 533 warnings. The best part is that it pointed out several places where bugs were likely to have occurred, which have now all been cleaned up.

2. Threading has also come up this past week. Although I didn’t “need” it, there was no way to get around it – learning threads was the appropriate solution to one problem that came up, so my development version is now beginning to include some thread management, which is likely to spread into the the core algorithms. Who knew??

3. Random politics: If you’re a grad student in a mixed academic/commercial environment, I have a word of warning for you: Not everyone there is looking out for your best interests. In fact, some people are probably looking out for their own interests, and they’re definitely not the same as yours.

4. I read Michael Smith’s biography this week. I was given a free copy by the Michael Smith Foundation for Health Research, who were kind enough to provide my funding for the next few years. It’s fantastic to understand a lot of the history behind the British Columbia Biotechnology scene. I wish I’d read that before having worked at Zymeworks. That would have provided me with a lot more insight into the organizations and people I met along the way. Hindsight is 20/20.

5. FindPeaks 4.0: Yes, I’m skipping plans for a FindPeaks 3.3. I’ve changed well over 12000+ lines of code, according to the automated scripts that report such things, which have included a major refactoring and the start I made at threading. If that doesn’t warrant an major number version change, I don’t know what does.

Well, on that note, back to coding… I’m going to be competing with people here, in the near future, so I had best be productive!

>I’ve spent the last week madly putting together a poster for the “Reasons for Hope 2008” conference this past weekend, which focuses on breast cancer science, treatment and quality of life research. So, you’ll notice (shortly), a new poster in my poster section. It was a educational experience, and I must admit I learned a lot. Not so much in the areas that I need to learn for my own research, but about physiology, psychology and general health research. And that’s even considering how few talks I went to!

Still, I highly recommend dropping into talks that aren’t in your field, on occasion. I try to make a habit of it, which included a pathology lecture just before xmas, last year, and this time, I learned a lot about mammography, and new techniques for mammography that are up and coming. Neither are really practical skills for a bioinformatician, but it gives me a good idea of where the samples I’ll be dealing with come from. Nifty.

Anyhow, I had a few minutes to revisit my ChIP-Seq code, FindPeaks, and do a few things I’d been hoping to do for a while. I got around to reducing the memory requirement – going from about 4Gb of RAM for a 12M+ read run down to under 1Gb. (I’d discussed this before in another posting.) The other thing I did was to re-write the core peak-finding algorithm. It was something I’d known was not-optimal for a while, but re-implementing a core routine isn’t something you do without a lot of thought. The good news, it runs about 2x as fast, scales better on multiple cores and guarantees not to produce any of the type of bugs that have been relatively common in early versions of FindPeaks.

Having invested the 2 hours to do it, I’m very glad to see it provide some return. Since my next project is to clean up the Transcripter code (for whole transcriptome shotgun sequencing), this was a nice lesson in coding: if you find a problem, don’t patch the problem: solve it. I think I have a lot of “solving” to do. (-;

For those of you who are interested, the next version of FindPeaks will be released once I can include support for the SRF files – hopefully the end of the week.

>Genomics Forum 2008

>You can probably guess what this post is about from the title – which means I still haven’t gotten around to writing an entry on thresholding for ChIP-Seq. Actually, it’s probably a good thing I haven’t, as we’ve been learning a lot about thresholding in the past week. It seems many things we took for granted aren’t really the case. Anyhow, I’m not going to say too much about that, as I plan to collect my thoughts and discuss it in a later entry.

Instead, I’d like to discuss the 2008 Genomics Forum, sponsored by Genome BC, which took place on Friday – though, in particular, I’m going to focus on one talk, near to my own research. Dr. Barbara Wold from Caltech gave the first of the science talks, and focussed heavily on ChIP-Seq and Whole Transcriptome Shotgun Sequencing (WTSS). But before I get to that, I wanted to mention a few other things.

The first is that Genome BC took a few minutes to announce a really neat funding competition, which really impressed me, the Genome BC Science Opportunities Fund. (There’s nothing up on the web page yet, but if you google for it, you’ll come across the agenda for Friday’s forum in which it’s mentioned – I’m sure more will appear soon.) Its whole premise revolves around the question: “Are there experiments that we need to be doing, that are of strategic importance to the BC life science community?” I take that to mean, are there projects that we can’t afford not to undertake, that we wouldn’t have the funding to do otherwise? I find that to be very flexible, and very non-academic in nature – but quite neat. I hope the funding competition goes well, and I’m looking forward to seeing what they think falls into the “must do” category.

The second was the surprising demand for Bioinformaticians. I’m aware of several jobs for bioinformaticians with experience in next-gen sequencing, but the surprise to me was the number of times (5) I heard people mention that they were actively recruiting. If anyone with next-gen experience is out there looking for a job (post-doc, full time or grad student), drop me a note, and I can probably point you in the right direction.

The third was one of the afternoon talks, on journalism in science, from the perspective of traditional news paper/tv journalists. It seems so foreign to me, yet the talk touched on several interesting points, including the fact that journalists are struggling to come to terms with “new media.” (… which doesn’t seem particularly new to those of us who have been using the net since the 90’s, but I digress.) That gave me several ideas about things I can do with my blog, to bring it out of the simple text format I use now. I guess even those of us who live/breath/sleep internet don’t do a great job of harnessing it’s power for communicating effectively. Food for though.

Ok… so on to the main topic of tonight’s blog: Dr. Wold’s talk.

Dr. Wold spoke at length on two topics, ChIP-Seq and Whole Transcriptome Shotgun Sequencing. Since these are the two subject I’m actively working on, I was obviously very interested in hearing what she has to say, though I’ll comment more on the ChIP-Seq side of things.

One of the great open questions at the Genome Sciences Centre has been how to do an effective control for a ChIP-Seq experiment. It’s not something we’ve done much of, in the past, but the Wold lab demonstrated why they’re necessary, and how to do them well. It seems that ChIP-Seq experiments tend to yield fragments in several genomic regions that have nothing to do with the antibody or experiment itself. The educated guess is that these are caused by hypersensitive sites in the genome that tend to fragment in repeatable patterns, giving rise to peaks that appear in all samples. Indeed, I spend a good portion of this past week talking about observations of peaks exactly like that, and how to “filter” them out of the ChIP-Seq results. I wasn’t able to get a good idea of how the Wold lab does this, other than by eye, (which isn’t very high throughput), but knowing what needs to be done now, it shouldn’t be particularly difficult to incorporate into our next release of the FindPeaks code.

Another smart thing that the Wold lab has done is to separate the interactions of ChIP-Seq into two different types: Type 1 and Type 2, where Type 1 refers to single molecule-DNA binding events, which give rise to sharp peaks, and very clean profiles. These tend be transcription factors like NRSF, or STAT1, upon which the first generation of ChIP-Seq papers were published. Type 2 interactomes tend to be less clear, as they are transcription factors that recruit other elements, or form complexes that bind to the DNA at specific sites, and require other proteins to bind to encourage transcription. My own interpretation is that the number of identifiable binding sites should indicate the type, and thus, if there were three identifiable transcription factor consensus sites lined up, it should be considered a Type 3 interactome, though, that may be simplifying the case tremendously, as there are, undoubtedly, many other proteins that must be recruited before any transcription will take place.

In terms of applications, the members of the wold lab have been using their identified peaks to locate novel binding site motifs. I think this is the first thing everyone thinks of when they hear of ChIP-Seq for the first time, but it’s pretty cool to see it in action. (We also do it at the GSC too, I might add.) The neatest thing, however, was that they were able to identify a rather strange binding site, with two halves of a motif, split by a variable distance. I haven’t quite figured out how that works, in terms of DNA/Protein structure, but it’s conceptually quite neat. They were able to show that the distance between the two halves of the structure vary by 10-20 bases, making it a challenge to identify, for most traditional motif scanners. Nifty.

Another neat thing, which I think everyone knows, but was cool to hear that it’s been shown is that the binding sites often line up on areas of high conservation across species. I use that as a test for my own work, but it was good to have it confirmed.

Finally, one of the things Dr. Wold mentioned was that they were interested in using the information in the directionality of reads in their analysis. Oddly enough, this was one of the first problems I worked on in ChIP-Seq, months ago, and discovered several ways to handle it. I enjoyed knowing that there’s at least one thing my own ChIP-Seq code does that is unique, and possibly better than the competition. (-;

As for transcriptome work, there were only a couple things that are worth mentioning. The Wold lab seems to be using MAQ and a list of splice junctions assembled from annotated exons to map the transcriptome sequences. I’ve heard that before, actually, from someone at the GSC who is doing exactly the same thing. It’s a small world. I’m not really a fan of the technique, however. Yes, you’ll get a lot of the exon junction reads, but you’ll only find the ones you’re looking for, which is exactly the criticism all the next-gen people throw at the use of micro-arrays. There has got to be a better solution… but I don’t yet know what it is. (We thought it was Exonerate, but we can’t seem to get it to work well, due to several bugs in the software. It’s clearly a work in progress.)

Anyhow, I think I’m going to stop here. I’ll just sum it all up by saying it was a pretty good talk, and it’s given me lots of things to think about. I’m looking forward to getting back to coding tomorrow.

>Dr. Henk Stunnenberg’s lecture

>I saw an interesting seminar today, which I thought I’d like to comment on. Unfortunately, I didn’t bring my notes home with me, so I can only report on the details I recall – and my apologies in advance if I make any errors – as always, any mistakes are obviously with my recall, and not the fault of the presenter.

Ironically, I almost skipped the talk – it was billed as discussing Epigenetics using “ChIP-on-Chip”, which I wrote off several months ago as being a “poor man’s ChIP-Seq.” I try not to say that too loud, usually, since there are still people out there who put a lot of faith in it, and I have no evidence to say it’s bad. Or, at least, I didn’t until today.

The presenter was Dr. Stunnenberg, from Nijmegen Center for Molecular Sciences, who’s web page doesn’t do him justice in any respect. To begin with, Dr. Stunnenberg gave a big apology for the change in date of his talk – I gather the originally scheduled talk had to be postponed because someone had stolen his bags while he was on the way to the airport. That has got to suck, but I digress…

Right away, we were told that the talk would focus not on “ChIP-on-Chip”, but on ChIP-Seq, instead, which cheered me up tremendously. We were also told that the poor graduate student (Mark?) who had spent a full year generating the first data set based on the ChIP-on-Chip method had had to throw away all of his data and start over again once the ChIP-Seq data had become available. Yes, it’s THAT much better. To paraphrase Dr. Stunnenberg, it wasn’t worth anyone’s time to work with the ChIP-on-Chip data set when compared to the accuracy, speed and precision of the ChIP-Seq technology. Ah, music to my ears.

I’m not going to go over what data was presented, as it would mostly be of interest only to cancer researchers, other than to mention it was based on estrogen receptor mediated binding. However, I do want to raise two interesting points that Dr. Stunnenberg touched upon: the minimum height threshold they applied to their data, and the use of Polymerase occupancy.

With respect to their experiment, they performed several lanes of sequencing on their ChIP-Seq sample, and used the standard peak finding to identify areas of enrichment. This yielded a large number of sites, which I seem to recall was in the range of 60-100k peaks, with a “statistically derived” cutoff around 8-10. No surprise, this is a typical result for a complex interaction with a relatively promiscuous transcription factor; a lot of peaks! The surprise to me was that they decided that this was too many peaks, and so applied an arbitrary threshold of a minimum peak height of 30, which reduced the number of peaks down to 6,400-ish peaks. Unfortunately, I can’t come up with a single justification for this threshold at 30. In fact, I don’t know that anyone could, including Dr. Stunnenberg, who admitted it was rather arbitrary, because they thought the first number, in the 10’s of thousands of peaks was too many.

I’ll be puzzling over this for a while, but it seems like a lot of good data was rejected for no particularly good reason. yes, it made the data set more tractable, but considering the number of peaks we work on regularly at the GSC, I’m not really sure this is a defensible reason. I’m personally convinced that there is a lot of biological relevance for the peaks with low peak heights, even if we aren’t aware of what that is yet, and arbitrarily raising the minimum height threshold 3-fold over the statistically justifiable cut off is a difficult pill to swallow.

Moving along, the part that did impress me a lot (one of many impressive parts, really) was the use of Polymerase occupancy ChIP-Seq tracks. Whereas the GSC tends to do a lot of transcriptome work to identify the expression of genes, Dr. Stunnenberg demonstrated that polymerase ChIP can be used to gain the same information, but with much less sequencing. (I believe he said 2-3 lanes of Solexa data were all that were needed, whereas our transcriptomes have been done up to a full 8 lanes.) Admittedly, I’d rather have both transcriptome and polymerase occupancy, since it’s not clear where each one has weaknesses, but I can see obvious advantages to both methods, particularly the benefits of having direct DNA evidence, rather than mapping cDNA back to genomic locations for the same information. I think this is something I’ll definitely be following up on.

In summary, this was clearly a well thought through talk, delivered by a very animated and entertaining speaker. (I don’t think Greg even thought about napping through this one.) There’s clearly some good work being done at the Nijmegen Center for Molecular Sciences, and I’ll start following their papers more closely. In the meantime, I’m kicking myself for not going to the lunch to talk with Dr. Stunnenberg afterwards, but alas, the chip-on-chip poster sent out in advance had me fooled, and I had booked myself into a conflicting meeting earlier this week. Hopefully I’ll have another opportunity in the future.

By the way, Dr. Stunnenberg made a point of mentioning they’re hiring bioinformaticians, so interested parties may want to check out his web page.

>New ChIP-Seq tool from Illumina

>Ok, I had to blog this. Someone on the SeqAnswers forum brought it to my attention that Illumina has a new tool for ChIP-Seq experiments. That in itself doesn’t bother me – the more people in this space, the faster we learn about what makes us tick.

What surprises me, though, is the tool itself (beadstudio data analysis software – chip sequencing module). It’s implemented only for Windows, for one. (Don’t most self-respecting scientists use Macs or Linux these days? Or at least use and develop tools that can be used cross-platform?) Second, the feature set appears to be a re-implementation of the UCSC Genome Browser. Given the choice between the two, I don’t see any reason to buy the Illumina version. (Yes, you have to pay for it, whereas UCSC is free and flexible.) I can’t tell if it loads bed files or wig files, but the screen shots show a rather unflexible tool that looks like a graphical version of Gap4 or Consed. I’m not particularly impressed.

Worse still, I can’t see this being implemented in a pipeline. If you’re processing 100’s of ChIP-Seq experiments in a year, or 1000’s once this technique really starts to hit it’s stride, why would you want to force it all through a GUI? I just don’t get it.

Well, what do I know? Maybe there’s a big market for people out there who don’t want free cross-platform tools, and would rather pay for a brand name science application than use something that works. Come to think of it, I’m willing to bet there are a few pharma companies out there who do think like that, and Illumina is likely to conquer that market with their tool. Happy clicking, Vista users.