>Response to QuEST posting

>One thing I really enjoy about this blog is that it seems to encourage communication between people in the 2nd Generation Genomics field. In particular, I get to hear from a lot of people that I otherwise would never have the privilege of meeting. After my posting on the QuEST software for ChIP-Seq the other day, I got a reply from Anton Valouev, the first author on the QuEST paper. (Who’s name I accidentally misspelled in the original post. Sorry – it’s now been fixed!)

He was kind enough to point out some of my mistakes, and to point me to the wikipedia entry on Kernel Density Estimation (KDE), which I had misunderstood as a windowing function. Anyhow, not only was he willing to teach me some science, he was also kind enough to let me print his reply here, which rebuts several of my points very well, and clearly points out the significance of the paper, which I had understated.

Here’s Anton Valouev’s reply:

Hi Anthony,

my name is Anton Valouev and I really enjoyed your blog post featuring QuEST. I wanted to point out a couple of things about your blog post that I wanted to clarify.

Although kernel density estimation uses bandwidth, it is not strictly a windowed approach since the points beyond the bandwidth window contribute to the estimation (see the formula in the methods section of our paper). KDE is an extremely powerful approach and surpasses windowed counts in every single respect except perhaps in simplicity of calculations. Wikipedia provides good reading on KDE approaches.

The novelty of the paper is in the software itself which appears to outperform the Wold lab peak caller (featured in Joh[n]son et al, although we do not demonstrate this in the paper as you have noted).

For these reasons, QuEST is currently adopted as a default peak caller by the ENCODE consortium. Yes, the analysis of the data is somewhat similar to Johnson et al Science paper, but the software is completely new which is the main point of the paper. Another new result is in direct detection of the ELK4 motif in the SRF data which hasn’t been shown before. Obviously motif finding is not new, but this particular result is.

The code dealing with controls is in the peak_caller.cpp lines 248-256, so this is not an unreleased feature. Also, I’d really appreciate if you can change my last name to correct spelling.

Good luck with your FindPeaks software, ChIP-Seq is seeing its early days, so much work is still to be done in this area.

Very best, Anton.


>I had an article all ready to write about today, but then decided to save it for tomorrow, and instead write about something completely unrelated: the CIRA AGM. Yes, I went downtown for part of the afternoon to the annual general meeting of the organization that controls the top level domain for Canada. (The .ca domain.)

Ok, so I’ll admit, they had good door prizes for people who sat through the whole thing (which I did), and with five chances to win good stuff (which I didn’t), I was tempted to go. The free lunch and 1Gb USB key really did make it worth my time. Besides, it seemed like something that could be related to my work with computers. (It wasn’t – but I really didn’t know that when I decided to go.)

Lunch was somewhat disappointing – they ran out of a lot of food before I got there. That’s my biggest complaint. Anyhow….

The AGM itself wasn’t really enlightening, though the question period was amusing. The guy who stood up and called CIRA “a bloated bureaucracy” was amusing, and the other guy who insisted the President defend their choice of speaker (who was in the room at the time) really made the question period a hoot. I was laughing pretty hard at the guy who demanded the CIO help him troubleshoot the fact that his domain wasn’t appearing in the listings of several search engines.

The more relevant thing was the Keynote speaker – Amber McArthur. For a crowd that held a lot of over-50’s type people, an introduction to web 2.0 social networking was probably an eye opener. For anyone who’s ever blogged, it was underwhelming – but an interesting social commentary on the management of CIRA.

Putting this in the context of some of the other statements made during the president’s address, and the other members of the board, a clear picture starts to emerge: (my paraphrasing)

  • They’re a windows only shop, internally, but don’t mind using unix for their DNS servers.
  • They’ve selected Microsoft Exchange as the premier email management and calendaring tool (and blew a lot of money on it this year).
  • They believe they’re on the cutting edge of domain registry authorities in the world in terms of technology and transparency.
  • They’re extremely proud of their use of Oracle10i (migrated this year).
  • They’re proud of their recent decision to use Cognos Business tools. (I’ve developed for them before… Cognos tools aren’t all they’re cracked up to be.)
  • One of their biggest achievements this year was to automate password recovery for members! (it was previously a manual process.)

I could go on, but I won’t.

Essentially, I don’t think they’re a bloated bureaucracy, but they’re clearly thrilled with the proprietary business model (which is probably not a bad thing, considering their need for 100% uptime) for the software they acquire, but aren’t spending a lot of time trying to figure out how to evolve with the internet. They’re a perfect model of what a software/internet shop should be in the late 1990’s. Do you think they use wiki’s in their office environment? How about instant messaging between developers? What about social networking to educate the public on their mandate? Have they ever contributed information to the wikipedia page on CIRA?

If they needed to invite a speaker to introduce them to web 2.0, I highly doubt it. They don’t “grok” where the web is going.

That’s the irony: The organization in charge of keeping Canada’s internet presence in order is clueless as to where the internet is going, and how it’s transforming itself through the use of new technology. Common people – if most of your board members have never heard of facebook, how are they going to guide you through the next 10 years of evolution of the web?

>ChIP-Seq in silico

>Yesterday I got to dish out some criticism, so it’s only fair that I take some myself, today. It came in the form of an article called “Modeling ChIP Sequencing In Silico with Applications”, by Zhengdong D. Zhang et al., PLoS Computational Biology, August 2008: 4(8).

This article is actually very cool. They’ve settled several points that have been hotly debated here at the Genome Sciences Centre, and made the case for some of the stuff I’ve been working on – and then show me a few places where I was dead wrong.

The article takes direct aim at the work done in Robertson et al., using the STAT1 transcription factor data produced in that study. Their key point is that the “FDR” used in that study was far from ideal, and that it could be significantly improved. (Something I strongly believe as well.)

For those that aren’t aware, Robertson et al. is sort of the ancestral origin of the FindPeaks software, so this particular paper is more or less aiming at the FindPeaks thresholding method. (Though I should mention that they’re comparing their results to the peaks in the publication, which used the unreleased FindPeaks 1.0 software – not the FindPeaks 2+ versions, of which I’m the author.) Despite the comparison to the not-quite current version of the software, their points are still valid, and need to be taken seriously.

Mainly, I think there are two points that stand out:

1. The null model isn’t really appropriate
2. The even distribution isn’t really appropriate.

The first, the null model, is relatively obvious – everyone has been pretty clear from the start that the null model doesn’t really work well. This model, pretty much consistent across ChIP-Seq platforms can be paraphrased as “if my reads were all noise, what would the data look like?” This assumption is destined to fail every time – the reads we obtain aren’t all noise, and thus assuming they are as a control is really a “bad thing”(tm).

The second, the even distribution model, is equally disastrous. This can be paraphrased as “if all of my noise were evenly distributed across some portion of the chromosome, what would the data look like?” Alas, noise doen’t distribute evenly for these experiments, so it should be fairly obvious why this is also a “bad thing”(tm).

The solution presented in the paper is fairly obvious; create a full simulation for your ChIP-Seq data. Their version requires a much more rigorous process, however. They simulate a genome-space, remove areas that would be gaps or repeats in the real chromosome, then begin tweaking the genome simulation to replicate their experiment using weighted statistics collected in the ChIP-Seq experiment.

On the one hand, I really like this method, as it should give a good version of a control, whereas on the other hand, I don’t like that you need to know a lot about the genome of interest before you can analyze your ChIP-Seq data. (ie, mappability, repeat-masking, etc.) Of course, if you’re going to simulate your genome, simulate it well – I agree with that.

I don’t want to belabor the point, but this paper provides a very nice method for simulating ChIP-Seq noise in the absence of a control, as in Robertson et al. However, I think there are two things that have changed since this paper was submitted (January 2008) that should be mentioned:

1. FDR calculations haven’t stood still. Even at the GSC, we’ve been working on two separate FDR models that no longer use the null model, however, both still make even distribution assumptions, which, is also not ideal.

2. I believe everyone has now acknowledged that there are several biases that can’t be accounted for in any simulation technique, and that controls are the way forward. (They’re used very successfully in QuEST, which I discussed yesterday.)

Anyhow, to summarize this paper: Zhang et al. provide a fantastic critique of the thresholding and FDR used in early ChIP-Seq papers (which is still in use today, in one form or another), and demonstrate a viable and clearly superior method for refining Chip-Seq results with out a matched control. This paper should be read by anyone working on FDRs for next-gen sequencing and ChIP-Seq software.

(Post-script: In preparation for my comprehensive exam, I’m trying to prepare critical evaluations of papers in the area of my research. I’ll provide comments, analysis and references (where appropriate), and try to make the posts somewhat interesting. However, these posts are simply comments and – coming from a graduate student – shouldn’t be taken too seriously. If you disagree with my points, please feel free to comment on the article and start a discussion. Nothing I say should be taken as personal or professional criticism – I’m simply trying to evaluate the science in the context of the field as it stands today.)


>(Pre-script: In preparation for my comprehensive exam, I’m trying to prepare critical evaluations of papers in the area of my research. I’ll provide comments, analysis and references (where appropriate), and try to make the posts somewhat interesting. However, these posts are simply comments and – coming from a graduate student – shouldn’t be taken too seriously. If you disagree with my points, please feel free to comment on the article and start a discussion. Nothing I say should be taken as personal or professional criticism – I’m simply trying to evaluate the science in the context of the field as it stands today.)

(UPDATE: A response to this article was kindly provided by Anton Valouev, and can be read here.)

I once wrote a piece of software called WINQ, which was the predecessor of a piece of software called Quest. Not that I’m going to talk about that particular piece of Quest software for long, but bear with me a moment – it makes a nice lead in.

The software I wrote wasn’t started before the University of Waterloo’s version of Quest, but it was released first. Waterloo was implementing a multi-million dollar set of software for managing student records built on oracle databases, PeopleSoft software, and tons of custom extensions to web interfaces and reporting. Unfortunately, The project was months behind, and the Quest system was no where near being deployed. (Vendor problems and the like.) That’s when I became involved – in two months of long days, I used Cognos tools (several of them, involving 5 separate scripting and markup languages) to build the WINQ system, which provided the faculty with a way to access query the oracle database through a secure web frontend and get all of the information they needed. It was supposed to be in use for about 4-6 months, until Quest took over… but I heard it was used for more than two years. (There are many good stories there, but I’ll save them for another day.)

Back to ChIP-Seq’s QuEST, this application was the subject of a recently published article. In a parallel timeline to the Waterloo story, QuEST was probably started before I got involved in ChIP-Seq, and was definitely released after I released my software – but this time I don’t think it will replace my software.

The paper in question (Valouev et al, Nature Methods, Advanced Online Publication) is called “Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. I suspect it was published with the intent of being the first article on ChIP-Seq software, which, unfortunately, it wasn’t. What’s most strange to me is that it seems to be simply a reiteration of the methods used by Johnson et al. in their earlier ChIP-Seq paper. I don’t see anything novel in this paper, though maybe someone else has seen something I’ve missed.

The one thing that surprises me about this paper, however, is their use of a “kernel density bandwidth”, which appears to be a sliding window of pre-set length. This flies in the face of the major advantage of ChIP-Seq, which is the ability to get very strong signals at high resolution. By forcing a “window” over their data, they are likely losing a lot of the resolution they could have found by investigating the reads directly. (Admittedly, with a window of 21bp, as used in the article, they’re not losing much, so it’s not a very heavy criticism.) I suppose it could be used to provide a quick way of doing subpeaks (finding individual peaks in areas of contiguous read coverage) at a cost of losing some resolving power, but I don’t see that discussed as an advantage.

The second thing they’ve done is provide a directional component to peak finding. Admittedly, I tried to do the same thing, but found it didn’t really add much value. Both the QuEST publication and my application note on FindPeaks 3.1 mention the ability to do this – and then fail to show any data that demonstrates the value of using this mechanism versus identifying peak maxima. (In my case, I wasn’t expected to provide data in the application note.)

Anyhow, that was the down side. There are two very good aspects to this paper. The first is that they do use controls. Even now, the Genome Sciences Centre is struggling with ChIP-Seq controls, while it seems everyone else is using them to great effect. I really enjoyed this aspect of it. In fact, I was rather curious how they’d done it, so I took a look through the source code of the application. I found the code somewhat difficult to wade through, as the coding style was very different from my own, but well organized. Unfortunately, I couldn’t find any code for dealing with controls, which leads me to think this is an unreleased feature, and was handled by post-processing the results of their application. Too bad.

The second thing I really appreciated was the motif finding work, which isn’t strictly ChIP-Seq, but is one of the uses to which the data can be applied. Unfortunately, this is also not new, as I’m aware of many earlier experiments (published and unpublished) that did this as well, but it does make a nice little story. There’s good science behind this paper – and the data collected on the chosen transcription factors will undoubtedly be exploited by other researchers in the future.

So, here’s my summary of this paper: As a presentation of a new algorithm, they failed to produce anything novel, and with respect to the value of those algorithms versus any other algorithm, no experiments were provided. On the other hand, as a paper on growth-associated binding protein, and serum response factor proteins (GABP and SRF respectively), it presents a nice compact story.

>Data Storage Problems?

>Here I was, thinking that us Next-gen genomics researchers (genomicists?) had a tough time with data management, simply by the phenomenal number of Gigabytes and Terabytes of data we’re generating. Then I read this article, titled How the Large Hadron Collider Might Change the Web [Scientific American].

When it’s up and running, they’ll be generating 15 petabytes (15 million gigabytes) per year, compared to the several Terabytes per year we work with. Ouch.

Anyhow, as a model of distributed data analysis, it’s definitely of interest. Maybe we’ll learn something from the physicists. Or maybe they can lend me a few Terabytes of space for my projects…

>Back to work…

>I’m back from my vacation – and just about ready to get back to work. Almost. Before I do, I figured I should share my pictures, and tie up a few loose ends.

Pictures first! Here’s the full album.

And here’s one of my favorites:

Ok, now that I’m past showing off pictures, onto to more sciency stuff:

The commenting system seems to have worked very well in my absence, so I’ll leave it as is – I won’t be moderating posts as they are added to the blog, though I reserve the right to delete rude or otherwise inappropriate posts – it hasn’t been a problem yet, though, and I hope it doesn’t become one.

I had planned to write more on indels, but I think everyone’s on the same page with it now. Longer reads will give more information, and we’re better off waiting a few months for those, rather than trying to work our way through it now.

I’ve been swamped with emails about FindPeaks while I was away, so anyone who sent me an email that I haven’t yet replied to – not to worry, I’m almost caught up. A few more hours, and I’ll be back to working, instead of emailing. I’ve also decided that FindPeaks needs to be open sourced ASAP – there are too many things that could be fixed by people who are using the program, rather than waiting for me, while I play customer support. It’s wasting my time, which could be better spend on developing new features.

FindPeaks has also had it’s first commit from someone other than me. Yay! It happens to be another developer at the GSC, but it’s a good first step.

And yes, I think I finally have a date for my committee meeting, and I’m also narrowing down a date for my comprehensive exam. I suspect that means a lot of my posts in the future will be literature reviews, roundups and discussions of a wide range of genomic topics, starting in October.

Ok, that about does it for now – Only about 45 more emails to process, and I’ll be back on top of things.