>More ups, less downs

>This has been an interesting week, and it’s only Tuesday. While last week dragged on – mainly because I was stuck at home sick – this week is flying by in a hurry. I have several topics to blog about (ABI SOLiD, Arrays vs Chip-Seq, Excellent organizational tools, etc) but now find myself with enough work that I don’t really have much time to blog on all of them today. Hopefully I can get around to it in the next couple days.

Last week, I blogged about my frustration with the GSC environment, but then found a few things to act as a counter-point. Whereas last week, I was stuck at home and could only watch things as they unfolded from a distance, this week I’m back in the office and able to talk with my co-workers, and realized how valuable a tool that can be. Even if I’m frustrated with jostling for position that goes on, there are fantastic people at the GSC who work hard and are happy to lend a hand. The GSC isn’t perfect, but there are definitely fantastic people here.

That said, on a more personal note, I’ve had quite a few “perks” appear this week:

  • For the first time in my life, I’ve been asked to review a paper for a journal,
  • I’ve also become involved in several really cool projects,
  • I’ve had the opportunity to review some ABI SOLiD files for inclusion in FindPeaks,
  • I was even offered a chance to apply for a post-doc with a lab that I’ve been following closely. Talk about a flattering offer! (Unfortunately, I’m not close to graduating yet, so it’s a bit premature. DRATS!)

Anyhow, its stuff like that keeps me writing the software and working hard. I guess that solves the question of why I try to do good work all of the time – A little bit of recognition goes a long way to keep me motivated. (=

Happy Tuesday, everyone.

>James Bond and evil hackers…

>Totally off topic blog today, but it’s a Weekend, so I feel like that’s not a bad thing.

I watched “Tomorrow Never Dies” this evening, a 1999 James Bond movie starring Pierce Brosnan. I swear, I thought I’d seen most, if not all, of the James Bond flicks, though I guess this one had slipped right by me. It wasn’t the best Bond movie I’d seen, but it was amusing enough for a Saturday night without a televised hockey game.

On my evening walk with the puppy, I got to thinking about who I’d be if I were in a Bond Movie.

Clearly, I wouldn’t be James Bond himself – I leave that to my older brother who probably is the Canadian James bond. At least, I always picture him that way. (Though, I don’t usually picture him in a tuxedo, and I seriously doubt if martinis are his drink. I’ll have to ask him next time I see him.)

I figured, based on my physical prowess, I probably wouldn’t be any of the action heroes. I’m a little too clumsy for that sort of stuff. (You can ask any of my lab partners from undergrad about how well glassware and I get along.)

Which leaves me with two options: the guy who gets run over during the car chase (which isn’t too flattering) or one of the generic assistant genius characters that hacks computers or develops bio-weapons in the pay of the truly insane evil world domination plotter. Well, since the glassware and I probably wouldn’t get along on the set any better than in real life, that probably leaves the assistant hacker role.

And that brings me to my next observation. In most films these days, the hackers are either the fat slob (in the style of Dennis Nedry in Jurassic Park), or the twig thin socially inept hackers(Neo from The Matrix). Instead, The evil American techno-genius computer hacker in this particular flick was made up to look like none other than Richard Stallman. (Considering how they lampoon Microsoft in this film, I wouldn’t be surprised if that were the case.) Both the shot at Microsoft and this are geeky jokes, but an interesting comment on society from Hollywood’s perspective. Anyhow, judge for yourself.

On the left, evil American techno-genius Henry Gupta (A very prescient name, considering the techno-expert Gupta that works at evil American corporation SCO…) played by Magician Ricky Jay, and Benevolent Genius Richard Stallman (as himself!) on the right. What do you think?

Evil Genius GuptaBenevolent Genius Richard Stallman

>grad school – phase II?

>I’m trying to keep up with my one-posting per-day plan, but I considered abandoning it today. It’s just hard to keep focussed long enough to write a post when you’ve been coughing all day. Fortunately, the coughing let up, the fever went away, and – even if my brain isn’t back to normal – I’m feeling much better this afternoon than I was this morning.

Anyhow, spending the day on the day on the couch gave me lots of time for reflection, even if my brain was at 1/4 capacity. I’ve been able to follow several email threads, and contribute to a couple along the way. And, of course, all this sitting around and watching email go by made me re-evaluate my attitude to grad school. Of course, this posting could just be because I’m sick and grumpy. Still, there’s a nugget of truth in here somewhere…

Until today, I’ve been trying to write good quality code that can be used in our production pipelines – which some of it has. FindPeaks has been a great success in that respect. [EDITED]

However, several other pieces of my work were slated to be put into production for several months, until a few days ago. Now, they’ve all been discarded by the GSC in favour of other people’s work. Of course, I can’t complain about that – they’re free to use whatever they want, and – despite doing my research at the GSC – I have to remind myself that I’m not a GSC employee, and they have no obligation to use my code.

To be fair, the production requirements for one element of the software changed this morning, and my code that was ready to be put in production a year ago is no longer cutting edge.

However, in hindsight, I guess I have to ask some questions: why have I been writing good code, when I could have been writing code that does only what I want, and doesn’t handle the 100’s of other cases that I don’t come across in my research? What advantages were there for me to do a good job?

  • If I want to sell my code, it belongs to UBC. Clearly this is a non-starter.
  • If I want the GSC to use my code, I won’t get any recognition for it. (That’s the problem with writing a tool: Many people will get the results, but no one will ever know that it was my code in use, and unlike lab work, pipeline software doesn’t seem to get included on publications.)
  • If I’m doing it just for myself, then there’s the satisfaction of a job well done, but it distracts me from other projects I could be working on.
  • If I want to publish using my code, it doesn’t have to be production ready – it only has to do the job I want it to, so there’s clearly no advantage here.

Which leaves me with only one decent answer:

  • develop all my code freely on the web, and let other people participate. The more production-ready it is, the more people will use it. The more people who use it, the more my efforts are recognized – which in a way, is what grad school is all about:
Publications and recognition.

I seem to have more or less already adopted that model [EDITED]

Some time last year, I came to the recognition that I should stop helping people who just take advantage of my willingness to work, but I think this is a new, more jaded version of it.

Viva la grad school.

[EDITED] I guess now I understand why people leave Academia.

>The best laid schemes…

>”The best laid schemes o’ mice an’ men / Gang aft a-gley.” – Robert Burns

Well, ok, I don’t speak 18-th century Scots, but I fully understand the sentiment. I have several major things in the works, and yet, I’m suffering from a bad cold and sore throat. So, the FindPeaks manual is going to be delayed a few days. I guess I’m not the only one working on it, but I defintely have plans for it.

Anyhow, between the FindPeaks manual, some new FindPeaks features, a poster abstract and two papers I’m working on, I have enough on my plate – but just don’t have the energy to get it all done. Damn colds! They always get you just as the stress level starts to rise.

Anyhow, Thanks to tcezard (Tim), who’s taking on an ever increasing role on the FindPeaks work, FindPeaks 3.2.1 was tagged today. Development is so much faster when people work together!

I thought I’d also go one further today, and add a plug in for a very cool technology that hasn’t (to my knowledge) gotten much coverage – ODF@WWW. As a scientist who uses wikis and blogs, this has to be one of the best applications I’ve seen of extending the wiki concept. This has the ability to change the way scientists communicate data, and allow wikis to be incorporated into environments where they previously were “too complex to be useful.” Of course, more people have to adopt ODF before this will really take off, but with applications like this, it’s only a matter of time. (Yay open formats for making this possible!)

>Two milestones reached

>Two of my three projects have met significant milestones this weekend, which I thought were worth sharing.

The first one is that FindPeaks had it’s first official patch submitted to me from developers outside the Genome Science Centre, on sunday night. Thank you very much to the submitters! I’m very glad to see someone using (and playing with!) the code. While the patch was minor, it tells me that there is definitely use in having released it to the wild.

The second milestone is for my blog, and while it’s not officially a “project” in the academic sense of the word, has consumed enough of my time that it starts to feel like one. The milestone? There are now as many people coming here directly as there are finding it through google. It’s not a huge number of hits, but it’s nice to know there’s an audience who’s enjoying my random rants. (-:

>MACS and end of the first generation of peak finders.

>First of all, I’d like to thank the anonymous poster who brought this paper (Zhang et al, Model-based Analysis of ChIP-Seq (MACS), Genome Biology) to my attention, and Xue, who passed me the link to it. It didn’t appear in Google until this morning, for me, so knowing where it was beforehand was very helpful.

Unlike the other ChIP-Seq software papers that I reviewed, I think this one is the first that I feel was really well done – and it’s a pretty decent read too. My biggest complaint is just that we had printer issues this morning, so it took me nearly 2 hours to get it printed out. (-;

Ok. Before I jump into technical details, I just have to ask one (rhetorical) question to Dr. Johnson: is that really your work email address? (-:

MACS appears to be an interesting application. Feature for feature, it competes well with the other applications in it’s group. In fact, from the descriptions given, I think it does almost the identical calculations that are implemented in FindPeaks 3.1. (And yes, they seem to have used 3.1+ in their comparisons, though it’s not explicitly stated which version was used.)

There are two ways in which MACS differs from the other applications:

  1. The use of a “shift”(referred to as an “adaptive fragment length” in FindPeaks)
  2. the use of controls in a more coherent manner.

The first is interesting because it’s one of the last open questions in ChIP-Seq data: how do you determine the actual length of a single end tag?

Alas, while FindPeaks had a very similar module for calculating fragment lengths in version 3.1.0, it was never released to the public. Here, MACS beats FindPeaks to the punch, and has demonstrated that an adaptive fragment size can work. In fact their solution is quite elegant: identify the top X peaks, then calculate the distance from the peak that the ends normally occur. I think there’s still room for improvement, but this has now opened the Pandora’s box of fragment length estimation for everyone to get their hands dirty and play with. And it’s about time.

The second improvement is their handling of controls. Several applications have moved “control” handling into the core algorithms, while FindPeaks has chosen to keep it part of the post-processing. Again, this is a good step in the right direction, and I’m glad to see them make controls an integral (though optional) part of the ChIP-Seq analysis.

To get into the specifics of the paper, there are several points that are worth discussing.

As a minor point, the terminology in the paper is mildly different, and doesn’t seem to reflect the jargon used in the field. They describe fragment lengths in terms of shifting their positions, rather than extending the lengths, and terms like “Watson strand” and “Crick strand” are used to refer to the forward and reverse strand – I haven’t heard those terms used since high school, but it’s cool. (=

Feature wise, MACS implements familiar concepts such as

  • duplicate filtering
  • Thresholding misnamed as FDR (just like FindPeaks!)
  • Peak finding that reports the peak max location
  • requirement of an effective genome size

and then offers some very nice touches

  • p-value cutoffs for peak reporting
  • built in analysis of controls
  • built in saturation calculations

All together, this makes for a great round up for a head to head competition. So guess what they do next? That’s right: it’s a peak finding bake-off!

For each test, MACS is compared against ChIP-Seq Peak Finder (from the Wold lab), FindPeaks (GSC) and QuEST (Valouev). And the competition is on (See Figure 2 in the paper).

Test number one (a) shows that MACS is able to keep the FDR very low for the “known set of FOXA1 binding sites” compared to the other three peak finders. This graph could either be very informative, or very confusing, as there are no details on how the number of sites for each application was chosen, who’s FDR was used or otherwise. This graph could either be reporting that MACS underestimates the real FDR for it’s peaks, or that the other peak finders provide overstated FDR for their peaks, or that only MACS is able to identify peaks that are true positives. MACS wins this round – though it may be by a technicality.

(Assuming that the peaks are all the same across peak callers, then all this tells us is that the peaks are assigned different FDRs by each peak caller, so I’m not even sure it’s a test of any sort.)

Test number two (b) is the same thing as number one, repeated for a different binding site, and thus the same thing occurs here. I still can’t tell if it’s the same binding sites between peak callers, or different, making this comparison rather meaningless, except to understand the FDR biases produced by each peak caller.

Test number three (c) starts to be a little more informative. Using a subset of peaks, how many of them contain the motif of interest. With the exception of the Wold lab’s Peak Finder, they all start out around 65% for the top(?) 1000 peaks, and then slowly wind their way down. By 7000 peaks, MACS stays up somewhere near 60%, FindPeaks is about 55%, and Quest never actually makes it that far, ending up with about 58% by 6000 peaks (the Wold Lab Peak finder starts at 58% and ends at about 54% at 6000 peaks as well. So, MACS seems to win this one – it’s enrichment stays the best over the greatest number of peaks. This is presumably due to it’s integration of controls into the application itself. Not surprisingly, Quest, which also integrates controls, ends up a close second. FindPeaks and the Wold lab Peak Finder lose here. (As a small note, one assumes that many of the peaks would be filtered out in post processing used by the Wold Lab’s software and FindPeaks, but that brings in a whole separate pipeline of tools.)

The fourth test is the same as the third test, but with the NRSF application. The results are tighter, this time, and surprisingly, FindPeaks ends up doing just a little better than QuEST, depite the lack of built-in control checking .

The fifth and six test roughly parallel the last two, but this time give the spatial resolution of the motif. Like the fourth test, MACS wins again, with FindPeaks a close second for most tests, and QuEST a hot contender. But it’s really hard to get excited about this one: The gap between best and worst here rarely exceeds 3 bases for FoxA1, and MACS appears to stay within 1-1.5 bases from motifs identified by FindPeaks for NRSF.

Take away message: MACS does appear to have the best application among the applications tested, while the Wold Lab Peak Finder seems to lose in all cases. Ouch. Quest and FindPeaks duke it out for second place, trading punches.

Like all battles, however, this is far from the last word on the subject. Both MACS and FindPeaks are now being developed as open source applications, which means that I expect to see future rounds of this battle being played out – and a little competition is a good thing. I know FindPeaks is pushing ahead in terms of improving FDR/thresholding, and I’m sure all of the other applications are as well.

One problem with all of these applications is that they all still assume a Poisson (or similar) model for the noise/peaks, which we’re all slowly learning, isn’t all it’s cracked up to be. (See ChIP-Seq in Silico.)

But, in contrast, I think we can now safely say that the low-hanging fruit of ChIP-Seq has all been picked. Four independently developed software applications have all been compared – and they all have roughly the same peaks, within a very narrow margin of error on the peak maxima.

To wrap this up, I think it’s pretty clear that the next round of ChIP-Seq software will be battled out by those projects that gain some momentum and community. With the peak finders giving more or less equivalent performance (with MACS just a hair above the others for the moment) and more peak finders becoming open source, the question of which one will continue moving forwards can be viewed as a straight software problem. Which peak finder will gain the biggest and most vibrant community?

That, it seems, will probably come down to a question of programming languages.

>FindPeaks 3.2.0

>Obviously, I didn’t write an article review this morning. Instead, I finished work on an “unstable” Findpeaks 3.2.0 release. It’s now been tagged, for anyone who’d like to download the source code and build it for themselves. The documentation has some catching up to do (which I plan to work on for the end of the week.), but the majority of the features are in place and appear to be functional.

I was debating posting the changelog to SEQanswers, but I think I’ll hold off until I have a more stable version. In the meantime, I’ll just post a summary from the tag commit here. Welcome to the FindPeaks 3.2 series!

(Update: I should also mention two important details. This release includes code contributions and fixes from Timothee Cezard, another developer at the GSC, as well as a code donation (Lander-Waterman FDR algorithm) from Mikhail Bilenky, also at the GSC.)

Changes since FindPeaks

  • New FindPeaks Core Module
  • Significant reorganization/modularity/OOP of code
  • Significant cleanup of code
  • ability to read MAQ .map files
  • ability to read bed files (both reduced and extended)
  • Log_Buffer to provide buffered and logged output to file
  • Several bug fixes to reported issues
  • better error handling
  • better error messages and verbose output
  • New coverage based FDR (Thresholding methods) incorporated
  • Native mode (mode 3) for MAQ SET/PET or Bed files, or generating wigs
  • Conforms to most Enerjy best practice recommendations. (over 2000 code changes.)
  • More controlled exits to handle system crashes gracefully.
  • better handling of constants
  • smoothing can be turned on or off for adaptive fragment size estimation.
  • utilities for working with PETs
  • FindPeaks can now be used as a simple wig generator
  • hundreds more fixes and updates!

>Math as art

>I came across an article on the BBC website about the art of maths, which is well worth the two minutes or so it takes to play. The images are stunning (I particularly like the four dimensional picture in two dimensions), and the narration is quite interesting. As a photographer, I especially enjoyed the comparison of math as art to photography as an art. My own take is that both math and photography share the common artistic element of forcing the artist to determine how to best express what it is they’re seeing. Two people can see the same thing, and still get very different photos, which is also a component of expressing math in an artistic manner.

That got me thinking about genomics as art as well. I’m aware of people who’ve made music out of DNA sequences, but not visual art. Oh, sure, you can mount a karyotype or electrophoresis image on your wall, and it’s pretty, but I don’t think genomics has realized the potential for expressing DNA form and function in a non-linear manner yet.

Still, it’s obviously on it’s way. Every time I go to a presentation, I see a few more “pretty” graphs. Or maybe I’ve just gone to too many seminars when a graph of the clustering of gene arrays starts to look like a Mondrian picture. Who knows… maybe ChIP-Seq will start looking like that too? (=

>Processing Paired End Reads

>Ok, I’m taking a break from journals. I didn’t like the overly negative tone I was taking in those reviews, so I’m rethinking how I write about articles. Admittedly, I think criticism is in order for papers , but I can definitely focus on papers that are worth reviewing. Unfortunately, I’m rarely a fan of papers discussing bioinformatics applications, as I always feel like there’s a hidden agenda behind them. Whether it’s simply proving their application is the best, or just getting it out first, computer application papers are rarely impartial.

Anyhow, I have three issues to cover today:

  • FindPeaks is now on sourceforge
  • Put the “number of reads under a peak” to rest. permanently, I hope.
  • Bed files for different data sources.

The first one is pretty obvious. FindPeaks is now available under the GPL on sourceforge, and I hope people will participate in using and improving the software. We’re aiming for our first tagged release on friday, with frequent tags thereafter. Since I’m no longer the only developer on this project, it should continue moving forward quickly, even while I’m busy studying for my comps.

The second point is this silly notion that keeps coming up. “How many reads were found under each peak?” I’m quite sick of that question, because it really makes no sense. Unfortunately, this was a metric produced in Robertson et al.’s STAT1 data set, and I think other people have included it or copied it. Unfortunately it’s rubbish.

The reason it worked in STAT1 was because they used a fixed length (or XSET) value on their data set. This allowed them to determine the exact length of each read, which allowed them to figure out how many reads were “contiguously linked in each peak.” Readers who are paying attention will also realize what the second problem is… They didn’t use subpeaks either. Once you start identifying subpeaks, you can no longer assign to which peak a read spanning peaks belongs. Beyond that, what do you do with reads in a long tail? Are they part of the peak or not?

Anyhow, the best measure for a peak, at the moment at least, is the height of the peak. This can also include weighted reads, so that reads which are unlikely to contribute to a peak actually contribute less, bringing in a scaled value. After all, unless you have paired end tags, you really don’t know how long the original DNA fragment was, which means you don’t know where it ended.

That also makes a nice segue to my last point. There are several ways of processing paired end tags. When it comes to peak calling it’s pretty simple: you use the default method – you know where the two ends are, and they span the full read. For other applications, however, there are complexities.

If the data source is a transcriptome, your read didn’t cover the intron, so you need to process the transcript to include breaks, when mapping it back to the genome. That’s really a pain, but it is clearly the best way to visualize transcriptome PETs.

If the data source is unclear, or you don’t know where the introns are (which is a distinct possibility), then you have to be more conservative, and not assume the extension of each tag. Thus, you end up with a “tag and bridge” format. I’ve included an illustration to make it clear.

So why bring it up? Because I’ve had several requests for the tag-and-bridge format, and my code only works on the default mode. Time to make a few adjustments.


>One more day, one more piece of ChIP-Seq software to cover. I’ve not talked about FindPeaks, much, which is the software descended from Robertson et al, for obvious reasons. The paper was just an application note – and well, I’m really familiar with how it works, so I’m not going to review it. I have talked about Quest, however, which was presumably descended from Johnson et al.. And, for those of you who have been following ChIP-Seq papers since the early days will realize that there’s still something missing: The aligner descended from Barski et al, which is the subject of today’s blog: SISSR. Those were the first three published ChIP-Seq papers, and so it’s no surprise that each of them followed up with a paper (or application note!) on their software.

So, today, I’ll take a look at SISSR, to complete the series.

From the start, the Barski paper was discussing both histone modifications and transcription factors. Thus, the context of the peak finder is a little different. Where FindPeaks (and possibly QuEST as well) was originally conceived for identifying single peaks, and expanded to do multiple peaks, I would imagine that SISSR was conceived with the idea of working on “complex” areas of overlapping peaks. Although, that’s only relevant in terms of their analysis, but I’ll come back to that.

The most striking thing you’ll notice about this paper is that the datasets look familiar. They are, in fact the sets from Robertson, Barski and Johnson: STAT1, CTCF and NRSF, respectively. This is the first of the Chip-Seq application papers that actually performs a comparison between the available peak finders, and of course, claim that theirs is the best. Again, I’ll come back to that.

The method used by SISSR is almost identical to the method used by FindPeaks, with the use of directional information built into the base algorithm, whereas FindPeaks provides it as an optional module (-directional flag, which uses a slightly different method). They provide an excellent visual image on the 4th page of the article, demonstrating their concept, which will explain the method better than I can, but I’ll try anyhow.

In ChIP-Seq, a binding site is expected to have many real tags pointing at it, as tags upstream should be on the sense strand, and tags on downstream should be on the anti-sense strand. Thus, a real binding site should exist at transition points, where the majority of tags switch from the sense to the anti-sense tag. By identifying these transition points, they will be able to identify locations of real binding sites. More or less, that describes the algorithm employed, with the following modifications: A window is used, (20bp default) instead of doing it on a base-by-base basis, and parameter estimation is employed to guess the length of the fragments.

In my review of QuEST, I complained that windows are a bad idea(tm) for ChIP-Seq, only to be corrected that QuEST wasn’t using a window. This time, the window is explicitly described – and again, I’m puzzled. FindPeaks uses an identical operation without windows, and it runs blazingly fast. Why throw away resolution when you don’t need to?

On the subject of length estimation, I’m again less than impressed. I realize this is probably an early attempt at it – and FindPeaks has gone through it’s fair share of bad length estimators, so it’s not a major criticism, but it is a weakness. To quote a couple lines from the paper: “For every tag i in the sense strand, the nearest tag k in the anti-sense strand is identified. Let J be the tag in the sense strand immediately upstream of k.” Then follows a formula based upon the distances between (i,j) and (j,k). I completely fail to understand how this provides an accurate assessment of the real fragment length. I’m sure I’m missing something. As a function that describes the width of peaks, that may be a good method, which is really what the experiment is aiming for, anyhow – so it’s possible that this may just be poorly named.

In fairness, they also provide options for a manual length estimation (or XSET, as it was referred to at the time), which overrides the length estimation. I didn’t see a comparison in the paper about which one provided the better answers, but having lots of options is always a good thing.

Moving along, my real complaint about this article is the analysis of their results compared to past results, which comes in two parts. (I told you I’d come back to it.)

The first complaint is what they were comparing against. The article was submitted for publication in May 2008, but they compared results to those published in the June 2007 Robertson article for STAT1. By August, our count of peaks had changed. By January 2008, several upgraded versions of FindPeaks were available, and many bugs had been ironed out. It’s hardly fair to compare the June 2007 FindPeaks results to the May 2008 version of SISSR, and then declare SISSR the clear winner. Still, that’s not a great problem – albeit somewhat misleading.

More vexing is their quality metric. In the Motif analysis, they clearly state that because of the large amount of computing power, only the top X% of reads were used in their analysis. For comparison with FindPeaks, the top 5% of peaks were used – and were able to observe the same motifs. Meanwhile, their claim to find 74% more peaks than FindPeaks, is not really discussed in terms of the quality of the additional sites. (FindPeaks was also modified to identify sub-peaks after the original data set was published, so this is really comparing apples to oranges, a fact glossed over in the discussion.)

Anyhow, complaints aside, it’s good to see a paper finally compare the various peak finders out there. They provide some excellent graphics, and a nice overview on how their ChIP-Seq application works, while contrasting it to published data available. Again, I enjoyed the motif work, particularly that of figure 5, which correlates four motif variants to tag density – which I feel is a fantastic bit of information, buried deeper in the paper than it should be.

So, in summary, this paper presents a rather unfair competition by using metrics guaranteed to make SISSR stand out, but still provides a good read with background on ChIP-Seq, excellent illustrations and the occasional moment of deep insight.