>Bowtie and Single End Mapped Paired End Data

>Strange title for a posting, but it actually makes sense, if you think about it.

One of the researchers here has been working on an assembly of a reasonably small fungal genome, using short read technology with which I’ve been somewhat involved. It’s been an educational project for me, so I’m glad I had the opportunity to contribute. One of the key elements of the paper, however, was the use of Paired End Tags (PET) from an early Illumina machine to assess the quality of the assembly.

Unfortunately, an early run of the software I’d written to analyze the Maq alignments of the paired ends to the assemblies had a bug, which made it look like the data supported the theory quite nicely – but alas, it was just a bug. The bug was fixed a few weeks ago, and a re-run of the data turned up something interesting:

If you use Maq alignments, you’re biasing your alignments towards the smith-waterman aligned sequences, which are not an independent measure of the quality of the assembly. Not shocking? I know – it’s pretty obvious.

Unfortunately, we hadn’t put it in context before hand, so we had to find another way of doing this. We wanted to get a fairly exhaustive set of alignments for each short read – and we wanted to get it quickly, so we turned to Bowtie. While I wasn’t the one running the alignments, I have to admit, I’m impressed with the quick performance of the aligner. Multi-matches work well, the file format is intuitive – similar to an eland file, and the quality of the information seems to be good. (The lack of a good scoring mechanism is a problem, but wasn’t vital for this project.)

Anyhow, by performing a Single End Tag style alignment on PET data, while retaining multimatches, we were able to remove the biases of the aligners, and perform the pairings ourselves – learning more about the underlying assembly to which we were aligning. This isn’t something you’d want to do often, unless you’re assembling genomes fairly regularly, but it’s quite cool.

Alas, for the paper, I don’t think this data was good enough quality – there may not be a good story in the results that will replace the one we thought we had when the data was first run… but there were other good results in the data, and the journey was worth the trip, even if the destination wasn’t all we’d hoped it would be.

As a sidenote, my code was run on an otherwise unused 16-way box, and managed to get 1592% CPU usage, at one point, while taking up 46Gb of RAM. (I should have processed each lane of reads separately, but didn’t.) That’s the first time I’ve seen CPU usage that high on my code. My previous record was 1499%. I should note that I only observe these scalings during the input processing – so the overal speed up was only 4x.

The code is part of the Vancouver Short Read Analysis Package, called BowtieToBedFormat.java, if anyone needs a process similar to this for Bowtie Reads. (There’s no manual for it at the moment, but I can put one togther, upon request.)

>Chip-Seq revisited

>In the world of ChIP-Seq, things don’t seem to slow down. A collaborator of mine pointed out the new application called MACS, which is yet another peak finder, written in python as an open source project. That makes 2 open source peak finders that I’m aware of: Useq and now MACS.

The interesting thing, to me, is the maturity of the code (in terms of features implemented). In neither cases is it all that great, as it’s mostly lacking features I consider to be relatively basic, and relatively naive in terms of algorithms used for peak detection. Though, I suppose I’ve been working with FindPeaks long enough that nearly everything else will seem relatively basic in comparison.

However, I’ll come back to a few more FP related things in a moment. I wanted to jump to another ChIP-Seq related item that I’d noticed this week. The Wold lab merged their Peak Finder software into a larger development package for Genomic and Transcriptome work, which I think they’re calling ERANGE. I’ve long argued that the Peak Finding tools are really just a subset of the whole Illumina tool-set required, and it’s nice to see other people doing this.

This is the development model I’ve been using, though I don’t know if the wold lab does exactly the same thing. The high-level organization uses a core library set, core object set, and then FindPeaks and other projects just sit on top, using those shared layers. It’s a reasonably efficient model. And, in a blog a while ago, I mentioned that I’d made a huge number of changes to my code after coming across the tool called “Enerjy“. I sat down to figure out how many lines were changed in the last two weeks: 26,000+ lines of code, comments and javadoc. That’s a startling figure, since my entire code base ( grep -r ” ” * | wc -l) is only 22,884 lines, of which 15,022 contain semi-colons.

Anyhow, I have several plans for the next couple of days:

  1. try to get my SVN repository to somewhere other people can work on it as well, and not just restricted to GSC developers.
  2. Improve the threading I’ve got going
  3. Clean up the documentation, where possible
  4. and work on the Adaptive mode code.

Hopefully, that’ll clean things up a bit.

Back to FindPeaks itself, the latest news is that my Application note in Bioinformatics has been accepted. Actually, it was accepted about a week ago, but I’m still waiting to see it in the advanced access section – hopefully it won’t be much longer. I also have a textbook chapter on ChIP-Seq coming out relatively soon, (I’m absolutely honoured to have been given that opportunity!) assuming I can get my changes done by Monday.

I don’t think that’ll be a problem.

>Downloads and aligners

>At the Genome Science Centre, we have a platform Scientific Advisory Board meeting coming up in the near future, so a lot of people are taking the time to put together some information on what’s being going on since the last meeting. As part of that, I was asked to provide some information on the FindPeaks application, since it’s been relatively successful. One of those things was how many times it’s been downloaded.

Surprisingly, it’s not an insigificant number: FindPeaks 2.x has been downloaded 280 times, while FindPeaks 3.1.x has been downloaded 325 times. I was amazed. Admittedly, I filtered out google and anything calling itself “spider”, and didn’t filter on unique IPs, but it’s still more than I expected.

The other surprise to me was the institutions from which it was downloaded, which are scattered across the world. Very cool to see it’s not just Vancouver people who took an interest in FindPeaks, though I somewhat suspected as much, already.

Thinking of FindPeaks, I put in one last marathon cleaning session for all my software. I’m sure I’m somewhere north of 15,000 lines of code modified in the past week. Even when you consider that some of those changes were massive search-and-replace jobs, (very few, really), or refactoring and renaming variables (many many more), it’s still an impressive number for two weeks. With that done, I’m looking to see some significant gains in development speed, and in developer productivity. (I learned a LOT doing this.)

The last 122 warnings will just have to get cleaned up when I get around to it – although they’re really only 4 warnings types repeated so many times. (The majority of them are just that the branched logic goes deeper than 5 levels, or that my objects have public variables. (You can only write so many accessor functions in a week.)

Anyhow, I’m looking forward to testing out FindPeaks 4.0 alphas starting tomorrow, and to put some documents together on it. (And catch up on the flood of emails I received in the last 2 days.

Finally, I’m working on a MAQ file interpreter for another project I’m working on. If I ever manage to figure out how to interpret the (nearly undocumented) files, I’ll post that here. If anyone’s done this already (though I didn’t see anything publicly available on the web), I’d love to hear from them.


>Random Update on FP/Coding/etc.

>I had promised to update my blog more often, but then failed miserably to follow through last week. I guess I have to chalk it up to unforeseen circumstances. On the bright side, it gave me the opportunity to come up with several things to discuss here.

1. Enerjy: I learned about this tool on Slashdot, last week while doing some of my usual lunch hour “open source news” perusal. I can unequivocally say that installing the Enerjy tool in Eclipse has improved my coding by unimaginable leaps and bounds. I tested it out on my Java codebase that has my FindPeaks application and the Transcriptome/Genome analysis tools, and was appalled by the number of suggestions it gave. Admittedly, I’m self taught in Java, but I thought I had grasped the “Zen” of Java by now, though the 2000+ warnings it gave disagreed. I’ve since been cleaning up the code like there’s no tomorrow, and have brought it down to 533 warnings. The best part is that it pointed out several places where bugs were likely to have occurred, which have now all been cleaned up.

2. Threading has also come up this past week. Although I didn’t “need” it, there was no way to get around it – learning threads was the appropriate solution to one problem that came up, so my development version is now beginning to include some thread management, which is likely to spread into the the core algorithms. Who knew??

3. Random politics: If you’re a grad student in a mixed academic/commercial environment, I have a word of warning for you: Not everyone there is looking out for your best interests. In fact, some people are probably looking out for their own interests, and they’re definitely not the same as yours.

4. I read Michael Smith’s biography this week. I was given a free copy by the Michael Smith Foundation for Health Research, who were kind enough to provide my funding for the next few years. It’s fantastic to understand a lot of the history behind the British Columbia Biotechnology scene. I wish I’d read that before having worked at Zymeworks. That would have provided me with a lot more insight into the organizations and people I met along the way. Hindsight is 20/20.

5. FindPeaks 4.0: Yes, I’m skipping plans for a FindPeaks 3.3. I’ve changed well over 12000+ lines of code, according to the automated scripts that report such things, which have included a major refactoring and the start I made at threading. If that doesn’t warrant an major number version change, I don’t know what does.

Well, on that note, back to coding… I’m going to be competing with people here, in the near future, so I had best be productive!