>Community

>This week has been a tremendous confluence of concepts and ideas around community. Not that I’d expect anyone else to notice, but it really kept building towards a common theme.

The first was just a community of co-workers. Last week, my lab went out to celebrate a lab-mate’s successful defense of her thesis (Congrats, Dr. Sleumer!). During the second round of drinks (Undrinkable dirty martinis), several of us had a half hour conversation on the best way to desalinate an over-salty martini. As weird as it sounds, it was an interesting and fun conversation, which I just can’t imagine having with too many people. (By the way, I think Obi’s suggestion wins: distillation.) This is not a group of people you want to take for granted!

The second community related event was an invitation to move my blog over to a larger community of bloggers. While I’ve temporarily declined, it raised the question of what kind of community I have while I keep my blog on my own server. In some ways, it leaves me isolated, although it does provide a “distinct” source of information, easily distinguishable from other people’s blogs. (One of the reasons for not moving the larger community is the lack of distinguishing marks – I don’t want to sink into a “borg” experience with other bloggers and just become assimilated entirely.) Is it worth moving over to reduce the isolation and become part of a bigger community, even if it means losing some of my identity?

The third event was a talk I gave this morning. I spent a lot of time trying to put together a coherent presentation – and ended talking about my experiences without discussing the actual focus of my research. Instead, it was on the topic of “successes and failures in developing an open source community” as applied to the Vancouver Short Read Analysis Package. Yes, I’m happy there is a (small) community around it, but there is definitely room for improvement.

Anyhow, at the risk of babbling on too much, what I really wanted to say is that communities are all around us, and we have to seriously consider our impact on them, and the impact they have on us – not to mention how we integrate into them, both in our work and outside. If you can’t maximize your ability to motivate them (or their ability to motivate you), then you’re at a serious disadvantage. How we balance all of that is an open question, and one I’m still working hard at answering.

I’ve attached my presentation from this morning, just in case anyone is interested. (I’ve decorated it with pictures from the South Pacific, in case all of the plain text is too boring to keep you awake.)

Here it is (it’s about 7Mb.)

>Science Cartoons – 3

>I wasn’t going to do more than one comic a day, but since I just published it into the FindPeaks 4.0 manual today, I may as well put it here too, and kill two birds with one stone.

Just to clarify, under copyright laws, you can certainly re-use my images for teaching purposes or your own private use (that’s generally called “fair use” in the US, and copyright laws in most countries have similar exceptions), but you can’t publish it, take credit for it, or profit from it without discussing it with me first. However, since people browse through my page all the time, I figure I should mention that I do hold copyright on the pictures, so don’t steal them, ok?

Anyhow, Comic #3 is a brief description of how the compare in FindPeaks 4.0 works. Enjoy!

>Universal format converter for aligned reads

>Last night, I was working on FindPeaks when I realized what an interesting treasure trove of libraries I was really sitting on. I have readers and writers for many of the most common aligned read formats, and I have several programs that do useful functions. So, that raise the distinctly interesting point that all of them should be applied together in one shot… and so I did exactly that.

I now have an interesting set of utilities that can be used to convert from one file format to another: bed, gff, eland, extended eland, MAQ .map (read only), mapview, bowtie…. and several other more obscure formats.

For the moment, the “conversion utility” forces the output to bed file format (since that’s the file type with the least information, and I don’t have to worry about unexpected file information loss), which can then be viewed with the UCSC browser, or interpreted by FindPeaks to generate wig files. (BED files are really the lowest common denominator of aligned information.) But why stop there?

Why not add a very simple functionality that lets one format be converted to the other? Actually, there’s no good reason not to, but it does involve some heavy caveats. Conversion from one format type to another is relatively trivial until you hit the quality strings. since these aren’t being scaled or altered, you could end up with some rather bizzare conversions unless they’re handled cleanly. Unfortunately, doing this scaling is such a moving target that it’s just not possible to keep up with that and do all the other devlopment work I have on my plate. (I think I’ll be asking for a co-op student for the summer to help out.)

Anyhow, I’ll be including this nifty utility in my new tags. Hopefully people will find the upgraded conversion utility to be helpful to them. (=

>The Future of FindPeaks

>At the end of my committee meeting, last month, my advisors suggested I spend less time on engineering questions, and more time on the biology of the research I’m working on. Since that means spending more time on the cancer biology project, and less on FindPeaks, I’ve been spending some time thinking about how I want to proceed forward – and I think the answer is to work smarter on FindPeaks. (No, I’m not dropping FindPeaks development. It’s just too much fun.)

For me, the amusing part of it is that FindPeaks is already on it’s 4th major structural iteration. Matthew Bainbridge wrote the first, I duplicated it by re-writing it’s code for the second version, then came the first round of major upgrades in version 3.1, and then I did the massive cleanup that resulted in the 3.2 branch. After all that, why would I want to write another version?

Somewhere along the line, I’ve realized that there are several major engineering things that could be done that would make FindPeaks faster, more versatile and able to provide more insight into the biology of ChIP-Seq and similar experiments. Most of the changes are a reflection of the fact that the underlying aligners that are being used have changed. When I first got involved we were using Eland 0.3 (?), which was simple compared to the tools we now have available. It just aligned each fragment individually and spit out the results, which left the filtering and sorting up to FindPeaks. Thus, early versions of FindPeaks were centred on those basic operations. As we moved to sorted formats like .map and _sorted.txt files, those issues have mostly dissapeared, allowing more emphasis to be placed on the statistics and functionality.

At this point, I think we’re coming to the next generation of biology problems – integrating FindPeaks into the wider toolset – and generating real knowledge about what’s going on in the genome, and I think it’s time for FindPeaks to evolve to fill that role, growing out to better use the information available in the sorted aligner results.

Ever since the end of my exam, I haven’t been able to stop thinking of neat applications for FindPeaks and the rest of my tool kit – so, even if I end up focussing on the cancer biology that I’ve got in front of me, I’m still going to find the time to work on FindPeaks, to better take advantage of the information that FindPeaks isn’t currently using.

I guess that desire to do things well, and to get at the answers that are hidden in the data is what drives us all to do science. And probably what drives grad students to work late into the night on their projects…. I think I see a few more late nights in the near future. (-;

>Catching up….

>I can’t believe it’s been nearly a month since my last post! I feel like I’ve been neglecting this a bit more than I should, but I’ll try to rectify that as best I can.

For an indication of how busy I’ve been, I sat down to update my resume yesterday, and ended up adding 3 papers (all in submission) and two posters. That just about doubles what was in there previously in the papers section.

Anyhow, Next-generation sequencing doesn’t stand still, so I thought I’d outline some of the things I want to talk about in my next posts, and set up a few pointers to other resources:

1. SeqAnswers. This aptly named forum has been around for a few months now, but has recently become more popular, and a great forum for discussing relevant Next-gen technology and sequencing methods. I’m especially happy to see the automated posts triggered by new literature on the subject, which are a great resource for those of us who are busy and forget to check for new publications ourselves.

2. There’s one forum in particular that’s of great interest: Benchmarking different aligners. This appears to be a well done comparison (if lightweight) that may be a good focus for other people who are interested in comparing aligners, and discussing it in a wider forum.

3. For people interested in ChIP-Seq, or Chromatin immunoprecipitation and massively parallel sequencing, I’ve finally gotten around to posting FindPeaks 3.1 on the web. I’d consider this release (3.1.3) an alpha release. I’d love to get more people contributing by using this application and telling me what could be improved on it, or what enhancements they’d like to see. I’m always happy to discuss new features, and can probably add most of them in with a relatively quick turn around time.

4. For people interested in assessing the quality of the whole transcriptome shotgun sequencing (WTSS), I’m about to break out a tool that should fit that purpose. If anyone is interested in giving suggestions on ways they’d like to see quality tests performed, I’d be more than happy to code those into my applications. (And yes, if you contribute to the tool, I will provide you a copy of the tool to use. Collaborations, etc, can be discussed, as well.)

5. A quick question, of which I’ll post more in the future. Has anyone here managed to get Exonerate 2.0 to work in client/server mode on two separate machines?

6. I’ll also post a little more about this in the future: building environments, ant and java. Why are people still doing large projects in perl?

7. One last thing I wanted to mention. I was going to write more on this topic, but eh… I’ll let slashdot do it for me: The more you drink, the less you publish. Well, So much for keeping a bottle of tequila under the desk. Now I know what to get the competition for x-mas, though…

Cheers!

>No more support

>For once, I think I’ll toss out a quick rant that I haven’t really thought through, so don’t mind if it’s a little rough.

I’ve spent some time thinking about the various projects I’ve been working on, and that I’d like to be working on in my grad school future. Surprisingly, I’m really happy with them, and I’m eager to delve into all of them, with one major exception: I think I’m sitting on a potential train wreck.

From a project planning perspective, the one major issue that I foresee is that I’m inheriting a legacy of a few 10’s of kloc (thousand lines of code) done in Java. I’m not really proficient in Java, but it’s not hard, compared to some of the other languages I’ve used – that’s not the issue. The big problem is that almost all of it takes advantage of something called the Ensembl API, which is a quick way for programmers to access all sorts of fantastic functions and data related to various genomic information. It’s a fantastic resource, but Ensembl (who made the API) has decided to stop supporting the java version in favour of the Perl version.

Even now, I’m stuck using the annotations from version 41 of the Ensembl Human Genome, whereas v.43 is the most current. How much difference will this make? Probably not much, at the moment. However, in the long term, I think that could become a major issue.

Now, I have worked in Perl before, years ago, so that’s not a problem. But what do I do about the 10kloc? Recreating it will take the better part of a year, at least. For now, the solution is to postpone the decision, but I think that’ll only work for another month or two. Eventually something is going to give, and I’m just going to have suck it up and redo all of the code we’ve got in house. Yuck.