>regenerate your happiness.

>Ever since I went to a conference in Korea, where I was treated to presentations of some of the local research, I have a tendency to view Asian research with a slightly more critical eye. That’s not to say Asian research isn’t good, but that it appears to cover a wide range of qualities – from absolutely top quality research right down to the “I can’t believe they tried that” type.

Unfortunately, I think the propensity for “cuteness” that seems to be a part of some Asian cultures also doesn’t translate into English well. Rather than tickling my sense of cuteness, it just leaves me wondering if they take their work seriously.

What brought me to this topic, today, was an email I received announcing a Stem-cell and regenerative medicine conference in China, which I assume is a serious affair. However, whatever seriousness it has is in stark contrast to the theme it was given: “Re-creation of Life, Reconstruction of Happiness.” I mean, how do you really take a conference seriously with a title like that? This sounds, to me, like a gathering of people who are interested in cloning your pet cat, and charging you $15,000 for a kitten. (Guarantees of happiness not included.)

If you’re interested in reconstructing your happiness, the conference is the BIT Life Sciences’ 1st Annual World Congress of Regenerative Medicine & Stem Cell (RMSC-2008). The link provided doesn’t seem to be working for me.

(Update: I took a look at the google cache version of the BIT Life Sciences web page, and I actually don’t think they’re a real organization. For instance, I think the Recommendations on their web page are fake – I doubt the people at Novartis, UC Irvine and Pfizer all write letters with typical Chinese grammatical mistakes.)


>A strange title, no?

I just discovered Google’s Knol project. Imagine an author-accountable version of Wikipedia. That’s quite the concept. It’s like a free encyclopedia, written by experts, where the experts make money by putting google adds on their pages (optional), and the encyclopedia itself is free. I can’t help but liking this concept.

This, to me, is about the influence of Open Source on business models other than software.

People used to claim, back in the 90’s, that the internet would eventually become nothing but adds, because no one in their right might will contribute content for free, and content generation would become the exclusive domain of major companies. That was the old thinking, which led to the “subscription models” favoured by companies like online subscription based dictionaries, and subscription based expert advice, both of which I find lacking in so many respects.

Subsequently, people began to shift in the other direction, where it was assumed that services could harness the vast power of the millions of online people. If each one contributed something to wikipedia, we’d have a mighty resource. Of course, they forgot the chaotic nature of society. There are always a bunch of idiots to ruin every party.

So where does this leave us? With Knol! This model is vastly more like the way software is created in the Open Source model. The Linux kernel is edited by thousands of people, creating an excellent software platform, and it’s not by letting just anyone edit the software. Many people create suggestions for new patches, and the best (or most useful, or most easily maintained…) are accepted. Everyone is accountable along the way, and the source of every patch is recorded. If you want to add something to the Linux kernel, you’d better know your stuff, and be able to demonstrate you do. I think the same thing goes for knol. If you want to create a page, fine, but you’ll be accountable for it, and your identity will be used to judge the validity of the page. If an anonymous person wants to edit it, great, that’s not a problem, but the page maintainer will have to agree to those changes. This is a decentralized expert based system, fueled by volunteers and self-sponsored (via the google adds) content providers. It’s a fantastic starting point for a new type of business model.

Anyhow, I have concerns about this model, as I would about any new product. What if someone hijacks a page or “squats” on it. I could register the page for “coca cola” and write an article on it and become the de-facto expert on something that has commercial value. ouch.

That said, I started my first knol article on ChIP-Seq. If anyone is interested in joining in, let me know. There’s always room for collaboration on this project.


>How many biologists does it take to fix a radio?

>I love google analytics. You can get all sorts of information about traffic to your web page, including the google searches people use to get there. Admittedly, I really enjoy seeing when people link to my web page, but the google searches are a close second.

This morning, though, I looked through the search tearms, and discovered that someone had found my page by googling for “How many biologists does it take to fix a radio?” And that had me hooked. I’ve been toying with the idea all morning, and figured I had to try to blog an answer to that. (I’ve already touched on the subject once, with less humour, but it’s worth revisiting.)

Now, bear in mind that I’m actually a biochemist and possibly a bioinformatician – and by some stretch of imagination, a microbiologist – so I enjoy poking fun at biologists, but it’s all in good humour. Biology is infinitely more complicated than radios, but it makes for a fun analogy.


This is how I see it going.

  • A nobel prize winner makes a keynote speech, expounding on the subject that biologists have completely ignored the topic of radios. They deserve to be studied and are a long neglected topic that is key to understanding the universe. The Nobel prize winner further suggests his own type of broken radio that he’s been tinkering with in his/her garage for several months as the model organism.
  • After the speech, several prominent biologists go to the bar, drink a lot, and then decide that the general consensus is that they should look at fixing broken radios.
  • Several opinion papers and notes appear on the subject, and a couple grad student written reviews pop up in the literature.
  • A legion of taxonomists appear, naming broken radios according to some principle that makes perfect sense to them. (eg. Monoantenna smithii, Nullamperage robertsoniaii). High school students are forced to learn the correct spellings of several prominent strains.
  • A Nature paper appears, describing the glossy casing of the Radio, the interaction of the broken radio with an electrical socket and the failed attempt to sequence the genome. Researchers around the world have been scooped by this first publication, and all subsequent attempts to publish descriptions of broken radios are not sufficiently novel to warrant publication in a big name journal.
  • Biologists begin to specialize in radio parts. Journal articles appear on components such as “purple red purple gold (PrpG), which is shown to differ dramatically from a similar appearing component, “blue green purple gold” (BgpG), and both are promptly given new names by ex-drosophila researchers: “Brothers for the Preservation of Tequila Based Drinks 12” and “Trombone.”
  • Someone tries to patent a capacitor, just in case it’s ever useful. Spawns three biotech companies, two of which spend $120 million dollars in less than 3 years and fold.
  • Someone does a knock out on a working radio and promptly discovers and names the component “Signal Silencing Subcomponent 1” or “Sss1”. 25 more are discovered in a high-throughput screen.
  • X-ray studies are done on Sss22, resulting in a widely acclaimed paper which will later result in a Nobel prize nomination. No one has the faintest idea how Sss22 works or what it does.
  • Science fiction writers publish several fantastic novels that one day we might be able to fix radios by replacing individual parts.
  • The religious right declares biologists are playing god, and that fixing radios is beyond the capacity of humans. The moral dilemmas are too complex. Ethicists get involved. The US president tries to cut funding for biologists doing research on broken radios.
  • A researcher invents a method of doing in-situ component complementation, which allows a single element to be bypassed and replaced with a new one. All new components are attached with green flags attached to them to make studying them easier.
  • Someone else invents a method of replacing a frayed power cord, producing a working phenotype from a broken radio. The resulting media storm declares the discovery of the cure for broken radios.
  • The technique for fixing power cords begins the long process of getting FDA approval. 10 years later (and with a $1bn investment showing that technique also works on lamps and doesn’t cause side effects in electric toothbrushes) the fix is allowed to go to market.
  • Marketing is conducted, telling people (with working and broken radios alike) that maybe they should try the cure, just in case they might have a frayed power cord some day too. They should talk to their doctor about if it’s right for them.
  • Advertisements appear on tv showing silent smiling people holding on to power cords.
  • Long term studies after the fact show that the new part wasn’t as good as it could have been. Sucking on it may cause liver damage.
  • Religious right takes recall as sign that science has failed again. Holistic fixes for frayed power cords appear, as well as organic electricity and antenna adjustment therapies, which work for some people. Products appear on the shopping channel.
  • Technology moves on, the radio becomes obsolete. Several biotech companies acquire each other in blockbuster mergers and begin working on new target components for computer sound cards.

Have a good weekend, everyone. (=

>ChIP-Seq control Data

>I promised a quick discussion on ChIP-Seq and controls, as well as a comparison of FindPeaks with USeq, though I may have to defer on the second topic. I haven’t played much with USeq, though I did look through it’s code several months ago. I think I’d best keep my comments to a minium and say only that ChIP-Seq software evolves quickly, and that it’s a moving target. Comparing FP would be doing a disservice to my code, which was tagged several months back, while USeq is being developed out in the open, and you can get a copy of what the developer was working on last night. Truly an unfair comparison. I think I’ll revisit this topic when I release FP 4.0.

The other topic is the use of controls. It’s an area that everyone at GSC acknowledges we could do better in. It’s just common sense that you’d want controls – this is science, after all.

However, there are different types of controls. Do you want a control that simulates a null-antibody? Do you want a genomic control that leaves out the ChIP entirely? Possibly there are other kinds I haven’t even though about, and they are all “valid” controls for different things in a ChIP-Seq experiment.

Which brings up the really key point: there are more things going on in ChIP-Seq than just peaks, and you can’t completely ignore all of them, just to filter out peaks that are found in your control. That’s a relatively naive method that other groups used several months ago. (I don’t know if they’re still using them.) I believe “Singapore” was using controls simply by checking if each peak’s maximum had a greater than 4:1 ratio over the same base location in their control and throwing away anything that didn’t pass that criteria. I believe other labs are doing similar things.

Anyhow, I think it’s safe to say that such a simple method is a great first pass (and yes, ALL of us are on our first pass, at the moment, so it’s definitely not an insult), but I don’t think this will stand the test of time.

For now, when I get a few minutes, I’m going to start working on integrating controls into FindPeaks 4.0. As of this morning, I ran the first few successful PET runs with the software, so work on controls won’t be that far off.

>100 blog post – Findpeaks 4

>Wow.. quite the milestone! 100 blog posts. I never thought I could ramble on this long… well, on second thought, I guess it’s not THAT much of a surprise if you know me. (=

Anyhow, I suppose I have a second milestone to talk about as well. I just preformed the first successful FindPeaks 4.0 (pre-alpha??) run using a Paired End MAQ .map file. Why is that important? First, I’m using a binary .map file, which means we don’t need to store .mapview files – though, this is a bit of trickery. I now store .bed files instead, which have the added advantage of being viewable through the UCSC browser, while mapview files can’t. Second, it’s a Paired End ChIP-Seq run. We’ve been waiting a long time to have paired end chip-seq, so that we’ll know where the second end of the tag is. We can now start pinpointing the location of the protein-dna interaction with much greater confidence.

The important thing is that I can now start doing-away with the crazy distributions that single end tags forced upon us. No more X-sets, adaptive distributions or triangle distributions! Yay! (It might still be publishable, but I think the window for that is closing…. another story for another day.)

>SNP calling from MAQ

>With that title, you’re probably expecting a discussion on how MAQ calls snps, but you’re not going to get it. Instead, I’m going to rant a bit, but bear with me.

Rather than just use the MAQ snp caller, I decided to write my own. Why, you might ask? Because I already had all of the code for it, my snp caller has several added functionalities that I wanted to use, and *of course*, I thought it would be easy. Was it, you might also ask? No – but not for the reasons you might expect.

I spent the last 4 days doing nothing but working on this. I thought it would be simple to just tie the elements together: I have a working .map file parser (don’t get me started on platform dependent binary files!), I have a working snp caller, I even have all the code to link them together. What I was missing was all of the little tricks, particularly the ones for intron-spanning reads in transcriptome data sets, and the code that links together the “kludges” with the method I didn’t know about when I started. After hacking away at it, bit by bit things began to work. Somewhere north of 150 code commits later, it all came together.

If you’re wondering why it took so long, it’s three fold:

1. I started off depending on someone else’s method, since they came up with it. As is often the case, that person was working quickly to get results, and I don’t think they had the goal of writing production quality code. Since I didn’t have their code (though, honestly, I didn’t ask for it either since it was in perl, which is another rant for another day) it took a long time to settle all of the 1-off, 2-off and otherwise unexpected bugs. They had given me all of the clues, but there’s a world of difference between being pointed in the general direction towards your goal and having a GPS to navigate you there.

2. I was trying to write code that would be re-usable. That’s something I’m very proud of, as most of my code is modular and easy to re-purpose in my next project. Half way through this, I gave up: the code for this snp calling is not going to be re-usable. Though, truth be told, I think I’ll have to redo the whole experiment from the start at some point because I’m not fully satisfied with the method, and we won’t be doing it exactly this way in the future. I just hope the change doesn’t happen in the next 3 weeks.

3. Name space errors. For some reason, every single person has a different way of addressing the 24-ish chromosomes in the human genome. (Should we include the mitochondrial genome in our own?) I find myself building functions that strip and rename chromosomes all the time, using similar rules. Is the Mitochondrial genome a “MT” or just “M”? What case do we use for “X” and “Y” (or is it “x” and “y”?) in our files? Should we pre-pend “chr” to our chromsome names? And what on earth is “chr5_random” doing as a chromosome? This is even worse when you need to compare two active indexes, plus the strings in each read… bleh.

Anyhow, I fully admit that SNP calling isn’t hard to do. Once you’ve read all of your sequences in, determined which bases are worth keeping (prb scores), determined the minimum level of coverage, minimum number of bases that are needed to call a snp, there’s not much left to do. I check it all against the Ensembl database to determine which ones are non-synonymous, and then: tada, you have all your snps.

However, once you’re done all of this, you realize that the big issue is that there are now too many snp callers, and everyone and their pet dog is working on one. There are several now in use at the GSC: Mine, at least one custom one that I’m aware of, one built into an aligner (Bad Idea(tm)) under development here and the one tacked on to the swiss army knife of aligners and tools: MAQ. Do they all give different results, or is one better than another? who knows. I look forward to finding someone who has the time to compare, but I really doubt there’s much difference beyond the alignment quality.

Unfortunately, because the aligner field is still immature, there is no single file output format that’s common to all aligners, so the comparison is a pain to do – which means it’s probably a long way off. That, in itself, might be a good topic for an article, one day.

>Yet another Chip-Seq tool

>One more chip seq tool for the road? How about NPS? It claims to do the the same stuff as findpeaks 1.0, but now with a Laplacian of Gaussian edge detection algorithm bolted on top. (Why? I’m not sure.)

Finding histones isn’t exactly rocket science, people. They’re pretty easy, and (unlike transcription factors) they saturate very quickly. The trick is to use good alignments, and to get the right fragment length for your reads. If you do that, noise will be limited, and locations are pretty obvious. Paired End Tags completely solves the second problem, and using a good aligner does the first. If you want a good Chip-Seq tool, just write one that uses PET. (Mine’s almost done, by the way – just debugging the first test set today.)

Anyhow, it’s fun to watch people re-invent the wheel… again and again. Hey, anyone want to write another chip-seq tool? I think we’re up to 2 in java and 3 in python, with a few in perl? For something really different, how about brainfuck?

>Eureka! A working .map reader

>Ok.. overly technical post, but it may be useful to anyone who’s writing java code, and trying to read binary files.

  1. Don’t use a BufferedReader wrapped around your FileInputStream. It modifies the values of the data you get!
  2. x86 processors are little endian: when you read in four bytes, you have to add them together in the reverse order to create an int.
  3. Java doesn’t use unsigned variables, so you’ll have to AND all your bytes with 0xff and cast to int to get the correct value.

I’ll just chalk this up to Lessons learned.

>RFC 1925

>I just came across a wonderful RFC, dating back to 1996, that is well worth the read. In light of the problems I’m facing getting my MAQ .map file reader to work, I figured this is a great comment on my frustration with undocumented file formats.

Although written as The Twelve Networking Truths, I think it applies to writing code in general, and likely far more broadly as well.

>MAQ map format

>I’m currently working on a couple of coding projects, though the one that’s urgent priority is to analyze my samples of interest using the MAQ aligner, which shouldn’t be too bad. Unfortunately, there are no real standards when it comes to output from alignment programs, which makes it impossible to do a simple replacement of the previous aligner with the current aligner. Thus, Eland and MAQ are not interchangeable modular components. Switching between them takes a large investment in time and effort. Blah.

MAQ, however, is also more tricky because of it’s poor documentation. I can understand why that is – if they stopped to document, they might not have time to write the code, which is fair enough since all of us in this field are busy. Compounding the problem, however, is MAQ’s use of really cryptic file formats. Unfortunately, I don’t know enough to really discuss that issue for the majority of the MAQ formats, but I did want to make two points in this morning’s post.

The first is simply that MAQ’s map format is a little sketchy – it’s a binary format that depends heavily on c code calls to “sizeof()”, which means that compiling code with different compilers may give very different results. That’s always a “BAD THING”(tm), in my (possibly not so) humble opinion. It’s also brutal for interoperability. I’m now working on coding up a java interface that reads the map file, allowing users to skip the mapview file – a gzipped human readable file – that takes up a lot of space. My issue is that I have to worry about many more things than I should need to: is the size of the variable the same between c and java? is the compiler for c big endian or little endian, and does java use the same conventions? how do I use masks to de-convolute the compressed flags? Admittedly, I love the idea of the binary file to save space for these large documents, but I’m not really a fan of the method.

The second is an appeal to people to stop coding in perl! Come on people, can’t we pick a language that is maintainable? I’d like to show you a piece of code from the MAQ package, which I thought was responsible for converting the .map file to a .mapview file. (Fortunately, I was wrong, but it took me a day of staring at it before I realized I was looking at the wrong piece of code.)

while (<>) {
my @t = split;
next if ($t[8] <= $opts{q});
next if ($t[5] == 18 || $t[5] == 64 || $t[5] == 130);
my $do_break = ($t[1] ne $lasts || $t[2] - $lastc > $d)? 1 : 0;
if ($do_break) {
if ($lastc - $beginc > $opts{s}) { # skip short/unreliable regions
my $k = @regs;
my $flag = ($lastc - $beginc < $opts{i})? '*' : '.';
push(@reg_name, "$beginst$beginct$lastct$flag");
my $p = @{$regs[$k]};
foreach (@reg_seq) {
push(@$p, $_);
my @s = split;
push(@{$read{$s[0]}}, $k);
($begins, $beginc) = @t[1..2];
@reg_seq = ();
push(@reg_seq, join(" ", @t[0..3,5,8,13]));
($lasts, $lastc) = @t[1..2];

Take a look at that code. I have no idea what 18, 64 and 130 are supposed to be – and I have no documentation to tell me. It took me 2 days to figure out that $p is actually being pulled from the data file as a string, and then the content of $p becomes the name of the array further on, which is populated from the gzipped binary file. Even now, I’m still unsure about how the $regs variable gets populated.

If this were written in python, java, c…. well, anything but perl, it would probably be interpretable and maintainable. I’ve heard it said that if you want to fix a bug in perl, it’s easier to rewrite the program than to look for the error – and after trying to read this script, I believe it.

Anyhow, while slagging perl, I should also mention that perl can be done in a maintainable way, and well documented. Unfortunately, the perl programming community seems to emphasize witty, compact and undecipherable code styles, so I’ve just rarely seen it done. If you’re going to open source your code, why not do it in a way that makes your code re-usable?