Normalizing reads for Next-Gen Sequencing.

This isn’t a particularly in depth blog.  Rather, I just want to touch on a few points in reply to a twitter question asking about how to normalize reads.

Actually, normalization is something I haven’t studied in great depth beyond the applications for Chip-Seq, where – of course – it’s still an open question.  So, if my answer is incomplete, please feel free to point out other resources in the comments.  Any contributions would be welcome.

First, one can do all sorts of fancy things to perform a normalization.  Frankly, I think the term is pretty abused, so a lot of what passes for normalization is a bit sketchy.  Anything that makes two samples “equivalent” in any way is often referred to as normalization, so yeah, your millage may vary when it comes to applying any approach.

For RNA-Seq, I’ve heard all sorts of techniques being applied.  The most common is to simply count the number of reads in the two samples, and then normalize by dividing by the ratio of reads between the two samples.  Very crude, and often very wrong, depending on what the two samples actually are.  I don’t use this approach unless I’m feeling lazy, someone has asked me to normalize a sample and it’s 5:40 on a Thursday evening. (I have better things to do on Thursday evenings.)

The second method is to bin the reads, then apply the same normalization as above.  The bins can be as large as a whole chromosome or as small as a few hundred base pairs, but the general method is the same: Use some average over a collection of bins to work out the ratio, then use that ratio to force the two samples to have the same approximate total read numbers.  It’s a bit better than what’s above, but not a whole lot better.

The third method that I’m aware of it to use a subset of reads to determine the normalization ratio.  This is a bit better – assuming you know enough to pick a good subset.   For instance, if you know housekeeping genes, you can use the total coverage over that set to approximate the relative abundance of reads in order to set the correct ratios.  This method can be dramatically better, if you happen to know a good set of genes (or other subset) to use, as it prevents you from comparing non-equivalent sets.

Just to harp on that last point, if you’re comparing a B-Cell- and a Mammary-derived cell line, you might be tempted to normalized on the total number of reads, however, it would quickly become apparent (once you look at the expressed genes) that some B-Cell genes are highly expressed and swamp your data set.  By paring those out of the normalization subset, you’d find your core genes in common to be more comparable – and thus less prone to bias introduced by genes only expressed in one sample.

You’ll notice, however, that all of the methods above use a simple ratio, with increasingly better methods of approximation.  That’s pretty much par for the course, as far as I’ve seen in RNA-Seq.  It’s not ideal, but I haven’t seen much more elegance than that.

When it comes to ChIP-Seq, the same things apply – most software does some variation of the above, and many of them are still floundering around with the first two types, of which I’m not a big fan.

The version I implemented in FindPeaks 4.0 goes a little bit differently, but can be applied to RNA-Seq just as well as for ChIP-Seq. (yes, I’ve tried) The basic idea is that you don’t actually know the subset of house-keeping genes in common in ChIP-Seq because, well, you aren’t looking at gene expression.  Instead, you’re looking at peaks – which can be broadly defined as any collection of peaks above the background.  Thus, the first step is to establish what the best subset of your data should be used for normalization – this can be done by peak calling your reads.  (Hence, using a peak caller.)

Once you have peak-calling done for both datasets, you can match the peaks up.  (Note, this is not a trivial operation, as it must be symmetrical, and repeatable regardless of the order of samples presented.)  Once you’ve done this, you’ll find you have three subsets:  peaks in sample 1, but not in sample 2.  Peaks in sample 2 but not in sample 1, and peaks common to both. (Peaks missing in both are actually important for anchoring your data set, but I won’t get into that.) If you only use the peaks common to both data sets, rather than peaks unique to one sample, you have a natural data subset ideal for normalization.

Using this subset, you can then perform a linear regression (again, it’s not a standard linear regression, as it must be symmetrical about the regression line, and not Y-axis dependent) and identify the best fit line for the two samples.  Crucially, this linear regression must pass through your point of origin, otherwise, you haven’t found a normalization ratio.

In any case, once all this is done, you can then use the slope of the regression line to determine the best normalization for your data sets.

The beauty of it is that you also end up with a very nice graph, which makes it easy to understand the data set you’ve compared, and you have your three subsets, each of which will be of some interest to the investigator

(I should also note, however, that I have not expanded this method to more than two data sets, although I don’t see any reason why it could not be.  The math becomes more challenging, but the concepts don’t change.)

Regardless, the main point is simply to provide a method by which two data sets become more comparable – the method by which you compare them will dictate how you do the normalization, so what I’ve provided above is only a vague outline that should provide you with a rough guide to some of the ways you can normalize on a single trait.  If you’re asking more challenging questions, what I’ve presented above may not be sufficient for comparing your data.

Good luck!

[Edit:  Twitter user @audyyy sent me this link, which describes an alternate normalization method.  In fact, they have two steps – a pre-normalization log transform (which isn’t normalization, but it’s common.  Even FindPeaks 4.0 has it implemented), and then a literal “normalization” which makes the mean = 0, and the standard deviation =1.   However, this is only applicable for one trait across multiple data sets (eg, count of total reads for a large number of total libraries.)  That said, it wouldn’t be my first choice of normalization techniques.]


ArsenicLife – a deeper shift than meets the eye.

Unfortunately, I don’t really have the time to write this post out in full, so please don’t mind the rough format in which I’ve assembled it.  I’ve been trying to twitter more than blog recently so that I wouldn’t distract myself by spending hours researching and writing blog posts.  Go figure.

However, I don’t think I can keep myself from commenting this time, even if it’s just in passing.  The whole arsenic-based-life (or #arseniclife) affair is really a much deeper  story than it appears.  It’s not just a story about some poor science, but rather a clash of cultures.

First, I read through the list of published critiques of the arsenic paper, as well as the response on the same page.  They critiques are pretty thoughtful and clear, giving me the overall impression that the authors of the original paper just didn’t bother to talk to specialists outside of their own narrow field.   That’s the first clash of cultures:  Specialists vs. interdisciplinary researchers.  If you neglect to consult people who can shed light on your results, you’re effectively ignoring alternative hypotheses.   Biologists have been guilty of this in the past, however, failing to consult statisticians before constructing a story their data doesn’t support.  In this case, however, it’s more blatant because the authors should have consulted with other biologists, the least of them being specialists in microbial biology. (Rosie Redfield’s blog comes to mind for some of the critiques that should have been solicited before the paper was sent to journals.)

If that wasn’t enough, this clash is also underpinned by “oldschool” meets “newschool” – aka, technology.  This isn’t an original idea of mine, as I’ve seen it hinted at elsewhere, but it’s an important idea.  Underneath all of the research, we have a science battle that’s being fought out in the journals, while new media runs circles around it.  It took almost 6 months for Science to print 6 critiques that stretch from a half page to just over a page.  In the world of blogs, that is about 2 hours worth of work.

I really don’t know what’s involved in having a small half-page article go to press, but I’m quite surprised if it would take 6 months to do that amount of work.  In contrast, a great many blogs popped up with serious scientific criticisms in hours, if not days, of the original embargo on the paper being lifted. (The embargo itself was a totally ridiculous, but that’s another tangent I’m not going to follow.)  The science discussion in the blogs was every bit as valid as the critiques Science published.

Frankly, the arsenic life paper leaves me stunned on many levels:  How long will continue to believe they can work in independent fields, publishing results without considering the implications of their work on other fields?  How long will journals be the common currency of science given their sluggish pace to keep up with the discussions?  How long will blogs  (and not the anonymous kind) be relegated to step-child status in science communication?

Given the rapid pace with which science progresses as a whole, it’s only a matter of time before something needs to be done to change the way we chose to publish and collaborate in the open.

For more reading on this, I suggest the Nature article here.


Illumina’s MiSeq.

Really, I have nothing to add here.  Keith Robinson on the Omics! Omics! blog has already said it all – right down to the email from Illumina’s PR person.

Admittedly, the poster they sent is quite pretty, but as always, I’m waiting to see how the instrument performs in other people’s hands.  (Though, that’s not to say I doubt the results, but I have been bitten by Illumina’s optimistic reports in the past – with specific emphasis on shipping dates for software.)

At this point in the game, we’ve now entered into a long protracted arms race, with each company trying to out-perform the others, but with very few new features.  Improving chemistry, longer reads,  cheaper per-base costs, faster sequencing time and better base qualities will continue to ratchet up – so the MiSeq is Illumina raising the bar again.  Undoubtedly we’ll continue to see other competitors showing off their products for the next few years, trying to push into the market.  (A market which grows to include smaller labs every time they can reduce the cost of the machine, the sequencing, and the bioinformatics overhead.)

However, let me say that we’ve certainly come a long way from the single end 22bp reads that I first saw from the Solexa machines in 2008.   mmmmm… PET 151bp reads in 27 hours. *drool*.

Edit:  Oops.  I missed the link to Illumina’s page for the MiSeq.  Yes, it’s exactly what you’d expect to see on a vendor page, but they do have a link to the poster on the left hand side so that you can check it out for yourself.


Random OpenOffice 3.2 tip

Just in case I ever come across this problem again.

If OpenOffice 3.2 repeatedly crashes on dialog boxes for open/save/insert image/etc, the fix is to install a single package that seems to import the proper dialog box tool kit:

This is probably something that only happens to Kubuntu users, as doesn’t sound like a package you’d need on kde.

Anyhow, if you’re using kubuntu, the fix is as simple as:

sudo apt-get install

And instantaneously, everything works again.

Dog Bath

It’s the weekend, and I’m trying not to do too much science today.  That shouldn’t be too much of a problem, really – the canucks are playing at 5pm, I’m going to be making BBQ personal pizzas, and I got all of my chores done yesterday.  So, today was the day to bathe the dog.  I’d hate to have the doggy pet-sitter get a stinky dog, when my wife and I go to Copenhagen.

If you’ve never seen a puli get a bath, though, it’s something pretty spectacular.

Working with Jackie Chan.

Since I’ve been posting jobs, I figured I may as well point people to another set of open positions.  Of course, again, I have no relationship with the people posting it… however, I just couldn’t not say anything about this set.

Apparently, if you work in the Pallen Group, you get to work on next gen sequencing pipelines in the lab with Jackie Chan. (Research Fellow in Microbial Bioinformatics)

How cool is that?Jackie Chan

Anyhow, The other position (Research Technician in Bioinformatics), doesn’t (apparently) involve martial arts.

Some open positions at Complete Genomics

In case anyone is looking for a bioinformatics position, Complete Genomics is currently looking to fill at least three positions:

Senior Computational Biologist, Assembly

Complete Genomics is seeking a senior Computational Biologist to develop methods and tools for its sequence assembly pipeline and genomics analysis toolset. The role will require conceiving of analysis approaches to solve key problems in a next-generation DNA sequencing pipeline, then designing, developing, testing, and deploying pragmatic solutions. The new hire will work on a pipeline that each day transforms trillions of sequenced bases into high quality human genome analyses.

Senior Software Engineer, Assembly

Senior Bioinformatician, Pipeline Validation

The assembly pipeline team is an amalgam of exceptional bioinformaticians and software engineers of diverse backgrounds, with an active and lively cross-pollination of ideas and approaches. The team is actively investigating improvements to aid in a wide variety of genomics research applications including cancer, Mendelian diseases, and large scale association studies.

They have also provided the following note:

We are especially interested in seeking out individuals who have had experience in one or more of the following:

  • Developing or modifying algorithms for the analysis of next-generation sequence read data (on the lines of GATK, Firehose, or similar)
  • Developing sets of relevant functional annotations to associate with DNA sequence variations, and analyses of annotated data for the detection of disease-associated variants in pedigrees, tumor-normal pairs, or unrelated genome sets.
  • Analyses of sequenced genomes against standard data (internal and external) to estimate errors, calibrate scores, identify signals of interest, etc…
  • We’re open to people who have spent time in academic or research settings.

If you’d like more information, the have more details at the complete genomics careers page.

Disclosure: I am not in any way affiliated with Complete Genomics – I just happen to think they’re a really interesting company and am always impressed with how accepting they are of social media.

Java 1.6 based fork of the Ensembl API.

Just in case anyone is still interested, I have started an ensj (Ensembl Java API) project at sourceforge, using the latest version of the ensj-core project as the root of the fork.  It fixes at least one bug with using the java API on hg19, and makes some improvements to the code for compatibility with java 1.6.

There are another few thousand changes I could make, but I’m just working on it slowly, if at all.

I’m not intending to support this full time, but interested parties are welcome to join the project and contribute, making this a truly open source version of the ensembl interface.  That is to say, community driven.

Ensembl isn’t interested in providing support (they no longer have people with the in-depth knowledge of the API to provide support), so please don’t use this project with the expectation of help from the Ensembl team.  Also note that significant enhancements or upgrades are unlikely unless you’re interested in contributing to them! (I have my own dissertation to write and am not looking to take this on as a full time job!)

If you’re interested in using it, however, you can find the project here:

and a few notes on getting started here and here.  I will get around to posting some more information on the project on sourceforge web site when I get a chance.


A few notes on Ubuntu 11.04

Right off the bat, I have to say that I’ve been using 11.04 on one of my computers since one of the earliest alpha releases, so I’ve had the opportunity to watch it mature.  It’s something I often do – pick one computer, and use it to test out the new versions of ubuntu, upgrading the packages daily to follow along with the progress of the development. It’s usually a rewarding process, and I enjoy fixing bugs and learning how the operating system components fit together.

This was one of the few times that it was a disappointment for me.

Normally, I’ll start the process a month after the newest release, spending about two months using highly unstable versions, which then improve over time so that nearly all of the bugs are gone about a month before the official release.  This time, the release has come and gone, and my computer still doesn’t feel particularly useful.

Unlike the majority of Ubuntu users, I’m not using the default Ubuntu with Gnome window manager (or is it unity, now?), but rather KDE + Compiz.  Unfortunately, while the KDE version of Ubuntu (Kubuntu) doesn’t appear to get nearly as much attentention from the devs, it usually does march in lockstep, giving reasonable releases that come together at the last minute.  Unusually, compiz just never did come back together for me.  Perhaps it has to do with the upgrades to the xorg packages, but something is just not right when using compiz.

The symptoms are frequent restarts and crashes of the windowing system, in which the CPU usage of compiz soars to 100% and fails to respond to anything short of a the “killl -9” command.  Unfortunately, that makes compiz totally unuseable.  Ironically, it seems to happen only when closing windows, which seems like a strange place for a bug.

The workaround, if it’s actually a work around, is to completely disable compiz, which means switching back to the default KDE window manager.  Fortunately, it’s stable, so at least I can use the computer.  Unfortunately, I find Compiz to be a significant boost to my productivity, with the rotating cube desktop, the scale plugin, etc.  All of those things really make the experience on the computer, so turning all of them off it just unapealing.

In any case, for now, I’ll be staying with 10.10 on my laptops and production computers until compiz is fixed, with the possibility of jumpping straight to 11.10 as soon as that begins if I see compiz gets a bit more of the attention it deserves.

Your milage may vary, but 11.04 isn’t likely to make an appearance on my main computers anytime soon.