Starting over

I’m back.  Physically, I’m back in Canada, although not yet home.  I’m visiting family while all our possessions make their way back to Vancouver.  In the meantime, I wanted to get back to blogging.  To re-engage in the community and return my life to some sense of normalcy.

On Denmark, I don’t plan to say much.  It was a terrible experience from start to finish, and I’m leaving with less money, stability – and none of the bioinformatics experience I had wanted.  All in all, it was a disaster.  If people want specific details or advice about moving to Denmark, of course I’ll share what I know, but this isn’t the right forum for it.  For the moment, I won’t comment on how things went down at the end, although I’ve heard less than accurate versions in circulation.

On the subject of bioinformatics, I feel a bit out of touch.  I’ll be starting to get back into it shortly.  Obviously, it’ll take some time to ramp up and get back in to the swing of things.  However, I can say that last night was the first time in a year that I had actual free time. So what did I do?  I started to learn Python.  Honestly, I don’t think Java is the right tool for all occasions, and with about a month of downtime, python just feels like it might be the best fit for some of the stuff I’ll be working on in the future.

Anyhow, with any luck, things will start to work their way out.  At least, being back in Canada, I can see the light at the end of the tunnel.

Where’s the collaboration?

I had another topic queued up this morning, but an email from my sister-in-law reminded me of a more pressing beef: Lack of collaboration in the sciences. And, of course, I have no statistics to back this up, so I’m going to put this out there and see if anyone has anything to comment on the topic.

My contention is that the current methods for funding scientists is the culprit for driving less efficient science, mixed with a healthy dose of Zero Sum Game thinking.

First, my biggest pet peeve is that scientists – and bioinformaticians in particular – spend a lot of time reinventing the wheel.  How many SNP callers are currently available?  How many ChiP-Seq packages? How many aligners?  And, more importantly, how can you tell one from the other?  (How many of the hundreds of snp callers have you actually used?)

It’s a pretty annoying aspect of bioinformatics that people seem to feel the need to start from scratch on a new project every time they say “I could tweak a parameter in this alignment algorithm…”  and then off they go, writing aligner #23,483,337 from scratch instead of modifying the existing aligner.  At some point, we’ll have more aligners than genomes!  (Ok, that’s a shameless hyperbole.)

But, the point stands.  Bioinformaticians create a plethora of software that solve problems that are not entirely new.  While I’m not saying that bioinformaticians are working on solved problems, I am asserting that the creation of novel software packages is an inefficient way to tackle problems that someone else has already invested time/money into building software for. But I’ll come back to that in a minute.

But why is the default behavior to write your own package instead of building on top of an existing one?  Well, that’s clear: Publications.  In science, the method of determining your progress is how many journal publications you have, skewed by some “impact factor” for how impressive the name of the journal is.  The problem is that this is a terrible metric to judge progress and contribution.  Solving a difficult problem in an existing piece of software doesn’t merit a publication, but wasting 4 months to rewrite a piece of software DOES.

The science community, in general, and the funding community more specifically, will reward you for doing wasteful work instead of focusing your energies where it’s needed. This tends to squash software collaborations before they can take off simply by encouraging a proliferation of useless software that is rewarded because it’s novel.

There are examples of bioinformatics packages where collaboration is a bit more encouraged – and those provide models for more efficient ways of doing research.  For instance, in the molecular dynamics community, Charmm and Amber are the two software frameworks around which most people have gathered. Grad students don’t start their degree by being told to re-write one or the other packages, but are instead told to learn one and then add modules to it.  Eventually the modules are released along with a publication describing the model.  (Or left to rot in a dingy hard drive somewhere if they’re not useful.)   Publications come from the work done and the algorithm modifications being explained.  That, to me, seems like a better model – and means everyone doesn’t start from scratch

If you’re wondering where I’m going with this, it’s not towards the Microsoft model where everyone does bioinformatics in Excel, using Microsoft generated code.

Instead, I’d like to propose a coordinated bioinformatics code-base.  Not a single package, but a unified set of hooks instead.  Imagine one code base, where you could write a module and add it to a big git hub of bioinformatics code – and re-use a common (well debugged) core set of functions that handle many of the common pieces.  You could swap out aligner implementations and have modular common output formats.  You could build a chip-seq engine, and use modular functions for FDR calculations, replacing them as needed.  Imagine you could collaborate on code design with someone else – and when you’re done, you get a proper paper on the algorithm, not an application note announcing yet another package.

(We have been better in the past couple years with tool sets like SAMTools, but that deals with a single common file format.  Imagine if that also allowed for much bigger projects like providing core functions for RNA-Seq or CNV analysis…  but I digress.)

Even better, if we all developed around a single set of common hooks, you can imagine that, at the end of the day (once you’ve submitted your repository to the main trunk), someone like the Galaxy team would simply vacuum up your modules and instantly make your code available to every bioinformatician and biologist out there.  Instant usability!

While this model of bioinformatics development would take a small team of core maintainers for the common core and hooks, much the same way Linux has Linus Torvalds working on the Kernel, it would also cut down severely on code duplication, bugs in bioinformatics code and the plethora of software packages that never get used.

I don’t think this is an unachievable goal, either for the DIY bioinformatics community, the Open Source bioinformatics community or the academic bioinformatics community.  Indeed, if all three of those decided to work together, it could be a very powerful movement.  Moreso, corporate bioinformatics could be a strong player in it, providing support and development for users, much the way corporate Linux players have done for the past two decades.

What is needed, however, is buy-in from some influential people, and some influential labs.  Putting aside their own home grown software and investing in a common core is probably a challenging concept, but it could be done – and the rewards would be dramatic.

Finally, coming back to the funding issue.  Agencies funding bioinformatics work would also save a lot of money by investing in this type of framework.  It would ensure more time is spent on more useful coding, more time is spent on publications that do more to describe algorithms and to ensure higher quality code is being produced at the end of the day.  The big difference is that they’d have to start accepting that bioinformatics papers shouldn’t be about “new software” available, but “new statistics”, “new algorithms” and “new methods” – which may require a paradigm change in the way we evaluate bioinformatics funding.

Anyhow, I can always dream.

Notes: Yes, there are software frameworks out there that could be used to get the ball rolling.  I know Galaxy does have some fantastic tools, but (if I’m not mistaken), it doesn’t provide a common framework for coding – only for interacting with the software.  I’m also aware that Charmm and Amber have problems – mainly because they were developed by competing labs that failed to become entirely enclusive of the community, or to invest substantially in maintaining the infrastructure in a clean way.Finally, Yes, the licensing of this code would determine the extent of corporate participation, but the GPL provides at least one successful example of this working.

“bioinformatics is in decline”

Actually, the original post said “Is bioinformatics on decline?”  That lack of consideration for English grammar might be a bit of a foreshadowing for the flavour of the arguments that I’ll relay below.  (Granted, I would never ignore a post solely because of bad grammar and the author is clearly not a native English speaker, so lets not dwell on that point.)

Let me summarize the original post:

  • A link to a post that indicates the author is upset that some people have to ask what fields are encompassed by bioinformatics.
  • A link to a blog post that says “Less people are searching for the word bioinformatics using google”.
  • Some ambiguous claim that this applies to India, which is later clarified to mean “northern India”, which is never explained.
  • The author, by the way, claims to be a Search Engine Optimization (SEO) “expert” on his blog.

The last point is a great indication to me about the quality of the information you’re about to hear from him, in my opinion.  Anyone who claims to be interested in search engine optimization is all about hits, not about quality of information disseminated.  That’s a simple guarantee –  and I’m certain that Mr Brij has gotten a lot of hits on his web page by making what is not a controversial statement, but an obtuse one.

In essence, I believe this whole claim is likely more about driving traffic to his blog than making a valid contribution to a discussion.  I regret having visited it already, and I certainly won’t be linking to it.

In any case, lets deal with his first two points.  The first is that “People are asking what bioinformatics is about.”  To me, that indicates that more people are interested in the field, rather than that fewer people are interested.  If he never had to answer the question about bioinformatics, then I might be concerned that he’d either saturated the world with people knowledgeable about the field, or that no one new was hearing about it.  Clearly the first is impossible.  (So long as I have family members, I can guarantee first hand knowledge that there are people who are clueless about what bioinformatics really is.)

In reality,  this is just anecdotal evidence that the author is relaying.  We’re better off ignoring it and assuming that the conclusions drawn by Mr. Brij are just inaccurate extrapolations from bad data.

Lets move on to the second claim that Mr. Brij makes about the decline of bioinformatics on google trends.  I’m going to ignore the  wonderful comment someone made about how all searches on google trends have to be considered in relation to the growing number of searches for “porn”.  I don’t know if it’s true or not, though it seems plausible, but it’s perhaps not the best refutation for Mr. Brij’s assertion that bioinformatics is declining.

Instead, lets look at a different set of words:  {bioinformatics, biotech, pharma and antibody}. This will give us a control experiment, as there are magazines and entire industries devoted to watching biotech and pharma for trends.  If either of those were in steep decline, it would undoubtedly be all over the front pages of every industry publication.

Guess what?  They all show exactly the same downward trend!  (Even down to the little dip over the North American winter holidays.  I’d definitely be interested in hearing more about that phenonmenon!)  Does that mean that the pharma and biotech industries are going to be extinct in the near future?  Not so much – it’s simply a function of the overall traffic on google.

I surmise that in order to predict the decline of something on the web, you’ll need much more useful information than the proportion of google searches that include a specific term.  People haven’t stopped using antibodies over the past decade and bioinformatics sure as heck hasn’t gotten any less popular.  Frankly, I doubt most bioinformaticians spend their time googling for “bioinformatics” anyhow, which makes this a pretty moot point.  We all have better things to do.

The third point from the original author is that this somehow relates specifically to bioinformatics in India.  I’m not even going to deal with this point other than to say that the data presented doesn’t support this claim.  No statistics are presented on Mr. Brij’s blog and the excerpt from an excel spreadsheet that is given only demonstrates that there is fluctuation in the frequency of the search terms by month.  It’s not at all enlightening.

So, if I may pass along one piece of advice to the “SEO expert” Mr. Brij: Instead of trying to get more traffic to your site by placing links to your blog next to poorly phrased questions on popular web sites, I suggest you try to produce quality analyses and insightful observations.  That will drive up the number of hits you get in the long term far more than trying to suck people in with ridiculous claims.

In the meantime, I won’t be visiting your page again.

A Reply to: Should a Rich Genome Variant File End the Storage of Raw Read Data?

Before you read my post, you might want to zip over to Complete Genomic’s blog where C.S.O. Dr. Rade Drmanac wrote an entry titled “Should a Rich Genome Variant File End the Storage of Raw Read Data?”  It’s an interesting perspective where he suggests that, as the article’s title might indicate, we should only be keeping a rich variant file as the only trace of a sequencing run.

I should mention that I’m really not going to distinguish between storing raw reads and storing aligned reads – you can go from one to the other by stripping out the alignment information, or by aligning to a reference template.  As far as I’m concerned, no information is lost when you align, unlike moving to a rich variant file (or any other non-rich variant file, for that matter.)

I can certainly see the attraction of deleting the raw or aligned data – much of the data we store is never used again, takes up space and could be regenerated at will from frozen samples of DNA, if needed.  As a scientist, I’ve very rarely have had to go back into the read-space files and look at the pre-variant calling data – and only a handfull of times in the past 2 years.  As such, it really doesn’t make much sense to store alignments, raw reads or other data. If I needed to go back to a data set, it would be (could be?) cheaper just to resequence the genome and regenerate the missing read data.

I’d like to bring out a real-world analogy, to summarize this argument. I had recently been considering a move, only to discover that storage in Vancouver is not cheap.  It’ll cost me $120-150/month for enough room to store some furniture and possessions I wouldn’t want to take with me.  If the total value of those possessions is only $5,ooo, storing them for more than 36 months means that (if I were to move) it would have been cheaper to sell it all and then buy a new set when I come back.

Where the analogy comes into play quite elegantly is if I have some interest in those particular items.  My wife, for instance, is quite attached to our dining room set.  If we were to sell it, we’d have to eventually replace it, and it might be impossible to find another one just like it at any price.  It might be cheaper to abandon the furniture than to store it in the long run, but if there’s something important in that data, the cost of storage isn’t the only thing that comes into play.

While I’m not suggesting that we should be emotionally attached to our raw data, there is merit in having a record that we can return to – if (and only if) there is a strong likelyhood that we will return to the data for verification purposes.  You can’t always recreate something of interest in a data set by resequencing.

That is a poor reason to keep the data around, most of the time.  We rarely find things of interest that couldn’t be recreated in a specific read set when we’re doing genome-wide analysis. Since most of the analysis we do uses only the variants and we frequently verify our findings with other means, the cost/value argument is probably strongly in favour of throwing away raw reads and only storing the variants.  Considering my recent project on storage of variants (all 3 billion of the I’ve collected), people have probably heard me make the same arguments before.

But lets not stop here. There is much more to this question than meets the eye. If we delve a little deeper into what Dr. Drmanac is really asking, we’ll find that this question isn’t quite as simple as it sounds.  Although the question stated basically boils down to “Can a rich genome variant file be stored instead of a raw read data file?”, the underlying question is really: “Are we capable of extracting all of the information from the raw reads and storing it for later use?”

Here, I actually would contend the answer is no, depending on the platform.  Let me give you examples of what data I feel we do a poor example of extracting, right now.

  • Structural variations:  My experience with structural variations is that no two SV callers give you the same information or make the same calls.  They are notoriously difficult to evaluate, so any information we are extracting is likely just the tip of the iceberg.  (Same goes for structural rearrangements in cancers, etc.)
  • Phasing information:  Most SNP callers aren’t giving phasing information, and some aren’t capable of it.  However, those details could be teased from the raw data files (depending on the platform).  We’re just not capturing it efficiently.
  • Exon/Gene expression:  This one is trivial, and I’ve had code that pulls this data from raw aligned read files since we started doing RNA-Seq.  Unfortunately, due to exon annotation issues, no one is doing this well yet, but it’s a huge amount of clear and concise information available that is obviously not captured in variant files.  (We do reasonably well with Copy Number Variations (CNVs), but again, those aren’t typically sored in rich variant files.)
  • Variant Quality information: we may have solved the base quality problems that plagued early data sets, but let me say that variant calling hasn’t exactly come up with a unified method of comparing SNV qualities between variant callers.  There’s really no substitute for comparing other than to re-run the data set with the same tools.
  • The variants themselves!  Have you ever compared the variants observed by two SNP callers run on the same data set? I’ll spoil the suspense for you: they never agree completely, and may in fact disagree on up to 25% of the variants called. (Personal observation – data not shown.)

Even dealing only with the last item, it should be obvious:  If we can’t have two snp callers produce the same set of variants, then no amount of richness in the variant file will replace the need to store the raw read data because we should always be double checking interesting findings with (at least) a second set of software.

For me, the answer is clear:  If you’re going to stop storing raw read files, you need to make sure that you’ve extracted all of the useful information – and that the information you’ve extracted is complete.  I just don’t think we’ve hit those milestones yet.

Of course, if you work in a shop where you only use one set of tools, then none of the above problems will be obvious to you and there really isn’t a point to storing the raw reads.  You’ll never return to them because you already have all the answers you want.  If, on the other hand, you get daily exposure to the uncertainty in your pipeline by comparing it to other pipelines, you might look at it with a different perspective.

So, my answer to the question “Are we ready to stop storing raw reads?” is easy:  That depends on what you think you need from a data set and if you think you’re already an expert at extracting it.  Personally, I think we’ve barely scratched the surface on what information we can get out of genomic and transcriptomic data, we just don’t know what it is we’re missing yet.

Completely off topic, but related in concept: I’m spending my morning looking for Scalable Vector Graphics (.svg) files for many jpg and png files I’d created along the course of my studies.  Unfortunately, jpg and png are lossy formats and don’t reproduce as nicely in the Portable Document Format (PDF) export process.  Having deleted some of those .svg files because I thought I had extracted all of the useful information from them in the export to png format, I’m now at the point where I might have to recreate them to properly export the files again in a lossless format for my thesis.  If I’d just have stored them (as the cost is negligible) I wouldn’t be in this bind….   meh.


Nature Comment : The case for locus-specific databases

There’s an interesting comment available in Nature today (EDIT: it came out last month, though I only found it today.) Unfortunately, it’s by subscription only, but let me save you the hassle of downloading it, if you don’t already have a subscription.  It’s not what I thought it was.

The entire piece fails to make the case for locus-specific databases, but instead conflates locus-specific with “high-resolution”, and then proceeds to tell us why we need high resolution data.  The argument can roughly be summarized as:

  • Omim and databases like it are great, but don’t list all known variations
  • Next-gen sequencing gives us the ability to see genome in high resolution
  • You can only get high-resolution data by managing data in a locus-specific manner
  • Therefore, we should support locus-specific databases

Unfortunately, point number three is actually wrong.  It’s just that our public databases haven’t yet transitioned to the high resolution format.  (ie, we have an internal database that stores data in a genome-wide manner at high resolution…  the data is, alas, not public.)

Thus, on that premise, I don’t think we should be supporting locus specific databases specifically –  indeed, I would say that the support they need is to become amalgamated in to a single genome-wide database at high resolution.

You wouldn’t expect major gains in understanding of car mechanics if you, by analogy, insisted that all parts should be studied independently at high resolution.  Sure you might improve your understanding of each part, and how it works alone, but the real gains come from understanding the whole system.  You might not actually need certain parts, and sometimes you need to understand how two parts work together.  It’s only by studying the whole system that you begin to see the big picture.

IMHO, Locus-specific databases are blinders that we adopt in the name of needing higher resolution, which is more of a comment on the current state of biology.  In fact, the argument can really be made that we don’t need locus-specific databases, we need better bioinformatics!

Dueling Databases of Human Variation

When I got it to work this morning, I was greeted by an email from 23andMe’s PR company, saying they have “built one of the world’s largest databases of individual genetic information.”   Normally, I wouldn’t even bat an eye at a claim like that.  I’m pretty sure it is a big database of variation…  but I thought I should throw down the gauntlet and give 23andMe a run for their money.  (-:

The timing for it couldn’t be better for me.  My own database actually ran out of auto-increment IDs this week, as we surpassed 2^31 snps entered into the db and had to upgrade the key field to bigint from int. (Some variant calls have been deleted and replaced as variant callers have improved, so we actually have only 1.2 Billion variations recorded against the hg18 version of the human genome.  A few hundred million more than that for hg19.)  So, I thought I might have a bit of a claim to having one of the largest databases of human variation as well.  Of course, comparing databases really is dependent on the metric being used, but hey, there’s some academic value in trying anyhow.

In the first corner, my database stores information from 2200+ samples (cancer and non-cancer tissue), genome wide (or transcriptome wide, depending on the source of the information.), giving us a wide sampling of data, including variations unique to individuals, as well as common polymorphisms.  In the other corner, 23andMe has sampled a much greater number of individuals (100,000) using a SNP chip, meaning that they’re only able to sample a small amount of the variation in an individual – about 1/3rd of a single percent of the total amount of DNA in each individual.

(According to this page, they look at only 1 million possible SNPs, instead of the 3 Billion bases at which single nucleotide variations can be found – although arguments can be made about the importance of that specific fraction of a percent.)

The nature of the data being stored is pretty important, however.  For many studies, the number of people sampled has a greater impact on the statistics than the number of sites studied and, since those are mainly the ones 23andMe are doing, clearly their database is more useful in that regard.  In contrast, my database stores data from both cancer and non-cancer samples, which allows us to make sense of variations observed in specific types of cancers – and because cancer derived variations are less predictable (ie, not in the same 1M snps each time) than the run-of-the-mill-standard-human-variation-type snps, the same technology 23andMe used would have been entirely inappropriate for the cancer research we do.

Unfortunately, that means comparing the two databases is completely impossible – they have different purposes, different data and probably different designs.  They have a database of 100k individuals, covering 1 million sites, whereas my database has 2k individuals, covering closer to 3 billion base pairs.  So yeah, apples and oranges.

(In practice, however, we don’t see variations at all 3 Billion base pairs, so that metric is somewhat skewed itself.  The number is closer to 100 Million bp –  a fraction of the genome nearly 100 times larger than what 23andMe is actually sampling.)

But, I’d still be interested in knowing the absolute number of variations they’ve observed…  a great prize upon which we could hold this epic battle of “largest database of human variations.”  At best, 23andMe’s database holds 10^11 variations, (1×10^6 SNPs x 1×10^5 people), if every single variant was found in every single person – a rather unlikely case.  With my database currently  at 1.2×10^9 variations, I think we’ve got some pretty even odds here.

Really, despite the joking about comparing database sizes, the real deal would be the fantastic opportunity to learn something interesting by merging the two databases, which could teach use something both about cancer and about the frequencies of variations in the human population.

Alas, that is pretty much certain to never happen.  I doubt 23andMe will make their database public – and our organization never will either.  Beyond the ethical issues of making that type of information public, there are pretty good reasons why this data can only be shared with collaborators – and in measured doses at that.  That’s another topic for another day, which I won’t go into here.

For now, 23andMe and I will just have to settle for both having “one of the world’s largest databases of individual genetic information.”  The battle royale for the title will have to wait for another day… and who knows what other behemoths are lurking in other research labs around the world.

On the other hand, the irony of a graduate student challenging 23andMe for the title of largest database of human variation really does make my day. (=

[Note: I should mention that when I say that I have a database of human variation, the database was my creation but the data belongs to the Genome Sciences Centre – and credit should be given to all of those who did the biology and bench work, performed the sequencing, ran the bioinformatics pipelines and assisted in populating the database.]

Normalizing reads for Next-Gen Sequencing.

This isn’t a particularly in depth blog.  Rather, I just want to touch on a few points in reply to a twitter question asking about how to normalize reads.

Actually, normalization is something I haven’t studied in great depth beyond the applications for Chip-Seq, where – of course – it’s still an open question.  So, if my answer is incomplete, please feel free to point out other resources in the comments.  Any contributions would be welcome.

First, one can do all sorts of fancy things to perform a normalization.  Frankly, I think the term is pretty abused, so a lot of what passes for normalization is a bit sketchy.  Anything that makes two samples “equivalent” in any way is often referred to as normalization, so yeah, your millage may vary when it comes to applying any approach.

For RNA-Seq, I’ve heard all sorts of techniques being applied.  The most common is to simply count the number of reads in the two samples, and then normalize by dividing by the ratio of reads between the two samples.  Very crude, and often very wrong, depending on what the two samples actually are.  I don’t use this approach unless I’m feeling lazy, someone has asked me to normalize a sample and it’s 5:40 on a Thursday evening. (I have better things to do on Thursday evenings.)

The second method is to bin the reads, then apply the same normalization as above.  The bins can be as large as a whole chromosome or as small as a few hundred base pairs, but the general method is the same: Use some average over a collection of bins to work out the ratio, then use that ratio to force the two samples to have the same approximate total read numbers.  It’s a bit better than what’s above, but not a whole lot better.

The third method that I’m aware of it to use a subset of reads to determine the normalization ratio.  This is a bit better – assuming you know enough to pick a good subset.   For instance, if you know housekeeping genes, you can use the total coverage over that set to approximate the relative abundance of reads in order to set the correct ratios.  This method can be dramatically better, if you happen to know a good set of genes (or other subset) to use, as it prevents you from comparing non-equivalent sets.

Just to harp on that last point, if you’re comparing a B-Cell- and a Mammary-derived cell line, you might be tempted to normalized on the total number of reads, however, it would quickly become apparent (once you look at the expressed genes) that some B-Cell genes are highly expressed and swamp your data set.  By paring those out of the normalization subset, you’d find your core genes in common to be more comparable – and thus less prone to bias introduced by genes only expressed in one sample.

You’ll notice, however, that all of the methods above use a simple ratio, with increasingly better methods of approximation.  That’s pretty much par for the course, as far as I’ve seen in RNA-Seq.  It’s not ideal, but I haven’t seen much more elegance than that.

When it comes to ChIP-Seq, the same things apply – most software does some variation of the above, and many of them are still floundering around with the first two types, of which I’m not a big fan.

The version I implemented in FindPeaks 4.0 goes a little bit differently, but can be applied to RNA-Seq just as well as for ChIP-Seq. (yes, I’ve tried) The basic idea is that you don’t actually know the subset of house-keeping genes in common in ChIP-Seq because, well, you aren’t looking at gene expression.  Instead, you’re looking at peaks – which can be broadly defined as any collection of peaks above the background.  Thus, the first step is to establish what the best subset of your data should be used for normalization – this can be done by peak calling your reads.  (Hence, using a peak caller.)

Once you have peak-calling done for both datasets, you can match the peaks up.  (Note, this is not a trivial operation, as it must be symmetrical, and repeatable regardless of the order of samples presented.)  Once you’ve done this, you’ll find you have three subsets:  peaks in sample 1, but not in sample 2.  Peaks in sample 2 but not in sample 1, and peaks common to both. (Peaks missing in both are actually important for anchoring your data set, but I won’t get into that.) If you only use the peaks common to both data sets, rather than peaks unique to one sample, you have a natural data subset ideal for normalization.

Using this subset, you can then perform a linear regression (again, it’s not a standard linear regression, as it must be symmetrical about the regression line, and not Y-axis dependent) and identify the best fit line for the two samples.  Crucially, this linear regression must pass through your point of origin, otherwise, you haven’t found a normalization ratio.

In any case, once all this is done, you can then use the slope of the regression line to determine the best normalization for your data sets.

The beauty of it is that you also end up with a very nice graph, which makes it easy to understand the data set you’ve compared, and you have your three subsets, each of which will be of some interest to the investigator

(I should also note, however, that I have not expanded this method to more than two data sets, although I don’t see any reason why it could not be.  The math becomes more challenging, but the concepts don’t change.)

Regardless, the main point is simply to provide a method by which two data sets become more comparable – the method by which you compare them will dictate how you do the normalization, so what I’ve provided above is only a vague outline that should provide you with a rough guide to some of the ways you can normalize on a single trait.  If you’re asking more challenging questions, what I’ve presented above may not be sufficient for comparing your data.

Good luck!

[Edit:  Twitter user @audyyy sent me this link, which describes an alternate normalization method.  In fact, they have two steps – a pre-normalization log transform (which isn’t normalization, but it’s common.  Even FindPeaks 4.0 has it implemented), and then a literal “normalization” which makes the mean = 0, and the standard deviation =1.   However, this is only applicable for one trait across multiple data sets (eg, count of total reads for a large number of total libraries.)  That said, it wouldn’t be my first choice of normalization techniques.]


Working with Jackie Chan.

Since I’ve been posting jobs, I figured I may as well point people to another set of open positions.  Of course, again, I have no relationship with the people posting it… however, I just couldn’t not say anything about this set.

Apparently, if you work in the Pallen Group, you get to work on next gen sequencing pipelines in the lab with Jackie Chan. (Research Fellow in Microbial Bioinformatics)

How cool is that?Jackie Chan

Anyhow, The other position (Research Technician in Bioinformatics), doesn’t (apparently) involve martial arts.