>Ok, I sent a tweet about it, but it didn’t solve the frustration I feel on the subject of SNP/SNV callers. There are so many of them out there that you’d think they grow on trees. (Actually, they grow on arrays…) I’ve written one, myself, and I know there are at least 3 others written at the GSC.

Anyhow, At first sight, what pisses me off is that there’s no standard format. Frankly, that’s not even the big problem, however. What’s really underlying that problem is that there’s no standard “minimum information” content being produced by the SNP/SNV callers. Many of them give a bare minimum information, but lack the details needed to really evaluate the information.

So, here’s what I propose. If you’re going to write a SNP or SNV caller, make sure your called variations contain the following fields:

  • chromosome: obviously the coordinate to find the location
  • position: the base position on the chromo
  • genome: the version of the genome against which the snp was called (eg. hg18 vs. hg19)
  • canonical: what you expect to see at that position. (Invaluable for error checking!)
  • observed: what you did see at that position
  • coverage: the depth at that position (filtered or otherwise)
  • canonical_obs: how many times you saw the canonical base (key to evaluating what’s at that position
  • variation_obs: how many times you saw the variation
  • quality: give me something to work with here – a confidence value between 0 and 1 would be ideal… but lets pick something we compare across data sets. Giving me 9 values and asking me to figure something out is cheating. Sheesh!

Really, most of the callers out there give you most, if not all of it – but I have yet to see the final “quality” being given. The MAQ SNP caller (which is pretty good) asks you to look at several different fields and make up your own mind. That’s fine for a first generation, but maybe I can convince people that we can do better in the second gen snp callers.

Ok, now I’ve got that off my chest! Phew.

