Variant Call Format Redundancy

I’ve been working with the Variant Call Format (VCF) files released for the latest version of dbsnp, which contain snp calls from the 1000 genome project. I have to say that it’s nice to have it all in one file. That makes my life easier, really. (Links to them and some discussion on them can be found here)

However, I’m a little surprised at the execution. First, VCF is a text-based format, which is supposed to make it easy to add information to the record by standardising the first couple of columns (chromosome, position, etc), and then providing one column for flags in text format. That seems just fine to me. However, the VCF format files for dbsnp are somewhat… funky.

Instead of sticking with text flags, they’ve also incorporated a binary flag. First, that somewhat defeats the point of using text flags in the first place – you’re either making something easy to parse by your users, or you’re not. Binary flag fields don’t really count in that category. (There are instructions for deciphering them, which is a big plus, but not something your average perl hacker can deal with.) This binary field, of course, is mixed in with a large number of text flags… and that’s where the weirdness begins.

Example line:

1 10327 rs112750067 T C . . dbSNPBuildID=132;VP=050000020005000000000100;WGT=1;VC=SNP;R5;ASP

That is to say, on chromosome 1, at the 10327th position, there is a snp which has been given the name “rs112750067”, which changes a reference T into a C. After the two dots, you’ll then see the text-flag field I was talking about, and the VP={blah} is the bit flag field.

Stranger than just forcing the binary information into a comma delimited text field, the binary-based flags are actually redundant with the text field – that is to say that it contains no information outside of what the text flags already hold! In the example above, the VP=blah actually contains the information given by the WGT, VC, R5 and ASP flags.

I would be very interested to know what exactly is going on here. There are logical possibilities: maybe this is just an intermediate format before they break the VCF format and switch to an all-binary flag set? Or maybe they haven’t realised how redundant it is to duplicate all of the information in the flags area by embedding itself in a binary format within said flags field? I’m not really sure… but I’d love to know.

For the moment, I’m just going to ignore the binary data and just deal with the text flags… I can’t see any reason to do otherwise.

7 thoughts on “Variant Call Format Redundancy

  1. I can’t comment on why the redundancy, but anyone who calls themselves a “Perl hacker” & cannot figure out how to decode a packed set of binary flags in less than an hour should stop calling themselves a “Perl hacker” — it’s an important trick to learn and conceptually quite simple.

    A number of the 1000K genomes informatics folks regularly contribute to SEQAnswers.com, so you might post your questions there.

    • I agree – binary flags aren’t hard to do, though I’ll say that it took me longer than an hour to figure out how to decode the maq binary files. (Endian-ness bit me in the end!) A single field in bit format shouldn’t stop any one for long, if they know what they’re doing.

      Though, I don’t actually have any questions other than “why did they do it?”, SeqAnswers is an awesome resource and I also recommend it!

  2. Pingback: File Formats and Genomic Data | Fejes.ca

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.