Variant Call Format Redundancy

I’ve been working with the Variant Call Format (VCF) files released for the latest version of dbsnp, which contain snp calls from the 1000 genome project. I have to say that it’s nice to have it all in one file. That makes my life easier, really. (Links to them and some discussion on them can be found here)

However, I’m a little surprised at the execution. First, VCF is a text-based format, which is supposed to make it easy to add information to the record by standardising the first couple of columns (chromosome, position, etc), and then providing one column for flags in text format. That seems just fine to me. However, the VCF format files for dbsnp are somewhat… funky.

Instead of sticking with text flags, they’ve also incorporated a binary flag. First, that somewhat defeats the point of using text flags in the first place – you’re either making something easy to parse by your users, or you’re not. Binary flag fields don’t really count in that category. (There are instructions for deciphering them, which is a big plus, but not something your average perl hacker can deal with.) This binary field, of course, is mixed in with a large number of text flags… and that’s where the weirdness begins.

Example line:

1 10327 rs112750067 T C . . dbSNPBuildID=132;VP=050000020005000000000100;WGT=1;VC=SNP;R5;ASP

That is to say, on chromosome 1, at the 10327th position, there is a snp which has been given the name “rs112750067”, which changes a reference T into a C. After the two dots, you’ll then see the text-flag field I was talking about, and the VP={blah} is the bit flag field.

Stranger than just forcing the binary information into a comma delimited text field, the binary-based flags are actually redundant with the text field – that is to say that it contains no information outside of what the text flags already hold! In the example above, the VP=blah actually contains the information given by the WGT, VC, R5 and ASP flags.

I would be very interested to know what exactly is going on here. There are logical possibilities: maybe this is just an intermediate format before they break the VCF format and switch to an all-binary flag set? Or maybe they haven’t realised how redundant it is to duplicate all of the information in the flags area by embedding itself in a binary format within said flags field? I’m not really sure… but I’d love to know.

For the moment, I’m just going to ignore the binary data and just deal with the text flags… I can’t see any reason to do otherwise.

Scalability

Yes, I am one of the luckiest grad students around.  One of the things that very few bioinformaticians get to work with during their PhD is scalability…  and I don’t just mean moving from their own project to a larger dataset.  I mean, industrial size scalability.

Interestingly enough, I’m now working on my second “scalable” project.  My first one, FindPeaks, was reasonably scalable – I’ve yet to find a dataset that it couldn’t handle – which includes some pretty monstrous illumina data sets.  We’ve even forced through individual peaks will several millions of reads, which is not a bad test of how scalable the app is.  However, even that pales in comparison to my current project.

This second project is a database rather than an application, which adds yet another level of scalability.  We’ve gone from single datasets to 10’s of datasets to 100’s of data sets and now we’re in storing the condensed information of about 1,700 libraries of next generation sequencing.  That’s no small feat!

Of course, that’s not where it ends.  In the next year, I can see that easily doubling or more, and heading rapidly towards the 10,000 library level.  What’s interesting about this is that you start to tax the hardware pretty hard, and not evenly.  This size of database is now rather far beyond what most grad students are likely to encounter in any project, so it’s pure gain for me.  However, even then, the database won’t stop growing.  At the rate that sequencing has grown, there’s likely to be a lot more sequencing done next year than this year, and so forth.  100,000 genomes sequenced is not really out of the question for a large sequencing centre in the lifespan of this database.

Just imagining where it’s going to go, you can think about how many SNVs will you find in 100,000 genomes. (A rough estimate is somewhere around 1,000,000 per dataset x 100,000 genomes = 100 billion records.)  Somehow, that number is pretty daunting for any database.  There will undoubtedly have to be purges of low quality information, or further division of the tables at some point.

Regardless of where the end point is, when you contemplate data sets this large, you have to start questioning everything. Was the database designed well?  Indexes, clustering, triggers? Did you pick the right database? There’s no end of places you can look to improve performance.  However, it all comes down to two things: experience and money.

Experience is a wonderful thing, it gives you a framework to work from.  The biggest database I’d worked with before this was in my days working for the University of Waterloo’s Student Information Systems Project, where I was building a reporting database to allow staff to instantly pull up records for any of the thousands of students.  No matter how many students the university can handle, it can’t compete with the number of variations across 100,000 genomes.  Of course, the university was willing to throw some money (and a db admin and myself) at the problem, so we came up with solutions.

Experience is not cheap, however, particularly when you have to bring it in.  Resourcefulness, however is cheap, so I’ve recently been turning to the people who work on postgres.  They have a great channel on IRC (Freenode – #postgres), where you can talk with experts on the subject.  They helped debug a trigger, clued me in to clustering, and provided me with a set of linux tools I’d never seen before.  They became my proxy for experience, and helped me bootstrap myself to where I’m comfortable with how the database works.  That takes care of experience.

So, what’s left is money.  And, that’s where the commitment to the project comes in. So far, I’ve been fortunate to get some great hardware, most of which comes from cast-offs of other projects, but is still pretty high quality.  All I can say at this point, is that I’m glad I work where I do, because I think I’m going to be able to rack up some pretty cool experiences with “large datasets” and big hardware, which was exactly what I put on my MSFHR scholarship application as my educational goal.  The specific header was “Develop new competencies in mining large databases”.

So there you have it – I get to be an incredibly lucky grad student, with the resources of an entire genome science centre behind me, and thousands of genomes to analyse… and I’ve even managed to fit in the goals that I told proposed to my funding source.

The next couple months will be a lot of fun, as I start to wrap this project up.

For those of you in Vancouver, who’d like to know more, I’ll be presenting some of this work at Vanbug tomorrow as the student speaker.  After that, I get to start making sense of the data…  now THAT should be fun.  If my luck holds, that should make a few good papers, and chapters for my thesis.