File Formats and Genomic Data

I’ve been thinking a lot about file formats lately. I know it doesn’t sound like a particularly interesting topic, but it really can be interesting. Well, at least useful. Interesting might be a bit overstating the case, but designing a new file format isn’t a simple job – and finding the right balance between bare bones and overly verbose can be a big challenge.

Over the past few years, I’ve collected a large number of interfaces in my codebase for interfacing with all sorts of files, from plain text files all the way up through binary alignment data (MAQ and SAM), spanning all sorts of information. It’s not like I set out to build a library of genomic file interpreters, but it happened.  Consequently, I spend a lot of time looking at file formats and recently, I’ve found myself saying “I’ve been thinking about file formats a lot lately…”, which is exactly as geeky as it sounds.

To make matters worse, I’m finding myself reversing an opinion I had earlier. I have been on record cheering VCF, saying that I would love to have one file format that EVERYONE can dump their SNV information into. I still stand by this, but I’m some what disgruntled with the format. Well, as much as one can be disgruntled about a file format, anyways. (Or even gruntled, for that matter.)

When VCF came out, I thought it was brilliant – and it is – because it’s simple. A few mandatory fields are listed, followed by an info field into which everything and anything can be crammed. (See here for a few examples.) While I love the mandatory fields, I don’t think they’ve gone quite far enough. I’d have tossed in a few more fields in that category, such as individual read qualities, total coverage, number of times the variant was called with confidence… you get the idea. Anyhow, as far as formats go, it really isn’t that bad. The part that annoys me is the lack of enforced fields in the info. By going this route, they’ve created a format with infinite flexibility and extensibility.  For the most part, I support that.

Infinite flexibility, however, can also be extended to infinite abuse. I am waiting to see the first time someone sends me a file that has 2k characters in the info field on the each line, and not a single useful piece of information in it.  It hasn’t happened yet, but given the plethora of stuff people are now jamming in to VCFs, it’s just a matter of time.

In fact, I’ve been working with the Complete Genomics data (second hand, since a co-worker has been processing it, and I’m not sure how much it has been processed) for much of this week. I’ll be thrilled to see this incredible volume of genomic data in both cancers and normals appear in my database, once the import is complete.  With the first import running when I left for the day, there’s a good chance it’ll be in pretty soon.  Unfortunately, not all of the information I’d like was in it. The format is similar to VCF in spirit, but a bit different in style. All in all, it’s not bad (albeit incredibly confusing when my co-worker provided the documentation somewhat second hand, and left out all the gooey bits that are necessary to actually interpret the file), but even here, there were issues with insufficient information.  The files had to be massaged a bit to get all of the information we needed.  (Although, really, with an expected ~270 Million variants called for these data sets, complaining about a lack of information really needs to be kept in perspective!)

I have to say, however, of all my favorite variant formats, my favorite this week is the varfilter SAM from Samtools. It’s a pain in the ass with its cryptic quality data at the end, but that is what makes it easily the most informative – and self-checking. Consider the following “invalid” line:

1       19154656        T       A       22      22      30      3     ,.^?n   "%@

In it, the T and A signify a T->A variant. The quality markers, (,.^?n) tell you what was found at that position ( “,” and “.” are reference bases of different quality, and the “n” tells you that one indeterminate poor quality base with no call.  I’ll admit, I don’t actually know what the “^?” part means.)  which allows you to figure out what evidence was used to make this call…  which turns out to be none whatsoever.  It’s a bad snp call. If that information hadn’t been there, I’d have had to blindly accept the above snp into my analysis, and there would be no way to verify the integrity of the snp caller. (Which seems to be poor, since the example above came from an actual varfilter file and lines like this are frequently found in other datasets as well, regardless of the snp caller used as far as I am aware.)

In a properly done snp call, you would expect to see a quality that looks more like “.,.,.aAAaa” which would tell you that 5 reference bases (the dots and commas describing the reference T’s in the example above) and 5 A’s were found, confirming the base call.  Now THAT’s useful information, easily interpreted, and easily parsed.  With VCF, that information might or might not be included – and is completely absent in Complete Genomics files and optional in VCF files.

So where am I going with this?  Simple: If you’re designing a snp calling file, make sure you include sufficient information that the data can be verified easily. I’m willing to pay a slight penalty in storage such that I can be certain that the data is correct.  This should also include the sub-point: If you’re using VCF, some of those optional fields shouldn’t be optional.