File Formats and Genomic Data

I’ve been thinking a lot about file formats lately. I know it doesn’t sound like a particularly interesting topic, but it really can be interesting. Well, at least useful. Interesting might be a bit overstating the case, but designing a new file format isn’t a simple job – and finding the right balance between bare bones and overly verbose can be a big challenge.

Over the past few years, I’ve collected a large number of interfaces in my codebase for interfacing with all sorts of files, from plain text files all the way up through binary alignment data (MAQ and SAM), spanning all sorts of information. It’s not like I set out to build a library of genomic file interpreters, but it happened.  Consequently, I spend a lot of time looking at file formats and recently, I’ve found myself saying “I’ve been thinking about file formats a lot lately…”, which is exactly as geeky as it sounds.

To make matters worse, I’m finding myself reversing an opinion I had earlier. I have been on record cheering VCF, saying that I would love to have one file format that EVERYONE can dump their SNV information into. I still stand by this, but I’m some what disgruntled with the format. Well, as much as one can be disgruntled about a file format, anyways. (Or even gruntled, for that matter.)

When VCF came out, I thought it was brilliant – and it is – because it’s simple. A few mandatory fields are listed, followed by an info field into which everything and anything can be crammed. (See here for a few examples.) While I love the mandatory fields, I don’t think they’ve gone quite far enough. I’d have tossed in a few more fields in that category, such as individual read qualities, total coverage, number of times the variant was called with confidence… you get the idea. Anyhow, as far as formats go, it really isn’t that bad. The part that annoys me is the lack of enforced fields in the info. By going this route, they’ve created a format with infinite flexibility and extensibility.  For the most part, I support that.

Infinite flexibility, however, can also be extended to infinite abuse. I am waiting to see the first time someone sends me a file that has 2k characters in the info field on the each line, and not a single useful piece of information in it.  It hasn’t happened yet, but given the plethora of stuff people are now jamming in to VCFs, it’s just a matter of time.

In fact, I’ve been working with the Complete Genomics data (second hand, since a co-worker has been processing it, and I’m not sure how much it has been processed) for much of this week. I’ll be thrilled to see this incredible volume of genomic data in both cancers and normals appear in my database, once the import is complete.  With the first import running when I left for the day, there’s a good chance it’ll be in pretty soon.  Unfortunately, not all of the information I’d like was in it. The format is similar to VCF in spirit, but a bit different in style. All in all, it’s not bad (albeit incredibly confusing when my co-worker provided the documentation somewhat second hand, and left out all the gooey bits that are necessary to actually interpret the file), but even here, there were issues with insufficient information.  The files had to be massaged a bit to get all of the information we needed.  (Although, really, with an expected ~270 Million variants called for these data sets, complaining about a lack of information really needs to be kept in perspective!)

I have to say, however, of all my favorite variant formats, my favorite this week is the varfilter SAM from Samtools. It’s a pain in the ass with its cryptic quality data at the end, but that is what makes it easily the most informative – and self-checking. Consider the following “invalid” line:

1       19154656        T       A       22      22      30      3     ,.^?n   "%@

In it, the T and A signify a T->A variant. The quality markers, (,.^?n) tell you what was found at that position ( “,” and “.” are reference bases of different quality, and the “n” tells you that one indeterminate poor quality base with no call.  I’ll admit, I don’t actually know what the “^?” part means.)  which allows you to figure out what evidence was used to make this call…  which turns out to be none whatsoever.  It’s a bad snp call. If that information hadn’t been there, I’d have had to blindly accept the above snp into my analysis, and there would be no way to verify the integrity of the snp caller. (Which seems to be poor, since the example above came from an actual varfilter file and lines like this are frequently found in other datasets as well, regardless of the snp caller used as far as I am aware.)

In a properly done snp call, you would expect to see a quality that looks more like “.,.,.aAAaa” which would tell you that 5 reference bases (the dots and commas describing the reference T’s in the example above) and 5 A’s were found, confirming the base call.  Now THAT’s useful information, easily interpreted, and easily parsed.  With VCF, that information might or might not be included – and is completely absent in Complete Genomics files and optional in VCF files.

So where am I going with this?  Simple: If you’re designing a snp calling file, make sure you include sufficient information that the data can be verified easily. I’m willing to pay a slight penalty in storage such that I can be certain that the data is correct.  This should also include the sub-point: If you’re using VCF, some of those optional fields shouldn’t be optional.

5 thoughts on “File Formats and Genomic Data

    • That’s enlightening… and confusing at the same time. The total coverage for that line is 3, which seems to indicate that it means that the n would be the read with the base at the start with mapping quality 30, but that hardly seems to make sense either. Clearly, I have some more reading to do on this subject.

      • I believe (but I’m not sure) that it refers to the preceding base. So the whole line would be, “reverse strand reference base, forward strand reference base (start of read with map qual 30), reverse strand indeterminate base”.

  1. I’ve also been playing with the Complete Genomics data lately. My needs are quite different, but I’m very happy with what I’ve found.

    They ship a compressed bundle with various slices inside. The outermost file is an uncompressed .tar. Insider a bz2 compressed file. There is no manifest, but the internal file names follow a convention.

    Just as VCF seems to be a relative of GFF, the complete genomics files follow similar conventions with ## indicating metadata, and enough header fields to verify the columns are what you expect. My interest is in rs#s, dbSNP identifiers. they provide a masterVarBeta which reports all variants from the reference. For the SNPs/small indels where a dbSNP rs# can be provided, they do so and manage to do this assignment better than any pipeline I’ve yet seen. Most valuably they’ve also run a pipeline to identify every possible dbSNP rs#. This file currently provides the most data of any of the platforms known to SNPedia. That excedes any known microarray. Comparable structural variants are found in other files and I hope they may someday include dbVar identifiers. More information at
    http://www.snpedia.com/index.php/Complete_Genomics

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.