On the subject of indels..

Ah, a blog post. It’s been a while, as life has been busy lately. My daughter turned 3 last week, and I’ve moved half way across the world and back, but I have slowly found myself with things to say again.

And, the one that needs saying first is that, as a community, NGS people have done a terrible job on standardizing how we deal with Indels. SNVs aren’t bad – we only have half a dozen ways to mess them up – but indels are just something else.

After a year of working hard on SNVs, indels have fallen back on the menu, and I’ve been beating my head on the wall trying to solve it all in one shot. Needless to say, it’s not going to be that easy, but there are a few things that are really worth pointing out:

If you can represent something in the genome two different ways, you should pick the easiest, right? Wrong, there are people who don’t agree with this, and I can give you an example. Lets say you have a reference sequence GAAAC, and you delete two As. Personally, I’d pick the left justified version and say GAA -> G. That’s pretty clear: you’ve removed to A’s after the G. Using the single redundant G makes it left justified , and anchored or rooted, and intuitively obvious. However, other people might disagree.

For instance, if you use a more old school style, that pre-dates Next-gen sequencing, you’d probably right justify it: AAC ->C… or take it one step further and drop the C, giving you AA->-. Yes, that’s a dash. Between the left and right justification, there’s not much to say: it’s either one standard or the other. Right justification is used by a lot of databases, such as clinvar, where many (most? all?) of the known deletions are pulled from clinical papers, who adopted that as the standard.

However, that’s far from the worst you can do.  You can also add one step to the confusion and pad your variant.  For instance, you could also represent the deletion of the two As with GAAAC->GAC.  Now, you’ll see it’s anchored on the left and the right, which is not necessarily a bad thing, but it is redundant.  You don’t need both for an unambiguous representation of the indel.  This is a non-reduced representation of the variant.  You can make them more confusing, if you try, though.  There are no bounds to the padding you can add.  Want a simple SNV to look more complicated?  How about: ACGTACTCGGCTAG->AGGTACTCGGCTAG. I would probably just shift the position over by one to the right and call it a C->G variant, and drop the padding.

Why do people not use reduced representation padding, though?  Because it’s more convenient for them.   Here’s an example I got from ExAC:  GAAA -> G,GA,GAA.  See what they’ve done there?  It’s actually three variants at the same position that I would represent with three different reference sequences, but by padding the variants, they can place them all on one line.  GA->G, GAA->G and GAAA->G.  If you don’t know that they’ve done this, it’s a bit surprising.  Indeed, I had to write to them to ask about it, because it wasn’t intuitively obvious to me why they show reduced variants on their web page, but distribute a VCF file with non-reduced variants.  There is a blog post about how to reduce variants, but as of last week, it wasn’t referenced in the readme files of their FTP site.

Regardless, ExAC isn’t the only one to use non-reduced representations – dbSNP does it as well, and I haven’t even begun to look at the myriad of other data sources we depend on for indel interpretation. It was rightly pointed out to me that non-reduced representations are not forbidden in the VCF 4.2 standard.  It’s definitely not forbidden, but then again, as a community, taking the position that anything not forbidden is allowed is a dangerous path for those who would like to see a unified standard.  We’re just not going to converge on the same page, if we keep stuff like this going.

Alas, Indels are a difficult minefield.  They are hard to call, hard to represent and hard to interpret.  We have a long path ahead of us to straighten it all out, but I don’t doubt we’ll get there.  This is just one more step we’ll have to take, in order to make sure we start getting these things right.


2 thoughts on “On the subject of indels..

  1. Great blog post! There are indeed many issues with representing indels which become an especially big problem in datasets like ExAC where we see a growing number of multi-allelic variants. I agree with many of your points here: left-aligning (or at least standardizing, but left-aligning just makes a lot of sense from a programming perspective) and anchoring are great standards we should adopt, and we as a community should continue this discussion to adopt the best possible standards.

    As for minimal representation, there’s a balance here between minimally represented VCFs and VCFs that facilitate individual-level queries. One could run “vt normalize” (http://genome.sph.umich.edu/wiki/Vt) on the current ExAC sites VCF to obtain a reduced representation VCF (here, we use scripts that convert into tabular formats for analysis in R: https://github.com/konradjk/loftee/blob/master/src/tableize_vcf.py). The problem, then, is that this representation seriously breaks genotypes, as a particular individual’s genotype cannot be gleaned from a single VCF line: it would be split into two lines, not to mention with a stranger “./1” representation. Additionally, the current format with padding is the default format for VCF (that is returned by GATK any many other tools). By no means is the ExAC VCF the best solution, but there are advantages and disadvantages to each. We indeed opted for minimal representation for the browser to facilitate comparisons.

    As for ExAC, we understand we could have explained this better and we’ve added an entry to the FAQ on the browser for this topic. We will add a note to the READMEs on the FTP site for this issue as well.

    • Hey Konrad, Thanks for the reply – and for adding this to the readme. I’m sure others will find that information to be useful!

      Anyhow, I absolutely agree that there are pros and cons to each method of representation, but my issue isn’t to pick specifically on anyone (least of all ExAC, who has released a ton of useful data!), but rather to emphasize that we all need to be better at this – and better at communicating information in a consistent way. When it comes to standardizing file formats, though, it can be like herding cats.

      At the end of the day, I understand why ExAC chose to do a non-reduced format, but your comment that a different format would have caused confusion about the genotype of an individual doesn’t make a lot of sense to me. If this were a genome, I would certainly have agreed, but in the context of a data dump, it’s not so clear what purpose it serves to make a decision that preserves genotype data.

      Fundamentally, however, the real issue is that we’re all trying to work around a shortcoming of the VCF standard. What should have been represented as three references going to a single alt (GA, GAA, GAAA -> G), had to become one reference going to three different alternate alleles to represent the same information (GAAA -> G, GA, GAA), which forced you to select the non-reduced representation. However, there’s no real provision in the VCF standard to include multiple-reference sequences, so ExAC had to make do with the next best options – and hence the file format that was selected. I certainly understand why it was done.

      Still, my post shouldn’t be considered a criticism, but more of a frustration that the whole field faces in disseminating information – It’s just one of the many challenges we all face together when trying to deal with creating a coherent perspective on all of the data we find ourselves wading through.

      Regardless, I really appreciate your comment, Konrad. Thanks for taking the time to respond.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.