On the subject of indels..

Ah, a blog post. It’s been a while, as life has been busy lately. My daughter turned 3 last week, and I’ve moved half way across the world and back, but I have slowly found myself with things to say again.

And, the one that needs saying first is that, as a community, NGS people have done a terrible job on standardizing how we deal with Indels. SNVs aren’t bad – we only have half a dozen ways to mess them up – but indels are just something else.

After a year of working hard on SNVs, indels have fallen back on the menu, and I’ve been beating my head on the wall trying to solve it all in one shot. Needless to say, it’s not going to be that easy, but there are a few things that are really worth pointing out:

If you can represent something in the genome two different ways, you should pick the easiest, right? Wrong, there are people who don’t agree with this, and I can give you an example. Lets say you have a reference sequence GAAAC, and you delete two As. Personally, I’d pick the left justified version and say GAA -> G. That’s pretty clear: you’ve removed to A’s after the G. Using the single redundant G makes it left justified , and anchored or rooted, and intuitively obvious. However, other people might disagree.

For instance, if you use a more old school style, that pre-dates Next-gen sequencing, you’d probably right justify it: AAC ->C… or take it one step further and drop the C, giving you AA->-. Yes, that’s a dash. Between the left and right justification, there’s not much to say: it’s either one standard or the other. Right justification is used by a lot of databases, such as clinvar, where many (most? all?) of the known deletions are pulled from clinical papers, who adopted that as the standard.

However, that’s far from the worst you can do.  You can also add one step to the confusion and pad your variant.  For instance, you could also represent the deletion of the two As with GAAAC->GAC.  Now, you’ll see it’s anchored on the left and the right, which is not necessarily a bad thing, but it is redundant.  You don’t need both for an unambiguous representation of the indel.  This is a non-reduced representation of the variant.  You can make them more confusing, if you try, though.  There are no bounds to the padding you can add.  Want a simple SNV to look more complicated?  How about: ACGTACTCGGCTAG->AGGTACTCGGCTAG. I would probably just shift the position over by one to the right and call it a C->G variant, and drop the padding.

Why do people not use reduced representation padding, though?  Because it’s more convenient for them.   Here’s an example I got from ExAC:  GAAA -> G,GA,GAA.  See what they’ve done there?  It’s actually three variants at the same position that I would represent with three different reference sequences, but by padding the variants, they can place them all on one line.  GA->G, GAA->G and GAAA->G.  If you don’t know that they’ve done this, it’s a bit surprising.  Indeed, I had to write to them to ask about it, because it wasn’t intuitively obvious to me why they show reduced variants on their web page, but distribute a VCF file with non-reduced variants.  There is a blog post about how to reduce variants, but as of last week, it wasn’t referenced in the readme files of their FTP site.

Regardless, ExAC isn’t the only one to use non-reduced representations – dbSNP does it as well, and I haven’t even begun to look at the myriad of other data sources we depend on for indel interpretation. It was rightly pointed out to me that non-reduced representations are not forbidden in the VCF 4.2 standard.  It’s definitely not forbidden, but then again, as a community, taking the position that anything not forbidden is allowed is a dangerous path for those who would like to see a unified standard.  We’re just not going to converge on the same page, if we keep stuff like this going.

Alas, Indels are a difficult minefield.  They are hard to call, hard to represent and hard to interpret.  We have a long path ahead of us to straighten it all out, but I don’t doubt we’ll get there.  This is just one more step we’ll have to take, in order to make sure we start getting these things right.