Thought’s on Andrew G Clark’s Talk and Cancer Genomics

Last night, I hung around late into the evening to hear Dr. Andrew G Clark give a talk focusing on how most of the variations we see in the modern human genome are rare variants that haven’t had a chance to equilibrate into the larger population.  This enormous expansion of rare variants is courtesy of the population explosion of humans since the dawn of the agricultural age, specifically in the past 2000 years at the dawn of modern science and education.

I think the talk was a very well done and managed to hit a lot of points that struck home for me.  In particular, my own collected database of human variations in cancers and normals has shown me much of the same information that Dr Clark illustrated using 1000 genome data, as well as information from his 2010 paper on deep re-sequencing.

However interesting the talk was, one particular piece just didn’t click in until after the talk was over.  During a conversation prior to the talk, I described my work to Dr. Clark and received a reaction I wasn’t expecting.  Paraphrased, this is how the conversation went:

Me: “I’ve assembled a very large database, where all of the cancers and normals that we sequence here at the genome science centre are stored, so that we can investigate the frequency of variations in cancers to identify mutations of interest.”

Dr. Clark: “Oh, so it’s the same as a HapMap project?”

Me: “Yeah, I guess so…”

What I didn’t understand at the time was that Dr. Clark was asking was: “So, you’re just cataloging rare variations, which are more or less meaningless?”  Which is exactly what HapMap projects are: Nothing more than large surveys of human variation across genomes.  While they could be the basis of GWAS studies, the huge amount of rare variants in the modern human population means that many of these GWAS studies are doomed to fail.  There will not be a large convergence of variations causing the disease, but rather an extreme number of rare variations with similar outcomes.

However, I think the problem was that I handled the question incorrectly.  My answer should have touched on the following point:

“In most diseases, we’re stuck using lineages to look for points of interest (variations) passed on from parent to child and the large number of rare variants in the human population makes this incredibly difficult to do as each child will have a significant number of variation that neither parent passed on to them.  However, in cancer, we have the unique ability to compare diseased cancer cells with a matched normal from the same patient, which allows us to effectively mask all of the rare variants that are not contributing to cancer.  Thus, the database does act like a large HapMap database, if you’re interested in studying non-cancer, but the matched-normal sample pairing available to cancer studies means we’re not confined to using it as a HapMap-style database, enabling incredibly detailed and coherent information about the drivers and passengers involved in oncogenesis, without the same level of rare variants interfering in the interpretation of the genome.”

Alas, in the way of all things, that answer only came to me after I heard Dr. Clark’s talk and understood the subtext of his question.  However, that answer is very important on its own.

It means that while many diseases will be hard slogs through the deep rare variant populations (which SNP chips will never be detailed enough to elucidate, by the way, for those of you who think 23andMe will solve a large number of complicated diseases), cancer is bound to be a more tractable disease in comparison!  We will by-pass the misery of studying every single rare variant, which is a sizeable fraction of each new genome sequenced!

Unfortunately, unlike many other human metabolic diseases that target a single gene or pathway, cancer is really a whole genome disease and is vastly more complex than any other disease.  Thus, even if our ability to zoom in on the “driver” mutations progresses rapidly as we sequence more cancer tissues (and their matched normal samples, of course!), it will undoubtedly be harder to interpret how all of these work and identify a cure.

So, as with everything, cancer’s somatic nature is a double edged sword: it can be used to more efficiently sort the wheat from the chaff, but will also be a source of great consternation for finding cures.

Now, if only I could convince other people of the dire necessity of matched normals in cancer research…

File Formats and Genomic Data

I’ve been thinking a lot about file formats lately. I know it doesn’t sound like a particularly interesting topic, but it really can be interesting. Well, at least useful. Interesting might be a bit overstating the case, but designing a new file format isn’t a simple job – and finding the right balance between bare bones and overly verbose can be a big challenge.

Over the past few years, I’ve collected a large number of interfaces in my codebase for interfacing with all sorts of files, from plain text files all the way up through binary alignment data (MAQ and SAM), spanning all sorts of information. It’s not like I set out to build a library of genomic file interpreters, but it happened.  Consequently, I spend a lot of time looking at file formats and recently, I’ve found myself saying “I’ve been thinking about file formats a lot lately…”, which is exactly as geeky as it sounds.

To make matters worse, I’m finding myself reversing an opinion I had earlier. I have been on record cheering VCF, saying that I would love to have one file format that EVERYONE can dump their SNV information into. I still stand by this, but I’m some what disgruntled with the format. Well, as much as one can be disgruntled about a file format, anyways. (Or even gruntled, for that matter.)

When VCF came out, I thought it was brilliant – and it is – because it’s simple. A few mandatory fields are listed, followed by an info field into which everything and anything can be crammed. (See here for a few examples.) While I love the mandatory fields, I don’t think they’ve gone quite far enough. I’d have tossed in a few more fields in that category, such as individual read qualities, total coverage, number of times the variant was called with confidence… you get the idea. Anyhow, as far as formats go, it really isn’t that bad. The part that annoys me is the lack of enforced fields in the info. By going this route, they’ve created a format with infinite flexibility and extensibility.  For the most part, I support that.

Infinite flexibility, however, can also be extended to infinite abuse. I am waiting to see the first time someone sends me a file that has 2k characters in the info field on the each line, and not a single useful piece of information in it.  It hasn’t happened yet, but given the plethora of stuff people are now jamming in to VCFs, it’s just a matter of time.

In fact, I’ve been working with the Complete Genomics data (second hand, since a co-worker has been processing it, and I’m not sure how much it has been processed) for much of this week. I’ll be thrilled to see this incredible volume of genomic data in both cancers and normals appear in my database, once the import is complete.  With the first import running when I left for the day, there’s a good chance it’ll be in pretty soon.  Unfortunately, not all of the information I’d like was in it. The format is similar to VCF in spirit, but a bit different in style. All in all, it’s not bad (albeit incredibly confusing when my co-worker provided the documentation somewhat second hand, and left out all the gooey bits that are necessary to actually interpret the file), but even here, there were issues with insufficient information.  The files had to be massaged a bit to get all of the information we needed.  (Although, really, with an expected ~270 Million variants called for these data sets, complaining about a lack of information really needs to be kept in perspective!)

I have to say, however, of all my favorite variant formats, my favorite this week is the varfilter SAM from Samtools. It’s a pain in the ass with its cryptic quality data at the end, but that is what makes it easily the most informative – and self-checking. Consider the following “invalid” line:

1       19154656        T       A       22      22      30      3     ,.^?n   "%@

In it, the T and A signify a T->A variant. The quality markers, (,.^?n) tell you what was found at that position ( “,” and “.” are reference bases of different quality, and the “n” tells you that one indeterminate poor quality base with no call.  I’ll admit, I don’t actually know what the “^?” part means.)  which allows you to figure out what evidence was used to make this call…  which turns out to be none whatsoever.  It’s a bad snp call. If that information hadn’t been there, I’d have had to blindly accept the above snp into my analysis, and there would be no way to verify the integrity of the snp caller. (Which seems to be poor, since the example above came from an actual varfilter file and lines like this are frequently found in other datasets as well, regardless of the snp caller used as far as I am aware.)

In a properly done snp call, you would expect to see a quality that looks more like “.,.,.aAAaa” which would tell you that 5 reference bases (the dots and commas describing the reference T’s in the example above) and 5 A’s were found, confirming the base call.  Now THAT’s useful information, easily interpreted, and easily parsed.  With VCF, that information might or might not be included – and is completely absent in Complete Genomics files and optional in VCF files.

So where am I going with this?  Simple: If you’re designing a snp calling file, make sure you include sufficient information that the data can be verified easily. I’m willing to pay a slight penalty in storage such that I can be certain that the data is correct.  This should also include the sub-point: If you’re using VCF, some of those optional fields shouldn’t be optional.