>GFF3 undocumented feature…

>Earlier today, I tweeted:

Does anyone know how to decypher a diBase GFF3 file? They don’t identify the “most abundant” nucleotide uniquely. seems useless to me.

Apparently, there is a solution, albeit undocumented:

The attribute “genotype” contains an IUB code that is limited to using either a single base or a double base annotation (eg, it should not contain, H, B, V, D or N – but may contain R, Y, W, S, M or K ), which then allows you to subtract the “reference” attribute (that must be canonical) from the “genotype” attribute IUB code to obtain the new SNP – but only when the “genotype” attribute is not a canonical base.

If only that were documented somewhere…

UPDATE: Actually, this turns out not to be the case at all — there are still positions for which the “genotype” attribute is an IUB code, and the reference is not one of the called bases. DOH!

>Useful error messages…. and another format rant.

>I’ll start with the error message, since it had me laughing, while everything else seems to have the opposite reaction.

I sent a query to Biomart the other day, as I often do. Most of the time, I get back my results quickly, and have no problems whatsoever. It’s one of my “go-to” sites for useful genomic data. Unfortunately, every time I tried to download the results of my query, I’d get 2-3Mb into the file before the download would die. (It was a LONG list of snps, and the file size was supposed to be in the 10Mb ballpark.)

Anyhow, in frustration, I tried the “email results to you” option, whereupon I got the following email message:

Your results file FAILED.
Here is the reason why:
Error during query execution: Server shutdown in progress

That has to be the first time I’ve ever had a server shutdown cause a result failure. Ok, it’s not that funny, but I am left wondering if that was the cause of the other 10 or so aborted downloads. Anyone know if Biomart runs on Microsoft products? (-;

The other thing on my mind this afternoon is that I am still looking to see my first Variant Call Format file for SNPs. A while back, I was optimistic about seeing the VCF files in the real world. Not that I can complain, but I thought adoption would be a little faster. A uniform SNP format would make my life much more enjoyable – I now have 7 different SNP format iterators to maintain, and would love to drop most of them.

What surprised me, upon further investigation, is that I’m also unable to find a utility that actually creates VCF files from .map, SAM/BAM, eland, bowtie or even pileup files. I know of only one SNP caller that creates VCF compatible files, and unfortunately, it’s not freely available, which is somewhat un-helpful. (I don’t know when or if it will be available, although I’ve heard rumours about it being put into our pipeline…)

That’s kind of a sad state of affairs – although I really shouldn’t complain. I have more than enough work on my plate, and I’m sure the same can be said for those who are actively maintaining SNP callers.

In the meantime, I’ll just have to sit here and be patient… and maybe write an 8th snp format iterator.

>Risk Factors #2 and thanks to Dr. Steve.

>Gotta love drive-by comments from Dr. Steve:

I don’t have time to go into all of the many errors in this post, but for a start, the odds ratio associated with the celiac snp is about 5-10X *per allele* (about 50X for a homozygote). This HLA allele accounts for about 90% of celiac disease and its mode of action is well understood.

I understand this is just a blog and you are not supposed to be an expert, but you should do some basic reading on genetics before posting misinformation. Or better yet, leave this stuff to Daniel MacArthur.

While even the 6th grade bullies I knew could give Dr Steve a lesson in making friends, I may as well at least clarify my point.

To start with, Dr. Steve made one good point. I didn’t know know what the risk factor for a single given Celiac SNP is – and thanks to Dr. Steve’s incredibly educational message – I still don’t. I simply made up a risk factor, which was probably the source of Dr. Steve’s confusion. (I did say I made it up in the post, but apparently hypothetical situations aren’t within Dr Steve’s repertoire.)

But lets revisit how Risk factors work, as I understand it. If someone would like to tell me how I’m wrong, I’ll accept that criticism. Telling me I’m wrong without saying why is useless, so don’t bother.

A risk factor is a multiplicative factor that indicates your risk of expressing a given phenotype versus the general population. If you have a risk factor of 5, and the general population has a 10% phenotype penetration, that means you have a 50% chance of expressing the phenotype. (.10 x 5 = 0.5, which is 50%).

In more complex cases, say with two independent SNPs, each with an independent risk factor, you multiply the set of risk factors by the probability of the phenotype appearing in the general population. (You don’t need me to do the math for you, do you?)

My rather long-winded point was that discussing risk factors without discussing the phenotypic background disease rate in the population is pointless, unless you know that the risk ratio leads you to a diagnostic test and predicts in a demonstrated statistically significant manner that the information is actionable.

Where I went out on a limb was in discussing other unknowns: error rates in the test, and possible other factors. Perhaps Dr. Steve knows he has a 0% error rate in his DTC SNP calls, or assumes as much – I am just not ready to make that assumption. Another place Dr. Steve may have objected to was my point about extraneous environmental factors, which may be included in the risk factor, although I just passed over it in my previous post without much discussion.

(I would love to hear how a SNP risk factor for something like Parkinson’s disease would be modulated by Vitamin D levels depending on your latitude. It can’t possibly be built into a DTC report. Oh, wait, this is hypothetical again – Sorry Dr. Steve!)

My main point from the previous post was that I have a difficult time accepting is that genomics consultants consider a “risk factor” as a useful piece of genomic information in the absence of an accompanying “expected background phenotypic risk.” A risk factor is simply a modulator of the risk, and if you talk about a risk factor you absolutely need to know what the background risk is.

Ok, I’m done rehashing my point from the previous point, and that takes me to my point for today:

Dr. Steve, telling people who have an interest in DTC genomics to stay out of the conversation in favor of the experts is shooting yourself in the foot. Whether it’s me or someone else, we’re going to ask the questions, and telling us to shut up isn’t going to get the questions answered. If I’m asking these questions, and contrary to your condescending comment I do have a genomics background, people without a genomics background will be asking them as well.

So I’d like to conclude with a piece of advice for you: Maybe you should leave the discussion to Daniel McArthur too – he’s doing a much better job of spreading information than you are, and he does it without gratuitously insulting people.

And I thought Doctors were taught to have a good bedside manner.

>DTC Snps… no more risk factors!

>I’ve been reading Daniel’s blog again. Whenever I end up commenting on things I don’t understand well, that’s usually why. Still, it’s always food for thought.

First of all, has anyone quantified the actual error rate on these tests? We know they have all sorts of mistakes going on. (This one was recently in the news, and yes, unlike Wikipedia, Daniel is a valid reference source for anything genomics related.) I’ll come back to this point in a minute.

As I understand it, the risk factor is an adjustment made to the likelihood of the general population in characterizing the risk of an individual suffering from a particular disease.

So, as I interpret it, you take whatever your likelihood of having that disease was multiplied by the risk factor. For instance with a disease like Jervell and Lange-Nielsen Syndrome, 6 of every 1 Million people suffer from it’s effects (although this is a bad example since you would have discovered it in childhood, but ignoring that for the moment we can assume another rare disease with a similar rate.) If our DTC test shows we have a 1.17 risk factor because we have a SNP, we would multiply that by 1.17.

6/1,000,000 x 1.17 = 7/1,000,000

if I’ve understood it all correctly, that means you’ve gone from knowing you have a 0.000,6% chance to being certain you have a 0.000,7% chance of suffering from your selected disease. (What a great way to spend your money!)

But lets not stop there. Lets ask about the the error rate on actually calling that snp is. From my own experience in SNP validation, I’d make a guess that the validation rate is close to 80-90%. Lets even be generous and take the high end. Thus:

You’ve gone from 100% knowing you’ve got a 0.000,6% chance of having a disease to being 90% sure you have a 0.000,7% chance of having a disease and a 10% sure you’ve still got a 0.000,6% of having the disease.

Wow, I’m feeling enlightened.

Lets do the same for something like Celiacs disease, which is estimated to strike 1/250 people, but is only diagnosed 1/4,700 people in the U.S.A. – and lets be generous and assume that the SNP in your DTC test has a 1.1 risk factor. (Celiacs is far from a rare disease, I might add.)

As a member of the average U.S. population, you had a 0.4% chance of having the disease, but a 0.02% chance of being diagnosed with it. That’s a pretty big disparity, so maybe there’s a good reason to have this test done. As a Canadian it’s somewhat different odds, but lets carry on with the calculations anyhow.

lets say you do the test and find out you have a 1.1 times risk factor of having the disease. omg scary!

Wait, lets not freak out yet. That sounds bad, but we haven’t finished the calculations.

Your test has the SNP…. 1.1 x 1/250 = 0.44% likelihood you have the disease. Because Celiacs disease requires a biopsy to definitively diagnose it (and treatment does not start till you’ve done the diagnosis), would you run out and submit yourself to a biopsy on a 0.44% chance you have a disease? Probably not unless you have some other knowledge that you’re likely to have this disease already.

Then, we factor in the 90% likelyhood of getting the SNP call correct: You have a 90% likelihood of having a 0.44% chance of having the disease, and a 10% likelihood of having a 0.4% chance of having the disease.

Ok, I’d be done panic-ing about now. And we’ve only considered two simple things here. Lets add one more just for fun.

lets pretend that an unknown environmental stressor is actually involved in triggering the condition, which would explain why the odds are somewhat different in Canada. Since we know nothing about that environmental trigger, we can’t even project odds of coming in contact with it. Who knows what effect that plays with the SNP you know about.

By now, I can’t help thinking that all of this is just a wild goose chase.

So, when people start talking about how you have to take your DTC results to a Genetic Counsellor or to your MD I really have to wonder. I can’t help but to think that unless you have a very good reason to suspect a disease or if you have some form of a priori knowledge, this whole thing is generally a waste. Your Genetic Counsellor will probably just laugh at you, and your MD will order a lot of unnecessary tests – which of those sounds productive?

Let me make a proposal (and I’m happy to hear dissent): Risk factors are great – but are absolutlely useless when it comes to discussing how genetic factors affect you. Lets leave the risk factors to the people writing the studies and ask the DTC companies to make a statement: what are your odds of being affected by a given condition? And, if you can’t make a helpful prediction (aka, a diagnostic test), maybe you shouldn’t be selling it as a test.

>SNP Database v0.2

>My SNP database is now up and running, with the first imports of data working well. That’s a huge improvement over the v0.1, where the data had to be entered under pretty tightly controlled circumstances. The API now uses locks, better indexes, and I’ve even tuned the database a little. (I also cheated a little and boosted the P4 running it to 1Gb RAM.)

So, what’s most interesting to me? Some of the early stats:

11,545,499 snps in total, made from:

  • 870549 snp calls from the 1000 genome project
  • 11361676 snps from dbsnp

So, some quick math:
11,361,676 + 870,549 – 11,545,499 = 686,726 Snps that overlapped between the 1000 genome project (34 data sets) and the dbSNP calls.

That is a whopping 1.6% of the SNPs in my database were not previously annotated in dbSNP.

I suppose that’s not a bad thing, since those samples were all “normals”, and it’s good to get some sense as to how big dbSNP really is.

Anyhow, now the fun with the database begins. A bit of documentation, a few scripts to start extracting data, and then time to put in all of the cancer datasets….

This is starting to become fun.

>SNP Datatabase v0.1

>Good news, my snp database seems to be in good form, and is ready for importing SNPs. For people who are interested, you can download the Vancouver Short Read Package from SVN, and find the relevant information in
/trunk/src/transcript_analysis/SNP_Database/

There’s a schema for setting up the tables and indexes, as well as applications for running imports from maq SNP calls and running a SNP caller on any form of alignment supported by FindPeaks (maq, eland, etc…).

At this point, there are no documents on how to use the software, since that’s the plan for this afternoon, and I’m assuming everyone who uses this already has access to a postgresql database (aka, a simple ubuntu + psql setup.)

But, I’m ready to start getting feature requests, requests for new SNP formats and schema changes.

Anyone who’s interested in joining onto this project, I’m only a few hours away from having some neat toys to play with!

>SNP/SNV callers minimum useful information

>Ok, I sent a tweet about it, but it didn’t solve the frustration I feel on the subject of SNP/SNV callers. There are so many of them out there that you’d think they grow on trees. (Actually, they grow on arrays…) I’ve written one, myself, and I know there are at least 3 others written at the GSC.

Anyhow, At first sight, what pisses me off is that there’s no standard format. Frankly, that’s not even the big problem, however. What’s really underlying that problem is that there’s no standard “minimum information” content being produced by the SNP/SNV callers. Many of them give a bare minimum information, but lack the details needed to really evaluate the information.

So, here’s what I propose. If you’re going to write a SNP or SNV caller, make sure your called variations contain the following fields:

  • chromosome: obviously the coordinate to find the location
  • position: the base position on the chromo
  • genome: the version of the genome against which the snp was called (eg. hg18 vs. hg19)
  • canonical: what you expect to see at that position. (Invaluable for error checking!)
  • observed: what you did see at that position
  • coverage: the depth at that position (filtered or otherwise)
  • canonical_obs: how many times you saw the canonical base (key to evaluating what’s at that position
  • variation_obs: how many times you saw the variation
  • quality: give me something to work with here – a confidence value between 0 and 1 would be ideal… but lets pick something we compare across data sets. Giving me 9 values and asking me to figure something out is cheating. Sheesh!

Really, most of the callers out there give you most, if not all of it – but I have yet to see the final “quality” being given. The MAQ SNP caller (which is pretty good) asks you to look at several different fields and make up your own mind. That’s fine for a first generation, but maybe I can convince people that we can do better in the second gen snp callers.

Ok, now I’ve got that off my chest! Phew.

>New Project Time… variation database

>I don’t know if anyone out there is interested in joining in – I’m starting to work on a database that will allow me to store all of the snps/variations that arise in any data set collected at the institution. (Or the subset to which I have the right to harvest snps, anyhow.) This will be part of the Vancouver Short Read Analysis Package, and, of course, will be available to anyone allowed to look at GPL code.

I’m currently on my first pass – consider it version 0.1 – but already have some basic functionality assembled. Currently, it uses a built in snp caller to identify locations with variations and to directly send them into a postgresql database, but I will shortly be building tools to allow SNPs from any snp caller to be migrated into the db.

Anyhow, just putting it out there – this could be a useful resource for people who are interested in meta analysis, and particularly those who might be interested in collaborating to build a better mousetrap. (=

>Science Cartoons – 5 (RNA-Seq)

>This is the last science cartoon I did for my poster. I was pretty happy with the pictures, although if I were to do it over again, I’ve learned a few more tricks that I’d have used instead.

Anyhow, my favorite effect on this picture is the “text to path”, where you can make any string follow any line – who knew graphic design could be so much fun. It definitely makes for some interesting graphics. I’d definitely use this effect in an RNA folding paper, if I ever got the chance to do another one. (-;