Dueling Databases of Human Variation

When I got it to work this morning, I was greeted by an email from 23andMe’s PR company, saying they have “built one of the world’s largest databases of individual genetic information.”   Normally, I wouldn’t even bat an eye at a claim like that.  I’m pretty sure it is a big database of variation…  but I thought I should throw down the gauntlet and give 23andMe a run for their money.  (-:

The timing for it couldn’t be better for me.  My own database actually ran out of auto-increment IDs this week, as we surpassed 2^31 snps entered into the db and had to upgrade the key field to bigint from int. (Some variant calls have been deleted and replaced as variant callers have improved, so we actually have only 1.2 Billion variations recorded against the hg18 version of the human genome.  A few hundred million more than that for hg19.)  So, I thought I might have a bit of a claim to having one of the largest databases of human variation as well.  Of course, comparing databases really is dependent on the metric being used, but hey, there’s some academic value in trying anyhow.

In the first corner, my database stores information from 2200+ samples (cancer and non-cancer tissue), genome wide (or transcriptome wide, depending on the source of the information.), giving us a wide sampling of data, including variations unique to individuals, as well as common polymorphisms.  In the other corner, 23andMe has sampled a much greater number of individuals (100,000) using a SNP chip, meaning that they’re only able to sample a small amount of the variation in an individual – about 1/3rd of a single percent of the total amount of DNA in each individual.

(According to this page, they look at only 1 million possible SNPs, instead of the 3 Billion bases at which single nucleotide variations can be found – although arguments can be made about the importance of that specific fraction of a percent.)

The nature of the data being stored is pretty important, however.  For many studies, the number of people sampled has a greater impact on the statistics than the number of sites studied and, since those are mainly the ones 23andMe are doing, clearly their database is more useful in that regard.  In contrast, my database stores data from both cancer and non-cancer samples, which allows us to make sense of variations observed in specific types of cancers – and because cancer derived variations are less predictable (ie, not in the same 1M snps each time) than the run-of-the-mill-standard-human-variation-type snps, the same technology 23andMe used would have been entirely inappropriate for the cancer research we do.

Unfortunately, that means comparing the two databases is completely impossible – they have different purposes, different data and probably different designs.  They have a database of 100k individuals, covering 1 million sites, whereas my database has 2k individuals, covering closer to 3 billion base pairs.  So yeah, apples and oranges.

(In practice, however, we don’t see variations at all 3 Billion base pairs, so that metric is somewhat skewed itself.  The number is closer to 100 Million bp –  a fraction of the genome nearly 100 times larger than what 23andMe is actually sampling.)

But, I’d still be interested in knowing the absolute number of variations they’ve observed…  a great prize upon which we could hold this epic battle of “largest database of human variations.”  At best, 23andMe’s database holds 10^11 variations, (1×10^6 SNPs x 1×10^5 people), if every single variant was found in every single person – a rather unlikely case.  With my database currently  at 1.2×10^9 variations, I think we’ve got some pretty even odds here.

Really, despite the joking about comparing database sizes, the real deal would be the fantastic opportunity to learn something interesting by merging the two databases, which could teach use something both about cancer and about the frequencies of variations in the human population.

Alas, that is pretty much certain to never happen.  I doubt 23andMe will make their database public – and our organization never will either.  Beyond the ethical issues of making that type of information public, there are pretty good reasons why this data can only be shared with collaborators – and in measured doses at that.  That’s another topic for another day, which I won’t go into here.

For now, 23andMe and I will just have to settle for both having “one of the world’s largest databases of individual genetic information.”  The battle royale for the title will have to wait for another day… and who knows what other behemoths are lurking in other research labs around the world.

On the other hand, the irony of a graduate student challenging 23andMe for the title of largest database of human variation really does make my day. (=

[Note: I should mention that when I say that I have a database of human variation, the database was my creation but the data belongs to the Genome Sciences Centre – and credit should be given to all of those who did the biology and bench work, performed the sequencing, ran the bioinformatics pipelines and assisted in populating the database.]

Working with Jackie Chan.

Since I’ve been posting jobs, I figured I may as well point people to another set of open positions.  Of course, again, I have no relationship with the people posting it… however, I just couldn’t not say anything about this set.

Apparently, if you work in the Pallen Group, you get to work on next gen sequencing pipelines in the lab with Jackie Chan. (Research Fellow in Microbial Bioinformatics)

How cool is that?Jackie Chan

Anyhow, The other position (Research Technician in Bioinformatics), doesn’t (apparently) involve martial arts.

ridiculous email.

Sometimes it’s fun to write ridiculous emails:

Good morning 1st floor!

You may notice that *all* items in both of the fridges and the freezer have been marked with a yellow sticky piece of paper. This yellow mark symbolizes your refrigerated item’s impending doom.

If there is anything in the freezer that still has this mark on it by Thursday afternoon, it will be sacrificed to the gods of bioinformatics in the hopes of better results and faster processing times for the GSC. (The sacrifice may or may not involve diabolical rituals and a RIP talk.)

Fortunately for your refrigerated items, they may be spared simply by removing the yellow tag.

Unlike last year, I will not be lenient in sparing “fresh looking” items with the yellow tag… as I found some yellow tags from last year in the freezer. (anyone want a frozen dinner?)

With humour,


I have to admit, I’ve never threatened doom on slightly chilled and expired food items before.  Thursday afternoon should be rather entertaining, I would think.

Job titles

I couldn’t resist telling my own story, after I came across this blog entry in which Karen Vancampenhout describes a typo that was made in the abstract of her paper as it was entered into the ISI web of science.  Having the name of the tree you’ve studied changed from ”Pinus sylvestris” to “Anal sylvestris” does seem like a bad april fools day joke… but, it reminds me of one of my own adventures.

About a decade ago, while I was doing my undergrads, I worked as a programmer at an insurance company in Toronto.  In a moment of sheer generosity, the head of the department got every one of the IT people (including us lowly co-op students) a pass to spend an afternoon checking out one of the big computer shows that was visiting.  For a co-op student, that meant an afternoon of not working, a pile of potential conference swag and access to some fantastic new stuff tech demos.  Sheer win!

Obviously, all of us IT geeks were excited when the passes came – considering that this was 1997, they seemed pretty spiffy.  You would wear them on a lanyard so people could see your name and when you walked into a booth, you could swipe it to give the vendor your name, contact information and job title.  At some booths, it would even appear on a screen so that the sales people didn’t have to squint to read who you were from the name tag.  In BIG glowing letters.

And, alas, that’s where the problem began.  Whoever it was that filled in the card for me had put in my full job title of “Programmer/Analyst”.  Not a bad title for a biochemist doing a comp sci job – and I was pretty proud of it for the most part.  Unfortunately, the conference badge didn’t have enough room for the whole thing.  You can guess what it was shortened to.  Yes… “Programmer Anal”.

I had to walk around all day, showing this badge to everyone, swiping it at all the booths I visited.  Of course, there was no hiding the unfortunate job title “modification” from my colleagues.   (None of whom were unfortunate enough to have the same job title as me.)

Lets just say that I saw an awful lot of smirks from vendor sales people – and the people in line behind me.  For the record, I “lost” that badge pretty quickly.

Unfortunately, that wasn’t the end of the whole affair.  Having swiped my card at a lot of booths, there were a lot of companies who had it entered into their databases.  When I went back to the insurance company for another co-op rotation, I got a nice big stack of glossy vender magazines – all, naturally, addressed to me as Anthony Fejes, Programmer Anal.