Dueling Databases of Human Variation

When I got it to work this morning, I was greeted by an email from 23andMe’s PR company, saying they have “built one of the world’s largest databases of individual genetic information.”   Normally, I wouldn’t even bat an eye at a claim like that.  I’m pretty sure it is a big database of variation…  but I thought I should throw down the gauntlet and give 23andMe a run for their money.  (-:

The timing for it couldn’t be better for me.  My own database actually ran out of auto-increment IDs this week, as we surpassed 2^31 snps entered into the db and had to upgrade the key field to bigint from int. (Some variant calls have been deleted and replaced as variant callers have improved, so we actually have only 1.2 Billion variations recorded against the hg18 version of the human genome.  A few hundred million more than that for hg19.)  So, I thought I might have a bit of a claim to having one of the largest databases of human variation as well.  Of course, comparing databases really is dependent on the metric being used, but hey, there’s some academic value in trying anyhow.

In the first corner, my database stores information from 2200+ samples (cancer and non-cancer tissue), genome wide (or transcriptome wide, depending on the source of the information.), giving us a wide sampling of data, including variations unique to individuals, as well as common polymorphisms.  In the other corner, 23andMe has sampled a much greater number of individuals (100,000) using a SNP chip, meaning that they’re only able to sample a small amount of the variation in an individual – about 1/3rd of a single percent of the total amount of DNA in each individual.

(According to this page, they look at only 1 million possible SNPs, instead of the 3 Billion bases at which single nucleotide variations can be found – although arguments can be made about the importance of that specific fraction of a percent.)

The nature of the data being stored is pretty important, however.  For many studies, the number of people sampled has a greater impact on the statistics than the number of sites studied and, since those are mainly the ones 23andMe are doing, clearly their database is more useful in that regard.  In contrast, my database stores data from both cancer and non-cancer samples, which allows us to make sense of variations observed in specific types of cancers – and because cancer derived variations are less predictable (ie, not in the same 1M snps each time) than the run-of-the-mill-standard-human-variation-type snps, the same technology 23andMe used would have been entirely inappropriate for the cancer research we do.

Unfortunately, that means comparing the two databases is completely impossible – they have different purposes, different data and probably different designs.  They have a database of 100k individuals, covering 1 million sites, whereas my database has 2k individuals, covering closer to 3 billion base pairs.  So yeah, apples and oranges.

(In practice, however, we don’t see variations at all 3 Billion base pairs, so that metric is somewhat skewed itself.  The number is closer to 100 Million bp –  a fraction of the genome nearly 100 times larger than what 23andMe is actually sampling.)

But, I’d still be interested in knowing the absolute number of variations they’ve observed…  a great prize upon which we could hold this epic battle of “largest database of human variations.”  At best, 23andMe’s database holds 10^11 variations, (1×10^6 SNPs x 1×10^5 people), if every single variant was found in every single person – a rather unlikely case.  With my database currently  at 1.2×10^9 variations, I think we’ve got some pretty even odds here.

Really, despite the joking about comparing database sizes, the real deal would be the fantastic opportunity to learn something interesting by merging the two databases, which could teach use something both about cancer and about the frequencies of variations in the human population.

Alas, that is pretty much certain to never happen.  I doubt 23andMe will make their database public – and our organization never will either.  Beyond the ethical issues of making that type of information public, there are pretty good reasons why this data can only be shared with collaborators – and in measured doses at that.  That’s another topic for another day, which I won’t go into here.

For now, 23andMe and I will just have to settle for both having “one of the world’s largest databases of individual genetic information.”  The battle royale for the title will have to wait for another day… and who knows what other behemoths are lurking in other research labs around the world.

On the other hand, the irony of a graduate student challenging 23andMe for the title of largest database of human variation really does make my day. (=

[Note: I should mention that when I say that I have a database of human variation, the database was my creation but the data belongs to the Genome Sciences Centre – and credit should be given to all of those who did the biology and bench work, performed the sequencing, ran the bioinformatics pipelines and assisted in populating the database.]

Dog Bath

It’s the weekend, and I’m trying not to do too much science today.  That shouldn’t be too much of a problem, really – the canucks are playing at 5pm, I’m going to be making BBQ personal pizzas, and I got all of my chores done yesterday.  So, today was the day to bathe the dog.  I’d hate to have the doggy pet-sitter get a stinky dog, when my wife and I go to Copenhagen.

If you’ve never seen a puli get a bath, though, it’s something pretty spectacular.

Working with Jackie Chan.

Since I’ve been posting jobs, I figured I may as well point people to another set of open positions.  Of course, again, I have no relationship with the people posting it… however, I just couldn’t not say anything about this set.

Apparently, if you work in the Pallen Group, you get to work on next gen sequencing pipelines in the lab with Jackie Chan. (Research Fellow in Microbial Bioinformatics)

How cool is that?Jackie Chan

Anyhow, The other position (Research Technician in Bioinformatics), doesn’t (apparently) involve martial arts.

Job titles

I couldn’t resist telling my own story, after I came across this blog entry in which Karen Vancampenhout describes a typo that was made in the abstract of her paper as it was entered into the ISI web of science.  Having the name of the tree you’ve studied changed from ”Pinus sylvestris” to “Anal sylvestris” does seem like a bad april fools day joke… but, it reminds me of one of my own adventures.

About a decade ago, while I was doing my undergrads, I worked as a programmer at an insurance company in Toronto.  In a moment of sheer generosity, the head of the department got every one of the IT people (including us lowly co-op students) a pass to spend an afternoon checking out one of the big computer shows that was visiting.  For a co-op student, that meant an afternoon of not working, a pile of potential conference swag and access to some fantastic new stuff tech demos.  Sheer win!

Obviously, all of us IT geeks were excited when the passes came – considering that this was 1997, they seemed pretty spiffy.  You would wear them on a lanyard so people could see your name and when you walked into a booth, you could swipe it to give the vendor your name, contact information and job title.  At some booths, it would even appear on a screen so that the sales people didn’t have to squint to read who you were from the name tag.  In BIG glowing letters.

And, alas, that’s where the problem began.  Whoever it was that filled in the card for me had put in my full job title of “Programmer/Analyst”.  Not a bad title for a biochemist doing a comp sci job – and I was pretty proud of it for the most part.  Unfortunately, the conference badge didn’t have enough room for the whole thing.  You can guess what it was shortened to.  Yes… “Programmer Anal”.

I had to walk around all day, showing this badge to everyone, swiping it at all the booths I visited.  Of course, there was no hiding the unfortunate job title “modification” from my colleagues.   (None of whom were unfortunate enough to have the same job title as me.)

Lets just say that I saw an awful lot of smirks from vendor sales people – and the people in line behind me.  For the record, I “lost” that badge pretty quickly.

Unfortunately, that wasn’t the end of the whole affair.  Having swiped my card at a lot of booths, there were a lot of companies who had it entered into their databases.  When I went back to the insurance company for another co-op rotation, I got a nice big stack of glossy vender magazines – all, naturally, addressed to me as Anthony Fejes, Programmer Anal.

the circle of life..

When I was getting close to the end of my masters degree, a fellow graduate student pulled me aside and asked me if I could think of any algorithms for a quantum computer… That turned into a rather successful biotechnology company here in Vancouver. As far as I was concerned, the quantum computer never materialised – but I don’t think they were necessary for that company. It would have been a nice touch, but it was never a core part of the strategy.

Now, as I near the end of my PhD, my supervisor asked me the same question today. Unfortunately, I’m still not sure that there’s a good answer to it either. I can think of great things I’d like to do with a quantum computer, but I still face the same problems as the first time around:

A) Does it actually exist?
B) When will it be ready?
and C) what can it do?

Coming up with problems is easy – coming up with problems that take advantage of an imaginary computer with imaginary strengths (of which I know very little) is hard.

Somehow, I don’t see this leading to a chapter in my theis. At least, this time, I don’t think I’ll lead to a start up company.

Science Spam

Periodically I get spam that makes me laugh.  Yes, this is a real company, and yes they do publish anything and everything they can get their hands on – basically they are a custom printing shop, printing your content at great expense to the purchaser.  I have no idea how much they offer to the generator of the content.

Dear Mr. Anthony Peter Fejes,

I am writing on behalf of an international publishing house, LAP Lambert Academic Publishing.

In the course of a research on the University of Waterloo, I came across a reference to your thesis on “Computationally Modeled Properties of BetaTrefoil Proteins”.
We are an international publisher whose aim is to make academic research available to a wider audience.
LAP would be especially interested in publishing your dissertation in the form of a printed book.

Your reply including an e-mail address to which I can send an e-mail with further information in an attachment
will be greatly appreciated.

I am looking forward to hearing from you.
Kind regards,
<Name Redacted>
Acquisition Editor

Why does it make me laugh? The thesis they’re asking about is one I did for my undergraduate Biochemistry degree.

Yeah, I know, my first thesis was a brilliant work of art, but really, computational modeling done over a decade ago using Swiss-Model isn’t going to be earth shattering – and yes, if there is someone out there who wants a copy of my thesis, I can have it printed for you for at Kinkos… if I can find that file.  (It did have awesome pictures, tho!)

And, as far as I’m concerned, one should never combine beta-trefoil proteins and my undergrad thesis with the term “wider audience” – it’s just not going to work out well in the end.

>Best Software Licence Ever!

>I was looking for some example code of a Mahalanobis distance calculator and came across what I happen to believe is the most entertaining license I have ever seen. I had to share:

The program is free to use for non-commercial academic purposes, but for course works, you must understand what is going inside to use. The program can be used, modified, or re-distributed for any purposes if you or one of your group understand codes (the one must come to court if court cases occur.) Please contact the authors if you are interested in using the program without meeting the above conditions.

The Source.

>Ridiculous Bioinformatics

>I think I’ve finally figured out why bioinformatics is so ridiculous. It took me a while to figure this one out, and I’m still not sure if I believe it, but let me explain to you and see what you think.

The major problem is that bioinformatics isn’t a single field, rather, it’s the combination of (on a good day) biology and computer science. Each field on it’s own is a complete subject that can take years to master. You have to respect the biologist who can rattle off the biochemicals pathway chart and then extrapolate that to the annotations of a genome to find interesting features of a new organism. Likewise, theres some serious respect due to the programmer who can optimize code down at the assembly level to give you incredible speed while still using half the amount of memory you initially expected to use. It’s pretty rare to find someone capable of both, although I know a few who can pull it off.

Of course, each field on it’s own has some “fudge factors” working against you in your quest for simplicity.

Biologists don’t actually know the mechanisms and chemistry of all the enzymes they deal with – they are usually putting forward their best guesses, which lead them to new discoveries. Biology can effectively be summed us as “reverse engineering the living part of the universe”, and we’re far from having all the details worked out.

Computer Science, on the other hand, has an astounding amount of complexity layered over every task, with a plethora of languages and system, each with their own “gotchas” (are your arrays zero based or 1 based? how does your operating system handle wild cards at the command line? what does your text editor do to gene names like “Sep9”) leading to absolute confusion for the novice programmer.

In a similar manner, we can also think about probabilities of encountering these pitfalls. If you have two independent events, and each of which has a distinct probability attached, you can multiply the probabilities to determine the likelihood of both events occurring simultaneously.

So, after all that, I’d like to propose “Fejes’ law of interdisciplinary research

The likelihood of achieving flawless work in an interdisciplinary research project is the product of the likelihood of achieving flawless work in each independent area.

That is to say, that if your biology experiments (on average) are free of mistakes 85% of the time, and your programming is free of bugs 90% of the time. (eg, you get the right answers), your likely hood of getting the right answer in a bioinformatics project is:

Fp = Flawless work in Programming
Fb = Flawless work in Biology
Fbp = Flawless work in Bioinformatics

Thus, according to Fejes’ law:

Fb x Fp = Fbp

and the example given:

0.90 x 0.85 = 0.765

Thus, even an outstanding programmer and bioinformatician will struggle to get an extremely high rate of flawless results.

Fortunately, there’s one saving grace to all of this: The magnitude of the errors is not taken into account. If the bug in the code is tiny, and has no impact on the conclusion, then that’s hardly earth shattering, or if the biology measurements have just a small margin of error, it’s not going to change the interpretation.

So there you have it, bioinformticians. if i haven’t just scared you off of ever publishing anything again, you now know what you need to do…

Unit tests, anyone?

>Elisa for Obesity Proteins.

>Ok, I can’t resist. I occasionally get emails from random biotech companies promoting products that are invariably useless to me. This one amused me enough that I thought I should share it.

The title of the email is “ELISA Strip for Profiling 8 Obesity Proteins.” While I’m sure there are people who have a good use for that, I have no clue why I’d want it. I’m not sure I’d want to go to a doctor who needs to use it to tell if their patients are overweight either.

What ever happened to looking at yourself in the mirror or standing on the bathroom scale and saying, “Oh man, I need to lose some weight!?” Now you’re supposed to kit yourself out and do an Elisa to tell if you’ve got to diet?

Oh well, if you do find you have a use for it, Signosis will be more than happy to sell you one.