Dueling Databases of Human Variation

When I got it to work this morning, I was greeted by an email from 23andMe’s PR company, saying they have “built one of the world’s largest databases of individual genetic information.”   Normally, I wouldn’t even bat an eye at a claim like that.  I’m pretty sure it is a big database of variation…  but I thought I should throw down the gauntlet and give 23andMe a run for their money.  (-:

The timing for it couldn’t be better for me.  My own database actually ran out of auto-increment IDs this week, as we surpassed 2^31 snps entered into the db and had to upgrade the key field to bigint from int. (Some variant calls have been deleted and replaced as variant callers have improved, so we actually have only 1.2 Billion variations recorded against the hg18 version of the human genome.  A few hundred million more than that for hg19.)  So, I thought I might have a bit of a claim to having one of the largest databases of human variation as well.  Of course, comparing databases really is dependent on the metric being used, but hey, there’s some academic value in trying anyhow.

In the first corner, my database stores information from 2200+ samples (cancer and non-cancer tissue), genome wide (or transcriptome wide, depending on the source of the information.), giving us a wide sampling of data, including variations unique to individuals, as well as common polymorphisms.  In the other corner, 23andMe has sampled a much greater number of individuals (100,000) using a SNP chip, meaning that they’re only able to sample a small amount of the variation in an individual – about 1/3rd of a single percent of the total amount of DNA in each individual.

(According to this page, they look at only 1 million possible SNPs, instead of the 3 Billion bases at which single nucleotide variations can be found – although arguments can be made about the importance of that specific fraction of a percent.)

The nature of the data being stored is pretty important, however.  For many studies, the number of people sampled has a greater impact on the statistics than the number of sites studied and, since those are mainly the ones 23andMe are doing, clearly their database is more useful in that regard.  In contrast, my database stores data from both cancer and non-cancer samples, which allows us to make sense of variations observed in specific types of cancers – and because cancer derived variations are less predictable (ie, not in the same 1M snps each time) than the run-of-the-mill-standard-human-variation-type snps, the same technology 23andMe used would have been entirely inappropriate for the cancer research we do.

Unfortunately, that means comparing the two databases is completely impossible – they have different purposes, different data and probably different designs.  They have a database of 100k individuals, covering 1 million sites, whereas my database has 2k individuals, covering closer to 3 billion base pairs.  So yeah, apples and oranges.

(In practice, however, we don’t see variations at all 3 Billion base pairs, so that metric is somewhat skewed itself.  The number is closer to 100 Million bp –  a fraction of the genome nearly 100 times larger than what 23andMe is actually sampling.)

But, I’d still be interested in knowing the absolute number of variations they’ve observed…  a great prize upon which we could hold this epic battle of “largest database of human variations.”  At best, 23andMe’s database holds 10^11 variations, (1×10^6 SNPs x 1×10^5 people), if every single variant was found in every single person – a rather unlikely case.  With my database currently  at 1.2×10^9 variations, I think we’ve got some pretty even odds here.

Really, despite the joking about comparing database sizes, the real deal would be the fantastic opportunity to learn something interesting by merging the two databases, which could teach use something both about cancer and about the frequencies of variations in the human population.

Alas, that is pretty much certain to never happen.  I doubt 23andMe will make their database public – and our organization never will either.  Beyond the ethical issues of making that type of information public, there are pretty good reasons why this data can only be shared with collaborators – and in measured doses at that.  That’s another topic for another day, which I won’t go into here.

For now, 23andMe and I will just have to settle for both having “one of the world’s largest databases of individual genetic information.”  The battle royale for the title will have to wait for another day… and who knows what other behemoths are lurking in other research labs around the world.

On the other hand, the irony of a graduate student challenging 23andMe for the title of largest database of human variation really does make my day. (=

[Note: I should mention that when I say that I have a database of human variation, the database was my creation but the data belongs to the Genome Sciences Centre – and credit should be given to all of those who did the biology and bench work, performed the sequencing, ran the bioinformatics pipelines and assisted in populating the database.]

Talk: Michael D. Taylor – University of Toronto on Medulloblastoma

I had heard about this talk earlier this week, but had decided to skip it, seeing as it sounded unrelated to my own work, however, a last minute email from a colleague suggested I might want to go, so I stopped to read the abstract.  There was a great line in the abstract that grabbed my attention:

The widespread intra-tumoral heterogeneity across space and time demonstrated highly suggests that targeted therapies will need to be developed and used in specific clinical scenarios. Use of targeted therapies developed through study of primary untreated tumors with subsequent use at clinical trials on recurrent metastatic disease is likely doomed to failure.

Since that sums up my opinion on cancer, these days, I thought it might be worth attending, after all.

In any case, the start of the presentation is delayed, due to a “catastrophic” computer failure, presumably the one hosting the powerpoint, so we’ll get underway shortly.

In the meantime, A bit of the background:

Title: Clinical Implications of Medulloblastoma Inter and Intra-tumoral Heterogeneity Through Space and Time”
Presented by: Michael D. Taylor MD PhD,
Associate Professor, University of Toronto,
Scientist, Program in Developmental Biology,
Neurosurgeon, The Hospital for Sick Children

Ok, and now it seems we’re ready to go!


Intra and intertumoral Hererogeneity in Medulloblastoma.

Interest in pediatric tumours, speciflcally, small blue cell tumours, such as medulloblastoma.  Brain tumours are the most common cause of non-accidental dealth in Canadian children.

Targetted therapy would be a huge step forward in the treatment of medulloblastoma (MB).  There are many subtypes of MB, but so far no one has manged to separate these into cohrerent sub-groups, mainly because of small tumour size.  However, Sick Kids has managed to get up to 100 tumours, and were able to show that there are 4 different diseases.  Not actually sub-groups, but different diseases.  Driven by different genetic events.

There are different genes with different expression. (WNT- because of Wint signalling, SHH- hedgehog signalling, Group C and Group D).  Group C and D are most similar of the group.

Tool: mesirov: NMF: (Missed the name, but different type of clustering from unsupervised heirarchical, which was used to generate 4 groups originally.)  Also showed the same 4 types, which are entirely different.

Also interesting is that the different types of tumours affect different age groups, and have different phenotypes and outcomes.  However, even more interesting, there’s a significant difference in the demographics by gender.  WNT is 3x more common in girls than boys, and WNT survival is practically 100%.

Metastasis occurence is also dramatically different in 4 different subgroups.  The group C tumours have MUCH higher incidence. (33%)

There are also subgroup specific mutations.  Great heat map classification of MB subtypes, it shows clearly all of the different subgoups.  However, Clinicians DO NOT want Affymetrix technology.  However, 4 genes can be used to differentiate the same categories – and clinicians LOVE antibody based technology.  4 antibodies can then be used to subgroup tumours.

Went to the lab and tested this out on 300 tumours. 98% of the time only one antibody was observed to bind, and same results were found using microarray. Thus, this technology appears to be working as advertised.

Survival curve:  suvival for WNT is nearly 100%, Group C is nearly 0% at 96 months. Group D is 50% by 144 months, SHH is about 80% at 216 months.  Once you account for subgroup, metastasis is not a useful indicator of your survival – thus, we should clearly use a molecular approach for selecting treatment options.

The Myth of Medulloblastoma – there is no single medulloblastoma – there are indeed 4 different cancers.

The next question is if there are subtypes within the subgroups.  The answer is yes, there does appear to be subtypes within the subgroups – 2 in WNT, 4 in SHH, etc.

There are now drugs for SHH that bind smoothin (?) and block the hedgehog signalling.  Unfortunately, response is transient, and eventually cancer recurrs.  However, SHH tumours have Chr9q, causing a loss of another key gene.

Looked at SHH group specifically, trying to subtype, even further – and yes, again they have phenotypes again – and they classify according to age, as well.  SHH affects 4 month olds differently to adults.

So, where to next?

The antibodies work, but they’re polyclonal, so there is variability.  As well, there are also different ways to embed/prepare cancer samples.  Thus, a better technology is needed.

nanoString assay used instead, uses a probe set instead with PCR.  The test is $50, 48 hour turn around – and gives the same results as the array – and now using new tumour sets to test them on.  Can even be used on parafilm samples up to 5 years old.  (Quality degrades rapidly on older samples).  Also tested at other locations around the world.

[Ok, Starting to feel like I’m in a commercial for a new medical product, despite the fact that I’m enjoying the data tremendously.]

Currently looking at CNV and SNVs for medulloblastoma, and have identified a lot of new oncogenes. [Probably putative oncogenes, really, but not specified.]

All the data above was for the primary medulloblastoma.  What about metastasis?  It’s not the primary tumour that kills kids – it’s less well studied.

Assumption in the field that the primary is identical in the metastasis, but was never really tested.

Retroviruses as insertional mutagenesis.  Use as tool to insert a transposon into random genes.  “Sleeping Beauty” system, which only moves when the correct enzyme is present.  Restricting the transposase allows you to target the cell of origin.

targetting the system to the putative cell of origin gave no results.

However, if you use Ptch mutant mice, you get a VERY high incidence of SHH medulloblastoma, and it’s metastatic.

Another experiment was to study Common insertion genes (overlap of genes).  In fact, this led to looking at the metabolism in primary and metastatic tumours.  Although they were the same tumour, the markers show that sometimes the primary tumours lose events, gain events or completely modify their characteristics.

Conclusions: Metatstatic tumours are seeded from the primary tumour.  [Missed the second one… too fast.]

SHH alone: low metastasis event occurance.  SHH + oncogene (eg, AKT), you get a significant increase in metastasis.

This model appears to apply to humans as well: Primary tumours frequently respond, but the metastatic does not – or vice versa.  Thus, it’s pretty likely we’re seeing the same thing happening.

Using a small number of paired primary/metastatic tumours – primary clusters with metastisis, but the metastasis cluster more closely together than they do with the primary.

Take home message: don’t assume the metastatic tumours are identical to the primary tumour.  They may be – but may also NOT be.

Next question: Is recurrent cancer the same as the primary?  NO!  They are genetically distinct from their primary tumour.  Most of the tumour dies when therapy is given but clonal tumours resurge at some point.


Medulloblastoma has 4 subgroups, with each subgroup having subtypes. Each subtype can be further divided into a primary type and one or more metastatic types, and recurrent types are different again.

Future work: using Next Gen sequencing here at the GSC to study 1000 RNA-Seq samples!

[Overall, a neat talk and Dr. Taylor is an excellent speaker.  The story is great, and the case is presented very well.  I’m entirely convinced by the data here, and it fits nicely with my own work.   There are probably subtypes in each tumour type, depending on the molecular biology, and those should be more predictive than the classical methods – and that the only way to approach proper treatment will be to understand the cancer types individually.  Glad I came to the talk!]

Cancer as a network disease

A lot of my work these days is in trying to make sense of a set of cancer cell lines I’m working on, and it’s a hard project.  Every time I think I make some headway, I find myself running up against a brick wall – Mostly because I’m finding myself returning back to the same old worn out linear cancer signaling pathway models that biochemists like to toss about.

If anyone remembers the biochemical pathway chart you used to be able to buy at the university chem stores (I had one as a wall hanging all through undergrad), we tend to perceive biochemistry in linear terms.  One substrate is acted upon by one enzyme, which then is picked up by another enzyme, which acts on that substrate, ad nauseum.  This is the model by which the electron transport cycle works and the synthesis of most common metabolites.  It is the default model to which I find myself returning when I think about cellular functions.

Unfortunately, biology rarely picks a method because it’s convenient to the biologist.  Once you leave cellular respiration and metabolite synthesis and move on to signaling, nearly all of it, as far as I can tell, works along a network model.  Each signaling protein accepts multiple inputs and is likely able to signal to multiple other proteins, propagating signals in many directions.  My colleague referred to it as a “hairball diagram” this afternoon, which is pretty accurate.  It’s hard to know which connections do what and if you’ve even managed to include all of them into your diagram. (I wont even delve into the question of how many of the ones in the literature are real.)

To me, it rather feels like we’re entering into an era in which systems biology will be the overwhelming force for driving the deep insight.  Unfortunately, our knowledge of systems biology in the human cell is pretty poor – we have pathway diagrams which detail sub-systems, but they are next to imposible to link together. (I’ve spent a few days trying, but there are likely people better at this than I am.)

Thus, every time I use a pathway diagram, I find myself looking at the “choke points” in the diagram – the proteins through which everything seems to converge.  A few classic examples in cancer are AKT, p53, myc and the Mapk’s.  However, the more closely I look into these systems, the more I realize that these choke points are not really the focal points in cancer.  After all, if they were, we’d simply have to come up with drugs that target these particular proteins and voila – cancer would be cured.

Instead, it appears that cancers use much more subtle methods to effect changes on the cell.  Modifying a signaling receptor, which turns on a set of transcription factors that up-regulates proto-oncogenes and down-regulates cancer-supressors, in turn shifting the reception of signalling that reinforce this pathway…

I don’t know what the minimum number of changes required are, but if a virus can do it with only a few proteins (EBV uses no more than 3, for instance), then why should a cell require more than that to get started?

Of course, this is further complicated by the fact that in a network model there are even more ways to create that driving mutation.  Tweak a signaling protein here, a receptor there… in no time at all, you can drive the cell in to an oncogenic pattern.

However, there’s one saving grace that I can see:  Each type of cell expresses a different set of proteins, which affects the processes available to activate cancers.  For instance inherited mutations to RB generally cause cancers of the eye, inherited BRCA mutations generally cause cancers of the breast and certain translocations are associated with blood cancers.  Presumably this is because the internal programs of these cells are pre-disposed to disruption by these particular pathways, whereas other cell types are generally not susceptible because of a lack of expression of particular genes.

Unfortunately, the only way we’re going to make sense of these patterns is to assemble the interaction networks of the human cells in a tissue specific manner.  It won’t be enough to know where the SNVs are in a cell type, or even which proteins are on or off (although it is always handy to know that).  Instead, we will have to eventually map out the complete pathway – and then be capable of simulating how all of these interactions disrupt cellular processes in a cell-type specific manner.  We have a long way to go, yet.

Fortunately, I think tools for this are becoming available rapidly.  Articles like this one give me hope for the development of methods of exposing all sorts of fundamental relationships in situ.

Anyhow, I know where this is taking us.  Sometime in the next decade, there will need to be a massive bioinformatics project that incorporates all of the information above: Sequencing for variations, indels and structural variations, copy number variations and loss of heterozygosity, epigenetics to discover the binding sites of every single transcription factor, and one hell of a network to tie it all together. Oh, and that project will have to take all sorts of random bits of information into account, such as the theory that cancer is just a p53 aggregation disease (which, by the way, I’m really not convinced of anyhow, since many cancers do not have p53 mutations).  The big question for me is if this will all happen as one project, or if science will struggle through a whole lot of smaller projects.  (AKA, the human genome project big-science model vs. the organized chaos of the academic model.)  Wouldn’t that be fun to organize?

In the meantime, getting a handle on the big picture will remain a vague dream at best, and tend to think cancer will be a tough nut to crack.  Like my own work and, for the time being, will be limited to one pathway at a time.

That doesn’t mean there isn’t hope for a cure – I just mean that we’re at a pivotal time in cancer research.  We now know enough to know what we don’t know and we can start filling in the gaps. But, if we thought next gen sequencing was a deluge of data, the next round of cancer research is going to start to amaze even the physicists.

I think we’re finally ready to enter the realms of real big biology data, real systems biology and a sudden acceleration in our understanding of cancer.

As we say in Canada… “GAME ON!”

Teens and risk taking… a path to learning.

I read an article on the web the other day, in which it was described that teenagers have a different weighting of risk and reward than either young children or adults due to a chemical change that emphasizes the benefits of the rewards, without fully processing the risks.

The idea is that the changes in the adolescent brain emphasize the imagined reward for achieving goals, but fails to equally magnify the resulting negative impulse for the potential outcomes of failure. (I suggest reading the linked article for a better explanation.)

Having once been a teenager myself, this somewhat makes sense to me in terms of how I learned to use computers. A large part of the advantage of learning computers as a child is the lack of fear of “doing something wrong.” If I didn’t know what I was doing, I would just try a bunch of things till something worked never worrying about the consequences of making a mess of the computer.  I have often taught people who came to computers late in their lives, and the one feature that comes to the forefront is always their (justified) fear of making a mess of their computer.

In fact, that was the greatest difference between my father and I, in terms of learning curve: when encountering an obstacle, my father would stop as though hitting a brick wall until he could find someone to guide him to a solution, while I’d throw myself at it till I found a hole through it, or a way around it. (Rewriting dos config files, editing registries and modifying IRQ settings on add-on boards were not for the faint of heart in the early 90’s.)

As someone now in my 30’s I can see the value of both approaches. My father never did mess up the computer, but managed to get the vast majority of things working. On the other hand, I learned dramatically faster, but did manage to make a few messes – all of which I eventually cleaned up (learning how to fix computers in the process). In fact, learning how to fix your mistakes is often more painful than causing the mistake in the first place, so my father’s method clearly was superior in sheer pain avoidance technique (eg, negative reinforcement).

However, in the long run, I think there’s something to be said for the teen’s approach: you can move much more agilely (is that a word?) if you throw yourself at problems with the full expectation that you’ll just learn how to solve them in the end.  One can’t be a successful researcher if fear of the unknown is what drives you.  And, if you never venture out into the fringes of the field, you won’t make the great discoveries.  Imagine if Columbus hadn’t been willing to test his theories (which were wrong, by the way) about the circumference of the earth – and no, even the ancient Greeks knew that the earth was round.

Incidentally, fear of making a mess of my computer was always the driving fear for me when I first started learning Linux.  Back in the days before good package management, I was always afraid of installing software because I never knew where to put it.  Even worse, however, was the posibility of doing something that would cause an unrecoverable partition or damaging hardware – both of which were actual possibilities in those days if you used the wrong settings in your config files.  However, with a distinct risk/reward ratio towards the benefit of getting a working system, I managed to learn enough to dull that fear.  Good package management also meant that I didn’t have to worry about making messes of the software while installing things, but that’s another story.

Anyhow, I’m not sure what this says about communicating with teenagers, but it does reinforce the idea that older researchers (myself included) have to lose some of their fear of failure – or fear of insufficient reward – to keep themselves competitive.

Perhaps this explains why older labs depend upon younger post-docs and grad students to conduct research… and the academic cycle continues.

Thought’s on Andrew G Clark’s Talk and Cancer Genomics

Last night, I hung around late into the evening to hear Dr. Andrew G Clark give a talk focusing on how most of the variations we see in the modern human genome are rare variants that haven’t had a chance to equilibrate into the larger population.  This enormous expansion of rare variants is courtesy of the population explosion of humans since the dawn of the agricultural age, specifically in the past 2000 years at the dawn of modern science and education.

I think the talk was a very well done and managed to hit a lot of points that struck home for me.  In particular, my own collected database of human variations in cancers and normals has shown me much of the same information that Dr Clark illustrated using 1000 genome data, as well as information from his 2010 paper on deep re-sequencing.

However interesting the talk was, one particular piece just didn’t click in until after the talk was over.  During a conversation prior to the talk, I described my work to Dr. Clark and received a reaction I wasn’t expecting.  Paraphrased, this is how the conversation went:

Me: “I’ve assembled a very large database, where all of the cancers and normals that we sequence here at the genome science centre are stored, so that we can investigate the frequency of variations in cancers to identify mutations of interest.”

Dr. Clark: “Oh, so it’s the same as a HapMap project?”

Me: “Yeah, I guess so…”

What I didn’t understand at the time was that Dr. Clark was asking was: “So, you’re just cataloging rare variations, which are more or less meaningless?”  Which is exactly what HapMap projects are: Nothing more than large surveys of human variation across genomes.  While they could be the basis of GWAS studies, the huge amount of rare variants in the modern human population means that many of these GWAS studies are doomed to fail.  There will not be a large convergence of variations causing the disease, but rather an extreme number of rare variations with similar outcomes.

However, I think the problem was that I handled the question incorrectly.  My answer should have touched on the following point:

“In most diseases, we’re stuck using lineages to look for points of interest (variations) passed on from parent to child and the large number of rare variants in the human population makes this incredibly difficult to do as each child will have a significant number of variation that neither parent passed on to them.  However, in cancer, we have the unique ability to compare diseased cancer cells with a matched normal from the same patient, which allows us to effectively mask all of the rare variants that are not contributing to cancer.  Thus, the database does act like a large HapMap database, if you’re interested in studying non-cancer, but the matched-normal sample pairing available to cancer studies means we’re not confined to using it as a HapMap-style database, enabling incredibly detailed and coherent information about the drivers and passengers involved in oncogenesis, without the same level of rare variants interfering in the interpretation of the genome.”

Alas, in the way of all things, that answer only came to me after I heard Dr. Clark’s talk and understood the subtext of his question.  However, that answer is very important on its own.

It means that while many diseases will be hard slogs through the deep rare variant populations (which SNP chips will never be detailed enough to elucidate, by the way, for those of you who think 23andMe will solve a large number of complicated diseases), cancer is bound to be a more tractable disease in comparison!  We will by-pass the misery of studying every single rare variant, which is a sizeable fraction of each new genome sequenced!

Unfortunately, unlike many other human metabolic diseases that target a single gene or pathway, cancer is really a whole genome disease and is vastly more complex than any other disease.  Thus, even if our ability to zoom in on the “driver” mutations progresses rapidly as we sequence more cancer tissues (and their matched normal samples, of course!), it will undoubtedly be harder to interpret how all of these work and identify a cure.

So, as with everything, cancer’s somatic nature is a double edged sword: it can be used to more efficiently sort the wheat from the chaff, but will also be a source of great consternation for finding cures.

Now, if only I could convince other people of the dire necessity of matched normals in cancer research…