CPHx : Mads Sønderkær, Aalborg University

The Potato Genome Sequence

The potato genome.  Started in 2006, 12 chromosomes, 844Mb, tetraploid and highly heterzygous.

Genotype 1: Heterozygous diploid.  144 Gb of Illumina, 454, and sanger.  de novo assembly not yet complete.

Genotype 2: homozygous double monoploid, 96.6gb illumina, 454, sanger, 727 Mb, 86% of total estimated size. 133x coverage.

Annotated 40,000 genes, 60% have more than one transcript variant.  But, how many of them are correct?  Manually currated a subset of genes.  Each one was queried for similarity (DM protein sequences, keyword search agains functional annotation.  that gave a core set of genes.)  This gave a set of 169 genes.

Validation.  When it goes well, you get wonderful exon coverage in RNA-seq, Detection of PolyA sites, and confirmation of Exon-intron structure with overlaps.  When it goes wrong, many non-conding regions/exons predicted, etc…

Of the 169 genes investigated, 76% had correct transcript model, 9% were not expressed in the tissue of interest.  8% had more than one gene (fusion), 2% annotated as split genes.

So, why do we grow potatoes? Starch metabolism.  We grow them to eat them.

Looking at starch metabolism, can you learn something about the two varieties grown?  [brief review of starch metabolism]

Fructose-bisphosphate aldolase.  Expression was measured in both strains.  You can see that there are some tissue specific differences.  Same happens for starch phosphorylase and alpha and beta amylase.  When these are different in the tubers, you can observe phenotypic changes as well. One strain gives big fat tubers, the other long narrow ones. Perhaps these genes were bred for differences in these genes.

Conclusions:  High degree of tissue specificity for starch genes.  Few isoforms dominate.  Opens possibility for manipulating potato genes for agricultural purposes.

Why I haven’t graduated yet and some corroborating evidence – 50 breast cancers sequenced.

Judging a cancer by it’s cover tissue of origin may be the wrong approach.  It’s not a publication yet, as far as I can tell, but summaries are flying around about a talk presented at AACR 2011 on Saturday, in which 50 breast cancer genomes were analyzed:

Ellis et al. Breast cancer genome. Presented Saturday, April 2, 2011, at the 102nd Annual Meeting of the American Association for Cancer Research in Orlando, Fla.

I’ll refer you to a summary here, in which some of the results are discussed.  [Note: I haven’t seen the talk myself, but have read several summaries of it.] Essentially, after sequencing 50 breast cancer genomes – and 50 matched normal genomes from the same individuals – they found nothing of consequence.  Everyone knows TP53 and signaling pathways are involved in cancer, and those were the most significant hits.

“To get through this experiment and find only three additional gene mutations at the 10 percent recurrence level was a bit of a shock,” Ellis says.

My own research project is similar in the sense that it’s a collection of breast cancer and matched normal samples, but using cell lines instead of primary tissues.  Unfortunately, I’ve also found a lot of nothing.  There are a couple of genes that no one has noticed before that might turn into something – or might not.  In essence, I’ve been scooped with negative results.

I’ve been working on similar data sets for the whole of my PhD, and it’s at least nice to know that my failures aren’t entirely my fault. This is a particularly difficult set of genomes to work on and so my inability to find anything may not be because I’m a terrible researcher. (It isn’t ruled out by this either, I might add.)  We originally started with a set of breast cancer cell lines spanning across 3 different types of cancer.  The quality of the sequencing was poor (36bp reads for those of you who are interested) and we found nothing of interest.  When we re-did the sequencing, we moved to a set of cell lines from a single type of breast cancer, with the expectation that it would lead us towards better targets.  My committee is adamant  that I be able to show some results of this experiment before graduating, which should explain why I’m still here.

Every week, I poke through the data in a new way, looking for a new pattern or a new gene, and I’m struck by the absolute independence of each cancer cell line.  The fact that two cell lines originated in the same tissue and share some morphological characteristics says very little to me about how they work. After all, cancer is a disease in which cells forget their origins and become, well… cancerous.

Unfortunately, that doesn’t bode well for research projects in breast cancer.  No matter how many variants I can filter through, at the end of the day, someone is going to have to figure out how all of the proteins in the body interact in order for us get a handle on how to interrupt cancer specific processes.  The (highly overstated) announcement of p53’s tendency to mis-fold and aggregate is just one example of these mechanisms – but only the first step in getting to understand cancer. (I also have no doubts that you can make any protein mis-fold and aggregate if you make the right changes.)  The pathway driven approach to understanding cancer is much more likely to yield tangible results than the genome based approach.

I’m not going to say that GWAS is dead, because it really isn’t.  It’s just not the right model for every disease – but I would say that Ellis makes a good point:

“You may find the rare breast cancer patient whose tumor has a mutation that’s more commonly found in leukemia, for example. So you might give that breast cancer patient a leukemia drug,” Ellis says.

I’d love to get my hands on the data from the 50 breast cancers, merge it with my database, and see what features those cancers do share with leukemia.  Perhaps that would shed some light on the situation.  In the end, cancer is going to be more about identifying targets than understanding its (lack of ) common genes.

Thought’s on Andrew G Clark’s Talk and Cancer Genomics

Last night, I hung around late into the evening to hear Dr. Andrew G Clark give a talk focusing on how most of the variations we see in the modern human genome are rare variants that haven’t had a chance to equilibrate into the larger population.  This enormous expansion of rare variants is courtesy of the population explosion of humans since the dawn of the agricultural age, specifically in the past 2000 years at the dawn of modern science and education.

I think the talk was a very well done and managed to hit a lot of points that struck home for me.  In particular, my own collected database of human variations in cancers and normals has shown me much of the same information that Dr Clark illustrated using 1000 genome data, as well as information from his 2010 paper on deep re-sequencing.

However interesting the talk was, one particular piece just didn’t click in until after the talk was over.  During a conversation prior to the talk, I described my work to Dr. Clark and received a reaction I wasn’t expecting.  Paraphrased, this is how the conversation went:

Me: “I’ve assembled a very large database, where all of the cancers and normals that we sequence here at the genome science centre are stored, so that we can investigate the frequency of variations in cancers to identify mutations of interest.”

Dr. Clark: “Oh, so it’s the same as a HapMap project?”

Me: “Yeah, I guess so…”

What I didn’t understand at the time was that Dr. Clark was asking was: “So, you’re just cataloging rare variations, which are more or less meaningless?”  Which is exactly what HapMap projects are: Nothing more than large surveys of human variation across genomes.  While they could be the basis of GWAS studies, the huge amount of rare variants in the modern human population means that many of these GWAS studies are doomed to fail.  There will not be a large convergence of variations causing the disease, but rather an extreme number of rare variations with similar outcomes.

However, I think the problem was that I handled the question incorrectly.  My answer should have touched on the following point:

“In most diseases, we’re stuck using lineages to look for points of interest (variations) passed on from parent to child and the large number of rare variants in the human population makes this incredibly difficult to do as each child will have a significant number of variation that neither parent passed on to them.  However, in cancer, we have the unique ability to compare diseased cancer cells with a matched normal from the same patient, which allows us to effectively mask all of the rare variants that are not contributing to cancer.  Thus, the database does act like a large HapMap database, if you’re interested in studying non-cancer, but the matched-normal sample pairing available to cancer studies means we’re not confined to using it as a HapMap-style database, enabling incredibly detailed and coherent information about the drivers and passengers involved in oncogenesis, without the same level of rare variants interfering in the interpretation of the genome.”

Alas, in the way of all things, that answer only came to me after I heard Dr. Clark’s talk and understood the subtext of his question.  However, that answer is very important on its own.

It means that while many diseases will be hard slogs through the deep rare variant populations (which SNP chips will never be detailed enough to elucidate, by the way, for those of you who think 23andMe will solve a large number of complicated diseases), cancer is bound to be a more tractable disease in comparison!  We will by-pass the misery of studying every single rare variant, which is a sizeable fraction of each new genome sequenced!

Unfortunately, unlike many other human metabolic diseases that target a single gene or pathway, cancer is really a whole genome disease and is vastly more complex than any other disease.  Thus, even if our ability to zoom in on the “driver” mutations progresses rapidly as we sequence more cancer tissues (and their matched normal samples, of course!), it will undoubtedly be harder to interpret how all of these work and identify a cure.

So, as with everything, cancer’s somatic nature is a double edged sword: it can be used to more efficiently sort the wheat from the chaff, but will also be a source of great consternation for finding cures.

Now, if only I could convince other people of the dire necessity of matched normals in cancer research…

VanBUG: Andrew G Clark, Professor of Population Genetics, Cornell University

[My preamble to this talk is that I was fortunate enough to have had the opportunity to speak with Dr. Clark before the talk along with a group of students from the Bioinformatics Training Program.  Although asked to speak today on the subject of the 1000 genomes work that he’s done, I was able to pose several questions to him, including “If you weren’t talking about 1000 Genomes, what would would you have been speaking about instead?”  I have to admit, I had a very interesting tour of the chemistry of drosophila mating, parental specific gene expression in progeny and even some chicken expression.  Rarely has 45 minutes of science gone by so quickly.  Without further ado (and with great respect to Rodrigo Goya, who is speaking far too briefly – and at a ridiculous speed – on RNA-seq and alternative splicing  in cancer before Dr. Clark takes the stage), here are my notes. ]

Human population genomics with large sample size and full genome sequences

Talking about two projects – one sequencing a large number of genomes (1000 Genomes project), the other sequencing a very large number of samples in only 2 genes (Rare Variant studies).

The ability to predict phenotype from genotype is still small – where is the heritability?  Using simple snps is insufficient to figure out disease and heritibility.  Perhaps it’s rare variation that is responsible.  That launched the 1000 Genome project.

1000 Genome was looking to find stuff down to 1% of population.   (In accessible regions)

See Nature for pilot project publication of the 1000 Genomes project.. This included several trios (Parents and child).  Found more than 15M snps across the human genome.  Biggest impact, however, has been the impact on informatics – How do you deal with that large volume of snps?  Snp calling, alignment, codification, etc…

Much of the standard file formats, etc came from the 1000 Genomes groups working on that data. Biggest issue is (of course) to avoid mapping to the wrong reference!  “High quality mismatches” ->  Many false positives that failed to validate: misalignments of reads.  Read length improvements helped keep this down, as did using the insertions found in other 1000 Genome project subjects.

Tuning of snp callling made a big difference.  Process with validations made a significant impact.  However, for rare snps, it’s still hard to call snps.

Novel SNPs tend to be population specific.  Eg. Yoruban vs. European have different patterns of SNPs.  There is a core of common SNPs, but each has it’s own distribution of the rare or population specific SNPs.

“Imputation” using haplotype information (phasing) was a key item for making sense of the different sources of the data.

Great graph on fequency spectrum.  (Number of variants – log vs allele frequency (0.01 – 1)) Gives a lying out flat hockey stick.  Lots of very rare frequency snps, decreasing towards 1, but a spike at 1.

>100kb from each gene there is reduced variation (eg, Transcription start site.)

Some discussion of recombination hotspots, which were much better mapped by using the 1000 genome project data.

Another application: de novo mutation.  Identify where there are variations in the offspring where they are not found in either present.   Roughly about 1000 mutations per gamete.  ~3×10^-8 substitution per generation.

1000 Genomes project is now expanding to 2500 samples.  Trying to distribute across 25 population groups, with 100 individuals per group.

Well, what do we expect to discover from ultra-deep sampling?

There are >3000 mutations in dystrophin.  (Ascertained cases of muscular dystrophy. – Flanagan et al, 2009, Human Mutation)

If you think of any gene, you can expect to find every gene mutated at every point across every population… eventually.  [Actually, I do see this in most genes, but not all… some are hyper conserved, if I’ve interpreted it correctly.]

Major problem, tho: sequencing error.  If you’re sampling billions of base pairs, with 1/100,000 error rate, you’ll still find bad base calls!

Alex Coventry: There are only 6 types of heterozygotes (CG, CT, GT, AC, AG, AT)… ancient technology, not getting into it – was developed for sanger.

Studied HHEX and KCNJ11 genes, sequenced in 13,715 people. Validated by Barcoding and 454 sequencing.

Using the model from Alex’s work, you could use a posterior probabilty of each SNP.  Helped in validating.  When dealing with rare variants, there isn’t a lot of information.

The punchline: “There are a lot of rare SNPs out there!”

Some data shown (site frequency) as sample data increases.  The vast majority of what you get in the long run is the rare SNPs.

Human rare variation is “in excess” of what you’d expect from classical theory.  So why are there so many variants?

Historical population was small, but underwent a recent population explosion in the last 2000 years. This allows for a rapid diversity to be generated as each new generation has new variants, and no dramatic culls to force this rare variation to consolidate.

How many excess rare variants would you expect from the population explosion?  (Guttenkunst et al, 2009, PLOS Genetics)  Population has expanded 100x in about 100 generations.  Thus, we see the core set, which were present in the population before the explosion, followed by the rapid diversification explosion of rare snps.

You can do age inferrence, then, with the frequency of SNPs.  older snps must be present across more of the population.  Very few SNVs are older than 100 generations.  If you fit the population model back to the expected SNV frequency in100 generations ago, the current data fits very well.

When fitting to effective sample size of humans, you can see that we’re WAY out of equilibrium from what the common snps would suggest.  [I’m somewhat lost on this, actually.  Ne (parent) vs n (offspring).  I think the point is that we’ve not yet seen consolidation (coalescence?) of SNPs.]

“Theory of Multiple Mergers”  Essentially, we have a lot of branches that haven’t had the chance to blend – each node on the variation tree has a lot of unique traits (SNPs) independent of the ancestors.  (The bulk of the weight of the branch lengths is in the many many leaves at the tips of the trees.)

[If that didn’t make sense, it’s my fault – the talk is very clear, but I don’t have the population genetics vocabulary to explain this on the fly.]

What proportion of SNPs found in each new full genome sequence do we expect to be novel? (For each human.)  “It’s a fairly large number.”  It’s about 5-7%, Outliers from ]3-17%.  [I see about the same for my database,  which is neat to confirm.]  Can fit this to models: constant population size would give a low fraction (0.1%), with explosive model (1.4%) over very large sample sizes.

Rare variants are enriched for non-synonymous and premature terminations (Marth et al , submitted) [Cool – not surprising, and very confounding if you don’t take population frequency into account in your variant discovery.]

What does this mean in complex diseases?  Many of our diseases are going to be caused by rare variants, rather than common variants.  Analogy of jets that have 4x redundancy, versus humans with 2x redundancy at the genome level.


  • Human population has exploded, but it has a huge effect on rare variations.
  • Huge samples must be sequenced to detect and test effects
  • Will impact out studies of diseases, as we have to come to terms with the effects of the rare variations.

[Great talk!  I’ve enjoyed this tremendously!]