VanBUG: Andrew G Clark, Professor of Population Genetics, Cornell University

[My preamble to this talk is that I was fortunate enough to have had the opportunity to speak with Dr. Clark before the talk along with a group of students from the Bioinformatics Training Program.  Although asked to speak today on the subject of the 1000 genomes work that he’s done, I was able to pose several questions to him, including “If you weren’t talking about 1000 Genomes, what would would you have been speaking about instead?”  I have to admit, I had a very interesting tour of the chemistry of drosophila mating, parental specific gene expression in progeny and even some chicken expression.  Rarely has 45 minutes of science gone by so quickly.  Without further ado (and with great respect to Rodrigo Goya, who is speaking far too briefly – and at a ridiculous speed – on RNA-seq and alternative splicing  in cancer before Dr. Clark takes the stage), here are my notes. ]

Human population genomics with large sample size and full genome sequences

Talking about two projects – one sequencing a large number of genomes (1000 Genomes project), the other sequencing a very large number of samples in only 2 genes (Rare Variant studies).

The ability to predict phenotype from genotype is still small – where is the heritability?  Using simple snps is insufficient to figure out disease and heritibility.  Perhaps it’s rare variation that is responsible.  That launched the 1000 Genome project.

1000 Genome was looking to find stuff down to 1% of population.   (In accessible regions)

See Nature for pilot project publication of the 1000 Genomes project.. This included several trios (Parents and child).  Found more than 15M snps across the human genome.  Biggest impact, however, has been the impact on informatics – How do you deal with that large volume of snps?  Snp calling, alignment, codification, etc…

Much of the standard file formats, etc came from the 1000 Genomes groups working on that data. Biggest issue is (of course) to avoid mapping to the wrong reference!  “High quality mismatches” ->  Many false positives that failed to validate: misalignments of reads.  Read length improvements helped keep this down, as did using the insertions found in other 1000 Genome project subjects.

Tuning of snp callling made a big difference.  Process with validations made a significant impact.  However, for rare snps, it’s still hard to call snps.

Novel SNPs tend to be population specific.  Eg. Yoruban vs. European have different patterns of SNPs.  There is a core of common SNPs, but each has it’s own distribution of the rare or population specific SNPs.

“Imputation” using haplotype information (phasing) was a key item for making sense of the different sources of the data.

Great graph on fequency spectrum.  (Number of variants – log vs allele frequency (0.01 – 1)) Gives a lying out flat hockey stick.  Lots of very rare frequency snps, decreasing towards 1, but a spike at 1.

>100kb from each gene there is reduced variation (eg, Transcription start site.)

Some discussion of recombination hotspots, which were much better mapped by using the 1000 genome project data.

Another application: de novo mutation.  Identify where there are variations in the offspring where they are not found in either present.   Roughly about 1000 mutations per gamete.  ~3×10^-8 substitution per generation.

1000 Genomes project is now expanding to 2500 samples.  Trying to distribute across 25 population groups, with 100 individuals per group.

Well, what do we expect to discover from ultra-deep sampling?

There are >3000 mutations in dystrophin.  (Ascertained cases of muscular dystrophy. – Flanagan et al, 2009, Human Mutation)

If you think of any gene, you can expect to find every gene mutated at every point across every population… eventually.  [Actually, I do see this in most genes, but not all… some are hyper conserved, if I’ve interpreted it correctly.]

Major problem, tho: sequencing error.  If you’re sampling billions of base pairs, with 1/100,000 error rate, you’ll still find bad base calls!

Alex Coventry: There are only 6 types of heterozygotes (CG, CT, GT, AC, AG, AT)… ancient technology, not getting into it – was developed for sanger.

Studied HHEX and KCNJ11 genes, sequenced in 13,715 people. Validated by Barcoding and 454 sequencing.

Using the model from Alex’s work, you could use a posterior probabilty of each SNP.  Helped in validating.  When dealing with rare variants, there isn’t a lot of information.

The punchline: “There are a lot of rare SNPs out there!”

Some data shown (site frequency) as sample data increases.  The vast majority of what you get in the long run is the rare SNPs.

Human rare variation is “in excess” of what you’d expect from classical theory.  So why are there so many variants?

Historical population was small, but underwent a recent population explosion in the last 2000 years. This allows for a rapid diversity to be generated as each new generation has new variants, and no dramatic culls to force this rare variation to consolidate.

How many excess rare variants would you expect from the population explosion?  (Guttenkunst et al, 2009, PLOS Genetics)  Population has expanded 100x in about 100 generations.  Thus, we see the core set, which were present in the population before the explosion, followed by the rapid diversification explosion of rare snps.

You can do age inferrence, then, with the frequency of SNPs.  older snps must be present across more of the population.  Very few SNVs are older than 100 generations.  If you fit the population model back to the expected SNV frequency in100 generations ago, the current data fits very well.

When fitting to effective sample size of humans, you can see that we’re WAY out of equilibrium from what the common snps would suggest.  [I’m somewhat lost on this, actually.  Ne (parent) vs n (offspring).  I think the point is that we’ve not yet seen consolidation (coalescence?) of SNPs.]

“Theory of Multiple Mergers”  Essentially, we have a lot of branches that haven’t had the chance to blend – each node on the variation tree has a lot of unique traits (SNPs) independent of the ancestors.  (The bulk of the weight of the branch lengths is in the many many leaves at the tips of the trees.)

[If that didn’t make sense, it’s my fault – the talk is very clear, but I don’t have the population genetics vocabulary to explain this on the fly.]

What proportion of SNPs found in each new full genome sequence do we expect to be novel? (For each human.)  “It’s a fairly large number.”  It’s about 5-7%, Outliers from ]3-17%.  [I see about the same for my database,  which is neat to confirm.]  Can fit this to models: constant population size would give a low fraction (0.1%), with explosive model (1.4%) over very large sample sizes.

Rare variants are enriched for non-synonymous and premature terminations (Marth et al , submitted) [Cool – not surprising, and very confounding if you don’t take population frequency into account in your variant discovery.]

What does this mean in complex diseases?  Many of our diseases are going to be caused by rare variants, rather than common variants.  Analogy of jets that have 4x redundancy, versus humans with 2x redundancy at the genome level.


  • Human population has exploded, but it has a huge effect on rare variations.
  • Huge samples must be sequenced to detect and test effects
  • Will impact out studies of diseases, as we have to come to terms with the effects of the rare variations.

[Great talk!  I’ve enjoyed this tremendously!]