AGBT talk: Zhong Wang, Joint Genome Institute

Title: Massive Metagenomic Discovery of Biomass-Degrading Genes and Genomes from Cow Rumen

First slide: “biofuels”, “cellulosic Ethanol”, and “genomics” [I think I see where this is going. This was the hot topic in 2003, but I haven’t heard much about it since.]

Overview of Lignocellulose structure and cellulase.  [My shorthand – lignocellulose is broken down by a whole series of enzymes, each which breaks down a different bond in the link depending on branch points, etc.  It is also a semi-crystal state, which has hard to break down.]

All cellulase we use industrially comes from one source: fungal source.]

[Oh my… fistulated cow.  I remember that from visiting universities when I was in high school. It’s a cow with a hole in it’s side so you can get into it’s stomachs any time.]

Using cow to digest switchgrass, and looking for microbes that do the breakdown.

[Odd, wouldn’t you want to do that with corn cellulose, which is plentiful and a wasteproduct of animal feed/etc?]

Did 3 billion reads, 300,000Mb  (1/4 TB of sequence).  Hoping to find new enzymes in this.  [On the other hand, wtf do you do with that much sequence?]  This was like a monster!

Taming the monster: prediction: needed huge hardware. [skipping this…] More cellulases found than other studies. A comparison of Carbohydrate Active Enzymes (CAZy) database.  More found in rumen than were collected in database between 1975-2009.

Diversity: very pretty picture of family tree of cellulases.  Found many new branches – and those found were highly diverged [ which makes sense to me, since the microbiome sequencing this morning said that gut bacteria were the only ones that were really most strongly diverged…. ]

Functional validation.  Panel of cellulase substrates, plus cow rumen enzymes.  Higher the activity, the more novel.

Did they get to the bottom of the metagenome? From saturation plot, it’s linear, never saturates out.

Image “Look what I found in the cow!”

Summary: a large number of cellulases were predicted, found and tested and many have excellent potential for new industrial uses.

Community complexity: cow is intermediate between extreme environments (mine water run off) and soil communities.

Assembly: Used Velvet, 1.93Gb sequences assembled. 47 scaffolds match NCBI, which is only 0.03%.  We know very little about this community.

[on a side note, does “fistulating” a cow change the gut flora community???  that would add other odd questions about the diversity of the cow, and particularly oxygen sensitive members of the community, but I guess those enzymes are mostly useless to us.]

Were able to estimate completeness of some assemblies – one example shown at 89.8% with “genome binning”.  With random binning, you do worse.

From cow genome, were able to assemble 15 good draft genomes. (1.8-3.3h Mb)

Did “Single cell genome sequencing”..  Match reads to assembled scaffold : from single organism.  So, it works.

Conclusion: despite super deep sequencing, were only able to assemble 15 genemes.  Pac Bio may help.  Have already tested some Pac Bio long reads, which do help further assemble.  90% of pac bio reads to validate and resolve outstanding assembly problems.

[Neat and though provoking talk!]

Question: Have you sampled other cows? (nope this was all from one cow!)

Mark Akeson – Baskin School of Engineering, UCSC

Nanopore DNA sequencing: Precision and Control

[I don’t think this is one of my better note sets – the technology is neat, the results are fun to watch, but you can’t capture such a rich data stream in a blog… sorry.]

Two types of nanopore sequencing: Exonuclease Sequencing, Strand Sequencing.

Exonuclease cleaves bases, so you don’t move backwards, strand sequencing converts ssDNA to dsDNA so it doesn’t move backwards through the pore.

Not going to talk much about how channels work – basic idea, charge potential across membrane, resistance changes as things move through. Charge per unit area per unit time is VERY strong.

Good new: non-covalent chemistry determines currents for GATC. (non-covalent chemistry, not size, dictates current.)

History of nanopore seq.

  • ( John Kasianowiz @NIST) alpha hemolysin pore to measure current. worked molecule by molecule – not single base.
  • Simulation showed single bases are passing through in single file.
  • Wild Type alpha hemolysin pore.  Ten nucleotides contribute to pore resistance.
    • convoluted by 3 “reading heads”
  • Did protein engineering to modify by site directed mutagenesis till they got one that could distinguish all 4 bases. (Bayley group)
  • Jens Gunlach lab: Moved to MspA.  Analogy: Alpha homolysin is more like a champagne flute, where MSp has a 1nt width gap at the bottom of a shot glass like pore
    • Better separation of CTA,, but AG still overlap, but far better than hemolysin.

Polymerase and nanopores:

DNA replication in a crystal (A family polymerase).  [Ok, that’s cool, the crystal is still active, so you can take images over time to observe the chemistry happen!]

Sub-millisecond active control of DNA template control.  Seredipitous discovery: At end of peak, there’s a voltage change IF the enzyme is departing, so it works as a good control.

Neat experiment where ssDNA is bounded by dsDNA on both sides of the pore, can watch the processes back and forth.

Tethering polymerase to pore isn’t bad – can kill pores.

Blocking Oligomers [Very graphical flow. I can’t describe this fast enough.]

Polymerase catalyzed synthesis to +12 Endpoint (Klenow fragment) and shown in a movie.  Pulsing 20s intervals.

is phi29DNAP better? 2ms vs 2 seconds.  Applied voltage causes phi29DNAP to tease apart dsDNA absent catalysis.

[This whole talk is images of results, by the way, nearly impossible to get much down that explains the content.]

Movie and results (Leiberman et al JACS 2010)

A ‘Branton’ test: use mix of dntp w ddATP.

Basically, all of the requirements for what’s needed to do nanopore sequencing … but lots letf to do.

  • Seuqening and re-sequencing  individual native DNA
  • Read lengths longer than inudustry standard
  • Sequencing across ethernet.

AGBT talk: David Jaffe, Broad Institute of MIT and Harvard

Title: High-Quality Draft Assemblies of a Dozen Vertebrate Genomes from Massively Parallel Sequence Data”

[EDIT (2011-02-10): Note the speaker has provided some clarifications to these notes in the comments below.  I have struck out comments that have been clarified, and made references to the authors comments where warranted.]

[I walked in at 2:00 as the 454 seminar was wrapping up, instead of 2:30, so I’ve been waiting a while for this talk… the suspense has been building!]

In 2000, vertebrate sequencing cost ~$2,000,000 per genome.  29 were done.

By 2010, BGI used SOAP denovo to do more genomes using illumina alone.  Coverage of segmental duplications, etc were raised as issues.

Goal: “Convince you that it is possible”

evidence: 1.  control genomes (human + mouse) 2. new vertebrate genomes.

How do we get there?

  • ALLPATHS-LG (lg is for large genomes)
  • new algorithms need to evolve with the new lab techniques
  • approach: sequence and assemble blind, then compare to reference when available.
  • Goal: each new genome should not be a research project.

Lab recipe: PET 200bp, 2000bp, 6,000bp and 40kbp (which required development of new methods. – poster by Louise Williams)

Algorithm philosophy:Discussion of small/large k values for assembly. (use large k for  specificity, small k for sensitivity.)  Goal with allpaths is to use all possible biological information – don’t lose reads that are relevant.

Challenge 1: Reads are inaccurate.  Remove errors, but don’t remove SNPs.   (algorithm discussed. used k to identify read stacks, look outside the k to try to correct poor quality non-matching positions.)

[I missed challenge 2… I thought that was all challenge 1.]

Challenge 3 :coverage is uneven;

Solutions: Use High Coverage!  improve lab process, improve algorithms. (See Gnirke talkfor better methods for GC rich areas. [Unfortunately, I didn’t go to that session.])

[Again, I missed challenge 4, which I couldn’t differentiate from part of challenge 3.]

Challenge 5: You don’t know the exact answer for what the genome says.

Challenge 6: Computations are large: billions are reads, need all of your reads in memory at once for assembly.

Solution: Buy more ram. [Ok, really?  that’s not a challenge, if throwing money at it solves the problem that easily, it’s just a funding issue – EDIT: See comments below.]

On to the experiment: (control)

Sequenced Ms.B6 mouse (finished genome available) and NA12878 cell line (sequenced by 1000 genomes).  Three way comparison: Allpaths-lg, SOAPdenovo. (No other one is available. [really?  It thought trans-abyss and velvet were also available and able to do this…. maybe not, I’m not an expert on assembly.]

Criteria: Continuity, Completeness, Accuracy.

Comparison summary: Capillary and allpaths do ok (capillary is slightly better) and soapdenovo is always trailing the pack.

[A whole LOT of slides that look like identical bar graphs – you’ll have to make do with the summary above.]

Doing de-novo assemblies for 15 genomes (7 fish, 8 mammals).  [ooo coelacanth genome!]

Excluding genomes [did he mean scaffolds?] in which 40kb data sets are not good (good def’n by physical coverage 20x or greater).  This removes a lot of the variability in the quality.

“Fish are hard”

Assembly discovers sequences you can’t get any other way.

Even if you do alignment based stuff, there is great value in assembly as a complimentary approach.

Allpaths-lg future:  lots of improvement, error can be driven towards zero.

Yes, you can do whole genome assembly of vertabrates.

[I don’t think I learned much new in this talk – but I don’t really see much difference between the assemblers – EDIT:See comments below.  My take home message is that ALLPATHS-LG can be used on larger data sets, but then again, they claimed that they need massive amounts of RAM…  I’ll just suggest people check out the software and decide for themselves.]

AGBT talk: Sophien Kamoun – The Sainsbury Laboratory

Title: Genome Evolution in the Irish Potato Famine Pathogen Lineage

We face a crisis in food production and food prices.  (New York Times article from Feb 4th) If you can’t eat, you don’t worry about cancer.

World population is expected to peak at 9billion, and we’re not producing enough food.  One aspect of that is crop diseases, which could allow millions more to be fed if we could overcome this problem.

One of the most important is oomycete Phytophthora – (latin: plant destroyer), kills dicots – 10’s of Billions of dollars worth annually.  World’s biggest potato producer is China.

P.infestans suppresses or triggers plant immunity.  Able to invade cells, and forms stuctures between cells. (hyphae)

Some plants carry resistance and have an apoptosis like response.  Resistant plants also suppress the immunity supressors.  [strange phrasing is all mine.]

Genome sequence of Phytophthora infestans is complete.  Published last year. (Cover of nature – a rotten potato!)

Compare P infestans genome to others of the family – very large expansion. Number of genes is about the same, but 240Mbp vs. 65-95Mbp.  Much of it is repeat driven.

Effectors (immunity suppressors) typically occur in expanded repeat-rich and gene-poor loci. (Examples :RXLR, AVR4)

Most of the genes in the genome are all clustered with 1kb of each other, except for a spattering that occur in the repeat-rich regions.  This is an unusual distribution.

Core othologs are all in the clustered regular regions, effector genes all seem to be in the repeat-rich, again, unusual distribution.

Some discussion of how the parasite evolved along with “host jumping”.  Resequenced several isolates of 4 related strains. (all of which have the large genome expansion), and compared them.

4-fold number of genes missing in repeat region vs non-repeat regions. Repeat regions are more plastic. Look at dn/ds, and there is also different selection pressures between the two.

Repeat regions are also highly enriched in genes induced during colonization of tomato and potato (Raffaele et al, Science 2010)


  • Core genome- high density region/low repeat content
  • “plastic” region – low gene density/high repeat content
  • high rates of gene turnover and positive selection in the plastic genome
  • “niches” in the genome for rapid effector evolution.
  • Unexpected rapidly-evolving “plastic” genome familes – cell wall hydrolases, histone and rRNA methyltransferases. [wasn’t discussed in the talk, as far as I can tell, bt interesting nonetheless.]


Using Genomics to improve isease resistance.  Emergence of P infestans “blue 13” clone which is dominating UK isolates, but was barely present 10  years ago.

Core effectors as targets for resistance.

Synthetic R genes with expanded effector recognition.  (modify potato genes to improve resistance.)  Expansion: An R3a mutatnt that recognizes both AVR3a(ki and em form), it is expected to be effective against all P.infestans isolates.  Did create this in the lab… and some success. some clones were able to trigger cell death response.


Single resitue mutations expand effector recognition.

Non-Gm solution through genome editing?

The knowledge of pathogen effectors and comparative genomes is essential.

[Again, a neat talk on a topic I knew nothing about.  Well delivered and very clearly explained.]

AGBT talk: James Giovannoni, Cornell University

Title: Utilization of Next Generation Sequencing for Creation and Exploitation of the Tomato Genome.

[He’s creating tomato genomes?  Odd phrasing, but we’ll see how this works out.]

Think of this as a side dish to the genome technology you’ve been hearing. [nice…]

Tomato has a lot uses: eating, ketchup, de-skunking.  Most important source of vitamin A and C, simply because the amount of it that we eat.

Looking for a reference genome – a great biological system for studying fruit ripening.  Synteny of tomato and potato is high.  So, high quality reference for one of these will be a big help.  Related to pepper & eggplant as well.

Also a wide variety of tomato species, in a wide variety of environments.  (Picture shown of the wild progenitor of the tomato, but it’s really hard to see, unfortunately) Originated in south america, but domestication happened in Europe after Cortez brought it back. Modern breeding of tomatoes has all descended from european stock, so there is a LOT of diversity in the americas that has not yet been tapped.

International consortium working on this. Sequencing efforts started in 2004.  Originally started with a BAC approach.  1500 bacs were sequenced by the consortium.  Things have changed fast, and there is now 454 (31x), Sanger (3.6x), Illumina (82x) and Solid (141x) reads.

Assembler strategy covered – (sequencing, filtering and assembly – using all available information).

Very pretty genetic map and FISH slides. Also a summary of metrics from version 1.0 to 2.3, which is currently frozen for publication. Validation summary as well. [I’m sure all of this will be in the paper, so I’m not copying out details.]

An automated annotation pipeline, run by collaborators in Belgium, also frozen for publication.

Sequence is available:

Still working on the sequence – setting a high target.  apx 1/3 of gaps can be closed by in silico means. (using Celera CABOG assembly).  using 100 bacs that spanned gaps, most of the sequences match the gap. [I may have missed something]

IMAGE2 used for closure and finishing – Iterative mapping and assembly for gap elimination.  Closed 11 of 12 gaps, and was able to reduce size of 12th gap.  Very resource intensive, tho.


  • duplications common to plant genomes are found here.
  • triplication event for dicot clade
  • etc.

New carotenoid genes with novel tissue-specificities.

Neat explanation/slide of the genetic regulation of the development and maturation of fleshy fruit. Chlorophyll degradation is inversely related to non-photosynthetic pigments.

Decoding the fruit transcriptome using large-scale strand-specific transcriptome sequencing.  Looking at both strands has caused them to revise how they believe expression occurs in the fruit. (skipped over examples in the interest of time.) About 5% of genes needed to be revised when strand-specific was taken into account.

Also doing ChIP-Seq, for tomato epigenome, both TF and histone data. (plant specific transcription factors)

Only fleshy fruits ripen, histone methylation seems to be correlated with regulation of genes involved – processes are tied together. [again, I missed part of the explanation] (Manning et al, 2006, Nat Gen 38)


  • high quality assembly of tomato genome
  • continuing to refine it
  • 97% of assembly is in 91 scaffolds, linked to 12 tomato chromosomes
  • annotation of 35,000 genes.
  • evidence of whole duplication events.
  • epigenetic and RNA-Seq providing novel insights into control of fruit ripening.

[Interesting talk… I knew nothing about the tomato, and little about fruit ripening beyond what you learn in undergrad… There seem to be good papers on this, which would make a nice blog entry one day.]

AGBT – page views

In case anyone was wondering what traffic to my blog looks like during AGBT season, I think this image is relatively informative.   I don’t have the ones from the last couple year’s AGBT conferences, but it looks about the same.  Fortunately, my computer seems to be coping with the load – and my wife hasn’t emailed me to say that its fan is whining yet.. (Yes, every time you view my blog, you’re actually visiting my living room.)

Graph shows jan 19 – Feb 4, 2001.  Y-axis is page views.

AGBT Talk: Joseph Petrosino, Baylor College of Medicine

Title: Toward Improved Bacterial  and Viral Metagenomic Sequencing and AnalysisStrategies in Healthy and Diseased individuals.

[EDIT: I found this talk really hard to take notes on – many of the slides did not have easily extractable messages, despite being interesting.  Errors are likely in the content below.]

Will focus more on viruses, as Dr. Knight focused more on bacteria.

NIH Human Microbiome Project (HMP).  Genomes from 900 microbiome bacteria  – has grown to 3000.  Characterize microbiome from healthy people (baseline), doing transcriptome, viral an eukaryotic microbiome. 15 disease-oriented Demonstration Projects.

Sample sites are 15/18 locations on the body, depending on gender of subject.

[skipping poo joke….   right… carying on.]

Sample -> enrich bacteia, virueses fungi -> extract DNA, – > sequence (which ever strategy) -> community structure and other value info (pathways/etc).

Descrtiption of sample sites and collection techniques – you need to do a lot of standardization.  [I’m not going into details, and the speaker is covering it very briefly.]

Moving on to the bacterial communities dedrograms.  Samples cluster by location of colection, and very specific to environment. (tongue different from saliva, etc.)

Many new disease relationshp projects (long list…) includes astronaut microbiomes.

Viral Metagenomics: detect encasidated viruses in clinical samples to discover relationships to health and disease.  (Virus hunting.)

In healthy patients, you have small virus loads.

Upstream processing covered – much filtering done. [review of cDNA library construction]  Can require over 80 PCR cycles.

Do random primer designs sample viruses equally well?  How much depth is needed to capture viral diversity?  454 vs illumina?  (Huge Human contamination.)

Some slides comparing results, [couldn’t pull out take home message fast enough]. plateu out aroung 30-40% of reads of a lane. [GAII?  not sure.]

Sampling is difficult, you don’t know if you’re capturing the whole population, but what you see caps out at 30-40%.

Random primer construction- does it work? Compared 6 different strategies.  [No take home message that I heard.]

Does more sample = more viruses, maybe.  You don’t need huge amounts of sample.

Virus families captured by random primers: many of them.  [I’m not listing, but there’s a difference by which primers are used.]

Data section:

Viral familyies detected in 4 subjects.  Patterns starting to emerge. [I can’t see them, though] Both DNA and RNA viruses detected.  Hits need to be verified. Are these colonizing, or are they just “passing through”.

Phage: 48 phages in 1st pass query against database.  Phage population can give you info about the microbiome.

Virus protocol differentiates stool and nasal wash viruses.   [yes, you can tell the difference, qualitatively.]

Some Diseases:

  • Kawasaki disease
    • children’s disease, usually found in children of asian decent. Cause is unknown.
    • [unpublished data] – seems to be a few viruses associated – still needs to be validated.
  • Elephant Herpes virus
    • all 6 calves born at houston zoon in last 2 decates have died from EEHV.
    • At zoo, they named the baby elephant “Baylor” to up the ante.
    • Did the usual process to try to pull out virus
    • Able to assemble EEHV1
    • research still underway

Many other projects ongoing.  Upward trend for viral metagenomic strategies.

Many areas to improve still, including improved curration of viral db. Better measures for coloniztion/passing through viruses.

AGBT talk: Rob Knight, University of Colorado at Boulder

Title: Spatially and Temporally Explicit Studies of the Human Microbiome

Sequencing is getting dramatically searching, as we all know.  What we can do now is dramatically different than what we could do a decade ago.

We know, since the invention of the microscope, (van Leeuwenhoek in 1683), we know that the human body is covered in bacteria.

Why should you care about your microbes? They can have interesting effects, eg, determine whether tylenol is toxic to your liver. (PNAS).  If you’re a fruit fly, it can determine your partner preferences (PNAS), steal genes from your food to help you digest it.

There are as many E. coli in your gut as there are people on earth.  It’s not the dominant member in the cut, though – it’s just best at growing on a petri dish.

Any two people you pick have 99.9% the same genome, but E. coli genomes can differ up to 40%.  Humans may not be unique like snow flakes, but our symbionts are!

Can start asking intelligent questions about our microbial selves.

How human are we?  in terms of cells, we’re 10% of the cells in our body.  only one percent of the DNA [if I got that right]

Most of the world is made of bacteria – animals and plants are a very small number of organisms.  and 99% aren’t culturable.

How do we look at them, then?  Get samples and extract DNA -> PCR amplify (usually SSU rRNA gene) -> sequence -> blast against genebank (but this is less and less useful. You now get a lot of hits on uncultured stuff. so skip this.) -> align and build trees to figure out what you can.

Problem: big trees are hard to understand and analyze.

Issue: need to interpret vast amounts of sequence/tree data.  Interpretation isn’t trivial as trees become massive.

Experiment: microbial biogeography on the keyboard?  (Are keys deserts for bacteria, different from fingertips?)  Result: we have distinct cultures, each of us, but our keyboards mirror our fingertip bacteria. (PNAS) – it was on CSI: Miami, so you know it’s true.

Darwin’s “Origin” has the first phylogenitic tree. [I did not know that]

Calculating a community distance metric. If trees are identical, distance =0.  If complete separation right from root, then distance = 1. [Very visually informative slides – discussing how we perceive the data in the metric.]

(Lozupone & Knight 2007, PNAS) [hope I got the name right – it’s jammed into the corner of a slide.]  Experiment looking for related-ness among a large number of samples.  They did see a significant divide between saline/non-saline.

Interesting: Extreme environments are not outliers.  However, there are outliers: they’re in the vertibrate gut!

QIIME: integrating analysis of hundreds of samples using barcodes.  Use 454 mostly, but also illumina. Use sequences to build phylogenetic trees.

[Joke about why we still call it “454”…. because that’s the temperature your money burns at when you do these experiments…. ]

[Joke section on sequencing technologies to watch out for… I can’t do it justice.]

but i digress…..

Different body habitats are very different from each other.  (2009 Science)  [I recall seeing this last year at AGBT, I think.] When on antibiotics, your communities change dramatically, and getting a picture of overall human microbiome variability.

You don’t need a lot of sequences per samples to see the patterns.  same pattern in 10 seq/sample as 1500seq/sample.

Have done these studies over time – over 3 months , visualized in a live 3D graph. [worth seeing, actually, very cool.]

Picture of Rodrigo Salvadore Dali painting. (It’s a pretty picture, but doesn’t tell you the whole story)

Detailed biogography of the human face.

[Nifty visualizations for the visualization of distribution of bacteria on the face.  Obviously can’t blog that.]

Where do the bacteria come from?  (Which raises privacy issues.)  Babies who come out vaginally  all have vaginal communities,  those that are born by c-section have a very different population.

Diversity of babies’ bacterial communities increases by day, and by the end of 3 years, they resemble their mother’s bacterial communities.

Do differences in the microbiome matter?  Fat mouse experiment says yes. (Two examples – Leptin and TLR5)  With TLR5 knockout mouse, the bacteria are different and seem to make the mouse hungrier – you can “rescue” the mouse by changing the bacteria. Same applies to Burmese pythons.

Fat vs thin are Bacteriodes vs Firmicules [missed which one is which, tho, and not sure about the spelling.]

Future directions: personalized medicine in developing nations. Pilot studies in “humanized mice” measuring input microbes, diet change and BMI.  Can you develop test from gut microbes to predict effects of diet/obesity/etc?

Much of the work is in developing systems for measuring and recording environmental conditions, etc.

Earth Microbiome project coming…

Conclusion: we all have a microbiome, and anyone can do this type of work now that sequencing is so cheap  – much of the cost of experiment is now in DNA extraction.

[A neat talk, summarizing a lot of published work.  Unfortunately, I couldn’t read most of the citations.  Talk was memorable for it’s good visualization tools and the excellent speaker.]

AGBT talk: Praveen Cherukuri, NHGRI

Title: Massively Parallel Sequencing of Exmomes and Transcriptomes in ClinSeq Participants.

Clinseq: large scale sequencing project of 1000 patients who have identified as phenotype for clinical symptoms of coronary disease. Started in Jan 2007, participants between 45-65 years old.

Nice slide illustrating balance between: Clinical data, genome breadth and # subjects. Hard to get all 3.

Project started with Targeted Gene approach, switched to Whole Exome and Whole Transcriptome. (403 exomes and 14 transcriptomes already done.)

Data analysis and workflow slide – Very similar to everyone else – and have a nextgen variant database. [no description given here for the db, unfortunately.] Erange and cufflinks used for processing reads.

Many novel variants are singleton – most do not show up in multiple data sets. [expected, I suppose, given what we see elsewhere.  Only polymorphisms (not novel) saturate quickly, by definition.]

Focus on differential allele expression: when each copy of a chromosome carries different alleles, they may be expressed differently, and that may relate to disease.

Whole exome gives you ability to count reads and count freequency. [as you’d expect, really.]  Distribution is generally similar (looks kinda like a normal distribution), stuff on the tails are allele specific expression.

High amount of correlation of allele frequency for both variants, but at greater than 100x, you see more variation.

Example gene: ERAP2, which has previously been published and known to have differential ASE.


  • refining methodologies… [I think I missed something with this point.]
  • ASE is reproducible
  • implementing integrative computational approaches on participants on patients with both Exome and transcriptome data.

AGBT talk: Kateryna Makova, Penn State Univerisity

Title: Dynamics of Mitochondrial Heteroplasmy in three families

Brief overview of Mt DNA:  37 genes, 16.5kb long.  Maternally transmitted, sperm MT is destroyed upon fertilization, and also a high mutation rate (poor repair mechanism, environmental effects.)

Heteroplasmy: presence of more than one mtDNA variant in an individual.

Mitochondrial bottleneck during oogenesis.

Makes MT DNA interesting.

more than 200 diseases are mutations in mtDNA.  Can be severe, frequency of disease/normal mtDNA in one person can determine severity.  There is no cure – so focus is on prevention of transmission.  (Nuclear transfer of maternal DNA into a n enucleated oocite of a healthy female could be done… but hasn’t been proved.)

mtDNA mutations are predisposing to features of aging. (alzheimers, parkinsons, diabetes, etc). Also possible link to autism.

mtDNA mutations are also markers for cancer – link not yet determined.

Recent studies with NGS get you further indetecting heteroplasmy, but outstanding dissagreements. (He at al, 2010 : heteroplasmy is common (from cell lines), Li et al, 2010: heteroplasmy is rare (from 131 individuals.))

question: how does heteroplasmy affect individuals, and how does it change during transmission to offspring?

Study design [not blogging this part – go read the research. (-: ]

Real challenge: how to distinguish low freq heteroplasmy from seq errors?  Tackled with lots of simulations and clonal samples and spike ins. Result: 2% or greater (conservatively, probably closer to 1%).


6 heteroplasmic sites – [nice map of chrMT, but um… yeah, not bloggable without a camera.]

“Static heteroplasmy” never observed.  Frequency shift without mutation, somatic (shifts in minor allele frequency between tissues) and germline (passed on to child) mutations observed.

Found: One germline, 3 frequency shits, 2 somatic mutations.

The one germline was different between two children from one mother, suggested that mutation happened early on.

Use Galaxy for this project.  History and log are available for this project – can be run on the cloud if you like. [cool.]


  • Heteroplasmic frequency shifts happen frequently,
  • analysis is reproducible,
  • objective determination frequency threshold calculations,
  • and a framework exists for repeating this work.