AGBT Talk: Tim Yu, Harvard Medical School

Title: Genome wide searches for autism.

Disclosure: work is his own, but was here as a guest of Complete Genomics.

Backround on neurodevelopmental brain disorders

  • Big question: How does brain form and work?
  • Recent work has been focused on autism.
    • Stats presented on autism (1:110 are autistic, using broad def of autism)
    • 8% recurrance risk for siblings.
  • Cognitive imparement in 50-60%
  • regression in 10-25% of cases
  • Challenges to gene discovery
    • difficult to make diagnoses (especially @ genetic level where 80-95% of cases)
  • Hypothesis: repressive burden
    • Nice slide: rate of cousin marriages corresponds to regions with higher birth defects and cognitive imparement.  (Mainly middle east and north Africa.)
    • over 200 pedigrees collected.
    • Focus on middle east, where cousin marriages are common and culturally appropriate.
  • use 500k snp chips
    • phase I: validation and CNV discovery
    • Phase II: targetted sequencing… which migrated to whole genome
  • Locus Heterogeneity
    • Many bands are implicated
    • Sequence region to identify candidate genes (mutations)
  • At this point, in comes Complete Genomics.
    • 40 individuals with autism sequenced
    • Analysis for non-complete genomics data set involved wading through forest of open source tools – complete genomics, tho, you just get hard drives and the data is ready to use.   (Big contrast)
    • Coverage: 63x coverage, 95.6% bases called
  • Variant calling: used rare gene model
    • 3.2M variants
    • 1k pathogenic  + novel
    • 100 pathogenic + novel + homozygous
    • 10 pathogenic + novel + homozygous + linked.
  • When compared to known autism genes, they were able to identify several individuals where it WAS the causative variant.
    • Most were still not obvious, however.
  • [Skipping clinical/patient data]
    • For this patient, a chr6 area identified.
      • from 3.2M snps, only 4 were consistent with recessive model, one was in PEX7.
    • All affected family members had mutation in PEX7
    • Absent in 700 controls.
  • 2nd example:
    • “Mutation in known gene with atypical presentation can be autism.”


  • There will be no “autism gene”
  • WGS will help us understand the disease
  • Not all genes identified will be new, and many will have known interventions.
  • There will new ones too.
  • Autism is complex, but some variations will be “low hanging fruit”, such as autosomal recessive presented here.

[All in all a neat talk – similar to cancer, wading into a complex genetic landscape.]

AGBT talk:David R. Bently – Illumina

Title: Evolving technology for clinical sequencing

[no blogging of clinical data – the rest is fair game. good policy.]

Review of sequencing technology, and intresting point that we’re now running towards smaller system for faster runs with smaller number of reads, which will lead to clinical or diagnostic use.


  • increase cluster density
  • minimise gc bias
  • increase ata yeld
  • increase flowcell area
  • increase chemistry
  • [more]

Increased representation of GC-rich features – much improved representation with new chemistry.  Slide showing same region over time, goes from gaps to complete coverage.

unique paired read alignment also makes a big impact, particularly with paired ends.

Contrast of HiSeq to MiSeq.  Usual things – push button, all on board, etc etc…  [different clients for each tool set, I would think.]  Run time 10 Days @600Gb vs 1Gb@ less than one day.

All methods are interchangeable between platforms.

Power of deep sequencing.  With sufficiently deep sequencing, it’s easy to see minor variants… (1.47% detectable in a 750k depth???  [That just sounds odd… maybe I got it wrong.])

Covering Chronic Lmphocytic Leukaemia (CLL)  example [- will not blog this part.]

  • New Bayesian caller: HYRAX
  • indels/SV de novo assembled with GROUPER
  • CNVs use recursive partitioning….[using the software published by Sergii Ivankhno, which I totally did not understand last night.]


  • First description of progression of CLL
  • Limited number of NS mutations and CNV occur
  • Candidates involved in regulation of innate immune response and cancer progression
  • relapse drivers are actually present in pre-treatement samples
  • profiling identifies mutations eradicated by or resistant to treatment
  • Quantification of mutational burden by deep sequencing reveals clusters.

Spectrum of seuqencing:  Targetted test <===> Whole Genomes.

AGBT Talk: Pippen prep system – Sage System.

[Not sure why I’m at this talk… it’s really an add for the Pippen prep.  Notes are acordingly splotchy.]

Size Fractionation in NGS applications – different cassettes for different sizes, for ChIP-Seq, miRNA, PE reads, 1st Mate Pair.

Different cassettes for different size selections – covers a variety of ranges.

[I’m sorry to say this is the driest talk today… it’s mainly a list of products and improvements made, and what appear to be agilent traces showing the size selection..  Blogging this talk ends here.]

AGBT: Eric Boerwinkle, University of Texas School of Public Health

Title: Life after GWAS

By phone – and tweeing is acceptable… Isn’t technology wonderful?

Background: Standard GWAS algorithm.  Find Genes -> Characterize Genes -> Define Functional Mutations -> Experimental Systems -> (Predictions/diagnosis/pharmacogenetics/Gene Interaction/etc)

Collaborations are an integral part of the work.  Large cohorts are important.

Through July 2010, there are 904 pubished GWA for 165 traits – this is not insignificant.  GWAS has played an important part in understanding disease.  Need to appreciate the successes, even if there’s a lot further to go.

Mechanisms: Fine Mapping, Epidemiology, Translation, Resequencing and Biology.  Focus on the last 3. (Fine mapping doesn’t make for a good talk…)

Eventually, GWAS will re-invigorate the old biochemistry fields (metabolism) by better understanding what’s going on. [paraphrased, but nice to hear it.]

First set of data: Atherosclerosis Risk in Communities, based on random sample of ~16k individuals, followed longitudinally.  Annual follow up.  The idea is to look at interaction between genes and environment.

Discussion of a few genes and their impact on the disease. [not going to directly copy the data, but a short discussion of risk factor.]  Slides on putting it in context in the genome, and then doing functional analysis using a mouse knock out model.

Translational Applications:

  • Novel drug targets
  • updated prevention strategies
  • new risk assessment algorithms
  • Tailored therapies based on genotypes. [not going to talk about it, but a very important part of the future of medicine.]

[Skipping examples, of how you might apply this type of information to influence patient treatment and how you might apply it to large groups to modify treatment guidelines.]

Next section on resequencing.

Example of Permanent Neonatal Diabetes – dogma is that is dominant, and a few mutations have been found, but the majority causes were not know, or rather the cause was unknown in the majority of patients. [Skipped background, which is probably available on the web.]

Two papers, but work in progress:  Voight et al, 2010, Dupuis et al (2010) – Both Nat Genet 42.   31 loci mentioned in first paper, 18 in second.

Wanted to confirm known genes, and to identify novel genes..

Early results:

  1. spiked in internal control to confirm that they could find known mutations.  (worked.)
  2. Examined previously implicated genes (3 of them)
  3. Examined T2DM genes. (None found)

Novel mutations in novel genes. Cryptic “inbreeding”.

[Unfortunately, I had a meeting, and had to run, as the talk had already gone over by 10 mintues, so I missed the section on Gout.  Overall, a well delivered talk for someone who wasn’t able to attend in person.]

AGBT talk: Ellen Wright Clayton, Vanderbilt University

[Speaker Encourages tweeting and blogging!]

Title: Surfing the tusnami of whole genome sequencing.


  • Complete disclosure of the results of whole genome sequencing could lead to disaster.
  • Suggest strategies to take the flood of information.

Medicine: Based on genetic and environmental contributions. Prevention plays a smaller part in medical care, and is based entirely on phenotype + age.

Future: Personalized medicine [Francis Collins quote on sequencing newborns].


  1. Separating the wheat from the chaff: false positives increase as data increases.
  2. Incidental findings: Most people say they want incidental findings, even if they don’t know what that means.  When deciding what results to return, however, there are many categories (reproductive outcome, action ability, personal value, but what about standards in clinical practice?)  The debate about this is ongoing, but possibly paternalistic.
  3. What are the downstream costs?  Parallel debate in radiology where you have to factor in everything – and where the actual cost of following up incidental findings is not trivial.  Maybe it’s not worth following up on everything.
  4. pleiotropy: ApoE4 story and PheWAS (no detail given, but much information available elsewhere.) As we look at genomes, we’ll find a lot of pleotropic effects, which means we’ll have a LOT of incidental effects.
  5. Bad Science: Discussion of “GATTACA”.

[This discussion is subtly directed at an American audience….  finding it less convincing as a Canadian, where healthcare is free, and the cost savings of personal genomics will outweigh the cost of following up on accessory conditions.]

Thus: disclosure of all this information threatens to sweep away the health care system.  [Meh… doubtful.]


  1. Consider utiltiy and and actionability… don’t disclose things until [someone] decides its ready for “primetime”.  [who is this someone who decides this for me?]
  2. Age for testing and disclosure
  3. Impact of costs of follow up
  4. what about people who don’t want to know?

The real question:

  • We all assume we can control who gets access to this data. [No, not really – I assume it is mostly irrelevant to everyone but the person for whom it pertains, unless you’re an american with private healthcare.]

What do we do when the information is available?

  • Better information for electronic medical records
  • Develop better policies now.
  • Patient’s desires will probably play a minor role.  This will be REALLY controversial.  Limits will make people unhappy.

[I’m leaving out the discussion of “parents have a constitutional right to their child’s information”… it’s very much irrelevant and seems like a non-sequitur to me, and childhood stories don’t belong on a blog.  See, I know where to stop blogging.]

To clarify:

  1. Scientific analysis of variations and their impacts must proceed at full speed. [Yes, but why would you assume it isn’t????]  Public doesn’t know it. [Ok, we need to be better at communication.  How about more blogging and tweeting? (-:]
  2. Policies determining access and use.
  3. We need to engage the public and explain what it shows and doesn’t show.  [Communicate limits.  I agree with this, but media needs to be better informed on the point…  yada yada]

If we’re going to “surf the tsunami” of medical data, we have to do a better job of engaging, recognizing that it will be controversial, and knowing it’s limits.

[Interesting talk, but I fail to see most of her points. First question makes light of the American/non-american divide… (-:  ]

AGBT talk: Sergii Ivankhno, Cancer Reesearch Uk Cambridge

Title: Inferring Somatic Copy Numer Aberations in Heterogeneous Cancer Samples

Problem of copy number calling in cancer with NGS.  [Kinda missed it…]

Straight into CNAseg Workflow.  Use BAM files to get read depth, derive log-ratio in 50bp window.  HMM Segmentation, output distinct HMM segments.  Compare the two states.

Ref given: Ivankhno S, CNAseg (2010)

HMM segment merging in multi-sample data analysis.

[Ok, I’m not doing this talk justice.   I appreciate that we can blog this talk, but I don’t think I can actually do it. Please read the paper.]

AGBT talk: Maria Mendez-Lago, BC Cancer Agency

Title: Mutations in MLL2 and MEF2B Genes in Follicular Lymphoma and Diffuse Large B-Cell Lymphoma.

Background on Follicular Lympoma (FL) and Diffuse Large B-Cell Lymphoma (DLBCL)- most common types of Non Hodgkin lymphoma

Goal: Detect driver mutations, characerize the pattern of mutations, and then understand the role of the mutations in the proteins in which they occur.

Method: 119 lymphoma samples were sequenced: genome/transcriptome. hg18  exon-exon junctions, SNVmix SNV calls.

Result: 137 genes with 1 confirmed somatic mutations  with mutations in at least 2 other samples.

Discovered an enrichment for histone modifying genes.  These turn transcription on and off in the normal cell.

To get a better picture, MLL2 locus was sequenced in several 10’s of lymphomas, as well as 8 normal samples.  MLL2 is highly mutated in lymphoma: FL 89%, DLBCL: 32%, DLBCL cell lines: 59%, Healthy BCells: 0%.

RNA-seq missed 33 mutations –

  • 20 were indels. (Missed by SNV calling methods.)
  • 3 were in splice sites
  • 10 new non synonymous SNVs in regions previously low coverage. (eg, the transcriptome contained low amounts of this gene.)

Very pretty image of the distribution of mutations in MLL2.

Assembly was done with Trans-ABYSS to confirm effect of a mutation at a donor site, as well we RT-PCR + Sanger.

Some samples had 2 independent mutations in MLL2 (one in each allele).

For other gene, MEF2B, targeted sequencing was used, as the mutations were mainly localized to a single mutation.  Some were outside, so capture strategy was used (Biotinylated RNA baits.)  captured RNA were sequenced on a GAII.

There were a few common mutations, and most mutations were found in exon 3 and 2.

[first time I’ve seen this today:] crystal structure used to show location of most common mutations, and why they interfere in binding.


There are several genes frequently mutated in FL and DBCL.  MLL2 and MEF2B are common in lymphomas at reasonably high levels and are likely strong drivers in lymphoma.

[I believe this has all been published already [EDIT: NO, it hasn’t – my apologies! And I’m that much more impressed that we’ve been allowed to blog this presentation!], but a great talk, none-the-less.  A very concise and clear explanation of the mutations found and why they are important.]

Lara Bull-Otterson, Baylor College of Medicine

Viral Genomes in Whole Genome Shotgun Sequencing of Hepatocellular carcinoma

– Strong link between cancer and viruses. 12% of cancer cases are caused by 7 viruses: EBV, HPV, HBV, HCV, HTLV-1, HHV8, polyomavirus)

Hepatocellular carcinoma: 5th most common cancer worlds-wide and 3rd leading casu of cancer death worldwide.

Hepatitus B Virus: small dna virus, ds and ss, circular, replicates by RNA intermediate.  4 overlapping open reading frames, 2 direct repeats of 11 bp.  (Brief review of life cycle.  Does not integrate into genome)

Talk today focuses on 3 patients – 1HBV+, 2 HCV+

30% of  DNA reads from patient will be unmapped, and viral sequences will be in the unmapped section.

WGS 30x coverage for tumour/normal.  use hg19 (BFAST) – used viral db  (NCBI &  JCVI viral genome database, soft mask viral seq)

Can confirm that in DNA, you do get the HBV signal, but not HCV (since it’s an rna virus), but you see both in RNA seq data.

Discussion of signal vs noise – if your signal is all in one spot on the genome, it’s just noise, not a real hit. If it maps across the whole viral genome, then it’s probably a good hit. [I tried this 3 years ago for another type of cancer, and saw the same thing with the noise – but never got a signal…  nice to see what a real signal looks like.]

Discussion on how to find the integration site.  Also indication of host-junction site.

There are implications to integration – could cause structural modification of genes, modifying product.  – modify regulation, modify promotors, etc.


  • Methods work – can find viruses in the virus positive individual.
  • mate paired data allows you to find putative site of viral integration (TERT promotor, in this case)
  • Method has applications for other pathogens
  • And there is value in unmapped reads!

[neat talk – interesting to find out that unmapped reads can have so much value.]

AGBT talk: Obi Grifith

[Yay,we can blog Obi’s talk]

“Transcriptome and Exome Sequencing of Breast Cancer Cell Lines to Identigy Molecular Predictors of Response to Anti-Cancer Compounds.”

Using a panel of breast cancer cell lines.

Hypothesis: correlating drug response of the breast cancer cell lines with molecular characteristics will help identify drug treatments.

Combine drug response data with molecular data -> molecular signiture of drug response.

Panel: 72 cell lines. [I see a few normals in the cell line list, tho, so not all breast cancer cell lines… odd.]

after filtering : 82 drugs, 63 are publicly available, 19 are through private companies.

Methods: U133A, exon array, mehtylation, rppa, westerns, cnv .  Also: Cosmic, RNAseq (Alexa-seq), Exome-seq.  All data types can be combined and filtered (unsupervised) for variance.

Ideally, in the perfect world, all cell lines would be profiled with all of the techniques – but “we don’t live in that world”. [Speaks to a lot of issues in working with cell lines, IMHO.]

RNA-Seq: 55 cell lines.  See Alexa-seq poster [74, I think.]

Description of pipeline.  Looks relatively standard.

For each drug, cell lines separated into responders/non-responders.  Random Forests used for classification. (Internal cross validation step)

Exome-seq identifies known and novel cancer variants.  Concordant calls identify mutations. – Almost all discrepancies are false negatives in exome-seq. [interesting!]

RNA-seq recapitulates known subtypes with high accuracy.  (unsupervised clustering map shown, looks very pretty, and cell lines are nearly all in the correct clusters.)

Drugs with best predictors listed – 58/82 are better than random, (AUC > 0.5), [gets better the higher you go, obviously.]

Example given with Lapatinib – predicted by Her2 amplicon.  Over expression of her2 amplicon is visible across all of the methods, and thus, it’s possible to use that info for lapatinib response.

Example : BIBW2992 – drug response also associated with HER2, but also 3 other mutations.  classification requires use of multiple genes for response prediction.


  • Predictors can be found for many drugs
  • modst important predictors come from a wide range of data types.

Future work:

  • Experiment with sensitivity parameters/thresholds
  • control for subtype
  • compare performatnce of individual data types.

[I really didn’t do justice with the notes – much of this talk was visual, and data was hard to summarize in text.  Good talk, however, and nice to see that molecular classification is becoming more feasible.]

Evening Festivities or being snarky about pac bio’s movie.

Possibly the most exciting thing that’s happened in the past hour is the fact I’ve won a million pounds in a lottery that I don’t remember entering…. although really, I don’t think I’m going to send my personal information to the lottery corporation – after all, the lottery was sponsored by the British tobaco promo.

More to the point, I was underwhelmed by “The New Biology” film shown by Pac Bio. That’s not to say it was bad, but that they’d picked the wrong audience. Really, it might have been good if you were, say, a complete newbie in the field of next-gen sequencing, or if you like snazzy graphics that don’t tell you much. (Yes, I’m being snarky… but I’ve been good all day, so here it goes.)

Personally, I found myself trying to read the lines of perl code that would scroll by periodically in batches of random numbers. I did catch the line:

while (1) {

which has me worrying about the origin of the code and where it came from. (This is one of those things that good coders just don’t do.) I got a copy of the video so I’ll try to figure this out later. My guess is it wasn’t pac bio’s code. I think much more highly of them than that, and this movie was really not designed for an audience of bioinformaticians. I hope the biologists in the audience got more out of it.

I’m also a little wary of the “new biology” paradigm, which was alternately defined as personalized medicine, drug screening, network biology and next generation sequencing. They can’t ALL be new biology… can they? Or did I miss the memo that everything in the future is new biology… hrm.

I suppose It also didn’t help that there were a lot of facebook analogies in the introduction… I’m rather anti-facebook because of it’s policies, and really, I think my database of a billion rows of search-able variations across 2000 samples faces entirely different challenges than the mechanisms used when my nephew tells all of his friends about how much he hates math class… Don’t get me wrong – I love social networking in the abstract, but facebook isn’t my device of choice….and then there was the Monsanto thing, but lets not get into that.

Anyhow, I guess I can say the movie wasn’t to my taste, unfortunately. I can see it doing well as a one hour TV special on the national geographic channel – or even uploaded to youtube, where I’m sure it would quickly accrue several million hits, but my further viewing pleasure will all be with an eye to figuring out where the code came from… or possibly as a drinking game. (A shot every time someone says “new biology” might work well.) Bottoms up!

Ah, Pac Bio, I was hoping for more snazzy technology this year, rather than a disney-esque version of the future. But that’s ok, you’re still my favorite technology… Long live single molecule sequencing!