Copenhagenomics 2011, in review

It’s early Saturday morning in Copenhagen and Copenhagenomics 2011 is done.  I was going to say that the sun has set on it, but the city is far enough north that the sun really doesn’t do much more than sink a bit below the horizon at night.  That said, the bright summer sunshine has me up early – and ready to write out a few thoughts about the conference.

[Yes, for what it’s worth, I was invited to blog the conference so I may not be completely impartial in my evaluation, but I think my comments also reflect the general consensus of the other attendees I spoke to as well.  Dissenters are welcome to comment below.]

First, I have to say that I think it was an unqualified success.  Any comments I might have can’t possibly amount to more than suggestions for the next year.  The conference successfully brought together a lot of European bioinformaticians and biologists and provided a forum in which some great science could be shown off.

The choice of venue was inspired and the execution was flawless, despite a few last minute cancellations.  These things happen, and the conference rolled on without a pause.  Even the food was good (I didn’t even hear Sverker, a vegetarian Swede, complain much on that count) and the weather cooperated, clearing up after the first morning.

As well, the conference organizers’ enlightened blogging and twittering policy was nothing short of brilliant, as it provided ways for people to engage in the conversation without being here first hand.  Of course, notes and tweets can only give you so much of the flavour – so those who did attend had the benefits of the networking sessions and the friendly discussions over coffee and meals.  The online presence of the conference seemed disproportionately high for such a young venue and the chat on the #CPHx hashtag was lively.  I was impressed.

With all that said, there were things that could be suggested for next year.  Personally, I would have liked to have seen a poster session as part of the conference.  It would have been a great opportunity to showcase next-gen and bioinformatics work from across europe.  I know that the science must be there, hiding in the woodwork somewhere, but it didn’t have the opportunity to shine as brightly as it might have.  It also would have served to bring out more graduate students, who made up a small proportion of the attendees (as far as I could tell). Next year, I imagine that this conference will be an ideal place for European companies and labs to do some recruiting of young scientists – and encouraging more graduate students to attend by submitting posters and abstracts would be a great way to facilitate that.

Another element that seemed slightly off for me was the vendors.  They certainly had a presence and were able to make their presence noticed, but the booths at the back of the room might not have been the best way for companies to showcase their contributions.  That said, I suspect that copenhagenomics will have already outgrown this particular venue by the next year anyhow and that it won’t be a concern moving forward.

While I’m on the subject of vendors, what happened to European companies like Oxford Nanopore, or the usual editor or two from Nature?  Were some UK attendees scared off by the name of the conference?  I’m just putting it out there – it’s entirely possible that I simply failed to bump into their reps.

In any case, the main focus of the conference, the science, was excellent.  There were a few fantastic highlights for me.  Dr. John Quackenbush‘s talk challenged everyone to seriously re-consider how we make sense of our data – and more importantly, the biology it represents.  Dr. Elizabeth Murchison‘s talk on transmissible cancers was excellent as well and became a topic of much conversation.  Heck, three of my fellow twitter-ers were there and each one did a great job with their respective talks. (@rforsberg, @dgmacarthur and @bioinfo)

In summary, I think the conference came off about as smoothly as any I’ve seen before – and better than most.  If I were given the opportunity, this would be a conference I’d pick to come back to again. Congratulations to the organizers and the speakers!

CPHx: Morten Rasmussen, National High-Throughput Sequencing Centre, sponsored by Illumina – Exploring ancient human genomes

Exploring ancient human genomes
Morten Rasmussen, National High-Throughput Sequencing Centre, sponsored by Illumina


Why study ancient DNA?  By studying modern species, we can only add leaves to the end of the phylogenetic tree, but not to study the nodes, or extinct branches. [my interpretation.]

How do you get ancient DNA? Bones and Teeth, mainly.  Coprolites are now used as well, and soft tissue, if available.  Ice and sediments can also be used in some cases.

Characteristics: The colder and dryer the environment, the better quality of the DNA preservation.  Age is also a factor.  The older the DNA, the less likely it is to have survived.  More than 1 million years is the limit, if conditions were optimal.

Goldilocks principle.  There is a sensitivity limit – you need enough.  Some is too short – you need longer strands.  You also need to worry about modern DNA contamination – mostly microbial.  Thus, within those constraints, you need to work carefully.

Some advantages in next-gen seq tho – no need for internal primers, size constraints are ok, etc.

DNA barcodes are frequently used to look at biodiversity.  Align the sequences to look for conserved regions surrounding a variable region – allowing primers to be designed for either end of the variable region.  If sequences are identical, you can’t distinguish the origin of the DNA.  [obviously a different type of bar-coding than what we usually discuss in NGS.]

Ice core genetics.  Willerslev et al, Science (2007).  Interesting results found in the “silty” ice, which included DNA from warmer climate plants.

Late survival of mammoth and horse…  can use similar techniques as ice cores to soil cores.

Paleogenomics.  DNA is often highly fragmented and full of bacterial contamination.  A big part of this is finding the right sample.. Eg, look in greenland for good samples where the cold will have preserved samples well.  Hair sample found, which was eventually moved to denmark.

Big issue of contamination, however still has to be dealt with.  Fortunately, DNA is held inside the hair, so washing hair with bleach removes most surface contaminants without harming the DNA sample.  Gives good results – vastly better than bone results that can’t use that method.  (84% in this case is homo sapiens, versus 1% recovery for neanderthal bone.)

DNA damage:  Expected damage from ancient DNA as previously observed, but bioinformaticians did not see significant damage.  Turns out that Pfu was used in protocol in this round, and Pfu does not amplify Uracil.  This has the unexpected side effect of “removing” the damage.

Standard pipeline was used, mapping to hg18.  only 46% of reads mapped, because only uniquely mapped reads were used for the analysis.  Multi-mapped reads were discarded, and clonal reads were also “collapsed”.  Still, 2.4 billion basepairs covered, 79% of hg18, 20X depth.

Inference about phenotypic traits:

  • dark eyes
  • brown hair
  • dry earwax
  • tendancy to go bald

Of course, many of those could have been predicted anyhow, but nice to confirm.

Compared to other populations with SNP chip data.  Confirmed that the ancient greenland DNA places the sequenced individual near the chukchis and koryaks (Populations from northern siberia).  That’s good, because it also rules out contamination from the people who did the sequencing. (Europeans.)  Thus, this was probably from an earlier migration than the current greenlanders, consistent with known data about migrations to the region.

What does the future hold:

  • More ancient genomes
  • Targeted sequencing for larger samples.

Why targeted sequencing of ancient DNA?  If you capture the most important bits of DNA, you would generate more interesting data with less effort, giving the same results.


CPHx: Daniel MacArthur, Wellcome Trust Sanger Institute & Wired Science – Functional annotation of “healthy” genomes: implications for clinical application.

Functional annotation of “healthy” genomes: implications for clinical application.
Daniel MacArthur, Wellcome Trust Sanger Institute & Wired Science


The sequence-function intersection.

What we need are tools and resources for researchers and clinicians to merge information together to utilize this data.  Many things need to be done, including improving annotations, fixing the human reference sequence and improved databases of variation and disease mutations.

Data sets used – single high quality individual genome.  Anonymous European from hapmap project.  One of the most highly sequenced individuals in the world.

Also working on a pilot study with 1000 genomes, 179 individuals from 4 populations.

Focussing on loss of function variants.  SNPs with stop codons, disrupting splice sites, large deletions and frame-shift mutations.  Expected to be enriched for deleterious mutations.  Have been found in ALL published genomes – all genomes are “dysfunctional”.  Some genomes are more dysfuntional than others…  however, it might be an enrichment of sequencing errors.

Functional sites are typically enriched for selective pressures, leading to less variation.  The more likely something is to be functional, the more likely you are to find error. [I didn’t express it well, but the noise has a greater influence on highly conserved regions with low variation than on regions with higher variation.]

Hunting mistakes

  1. sequencing errors.  This gets easier to find as time goes by and tech. improves.
  2. reference or annotation artefacts.  False intron in annotation of genes, or otherwise.
  3. Unlikely to cause true loss of function.  eg, truncation in last amino acid of protein.

Loss of function filtering.  Done with experimental genotyping, manual annotation and informatic filtering.  Finally, after all those filtering, you get down to the “true LOF variations.”

example. 600 raw becomes 200 filtered by any transcript, down to 130 filtered on all transcripts.

Homozygous loss of function variants were observed in the high quality genome.  The ones observed cover a range of genes.  the real lof variations tend to be rare, enriched for mildly deleterious effects.

LOF variants affect RNA expression.  Variants predicted to undergo nonsense mediated decay are less frequent. [I may have made a mistake here.]

Can use LOF variants to inform clinical outcomes.  You can distinguish LOF variant genes from recessive disease genes.  ROC AUC = 0.81 (Reasonably modest but predictive model.) Applying this to disease studies at Sanger.


  • More LOF variants for better classification
  • Improve upstream processes
  • Improve human ref seq
  • Use catalogs of LOF tolerant genes for better disease gene prediction

CPHx: Kevin Davies, Bio-IT World – The $1,000 genome, the $1,000,000 interpretation

“The $1,000 genome, the $1,000,000 interpretation”
Kevin Davies, Bio-IT World


Taking notes on a talk by a journalist is pretty much a bad idea.   Frankly, it would be akin to reducing a work of art to a mere grunt.  The jokes, nuances and elegance would all be lost – and if I were some how able to do a good job, it would have the nasty side effect of putting Kevin out of work when everyone spends their time reading my blog instead of inviting him to speak himself  – or worse, instead of reading his book.  (Alas, I haven’t read it myself, either.)

However, in the vein of letting people know what’s happening here, Kevin has taken the opportunity to review some of the early history of next gen sequencing.  It’s splashed with all sorts of wonderful artefacts that represent the milestones: the first solexa genome sequenced (A phage), James Watson’s genome, the first prescription for human sequencing, etc.

More importantly, the talk also wandered into some of the more useful applications and work done on building the genomic revolution for personalized medicine.  (You might consider checking for one great example.  Pulitzer prize winning journalism, we’re told.)  Kevin managed to cover plenty of ways in which the new technologies have been applied to human health and disease – as well as to discover common human traits like freckling, hair curl and yes, even Asparagus anosmia!

Finally, the talk headed towards some of the sequencing centres and technologies we’ve seen here, including Complete Genomics, PacBio and a brief sojourn past Oxford Nanopore.  Some of my favourite technologies – and endlessly interesting topics for discussion over beer.  And naturally, as every conversation on next-gen sequencing must do, Kevin reminds us that the cost of the human genome has dropped from millions of dollars for the first set, down to the sub $10,000 specials.  Genomes for all!



CPHx: Anne Palser, Welcome Trust Sanger Inst., Sponsored by Agilent Technologies – Whole genome sequencing of human herpesviruses

Whole genome sequencing of human herpesviruses
Anne Palser, Welcome Trust Sanger Inst., Sponsored by Agilent Technologies


Herpes virus review.  dsDNA, enveloped viruses.  3 major classes, alpha, beta, gamma.

Diseases include Kaposi’s (KSHV-140kb genome) sarcoma, Burkitt’s lymphoma (EBV – 170kb genome).

Hard to isolate viruses to sequences.  In some clinical samples, not all cells are infected.  When you sequence samples, you get more human DNA than you do virus.  Little known about genome diversity,  All sequences come from cell lines and tumours.  There is no wild type full genome sequence.

Target enrichment method used to try to enrich for virus DNA.

Samples of cell lines used.  Tried 5 primary effusion lymphoma cell lines (3 have EBV, all 5 have KSHV) and 2 burkett lymphoma cell lines (EBV).

Custom baits designed using 120-mers, each base covered by 5 probes for KSHV.  Similar done for EBV1 and EBV2. [skipping some details of how this was done.]

Flow chart for “SureSelect target enrichment system capture process” from illustration.

Multiplexed 6 samples per lane.  Sequenced on Illumina GaII.

Walk through analysis pipeline.  Bowtie and Samtools used at final stages.

Specific capture of virus DNA.

  • KSHV.  77-91% reads map to reference sequence.  Capture looked good.
  • EBV: 52-82% mapping to ref.

Coverage looks good, and high for most of the genome.   Typical for viral sequencing.

SNPs relative to ref. sequence.  500-700 for KSHV, 2-2.5k for EBV relative to reference seq. Nice Circos-like figure showing distribution.


  • Custom SureSelect to isolate virus dna from human dna is successful.
  • full genome sequence viruses obtained.
  • analysing snps and minority species present
  • currently looking at saliva samples, looking estimate genomic diversity
  • looking at clinical pathologies
  • high throughput, cost effective, applicable as a method to analyse other pathogen sequences.

CPHx: Elizabeth A Worthey, Medical College of Wisconsin – Making a Definitive Diagnosis: Successful clinical application of diagnostic whole enome sequencing

Making a Definitive Diagnosis: Successful clinical application of diagnostic whole enome sequencing
Elizabeth A Worthey, Medical College of Wisconsin


Making a Definitive Diagnosis.

Original request came 2009, young child with intractable irritable bowel disease.  No known test was diagnostic.   Primary physician went to a talk on WGS, and wanted to know if it would work on the child.

Case, poor weight gain at 15 months w perianal abscess.  symptoms consistent with sever Crones disease.  90 trips to the OR by the age of 3. Disease progressed even after severe operations.  [Some very graphic photos here.]

Time is of the essence, bottleneck was in the analysis.  Child was very ill, so had to work fast.  Expected about 15,000 variation.    Used Adobe Flex UI, java middleare layer, oracle 11g DB.

CarpeNovo.  Gives variant reports, etc. Used the tool for about 4-5 months to narrow down 16k to just 2 that were highly conserved positions, not found in additional human genome sequences.  Left only 1 variant after more analysis.

XAIP gene.  Mutation changed single amino acid.  Clinical diagnostics done to confirm sequence variant.  Also not in other family members.  Conservation of this position is extreme, including in non-mammal model organisms.

Mutation would be predicted to affect release of inflammatory molecules.  Used assays to confirm this was the case in vitro.

Diagnosis then was made, and compared to other XIAP deficiencies, such as XLP2.  Standard treatment for XLP2 is allogenic hematopoetic transplant.  After this treatment, child progressed very well, and has few recurring symptoms, etc.  Doing well!

Not the end of the story.  After this, other physicians started to request similar programs.  Did not have resources to do this for everyone who requested it.  Went to the hospital and looked for funds to continue this program with additional children.

Multi-disciplinary, multi-institutional review process.  Patents receiving care at the hospital can be nominated.  Review committee makes decisions.  This is NOT a research project – it’s focussed on treatment of patient.  Is there a likely outcome for potentially changing treatment, can it reduce the cost of diagnostic testing, etc?

Structure of review board covered.  External expert physicians, committee review and nominating physician.  It takes 8-10 hours of work per patient nominated.

Discussion of ethics of what to return.  Data observed from NGS is not added to electronic health record.

WGS done on 6 individuals since.

Case #2.

Intractable siezures, neurological symptoms.  Also: was the twin sibling at risk?

Found two mutations that cause Jubert syndrome, but presentation was not classic.  Unfortunately, no direct actions were possible.

Case #3

Full term infant born, seemed normal, but at 10 weeks rushed to the hospital.  [missed the why though]  Two mutations in twinkle gene.  Child died at 6 months of age.  Avoided major futile surgery.

Broader findings.

Have pre-authorization from insurance.  Education of providers and patients are necessary.  Large diverse teams required.  Diseases will be redefined:  known phenotype but different gene.  You don’t always have improved treatment options, and sometimes there are none.




CPHx: Lisa D. White – Baylor College of Medicine – Chromosomal Microarray (aCGH) Applications in the Clinical Setting

Chromosomal Microarray (aCGH) Applications in the Clinical Setting
Lisa D. White – Baylor College of Medicine


Work shown here is the work of a large number of people.

Conflict of interest statement.  Baylor does get revenue from it’s sequencing services.

Custom targeted arrays.  180k postnatal CMA – high resolution, related to MR, DD, DF, autism, heart defects, seizure disorder.  entire mitochondria genome.  Recently upgraded to a 400kb postnatal array, has same coverage as other array, but includes 120k new snps across the genome.

Interested in detecting absense of heterozygosity.  eg, consanguinity.

How does it work?  uses same label protocol w restriction digestion.  SNPs are recognized by whether it is cut or not.

Absense of heterozygosity is not loss of heterozygosity.  Happens with consanguinity, eg, identical regions inherited, not loss of a chromosome.

Also, Uniparental disomy, when one parent gives you both copies of the same chromosome, rather than one from each parent.  [is that correct?]

Examples given, showing Illumina 610 Quad vs Agilent custom.  Looks good.

Discovery of incest in assessment of AOH detection.  In clinical setting, it’s possible to identify cases of incest based on chromosomal data, eg. consanguinity.  Raises ethical issues, however.

Limits to the array.  [general array stats, like regions it doesn’t cover, situations like balanced translocations.  etc.]

Other situations: DNA extraction of uncultured Amniocytes.  Informed Consent.  MD collects sample and ships to lab.  DNA extraction is done (Bi et al, 2008 Prenatal diagnosis.).  3-5ml.  Do three thingsÆ Maternal cell contamination test, gender PCR and Quantitation.  Average turn around time is 6 days. (some info about back up culture, from set aside portion of the sample, but it’s rarely needed)

Prenatal example… indication of abnormal hands and feed… found a 500kb duplication detected.  Able to show it was de novo, not tied to either parent.


  • Arrays are important for diagnostics, even given NGS.
  • Can do valuable work, and can be offered more uniersally for all pregnancies.
  • Recently launched a cancer genetics lab, which will also use array CGH and NGS as part of the test.

Also developing NGS tests as well, moving forward.  Looking for diagnostic tests that can move into the CLIA lab for proper applications.

Big effort with lots of people working on it.

CPHx: Edwin Cuppen, Hubrecht Inst. and Utrecht University – Are we looking at the right places for pathogenic human generic variation?

CPHx: Edwin Cuppen, Hubrecht Inst. and Utrecht University – Are we looking at the right places for pathogenic human generic variation?

Where are we looking, typically?  Genomes.  Thus, we search for variations across the genome.  We then end up sequencing the whole genome, but then lack the tools to sequencing.

Reduce work and costs by multiplexing.  Typically, we multiplex sample prep, multiplex enrichment, then barcode.  Instead, multiplex enrichment would be more cost effective.

Barcode Blocking would be the way to go.

Example shown comparing to agilent sureSelct – exactly the same.  Have pushed so far to 5 samples in this case.

You can also show this scales to 96x-fold, however, then you need proportionally more sequencing for large data sets. (eg, wouldn’t want to do this for a genome.)

Average base coverage per sample using 96-plex.  It is between 40-100x, so there’s only a 2-fold distribution.

Do you see allelic competition in enrichment pool? It’s possible, but in practice, you don’t see it.

Example given for X-exome screenome.  Only one enrichment with all of the different families, so it’s more cost effective.  Show ability to identify causative variants.

Are we looking at the right places?  UTRs, promoters, enhancers, insulators, chromatin organizers, non-coding RNA.  There is much more than just protein coding sections in the genome.  However, if we look at the whole genome, there are limitations there too.  And still, what about structural variations?

Mate Pair sequencing of structural variation using mate-pair sequencing.  Not only do you get distance of structural variations, you also get direction information.

Proof of principle.  Detection of a three way translocation.  Started with a diagnosed patient.  It was found by standard cytogenetic analysis, but the question was if they could find it using structural variant detection. Sequenced father, mother, child.

Thousands of predicted structural events.  It includes errors in reference, it includes artefacts.  Some are just found in mother or father – inherited.  Are you finding the known breakpoints?  Yes… but they found more.  It was not just 3, it was far more.

10 of them were then confirmed, including the three that were expected.  Original data set did not make predictions about disrupted genes.  Looking at the new breakpoints observed, however, one was a protocadherin15, which is mutated in Usher syndrome – which explains the phenotype.

Cytogenetics gives you less information, which is simple, but the next gen sequencing gives you way more information, and can then give explanatory power.  In fact, you can use den novo to make sense of the data more effectively and reconstruct the chromosomes.  You can even get single bp resolution.

Chromothripsis.  Shattering and reassembling of chromosomes.  Some pieces are lost, others are mixed, and reassembly occurs giving you information that would be challenging to identify otherwise.  Were able to reconstruct this data.

Mate pair seq in diagnostics.  Tag-density and mate-pair information can be used. Trio based approach used.  They were able to identify exact gene disrupted, where other approaches failed.  Single bp resolution.

You can also use it to resolve complex rearagements that could not otherwise be visualized with other technique.  Chromothripsis may be much more common than expected, as it has been observed in other samples.

Can be applied to cancer research.  Tumour specific structural variation.  There is significant differences between two tumours of the same type, even.

Chromothripsis looked for in cancer samples.. expected 2-4%.  Found it in all but one sample.  (looking in metastatic colorectal cancer.)  Chromotripsis seems to be a common phenomonon driving cancer events.

Able to find expected events as well, and able to find known cancer genes affected by the arrangements.

Also went back to exome sequencing.  Found a few interesting mutations in known cancer genes.


Multiplexed targetd sequencing aproaches are effective for large and small sample sets.

Structural variation can be relevant and is largely missed, but can be assayed by using mate pair sequencing.

Chromothripsis is a novel and frequent process that contribututes to dramatic somatic and germline structural variation and disease.

For understanding disease, we need to evaluate genomes at the nucleotide AND the structural level.

CPHx: Steve Glavas, Sponsored by Life Technologies

Sponsored talk: Introducing 5500 Series Genetic Analysis Systems
Steve Glavas, Sponsored by Life Technologies


Some features: Pay per lane sequencing, application per lane sequencing, optimized tools.

In one flow cell, you can use each lane independently.

Barcoding is supported.  Up to 1152 samples simultaneously.  Also have a library builder, runs in 2.5 hours, hands free.  Supports PGM.

Colourspace:  Only do it if you want to.   You can get out a fastq.  You get an .xsq file, one perlane.  you can then switch to FASTQ,  etc.

[Actually, I found this online –  you might want to look there instead of having me copy out a vendor’s notes.  link.]

Per lane…. can reuse flow cells with unused lanes.

Dollar figues given in Euros… 99 euros per sample.

Review of data generated by an example flow cell.

10kb mate pairs look good – same as microarrays.

Also strand specific kits available.

Take away message:  Flexible, accurate (99.99%), walk away sample prep, barcoding, your choice of data format.

CPHx: Kasper Daniel Hansen, Johns Hopkins – Generalized Loss of Stability of Epigenetic Domains Across Cancer Types

Generalized Loss of Stability of Epigenetic Domains Across Cancer Types
Kasper Daniel Hansen, Johns Hopkins


The basis of phentypic variation between species is frequently thought of as genotype -based.  However, if you consider two organs in the same individual, the tissues are clearly sharing genomes, but are phenotpyically different.

Cancer and DNA methylation.  DNA methylation was first epigenetic changed in cancer. Focus, recently, has been on hypermethylation of CpG islands.  We now see more on global hypomethylation, hypomethylation of selected genes (oncogenes)

CpG island Shores.  Many changes are not in CpG islands, but in regions bordering CpG islands, termed CpG shores. (Irizarry et al, 2009)

Increased methylation variation across all cancers. The same regions that distinguish cancers from normals also distinguishes normal tissue types.  [sorry, spilling water glass behind me distracted me for a minute.. missed a bit.]

Study design.  Whole genome bisulfite sequencing on tumour and matched normals.  3 colon cancers, and normal mucosa and 2 adenomas.

Wrote a custom aligner, Merman, to process bisulfite sequencing data.

Global levels of methylation – it’s not particularly interesting, but is representative.  some clustering between tumour types, normals.

Loss of methylation boundaries:  There are sharp methylation boundaries in normals, but the boundaries are lost in cancers.  in cancers, methylation appears constant across shores and islands in graph shown.

Boundary shifts also occur and novel hypomethylation exists as well.

Capture bisulfite – 40,000 capture regions.  Great data, consistent across labs doing the work, across samples, etc.

Large blocks of hypomethyation observed in cancers.  increased variability in cancer samples compared to normals.  Consistent boundaries, cover more than 1/2 of genome.   Related to structural conformation of the DNA in the nucleus.  Some cell type specificity as well.  [a lot is data is being presented here in graphical form, so if the above notes are confused, it’s because I’m having a bit of trouble following the flow.]

what predicts hypomethylation?  Silenced genes are correleated?

Blocks are enriched for hyper-variables [hypervariable methylation status?].  Some of the genes are associated with tumour progression.

[And a very abrupt end to the talk!  That’s it.]