Functional annotation of “healthy” genomes: implications for clinical application.
Daniel MacArthur, Wellcome Trust Sanger Institute & Wired Science
The sequence-function intersection.
What we need are tools and resources for researchers and clinicians to merge information together to utilize this data. Many things need to be done, including improving annotations, fixing the human reference sequence and improved databases of variation and disease mutations.
Data sets used – single high quality individual genome. Anonymous European from hapmap project. One of the most highly sequenced individuals in the world.
Also working on a pilot study with 1000 genomes, 179 individuals from 4 populations.
Focussing on loss of function variants. SNPs with stop codons, disrupting splice sites, large deletions and frame-shift mutations. Expected to be enriched for deleterious mutations. Have been found in ALL published genomes – all genomes are “dysfunctional”. Some genomes are more dysfuntional than others… however, it might be an enrichment of sequencing errors.
Functional sites are typically enriched for selective pressures, leading to less variation. The more likely something is to be functional, the more likely you are to find error. [I didn’t express it well, but the noise has a greater influence on highly conserved regions with low variation than on regions with higher variation.]
- sequencing errors. This gets easier to find as time goes by and tech. improves.
- reference or annotation artefacts. False intron in annotation of genes, or otherwise.
- Unlikely to cause true loss of function. eg, truncation in last amino acid of protein.
Loss of function filtering. Done with experimental genotyping, manual annotation and informatic filtering. Finally, after all those filtering, you get down to the “true LOF variations.”
example. 600 raw becomes 200 filtered by any transcript, down to 130 filtered on all transcripts.
Homozygous loss of function variants were observed in the high quality genome. The ones observed cover a range of genes. the real lof variations tend to be rare, enriched for mildly deleterious effects.
LOF variants affect RNA expression. Variants predicted to undergo nonsense mediated decay are less frequent. [I may have made a mistake here.]
Can use LOF variants to inform clinical outcomes. You can distinguish LOF variant genes from recessive disease genes. ROC AUC = 0.81 (Reasonably modest but predictive model.) Applying this to disease studies at Sanger.
- More LOF variants for better classification
- Improve upstream processes
- Improve human ref seq
- Use catalogs of LOF tolerant genes for better disease gene prediction