Jonathan Marchini, University of Oxford – Phasing, imputation and analysis of 500,000 UK individuals genotyped for UK biobank #AGBTPH

Major health resource aimed at improving the prevention and treatment of disease. Available for academic and commercial researchers worldwide. (Not completely free.. have to have good reason to use it, etc.)

Baseline questionnaire (touch screen), 4 minute interview, baseline measures.  Some subsets had additional tests.  Enhanced phenotypes were asked to do further specific tests and questionnaires as well.

Whole genome genotyping with a bespoke array.

Axiom SNP array – 830k.  Run on all participants.

First step: Quality control.  Provide robust set of quality control measures. Also provide researchers with genetic properties that are useful about genetic ancestry.

PCA done on individuals, showing geographic genetic ancestry.  [Very typical plot of first 2 PCs.]

Family relatedness: “Found a considerable number, rather more than we expected”  148,000 individuals with a relative (cousin or closer).  Can be useful, but important to know that not all individuals are independent data points.

3.2 Billion bases in the genome, but only measured 800,000 positions.  What can be said about the unmeasured fraction?  Use statistical methods to estimate haplotypes. (Haplotype estimation – phasing) Used their tool “Shapeit2”, which was ok, but not great because one step had O(n^2) behaviour.  Modified code to O(n*log(n)).  Uses hierarchical clustering in a local area.

Applied method to data set – (Nature Genetics)

Tested software using 72 trios.  Run time: 15 minutes, Switch error rate: 2.6% Total sample size: 1072.  Method was to call children in trio, and then remove the parents and call again in a group.  If the phase changes, that’s an error.

If sample size is changed to 10,000, you do much better.  error rate goes to 1.5%  At 150,000 samples, error goes to 0.3%. (run time: 38 hours) “Making just a handful of errors”.


Use existing data sets.  So, use those data set, where haplotypes are known.  Called Imputation.  Thus, you can use matches for your known SNPs to existing haplotypes to guess at what is in between.  (In practice, you can use many matches, and a HMM to best guess at the answer.)  Algorithm is called IMPUTE4 – 10min per sample.

800,000 SNPs –>  80 Million Imputed SNPs.  [Mostly accurate, from tests shown and getting better all the time.]

Example: standing height.

Using biobank SNPs, you don’t see much with 10,000 individuals.  With 150,000 biobank individuals, you can see a few more regions of interest.  At 350,000 individuals (subset that have homogenous ancestry), you can find several regions that are relevant.  If you apply imputation on top, you can see many regions that are likely to be interact with the trait. (adding imputation actually lets you see details on genes that aren’t there on the original SNP set.)

Leave the validation of this data for other researches.

Full release will probably happen early next year.


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.