19
Genotype Phasing and Imputation in 1x Sequencing Data Warren W. Kretzschmar DPhil Genomic Medicine and Statistics Wellcome Trust Centre for Human Genetics, Oxford, UK Supervisor: Jonathan Marchini

Genotype Phasing and Imputation in 1x Sequencing Data Warren W. Kretzschmar DPhil Genomic Medicine and Statistics Wellcome Trust Centre for Human Genetics,

Embed Size (px)

Citation preview

Page 1: Genotype Phasing and Imputation in 1x Sequencing Data Warren W. Kretzschmar DPhil Genomic Medicine and Statistics Wellcome Trust Centre for Human Genetics,

Genotype Phasing and Imputation in 1x Sequencing Data

Warren W. Kretzschmar

DPhil Genomic Medicine and StatisticsWellcome Trust Centre for Human Genetics, Oxford, UK

Supervisor: Jonathan Marchini

Page 2: Genotype Phasing and Imputation in 1x Sequencing Data Warren W. Kretzschmar DPhil Genomic Medicine and Statistics Wellcome Trust Centre for Human Genetics,

• Commonest psychiatric disorder and the second ranking cause of morbidity world-wide.

• Affects 1 in 10 people in their lifetime.

• Estimates of heritability range between 30-40%.

Major Depression

Page 3: Genotype Phasing and Imputation in 1x Sequencing Data Warren W. Kretzschmar DPhil Genomic Medicine and Statistics Wellcome Trust Centre for Human Genetics,

Major de-pressive dis-

orders

Violence

Ischaemic heart disease

Alcohol use disorders

Road traffic ac-cidents

Diabetes mellitus

Cerebrovascular disease

Other uninten-tional injuries

Lower respiratory infections

Chronic obstructive pulmonary disease

DALY : Disability adjusted life year : number of years lost due to ill-health, disability or early death

Top Ten causes of DALYs

Page 4: Genotype Phasing and Imputation in 1x Sequencing Data Warren W. Kretzschmar DPhil Genomic Medicine and Statistics Wellcome Trust Centre for Human Genetics,

Genetics of Major Depression

Major Depressive Disorder Working Group of the Psychiatric GWAS Consortium (2012). A mega-analysis of genome-wide association studies for major depressive disorder. Molecular Psychiatry 18.4:497-511.

Study Design• Unrelated Europeans• 9240 cases• 9519 controls• 1.2 million SNPs

Hypotheses• Depression has

heterogeneous environmental and genetic causes

• Depression is a complex trait with genetic components of small effect size

Page 5: Genotype Phasing and Imputation in 1x Sequencing Data Warren W. Kretzschmar DPhil Genomic Medicine and Statistics Wellcome Trust Centre for Human Genetics,

CONVERGE (China, Oxford and VCU Experimental Research on Genetic Epidemiology)

Genetically Homogeneous : All subjects are female and their grandparents are Han Chinese

6,000 cases : typically severe affected: 85% qualify for a diagnosis of melancholia by DSM-IV. >25% reported a family history of MD in one or more first-degree relatives

6,000 controls : patients undergoing minor surgical procedures.

Extensive Phenotyping : primary disorder of major depression, common comorbid disorders (e.g. generalized anxiety disorder, panic disorder), within disorder symptoms (e.g. suicidal ideation), disorder subtypes (e.g. melancholia, dysthymia), possible endophenotypes (e.g. neuroticism) and a range of risk factors (e.g. child abuse, stressful life events, social and marital relationships, parenting, post-natal depression, demographics).

Sequencing : mean depth 1.7X using lllumina HiSeq at Beijing Genomics Institute

Current status Sequencing finished. We have data on 12,000 samples. For now we have only considered ~13M sites polymorphic 1000 Genomes Asian samples. Analysis ongoing…

59 hospitals, 45 cities, 21 provinces.

Page 6: Genotype Phasing and Imputation in 1x Sequencing Data Warren W. Kretzschmar DPhil Genomic Medicine and Statistics Wellcome Trust Centre for Human Genetics,

Phase 1: genotype likelihood estimationOne sample at a time

Phase 2: phasing and imputationAll samples together

Raw reads

Genotype likelihoods

Mapping Stampy

Duplicate Picard

marking

Base quality GATK recalibration Genotype

probabilitiesGenotype

likelihoodSNPToolsestimation

Phasing and imputation

Genotype likelihoods My focus!

Sequence analysis pipeline

48 TB

650 GB4.6 CPU

years

350 GB2.7 CPU

years

5 CPU years

Page 7: Genotype Phasing and Imputation in 1x Sequencing Data Warren W. Kretzschmar DPhil Genomic Medicine and Statistics Wellcome Trust Centre for Human Genetics,

GENOTYPE PHASING AND IMPUTATION

Page 8: Genotype Phasing and Imputation in 1x Sequencing Data Warren W. Kretzschmar DPhil Genomic Medicine and Statistics Wellcome Trust Centre for Human Genetics,

Genotype Phasing

Unphased: G/G A/T A/A T/T G/T A/T T/T A/A G/G G/C

Example SNP chip data

Hap 1: G A A T T T T A G C

Hap 2: G T A T G A T A G G

After Phasing

Phase-informative Sites

Page 9: Genotype Phasing and Imputation in 1x Sequencing Data Warren W. Kretzschmar DPhil Genomic Medicine and Statistics Wellcome Trust Centre for Human Genetics,

Genotype Imputation from Haplotypes

J Marchini and B Howie. Nature Rev. Genet. 2010

Page 10: Genotype Phasing and Imputation in 1x Sequencing Data Warren W. Kretzschmar DPhil Genomic Medicine and Statistics Wellcome Trust Centre for Human Genetics,

GENOTYPE LIKELIHOODS

Page 11: Genotype Phasing and Imputation in 1x Sequencing Data Warren W. Kretzschmar DPhil Genomic Medicine and Statistics Wellcome Trust Centre for Human Genetics,

What is a Genotype Likelihood?

Genotype Likelihood = Pr( R | G )

R = Reads; also known as the “observed data”G = Genotype; usually one of ref/ref, ref/alt, alt/alt

Genotype likelihoods (aka GL) are defined on a site by site basis.

GLs are conditional probabilities.

Page 12: Genotype Phasing and Imputation in 1x Sequencing Data Warren W. Kretzschmar DPhil Genomic Medicine and Statistics Wellcome Trust Centre for Human Genetics,

How are Genotype Likelihoods Useful?

Genotype Probability = Pr ( G | R ) proportional to Pr( R | G ) * Pr( G )

Genotype likelihoods allow us to quantify how much the reads support each possible genotype independent of other information.

To determine the most likely genotype call, we need a genotype probability.

Pr( G ) = prior probability of G.May be determined through haplotype phasing and imputation approaches.

Page 13: Genotype Phasing and Imputation in 1x Sequencing Data Warren W. Kretzschmar DPhil Genomic Medicine and Statistics Wellcome Trust Centre for Human Genetics,

Genotype Likelihood Creation with SNPTools

Y Wang, J Lu, J Yu, RA Gibbs, FL Yu. Genome Research. 2013

observed reads

Three distributions

Pr(R|G = alt/alt) = 10e-6

Pr(R|G = ref/alt) = 10e-3

Pr(R|G = ref/ref) = 0.06

Page 14: Genotype Phasing and Imputation in 1x Sequencing Data Warren W. Kretzschmar DPhil Genomic Medicine and Statistics Wellcome Trust Centre for Human Genetics,

Genotype Phasing using Genotype Likelihoods

Example GL data

Pr(ref/ref): G/G A/A A/A T/T G/G A/A T/T A/A G/G G/G

Pr(ref/alt): G/A A/T A/G T/A G/T A/T T/C A/G G/C G/C

Pr(alt/alt): A/A T/T G/G A/A T/T T/T C/C G/G C/C C/C

Hap 5: G A A T T A T A G C

Hap 6: G T A T T A T A G G

Plausible Haplotypes after Phasing

Hap 1: G A A T T A C A G G

Hap 2: G T A T T A T A G G

Hap 3: G T A T G A C A G G

Hap 4: G T A T G A T A G C

Reference Haplotypes

Page 15: Genotype Phasing and Imputation in 1x Sequencing Data Warren W. Kretzschmar DPhil Genomic Medicine and Statistics Wellcome Trust Centre for Human Genetics,

General MCMC Scheme for Phasing from GLs

When using GLs, haplotype estimation is currently done in an iterative Markov Chain Monte Carlo (MCMC) scheme

1. Initalize haplotypes for each sample randomly2. for a predetermined number of iterations

1. for each sample1. Find a plausible haplotype pair using its GLs and all

other haplotypes as a reference panel2. Update that sample’s haplotypes with the plausible

haplotype pair3. Return each sample’s current pair of haplotypes

Page 16: Genotype Phasing and Imputation in 1x Sequencing Data Warren W. Kretzschmar DPhil Genomic Medicine and Statistics Wellcome Trust Centre for Human Genetics,

The Tools/Languages I use

Coding Emacs

Scripting Perl with DistributedMake for pipelines

Statistical Methods C++

Figure Generation R

Statistical Analysis & Report Writing

LaTeX with SWeave

Presentations PowerPoint or LaTeX

Page 17: Genotype Phasing and Imputation in 1x Sequencing Data Warren W. Kretzschmar DPhil Genomic Medicine and Statistics Wellcome Trust Centre for Human Genetics,

A Bioinformatician’s Best Practices

- Understand your goals and choose appropriate methods- Be suspicious and trust nobody

- Set traps for your own scripts and other people’s- Be a detective- You're a scientist, not a programmer- Use version control software- Pipelineitis is a nasty disease- An Obama frame of mind- Someone has already done this. Find them!

according to Nick Loman & Mick Watson. Nature Biotechnology. 2013see also: W. S. Noble. PLoS Computational Biology. 2009

Page 18: Genotype Phasing and Imputation in 1x Sequencing Data Warren W. Kretzschmar DPhil Genomic Medicine and Statistics Wellcome Trust Centre for Human Genetics,

Good Directory Structureaccording to W. S. Noble. PLoS Computational Biology. 2009

Page 19: Genotype Phasing and Imputation in 1x Sequencing Data Warren W. Kretzschmar DPhil Genomic Medicine and Statistics Wellcome Trust Centre for Human Genetics,

Thank you. Questions?