33
The distribution of the IBD sharing and applications Tel Aviv University July 23, 2012 Shai Carmi Itsik Pe’er’s lab Department of Computer Science Columbia University

The distribution of the IBD sharing and applications

  • Upload
    teryl

  • View
    37

  • Download
    0

Embed Size (px)

DESCRIPTION

The distribution of the IBD sharing and applications. Shai Carmi Itsik Pe’er’s lab Department of Computer Science Columbia University. Tel Aviv University July 23, 2012. Outline. IBD: introduction Coalescent theory of IBD Distribution of pairwise sharing. The variance. - PowerPoint PPT Presentation

Citation preview

Page 1: The  distribution of  the IBD sharing and applications

The distribution of the IBD sharing and applications

Tel Aviv UniversityJuly 23, 2012

Shai CarmiItsik Pe’er’s lab

Department of Computer ScienceColumbia University

Page 2: The  distribution of  the IBD sharing and applications

Outline

• IBD: introduction• Coalescent theory of IBD

– Distribution of pairwise sharing.– The variance.– The variance of the cohort-averaged sharing.

• Applications– Imputation by IBD– Sequencing study design.– Siblings.– Demographic inference.– Jewish genetics.

• Summary

Page 3: The  distribution of  the IBD sharing and applications

Identity-by-descent (IBD)• In isolated, small populations all

individuals have recent common ancestor.• Abundance of long haplotypes which are

IBD.

L. Macmillan, UNC

A B

AB

A shared segment

Page 4: The  distribution of  the IBD sharing and applications

IBD detection• Until last decade, IBD usually defined for single markers.• Genome-wide SNP arrays enable detection of long segments.• GERMLINE (Gusev et al., Genome Res., 2009):

A fast algorithm for detection of IBD segment in large cohorts.• Divide the chromosomes into small windows.• For each window, hash the genotypes of each individual and

search for perfect matches.• Extend seeds, as long as match is good enough.• Record matches longer than a cutoff m.

• Other methods exist.

A

B

Page 5: The  distribution of  the IBD sharing and applications

Questions

• How much IBD is expected in model populations?– Consider the fraction of genome shared between all possible pairs.– Mean?– Variance?– Distribution?

• Applications– Demographic inference– Study design– Positive selection detection– Phasing and imputation– Pedigree reconstruction

Page 6: The  distribution of  the IBD sharing and applications

Sequencing study design

• A large genotyped cohort.• A subset is selected for sequencing.• Look for IBD segments between sequenced and not-sequenced

individuals.

Select A

• Impute variants along IBD segments.• To maximize utility, select individuals with most sharing (Gusev at

al., Genetics, 2012 (INFOSTIP)).

Page 7: The  distribution of  the IBD sharing and applications

Sequencing study design

• A large genotyped cohort.• A subset is selected for sequencing.• Look for IBD segments between sequenced and not-sequenced

individuals.

Select A

• Is the strategy useful? Is it worth prioritzing?• How is the average sharing of each individual to the rest of the

cohort distributed?

Page 8: The  distribution of  the IBD sharing and applications

Wright-Fisher model

• Non-overlapping, discrete generations.• A population of constant size of N haploid

individuals.• Ignore mutations (when studying IBD).• Recombination is a Poisson process.

• Each pair of individuals (linages) has probability 1/N to coalesce in the previous generation.

• In the limit of continuous-time and large population size, approximated by the coalescent.

• (Scaled) Time to MRCA is exponential with rate 1.

N=10

Page 9: The  distribution of  the IBD sharing and applications

Mosaic of segments

• Consider two (unrelated) chromosomes.• The total sharing fT :

The fraction of the chromosome in shared segments of length ≥m.• Observation:

All sites are in shared segments, but length can be small due to ancient common ancestor.

ℓ1

0 Lcoordinate

ℓ2ℓ3 ℓ4

ℓ5 ℓ6ℓ7 ℓ8

ℓ9 ℓ10ℓ11

m ℓT=ℓ1+ℓ5+ℓ9

A

B

Page 10: The  distribution of  the IBD sharing and applications

Mosaic of segments

• Assume the (scaled) coalescence time at a site is t. • A segment of length ℓ is shared if there is no recombination event in

the history of the two linages.• Number of meioses: 2Nt.

ℓ1

0 Lcoordinate

ℓ2ℓ3 ℓ4

ℓ5 ℓ6ℓ7 ℓ8

ℓ9 ℓ10ℓ11

m ℓT=ℓ1+ℓ5+ℓ9

A B

t

A

B

AB

Page 11: The  distribution of  the IBD sharing and applications

Mosaic of segments

• Li and Durbin (Nature, 2011) found that at the end of a segment, • Therefore,

ℓ1

0 Lcoordinate

ℓ2ℓ3 ℓ4

ℓ5 ℓ6ℓ7 ℓ8

ℓ9 ℓ10ℓ11

m ℓT=ℓ1+ℓ5+ℓ9

A

B

Page 12: The  distribution of  the IBD sharing and applications

Renewal theory

• Distribution of waiting times:

τ1

0 Ttime

τ2τ3 τ4

τ5 τ6τ7 τ8

τ9 τ10τ11

m tS =τ1+τ5+τ9

A

B

Page 13: The  distribution of  the IBD sharing and applications

Renewal theory: solution

• Laplace transform T→s, tS→u

Page 14: The  distribution of  the IBD sharing and applications

Mean IBD sharing

• Can be derived in many ways.

• (1)

• (2) • The average number of segments ≥m is 2NL·P(ℓ≥m).

• (3) Palamara, …, Pe’er, AJHG, 2012.• At the end of the talk (time-permitting).

Page 15: The  distribution of  the IBD sharing and applications

Varying population size

• Use results of Li and Durbin (Nature, 2011).

and then proceed as before. • The mean IBD sharing:

Page 16: The  distribution of  the IBD sharing and applications

The variance of the IBD sharing• Can also be calculated in a number of ways.

• (1)

• (2) Define I(s), the indicator, with probability π (=<fT>) , that site s is in a shared segment between two given chromosomes.• Define the number of sites as M.

• The variance requires calculating two-sites probabilities.• Almost-exact solution at the end of the talk (time-permitting).

Page 17: The  distribution of  the IBD sharing and applications

The variance: simplified• (3) Idea:

• Two distant sites will always be on a shared segment if there was no recombination event in their history.

• If there was, treat sites as independent.• Neglect some small terms.

• The probability of no recombination:

• The variance:

For the human genome,

d≥m

Page 18: The  distribution of  the IBD sharing and applications

The cohort-averaged sharing

• The distribution is close to normal.• But with variance that approaches a

constant even for large sample size n. Why?

• Scales as 1/n for small n.• Approaches a constant for large samples.

• For the human genome,

Page 19: The  distribution of  the IBD sharing and applications

The tail of the cohort-averaged sharing- `hyper sharing’

• Even for large cohorts, the distribution of the cohort-averaged sharing retains a constant width.

• Some individuals will be in the tails of this distribution! ‘hyper sharing’.

• Can be taken advantage of in sequencing studies.

Page 20: The  distribution of  the IBD sharing and applications

Imputation by IBD

• Our results can be used to calculate the expected imputation power when sequencing a subset of a cohort.

• Assume a cohort of size n, ns of which are sequenced.• Random selection of individuals:

• Selection of highest-sharing individuals:

• where

Page 21: The  distribution of  the IBD sharing and applications

Increase in association power

• The imputed genomes can be thought of as increasing the effective number of sequences.

• A simple model (Shen et al., Bioinformatics, 2011):• Variant appears in cases only.• Carrier frequency in cases equal β.• Dominant effect.• Association detected if P-value

below a threshold.• For a fixed budget, trade-off in the

number of cases/controls to sequence.

Page 22: The  distribution of  the IBD sharing and applications

Siblings

• Siblings share, on average, 50% of their genomes.• What is the variance?• A classic problem.

• (Visscher et al. PLoS Genet. 2006).• Used the variance to estimate heritability from

siblings studies.• Genome-wide SD 5.5%.• But what if parents are inbred?

• Assume shared segments are either from parents or are more remote.

Page 23: The  distribution of  the IBD sharing and applications

Estimator of population size

• Given one genome, estimate the population size N.• Calculate the total sharing fT. We know that

• Invert to suggest an estimator:

• Not very useful: estimator is biased

• and has SD

• Compared to for Watterson’s estimator (based on the number of het sites).

Page 24: The  distribution of  the IBD sharing and applications

Ashkenazi Jews• In recent years, shown to be a genetically distinct group.• Close to Middle-Easterns and Europeans (particularly Italians and

Adygei).

• (Atzmon et al., Am. J. Hum. Genet., 2010)• Very large amounts of IBD (Gusev et al., Mol. Biol. Evol., 2011), likely

due to a recent, severe bottleneck.

Page 25: The  distribution of  the IBD sharing and applications

IBD in Ashkenazi Jews• 2,600 Ashkenazi Jews, 1M SNP array (Guha et al., Genome Biol. 2012).• Use Germline to detect IBD segments.• Compare the total sharing to simulations of inferred demography

based on mean IBD in different length ranges (Palamara et al., AJHG, 2012).

Excess of `hyper sharing’ in AJ

Page 26: The  distribution of  the IBD sharing and applications

Admixture in AJ

• Most plausible explanation: correct for admixture.

The AJ component was calculated in comparison to CEU.

When considering only individuals with close to median AJ ancestry, most of the unexplained variance disappears.

Page 27: The  distribution of  the IBD sharing and applications

Summary

• We calculated the distribution of the total IBD sharing in the Wright-Fisher model using renewal theory.

• We obtained explicit expressions for the variance of both the pairwise sharing and the cohort-averaged sharing.

• We calculated the expected gain in imputation and association power if individuals at the tail of the cohort-averaged sharing distribution are selected for sequencing.

• The variance/distribution of IBD has many applications, some of which we presented, some are left for future work.

• In the AJ population, individuals differ in cohort-averaged sharing by up to 30%. Admixture explains some of the variance.

Page 28: The  distribution of  the IBD sharing and applications

The end

Thanks to:• Pier Francesco Palamara.• Vladimir Vacic• Itsik Pe’er• Todd Lencz, Ariel Darvasi

(for AJ genotypes)

• Human Frontiers Science program Cross-Disciplinary Fellowship.

Page 29: The  distribution of  the IBD sharing and applications

Identity-by-descent (IBD)

founder chromosomes

contemporary chromosomes

Identity-by-descent

Page 30: The  distribution of  the IBD sharing and applications

Mean IBD (Palamara et al.)

• See (Palamara et al., AJHG, 2012).• Assume shared segments must

have length at least m.

• Define I(s): the indicator, with probability π, that site s is in a shared segment between two given chromosomes.

• Define fT: the mean fraction of the chromosome found in shared segments, or the total sharing.

• Given g, the number of generations to the MRCA:

• In the coalescent, g→Nt:

• Then, <fT>=π.

Page 31: The  distribution of  the IBD sharing and applications

The variance of the total sharing (1)

• The variance requires calculating two-sites probabilities.

• Idea: • For one site, PDF of the coalescence time is Φ(t)~Exp(1).• For two sites, calculate the joint PDF Φ(t1,t2).• Φ(t1,t2) takes into account the interaction between the sites.• Given t1, t2, calculate π2 as if sites are independent.

Page 32: The  distribution of  the IBD sharing and applications

The variance of the total sharing (2)

• Express π2 in terms of the Laplace transform of Φ(t1,t2).

• π2

• Use the coalescent with recombination to find

where A-E are defined in terms of q1, q2, and the scaled recombination rate ρ.

Page 33: The  distribution of  the IBD sharing and applications

IBD in AJAre `hyper-sharing’ individuals sharing more with everyone else, or just with other `hyper-sharing’ individuals?

Each curve represents average of 1/7 of the individuals in order of their cohort-averaged sharing.

Highest sharing Lowest sharing

Highest sharing

Lowest sharing