59
Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics University of Vermont

Data Standards and Statistical Issues for Immunogenetic Datarsingle/temp2/slides/UVM_201410.pdf · Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate

Embed Size (px)

Citation preview

Data Standards and Statistical Issues

for Immunogenetic Data

Richard M. Single

Associate Professor of Statistics

Department of Mathematics & Statistics

University of Vermont

• HLA nomenclature: Why it matters for analysis and interpretation

– Challenges for combining HLA data from different sources

• Data Standardization to facilitate meta-analyses and reproducibility

– Developing a community standard for HLA & KIR data reporting

• Overview of HLA data curation & ambiguity resolution

– Example, Immport, Next steps: GL strings & QR codes

• HLA (chrom 6) and KIR (chrom 19) interactions

– A brief overview

• HLA and KIR: population-level evidence of co-evolution

– Population-genetic evidence of co-evolution

– Randomization tests and genomic controls

Outline

HLA Nomenclature and why it matters

MHC

HLA Nomenclature and why it matters

• Challenges for HLA data management and analysis

– The HLA genes are very polymorphic;

– HLA nomenclature is complicated;

– There are multiple ways to generate HLA data;

– All common typing systems generate ambiguous data;

– There are multiple ways to report alleles and ambiguities;

These issues make meta-analyses of HLA data from

different sources very difficult.

Klein J. et al New Eng J Med, 2000; 343:702-709 An extremely gene-rich region.

TCR

= peptide fragm ent

-m

TCR

HLA class I HLA class II

T CR = T -cell recept or

-m = microglobulin

Structure of HLA molecules

• HLA molecules are cell-surface proteins that present peptide fragments to T-cells

• They bind specific sets of peptides based on structure

7

90

73 77 80

Ribbon drawing from Hedrick et al. PNAS, 88, 5897-5901

HLA-C binding pocket

DP DQ DR B C A

50 kb 850 kb 100 kb 1270 kb

class II loci class I loci

B1 A1 B1 A1 B1

A

400 kb 250 kb

1612 2211 1280

2 980

31 216

19

153

IMGT/HLA Database Release 3.12.0 April 17, 2013

HLA classical loci and polymorphism

Protein-level allele numbers:

HLA-A * 24 : 02 : 01 : 02 : L

Locus Field 1

(2-Digit)

Serological level

(where possible)

Field 2

(4-Digit)

Peptide level

(amino acid

difference)

Field 3

(6-Digit)

Nucleotide level

[silent]

(synonymous

substitutions)

Field 4

(8-Digit)

Intron level

(3’ or 5’

polymorphism)

Expression

N = null

L = low

S = soluble

• For most analyses, we want to distinguish among unique peptide sequences,

i.e., 2 fields (“4-digit”) level

• This level of resolution treats alleles with the same peptide sequence for

exons 2 & 3 (class I) or exon 2 (class II) as being equivalent [“binning” alleles]

HLA Allele Nomenclature

• HLA alleles are defined by a “patchwork” of sequence-level polymorphisms.

• Most typing systems do not interrogate the same set of polymorphisms

- e.g., DRB1*14:01:01 vs. *14:54 differ only in exon 3

• There is currently no simple way to identify which alleles could (could not)

have been detected by a given typing system.

HLA Nomenclature & Polymorphism

Distinctive Geographical Distribution of subtypes of HLA-DRB1*08

• HLA nomenclature: Why it matters for analysis and interpretation

– Challenges for combining HLA data from different sources

• Data Standardization to facilitate meta-analyses and reproducibility

– Developing a community standard for HLA & KIR data reporting

• Overview of HLA data curation & ambiguity resolution

– Example, Immport, Next steps: GL strings & QR codes

• HLA (chrom 6) and KIR (chrom 19) interactions

– A brief overview

• HLA and KIR: population-level evidence of co-evolution

– Population-genetic evidence of co-evolution

– Randomization tests and genomic controls

Outline

Data Standardization to facilitate Meta-analyses

Data standardization methods …

• Document the typing method (SSOP, SSP, SBT, …), version, exons interrogated,

and the set of detectable alleles:

• Perform data validation by checking against IMGT & IPD-KIR allele lists

allow re-evaluation of raw data in future contexts

allow information/results to be combined across datasets more easily

Extending STREGA to Immunogenomic Studies

• The STrengthening the REporting of Genetic Association studies

(STREGA) statement provides community-based data reporting and

analysis standards for genomic disease association studies

• The IDAWG (immunogenomics.org) has proposed an extension of

STREGA: STrengthening the REporting of Immunogenomic Studies

(STREIS)

From STREGA to STREIS

Extensions to the STREGA guidelines for immunogenomic data include:

• Describing the system(s) used to store, manage, and validate genotype

and allele data

• Documenting all methods applied to resolve ambiguity

• Defining any codes used to represent ambiguities

• Describing any binning or combining of alleles into common categories

• Avoiding the use of subjective terms (e.g. high-resolution typing), that

may change over time

• HLA nomenclature: Why it matters for analysis and interpretation

– Challenges for combining HLA data from different sources

• Data Standardization to facilitate meta-analyses and reproducibility

– Developing a community standard for HLA & KIR data reporting

• Overview of HLA data curation & ambiguity resolution

– Example, Immport, Next steps: GL strings & QR codes

• HLA (chrom 6) and KIR (chrom 19) interactions

– A brief overview

• HLA and KIR: population-level evidence of co-evolution

– Population-genetic evidence of co-evolution

– Randomization tests and genomic controls

Outline

Allele-level Ambiguity

Group codes (“g”-codes)

for alleles identical in

exons 2 & 3 for class I,

or exon 2 for class II.

A*0201/ 0209/ 0243N/ 0266/ 0275/ 0283N/ 0289 = “A020101g”

NMDP ambiguity codes for

4-digit non-null alleles

A*0201/0209 = A*02AF

A*0201/0209/0266 = A*02AJEY

A*0201/0209/0266/0275/0289 = A*02BSFJ

Ambiguous allele sets A*0201/ 0209/ 0243N/ 0266/ 0275/ 0283N/ 0289

Ambiguous alleles result from polymorphisms outside of assessed regions;

• outside of exons 2 & 3, or

• in sections of those exons that were not interrogated.

Genotype-level Ambiguity

Ambiguous genotypes result from an inability to establish the phase of individual polymorphisms

or entire exons.

Different combinations of alleles can lead to the same typing result.

Example: A typing result for one individual that could be explained by any of four different

possible genotype sets at HLA-B.

Genotype 1 2705 4402

Genotype 2 2705 4411

Genotype 3 2709 4402

Genotype 4 2709 4411

B*2705 + B*4402 or

B*2705 + B*4411 or

B*2709 + B*4402 or

B*2709 + B*4411

Most analytical methods require a single genotype call for each individual sample.

Standardized Ambiguity Reduction

2703, 270502, 270503, 270504, 270505, 270506,

270508, 2710, 2713, 2717

44020101, 44020102S, 440203, 4419N, 4423N, 4424,

4427, 4433

2703, 270502, 270503, 270504, 270505, 270506,

270508, 2710, 2713, 2717

440202, 4411

2709 44020101, 44020102S, 440203, 4419N, 4423N, 4424,

4427, 4433

2709 440202, 4411

HLA-B allele 1 HLA-B allele 2

Genotype 1

Genotype 2

Genotype 3

Genotype 4

Sample #001

Peptide-level Filtering,

Remove non-CWD alleles,

Binning alleles identical over exons 2&3

Unambiguous data

2703, 2705 4402

Regional population-level frequency data

xxx

2703, 2705 4402

2705 4402

immunogenomics.org

Genotype List (GL) Strings

• Use a hierarchical set of operators to describe the relationships between

– alleles, lists of possible alleles, phased alleles, genotypes, lists of

possible genotypes, and multilocus unphased genotypes,

– without losing typing information or increasing typing ambiguity.

• Are proposed to replace NMDP codes

Milius et al. (2013) Tissue Antigens

Genotype List (GL) Strings

• Example GL string

for the genotype: A*02:69 + A*23:30 or

A*02:302 + A*23:26 or

A*02:302 + A*23:39

B*44:02 + B*49:08 and

• Immunology Database and Analysis Portal (www.ImmPort.org)

Developed under the Bioinformatics Integration Support Contract (BISC) for

NIH, NIAID, & DAIT (Division of Allergy, Immunology, and Transplantation)

– Data validation pipeline

– Analysis tools

– Standardized ambiguity reduction tools

– Data from a large number of immunogenomic studies

• ImmunoGenomics Data Analysis Working Group (www.immunogenomics.org)

(www.IgDAWG.org)

An international collaborative group working to …

– facilitate the sharing of immunogenomic data (HLA, KIR, etc.) and

– foster consistent analysis and interpretation of immunogenomic data

Resources for HLA Data Validation & Analysis

• HLA nomenclature: Why it matters for analysis and interpretation

– Challenges for combining HLA data from different sources

• Data Standardization to facilitate meta-analyses and reproducibility

– Developing a community standard for HLA & KIR data reporting

• Overview of HLA data curation & ambiguity resolution

– Example, Immport, Next steps: GL strings & QR codes

• HLA (chrom 6) and KIR (chrom 19) interactions

– A brief overview

• HLA and KIR: population-level evidence of co-evolution

– Population-genetic evidence of co-evolution

– Randomization tests and genomic controls

Outline

• The KIR gene complex is located on Chromosome 19 (19q13.4)

• KIR are expressed on natural killer (NK) cells and a subset of T cells

• Certain HLA alleles serve as ligands for KIR

KIR Gene Function Ligand

2DL1 Inhibitory HLA-C group2

2DS1 Activating HLA-C group2

2DL2/3 Inhibitory HLA-C group1

2DS2 Activating HLA-C group1

3DL1 Inhibitory HLA-Bw4

3DS1 Activating HLA-Bw4

Killer cell Immunoglobulin-like Receptor (KIR)

NK Cell Normal

Cell

No Lysis

Dominant inhibition

iKIR HLA

Act. rec.

Protection

ligand

Lysis

Cytokines

Missing-self recognition

NK Cell

iKIR

Act. rec.

HIV+

Targets

ligand

KIR regulate NK cell activity

HLA-C alleles can be divided into two groups

based on the amino acid at position 80 (& 77),

which determines KIR recognition

Ser77Asp80

Cw1

Cw3

Cw7

Cw8

Cw12

Cw13

Cw14

HLA-C1

KIR2DL3/2DL2 NK cell

inhibition

HLA-C2

Asp77Lys80

Cw2

Cw4

Cw5

Cw6

Cw15

Cw17

KIR2DL1

Bifurcation of HLA-B allotypes

HLA-B

Bw4 (40%) Bw6 (60%)

KIR3DL1 ligands

KIR3DS1

Not a ligand for KIR

80I 80T

• HLA nomenclature: Why it matters for analysis and interpretation

– Challenges for combining HLA data from different sources

• Data Standardization to facilitate meta-analyses and reproducibility

– Developing a community standard for HLA & KIR data reporting

• Overview of HLA data curation & ambiguity resolution

– Example, Immport, Next steps: GL strings & QR codes

• HLA (chrom 6) and KIR (chrom 19) interactions

– A brief overview

• HLA and KIR: population-level evidence of co-evolution

– Population-genetic evidence of co-evolution

– Randomization tests and genomic controls

Outline

KIR & HLA in 30 Global Populations

• Several studies have hypothesized selection for KIR that suit the locale-

specific HLA repertoire.

– Population-level data suggest a balanced relationship between activating

receptors and their ligands across populations.

• Disease association studies point to HLA-Bw4 alleles with Isoleucine at

position 80 (“Bw4-80I”) as the strongest ligand for KIR3DS1

– Population-level data show the strongest relationship between KIR3DS1 and

Bw4-80I frequencies.

Population-level evidence for Co-evolution

& Natural Selection for KIR and HLA

KIR2DL3 vs. HLA-Cgroup1

r = 0.184

KIR3DL1 vs. HLA-Bw4

r = 0.426

KIR2DL1 vs. HLA-Cgroup2

r = 0.046

Inhibitory KIR

Correlations between frequencies for

KIR and HLA Ligands

Correlations between frequencies for

KIR and HLA Ligands

KIR3DS1 vs. HLA-Bw4

r = -0.632

KIR2DS1 vs. HLA-Cgroup2

r = -0.478

KIR2DS2 vs. HLA-Cgroup1

r = -0.371

Activating KIR

Correlations between frequencies for

KIR and HLA Ligands

Activating KIR3DS1

Subsets of Bw4 alleles based on amino acid position 80

KIR3DS1 vs. HLA-Bw4

r = -0.632

KIR3DS1 vs. HLA-Bw4-80I

r = -0.657

KIR3DS1 vs. HLA-Bw4-80T

r = -0.190

Single et al., Nature Genetics

• Challenges for these and other population studies – Demographic history shapes patterns of variation & can mimic the

effects of selection.

– Gene frequencies are not statistically independent among populations,

due to shared demographic history.

• Ordinary Pearson correlation p-values assume independence among the

observations.

• We constructed a randomization test to account for the demographic

histories of the populations and focus on the genetic effect.

Statistical Issues

Assessing the significance ρ = cor(X,Y)

• Null Hypothesis: H0: ρ = 0

• Statistic: Pearson’s correlation coefficient

Hypothesis Test for a Correlation Coefficient

.674observedr

X Y

4.1 4.9

8.6 5.4

2.3 4.2

5.4 7.4

9.2 8.8

7.7 6.7

6.4 8.8

4.3 5.1

7.6 9.4

3.4 5.3

2 2

i i

i

i i

i i

x x y y

r

x x y y

Randomization Test

Population HLA-B (1) HLA-B (2) B-grp (1) B-grp (2) HLA-C (1) HLA-C (2) C-grp (1) C-grp (2)

Biaka 0702 1503 Bw6 Bw6 0202 0702 C2 C1

Biaka 0702 4403 Bw6 Bw4 0401 0702 C2 C1

Biaka 1302 3701 Bw4 Bw4 0202 0602 C2 C2

Biaka 4901 5301 Bw4 Bw4 0401 0701 C2 C1

Biaka 3701 3910 Bw4 Bw6 0202 1203 C2 C1

… … … … … … … … …

• Bw4 alleles: 1301, 1302, 1516, 1517, 2702, 2703, 2704, 2705, 3701,

3801, 3802, 4402, 4403, 4404, 4405, ...

• Bw6 alleles: 0702, 0705, 0799, 0801, 1401, 1402, 1403, 1501, 1502,

1503, 1504, 1506, 1507, 1508, 1510, ...

• Reassign Bw4/Bw6 status to simulate the null hypothesis

• Compute correlation of frequencies for KIR-3DS1 & reassigned HLA

Permutation Distribution

correlation

De

nsity

-0.5 0.0 0.5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

XX

KIR3DS1 – HLA-Bw4 correlation

Permutation p-value=0.012

r = -0.632

• Empirical comparisons based on genomic data or other methods that

incorporate information about the demographic histories of populations

(Pritchard and Donnelly, 2001).

– Our study used data from the ALFRED database to assess statistical significance

http://alfred.med.yale.edu

– We selected 538 neutral sites from 202 genes typed in the same individuals

Genomic Controls

Genomic Data

• Randomly select two SNP sites from different chromosomes

• Find the frequencies in each population and compute the correlation

• Repeat

Genomic Data for Empirical Tests

0.2 0.4 0.6 0.8 1.0

0.3

0.4

0.5

0.6

0.7

0.8

SNP site 1

SN

P s

ite

2

Empirical Distribution for Correlations among unlinked SNPs

Correlation

De

nsi

ty

-1.0 -0.5 0.0 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

XX

KIR3DS1 – HLA-Bw4 correlation

empirical p-value=0.041

r = -0.632

Genomic Data – Empirical Distribution

* Ordinary Pearson p-values in red overestimate the significance of trends

locus pair Correlation

p-value (1)

(correlation)

p-value (2)

(permutation)

p-value (3)

(empirical)

3DS1 - Bw4 -0.632 0.000 0.012 0.041

3DS1 - Bw480I -0.657 0.000 0.009 0.038

3DS1 - Bw480T -0.190 0.316 0.532 0.534

3DL1 - Bw4 0.426 0.019 0.106 0.218

3DL1 - Bw480I 0.416 0.022 0.115 0.191

3DL1 - Bw480T 0.171 0.367 0.540 0.758

2DS1 - C2 -0.478 0.008 0.243 0.149

2DL1 - C2 0.046 0.810 0.891 0.924

2DL2 - C1 -0.366 0.047 0.193 0.542

2DL3 - C1 0.184 0.331 0.458 0.328

2DS2 - C1 -0.371 0.044 0.170 0.479

(1) P-correlation is the ordinary Pearson product-moment correlation p-value.

(2) P-permutation is based on the permutation distribution under the null hypothesis.

(3) P-empirical is based on the empirical distribution for unlinked SNPs from ALFRED.

Significance of Correlations *

• HLA nomenclature: Why it matters for analysis and interpretation

– Challenges for combining HLA data from different sources

• Data Standardization to facilitate meta-analyses and reproducibility

– Developing a community standard for HLA & KIR data reporting

• Overview of HLA data curation & ambiguity resolution

– Example, Immport, Next steps: GL strings & QR codes

• HLA (chrom 6) and KIR (chrom 19) interactions

– A brief overview

• HLA and KIR: population-level evidence of co-evolution

– Population-genetic evidence of co-evolution

– Randomization tests and genomic controls

Outline

Acknowledgements

NCI

Mary Carrington

Pat Martin

Gao Xiaojiang

USP

Diogo Meyer

Rodrigo dos Santos Francisco

Yale University

Ken and Judy Kidd

Children's Hospital Oakland

Research Inst.

Steven J. Mack

Jill A. Hollenbach

Harvard Medical School

Alex Lancaster

UC San Francisco

Owen Solberg

Roche Molecular Systems

Henry A. Erlich

Anthony Nolan Research Inst.

Steven G.E. Marsh

NCBI/NIH

Mike Feolo

NGIT

Jeff Wiser

Patrick Dunn

Tom Smith

If time allows …

1 1

I J

iji j

i j

D p q D

12

12

2

21 1 2

min( 1 1) min( 1 1)

I J

ij i j

i j LDn

D p qX N

WI J I J

The two most common measures of the strength of LD are:

(1) the normalized measure of the individual LD values, namely

Dij' = Dij / Dmax (Lewontin 1964); and

(2) the correlation coefficient r for bi-allelic data, which is most often

reported as r2 = D2 / (pA1 pA2 pB1 pB2).

r =1 only when the allelic variations at the two loci show 100% correlation

Their multi-allelic extensions are:

Linkage Disequilibrium (LD) Measures

Standard LD measures D’ and Wn

Standard LD measures (overall D’ & Wn) assume/force symmetry,

even though with >2 alleles per locus that is not the case

Data Source: Immport Study#SDY26: Identifying polymorphisms associated with

risk for the development of myopericarditis following smallpox vaccine

Asymmetric Linkage Disequilibrium (ALD)

Interpretation:

ALD for HLA-DRB1 conditioning on HLA-DQA1

WDRB1 / DQA1 = .58

ALD for HLA-DQA1 conditioning on HLA-DRB1

WDQA1 / DRB1 = .95

The overall variation for DRB1 is relatively high given

specific DQA1 alleles.

The overall variation for DQA1 is relatively low given

specific DRB1 alleles.

ALD

row gene conditional on column gene

Thomson and Single, 2014 Genetics

• Balancing selection can result from:

- Overdominance/Heterozygote advantage

- Frequency-dependent selection

- Selective regimes that change over time/space

• For HLA, the common factor in these models is rare allele advantage,

which is consistent with a pathogen-directed frequency-dependent

selection model.

• At the Amino Acid (AA) level we see

- High AA variability at antigen recognition sites (ARS)

- Relatively even AA frequencies at ARS sites

- Higher rates of non-synonymous vs. synonymous changes at ARS

Balancing Selection Operates at Most HLA Loci

Meyer & Mack, 2008

Homozygosity (F) and the

Normalized Deviate (Fnd)

0

0.05

0.1

0.15

0.2

0.25

0.3

allele

all

ele

fre

qu

en

cy

0

0.1

0.2

0.3

0.4

0.5

0.6

allele

all

ele

fre

qu

en

cy

0

0.02

0.04

0.06

0.08

0.1

0.12

allele

all

ele

fre

qu

en

cy

Neutrality

FOBS ≈ FEQ

Fnd ≈ 0

Directional Selection

FOBS > FEQ

Fnd > 0

Balancing Selection

FOBS < FEQ

Fnd < 0

2

1

k

iiF p

Fnd = (FOBS - FEQ) / SD(FEQ)

Fnd for DRB1 AA sites in a EUR population

• Fnd << 0 gives evidence of possible balancing selection.

• Fnd >> 0 gives evidence of possible directional selection.

LD for DRB1 AAs

Wn

ALD

row gene conditional on column gene

Asymmetric LD (ALD) Wn (symmetric)

Fnd for DRB1 AA sites (Meta-Analysis)

Fnd for all polymorphic sites in a meta-analysis of 57

populations

• Fnd << 0 gives evidence of possible balancing selection.

• Fnd >> 0 gives evidence of possible directional selection.

Asymmetric Linkage Disequilibrium (ALD)

Table 1. Linkage disequilibrium and genetic diversity measures

Description

Definition of Measuresa

1. Single locus homozygosity (F)b

FA = i pAi2

2. Haplotype specific homozygosity

(HSF)c

FA/Bj = i (fij / pBj)2

3. Overall weighted HSF valuesd

FA/B (and FB/A)

FA/B = j (FA/Bj) (pBj) = FA + i j Dij2 / pBj

4. Multi-allelic ALDe squared

WA/B (and WB/A)

WA/B2 = (FA/B−FA) / (1−FA)

Thomson and Single(2014) Genetics

Asymmetric Linkage Disequilibrium (ALD)

Table 1. Linkage disequilibrium and genetic diversity measures

Description

Definition of Measuresa

1. Single locus homozygosity (F)b

FA = i pAi2

2. Haplotype specific homozygosity

(HSF)c

FA/Bj = i (fij / pBj)2

3. Overall weighted HSF valuesd

FA/B (and FB/A)

FA/B = j (FA/Bj) (pBj) = FA + i j Dij2 / pBj

4. Multi-allelic ALDe squared

WA/B (and WB/A)

WA/B2 = (FA/B−FA) / (1−FA)

If both loci are bi-allelic:

WA/B2 = [i j (Dij

2 / pBj)] / (1 − FA) = D

2 / (pA1 pA2 pB1 pB2) = r

2, since D11= −D12= −D21= D22=D

Thomson and Single(2014) Genetics