Upload
phamminh
View
217
Download
0
Embed Size (px)
Citation preview
Data Standards and Statistical Issues
for Immunogenetic Data
Richard M. Single
Associate Professor of Statistics
Department of Mathematics & Statistics
University of Vermont
• HLA nomenclature: Why it matters for analysis and interpretation
– Challenges for combining HLA data from different sources
• Data Standardization to facilitate meta-analyses and reproducibility
– Developing a community standard for HLA & KIR data reporting
• Overview of HLA data curation & ambiguity resolution
– Example, Immport, Next steps: GL strings & QR codes
• HLA (chrom 6) and KIR (chrom 19) interactions
– A brief overview
• HLA and KIR: population-level evidence of co-evolution
– Population-genetic evidence of co-evolution
– Randomization tests and genomic controls
Outline
HLA Nomenclature and why it matters
• Challenges for HLA data management and analysis
– The HLA genes are very polymorphic;
– HLA nomenclature is complicated;
– There are multiple ways to generate HLA data;
– All common typing systems generate ambiguous data;
– There are multiple ways to report alleles and ambiguities;
These issues make meta-analyses of HLA data from
different sources very difficult.
TCR
= peptide fragm ent
-m
TCR
HLA class I HLA class II
T CR = T -cell recept or
-m = microglobulin
Structure of HLA molecules
• HLA molecules are cell-surface proteins that present peptide fragments to T-cells
• They bind specific sets of peptides based on structure
DP DQ DR B C A
50 kb 850 kb 100 kb 1270 kb
class II loci class I loci
B1 A1 B1 A1 B1
A
400 kb 250 kb
1612 2211 1280
2 980
31 216
19
153
IMGT/HLA Database Release 3.12.0 April 17, 2013
HLA classical loci and polymorphism
Protein-level allele numbers:
HLA-A * 24 : 02 : 01 : 02 : L
Locus Field 1
(2-Digit)
Serological level
(where possible)
Field 2
(4-Digit)
Peptide level
(amino acid
difference)
Field 3
(6-Digit)
Nucleotide level
[silent]
(synonymous
substitutions)
Field 4
(8-Digit)
Intron level
(3’ or 5’
polymorphism)
Expression
N = null
L = low
S = soluble
…
• For most analyses, we want to distinguish among unique peptide sequences,
i.e., 2 fields (“4-digit”) level
• This level of resolution treats alleles with the same peptide sequence for
exons 2 & 3 (class I) or exon 2 (class II) as being equivalent [“binning” alleles]
HLA Allele Nomenclature
• HLA alleles are defined by a “patchwork” of sequence-level polymorphisms.
• Most typing systems do not interrogate the same set of polymorphisms
- e.g., DRB1*14:01:01 vs. *14:54 differ only in exon 3
• There is currently no simple way to identify which alleles could (could not)
have been detected by a given typing system.
HLA Nomenclature & Polymorphism
• HLA nomenclature: Why it matters for analysis and interpretation
– Challenges for combining HLA data from different sources
• Data Standardization to facilitate meta-analyses and reproducibility
– Developing a community standard for HLA & KIR data reporting
• Overview of HLA data curation & ambiguity resolution
– Example, Immport, Next steps: GL strings & QR codes
• HLA (chrom 6) and KIR (chrom 19) interactions
– A brief overview
• HLA and KIR: population-level evidence of co-evolution
– Population-genetic evidence of co-evolution
– Randomization tests and genomic controls
Outline
Data Standardization to facilitate Meta-analyses
Data standardization methods …
• Document the typing method (SSOP, SSP, SBT, …), version, exons interrogated,
and the set of detectable alleles:
• Perform data validation by checking against IMGT & IPD-KIR allele lists
allow re-evaluation of raw data in future contexts
allow information/results to be combined across datasets more easily
Extending STREGA to Immunogenomic Studies
• The STrengthening the REporting of Genetic Association studies
(STREGA) statement provides community-based data reporting and
analysis standards for genomic disease association studies
• The IDAWG (immunogenomics.org) has proposed an extension of
STREGA: STrengthening the REporting of Immunogenomic Studies
(STREIS)
From STREGA to STREIS
Extensions to the STREGA guidelines for immunogenomic data include:
• Describing the system(s) used to store, manage, and validate genotype
and allele data
• Documenting all methods applied to resolve ambiguity
• Defining any codes used to represent ambiguities
• Describing any binning or combining of alleles into common categories
• Avoiding the use of subjective terms (e.g. high-resolution typing), that
may change over time
• HLA nomenclature: Why it matters for analysis and interpretation
– Challenges for combining HLA data from different sources
• Data Standardization to facilitate meta-analyses and reproducibility
– Developing a community standard for HLA & KIR data reporting
• Overview of HLA data curation & ambiguity resolution
– Example, Immport, Next steps: GL strings & QR codes
• HLA (chrom 6) and KIR (chrom 19) interactions
– A brief overview
• HLA and KIR: population-level evidence of co-evolution
– Population-genetic evidence of co-evolution
– Randomization tests and genomic controls
Outline
Allele-level Ambiguity
Group codes (“g”-codes)
for alleles identical in
exons 2 & 3 for class I,
or exon 2 for class II.
A*0201/ 0209/ 0243N/ 0266/ 0275/ 0283N/ 0289 = “A020101g”
NMDP ambiguity codes for
4-digit non-null alleles
A*0201/0209 = A*02AF
A*0201/0209/0266 = A*02AJEY
A*0201/0209/0266/0275/0289 = A*02BSFJ
Ambiguous allele sets A*0201/ 0209/ 0243N/ 0266/ 0275/ 0283N/ 0289
Ambiguous alleles result from polymorphisms outside of assessed regions;
• outside of exons 2 & 3, or
• in sections of those exons that were not interrogated.
Genotype-level Ambiguity
Ambiguous genotypes result from an inability to establish the phase of individual polymorphisms
or entire exons.
Different combinations of alleles can lead to the same typing result.
Example: A typing result for one individual that could be explained by any of four different
possible genotype sets at HLA-B.
Genotype 1 2705 4402
Genotype 2 2705 4411
Genotype 3 2709 4402
Genotype 4 2709 4411
B*2705 + B*4402 or
B*2705 + B*4411 or
B*2709 + B*4402 or
B*2709 + B*4411
Most analytical methods require a single genotype call for each individual sample.
Standardized Ambiguity Reduction
2703, 270502, 270503, 270504, 270505, 270506,
270508, 2710, 2713, 2717
44020101, 44020102S, 440203, 4419N, 4423N, 4424,
4427, 4433
2703, 270502, 270503, 270504, 270505, 270506,
270508, 2710, 2713, 2717
440202, 4411
2709 44020101, 44020102S, 440203, 4419N, 4423N, 4424,
4427, 4433
2709 440202, 4411
HLA-B allele 1 HLA-B allele 2
Genotype 1
Genotype 2
Genotype 3
Genotype 4
Sample #001
Peptide-level Filtering,
Remove non-CWD alleles,
Binning alleles identical over exons 2&3
Unambiguous data
2703, 2705 4402
Regional population-level frequency data
Genotype List (GL) Strings
• Use a hierarchical set of operators to describe the relationships between
– alleles, lists of possible alleles, phased alleles, genotypes, lists of
possible genotypes, and multilocus unphased genotypes,
– without losing typing information or increasing typing ambiguity.
• Are proposed to replace NMDP codes
Milius et al. (2013) Tissue Antigens
Genotype List (GL) Strings
• Example GL string
for the genotype: A*02:69 + A*23:30 or
A*02:302 + A*23:26 or
A*02:302 + A*23:39
B*44:02 + B*49:08 and
• Immunology Database and Analysis Portal (www.ImmPort.org)
Developed under the Bioinformatics Integration Support Contract (BISC) for
NIH, NIAID, & DAIT (Division of Allergy, Immunology, and Transplantation)
– Data validation pipeline
– Analysis tools
– Standardized ambiguity reduction tools
– Data from a large number of immunogenomic studies
• ImmunoGenomics Data Analysis Working Group (www.immunogenomics.org)
(www.IgDAWG.org)
An international collaborative group working to …
– facilitate the sharing of immunogenomic data (HLA, KIR, etc.) and
– foster consistent analysis and interpretation of immunogenomic data
Resources for HLA Data Validation & Analysis
• HLA nomenclature: Why it matters for analysis and interpretation
– Challenges for combining HLA data from different sources
• Data Standardization to facilitate meta-analyses and reproducibility
– Developing a community standard for HLA & KIR data reporting
• Overview of HLA data curation & ambiguity resolution
– Example, Immport, Next steps: GL strings & QR codes
• HLA (chrom 6) and KIR (chrom 19) interactions
– A brief overview
• HLA and KIR: population-level evidence of co-evolution
– Population-genetic evidence of co-evolution
– Randomization tests and genomic controls
Outline
• The KIR gene complex is located on Chromosome 19 (19q13.4)
• KIR are expressed on natural killer (NK) cells and a subset of T cells
• Certain HLA alleles serve as ligands for KIR
KIR Gene Function Ligand
2DL1 Inhibitory HLA-C group2
2DS1 Activating HLA-C group2
2DL2/3 Inhibitory HLA-C group1
2DS2 Activating HLA-C group1
3DL1 Inhibitory HLA-Bw4
3DS1 Activating HLA-Bw4
Killer cell Immunoglobulin-like Receptor (KIR)
NK Cell Normal
Cell
No Lysis
Dominant inhibition
iKIR HLA
Act. rec.
Protection
ligand
Lysis
Cytokines
Missing-self recognition
NK Cell
iKIR
Act. rec.
HIV+
Targets
ligand
KIR regulate NK cell activity
HLA-C alleles can be divided into two groups
based on the amino acid at position 80 (& 77),
which determines KIR recognition
Ser77Asp80
Cw1
Cw3
Cw7
Cw8
Cw12
Cw13
Cw14
HLA-C1
KIR2DL3/2DL2 NK cell
inhibition
HLA-C2
Asp77Lys80
Cw2
Cw4
Cw5
Cw6
Cw15
Cw17
KIR2DL1
Bifurcation of HLA-B allotypes
HLA-B
Bw4 (40%) Bw6 (60%)
KIR3DL1 ligands
KIR3DS1
Not a ligand for KIR
80I 80T
• HLA nomenclature: Why it matters for analysis and interpretation
– Challenges for combining HLA data from different sources
• Data Standardization to facilitate meta-analyses and reproducibility
– Developing a community standard for HLA & KIR data reporting
• Overview of HLA data curation & ambiguity resolution
– Example, Immport, Next steps: GL strings & QR codes
• HLA (chrom 6) and KIR (chrom 19) interactions
– A brief overview
• HLA and KIR: population-level evidence of co-evolution
– Population-genetic evidence of co-evolution
– Randomization tests and genomic controls
Outline
• Several studies have hypothesized selection for KIR that suit the locale-
specific HLA repertoire.
– Population-level data suggest a balanced relationship between activating
receptors and their ligands across populations.
• Disease association studies point to HLA-Bw4 alleles with Isoleucine at
position 80 (“Bw4-80I”) as the strongest ligand for KIR3DS1
– Population-level data show the strongest relationship between KIR3DS1 and
Bw4-80I frequencies.
Population-level evidence for Co-evolution
& Natural Selection for KIR and HLA
KIR2DL3 vs. HLA-Cgroup1
r = 0.184
KIR3DL1 vs. HLA-Bw4
r = 0.426
KIR2DL1 vs. HLA-Cgroup2
r = 0.046
Inhibitory KIR
Correlations between frequencies for
KIR and HLA Ligands
Correlations between frequencies for
KIR and HLA Ligands
KIR3DS1 vs. HLA-Bw4
r = -0.632
KIR2DS1 vs. HLA-Cgroup2
r = -0.478
KIR2DS2 vs. HLA-Cgroup1
r = -0.371
Activating KIR
Correlations between frequencies for
KIR and HLA Ligands
Activating KIR3DS1
Subsets of Bw4 alleles based on amino acid position 80
KIR3DS1 vs. HLA-Bw4
r = -0.632
KIR3DS1 vs. HLA-Bw4-80I
r = -0.657
KIR3DS1 vs. HLA-Bw4-80T
r = -0.190
Single et al., Nature Genetics
• Challenges for these and other population studies – Demographic history shapes patterns of variation & can mimic the
effects of selection.
– Gene frequencies are not statistically independent among populations,
due to shared demographic history.
• Ordinary Pearson correlation p-values assume independence among the
observations.
• We constructed a randomization test to account for the demographic
histories of the populations and focus on the genetic effect.
Statistical Issues
Assessing the significance ρ = cor(X,Y)
• Null Hypothesis: H0: ρ = 0
• Statistic: Pearson’s correlation coefficient
Hypothesis Test for a Correlation Coefficient
.674observedr
X Y
4.1 4.9
8.6 5.4
2.3 4.2
5.4 7.4
9.2 8.8
7.7 6.7
6.4 8.8
4.3 5.1
7.6 9.4
3.4 5.3
2 2
i i
i
i i
i i
x x y y
r
x x y y
Randomization Test
Population HLA-B (1) HLA-B (2) B-grp (1) B-grp (2) HLA-C (1) HLA-C (2) C-grp (1) C-grp (2)
Biaka 0702 1503 Bw6 Bw6 0202 0702 C2 C1
Biaka 0702 4403 Bw6 Bw4 0401 0702 C2 C1
Biaka 1302 3701 Bw4 Bw4 0202 0602 C2 C2
Biaka 4901 5301 Bw4 Bw4 0401 0701 C2 C1
Biaka 3701 3910 Bw4 Bw6 0202 1203 C2 C1
… … … … … … … … …
• Bw4 alleles: 1301, 1302, 1516, 1517, 2702, 2703, 2704, 2705, 3701,
3801, 3802, 4402, 4403, 4404, 4405, ...
• Bw6 alleles: 0702, 0705, 0799, 0801, 1401, 1402, 1403, 1501, 1502,
1503, 1504, 1506, 1507, 1508, 1510, ...
• Reassign Bw4/Bw6 status to simulate the null hypothesis
• Compute correlation of frequencies for KIR-3DS1 & reassigned HLA
Permutation Distribution
correlation
De
nsity
-0.5 0.0 0.5
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
XX
KIR3DS1 – HLA-Bw4 correlation
Permutation p-value=0.012
r = -0.632
• Empirical comparisons based on genomic data or other methods that
incorporate information about the demographic histories of populations
(Pritchard and Donnelly, 2001).
– Our study used data from the ALFRED database to assess statistical significance
http://alfred.med.yale.edu
– We selected 538 neutral sites from 202 genes typed in the same individuals
Genomic Controls
• Randomly select two SNP sites from different chromosomes
• Find the frequencies in each population and compute the correlation
• Repeat
Genomic Data for Empirical Tests
0.2 0.4 0.6 0.8 1.0
0.3
0.4
0.5
0.6
0.7
0.8
SNP site 1
SN
P s
ite
2
Empirical Distribution for Correlations among unlinked SNPs
Correlation
De
nsi
ty
-1.0 -0.5 0.0 0.5 1.0
0.0
0.2
0.4
0.6
0.8
1.0
XX
KIR3DS1 – HLA-Bw4 correlation
empirical p-value=0.041
r = -0.632
Genomic Data – Empirical Distribution
* Ordinary Pearson p-values in red overestimate the significance of trends
locus pair Correlation
p-value (1)
(correlation)
p-value (2)
(permutation)
p-value (3)
(empirical)
3DS1 - Bw4 -0.632 0.000 0.012 0.041
3DS1 - Bw480I -0.657 0.000 0.009 0.038
3DS1 - Bw480T -0.190 0.316 0.532 0.534
3DL1 - Bw4 0.426 0.019 0.106 0.218
3DL1 - Bw480I 0.416 0.022 0.115 0.191
3DL1 - Bw480T 0.171 0.367 0.540 0.758
2DS1 - C2 -0.478 0.008 0.243 0.149
2DL1 - C2 0.046 0.810 0.891 0.924
2DL2 - C1 -0.366 0.047 0.193 0.542
2DL3 - C1 0.184 0.331 0.458 0.328
2DS2 - C1 -0.371 0.044 0.170 0.479
(1) P-correlation is the ordinary Pearson product-moment correlation p-value.
(2) P-permutation is based on the permutation distribution under the null hypothesis.
(3) P-empirical is based on the empirical distribution for unlinked SNPs from ALFRED.
Significance of Correlations *
• HLA nomenclature: Why it matters for analysis and interpretation
– Challenges for combining HLA data from different sources
• Data Standardization to facilitate meta-analyses and reproducibility
– Developing a community standard for HLA & KIR data reporting
• Overview of HLA data curation & ambiguity resolution
– Example, Immport, Next steps: GL strings & QR codes
• HLA (chrom 6) and KIR (chrom 19) interactions
– A brief overview
• HLA and KIR: population-level evidence of co-evolution
– Population-genetic evidence of co-evolution
– Randomization tests and genomic controls
Outline
Acknowledgements
NCI
Mary Carrington
Pat Martin
Gao Xiaojiang
USP
Diogo Meyer
Rodrigo dos Santos Francisco
Yale University
Ken and Judy Kidd
Children's Hospital Oakland
Research Inst.
Steven J. Mack
Jill A. Hollenbach
Harvard Medical School
Alex Lancaster
UC San Francisco
Owen Solberg
Roche Molecular Systems
Henry A. Erlich
Anthony Nolan Research Inst.
Steven G.E. Marsh
NCBI/NIH
Mike Feolo
NGIT
Jeff Wiser
Patrick Dunn
Tom Smith
1 1
I J
iji j
i j
D p q D
12
12
2
21 1 2
min( 1 1) min( 1 1)
I J
ij i j
i j LDn
D p qX N
WI J I J
The two most common measures of the strength of LD are:
(1) the normalized measure of the individual LD values, namely
Dij' = Dij / Dmax (Lewontin 1964); and
(2) the correlation coefficient r for bi-allelic data, which is most often
reported as r2 = D2 / (pA1 pA2 pB1 pB2).
r =1 only when the allelic variations at the two loci show 100% correlation
Their multi-allelic extensions are:
Linkage Disequilibrium (LD) Measures
Standard LD measures D’ and Wn
Standard LD measures (overall D’ & Wn) assume/force symmetry,
even though with >2 alleles per locus that is not the case
Data Source: Immport Study#SDY26: Identifying polymorphisms associated with
risk for the development of myopericarditis following smallpox vaccine
Asymmetric Linkage Disequilibrium (ALD)
Interpretation:
ALD for HLA-DRB1 conditioning on HLA-DQA1
WDRB1 / DQA1 = .58
ALD for HLA-DQA1 conditioning on HLA-DRB1
WDQA1 / DRB1 = .95
The overall variation for DRB1 is relatively high given
specific DQA1 alleles.
The overall variation for DQA1 is relatively low given
specific DRB1 alleles.
ALD
row gene conditional on column gene
Thomson and Single, 2014 Genetics
• Balancing selection can result from:
- Overdominance/Heterozygote advantage
- Frequency-dependent selection
- Selective regimes that change over time/space
• For HLA, the common factor in these models is rare allele advantage,
which is consistent with a pathogen-directed frequency-dependent
selection model.
• At the Amino Acid (AA) level we see
- High AA variability at antigen recognition sites (ARS)
- Relatively even AA frequencies at ARS sites
- Higher rates of non-synonymous vs. synonymous changes at ARS
Balancing Selection Operates at Most HLA Loci
Meyer & Mack, 2008
Homozygosity (F) and the
Normalized Deviate (Fnd)
0
0.05
0.1
0.15
0.2
0.25
0.3
allele
all
ele
fre
qu
en
cy
0
0.1
0.2
0.3
0.4
0.5
0.6
allele
all
ele
fre
qu
en
cy
0
0.02
0.04
0.06
0.08
0.1
0.12
allele
all
ele
fre
qu
en
cy
Neutrality
FOBS ≈ FEQ
Fnd ≈ 0
Directional Selection
FOBS > FEQ
Fnd > 0
Balancing Selection
FOBS < FEQ
Fnd < 0
2
1
k
iiF p
Fnd = (FOBS - FEQ) / SD(FEQ)
Fnd for DRB1 AA sites in a EUR population
• Fnd << 0 gives evidence of possible balancing selection.
• Fnd >> 0 gives evidence of possible directional selection.
Fnd for DRB1 AA sites (Meta-Analysis)
Fnd for all polymorphic sites in a meta-analysis of 57
populations
• Fnd << 0 gives evidence of possible balancing selection.
• Fnd >> 0 gives evidence of possible directional selection.
Asymmetric Linkage Disequilibrium (ALD)
Table 1. Linkage disequilibrium and genetic diversity measures
Description
Definition of Measuresa
1. Single locus homozygosity (F)b
FA = i pAi2
2. Haplotype specific homozygosity
(HSF)c
FA/Bj = i (fij / pBj)2
3. Overall weighted HSF valuesd
FA/B (and FB/A)
FA/B = j (FA/Bj) (pBj) = FA + i j Dij2 / pBj
4. Multi-allelic ALDe squared
WA/B (and WB/A)
WA/B2 = (FA/B−FA) / (1−FA)
Thomson and Single(2014) Genetics
Asymmetric Linkage Disequilibrium (ALD)
Table 1. Linkage disequilibrium and genetic diversity measures
Description
Definition of Measuresa
1. Single locus homozygosity (F)b
FA = i pAi2
2. Haplotype specific homozygosity
(HSF)c
FA/Bj = i (fij / pBj)2
3. Overall weighted HSF valuesd
FA/B (and FB/A)
FA/B = j (FA/Bj) (pBj) = FA + i j Dij2 / pBj
4. Multi-allelic ALDe squared
WA/B (and WB/A)
WA/B2 = (FA/B−FA) / (1−FA)
If both loci are bi-allelic:
WA/B2 = [i j (Dij
2 / pBj)] / (1 − FA) = D
2 / (pA1 pA2 pB1 pB2) = r
2, since D11= −D12= −D21= D22=D
Thomson and Single(2014) Genetics