Saurabh Sinha
Mayo-Illinois Computational Genomics WorkshopJune 14, 2019
Acknowledgment for some slides to Arián Avalos
§ Molecular Markers
§ Genome Wide Association Studies (GWAS)
§ Functional Effects
§ What is a SNP and a SNV?§ Single Nucleotide Polymorphysm§ Single Nucleotide Variant
I1: AACGAGCTAGCGATCGATCGACTACGACTACGAGGTI2: AACGAGCTAGCGATCGATCGACAACGACTACGAGGTI3: AACGAGCTAGCGATCGATCGACTACGACTACGAGGTI4: AACGAGCTAGCGATCGATCGACTACGACTACGAGGTI5: AACGAGCTAGCGATCGATCGACAACGACTACGAGGTI6: AACGAGCTAGCGATCGATCGACTACGACTACGAGGTI7: AACGAGCTAGCGATCGATCGACTACGACTACGAGGTI8: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT
§ A SNV is any change (e.g. a somatic mutation, even an artifact).
§ A SNP has defining criteria§ Polymorphic SNV, have “Major” and “minor” alleles§ Sometimes defined by frequency level (e.g. minimum allele frequency
of 5%)
§ For reference, the 1000 Genomes project identified ~41 Million SNPs across ~1000 Individuals.
§ Both types of variants are relevant depending on the field§ Population geneticists conducting association test will focus on SNPs§ Cancer geneticists will instead be interested in SNVs
§ The terminology is further complicated in non-human biology (e.g. polyploidy, horizontal gene transfer, etc.)
§ Example: § Cystic Fibrosis and the CFTR gene mutations
§ Approach: Genetic Linkage Analysis§ Genotype family members (some individuals carrying the disease)§ Find a marker that correlates with the disease§ Disease gene lies close to this marker
§ Limitations of Genetic Linkage Analysis
§ Requires data from entire families, preferably large ones, where the trait is segregating
§ Linkage analysis less successful with common diseases, e.g., heart disease or cancers.
§ Requires single, large effect loci
§ Hypothesize that common diseases are influenced by common genetic variation in the population
§ Implications:§ Any individual variation (SNP) will have relatively small correlation
with the disease§ Multiple common alleles together influence the disease phenotype
§ This argues for population- rather than family-based studies.
Bush W. S. & Moore J. H. (2012) PLoS Comput Biol 8(12): e1002822
§ Zhang X. et al. (2012). PLoS Comput Biol 8(12): e1002828.
§ Bush W. S. & Moore J. H. (2012) PLoS Comput Biol 8(12): e1002822.
§ Microarray – can assay 0.5 – 1.0 Million or more SNPs
§ Whole-genome sequencing (WGS) – assays (near) complete SNP profile
§ In non-human genetics, reduced-representation methods provide a middle-ground.
§ Case / Control – qualitative, usually binary measure (e.g. disease vs. no disease)
§ Quantitative – continuous measure usually complex phenotypes (e.g. blood pressure, LDL levels)
§ Possible to look at more than one phenotype?
§ Case / ControlDisease?
I1: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I2: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT +I3: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I4: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I5: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT +I6: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I7: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I8: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -
§ Before analysis and interpretations a few considerations:§ Correlation is not causation
§ Before analysis and interpretations a few considerations:§ Correlation is not causation§ Linkage disequilibrium (see later) § Population structure (see later)§ Phenotyping
§ Further consider that even if the analysis is successful, findings can be hard to interpret
§ Example:§ SNP correlates well with heart disease§ Biochemical link? Behavioral link (you particularly like bacon…)?
§ Case vs. Control
I1: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I2: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I3: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I4: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I5: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT +I6: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I7: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I8: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT +I9: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I10: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT +I11: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I12: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I13: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT -I14: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT +
A T
Case 3 1
Control 1 9
§ Case vs. Control
§ The Fisher’s Exact Test
p-value < 0.05
3
All
14
Case4
A4
A T
Case 3 1
Control 1 9
§ An favored alternative to the Fisher’s Exact Test is the Chi-Squared Test.
§ We conduct this test on EACH SNP separately, and get a corresponding p-value.
§ The smallest p-values point to the SNPs most associated with the disease.
§ Either Fisher’s or the Chi-Squared Test are considered an allelic association test, i.e. we test if A instead of T at the polymorphic site correlates with the disease.
§ In a genotypic association test each position is a combination of two alleles, e.g. AA, TT, AT
§ We therefore correlate genotype with phenotype of the individual
§ There are various options for a Case vs. Control genotypic association test
§ Example: § Dominant Model
AA or AT TT
Case ? ?
Control ? ?
§ There are various options for a Case vs. Control genotypic association test
§ Example: § Dominant Model§ Recessive Model
AA AT or TT
Case ? ?
Control ? ?
§ There are various options for a Case vs. Control genotypic association test
§ Example: § Dominant Model§ Recessive Model§ 2x3 Table
AA AT TT
Case O11 O12 O13
Control O21 O22 O23
Χ" =$%
$&
𝑂%& − 𝐸%&"
𝐸%&Chi-Squared Test
3 0 1 2
10
4
5
6
7
8
9
X = Genotype
Y =
Phe
noty
pe
y = 1
.683
x + 5
.834
R² =
0.8
644
§ Quantitative Phenotypes
• 𝑌 = 𝑎 + 𝑏𝑋
• If no association, 𝑏 ≈ 0
• The more 𝑏 differs from 0, the stronger the association
• This is called linear regression
§ Quantitative Phenotypes
§ Another statistical test commonly used on GWAS matrices is Analysis of Variance (ANOVA)
§ Statistical models for GWAS can get quite involved (can give references on request)
Lambert et al., 2013: Nature Genetics 45, 1452
§ Multiple Hypothesis Correction
§ What does p-value = 0.01 mean?
§ It means that the observed Genotype x Phenotype correlation has only 1% probability of happening just by chance.
§ What if we repeat the test for 1 Million SNPs? Of those tests, 1% (10,000 SNPs) will show this level of correlation,just by chance (and by definition)
§ Multiple Hypothesis Correction
§ Bonferroni (Seen in statistics lecture)§ Multiply the p-value by the number of tests§ So if the original SNP had p-value 𝑝, the new p-value is defined as 𝑝; =𝑝 ×𝑁
§ With 𝑁 = 10?, a p-value of 10@A is downgraded to:
§ False Discovery Rate (seen in Statistics lecture)
𝑝; = 10@A × 10? = 10@B
This is quite good!
§ So far we have tested each SNP separately, however recall our hypothesis that common diseases are influenced by common variants
§ Maybe considering two SNPs together will identify a stronger correlation with phenotype
§ Main problem: Number of pairs ~ 𝑁"
§ Further consider, in genotyping we may be using a Microarray (e.g. 0.5 – 1 Million SNPs)
§ But there are many more sites in the human genome where variation may exist, will we then miss any causal variant outside the panel of ~1 Million?
§ Not necessarily
§ Two sites close to each other may vary in a highly correlated manner, this is Linkage Disequilibrium (LD)
§ In this situation, lack of recombination events have made the inheritance of those two sites dependent
§ If two such sites have high LD, then one site can serve as proxy for the other
§ So if sites X & Y have high LD, and X is in the Microarray, then knowing the allelic form of X informs the allelic form at Y
§ In this way a reduced panel can represent a larger number (all?) of the common SNPs
§ A problem is that if X correlates with a disease, the causal variant may be either X or Y
§ In many cases, able to find SNPs that have significant association with disease.
§ GWAS Catalog : http://www.genome.gov/26525384
§ Yet, final predictive power (ability to predict disease from genotype) is limited for complex diseases.
§ “Finding the Missing Heritability of Complex Diseases” http://www.genome.gov/27534229
§ Increasingly, whole-exome and even whole-genome sequencing used for variant detection
§ Taking on the non-coding variants. Use functional genomics data as template
§ Network-based analysis rather than single-site or site-pairs analysis
§ Complement GWAS with family-based studies
§ How do we predict how a variant is likely to be affecting protein function?
§ Case:§ I found a SNP inside the coding sequence. Knowing how to translate
the gene sequence to a protein sequence, I discovered that this is a non-synonymous change, i.e., the encoded amino acid changes. This is an nsSNP.
§ Will that impact the protein’s function?
§ (And I don’t quite know how the protein functions in the first place ...)
§ Two popular approaches:
§ PolyPhen 2.0 § Adzhubei, I. A. et al. (2010). Nat Methods 7(4):248-249
§ SIFT§ Kumar P. et al., (2009). Nat Protoc 4(7):1073-1081
§ PolyPhen 2.0
§ The PolyPhen 2.0 pipeline uses existing data sets for training and later evaluation of target data.
§ Specifically the HumDiv data base which is§ A compilation of all the damaging mutations with known effects of
molecular function§ A collection of non-damaging differences between human proteins
and those of closely related mammalian homologs
§ A look at the Multiple Sequence Alignment (MSA) part of the PolyPhen 2.0 pipeline:
§ Of interest is the Position Specific Independent Count (PSIC) Score.
§ This score reflects the amino acid’s frequency at the specific position in the sequence given an MSA
§ Example:
§ To derive the PSIC score we first calculate the frequency of each amino acid:
p 𝑎, 𝑖 =𝑛 𝑎, 𝑖 FGG∑I 𝑛 𝑏, 𝑖 FGG
§ The idea: § 𝑛 𝑎, 𝑖 FGG is not the raw count of amino acid “𝑎” at position 𝑖 but
rather it is adjusted for the many closely related sequences in the MSA
§ The PSIC score of a SNP 𝑎 → 𝑏 at position 𝑖 is given by:
p 𝑎, 𝑖 =𝑛 𝑎, 𝑖 FGG∑I 𝑛 𝑏, 𝑖 FGG
PSIC 𝑎 → 𝑏, 𝑖 ∝ ln𝑝 𝑏, 𝑖𝑝 𝑎, 𝑖
§ Ultimately your derived score can be compared with the existing scores from HumDiv
§ Classification
§ Naive Bayes method§ A type of classifier. Other classification algorithms include
“Support Vector Machine”, “Decision Tree”, “Neural Net”, “Random Forest” etc.
§ Sometimes called “Machine Learning”
§ What is a classification algorithm?
§ What is a Naive Bayes method/classifier?
§ 𝑥SS, 𝑥S", … , 𝑥SU +
§ 𝑥"S, 𝑥"", … , 𝑥"U +
§ …
§ 𝑥%VS,S, 𝑥%VS,", … , 𝑥%VSU -
§ 𝑥%V",S, 𝑥%V",", … , 𝑥%V"U -
§ …
MODEL
Positive examples
Negative examples
Training Data
“Supervised Learning”
MODEL Yes or No
Data Vector𝑥S, 𝑥", … , 𝑥U
Pr(x1 | +),Pr(x1 | −),Pr(x2 | +),Pr(x2 | −),...Pr(xn | +),Pr(xn | −),
Training Data
Pr(+ | x1, x2,..., xn )∝ Pr(x1 | +)Pr(x2 | +)...Pr(xn | +)Pr(+)
Pr(− | x1, x2,..., xn )∝ Pr(x1 | −)Pr(x2 | −)...Pr(xn | −)Pr(−)+ or −
• Bayesian Inference:• Expresses how a subjective assessment
of likelihood should rationally change to account for evidence
§ Evaluating a classifier: Cross-validation
FOLD 1
TRAIN ON THESEPREDICT AND EVALUATE
ON THESE
§ Evaluating a classifier: Cross-validation
FOLD 2
PREDICT AND EVALUATE ON THESE
§ Evaluating a classifier: Cross-validation
FOLD k
PREDICT AND EVALUATE ON THESE
§ Evaluating a classifier: Cross-validation
Collect all evaluation results (from k “FOLD”s)
§ Evaluating Classification Performance
Wikipedia
§ The Receiver Operating Characteristic (ROC) curve§ True +ve vs False +ve
§ What about those SNPs outside the coding regions?
§ Generally hard enough to predict within coding regions –regulatory sequences notoriously hard to pin down
§ Interesting new approaches uses coming up to predict impact on TF binding strength or DNA accessibility using machine learning