Download pdf - SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational

Saurabh Sinha

Mayo-Illinois Computational Genomics WorkshopJune 14, 2019

Acknowledgment for some slides to Arián Avalos

§ Molecular Markers

§ Genome Wide Association Studies (GWAS)

§ Functional Effects

§ What is a SNP and a SNV?§ Single Nucleotide Polymorphysm§ Single Nucleotide Variant

I1: AACGAGCTAGCGATCGATCGACTACGACTACGAGGTI2: AACGAGCTAGCGATCGATCGACAACGACTACGAGGTI3: AACGAGCTAGCGATCGATCGACTACGACTACGAGGTI4: AACGAGCTAGCGATCGATCGACTACGACTACGAGGTI5: AACGAGCTAGCGATCGATCGACAACGACTACGAGGTI6: AACGAGCTAGCGATCGATCGACTACGACTACGAGGTI7: AACGAGCTAGCGATCGATCGACTACGACTACGAGGTI8: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT

§ A SNV is any change (e.g. a somatic mutation, even an artifact).

§ A SNP has defining criteria§ Polymorphic SNV, have “Major” and “minor” alleles§ Sometimes defined by frequency level (e.g. minimum allele frequency

of 5%)

§ For reference, the 1000 Genomes project identified ~41 Million SNPs across ~1000 Individuals.

§ Both types of variants are relevant depending on the field§ Population geneticists conducting association test will focus on SNPs§ Cancer geneticists will instead be interested in SNVs

§ The terminology is further complicated in non-human biology (e.g. polyploidy, horizontal gene transfer, etc.)

§ Example: § Cystic Fibrosis and the CFTR gene mutations

§ Approach: Genetic Linkage Analysis§ Genotype family members (some individuals carrying the disease)§ Find a marker that correlates with the disease§ Disease gene lies close to this marker

§ Limitations of Genetic Linkage Analysis

§ Requires data from entire families, preferably large ones, where the trait is segregating

§ Linkage analysis less successful with common diseases, e.g., heart disease or cancers.

§ Requires single, large effect loci

§ Hypothesize that common diseases are influenced by common genetic variation in the population

§ Implications:§ Any individual variation (SNP) will have relatively small correlation

with the disease§ Multiple common alleles together influence the disease phenotype

§ This argues for population- rather than family-based studies.

Bush W. S. & Moore J. H. (2012) PLoS Comput Biol 8(12): e1002822

§ Zhang X. et al. (2012). PLoS Comput Biol 8(12): e1002828.

§ Bush W. S. & Moore J. H. (2012) PLoS Comput Biol 8(12): e1002822.

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002828

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002822

§ Microarray – can assay 0.5 – 1.0 Million or more SNPs

§ Whole-genome sequencing (WGS) – assays (near) complete SNP profile

§ In non-human genetics, reduced-representation methods provide a middle-ground.

§ Case / Control – qualitative, usually binary measure (e.g. disease vs. no disease)

§ Quantitative – continuous measure usually complex phenotypes (e.g. blood pressure, LDL levels)

§ Possible to look at more than one phenotype?

§ Case / ControlDisease?

I1: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I2: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT +I3: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I4: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I5: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT +I6: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I7: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I8: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -

§ Before analysis and interpretations a few considerations:§ Correlation is not causation

§ Before analysis and interpretations a few considerations:§ Correlation is not causation§ Linkage disequilibrium (see later) § Population structure (see later)§ Phenotyping

§ Further consider that even if the analysis is successful, findings can be hard to interpret

§ Example:§ SNP correlates well with heart disease§ Biochemical link? Behavioral link (you particularly like bacon…)?

§ Case vs. Control

I1: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I2: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I3: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I4: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I5: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT +I6: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I7: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I8: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT +I9: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I10: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT +I11: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I12: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I13: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT -I14: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT +

A T

Case 3 1

Control 1 9

§ Case vs. Control

§ The Fisher’s Exact Test

p-value < 0.05

3

All

14

Case4

A4

A T

Case 3 1

Control 1 9

§ An favored alternative to the Fisher’s Exact Test is the Chi-Squared Test.

§ We conduct this test on EACH SNP separately, and get a corresponding p-value.

§ The smallest p-values point to the SNPs most associated with the disease.

§ Either Fisher’s or the Chi-Squared Test are considered an allelic association test, i.e. we test if A instead of T at the polymorphic site correlates with the disease.

§ In a genotypic association test each position is a combination of two alleles, e.g. AA, TT, AT

§ We therefore correlate genotype with phenotype of the individual

§ There are various options for a Case vs. Control genotypic association test

§ Example: § Dominant Model

AA or AT TT

Case ? ?

Control ? ?


§ Example: § Dominant Model§ Recessive Model

AA AT or TT

Case ? ?

Control ? ?


§ Example: § Dominant Model§ Recessive Model§ 2x3 Table

AA AT TT

Case O11 O12 O13

Control O21 O22 O23

Χ" =$%

$&

𝑂%& − 𝐸%&"

𝐸%&Chi-Squared Test

3 0 1 2

10

4

5

6

7

8

9

X = Genotype

Y =

Phe

noty

pe

y = 1

.683

x + 5

.834

R² =

0.8

644

§ Quantitative Phenotypes

• 𝑌 = 𝑎 + 𝑏𝑋

• If no association, 𝑏 ≈ 0

• The more 𝑏 differs from 0, the stronger the association

• This is called linear regression

§ Quantitative Phenotypes

§ Another statistical test commonly used on GWAS matrices is Analysis of Variance (ANOVA)

§ Statistical models for GWAS can get quite involved (can give references on request)

Lambert et al., 2013: Nature Genetics 45, 1452

§ Multiple Hypothesis Correction

§ What does p-value = 0.01 mean?

§ It means that the observed Genotype x Phenotype correlation has only 1% probability of happening just by chance.

§ What if we repeat the test for 1 Million SNPs? Of those tests, 1% (10,000 SNPs) will show this level of correlation,just by chance (and by definition)

§ Multiple Hypothesis Correction

§ Bonferroni (Seen in statistics lecture)§ Multiply the p-value by the number of tests§ So if the original SNP had p-value 𝑝, the new p-value is defined as 𝑝; =𝑝 ×𝑁

§ With 𝑁 = 10?, a p-value of 10@A is downgraded to:

§ False Discovery Rate (seen in Statistics lecture)

𝑝; = 10@A × 10? = 10@B

This is quite good!

§ So far we have tested each SNP separately, however recall our hypothesis that common diseases are influenced by common variants

§ Maybe considering two SNPs together will identify a stronger correlation with phenotype

§ Main problem: Number of pairs ~ 𝑁"

§ Further consider, in genotyping we may be using a Microarray (e.g. 0.5 – 1 Million SNPs)

§ But there are many more sites in the human genome where variation may exist, will we then miss any causal variant outside the panel of ~1 Million?

§ Not necessarily

§ Two sites close to each other may vary in a highly correlated manner, this is Linkage Disequilibrium (LD)

§ In this situation, lack of recombination events have made the inheritance of those two sites dependent

§ If two such sites have high LD, then one site can serve as proxy for the other

§ So if sites X & Y have high LD, and X is in the Microarray, then knowing the allelic form of X informs the allelic form at Y

§ In this way a reduced panel can represent a larger number (all?) of the common SNPs

§ A problem is that if X correlates with a disease, the causal variant may be either X or Y

§ In many cases, able to find SNPs that have significant association with disease.

§ GWAS Catalog : http://www.genome.gov/26525384

§ Yet, final predictive power (ability to predict disease from genotype) is limited for complex diseases.

§ “Finding the Missing Heritability of Complex Diseases” http://www.genome.gov/27534229

§ Increasingly, whole-exome and even whole-genome sequencing used for variant detection

§ Taking on the non-coding variants. Use functional genomics data as template

§ Network-based analysis rather than single-site or site-pairs analysis

§ Complement GWAS with family-based studies

§ How do we predict how a variant is likely to be affecting protein function?

§ Case:§ I found a SNP inside the coding sequence. Knowing how to translate

the gene sequence to a protein sequence, I discovered that this is a non-synonymous change, i.e., the encoded amino acid changes. This is an nsSNP.

§ Will that impact the protein’s function?

§ (And I don’t quite know how the protein functions in the first place ...)

§ Two popular approaches:

§ PolyPhen 2.0 § Adzhubei, I. A. et al. (2010). Nat Methods 7(4):248-249

§ SIFT§ Kumar P. et al., (2009). Nat Protoc 4(7):1073-1081

https://www.nature.com/articles/nmeth0410-248

https://www.nature.com/articles/nprot.2009.86

§ PolyPhen 2.0

§ The PolyPhen 2.0 pipeline uses existing data sets for training and later evaluation of target data.

§ Specifically the HumDiv data base which is§ A compilation of all the damaging mutations with known effects of

molecular function§ A collection of non-damaging differences between human proteins

and those of closely related mammalian homologs

§ A look at the Multiple Sequence Alignment (MSA) part of the PolyPhen 2.0 pipeline:

§ Of interest is the Position Specific Independent Count (PSIC) Score.

§ This score reflects the amino acid’s frequency at the specific position in the sequence given an MSA

§ Example:

§ To derive the PSIC score we first calculate the frequency of each amino acid:

p 𝑎, 𝑖 =𝑛 𝑎, 𝑖 FGG∑I 𝑛 𝑏, 𝑖 FGG

§ The idea: § 𝑛 𝑎, 𝑖 FGG is not the raw count of amino acid “𝑎” at position 𝑖 but

rather it is adjusted for the many closely related sequences in the MSA

§ The PSIC score of a SNP 𝑎 → 𝑏 at position 𝑖 is given by:

p 𝑎, 𝑖 =𝑛 𝑎, 𝑖 FGG∑I 𝑛 𝑏, 𝑖 FGG

PSIC 𝑎 → 𝑏, 𝑖 ∝ ln𝑝 𝑏, 𝑖𝑝 𝑎, 𝑖

§ Ultimately your derived score can be compared with the existing scores from HumDiv

§ Classification

§ Naive Bayes method§ A type of classifier. Other classification algorithms include

“Support Vector Machine”, “Decision Tree”, “Neural Net”, “Random Forest” etc.

§ Sometimes called “Machine Learning”

§ What is a classification algorithm?

§ What is a Naive Bayes method/classifier?

§ 𝑥SS, 𝑥S", … , 𝑥SU +

§ 𝑥"S, 𝑥"", … , 𝑥"U +

§ …

§ 𝑥%VS,S, 𝑥%VS,", … , 𝑥%VSU -

§ 𝑥%V",S, 𝑥%V",", … , 𝑥%V"U -

§ …

MODEL

Positive examples

Negative examples

Training Data

“Supervised Learning”

MODEL Yes or No

Data Vector𝑥S, 𝑥", … , 𝑥U

Pr(x1 | +),Pr(x1 | −),Pr(x2 | +),Pr(x2 | −),...Pr(xn | +),Pr(xn | −),

Training Data

Pr(+ | x1, x2,..., xn )∝ Pr(x1 | +)Pr(x2 | +)...Pr(xn | +)Pr(+)

Pr(− | x1, x2,..., xn )∝ Pr(x1 | −)Pr(x2 | −)...Pr(xn | −)Pr(−)+ or −

• Bayesian Inference:• Expresses how a subjective assessment

of likelihood should rationally change to account for evidence

§ Evaluating a classifier: Cross-validation

FOLD 1

TRAIN ON THESEPREDICT AND EVALUATE

ON THESE


FOLD 2

PREDICT AND EVALUATE ON THESE


FOLD k

PREDICT AND EVALUATE ON THESE


Collect all evaluation results (from k “FOLD”s)

§ Evaluating Classification Performance

Wikipedia

§ The Receiver Operating Characteristic (ROC) curve§ True +ve vs False +ve

§ What about those SNPs outside the coding regions?

§ Generally hard enough to predict within coding regions –regulatory sequences notoriously hard to pin down

§ Interesting new approaches uses coming up to predict impact on TF binding strength or DNA accessibility using machine learning