Computational problems involving Single Nucleotide Polymorphisms

1

Computational problems involving Single Nucleotide Polymorphisms

Pritam Chanda

2

Agenda

• Biological background• SNP representation• Tag SNP selection• Haplotype analysis• SNP-disease association study• Discussion

3

Central Dogma

4

A cell and its chromosomes

5

DNA structure

Base pairs : A-T, G-C

A T A T A T G C A G C A

T A T A T A C G T C G T

5’

5’

3’

3’

Anti-parallel chain

Template strand

Thus, each chromosome can be thought of as a sequence of A, T, G, C’s

6

Genomic Variation and SNPs

• Human Genome 3 109 base pairs.• Main form of variation between individual genomes: single

nucleotide polymorphisms (SNPs) – Single base changes in the genome sequence that occurs in a

significant proportion (more than 1 percent) of the population– Most SNPs are bi-allelic (two variations)

• Total #SNPs 1 107 • Difference between any two individuals 3 106 SNPs ( 0.1% of

entire genome)

Sequences on a pair of homologous chromosomes

7

Why important ?

• A SNP (pronounced as ‘snip’) can alter the amino acid sequence of the protein produced.

• Not always– A protein consists of sequence of amino acids.– There are total 20 amino acids– Genetic code produces amino acids by reading groups of 3

nucleotides at a time • 43 combinations = 64 different combinations of A,T,G,C.

– Thus not all combinations of 3 nucleotides produce different amino acids

• Redundancy in genetic code.– A SNP in which both alleles lead to the same protein sequence

is termed synonymous – If different proteins are produced they are non-synonymous.

8

Why important ?

• SNPs that are not in protein coding regions may still have consequences for gene splicing, transcription factor binding, or the sequence of non-coding RNA.

• SNPs in humans can affect how humans develop diseases, respond to pathogens, chemicals, drugs, etc.

• SNPs are inherited and do not change much from generation to generation in an individual with time, – SNPs are of great value to biomedical research and in

developing diagnostic and pharmaceutical products.

9

Bioinformatics representation

• Assumption: a snp is bi-allelic.• Major allele

– most frequent allele

• Minor allele– The other one

• Example– Given DNA sequence

• Major allele (A) - 67%

• Minor allele (C) - 33%

• Encoding– Major allele : 0– Minor allele : 1

A G A T A G T A A T A G A T C G T A A T

A G A T A G T A A T A G A T A G T A A T

A G A T A G T A A T A G A T C G T A A T

Sample 1

Sample 2

Sample 3

A G A T 0 G T A A T A G A T 1 G T A A T



Sample 1

Sample 2

Sample 3

Sequences on a pair of homologous chromosomes

10

Haplotypes and Genotypes

• Diploid organisms: cells have two homologous set of chromosomes.

• Haplotype: description of SNP alleles on a single chromosome

– 0/1 vector, e.g., 00110101 (here, 0 is for major, 1 is for minor allele).

• Genotype: combined description of SNP alleles on pairs of homologous chromosomes

– 0/1/2 vector, e.g., 01122110 (0=0+0, 1=1+1, 2=0+1 or 1+0)

– Each genotype with k 2’s (heterozygotes) can be explained by 2k-1 pairs of haplotypes

A G A T A G T A A T A C A T G G T A A A

snps Other nucleotides

0 1 0 1 11 0 0 1 0

Major allele Minor allele

2 2 0 1 2

Heterozygous Homozygous

Haplotype Genotype

11

SNP databases

• HapMap project (www.hapmap.org)– The aim of the project is to record the significant SNPs.– Started in October 2002. – Phase 1 data have been published and analysis of Phase 2 data is

underway as of October 2006. • dbSNP

– A database of SNPs and short deletion and insertion polymorphisms at NCBI.

• CGAP– Genetic variation in genes important in cancer (At the National Cancer

Institute)• EnsEMBL

– Joint project between EMBL-EBI and the Sanger Centre to develop a system which produces and maintains automatic annotation on eukaryotic genomes.

• The SNP Consortium– Information about up to 300000 SNPs.

• Many more…

http://www.hapmap.org/

12

Linkage Disequilibrium (LD)• LD measures the correlation between two SNPs.

– Some combinations of alleles or genetic markers occur more or less frequently in a population than would be expected from a random formation of haplotypes from alleles based on their frequencies.

– Non-random associations between genes at different loci are measured by the degree of linkage disequilibrium (D).

– Consider two loci case (i.e. two SNPs)• SNP1 has alleles A, a• SNP2 has alleles B, b

– When the two loci are independent, expected freq of haplotype AB is pAB = pApB

– LD measure: D = pAB - pApB

Haplotype Frequency

A pA=pAB+pAb

a pa=paB+pab

B pB=pAB+paB

b pb=pAb+pab

Allele Frequency

A a

B pAB paB

b pAb pab

13

LD measures

• DD = pAB – pApB, pAB = pApB + D

pAb = pA – pAB = pA – pApB – D = pA(1-pB) – D = pApb – D

A a Total

B pAB = pApB + D paB = papB − D pB

b pAb= pApb − D pab = papb +D pb

Total pA pa 1

• D’ = D/Dmax

• r2 = D/(pApapBpb)

14

Types of Diseases

• Monogenic diseases – rarer (<0.1%)– Mutated gene is entirely responsible for the disease– Easy to locate diseased gene using LD based association studies.

• Complex diseases (more common)– Interaction of multiple genes in a complicate fashion

• One mutation does not cause disease• Hard to analyze – a single SNP may show weak association• A specific combination may show strong association, but

what combination ?– Multiple independent causes

• There are different causes and each of these causes can be result of interaction of several genes

• Each cause explains a certain percentage of cases

Monogenic & Complex Diseases

15

Tag SNP selection

16

Tag SNP

• SNPs are inherited from one generation to another in blocks.

• Each block contains a few common haplotypes and the SNPs in the block are in LD.

• Because of LD, each block contains a minimal informative set of SNPs that can represent the rest of the SNPs with high accuracy and also can identify all the haplotypes of the block.– Tag SNPs.

• Study of genetic factors for complex diseases– Several genes contribute together to the disease. – Need to study a relatively large number of SNPs.

• Also need a bigger sample size of individuals.

17

Tag SNP problem definition

• Genotyping a large number of SNPs is cost-prohibitive. – Essential to choose a set of SNPs to be genotyped

such that this set predicts the rest of the SNPs (not typed) with high accuracy.

– This set of SNPs is called the tag SNPs.

• Tag SNP selection deals with finding a set of tag SNPs of minimum size that would have very good prediction ability for the rest of the SNPs.

18

LD based tag SNP selection

• Greedy algorithm to identify subsets of tagSNPs for genotyping• Start with all SNPs above a MAF threshold and calculate pair-wise

LD.• Select the SNP that exceeds a LD threshold with the maximum

number of other sites.– This maximally informative SNP and all associated SNP are grouped as

a bin of associated sites.

• All pairwise LD within bin are re-evaluated, and any SNP exceeding threshold LD with all other sites in the bin is specified as a tagSNP for the bin.

• Repeat the bining process analyzing all as-yet-unbinned SNPs at each round, until all sites exceeding the MAF threshold are binned.

• If an SNP does not exceed the LD threshold with any other SNP in the region, it is placed in a singleton bin.

19

Tag SNP using feature selection

• Given N x M matrix– N haploid sequences– M snps

• Each snp is a feature. • Select the minimum set of features to classify all

haplotypes accurately.• r2 = (pABpab – pAbpaB)/(pABpAbpaBpab)• FSFS selects the most informative set of SNPs by first

grouping them into homogenous subsets and then choosing a representative SNP from each group.

• Designed only for haplotypes

Phuong T. M., Lin Z., Altman R. B. Choosing SNPs Using Feature Selection. Proc IEEE Comput Syst Bioinform Conf. 2005; 301-9.

20

Feature selection algorithm

• Let, set of all SNPs : S = {F1; F2; ...;FN}. • D(Fi; Fj) represents the dissimilarity between the two SNPs (Fi and Fj ) and is calculated using r2. • R represent the final set of SNPs chosen as the tag SNPs. • FSFS takes as input S and K (# of nearest neighbors of a SNP to

consider), • During each iteration, FSFS calculates the distance D(i,k) between

each SNP F(i) in R and its kth nearest neighboring SNP. • The algorithm then finds SNP F0 for which D(0,k) is minimum,

retains this SNP in R and removes its K nearest SNPs from R. – Thus the algorithm always discards SNPs from the most compact

cluster causing the minimum information loss.• FSFS gradually decreases K and re-computes D(0,k) until D(0,k) is

less than or equal to a threshold.Phuong T. M., Lin Z., Altman R. B. Choosing SNPs Using Feature Selection. Proc IEEE Comput Syst Bioinform Conf. 2005; 301-9.

21

A Regression based method

• Uses Multivariate Linear Regression (MLR)• SNP value prediction

– (n+1)x(k+1) matrix M corresponding to n sample individuals and the individual x and k tag SNPs (assume already known for prediction purpose) and a single non-tag SNP s (whose value the tag SNPs will predict).

– All SNP values in M are known except the value of s in x.

– In case of haplotypes, there are only two possible resolutions of s, s0 (for SNP value 0) and s1 (for SNP value 1).

– For genotypes, there are 3 possible resolutions s0 (SNP value 0), s1 (SNP value 1), and s2 (SNP value 2).

– The SNP prediction method should predict correct resolution of s.

0 1 0 1… 11 0 0 1… 0……………1 1 0 1… 1

1 1 … .. 0 s

1

0

1

..

k snps

M s

am

ple

s

Jingwu H. and Zelikovsky A. Tag SNP Selection Based on Multivariate Linear Regression. Proc. of Intl Conf on Computational Science (ICCS 2006), May 2006, LNCS 3992, pp. 750-757.

22

MLR

• The set of tag SNPs T are vectors in the (n+1)-dimensional Euclidean space.

• Get the projections of the vectors s0, s1 and s2 onto the span of the set of tag SNPs.

• The most probable resolution of s should be closest to the span of T.

A Greedy Algorithm1. Start with selecting the best tag t0

that alone predicts all other tags with minimum prediction error,

2. In each iteration, continue to add tags to the set T such that T best predicts the remaining tags.

Jingwu H. and Zelikovsky A. Tag SNP Selection Based on Multivariate Linear Regression. Proc. of Intl Conf on Computational Science (ICCS 2006), May 2006, LNCS 3992, pp. 750-757.

23

Other methods

• Entropy based methods• Support vector machines• Bayesian methods• Principal Component analysis

Haplotype tagging using support vector machines. Granular Computing, 2006 IEEE International Conference on. Jingwu He; Jun Zhang; Altun, G.; Zelikovsky, A.; Yanqing Zhang Page(s): 758- 761

Haplotype Block Partitioning and Tag SNP Selection Using Genotype Data and Their Applications to Association Studies - Kui Zhang, Zhaohui S. Qin, Jun S. Liu, Ting Chen, Michael S. Waterman and Fengzhu Sun Genome Research 14:908-916, 2004

Lin Z., Altman R. B. Finding haplotype tagging SNPs by use of principal components analysis. Am J Hum Genet. 2004 Nov;75(5):850-61.

Hampe J., Schreiber S., Krawczak M. Entropy-based SNP selection for genetic association studies. (2003) Hum Genet 114:36-43.

24

Haplotype analysis

25

Haplotype Estimation

• Each individual has two “copies” of each chromosome. • At each site, each chromosome has one of two alleles (states)

denoted by 0 and 1 (0 major allele, 1 = minor allele)

0 1 1 1 0 0 1 1 0

1 1 0 1 0 0 1 0 0

2 1 2 1 0 0 1 2 0

Two haplotypes per individual

Genotype for the individual

Merge the haplotypes

HapMap Project•NIH lead project ($100M) to find common haplotypes in the Human population.•Haplotyping individuals is expensive.

26

Haplotyping issues

• Biological Problem: For disease association studies, haplotype data is more valuable than genotype data, but haplotype data is hard to collect. Genotype data is easy to collect.

• Computational Problem: Given a set of n genotypes, determine the original set of n haplotype pairs that generated the n genotypes.

0 0 1 01 0 1 1

1 0 1 00 0 1 1

2 0 1 2 Genotype

Possible valid Haplotypes

Each genotype with k 2’s (heterozygotes) can be explained by 2k haplotypes

27

Need for haplotype inference

• Why do we want to determine haplotypes for individuals at tightly linked SNP loci? – Haplotypes are more powerful discriminators between cases

and controls in disease association studies.– With haplotypes we can conduct evolutionary studies.– Use of haplotypes in disease association studies reduces the

number of tests to be carried out, and hence the penalty for multiple testing.

• Two aspects of the problem– Estimate the frequencies of all possible haplotypes in the

population.– Infer the haplotypes of all individuals in the given sample.

28

Clark’s method

• Haplotype inference by A. Clark in 1990. • With a reasonable sample size, we expect to have some individuals

homozygous at every locus, e.g. 1—0—1, or heterozygous at just one locus, e.g. 1—0—2. – For the first case, unambiguously identify haplotype (1—0—1), – From the second case, two (1—0—2 and 1—0—1) haplotypes are

present in the population. – The algorithm begins by finding all homozygotes and single SNP

heterozygotes and tallying the resulting known haplotypes.• For each known haplotype, check if the known haplotype can be

made from some combination of ambiguous sites from an unresolved case. – 1—0—1 known . So resolve 2—0—2 as (1—0—1) + (0—0—0).

• This chain of inferences is continued until either all haplotypes have been recovered, or until no more new haplotypes can be found in this way.

29

Hardy Weinberg Equilibrium

• Consider a SNP with two alleles A,a – 3 possible genotypes A/A, A/a and a/a.

– pA, pa are the individual allele frequencies.

• HWE assumes that a child inherits the two alleles independently from his parents.

• A population in which A/A occurs with probability p2A, A/a

with 2pApa and a/a with p2b is said to be in HWE.

– Under a certain set of assumptions like infinite population size, random mating etc, the genotype frequencies stabilize.

30

Maximum Likelihood Estimation

• Given a SNP with alleles M, m.• Possible genotypes are M/M, M/m, m/m. • What is the probability of seeing a M/M’s, b M/m’s and c m/m’s ?• According to HWE, probability that any one particular individual

selected is MM, Mm or mm is pM2, 2pMpm, pm

2.

• Taking log, differentiating and setting to 0 gives the maximum likelihood estimates

• pM = (2a+b)/2N, pm = (2c+b)/2N

cm

bmM

aMmM pppp

cba

Nppcbap 22 )2(

,,),;,,(

31

Expectation Maximization (EM)

• E-step– The missing data is calculated using θ. This along

with the available data forms the complete data (D).

• M-step– θ’ = Recalculate the maximum likelihood estimates of

θ from D. Repeat E-step with θ= θ’.

Available Data Missing Data

Data (D)

θ = Parameters to calculate the missing data

32

Using EM• Consider a 2-loci case

– Bi-allelic loci

• So possible haplotypes– AB, Ab, aB, ab.

• We are given observed counts of each possible genotype– 9 possible genotypes– AABB, AABb, AAbb, AaBB, …

• Observe that only genotype AaBb can have more than 2 different haplotypes

BB Bb bb Total

AA 10 15 5 30

Aa 10 50 13 73

aa 3 13 10 26

Total 23 78 28 129

x 1-xx = fraction of genotype AaBb that are

33

Using EM

Parameters = pAB, pAb, paB, pab (haplotype frequencies)• Calculate pAB etc. from given genotype frequencies.

– The allele frequencies are• pA = (30+73/2)/129 = 0.5155• pa = (26+73/2)/129 = 0.4845• pB = (23+78/2)/129 = 0.4806• pb=(28+78/2)/129 = 0.5194

– The haplotype frequencies are• pAB=[2(10)+15+10+50x]/[129(2)]• pAb=[15+2(5)+50(1-x)+13]/[129(2)]• paB=[50x+3+13+28(2)]/[129(2)]• pab=[50(1-x)+13+13+10(2)]/[129(2)]

• The problem is to estimate the 4 haplotype frequencies despite not knowing the value of x (our missing data).

34

E-step

• E-step : obtain some initial values for the haplotype frequencies– Assume we have simply each genotype frequency as product of the

respective allele frequencies.

• p0AB = (0.5155)(0.4806)

• p0Ab = (0.5155)(0.5194)

• p0aB = (0.4845)(0.4806)

• p0ab = (0.4845)(0.5194)

– The ‘expected’ value of x given these haplotype frequencies, is

35

M-step

• M-step : maximize the parameters (haplotype frequencies) using x0 calculated at the E-step.– Substitute x0 into the haplotype frequencies.

• p1AB = [2(10)+15+10+50x]/[129(2)] = 0.27131

• p1Ab = [15+2(5)+50(1-x)+13]/[129(2)] = 0.24418

• p1aB = [50x+3+13+28(2)]/[129(2)] = 0.20930

• p1ab = [50(1-x)+13+13+10(2)]/[129(2)] = 0.27519

• Repeat E-step and M-step until the haplotype frequencies do not change much.

36

Other methods

• Bayesian methods• Combinatorial methods• Dynamic programming

Haplotype Block Partitioning and Tag SNP Selection Using Genotype Data and Their Applications to Association Studies Kui Zhang, Zhaohui S. Qin, Jun S. Liu, Ting Chen, Michael S. Waterman and Fengzhu SunGenome Research 14:908-916, 2004

V. Bafna, D. Gusfield, G. Lancia, and S. Yooseph. Haplotyping asperfect phylogeny: A direct approach. Technical report, UC Davis,Department of Computer Science, 2002.

Bayesian Haplotype Inference via the Dirichlet Process, Xing et. al, in Proceedings of the Second RECOMB Satellite Workshop on Computational Methods for SNP and Haplotypes, pp. 99-112;

An Entropy-Based Statistic for Genomewide Association StudiesJinying Zhao,Eric Boerwinkle,and Momiao Xiong Am J Hum Genet. 2005 July; 77(1): 27–40.

37

SNP-disease association study

38

Support Vector Machines

• Given training set of instance-label pairs (xi,yi), i = 1,... , L where xi ε Rn and y ε {1,−1}L, the (SVM) seeks solution to the following optimization problem:

• Training vectors xi are mapped into a higher dimensional space by the function Φ.

• SVM finds a linear separating hyper-plane with the maximal margin in this higher dimensional space.

• C > 0 is the penalty parameter of the error term.

39

Support Vector Machines

• SVM machine for binary classification. The margin to be maximized is w that separates the hyper-plane (shown with dotted line) from the two classes of data.

40

• Multiple Myeloma (a type of cancer) is studied.• The data set consists of genotypes from 3000 SNPs for 80 patients

selected so that they are evenly spaced at about 1Mb apart to give a good overall coverage of the human genome.

• Each heterozygous SNP data is coded as 0, one homozygous is arbitrarily coded as +1 and the other as -1.

• Entropy based feature selection – Select the most informative top 10% SNPs from the set of 3000 SNPs.– The entropy of a data set is given by - p log2(p) - (1 - p) log2(1 - p) where

p is the fraction of examples that belong to class predisposed. – The information gain of the split is given by the entropy of the original

data set minus the weighted sum of entropies of the two data sets resulting from the split, where these entropies are weighted by the fraction of data points in each set.

– The SNP features are ranked by information gain, and the top-scoring 0% of the features are selected.

• Classification of the diseased and control cases using a leave-one-out cross validation approach yields an overall classification accuracy of 71% which is significantly better than chance (50%).

Waddell M., Page D., Zhan F., Barlogie B. and John Shaughnessy Jr. J. Predicting Cancer Susceptibility from Single-Nucleotide Polymorphism Data: A Case Study in Multiple Myeloma, Proceedings of BIOKDD '05, Chicago, Illinois, August 2005, Aug 2005.

41

Case/Control study

Disease Status

1 2 3 4 5 6 7 8 9 100 1 0 1 2 0 1 0 2 0 2 0 1 1 0 2 1 0 1 2 1 2 0 0 1 2 2 2 1 0 1 1 0 1 2 0 2 0 2 0 2 0 1 2 0 0 1 0 1 22 0 0 0 2 0 2 1 0 02 1 0 1 1 0 0 0 2 1

1112222

Healthy genotypes (Control)

Diseased Genotypes (Case)

Genotypes

Disease association analysis searches for risk (resistance) factor with frequency among case (control) individuals considerably higher than among control (case) individuals.

Given : A population of n genotypes each containing values of m SNPs and disease status.

A Combinatorial approach

0: homozygous major allele, 1: homozygous minor allele, 2 : heterozygous allele

42

Multi-SNP extension

• Snp(C) : subset of given SNPs.• MSC(C) : a specific value of Snp(C).• Cluster(C) : subset of individuals that coincides with

{Snp(C), MSC(C)} in the given genotype data.

x x 1 x x 2 x x xMSC(C)

present in 4 cases : 1 control

How significant is this cluster ?

Multi-SNP Combination (MSC)[1,2]

0 1 1 0 1 2 0 0 2 control

0 1 1 0 1 2 1 0 2 case 0 1 1 1 0 2 0 0 1 case 0 0 1 0 0 0 0 2 1 case 0 1 1 1 1 2 0 0 1 case 0 0 1 0 1 2 1 0 2 control 0 1 0 0 1 1 0 0 2 control

1 2 3 4 5 6 7 8 9 status

Snp(C) = (3,6)

D(C) = (1,2,4)

1234567

H(C) = (5,7)

C = (1,2,4,5,7)

[1] Combinatorial Search Methods for Multi-SNP Disease Association. Brinza et. al., 2006.

[2] Combinatorial Methods for Disease Association Search and Susceptibility Prediction. Brinza et. al., 2006.

43

• Measured P-value

– Probability that diseased/healthy distribution among exposed to risk factor happened by chance

– Compute by binomial distribution

• Searching for risk factors among many SNPs requires multiple testing adjustment of the p-value

P-value of MSC[1,2]



44

Disease Association problem formulation

Given: Each containing values of m SNPs and disease status Case/control study data consisting of n genotypes

Find: All Risk/Resistance factors (MSCs) with p-value below 0.05

45

Searching Approaches

• Computationally infeasible, exponential number of combinations• Searching for 3-SNP MSC on the sample with n genotypes and m

SNPs requires O(n3m)• Case-closure of a MSC C is an MSC C’, with maximum number of SNPs

with fixed values, which consists of the same set of cases and minimum number of controls.

• Efficient way for finding case-closure: Extend MSC with those SNPs that have common values in all cases.

Cluster C : subset of genotypes which share the same MSC

0 1 1 0 1 2 0 1 2 control

0 1 1 0 1 2 1 0 2 case 2 0 1 1 0 2 0 0 2 case 0 0 1 0 0 0 0 2 1 case 0 1 1 0 1 2 0 0 2 control

x x 1 x x 2 x x x MSCPresent in 2 cases : 2 controls

Case-closure 0 1 1 0 1 2 0 1 2 control

0 1 1 0 1 2 1 0 2 case 2 0 1 1 0 2 0 0 2 case 0 0 1 0 0 0 0 2 1 case 0 1 1 0 1 2 0 0 2 control

x x 1 x x 2 x 0 xMSC’

Present in 2 cases : 1 controls

i ij j

Exhaustive search (ES)[1,2]



46

Combinatorial Search

• Combinatorial Search Method (CS)

– Searches only among case-closed MSCs– Avoids checking of clusters with small number of cases– Finds significant MSCs faster than ES– Still too slow for large data– Further speedup by reducing number of SNPs

• Indexing: compress S by extracting most informative SNPs– Tag SNP Selection– Apply ES/CS on selected tag snps

Combinatorial search (CS)[1,2]



47

Discussion

• Neural networks, hidden markov models, interaction information, linkage analysis etc.

• In general machine learning methods tend to do better than purely combinatorial methods and also are applicable to bigger data sets with hundreds of SNPs. – Scalablity

• Identifying SNPs in disease association study is more difficult, largely depends on the population under study and often faces the problem of replication.

Documents

Computational problems involving Single Nucleotide Polymorphisms