28
BNFO 602 Lecture 2 Usman Roshan

BNFO 602 Lecture 2 Usman Roshan. Bioinformatics problems Sequence alignment: oldest and still actively studied Genome-wide association studies: new problem,

  • View
    221

  • Download
    1

Embed Size (px)

Citation preview

BNFO 602Lecture 2

Usman Roshan

Bioinformatics problems

• Sequence alignment: oldest and still actively studied

• Genome-wide association studies: new problem, great potential for personalized medicine and personal genomics

• Phylogenetics: understanding evolutionary histories

Pairwise sequence alignment

• How to align two sequences?

Pairwise alignment

• How to align two sequences?• We use dynamic programming• Treat DNA sequences as strings over the

alphabet {A, C, G, T}

Pairwise alignment

Dynamic programmingDefine V(i,j) to be the optimal pairwise alignment score between S1..i and T1..j (|S|=m, |T|=n)

Dynamic programming

Time and space complexity is O(mn)

Define V(i,j) to be the optimal pairwise alignment score between S1..i and T1..j (|S|=m, |T|=n)

How do we pick gap parameters?

Structural alignments

• Recall that proteins have 3-D structure.

Structural alignment - example 1

Alignment of thioredoxins fromhuman and fly taken from theWikipedia website. This proteinis found in nearly all organismsand is essential for mammals.

PDB ids are 3TRX and 1XWC.

Structural alignment - example 2

Computer generated aligned proteins

Unaligned proteins.2bbm and 1top areproteins from fly andchicken respectively.

Taken from http://bioinfo3d.cs.tau.ac.il/Align/FlexProt/flexprot.html

Structural alignments

• We can produce high quality manual alignments by hand if the structure is available.

• These alignments can then serve as a benchmark to train gap parameters so that the alignment program produces correct alignments.

Benchmark alignments

• Protein alignment benchmarks– BAliBASE, SABMARK, PREFAB,

HOMSTRAD are frequently used in studies for protein alignment.

– Proteins benchmarks are generally large and have been in the research community for sometime now.

– BAliBASE 3.0

Biologically realistic scoring matrices

• PAM and BLOSUM are most popular• PAM was developed by Margaret

Dayhoff and co-workers in 1978 by examining 1572 mutations between 71 families of closely related proteins

• BLOSUM is more recent and computed from blocks of sequences with sufficient similarity

PAM

• We need to compute the probability transition matrix M which defines the probability of amino acid i converting to j

• Examine a set of closely related sequences which are easy to align---for PAM 1572 mutations between 71 families

• Compute probabilities of change and background probabilities by simple counting

Genome wide association studies

Application of SNPs: association with disease

• Experimental design to detect cancer associated SNPs:– Pick random humans with and without

cancer (say breast cancer)– Perform SNP genotyping– Look for associated SNPs – Also called genome-wide association study

Case-control example

• Study of 100 people:– Case: 50 subjects with

cancer

– Control: 50 subjects without cancer

• Count number of alleles and form a contingency table

#Allele1 #Allele2

Case 10 90

Control 2 98

Odds ratio

• Odds of allele 1 in cancer = a/b = e

• Odds of allele 1 in healthy = c/d = f

• Odds ratio of recessive in cancer vs healthy = e/f

#Allele1 #Allele2

Cancer a b

Healthy c d

Risk ratio (Relative risk)

• Probability of allele 1 in cancer = a/(a+b) = e

• Probability of allele 2 in healthy = c/(c+d) = f

• Risk ratio of recessive in cancer vs healthy = e/f

#Allele1 #Allele2

Cancer a b

No cancer c d

Odds ratio vs Risk ratio

• Risk ratio has a natural interpretation since it is based on probabilities

• In a case-control model we cannot calculate the probability of cancer given recessive allele. Subjects are chosen based disease status and not allele type

• Odds ratio shows up in logistic regression models

Example

• Odds of allele 1 in case = 15/35

• Odds of allele 1 in control = 2/48

• Odds ratio of allele 1 in case vs control = (15/35)/(2/48) = 10.3

• Risk of allele 1 in case = 15/50

• Risk of allele 2 in control = 2/50

• Risk ratio of allele 1 in case vs control = 15/2 = 7.5

#Allele1 #Allele2

Case 15 35

Control 2 48

Odds ratios in genome-wide association studies

• Higher odds ratio means stronger association

• Therefore SNPs with highest odds ratios should be used as predictors or risk estimators of disease

• Odds ratio generally higher than risk ratio

• Both are similar when small

Statistical test of association (P-values)

• P-value = probability of the observed data (or worse) under the null hypothesis

• Example:– Suppose we are given a series of coin-tosses– We feel that a biased coin produced the tosses– We can ask the following question: what is the probability

that a fair coin produced the tosses?– If this probability is very small then we can say there is a

small chance that a fair coin produced the observed tosses.– In this example the null hypothesis is the fair coin and the

alternative hypothesis is the biased coin

Effect of population structure on genome-wide association

studies• Suppose our sample is drawn from a

population of two groups, I and II• Assume that group I has a majority of allele

type I and group II has mostly the second allele.

• Further assume that most case subjects belong to group I and most control to group II

• This leads to the false association that the major allele is associated with the disease

Effect of population structure on genome-wide association

studies• We can correct this effect if case and

control are equally sampled from all sub-populations

• To do this we need to know the population structure

Population structure prediction

• Treated as an unsupervised learning problem (i.e. clustering)