52
Previous Lecture: Descriptive Statistics Complex Normal Skewed Long tails

Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Embed Size (px)

Citation preview

Page 1: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Previous Lecture: Descriptive Statistics

Complex Normal Skewed Long tails

Page 2: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Introduction to Biostatistics and Bioinformatics

Sequence Alignment Concepts

This Lecture

Page 3: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Sequence Alignment

Stuart M. Brown, Ph.D.Center for Health Informatics and Bioinformatics

NYU School of Medicine

Slides/images/text/examples borrowed liberally from:Torgeir R. Hvidsten,Michael Schatz,Bill Pearson, Fourie Joubert, others …

Page 4: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Learning Objectives

• Identity, similarity, homology• Analyze sequence similarity by

dotplots• window/stringency

• Alignment of text strings by edit distance

• Scoring of aligned amino acids• Gap penalties• Global vs. local alignment• Dynamic Programming (Smith

Waterman)• FASTA method

Page 5: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Why Compare Sequences?

• Identify sequences found in lab experiments• What is this thing I just found?

• Compare new genes to known ones• Compare genes from different species

• information about evolution

• Guess functions for entire genomes full of new gene sequences

• Map sequence reads to a Reference Genome

(ChIP-seq, RNA-seq, etc.)

Page 6: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Are there other sequences like this one?

1) Huge public databases - GenBank, Swissprot, etc.

2) “Sequence comparison is the most powerful and reliable method to determine evolutionary relationships between genes” -R. Pearson

3) Similarity searching is based on alignment

4) BLAST and FASTA provide rapid similarity searching

a. rapid = approximate (heuristic)

b. false + and - scores

Page 7: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Similarity ≠ Homology

1) 25% similarity ≥ 100 AAs is strong evidence for homology

2) Homology is an evolutionary statement which means “descent from a common ancestor” – common 3D structure– usually common function– homology is all or nothing, you cannot say

"50% homologous"

Page 8: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Similarity is Based on Dot Plots

1) two sequences on vertical and horizontal axes of graph

2) put dots wherever there is a match

3) diagonal line is region of identity (local alignment)

4) apply a window filter - look at a group of bases, must meet % identity to get a dot

Page 9: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Simple Dot PlotG A T C A A C T G A C G T A

G T T C A G C T G C G T A C

Page 10: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

PTHPLASKTQILPEDLASEDLTI

PTHPLAGERAIGLARLAEEDFGM

Score = 7

PTHPLASKTQILPEDLASEDLTI

PTHPLAGERAIGLARLAEEDFGM

Score = 11

Window = 12

Stringency = 9

Filtering

PTHPLASKTQILPEDLASEDLTI

PTHPLAGERAIGLARLAEEDFGM

Score = 11

Window / Stringency

Page 11: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Dot plot filtered with 4 base window and 75% identity

G A T C A A C T G A C G T A

G T T C A G C T G C G T A C

Page 12: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Dot plot of real data

CVJB

Window Size = 8 Scoring Matrix: pam250 matrixMin. % Score = 30Hash Value = 2

20 40 60 80 100 120 140 160 180 200 220

20

40

60

80

100

120

140

160

180

200

220

Page 13: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Hemoglobin -chain

Hemoglobin-chain

Dotplot (Window = 130 / Stringency = 9)

Page 14: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Dotplot (Window = 18 / Stringency = 10)

Hemoglobin-chain

Hemoglobin -chain

Page 15: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

• Manually line them up and count?• an alignment program can do it for you• or a just use a text editor

• Dot Plot – shows regions of similarity as diagonals

GATGCCATAGAGCTGTAGTCGTACCCT <—

—> CTAGAGAGC-GTAGTCAGAGTGTCTTTGAGTTCC

How to Align Sequences?

Page 16: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Percent Sequence Identity• The extent to which two nucleotide or amino

acid sequences are invariant

A C C T G A G – A G A C G T G – G C A G

70% identical

mismatchindel

Page 17: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Hamming Distance• The minimum number of base changes that will

convert one (ungapped) sequence into another

• The Hamming distance is named after Richard

Hamming, who introduced it in his fundamental paper on Hamming codes: “Error detecting and error correcting codes” (1950) Bell System Technical Journal 29 (2): 147–160.

Python function hamming_distance

def hamming_distance(s1, s2): #Return the Hamming distance between equal-length sequences if len(s1) != len(s2): raise ValueError("Undefined for sequences of unequal length") return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

Page 18: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Hamming Dist can be unrealistic

v: ATATATAT w: TATATATA

Hamming Dist = 8 (no gaps, no shifts)

• Levenshtein (1966) introduced edit distance

v = _ATATATAT w = TATATATA_ edit distance: d(v, w) = 2

Levenshtein, Vladimir I. (February 1966). "Binary codes capable of correcting deletions, insertions, and reversals". Soviet Physics Doklady 10 (8): 707–710.

Page 19: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Affine Gap Penalties

Page 20: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Gap Penalites• With unlimited gaps (no penalty), unrelated

sequences can align (especially DNA)• Gap should cost much more than a mismatch• Multi-base gap should cost only a little bit more

than a single base gap• Adding an additional gap near another gap

should cost more (not implemented in most algorithms)

• Score for a gap of length x is: -(p + σx)• p is gap open penalty• σ is gap extend penalty

Page 21: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Global vs Local similarity1) Global similarity uses complete aligned

sequences - total % matches - Needleman & Wunch algorithm

2) Local similarity looks for best internal matching region between 2 sequences

- find a diagonal region on the dotplot– Smith-Waterman algorithm– BLAST and FASTA

3) dynamic programming – optimal computer solution, not approximate

Page 22: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Global vs. Local Alignments

Page 23: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails
Page 24: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

[Essentially finding the diagonals on the dotplot]

• Basic principles of dynamic programming

- Creation of an alignment path matrix

- Stepwise calculation of score values

- Backtracking (evaluation of the optimal path)

Smith-Waterman Method

Page 25: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Creation of an alignment path matrix

Idea: Build up an optimal alignment using previous solutions for optimal alignments of smaller subsequences

• Construct matrix F indexed by i and j (one index for each sequence)

• F(i,j) is the score of the best alignment between the initial segment x1...i of x up to xi and the initial segment y1...j of y up to yj

• Build F(i,j) recursively beginning with F(0,0) = 0

Page 26: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Michael Schatz

Page 27: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

• If F(i-1,j-1), F(i-1,j) and F(i,j-1) are known we can calculate F(i,j)

• Three possibilities:

• xi and yj are aligned, F(i,j) = F(i-1,j-1) + s(xi ,yj)

• xi is aligned to a gap, F(i,j) = F(i-1,j) - d

• yj is aligned to a gap, F(i,j) = F(i,j-1) - d

• The best score up to (i,j) will be the smallest of the three options

Creation of an alignment path matrix

Page 28: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Michael SchatzMichael Schatz

Choose the best option

Page 29: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Michael Schatz

Page 30: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Smith-Waterman is OPTIMAL but computationally slow

• SW search requires computing of matrix of scores at every possible alignment position with every possible gap.

• Compute task increases with the product of the lengths of two sequence to be compared (N2)

• Difficult for comparison of one small sequence to a much larger one, very difficult for two large sequences, essentially impossible to search very large databases.

Page 31: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Scoring Similarity1) Can only score aligned sequences

2) DNA is usually scored as identical or not

3) modified scoring for gaps - single vs. multiple base gaps (gap extension)

4) Protein AAs have varying degrees of similarity– a. # of mutations to convert one to another– b. chemical similarity– c. observed mutation frequencies

5) PAM matrix calculated from observed mutations in protein families

Page 32: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

The 20 amino acids used in proteins have different chemical structures

Page 33: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

• Amino acids have different biochemical and physical properties that influence their relative replaceability in evolution.

CP

GGAVI

L

MF

Y

W HK

RE Q

DN

S

T

CSH

S+S

positive

chargedpolar

aliphatic

aromatic

small

tiny

hydrophobic

Protein Scoring Systems

Page 34: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Evolutionary Conservation: how often is one AA replaced by anotherin the same position in orthologous proteins?

Page 35: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

The PAM 250 scoring matrix

Dayhoff, M, Schwartz, RM, Orcutt, BC (1978) A model of evolutionary change in proteins. in Atlas of Protein Sequence and Structure, vol 5, sup. 3, pp 345-352. M. Dayhoff ed., National Biomedical Research Foundation, Silver Spring, MD.

Page 36: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

PAM Vs. BLOSUM PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45

More distant sequences

BLOSUM62 for general use BLOSUM80 for close relations BLOSUM45 for distant relations

PAM120 for general use PAM60 for close relations PAM250 for distant relations

Page 37: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Search with Protein, not DNA Sequences

1) 4 DNA bases vs. 20 amino acids - less chance similarity

2) can have varying degrees of similarity between different AAs- # of mutations, chemical similarity, PAM matrix

3) protein databanks are much smaller than DNA databanks

Page 38: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

FASTA

1) A faster method to find similarities and make alignments – capable of searching many sequences (an entire database)

2) Only searches near the diagonal of the alignment matrix

3) Produces a statistic for each alignment (more on this in the next lecture)

Page 39: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

FASTA1) Derived from logic of the dot plot

– compute best diagonals from all frames of alignment

2) Word method looks for exact matches between words in query and test sequence– hash tables (fast computer technique)– Only matches exactly identical words– DNA words are usually 6 bases– protein words are 1 or 2 amino acids– only searches for diagonals in region of word

matches = faster searching

Page 40: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

FASTA Format

• simple format used by almost all programs• >header line with a [return] at end• Sequence (no specific requirements for line

length, characters, etc)

>URO1 uro1.seq Length: 2018 November 9, 2000 11:50 Type: N Check: 3854 ..CGCAGAAAGAGGAGGCGCTTGCCTTCAGCTTGTGGGAAATCCCGAAGATGGCCAAAGACAACTCAACTGTTCGTTGCTTCCAGGGCCTGCTGATTTTTGGAAATGTGATTATTGGTTGTTGCGGCATTGCCCTGACTGCGGAGTGCATCTTCTTTGTATCTGACCAACACAGCCTCTACCCACTGCTTGAAGCCACCGACAACGATGACATCTATGGGGCTGCCTGGATCGGCATATTTGTGGGCATCTGCCTCTTCTGCCTGTCTGTTCTAGGCATTGTAGGCATCATGAAGTCCAGCAGGAAAATTCTTCTGGCGTATTTCATTCTGATGTTTATAGTATATGCCTTTGAAGTGGCATCTTGTATCACAGCAGCAACACAACAAGACTTTTTCACACCCAACCTCTTCCTGAAGCAGATGCTAGAGAGGTACCAAAACAACAGCCCTCCAAACAATGATGACCAGTGGAAAAACAATGGAGTCACCAAAACCTGGGACAGGCTCATGCTCCAGGACAATTGCTGTGGCGTAAATGGTCCATCAGACTGGCAAAAATACACATCTGCCTTCCGGACTGAGAATAATGATGCTGACTATCCCTGGCCTCGTCAATGCTGTGTTATGAACAATCTTAAAGAACCTCTCAACCTGGAGGCTT

Page 41: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

FASTA Algorithm

Page 42: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Makes Longest Diagonal

3) after all diagonals found, tries to join diagonals by adding gaps

4) computes alignments in regions of best diagonals

Page 43: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

FASTA Alignments

Page 44: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

FASTA Results - AlignmentSCORES Init1: 1515 Initn: 1565 Opt: 1687 z-score: 1158.1 E(): 2.3e-58>>GB_IN3:DMU09374 (2038 nt) initn: 1565 init1: 1515 opt: 1687 Z-score: 1158.1 expect(): 2.3e-58 66.2% identity in 875 nt overlap (83-957:151-1022)

60 70 80 90 100 110 u39412.gb_pr CCCTTTGTGGCCGCCATGGACAATTCCGGGAAGGAAGCGGAGGCGATGGCGCTGTTGGCC || ||| | ||||| | ||| |||||DMU09374 AGGCGGACATAAATCCTCGACATGGGTGACAACGAACAGAAGGCGCTCCAACTGATGGCC 130 140 150 160 170 180

120 130 140 150 160 170 u39412.gb_pr GAGGCGGAGCGCAAAGTGAAGAACTCGCAGTCCTTCTTCTCTGGCCTCTTTGGAGGCTCA ||||||||| || ||| | | || ||| | || || ||||| || DMU09374 GAGGCGGAGAAGAAGTTGACCCAGCAGAAGGGCTTTCTGGGATCGCTGTTCGGAGGGTCC 190 200 210 220 230 240

180 190 200 210 220 230 u39412.gb_pr TCCAAAATAGAGGAAGCATGCGAAATCTACGCCAGAGCAGCAAACATGTTCAAAATGGCC ||| | ||||| || ||| |||| | || | |||||||| || ||| ||DMU09374 AACAAGGTGGAGGACGCCATCGAGTGCTACCAGCGGGCGGGCAACATGTTTAAGATGTCC 250 260 270 280 290 300

240 250 260 270 280 290 u39412.gb_pr AAAAACTGGAGTGCTGCTGGAAACGCGTTCTGCCAGGCTGCACAGCTGCACCTGCAGCTC |||||||||| ||||| | |||||| |||| ||| || ||| || | DMU09374 AAAAACTGGACAAAGGCTGGGGAGTGCTTCTGCGAGGCGGCAACTCTACACGCGCGGGCT 310 320 330 340 350 360

Page 45: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Similarity Search Statistics

• Searches with Needleman-Wunsch and Smith-Waterman have shown problems with simple “score” functions

• Score varies with length of sequences, gap penalties, composition, protein scoring matrix, etc.

• Need an unbiased method to compare alignemnts and judge if they are “significant” in a biological sense

Page 46: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

How to score alignments?• In a large database search, the scores of

good alignments differ from random alignments.

• Many are random, very few good ones. • This follows an extreme value distribution,

not a normal distribution. So we need to use an appropriate statistical test.

Page 48: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Pearson: FASTA Statistics• http://people.virginia.edu/~wrp/talk95/prot_talk12-95_4.html

Page 49: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Compute p-value from the extreme value distribution

E is the e-value (significance score, m is your query length, n is the length of a matching database sequence, S is the score (computed from a count of matching letters with a scoring matrix and gap penalty). K is a constant computed from the database size, lambda is a constant that models the scoring system.

Page 50: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Similarity Statistics• E() value is equivalent to a p value

• Significant if E() < 0.05 (smaller numbers are more significant) – The E-value represents the likelihood that the observed

alignment is due to chance alone. A value of 1 indicates that an alignment this good would happen by chance with any random sequence searched against this database.

http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

The official NCBI explanation of BLAST statistics by Stephen Altschul

Page 51: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Summary

• Identity, similarity, homology• Analyze sequence similarity by

dotplots• window/stringency

• Alignment of text strings by edit distance

• Scoring of aligned amino acids• Gap penalties• Global vs. local alignment• Dynamic Programming (Smith

Waterman)• FASTA method

Page 52: Previous Lecture: Descriptive Statistics ComplexNormalSkewedLong tails

Next Lecture: Searching Sequence Databases