Upload
samira
View
41
Download
0
Tags:
Embed Size (px)
DESCRIPTION
CZ5225 Methods in Computational Biology Lecture 2-3: Protein Families and Family Prediction Methods Prof. Chen Yu Zong Tel: 6874-6877 Email: [email protected] http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, NUS August 2004. Protein Evolution: SARS coronavirus as an example. - PowerPoint PPT Presentation
Citation preview
CZ5225 Methods in Computational BiologyCZ5225 Methods in Computational Biology
Lecture 2-3: Protein Families Lecture 2-3: Protein Families and Family Prediction Methodsand Family Prediction Methods
Prof. Chen Yu ZongProf. Chen Yu Zong
Tel: 6874-6877Tel: 6874-6877Email: Email: [email protected]@nus.edu.sghttp://xin.cz3.nus.edu.sghttp://xin.cz3.nus.edu.sg
Room 07-24, level 7, SOC1, NUSRoom 07-24, level 7, SOC1, NUSAugust 2004August 2004
22
Protein Evolution: Protein Evolution: SARS coronavirus as an exampleSARS coronavirus as an example
33
SARS CoronavirusSARS CoronavirusA novel coronavirusIdentified as the cause ofsevere respiratorysyndrome (SARS )
44
SARS InfectionSARS Infection
How SARS coronavirus enters a cell and reproduce
55
Protein EvolutionProtein Evolution
Generation of different species
66
Protein Families• Sequence alignment-based families.
– Based on Principle of Sequence-structure-function-relationship.– Derived by multiple sequence alignment– Database: PFAM (Nucleic Acids Res. 30:276-280)
• Structure-based families.– Derived by visual inspection and comparison of structures– Database: SCOP (J. Mol. Biol. 247, 536-540)
• Functional Families.– Databases:
• G-protein coupled receptors: GPCRDB (Nucleic Acids Res. 29: 346-349), ORDB (Nucleic Acids Res. 30:354-360)
• Nuclear receptors: NucleaRDB (Nucleic Acids Res. 29: 346-349)• Enzymes: BRENDA (Nucleic Acids Res. 30, 47-49)• Transporters: TC-DB (Microbiol Mol Biol Rev. 64:354-411)• Ligand-gated ion channels: LGICdb (Nucleic Acids Res. 29: 294-295)• Therapeutic targets: TTD (Nucleic Acids Res. 30, 412-415)• Drug side-effect targets: DART (Drug Safety 26: 685-690)
77
Protein Families
Sequence families =\= Structural families =\= Functional families
Sequence similar, structure different
Sequence different, structure similar
Sequence similar, function different (distantly related proteins)
Sequence different, function similar
Homework: find examples
88
Protein Family Prediction Methods
Sequence alignment-based families:
• Multiple sequence alignment (HMM): HMMER; JMB 235, 1501-153; JMB 301, 173-190
Structure-based families:
• Visual inspection and comparison of structures
Functional Families.
• Statistical learning methods: – Neural network: ProtFun (Bioinformatics, 19:635-642)
– Support vector machines: SVMProt (Nucleic Acids Res., 31: 3692-3697)
99
Sequence Comparison as a Sequence Comparison as a Mathematical Problem: Mathematical Problem:
Example:
Sequence a: ATTCTTGC
Sequence b: ATCCTATTCTAGC
Best Alignment: ATTCTTGC ATCCTATTCTAGC /|\ gap Bad Alignment: AT TCTT GC ATCCTATTCTAGC /|\ /|\ gap gap
Construction of many alignments => which is the best?
1010
How to rate an alignment?How to rate an alignment?• Match: +8 (w(x, y) = 8, if x = y)
• Mismatch: -5 (w(x, y) = -5, if x ≠ y)
• Each gap symbol: -3 (w(-,x)=w(x,-)=-3)
C - - - T T A A C TC G G A T C A - - T
+8 -3 -3 -3 +8 -5 +8 -3 -3 +8 = +12
Alignment score
1111
Alignment GraphAlignment GraphSequence a: CTTAACT
Sequence b: CGGATCATC G G A T C A T
C
T
T
A
A
C
T
C---TTAACTCGGATCA--T
1212
An optimal alignmentAn optimal alignment-- the alignment of maximum score-- the alignment of maximum score
• Let A=a1a2…am and B=b1b2…bn .
• Si,j: the score of an optimal alignment between
a1a2…ai and b1b2…bj
• With proper initializations, Si,j can be computedas follows.
),(
),(
),(
max
1,1
1,
,1
,
jiji
jji
iji
ji
baws
bws
aws
s
1313
Computing Computing SSi,ji,j
i
j
w(ai,-)
w(-,bj)
w(ai,bj)
Sm,n
1414
InitializationsInitializations
0 -3 -6 -9 -12 -15 -18 -21 -24
-3
-6
-9
-12
-15
-18
-21
C G G A T C A T
C
T
T
A
A
C
T
1515
SS3,53,5 = = ??
0 -3 -6 -9 -12 -15 -18 -21 -24
-3 8 5 2 -1 -4 -7 -10 -13
-6 5 3 0 -3 7 4 1 -2
-9 2 0 -2 -5 ?
-12
-15
-18
-21
C G G A T C A T
C
T
T
A
A
C
T
1616
SS3,53,5 = = ??
0 -3 -6 -9 -12 -15 -18 -21 -24
-3 8 5 2 -1 -4 -7 -10 -13
-6 5 3 0 -3 7 4 1 -2
-9 2 0 -2 -5 5 -1 -4 9
-12 -1 -3 -5 6 3 0 7 6
-15 -4 -6 -8 3 1 -2 8 5
-18 -7 -9 -11 0 -2 9 6 3
-21 -10 -12 -14 -3 8 6 4 14
C G G A T C A T
C
T
T
A
A
C
T
optimal score
1717
C T T A A C – TC T T A A C – TC G G A T C A TC G G A T C A T
0 -3 -6 -9 -12 -15 -18 -21 -24
-3 8 5 2 -1 -4 -7 -10 -13
-6 5 3 0 -3 7 4 1 -2
-9 2 0 -2 -5 5 -1 -4 9
-12 -1 -3 -5 6 3 0 7 6
-15 -4 -6 -8 3 1 -2 8 5
-18 -7 -9 -11 0 -2 9 6 3
-21 -10 -12 -14 -3 8 6 4 14
C G G A T C A T
C
T
T
A
A
C
T
8 – 5 –5 +8 -5 +8 -3 +8 = 14
1818
Global Alignment vs. Local AlignmentGlobal Alignment vs. Local Alignment
• global alignment:
• local alignment:
1919
An optimal local alignmentAn optimal local alignment
• Si,j: the score of an optimal local alignment ending at ai and bj
• With proper initializations, Si,j can be computedas follows.
),(
),(),(
0
max
1,1
1,
,1
,
jiji
jji
iji
ji
baws
bwsaws
s
2020
local alignmentlocal alignment
0 0 0 0 0 0 0 0 0
0 8 5 2 0 0 8 5 2
0 5 3 0 0 8 5 3 13
0 2 0 0 0 8 5 2 11
0 0 0 0 8 5 3 ?
0
0
0
C G G A T C A T
C
T
T
A
A
C
T
Match: 8
Mismatch: -5
Gap symbol: -3
2121
0 0 0 0 0 0 0 0 0
0 8 5 2 0 0 8 5 2
0 5 3 0 0 8 5 3 13
0 2 0 0 0 8 5 2 11
0 0 0 0 8 5 3 13 10
0 0 0 0 8 5 2 11 8
0 8 5 2 5 3 13 10 7
0 5 3 0 2 13 10 8 18
C G G A T C A T
C
T
T
A
A
C
T
The best
score
A – C - TA T C A T8-3+8-3+8 = 18
local alignmentlocal alignment
2222
Multiple sequence alignment (MSA)Multiple sequence alignment (MSA)
• The multiple sequence alignment problem is to simultaneously align more than two sequences.
Seq1: GCTC
Seq2: AC
Seq3: GATC
GC-TC
A---C
G-ATC
2323
How to score an MSA?How to score an MSA?
• Sum-of-Pairs (SP-score)
GC-TC
A---C
G-ATC
GC-TC
A---C
GC-TC
G-ATC
A---C
G-ATC
Score =
Score
Score
Score
+
+
2424
Functional Classification by SVMFunctional Classification by SVM
• A protein is classified as either belong (+) or not belong (-) to a functional family
• By screening against all families, the function of this protein can be
identified (example: SVMProt)
• What is SVM? Support vector machines, a machine learning method, learning by examples, statistical learning, classify objects into one of the two classes.
• Advantage of SVM: Diversity of class members (no racial discrimination). Use of sequence-derived physico-chemical features as basis for classification. Suitable for functional family classifications.
2525
SVM ReferencesSVM References
• C. Burges, "A tutorial on support vector machines for pattern recognition", Data Mining and Knowledge Discovery, Kluwer Academic Publishers,1998 (on-line).
• R. Duda, P. Hart, and D. Stork, Pattern Classification, John-Wiley, 2nd edition, 2001 (section 5.11, hard-copy).
• S. Gong et al. Dynamic Vision: From Images to Face Recognition, Imperial College Pres, 2001 (sections 3.6.2, 3.7.2, hard copy).
• Online lecture notes
2626
Introduction to Machine LearningIntroduction to Machine Learning
Goal:
To “improve” (gaining knowledge, enhancing computing capability)
Tasks:
•Forming concepts by data generalization.•Compiling knowledge into compact form •Finding useful explanations for valid concepts.•Clustering data into classes.
Reference:
Machine Learning in Molecular Biology Sequence Analysis .
Internet links:
http://www.ai.univie.ac.at/oefai/ml/ml-resources.html
2727
Introduction to Machine LearningIntroduction to Machine Learning
Category:
• Inductive learning.
• Forming concepts from data without a lot of knowledge from domain (learning from examples).
• Analytic learning.
• Use of existing knowledge to derive new useful concepts (explanation based learning).
• Connectionist learning.
• Use of artificial neural networks in searching for or representing of concepts.
• Genetic algorithms.
• To search for the most effective concept by means of Darwin’s “survival of the fittest” approach.
2828
Machine Learning MethodsMachine Learning Methods Inductive learning:
Concept learning and example-based learning
Concept learning:
2929
Machine Learning MethodsMachine Learning Methods Analytic
learning:
3030
Machine Learning MethodsMachine Learning Methods Neural network:
3131
Machine Learning MethodsMachine Learning Methods Genetic algorithms:
Strength
Pattern
Classification
3232
3333
SVMSVM
3434
SVMSVM
3535
SVMSVM
3636
SVMSVM
3737
SVMSVM
3838
SVMSVM
3939
SVMSVM
4040
SVMSVM
4141
SVMSVM
4242
SVMSVM
4343
SVMSVM
4444
SVM for Classification of ProteinsSVM for Classification of ProteinsHow to represent a protein?
• Each sequence represented by specific feature vector assembled from encoded representations of tabulated residue properties:– amino acid composition– Hydrophobicity– normalized Van der Waals volume– polarity,– Polarizability– Charge– surface tension– secondary structure– solvent accessibility
• Three descriptors, composition (C), transition (T), and distribution (D), are used to describe global composition of each of these properties.
Nucleic Acids Res., 31: 3692-3697
4545
SVM for Classification of ProteinsSVM for Classification of Proteins
Descriptors for amino acid composition of protein:
C=(53.33, 46.67)
T=(51.72)
D=(3.33, 16.67, 40.0, 66.67, 96.67, 6.67, 26.67, 60.0, 76.67, 100.0)
Nucleic Acids Res., 31: 3692-3697
4646
CZ5225 Methods in Computational Biology Assignment 1Assignment 1
• Project 1: Protein family classification by SVM– Construction of training and testing datasets– Generating feature vectors– SVM classification and analysis.– Write a report and include a softcopy of your datasets
• Project 2: Develop a program of pair-wise sequence alignment using a simple scoring scheme. – Write a code in any programming language– Test it on a few examples (such as estrogen receptor and Progesterone
receptor)– Can you extend your program to multiple alignment?– Write a report and include a softcopy of your program