46
Protein motif extraction with n Protein motif extraction with n euro-fuzzy optimization euro-fuzzy optimization Author : Bill C. H. Chang and Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser : K. T. Sun Presenter : Wei-Liang Liu Presenter : Wei-Liang Liu BIOINFORMATICS Vol. 18 no. 8 2002 Pages 1084–1090

Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

Embed Size (px)

Citation preview

Page 1: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

Protein motif extraction with nProtein motif extraction with neuro-fuzzy optimizationeuro-fuzzy optimization

Author : Bill C. H. Chang and Bill C. H. Chang and Saman K. HalgamugeSaman K. HalgamugeAdviser : K. T. SunPresenter : Wei-Liang LiuPresenter : Wei-Liang Liu

BIOINFORMATICS Vol. 18 no. 8 2002 Pages 1084–1090

Page 2: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

22

Introduction (1/2)Introduction (1/2)

We present a new algorithm for extracting the consensus pattern, or motif, from a group of related protein sequences.

This algorithm involves a statistical method to find short patterns with high frequency and then neural network training to optimize the final classification accuracies.

Fuzzy logic is used to increase the flexibility of protein motifs.

Page 3: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

33

Introduction (2/2)Introduction (2/2)

Sequence motif discovery algorithms can be Sequence motif discovery algorithms can be generally categorized into three types: generally categorized into three types:

(1) string Alignment algorithms, (1) string Alignment algorithms, (2) exhaustive enumeration algorithms,(2) exhaustive enumeration algorithms, (3) heuristic methods.(3) heuristic methods.

Page 4: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

44

String alignment algorithmsString alignment algorithms

Find sequence motifs by minimizing a cost Find sequence motifs by minimizing a cost function which is related to the edit distances function which is related to the edit distances between sequences. between sequences.

Multiple alignment of sequences is a NP-hard Multiple alignment of sequences is a NP-hard problem and its computational time increases problem and its computational time increases exponentially with the sequence size. exponentially with the sequence size.

Page 5: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

55

Exhaustive enumeration algorithmsExhaustive enumeration algorithms

Exhaustive enumeration algorithms are guaraExhaustive enumeration algorithms are guaranteed to find the optimal motif, but run in exponteed to find the optimal motif, but run in exponential time with respect to the length of motif.nential time with respect to the length of motif.

Page 6: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

66

Heuristic methodsHeuristic methods

Heuristic methods can have a better performaHeuristic methods can have a better performance but are usually less flexible.nce but are usually less flexible.

Page 7: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

77

Neuro-Fuzzy systemNeuro-Fuzzy system

A neuro-fuzzy system is a A neuro-fuzzy system is a neural networkneural network and and a a fuzzyfuzzy system mapped to each other thus pro system mapped to each other thus providing advantages of both systems (Halgamugviding advantages of both systems (Halgamuge and Glesner, 1994). e and Glesner, 1994).

When it is used as a When it is used as a classifierclassifier, the outputs are , the outputs are class labels and therefore, class labels and therefore, no conventional defno conventional defuzzificationuzzification is applied. is applied.

Page 8: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

88

Example of a sequenceExample of a sequence One example of a sequence data is the human zinc

finger sequence data ZNF117 [6]:

MKRHEMVAKHLVMFYYFAQHLWPEQNIRDSFQKVTLRRYRKCGYENLQLRKGCKSVVECKQHKGDYSGLNQCLKTTLSKIFQCNKYVEVFHKISNSNRHKMRHTENKHFKCKECRKTFCMLSHLTQHKRIHTRVNFYKCEAYGRAFNWSSTLNKHKRIHTGEKPYKCKECGKAFNQTSHLIRHKRIHTEEKPYKCEECGKAFNQSSTLTTHNIIHTGEIPYKCEKCVRAFNQASKLTEHKLIHTGEKRYECEECGKAFNRSSKLTEHKYIHTGEKLYKCEECDKAFNLSSTLTKHKVIHTGEKLYKCKECGKAFKQFSHLAIHNIIHTGEKLYKCEECGKAFNSSSNLTAHKKNRTGEKPYKCEECGKANLSSTLTPHKTIHI

Page 9: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

99

AlgorithmAlgorithm

The aim of this algorithm is to The aim of this algorithm is to find a consensus pattefind a consensus pattern,or motifrn,or motif, from sequences belonging to the same fa, from sequences belonging to the same family.mily.

This motif can be either a This motif can be either a rigid or flexiblerigid or flexible pattern. pattern. A rigid pattern may be A–A rigid pattern may be A–xx((55)–B, where there exist a )–B, where there exist a

fixed number of fixed number of gaps/wildcardsgaps/wildcards (in this case, five) bet (in this case, five) between two patterns A and B. ween two patterns A and B.

In a In a flexible patternflexible pattern, the number of gaps is represent, the number of gaps is represented by a ed by a lower bound and an upper boundlower bound and an upper bound, such as , such as xx(2,4).(2,4).

Page 10: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

1010

Algorithm has four main stepsAlgorithm has four main steps

The proposed motif extraction algorithm has The proposed motif extraction algorithm has four main steps: four main steps: sequence preprocessingsequence preprocessing, , motif generation, motif generation, motif selection and motif selection and motif optimizationmotif optimization. .

Page 11: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

1111

Overview of the algorithmOverview of the algorithm

Page 12: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

1212

Sequence PreprocessingSequence Preprocessing

The aim of the preprocessing step is to select The aim of the preprocessing step is to select the ‘the ‘moremore’ important ‘’ important ‘featuresfeatures’ within a single f’ within a single family sequences so that actual motif extractioamily sequences so that actual motif extraction becomes faster.n becomes faster.

Page 13: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

1313

Example (1/2)Example (1/2)

ABC–ABC–xx(1,3)–DEF,(1,3)–DEF, where where xx(1,3) represents wild cards of length 1 to 3. A(1,3) represents wild cards of length 1 to 3. A

ny amino acid symbol can match a wild card. Sequeny amino acid symbol can match a wild card. Sequencesnces

ABCHHDEF and ABCAAADEF both satisfy the abovABCHHDEF and ABCAAADEF both satisfy the above consensus pattern. e consensus pattern.

The consensus pattern ABC–The consensus pattern ABC–xx(1,3)–DEF can also be (1,3)–DEF can also be written as A–written as A–xx(0)–B–(0)–B–xx(0)–C–(0)–C–xx(1,3)–D–(1,3)–D–xx(0)–E–(0)–E–xx(0)–(0)–F.F.

Page 14: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

1414

Example (2/2)Example (2/2)

As a general form, a sequence pattern can be As a general form, a sequence pattern can be represented as a series of represented as a series of events events and and intervalintervalss (Chang and Halgamuge, 2001):(Chang and Halgamuge, 2001):

EE11––II11,,22––EE22––II2,32,3 − − . . . . . . − − II(N−1)(N−1),,NN ––EENN

Where EWhere E11 is the first event and I is the first event and I1,21,2 is the interv is the interv

al al gapgap between the first and second events. between the first and second events.

Page 15: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

1515

Vector generationVector generation

Each element of the vector represents a combEach element of the vector represents a combination of ination of two eventstwo events, , EiEi and and E jE j and theirand their gap gap II

i, ji, j , (where , (where EEii occurs before occurs before E E jj ), and the value ), and the value of each element of the vector is either 1 or 0.of each element of the vector is either 1 or 0.

A value of A value of 1 1 translates to ‘translates to ‘in this sequencein this sequence, th, there is an occurrence of character ere is an occurrence of character Ei Ei with intervwith interval al Ii j Ii j before before E j E j ’, and a value of ’, and a value of zerozero is otherw is otherwise (there is ise (there is no such occurrenceno such occurrence).).

Page 16: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

1616

ExampleExample

let us assume the first element of a vector reprlet us assume the first element of a vector represents ‘A–esents ‘A–xx(0)–A’. (0)–A’.

The value of this element will be The value of this element will be 1 for sequence ‘AABCD’ and 1 for sequence ‘AABCD’ and 0 for sequence ‘ABACD’, 0 for sequence ‘ABACD’, as the short pattern A–as the short pattern A–xx(0)–A occurs in the firs(0)–A occurs in the first sequence but not the second.t sequence but not the second.

Page 17: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

1717

Size of VectorSize of Vector For protein sequences, the number of possible For protein sequences, the number of possible

events is 20 (there are events is 20 (there are 20 amino acids20 amino acids) ) By considering that only nine patterns in PROSITE By considering that only nine patterns in PROSITE

out of around 1300 motif patterns have interval gaps out of around 1300 motif patterns have interval gaps of more than 20 (Hart of more than 20 (Hart et al.et al.,2000), a ,2000), a maximum gapmaximum gap considered between any two events of considered between any two events of 2020 should be should be satisfactory. satisfactory.

Therefore the size of the vector is Therefore the size of the vector is 20 × 20 × 20 = 800020 × 20 × 20 = 8000

vector can be implementedvector can be implemented as a as a 13-bits13-bits ((213 = 8192213 = 8192) ) binary data.binary data.

Page 18: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

1818

Protein sequencesProtein sequences

Page 19: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

1919

Feature selectionFeature selection

By selecting the elements above a certain By selecting the elements above a certain threthreshold valueshold value (e.g. 0.90). (e.g. 0.90).

The value of each vector element represents tThe value of each vector element represents the he frequencies of occurrencesfrequencies of occurrences of a particular of a particular EEii – – IIi,i, jj – – E E jj pattern. pattern.

For example,if an element which represents AFor example,if an element which represents A––xx(0)–A has a value of 0.99, then 99% of this (0)–A has a value of 0.99, then 99% of this group of sequences have ‘AA’ somewhere in tgroup of sequences have ‘AA’ somewhere in their sequences.heir sequences.

Page 20: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

2020

Motif generation (1/3)Motif generation (1/3)

For example, For example, if a motif pattern if a motif pattern C–C–xx(2)–C–(2)–C–xx(3)–F(3)–F occurs in 9 occurs in 90% of the sequences in the family, 0% of the sequences in the family, the short patterns (or important features): the short patterns (or important features): (1) (1) C–C–xx(2)–C(2)–C, , (2) (2) C–C–xx(3)–F(3)–F, and, and(3) (3) C–C–xx(6)–F(6)–Fmust all exist at a frequencey of 90% or greater in the sequences. But the reverse is not always true.

Page 21: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

2121

Motif generation (2/3)Motif generation (2/3)

Fig.2.Connect important features to form a motif candidate.

Page 22: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

2222

Motif generation (3/3)Motif generation (3/3)

In Figure 2, F–x(2)–S is not connected because for a motif C–x(2)–C–x(3)–F–x(2)–S to occur frequently, the short patterns C–x(9)–S, C–x(6)–S should have occurred frequently as well (which is not in the above case).

Page 23: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

2323

A good motif patternA good motif pattern

A good motif pattern can be simply described as:(1) Correctly identify protein sequences

belonging to the family it represents, or maximize ‘true-positives’.

(2) Does not identify protein sequences belonging to the other families, or minimize ‘false-positives’.

Page 24: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

2424

Motif optimization (1/2)Motif optimization (1/2)

Page 25: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

2525

Motif optimization (2/2)Motif optimization (2/2)

The inputs to the network are event intervals.The simple rule (black node in ‘rule base’ layer

of Figure 3) in the neuro-fuzzy system is: ‘IF I1 is μ1 and I2 is μ1, THEN output is μclass’.

μclass is the output of the neuro-fuzzy network.

Page 26: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

2626

Fuzzy inference systemFuzzy inference system

A fuzzy inference system embedded in neural network has three main steps:fuzzification, fuzzy inference anddefuzzification.

Page 27: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

2727

Sequence Preprocessing (1/3)Sequence Preprocessing (1/3)

For example, let T = AGCCTGAT. The first and second level distribution matrices are shown in Table 1:

Page 28: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

2828

Sequence Preprocessing (2/3)Sequence Preprocessing (2/3)

Page 29: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

2929

Sequence Preprocessing (3/3)Sequence Preprocessing (3/3)

Page 30: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

3030

Sequence Fuzzification (1/2)Sequence Fuzzification (1/2)

The value of event interval is also fuzzified. For example, if pattern P = T φφG, the event interval fuzzy membership function can be defined as shown in Figure 4.

P = T φφG = P = T-X(2)-G

Page 31: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

3131

Sequence Fuzzification (2/2)Sequence Fuzzification (2/2)

Page 32: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

3232

Sequence InferenceSequence Inference

This step aims to find the most “similar” subsequence in Text T compares to Pattern P.

The inference rule used here is: IF event A1 occurs AND event A2 occursAND event interval between A1 and A2 is I1

AND … event An-1 occurs AND event An occurs AND event interval between An-1 and An is In-1, THEN Pattern P exists in Text T with degree Yi.

Page 33: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

3333

Fuzzy Sequence Pattern Matching Fuzzy Sequence Pattern Matching Algorithm (example)Algorithm (example)

The general structure of a C2H2 zinc finger protein motif (a motif is the signature of a particular group of sequences) is [2]:CφφCφφφφφφφφφφφφHφφH

Page 34: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

3434

Sequence Preprocessing (example)Sequence Preprocessing (example)

CφφCφφφφφφφφφφφφHφφH

Page 35: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

3535

Sequence Fuzzification (example)Sequence Fuzzification (example)

We use the following fuzzy rule to describe the event interval:

R1: If event interval is I1 between the first two C, then the membership value is μ1

R2: If event interval is I2 between C and H, then themembership value is μ2

R3: If event interval is I3 between the last two H, then

the membership value is μ3

Page 36: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

3636

Sequence Inference (example)Sequence Inference (example)

The inference rule used here is:

IF event interval between the first two Cs is I1 AND event interval between C and H is I2 AND event interval between the last two Hs is I3, THEN Pattern P exists in Text T with degree Yi.

Where Yi = μ1 × μ2 × μ3 And Y = Max(Y1, Y2, Y3, …, Ym)

Page 37: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

3737

ClassifyClassify

Page 38: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

3838

Sum of square errorSum of square error For example, sequence Z is ACCABBDACA, and the

preliminary motif is A–x(2)–A–x(2)–A. The possible matches are

(a) ACCABBDA (A–x(2)–A–x(3)–A) and (b) ABBDACA (A–x(3)–A–x(1)–A).

The sum of square error is:for (a) : (2 − 2)2 + (3 − 2)2 = 1

(b) : (3 − 2)2 + (1 − 2)2 = 2. So (a) is the ‘most similar match’ and its event interv

al values (2, 3) is used as a training input data.

Page 39: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

3939

Result of C2H2 zinc finger protein (1/3)Result of C2H2 zinc finger protein (1/3)

Page 40: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

4040

Result of C2H2 zinc finger protein (2/3)Result of C2H2 zinc finger protein (2/3)

Page 41: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

4141

Result of C2H2 zinc finger protein (3/3)Result of C2H2 zinc finger protein (3/3)

Page 42: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

4242

Result of EGF Protein (1/3)Result of EGF Protein (1/3)

Page 43: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

4343

Result of EGF Protein (2/3)Result of EGF Protein (2/3)

Page 44: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

4444

Result of EGF Protein (3/3)Result of EGF Protein (3/3)

Page 45: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

4545

DiscussionDiscussion

The optimization of motif patterns in both EGF and zinc finger protein family increases the rate of true positives.

However, with an increase in true positives rate, the rate of false positives also increases.

An interesting observation is that in comparison to the motifs suggested in PROSITE, the motifs identified by our method are more flexible and broad.

Page 46: Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser

4646

Conclusion and future workConclusion and future work

For future research, optimization of neuro-fuzzy system will be further investigated to implement event fuzzy membership functions for events.