Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H....

Preview:

Citation preview

Protein motif extraction with nProtein motif extraction with neuro-fuzzy optimizationeuro-fuzzy optimization

Author : Bill C. H. Chang and Bill C. H. Chang and Saman K. HalgamugeSaman K. HalgamugeAdviser : K. T. SunPresenter : Wei-Liang LiuPresenter : Wei-Liang Liu

BIOINFORMATICS Vol. 18 no. 8 2002 Pages 1084–1090

22

Introduction (1/2)Introduction (1/2)

We present a new algorithm for extracting the consensus pattern, or motif, from a group of related protein sequences.

This algorithm involves a statistical method to find short patterns with high frequency and then neural network training to optimize the final classification accuracies.

Fuzzy logic is used to increase the flexibility of protein motifs.

33

Introduction (2/2)Introduction (2/2)

Sequence motif discovery algorithms can be Sequence motif discovery algorithms can be generally categorized into three types: generally categorized into three types:

(1) string Alignment algorithms, (1) string Alignment algorithms, (2) exhaustive enumeration algorithms,(2) exhaustive enumeration algorithms, (3) heuristic methods.(3) heuristic methods.

44

String alignment algorithmsString alignment algorithms

Find sequence motifs by minimizing a cost Find sequence motifs by minimizing a cost function which is related to the edit distances function which is related to the edit distances between sequences. between sequences.

Multiple alignment of sequences is a NP-hard Multiple alignment of sequences is a NP-hard problem and its computational time increases problem and its computational time increases exponentially with the sequence size. exponentially with the sequence size.

55

Exhaustive enumeration algorithmsExhaustive enumeration algorithms

Exhaustive enumeration algorithms are guaraExhaustive enumeration algorithms are guaranteed to find the optimal motif, but run in exponteed to find the optimal motif, but run in exponential time with respect to the length of motif.nential time with respect to the length of motif.

66

Heuristic methodsHeuristic methods

Heuristic methods can have a better performaHeuristic methods can have a better performance but are usually less flexible.nce but are usually less flexible.

77

Neuro-Fuzzy systemNeuro-Fuzzy system

A neuro-fuzzy system is a A neuro-fuzzy system is a neural networkneural network and and a a fuzzyfuzzy system mapped to each other thus pro system mapped to each other thus providing advantages of both systems (Halgamugviding advantages of both systems (Halgamuge and Glesner, 1994). e and Glesner, 1994).

When it is used as a When it is used as a classifierclassifier, the outputs are , the outputs are class labels and therefore, class labels and therefore, no conventional defno conventional defuzzificationuzzification is applied. is applied.

88

Example of a sequenceExample of a sequence One example of a sequence data is the human zinc

finger sequence data ZNF117 [6]:

MKRHEMVAKHLVMFYYFAQHLWPEQNIRDSFQKVTLRRYRKCGYENLQLRKGCKSVVECKQHKGDYSGLNQCLKTTLSKIFQCNKYVEVFHKISNSNRHKMRHTENKHFKCKECRKTFCMLSHLTQHKRIHTRVNFYKCEAYGRAFNWSSTLNKHKRIHTGEKPYKCKECGKAFNQTSHLIRHKRIHTEEKPYKCEECGKAFNQSSTLTTHNIIHTGEIPYKCEKCVRAFNQASKLTEHKLIHTGEKRYECEECGKAFNRSSKLTEHKYIHTGEKLYKCEECDKAFNLSSTLTKHKVIHTGEKLYKCKECGKAFKQFSHLAIHNIIHTGEKLYKCEECGKAFNSSSNLTAHKKNRTGEKPYKCEECGKANLSSTLTPHKTIHI

99

AlgorithmAlgorithm

The aim of this algorithm is to The aim of this algorithm is to find a consensus pattefind a consensus pattern,or motifrn,or motif, from sequences belonging to the same fa, from sequences belonging to the same family.mily.

This motif can be either a This motif can be either a rigid or flexiblerigid or flexible pattern. pattern. A rigid pattern may be A–A rigid pattern may be A–xx((55)–B, where there exist a )–B, where there exist a

fixed number of fixed number of gaps/wildcardsgaps/wildcards (in this case, five) bet (in this case, five) between two patterns A and B. ween two patterns A and B.

In a In a flexible patternflexible pattern, the number of gaps is represent, the number of gaps is represented by a ed by a lower bound and an upper boundlower bound and an upper bound, such as , such as xx(2,4).(2,4).

1010

Algorithm has four main stepsAlgorithm has four main steps

The proposed motif extraction algorithm has The proposed motif extraction algorithm has four main steps: four main steps: sequence preprocessingsequence preprocessing, , motif generation, motif generation, motif selection and motif selection and motif optimizationmotif optimization. .

1111

Overview of the algorithmOverview of the algorithm

1212

Sequence PreprocessingSequence Preprocessing

The aim of the preprocessing step is to select The aim of the preprocessing step is to select the ‘the ‘moremore’ important ‘’ important ‘featuresfeatures’ within a single f’ within a single family sequences so that actual motif extractioamily sequences so that actual motif extraction becomes faster.n becomes faster.

1313

Example (1/2)Example (1/2)

ABC–ABC–xx(1,3)–DEF,(1,3)–DEF, where where xx(1,3) represents wild cards of length 1 to 3. A(1,3) represents wild cards of length 1 to 3. A

ny amino acid symbol can match a wild card. Sequeny amino acid symbol can match a wild card. Sequencesnces

ABCHHDEF and ABCAAADEF both satisfy the abovABCHHDEF and ABCAAADEF both satisfy the above consensus pattern. e consensus pattern.

The consensus pattern ABC–The consensus pattern ABC–xx(1,3)–DEF can also be (1,3)–DEF can also be written as A–written as A–xx(0)–B–(0)–B–xx(0)–C–(0)–C–xx(1,3)–D–(1,3)–D–xx(0)–E–(0)–E–xx(0)–(0)–F.F.

1414

Example (2/2)Example (2/2)

As a general form, a sequence pattern can be As a general form, a sequence pattern can be represented as a series of represented as a series of events events and and intervalintervalss (Chang and Halgamuge, 2001):(Chang and Halgamuge, 2001):

EE11––II11,,22––EE22––II2,32,3 − − . . . . . . − − II(N−1)(N−1),,NN ––EENN

Where EWhere E11 is the first event and I is the first event and I1,21,2 is the interv is the interv

al al gapgap between the first and second events. between the first and second events.

1515

Vector generationVector generation

Each element of the vector represents a combEach element of the vector represents a combination of ination of two eventstwo events, , EiEi and and E jE j and theirand their gap gap II

i, ji, j , (where , (where EEii occurs before occurs before E E jj ), and the value ), and the value of each element of the vector is either 1 or 0.of each element of the vector is either 1 or 0.

A value of A value of 1 1 translates to ‘translates to ‘in this sequencein this sequence, th, there is an occurrence of character ere is an occurrence of character Ei Ei with intervwith interval al Ii j Ii j before before E j E j ’, and a value of ’, and a value of zerozero is otherw is otherwise (there is ise (there is no such occurrenceno such occurrence).).

1616

ExampleExample

let us assume the first element of a vector reprlet us assume the first element of a vector represents ‘A–esents ‘A–xx(0)–A’. (0)–A’.

The value of this element will be The value of this element will be 1 for sequence ‘AABCD’ and 1 for sequence ‘AABCD’ and 0 for sequence ‘ABACD’, 0 for sequence ‘ABACD’, as the short pattern A–as the short pattern A–xx(0)–A occurs in the firs(0)–A occurs in the first sequence but not the second.t sequence but not the second.

1717

Size of VectorSize of Vector For protein sequences, the number of possible For protein sequences, the number of possible

events is 20 (there are events is 20 (there are 20 amino acids20 amino acids) ) By considering that only nine patterns in PROSITE By considering that only nine patterns in PROSITE

out of around 1300 motif patterns have interval gaps out of around 1300 motif patterns have interval gaps of more than 20 (Hart of more than 20 (Hart et al.et al.,2000), a ,2000), a maximum gapmaximum gap considered between any two events of considered between any two events of 2020 should be should be satisfactory. satisfactory.

Therefore the size of the vector is Therefore the size of the vector is 20 × 20 × 20 = 800020 × 20 × 20 = 8000

vector can be implementedvector can be implemented as a as a 13-bits13-bits ((213 = 8192213 = 8192) ) binary data.binary data.

1818

Protein sequencesProtein sequences

1919

Feature selectionFeature selection

By selecting the elements above a certain By selecting the elements above a certain threthreshold valueshold value (e.g. 0.90). (e.g. 0.90).

The value of each vector element represents tThe value of each vector element represents the he frequencies of occurrencesfrequencies of occurrences of a particular of a particular EEii – – IIi,i, jj – – E E jj pattern. pattern.

For example,if an element which represents AFor example,if an element which represents A––xx(0)–A has a value of 0.99, then 99% of this (0)–A has a value of 0.99, then 99% of this group of sequences have ‘AA’ somewhere in tgroup of sequences have ‘AA’ somewhere in their sequences.heir sequences.

2020

Motif generation (1/3)Motif generation (1/3)

For example, For example, if a motif pattern if a motif pattern C–C–xx(2)–C–(2)–C–xx(3)–F(3)–F occurs in 9 occurs in 90% of the sequences in the family, 0% of the sequences in the family, the short patterns (or important features): the short patterns (or important features): (1) (1) C–C–xx(2)–C(2)–C, , (2) (2) C–C–xx(3)–F(3)–F, and, and(3) (3) C–C–xx(6)–F(6)–Fmust all exist at a frequencey of 90% or greater in the sequences. But the reverse is not always true.

2121

Motif generation (2/3)Motif generation (2/3)

Fig.2.Connect important features to form a motif candidate.

2222

Motif generation (3/3)Motif generation (3/3)

In Figure 2, F–x(2)–S is not connected because for a motif C–x(2)–C–x(3)–F–x(2)–S to occur frequently, the short patterns C–x(9)–S, C–x(6)–S should have occurred frequently as well (which is not in the above case).

2323

A good motif patternA good motif pattern

A good motif pattern can be simply described as:(1) Correctly identify protein sequences

belonging to the family it represents, or maximize ‘true-positives’.

(2) Does not identify protein sequences belonging to the other families, or minimize ‘false-positives’.

2424

Motif optimization (1/2)Motif optimization (1/2)

2525

Motif optimization (2/2)Motif optimization (2/2)

The inputs to the network are event intervals.The simple rule (black node in ‘rule base’ layer

of Figure 3) in the neuro-fuzzy system is: ‘IF I1 is μ1 and I2 is μ1, THEN output is μclass’.

μclass is the output of the neuro-fuzzy network.

2626

Fuzzy inference systemFuzzy inference system

A fuzzy inference system embedded in neural network has three main steps:fuzzification, fuzzy inference anddefuzzification.

2727

Sequence Preprocessing (1/3)Sequence Preprocessing (1/3)

For example, let T = AGCCTGAT. The first and second level distribution matrices are shown in Table 1:

2828

Sequence Preprocessing (2/3)Sequence Preprocessing (2/3)

2929

Sequence Preprocessing (3/3)Sequence Preprocessing (3/3)

3030

Sequence Fuzzification (1/2)Sequence Fuzzification (1/2)

The value of event interval is also fuzzified. For example, if pattern P = T φφG, the event interval fuzzy membership function can be defined as shown in Figure 4.

P = T φφG = P = T-X(2)-G

3131

Sequence Fuzzification (2/2)Sequence Fuzzification (2/2)

3232

Sequence InferenceSequence Inference

This step aims to find the most “similar” subsequence in Text T compares to Pattern P.

The inference rule used here is: IF event A1 occurs AND event A2 occursAND event interval between A1 and A2 is I1

AND … event An-1 occurs AND event An occurs AND event interval between An-1 and An is In-1, THEN Pattern P exists in Text T with degree Yi.

3333

Fuzzy Sequence Pattern Matching Fuzzy Sequence Pattern Matching Algorithm (example)Algorithm (example)

The general structure of a C2H2 zinc finger protein motif (a motif is the signature of a particular group of sequences) is [2]:CφφCφφφφφφφφφφφφHφφH

3434

Sequence Preprocessing (example)Sequence Preprocessing (example)

CφφCφφφφφφφφφφφφHφφH

3535

Sequence Fuzzification (example)Sequence Fuzzification (example)

We use the following fuzzy rule to describe the event interval:

R1: If event interval is I1 between the first two C, then the membership value is μ1

R2: If event interval is I2 between C and H, then themembership value is μ2

R3: If event interval is I3 between the last two H, then

the membership value is μ3

3636

Sequence Inference (example)Sequence Inference (example)

The inference rule used here is:

IF event interval between the first two Cs is I1 AND event interval between C and H is I2 AND event interval between the last two Hs is I3, THEN Pattern P exists in Text T with degree Yi.

Where Yi = μ1 × μ2 × μ3 And Y = Max(Y1, Y2, Y3, …, Ym)

3737

ClassifyClassify

3838

Sum of square errorSum of square error For example, sequence Z is ACCABBDACA, and the

preliminary motif is A–x(2)–A–x(2)–A. The possible matches are

(a) ACCABBDA (A–x(2)–A–x(3)–A) and (b) ABBDACA (A–x(3)–A–x(1)–A).

The sum of square error is:for (a) : (2 − 2)2 + (3 − 2)2 = 1

(b) : (3 − 2)2 + (1 − 2)2 = 2. So (a) is the ‘most similar match’ and its event interv

al values (2, 3) is used as a training input data.

3939

Result of C2H2 zinc finger protein (1/3)Result of C2H2 zinc finger protein (1/3)

4040

Result of C2H2 zinc finger protein (2/3)Result of C2H2 zinc finger protein (2/3)

4141

Result of C2H2 zinc finger protein (3/3)Result of C2H2 zinc finger protein (3/3)

4242

Result of EGF Protein (1/3)Result of EGF Protein (1/3)

4343

Result of EGF Protein (2/3)Result of EGF Protein (2/3)

4444

Result of EGF Protein (3/3)Result of EGF Protein (3/3)

4545

DiscussionDiscussion

The optimization of motif patterns in both EGF and zinc finger protein family increases the rate of true positives.

However, with an increase in true positives rate, the rate of false positives also increases.

An interesting observation is that in comparison to the motifs suggested in PROSITE, the motifs identified by our method are more flexible and broad.

4646

Conclusion and future workConclusion and future work

For future research, optimization of neuro-fuzzy system will be further investigated to implement event fuzzy membership functions for events.