28
Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids Y. Wang, O. Zaiane, R. Goebel

Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

Embed Size (px)

DESCRIPTION

Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids. Y. Wang, O. Zaiane, R. Goebel. Introduction. Protein: linear sequence of amino acids Protein subcellular localization Plant: nuclear, cytoplamic, mitochondria, extracellular, … - PowerPoint PPT Presentation

Citation preview

Page 1: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

Identifying Extracellular Plant Proteins Based on

Frequent Subsequences of Amino Acids

Y. Wang, O. Zaiane, R. Goebel

Page 2: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

2

IntroductionProtein: linear sequence of amino acidsProtein subcellular localization Plant: nuclear, cytoplamic,

mitochondria, extracellular, …Intracellular vs. Extracellular Sequence information alone Class imbalance Transparency

Page 3: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

3

Related WordN-terminal sorting signalsAmino acid compositionLexical analysisIntegrative approachSubsequence methods

Page 4: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

4

Predicting Extracellular Proteins

Feature ExtractionSupport Vector MachineBoostingFrequent Pattern Method

Page 5: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

5

Feature ExtractionFrequent subsequences: subsequences that occur in more than a certain percentage of extracellular proteins Strong discriminative power Perform similar functions via

relationed biochemical mechanism Capture local similarity

Page 6: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

6

Generalized Suffix Tree

Page 7: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

7

Support Vector MachineInput data represented as feature vectorsFind a linear separator that separate the data and maximize the marginKernel function: nonlinear separator

Page 8: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

8

SVM for extracellular protein prediction

Data Transformation(sequencevector) Frequent subsequences as features Transform protein sequence as binary

vectorsKernel Functions Linear kernel Polynomial kernel RBF kernel

Page 9: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

9

BoostingIterative algorithms to improve weak classifierDifferent weighted distribution of examples in each iterationIncrease the weights of incorrectly classified examples, and decrease the weights of correctly classified ones

Page 10: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

10

AdaBoost

Page 11: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

11

Frequent Pattern MethodFrequent pattern: *X1*X2*…*Xn* extracellular X1,X2,…Xn are frequent

subsequences “*” can be substituted to zero or up to

MaxGap amino acids when matching a protein sequence

Page 12: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

12

FOIL algorithm

Page 13: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

13

Z-number

:accuracy of rule R:support of rule R

Page 14: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

14

Page 15: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

15

ExperimentsDataset(PASub project at UofA) Plant: 3293 proteins, 171 extracellularFive-cross validation

Page 16: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

16

Evaluation MatrixOverall accuracy is not good enoughF-measure

Page 17: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

17

Result(SVM with subsequence)

Page 18: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

18

Result(Boosting with subsequence)

Page 19: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

19

Result(Frequent Pattern)

MinLen=3Min_gain=0.1

03.08.0

MinSup=5%MinConf=80%MaxGap=300

Page 20: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

20

Result(SVM with composition)

Page 21: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

21

Result(Boosting with composition)

Page 22: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

22

Cross Comparision

Page 23: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

23

SVM with combined features

Page 24: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

24

Boosting with combined features

Page 25: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

25

Effects of MinLen on SVM

Page 26: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

26

Effects of MinLen on boosting

Page 27: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

27

ConclusionPresented three methods for identifying extracellular proteins based on frequent subsequence of amino acidsSVM achieves the best resultFSP method provides easily interpretable rules

Page 28: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

28

Future WorkUse for information about proteins (e.g., structure, function, …)Integrating amino acid composition into FSP methodIncorporate more biological knowledge