Bioinformatics of Protein Domains: New Computational Approach for

Preview:

Citation preview

Maricel Kann. Feb-08

Bioinformatics of Protein Domains:Bioinformatics of Protein Domains:New Computational Approach for the New Computational Approach for the

Detection of Protein Domains Detection of Protein Domains

Maricel KannAssistant Professor

University of Maryland, Baltimore Countymkann@umbc.edu

2 Maricel Kann. Feb-08

GATCCTCCATATACAACGGTATCTCCACCTCAGGTTTAGATCTCAACAACGGAACCATTGCCGACATGAGACAGGGTATCGTCGAGAGTTACAAGCTAAAACGAGCAGTAGTCAGCTCTGCATCTGAAGCCGCTGAAGTTCTACTAAGGGATAACATCATCCGTGCAAGACCAAGAACCGCCAATAGACAACATATGTAACATATTTAGGATATACCTCGAAAATAAACCGCCACACTGTCATTATTATAATTAGAAACAGAACGCAAAAATTATCCACTATATAATTCAAAGACGCGAAAAAAGAACAACGCGTCATAGAACTTTTGGCAATTCGCGTCACAAATAAATTTTGGCAACTTATGTTTCCTCTTCGCAGTACTCGAGCCCTGTCTCAAGAATGTAATAATACCCATCGTAGGTATGGTTAAAGATAGCATCTCCACAACCTAGCTCCTTGCCGAGAGTCGCCCTCCTTTGTCGAGTAATTTTCACTTTTCATATGAGAACTTATTTTCTTATTCTTTACTCACATCCTGTAGTGATTGACACTGCAACAGCCACCATCACTAGAAGAACAGAACAATTACTTAATAGAAAAATATCTTCCTCGAAGGCTAATCGATAACTGACGATTTCCTGCTTCCAACATCTACGTATATCAAGAAGCATTCACTTAATGACACAGCTTCAGATTTCATTATTGCTGACAGCTACTATATCACTACTCCATCTAGTAGTGGCCACGCCCTATGCATATCCTATCGGAAAACAATACCCCCCAGTGGCAAGAGTCAATGAATCGTTTACATTTCAAATTTCCAATGATATATAAATCGTCTGTAGACAAGACAGCTCAAATAACATACAATTGCTTCGACTTACCGAGCTGGCTTTCGTTTGACAGTTCTAGAACGTTCTCAGGTGAACCTTCTTCTGACTTACTATCTGATGCGAACACCACGTTGTATTTCAATGTAATCGAGGGTACGGACTCTGCCGACAGCACGTCTTTGAACAATACATACCAATTTGTTGTTACAAACCGTCCATCCACGCTATCGTCAGATTTCAATCTATTGGCGTTGTTAAAAAACTATGGTTATACTAACGGCAAAAACGCTCTGAAACTATCCTAATGAAGTCTTCAACGTGACTTTTGACCGTTCAATGTTCACTAACGAAGAATCCATTGTGTCGTATTACGGGTTCTCAGTTGTATAATGCGCCGTTACCCAATTGGCTGTTCTTCGATTCTGGCGAGTTGAAGTTTACTGGGACGGCCGGTGATAAACTCGGCGATTGCTCCAGAAACAAGCTACAGTTTTGTCATCATCGCTACAGACATTGAAGGATTTGCCGTTGAGGTAGAATTCGAATTAGTCATCGGGGCTCACCAGTTAACTACCTCTATTCAAAATAGTTTGATAATCGTTACTGACACAGGTAACGTTTCATATGACTTACCTCTAAACTATGTTTATCTCGATGACGATCCTATTTCTTCTGAATTGGGTTCTATAAACTTATTGGATGCTCCAGACTGGGTGGCATTAGATAATGCTACCATTTCCGGGTCTGTCCATGAATTACTCGGTAAGAACTCCAATCCTGCCAATTTTTCTGTGTCCATTTATGATACTTATGGTGATGTGATTTATCAACTTCGAAGTTGTCTCCACAACGGATTTGTTTGCCATTAGTTCTCTTCCCAATATTAACGCTACAAGGGGTGAGTTCTCCTACTATTTTTTGCCTTCTCAGTTTACAGACTACGTGAATACAAACGTTTCATTAGAGTTTACTAATTCAACAAGACCATGACTGGGTGAAATTCCAATCATCTAATTTAACATTAGCTGGAGAAGTGCCCAAGAATTTCGACAAGTCATTAGGTTTGAAAGCGAACCAAGGTTCACAATCTCAAGAGCTATATTTTAACATCATTGGCATGGATTCAAAGAACTCACTCAAACCACAGTGCGAATGCAACGTCCACAAGAAGTTCTCACCACTCCACCTCAACAAGTTCTTACACATCTACTTACACTGCAAAAATTTCTTCTACCTCCGCTGCTGCTACTTCTTCTGCTCCAGCAGCGCTGCCAGCAGCCAAAACTTCATCTCACAATAAAAAAGCAGTAGCAATTGCGTGCGGTGTTGCTATCCCATTAGGCGTTATCCTAGTACTCATTTGCTTCCTAATATTCTGGAGACGCAGAAGGGAAAATCCAGACGATGAAAACTTACCGCATGCTATTAGTACCTGATTTGAATAATCCTGCAAATAAACCAAATCAAGAAAACGCTACACCTTTGAACAACCCCTTTGATGATGATTTCCTCGTACGATGATACTTCAATAGCAAGAAGATTGGCTGCTTTGAACACTTTGAAATTGGATAACCACTCTGCTGAATCTGATATTTCCAGCGTGGATGAAAAGAGAGATTCTCTATCAGGTATGAATACATACAATGATCAGTTCCAACCAAAGTAAAGAAGAATTATTAGCAAAACCCCCAGTACAGCCTCCAGAGAGCCCGTTCTTTGACCCACAGAATACTTCTTCTGTGTATATGGATAGTGAACCAGCAGTAAATAAATCCTGGCGATATACTGGCAACCTGTCACCAGTCTATATTGTCAGAGACAGTTACGGATCACAAAAAACTGTTGATACAGAAAAACTTTTCGATTTAGAAGCACCAGAGAAAAAACGTACGTCAAGGGATGTCACTATGTCTTCACTGGACCCTTGGAACAGCAATATTAGCCCTTCTCCCGTAAAAATCAGTAACACCATCACCATATAACGTAACGAAGCATCGTAACCGCCACTTACAAAATATTCAAGACTCTCAAGGTAAAAACGGAATCACTCCCACAACAATGTCAACTTCATCTTCTGACGATTTTGTTCCGGTTAAAGATGGTGAA

The Human Genome ProjectThe Human Genome Project

3 Maricel Kann. Feb-08© Sidney Harris

4 Maricel Kann. Feb-08

Protein ClassificationProtein Classification

A L I G N M E N T

A L I G G N M E N

QUERYSet of related sequences or protein family from database

A L I G N M E N T

A L I G G N M E N

4 3 4 7 1 2 -2 0 0

Alignment AlgorithmScoring Function

Accurate Statistics

A L I G - N M E N T

A L I G G N M E N -

score=19

PAM: PAM: DayhoffDayhoff et al. (1978); BLOSUM: et al. (1978); BLOSUM: HenikoffHenikoff & & HenikoffHenikoff (1992);(1992);OPTIMA:KannOPTIMA:Kann et al.et al. (2000).(2000).

5 Maricel Kann. Feb-08

Significance of a scoreSignificance of a score

Estimated number of non-related sequences in the database that score higher than the query

( )Q RE p S S D= <

D= size of database

6 Maricel Kann. Feb-08

Alignments’ scores

# of

alig

nmen

ts w

ith sc

ore

S

S SQ

random scores

( ) 1 exp[ ]RSQ Rp S S KMNe λ−< = − −

7 Maricel Kann. Feb-08

OutlineOutline

A new computational approach for the detection of protein domains: Semi-Global Alignment of Protein Domains with accurate statistics.– Introduction

• Definition of protein domain.• Main features of the Conserved domain database (CDD)• Position specific scoring matrices (PSSM)• Classification of alignment methods

– Current methods for protein domain searches– Our approach (Global Blocks Aligned Locally)– Results

8 Maricel Kann. Feb-08

The term protein domain (or domain) refers to a region of the protein with compact structure, usually with a hydrophobic core.

9 Maricel Kann. Feb-08

Conserved DomainsConserved Domains• In 1974 Michael Rossman recognized the NADH

binding domain in several dehydrogenases (named after him).

• Conserved domains are determined by sequencecomparative analysis.

• Molecular evolution uses such domains as building blocks

• They may be recombined in different arrangements to make proteins with different functions.

• Most proteins contain multiple domains (65% euk, 40% prok), giving rise to a variety of combinations of domains.

10 Maricel Kann. Feb-08

CDD: a collection of domain multiple alignments linkedto protein 3D structure

11 Maricel Kann. Feb-08MarchlerMarchler--Bauer Bauer et al et al (2003) (2003) NAR NAR 383:387383:387

heme-binding site

It combines information about protein sequence, their conservationpatterns across evolution and the protein structure and provide useful functional annotation.

12 Maricel Kann. Feb-08

Protein ClassificationProtein Classification

QUERYSet of related sequences or protein family from database

Alignment AlgorithmScoring Function

Accurate Statistics

PSSM can be derived from the MSAPSSM can be derived from the MSA

A PSSM, or Position-Specific Scoring Matrix (or profile), is a type of scoring matrix in which amino acid substitution scores are given separately for each position in a protein multiple sequence alignment.

13 Maricel Kann. Feb-08

MSA contains conserved blocksMSA contains conserved blocks

14 Maricel Kann. Feb-08

Protein Structure AlignmentProtein Structure AlignmentProtein Structure Alignment

αα--helixhelix

ββ--strandstrand

loopsloops

red red sequencesequence

blue blue sequencesequence

Subsequences Subsequences corresponding to corresponding to secondary structure secondary structure elements (SSEs: elements (SSEs: αα--helices and helices and ββ--strands) strands) are more conserved are more conserved than the intervening than the intervening loops.loops.

Protein Sequence ConservationProtein Sequence ConservationOccurs in Blocks with Intervening GapsOccurs in Blocks with Intervening Gaps

15 Maricel Kann. Feb-08

CDD representationCDD representation

gapgap gapgap

11 22

CDD footprintCDD footprint

16 Maricel Kann. Feb-08

SequenceSequence--PSSM alignmentPSSM alignment

A L I G N M E N T

17 Maricel Kann. Feb-08

SequenceSequence--PSSM alignmentPSSM alignment

blockblock blockblock blockblock

quer

yqu

ery

PSSM

Gaps in Query

Gaps in PSSM

18 Maricel Kann. Feb-08

Three Types of Sequence AlignmentsThree Types of Sequence Alignments

SemiSemi--Global Global AlignmentAlignment

SubsequenceSubsequenceOntoOnto

SequenceSequence

LocalLocalAlignment Alignment

SubsequenceSubsequenceToTo

SubsequenceSubsequence

Global AlignmentGlobal Alignment

SequenceSequenceToTo

SequenceSequence

BW Erickson & P SellarsBW Erickson & P Sellars (1983) (1983) Time Warps, String Edits, and MacromoleculesTime Warps, String Edits, and Macromolecules,, p. 55p. 55

19 Maricel Kann. Feb-08

SemiSemi--global Alignmentglobal Alignment

• Finding a complete domain in the query , semi-global, is the natural choice in the context of the protein structure, function and evolution

queryquerysequencesequence

20 Maricel Kann. Feb-08

OutlineOutline

A new computational approach for the detection of protein domains: Semi-Global Alignment of Protein Domains with accurate statistics.– Introduction– Current methods for protein domain searches

• RPS-BLAST• HMMer• SALTO

– (Global Blocks Aligned Locally)– Derivation of Statistics– Results

21 Maricel Kann. Feb-08

Reverse PositionReverse Position--Specific BLAST(RPSSpecific BLAST(RPS--BLAST)BLAST)

blockblock blockblock blockblock

quer

yqu

ery

PSSM

rpsBLASTrpsBLAST doesndoesn’’t incorporate the concept of t incorporate the concept of ““blockblock””

A Schaffer A Schaffer et al et al (1999) (1999) Bioinformatics Bioinformatics 15:100015:1000

The role of the PSSM has changed from being the “query” in PSI-BLAST to “subject”, hence the term “reverse” in RPS-BLAST

(Reversed-Position Specific)

22 Maricel Kann. Feb-08

HMMHMM

HMMerHMMer is trained on the CDDis trained on the CDDsequences.sequences.

HMMerHMMer does not specifically incorporate the concept does not specifically incorporate the concept of of ““blockblock””..

23 Maricel Kann. Feb-08

HMMerHMMer’’ss Statistics are a (Poor) Empirical FitStatistics are a (Poor) Empirical Fit

• HMMer fits the EVD distribution parameters λand K to simulated sequences with a Gaussian length distribution.

• HMMer_semi-global Gumbel E-value approximation is sometimes very inaccurate.

ftp://ftp.genetics.wustl.edu.pub/eddy/hmmer_CURRENT/Userguide.pdftp://ftp.genetics.wustl.edu.pub/eddy/hmmer_CURRENT/Userguide.pdff

24 Maricel Kann. Feb-08

gapgap gapgap

SALTOSALTO

SALTOSALTO

Kann MG et al. Bioinformatics, 21(8):1451Kann MG et al. Bioinformatics, 21(8):1451--6. (2005) 6. (2005)

GG--SALTOSALTO

Structure-based ALignment TOol

25 Maricel Kann. Feb-08

Properties of an Ideal Alignment MethodProperties of an Ideal Alignment Method

• Semi-global alignment method is intrinsically the right tool for searching for domains within proteins.– Local alignment methods match only a portion of

a domain against a query.• Reverse Position-Specific BLAST (rpsBLAST)

• Screening a database for matches needs to be fast.• HMMs have no intrinsic heuristics to speed computation.• The word heuristics in rpsBLAST speed screening and are

available for any local alignment method.

• Accurate Statistics.

26 Maricel Kann. Feb-08

GLOBAL (GLOBAL (GloGlobal bal BBlocks locks AAligned ligned LLocally)ocally)

A semiA semi--global Alignment Method for Queryingglobal Alignment Method for QueryingA Database of Protein DomainsA Database of Protein Domains

with Accurate Statistics with Accurate Statistics

M G. Kann et al (2007) M G. Kann et al (2007) NARNAR, , 35(14):467835(14):4678--46854685..

27 Maricel Kann. Feb-08

OutlineOutline

A new computational approach for the detection of protein domains: Semi-Global Alignment of Protein Domains with accurate statistics.– Introduction– Current methods for protein domain searches– Method (Global Blocks Aligned Locally)

• Algorithm and scoring scheme• Derivation of Statistics

– Results

28 Maricel Kann. Feb-08

gapgap gapgap

GLOBAL: aligns blocks locallyGLOBAL: aligns blocks locally

GG--SALTOSALTO

GLOBALGLOBAL

29 Maricel Kann. Feb-08

GLOBALGLOBALQ

UE

RY

PSSMGlobal Algorithm: Global Algorithm: •Uses dynamic programming (DP) to find the alignment of a protein query sequence to all blocks of the PSSM (in order).•Penalty=0 both for unaligned regions of the PSSM at the ends of the blocks and unaligned regions of the queries between blocks.

30 Maricel Kann. Feb-08

GLOBAL: statistics for b blocksGLOBAL: statistics for b blocksQ

UE

RY

PSSM

LLeffeff=effective length=effective length

Assuming the score for each block is independent of each other, GLOBAL estimates total the alignment p-value by convolution algorithm

1,

ˆ ( )i effi b

T M L=

= ∑For b blocks, the total alignment score T is:

( ) ( ) 1/1 !/ ( 1)! !

beffL n b n b= + − −⎡ ⎤⎣ ⎦

n=size of queryb=number of blocks in the PSSM

e.g., n=160, b=3, Leff=89

31 Maricel Kann. Feb-08

OutlineOutline

A new computational approach for the detection of protein domains: Semi-Global Alignment of Protein Domains with accurate statistics.– Introduction– Current methods for protein domain searches– Method (Global Blocks Aligned Locally)– Results

• Benchmarking database• ROC (L-ROC) curves• P-value Accuracy

32 Maricel Kann. Feb-08

Database of queries: ~ 10,000 sequences with known structure (from MMDB database).

To define true relationships to a CDD entry a query sequence need to be a structure neighbor (using VAST) of a CD’s protein from for which the structure is known

The resulting test has >300 families with almost 30,000 known true positives.

Benchmarking test setBenchmarking test set

33 Maricel Kann. Feb-08

Benchmarking test setBenchmarking test set

20 40 60 80 1000

200

400

600

800

1000

1200

1400

1600

1800

2000

20 40 60 80 1000

2000

4000

6000

8000

10000

12000

14000

16000

18000

Nu

mb

er

of

tru

e p

osi

tive

s

P e rce ntage o f sequence identity w /V A ST

Num

ber o

f tru

e po

sitiv

es

Percentage of sequence identity (VAST)

34 Maricel Kann. Feb-08

ROC curve for GLOBALROC curve for GLOBAL

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.00 0.01 0.02 0.03 0.04 0.05

GLOBAL HMMer-semi-global HMMer-local RPS-BLAST

Fraction of false positives

Frac

tion

of tr

ue p

ositi

ves

*LROC10000 LROC50000 LROC200000

GLOBAL 0.181 0.224 0.3130.2990.2390.229

HMMer semiglobal 0.185 0.224HMMer local 0.169 0.194rpsBLAST 0.168 0.192

*LROC:SwenssonSwensson RG: RG: Med Phys Med Phys 1996, 23(10):17091996, 23(10):1709--2525

35 Maricel Kann. Feb-08

PP--value accuracyvalue accuracyGLOBAL HMMer

Cd00030 Cd00083Cd00288

1,000,000 simulations using random sequences of length 350

0.1

1

10

100

1000

1.E-071.E-061.E-051.E-041.E-031.E-021.E-011.E+00

True P-value

Estim

ated

P-v

alue

/ Tr

ue P

-val

ue

0.1

1

10

100

1000

1.E-071.E-061.E-051.E-041.E-031.E-021.E-011.E+00

True P-valueEs

timat

ed P

-val

ue /

True

P-v

alue

36 Maricel Kann. Feb-08

ConclusionsConclusions• The GLOBAL algorithm and p-value provides a

flexible format for semi-global sequence alignments.

• GLOBAL respect block structure but adds flexibility at the ends of each block.

• The GLOBAL p-value is based on local alignment p-values. BLAST heuristics from local alignment therefore apply to GLOBAL.

• While the overall performance is similar to that of HMMer semi-global, GLOBAL has more accurate statistics and the possibility to implement heuristics similar to those used in local methods could make it orders of magnitude faster.

37 Maricel Kann. Feb-08

Future workFuture work

• Implementation of GLOBAL:– “Blockalizer”: creates blocks within the MSA.– Heuristics to increase the speed.

• Optimization of domain discovery: Can we mix and match methods/CDs?

38 Maricel Kann. Feb-08

AcknowledgmentsAcknowledgmentsSALTO:SALTO:• Stephen Altschul, Anna Panchenko, Paul Thiessen and Steve

Bryant.GLOBAL:GLOBAL:• John Spouge, Sergey Sheetlin and Yonil Park.PROTEIN INTERACTIONS:PROTEIN INTERACTIONS:Predicting protein-protein interaction by searching evolutionary tree

automorphism space• Teresa Przytycka and Raja Jothi.Predicting protein domain interactions from co-evolution of conserved

regions: Teresa Przytycka, Praveen Cherukuri and Raja Jothi.

• UMBC Computational Biology lab team.

39 Maricel Kann. Feb-08

KannKann’’ss Computational Biology lab.Computational Biology lab.

Recommended