Upload
lamxuyen
View
218
Download
0
Embed Size (px)
Citation preview
www.sciencesignaling.org/cgi/content/full/5/215/rs1/DC1
Supplementary Materials for
Proteome-Wide Discovery of Evolutionary Conserved Sequences in Disordered Regions
Alex N. Nguyen Ba, Brian J. Yeh, Dewald van Dyk, Alan R. Davidson, Brenda J.
Andrews, Eric L. Weiss, Alan M. Moses*
*To whom correspondence should be addressed. E-mail: [email protected]
Published 13 March 2012, Sci. Signal. 5, rs1 (2012) DOI: 10.1126/scisignal.2002515
This PDF file includes:
Fig. S1. Schematic of the phylo-HMM approach. Fig. S2. Regions with no conserved segments are not detected by the phylo-HMM approach. Fig. S3. Newly identified KEN box in Spt21 and Cbk1 interaction motif in Ssd1 are conserved in further yeast species. Fig. S4. Simulation of protein evolution. Fig. S5. Performance of the phylo-HMM approach on literature-curated short linear motifs. Fig. S6. Binding of FxFP peptides to Cbk1. Fig. S7. Phylogenetic tree of species used for this study. Tables S1 to S4 legends Table S5. Annotation of top 20 clusters from different distance metrics of predicted short conserved sequences.
Other Supplementary Material for this manuscript includes the following: (available at www.sciencesignaling.org/cgi/content/full/5/215/rs1/DC1)
Table S1 (Microsoft Excel format). Predictions on the yeast proteome by the phylo-HMM approach. Table S2 (Microsoft Excel format). Literature-curated characterized short linear motifs. Table S3 (Microsoft Excel format). Enrichment analysis of motifs matching known consensus sequences. Table S4 (Microsoft Excel format). Clusters of similar short conserved sequences.
Conserved Background
αc αw
S
S
T
S
X X X X X
-----X X X X X
X X X X X
X X X X X
R
T
P
X X X X X
X X X X X
X X X X X
----------
Fig. S1. Schematic of the phylo-HMM approach. In the phylo-HMM framework, a column of the sequence alignment is assumed to belong to either a conserved state (“Conserved”) or a background state (“Background”) and probabilities of observing alignment columns in each state depend on a phylogenetic model of protein evolution with a rate parameter α.
Fig. S2. Regions with no conserved segments are not detected by the phylo-HMM approach. Posterior trace of the region 420-520 in the alignment of Swi5. No locally conserved segments were identified. The region shown corresponds to position 266-322 in S.cerevisiae. Red color intensity represents the posterior probability of the conserved state.
420 430 440 450 460 470 480 490 500 510 520
- - - - - - - NNRQKY C L - - - - - - - - - - - - - - - - - - - - - - - - - QR K - N S SGT VGP L C FQ- - E LNE G FND S L I S P KK I R SNP NE N L S S KT - - - - - - - - - - - - - - - - - K F I T P
- - - - - - - I N RQKY S F - - - - - - - - - - - - - - - - - - - - - - - - - E R K - N SNGTAGP L C FQ- - D LNE D FNDT L I S P KKT R SDP S E N LN S K P - - - - - - - - - - - - - - - - - K F I AP F
- - - - - - - NNR P KY S L - - - - - - - - - - - - - - - - - - - - - - - - - G KK - N S TGT AGP L C FQ- - E LNE D SNE L F I S P KK S R P NAKE Y LD S K S - - - - - - - - - - - - - - - - - K F I AP F
- - - - - - - TN R S K F S L - - - - - - - - - - - - - - - - - - - - - - - - - G R K - N SDGTAGP L S FQ- - E LNE D FNG I L I S P KR T R P NP SG - - N S K P - - - - - - - - - - - - - - - - - K Y I AP F
- - - - - - - G KYG TNGK - - - - - - - - - - - - - - - - YG TNVR VK LD F E - DVV SNGAVNGGQ- - H L S V L S P Q S SNK LN I KKAA- - - - - - - - - - - - - - - - - - - - - - - - - - H R AVP
- - - - - - - QG K S R I D L - - - - - - - - - - - - - - - - - - - - - - - - - E MT - GKN S T KGY RNMK - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - P N RG K L S I I H - - - - - - - - - - - - - N YNDK I P V PQNKR - - - - - - K S SN I V R - - E ME R F FQE TNTQNKE N I S YGK I S V PHK - - - - - - - - - - - - - - - - - - FDQN I
- - - - - - - V S RG R T P P I N - - - - - - - - - - - - - - - - - - - - - - - - - - - DD L R E DADE DVA- - MTNY L R P E I E QK E AVAD S SH F L KK RNAQP - - - - - - - - - - - - - - - - G F TNG
- - - - - - - RQHNYDGV - - - - - - - - - - - - - - E NGNNR L AP PMN F S I S S T S T VGNNT E Q L L KMS KY FQE L TDNQ S RGP SNCAVYNKK S - - - - - - - - - - - - - - - - - - SN R P P
LQE TGA LDT LG AV S P RG T AMDK L P P KT V P P I S I E R P L K L F - - - - - - - - P TGVQ- - QQQHHQQQQQQQQQQQQQQQQQQQHGKQP C L I RQT F T VG S I D SG LGDQ S P
- - - - - - - AHK L RMP L V S LNG THTD S S E E Y S S P E N F L - - - - G S L - N - - G R I NTNGVQ- - KVNHC F KR SGDRGAP AA LN F KDK F - - - - - - - - - - - - - - - - - - - - - S SNN P
- - - - - - - T Y RG VP P VVD I E I P P PDNMEG F SN SNDY I - - - - NHD - N F T KR VP NDT VQ- - KMSQY F E K I N SQ I P A SNNK L K P K L - - - - - - - - - - - - - - - - - - - - - N S TH P
- - - - - - - TQR VR AQM I NAQGG LQ L VK EN SG S PQD S AD L SG S Y R - N S TNNP P I S S LQ- - K I S K Y F P NMDV S TQE ADA L FH L R P R R R S - - - - - - - - - - - - - - - - - GG TG P
0
0.2
0.4
0.6
0.8
1
420 430 440 450 460 470 480 490 500 510 520
Posterior probability
Position in alignment
0 1 Posterior
probability
Fig. S3. Newly identified KEN box in Spt21 and Cbk1 interaction motif in Ssd1 are conserved in further yeast species. (A) Alignment of the KEN box (in rectangle) in Spt21 with distantly related yeast species. We note that the KEN motif in the Candida clade does not align with species used in our study when performing a whole gene alignment. However, the motif is well conserved. (B) Alignment of one of the Cbk1 interaction motif (in rectangle) in Ssd1 with distantly related yeast species. In both panels, species used for the phylo-HMM analysis are labelled with a vertical black bar.
A B
S.cer/458-469 S.par/460-471 S.mik/462-473 S.bay/462-473
K.lac/371-392
K.the/387-398
K.wal/389-400
A.gos/372-383
S.cas/392-403
Z.rou/373-384
S.klu/418-429
C.gla/416-427
C.lus/586-597 D.han/611-622 C.gui/548-559 C.tro/603-614 C.alb/616-627 C.par/620-631 L.elo/334-345
Spt21 Ssd1
S.cer/233-241 S.par/240-248 S.mik/235-443 S.bay/241-249
K.lac/194-203
K.the/230-239
K.wal/232-241
A.gos/223-232
S.cas/231-239
Z.rou/242-251
S.klu/227-236
C.gla/302-310
K.pol/243-252 P.sti/191-200
C.lus/176-184 C.tro/198-207
D.han/231-240 L.elo/259-268 Y.lip/236-245
U.ree/275-284 A.nig/265-274 P.chr/261-270 S.scl/290-299
Fig. S4. Simulation of protein evolution. (A) Simulations of proteins evolution were performed by randomly generating a protein sequence containing a motif in an unstructured region. An alignment of a typical simulated protein is shown with the motif, properly aligned, boxed in red. (B) Alignment accuracy of the motif using MAFFT depends on the background rate of evolution and on the rate of motif evolution. (C) The sensitivity of the phylo-HMM on simulated data shows strong dependence on the relative rate of evolution of the motif to the rate of evolution of the background. (D) The rate of computational artifacts of the phylo-HMM on simulated data is dependent on the background rate of evolution. Each point represents the results from 100 simulated proteins.
0.00001
0.0001
0.001
0.01
0.1
10 0.2 0.4 0.6 0.8
Rat
e o
f co
mp
uta
tio
nal
art
ifac
ts(p
er u
nst
ruct
ure
d a
min
o a
cid
s)
Motif rate of evolution
A
B
Unstructured region Unstructured region Domain
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8
Alig
nm
ent
accu
racy
(Fra
ctio
n o
f co
rrec
t m
oti
f al
ign
men
t)
Motif rate of evolution
Background rate = 0.7Background rate = 1Background rate = 1.3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.25 0.5 0.75 1
Sen
siti
vity
Motif rate of evolution relative to background
C D
Background rate = 0.7Background rate = 1Background rate = 1.3
Background rate = 0.7Background rate = 1Background rate = 1.3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
>=0.9 0.5-0.9 <0.5
Sensitivity
Fraction of conservation
Fig. S5. Performance of the phylo-HMM approach on literature-curated short linear motifs. On characterized short linear motifs, the phylo-HMM performs best on conserved regulatory sequences. Data shown here only includes motifs with consensus sequences (no localization signals). Regulatory motifs were binned on the basis of the fraction of species in which the consensus sequence could be found (Fraction of conservation) and sensitivity of the phylo-HMM was calculated for each bin (see Methods).
alon
e
Ace
2270-
290
Bop
3135-
178
Fir1
396-
439
Ptp
3351-
397
Ssd
1185-
258
Tao
31-41
MBP-
α-GST
10%
load
Cbk1∆1-351
Fig. S6. Binding of FxFP peptides to Cbk1. Fragments from proteins identified in the [YF]xFP cluster (Fig. 5C) were expressed as maltose-binding protein (MBP) fusions and immobilized on amylose resin. The beads were assayed for binding to GST-tagged Cbk1 (Cbk1∆1-351) in a pulldown assay. At the exposure time shown in Figure 6, the lane showing 10% of the GST-tagged Cbk1 input (on a different nitrocellulose membrane but imaged at the same time and incubated with the same conditions) is overexposed. A reduced exposure of the assay is shown in this figure for comparison.
Fig. S7. Phylogenetic tree of species used for this study. Protein sequences from related yeast species were used to predict short conserved sequences by our phylo-HMM approach. The vertical black line indicates the four closest species of yeast that were used to obtain the amino acid substitution model. The branch lengths were estimated from random concatenations of proteins (See Methods) and the Newick format tree representation is the following: ((((((((Scer: 0.0317431, Spar: 0.022837): 0.0200533, Smik: 0.0499671): 0.0302187, Sbay: 0.0545286): 0.1984531, Cgla: 0.3382801): 0.042494, Scas: 0.2796456): 0.0506147, Kpol: 0.3061641): 0.0374508, Zrou: 0.2684242): 0.0489862, ((Klac: 0.3075333, Agos: 0.3326945): 0.0600818, ((Kwal: 0.1277869, Kthe: 0.1223815): 0.1697185, Sklu: 0.1779484): 0.0492655): 0.0622827);
S.cer S.par S.mik S.bay
C.gla S.cas K.pol
S.klu
Z.rou
A.gos K.lac
K.wal K.the
Legends to Tables S1-S4
Table S1. Predictions on the yeast proteome by the phylo-HMM approach. Short conserved sequences were predicted on the yeast proteome using our phylo-HMM approach. Each motif is represented by a unique identifier, as well as the gene that contains it (systematic name), its position and sequence within the S. cerevisiae protein.
Table S2. Literature-curated characterized short linear motifs. A set of short linear motifs in yeast proteins was curated from the literature. Each short linear motif is defined by the gene that contains it (standard name and systematic name), the reported function of the motif within the gene, the regulator that interacts with the motif, its position within the protein and the sequence of amino acids reported. The reference for this information is included for each short linear motif. The following columns indicate whether the motif is found in unstructured regions (by our definition) and whether the motif was predicted by the phylo-HMM. If the motif could be defined by a consensus sequence and was found in unstructured regions, the fraction of species that contained this consensus sequence is shown. For short linear motifs that matched a known consensus, if this fraction was above 90%, we considered the consensus sequence to be conserved. For short linear motifs that did not match a known consensus, in some cases we inspected the alignment by eye and indicated these as ‘considered conserved’ if we could identify them in many species.
Table S3. Enrichment analysis of motifs matching known consensus sequences. Three tables describe the functional enrichment analysis of proteins (labeled by their systematic name) with motifs matching three known consensus sequences. The first table describes whether proteins known to be phosphorylated and known not to be phosphorylated by the cyclin-dependent kinase Cdc28 contain a phylo-HMM prediction with a conserved phosphorylation site consensus sequence ([ST]Px[RK]). The second table lists the proteins with an FG-dipeptide sequence, whether they are FG-repeat containing nucleoporins, whether they are involved in non-nuclear protein transport or sorting, and whether they contain a phylo-HMM prediction with a conserved FG-dipeptide sequence. The last table indicates the proteins that contain a predicted KEN-box consensus sequence, and the experimental line of evidence that describes APC/C-mediated degradation.
Table S4. Clusters of similar short conserved sequences. Three tables show the results from three different cluster analyses that used different metrics to assign sequence similarity between motifs. The first table describes the results using an all-by-all pairwise distance measure (described in Methods) between the predicted segments. The second table describes the results after first extending either side of the motifs by five residues before applying the all-by-all distance measure. Using the same extended motifs, the third table describes the results considering only the top 10 most similar motifs as similar. Each motif is represented by its identifier, the sequence that was used in the cluster analysis, the gene that contains the motif, its position within the S. cerevisiae protein, and the cluster where it is found.
Table S5. Annotation of top 20 clusters of predicted from different distance metrics of predicted short conserved sequences. Three tables show annotations of the top 20 clusters of each three cluster analyses that used three different metrics to assign sequence similarity between motifs (see Table S4). Each set of annotation includes a sequence logo representing the pattern formed by the motifs, whether this pattern is known and notes regarding functional enrichments. The first table describes the results using an all-by-all pairwise distance measure (described in Methods) between the predicted segments. The second table describes the results after first extending either side of the motifs by five residues before applying the all-by-all distance measure. Using the same extended motifs, the third table describes the results considering only the top 10 most similar motifs as similar.
Table S5-1. Patterns obtained from clustering of pairwise sequence distance between conserved sequences. Motif profiles were aligned using the average pairwise substitution score derived from the empirical substitution matrix described in Methods. Pairs of profiles (mi and mj) were then clustered using the Smith-Waterman score (S[mi,mj]) and divided by the square root of the aligned region length (li,j). Edges were pruned if the score exceeded a threshold of min(7.7,S[mi,mi]/sqrt(li,i)).
Cluster number Known Motif Notes
1 Yes
Enriched in cell-cycle proteins Probably proline-directed phosphorylation sites
2 Yes
Enriched in nuclear proteins Probably related to nuclear localization signals
3 Yes
Enriched in nucleoporins GLFG-repeat motif
4 Yes Enriched in cell-cycle proteins Probably proline-directed phosphorylation sites
5 Yes
Proline-rich motif
6 Yes
Basophilic-kinase phosphorylation sites
7 No
7 No
8 No
9 Yes
Probably related to nuclear localization signals
10 Yes
Enriched in endocytosis genes EH-interacting motif
10 Yes
Proline-rich motif
11 Yes
Similar to Ime2 phosphorylation sites
12 Yes
Probably related to nuclear localization signals
13 No
14 No
15 Yes
Probably related to nuclear localization signals
16 No
17 Yes
Proline-rich motif
17 Yes
Probably related to proline-directed phosphorylation sites
18 Yes
Glutamine-repeat
19 No
19 No
20 No
Table S5-2. Patterns obtained from clustering of pairwise sequence distance between conserved sequences. Each predicted conserved sequences were first extended on both side and trimmed (See Methods). Motif profiles were aligned and scored as in the previous section of this table.
Cluster number Known Motif Notes
1 Yes
Enriched in nucleoporins GLFG-repeat motif
2 Yes
Proline-rich motif
3 Yes
Enriched in nuclear localization Probably related to nuclear localization signal
4 Yes
Enriched in endocytosis genes EH-interacting motif
5 Yes
Enriched in cell-cycle proteins Probably proline-directed phosphorylation sites
6 Yes
Enriched in nucleoporins FxFG-repeat motif
7 Yes
Enriched in cell-cycle proteins Probably proline-directed phosphorylation sites
7 No
8 No
Enriched in vesicle and nuclear membrane proteins, enriched in protein transport process. Probably related to NPF motif
9 No
10 Yes
Probably basophilic-kinase phosphorylation sites
10 Yes
Probably related to nuclear localization signal
11 Yes
Probably related to nuclear localization signals
12 Yes
Probably related to nuclear localization signals
12 Yes
Probably related to nuclear localization signals
12 Yes
Basophilic-kinase phosphorylation sites
12 Yes
Probably proline-directed phosphorylation sites
13 Yes
Probably acidophilic-kinase phosphorylation sites
13 No
14 Yes
Enriched in ER localization ER-localization signal
15 Yes KEN-box APC/C degradation signal
16 Yes
Proline-rich motif
16 Yes
Probably proline-directed phosphorylation sites
17 No
Enriched in amino acid permeases
17 No
17 No
18 No
19 Yes
Probably proline-directed phosphorylation sites
20 Yes
Enriched in mitochondrial localization Mitochondrial targeting signal
Table S5-3. Patterns obtained from clustering of conserved sequences and their top ten most similar conserved sequences, without allowing matches to paralogs or to the same protein. Each predicted conserved sequences were first extended on both side and trimmed (See Methods). Motif profiles were aligned using the average pairwise substitution score derived from the empirical substitution matrix described in Methods. These profiles were then clustered using the Smith-Waterman score (S[mi,mj]). The top ten most similar profiles were taken.
Cluster number Known Motif Notes
1 Yes
Resembles motif in vacuolar proteins in yeast
2 Yes
Proline-rich motif
3 Yes
Enriched in ER localization ER-localization signal
4 Yes
Proline-rich motif of class II (PxxPx+)
5 Yes
Enriched in mitochondrial localization Mitochondrial targeting signal
6 Yes
Probably related to nuclear localization signal
6 Yes
PCNA-interacting motif
6 No
7 No
8 Yes
Probably basophilic-kinase phosphorylation sites
9 No
Enriched in Cbk1 interactors
10 Yes
Cbk1 phosphorylation motif
11 No
12 Yes
Disulfide isomerase motif
13 Yes
Probably related to nuclear localization signal
14 Yes
eIF4e binding site
15 Yes
EH-interacting motif
16 No N-terminal motif
17 No No particular motif 18 No
19 No
19 Yes
Disulfide isomerase motif
19 No
20 Yes
FxFG-repeat motif