8
Nick Till NRPS Code Project Summary A. Data formatting/trimming The data used in this project was obtained from a paper which detailed a machine-learning approach to the prediction of amino-acids encoded by a given A-domain in non-ribosomal peptide synthesis. The training data used for this approach was taken from this paper, in .xls format. Then, all unnatural amino acid residues were stripped, leaving 19 amino-acid encoding sequence types (each type contained 1-~30 sequences). No methionine encoding A-domains were in the training data, leaving only 19 amino-acids rather than 20. The resulting trimmed down .xls file was converted to .csv format, and read into the python script as a dictionary of lists (see .py file). All further data manipulations and analysis was performed in Python (see .py file). B. High Level Overview The first aspect of the project was essentially visualizing the IC of the A- domains to more easily make a hypothesis about whether a method simpler than the machine learning approach (reported by Rottig et al.) could be used to accomplish AA prediction based on a signature sequence of an A-domain. This visualization was done with IC Bar Charts. Through the following steps: 1. Read in CSV, to get sequence data into usable dictionary format (readseq). 2. Construct a matrix of amino acid frequency of occurrence for each position in each set of A-domain sequences corresponding to one amino-acid recognition (AAfrequency). 3. Use the frequencies of each amino acid at each position in each set of A- domain sequences to construct a consensus sequence for each A-domain type (consensus)

Nick Till NRPS Code Project Summary - Reed · PDF fileNick Till NRPS Code Project Summary A. Data formatting/trimming The data used in this project was obtained from a paper which

  • Upload
    vuliem

  • View
    228

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Nick Till NRPS Code Project Summary - Reed · PDF fileNick Till NRPS Code Project Summary A. Data formatting/trimming The data used in this project was obtained from a paper which

Nick Till

NRPS Code Project Summary

A. Data formatting/trimming The data used in this project was obtained from a paper which detailed a

machine-learning approach to the prediction of amino-acids encoded by a given

A-domain in non-ribosomal peptide synthesis. The training data used for this

approach was taken from this paper, in .xls format. Then, all unnatural amino

acid residues were stripped, leaving 19 amino-acid encoding sequence types

(each type contained 1-~30 sequences). No methionine encoding A-domains

were in the training data, leaving only 19 amino-acids rather than 20. The

resulting trimmed down .xls file was converted to .csv format, and read into the

python script as a dictionary of lists (see .py file). All further data manipulations

and analysis was performed in Python (see .py file).

B. High Level Overview The first aspect of the project was essentially visualizing the IC of the A-

domains to more easily make a hypothesis about whether a method simpler than

the machine learning approach (reported by Rottig et al.) could be used to

accomplish AA prediction based on a signature sequence of an A-domain. This

visualization was done with IC Bar Charts. Through the following steps:

1. Read in CSV, to get sequence data into usable dictionary format

(readseq).

2. Construct a matrix of amino acid frequency of occurrence for each position

in each set of A-domain sequences corresponding to one amino-acid

recognition (AAfrequency).

3. Use the frequencies of each amino acid at each position in each set of A-

domain sequences to construct a consensus sequence for each A-domain

type (consensus)

Page 2: Nick Till NRPS Code Project Summary - Reed · PDF fileNick Till NRPS Code Project Summary A. Data formatting/trimming The data used in this project was obtained from a paper which

4. Calculate entropy at each position of each consensus (Entropy)

5. Calculate information content at each position of each consensus (IC)

6. Plot bar charts of sequence position vs information content (plotBarChart)

7. Plot bar charts for all A-domain sequences regardless of amino-acid

specificity to see overall conservation

8. Make a weighted hamming distance function, where hamming distance

scales with information content at a given position (WeightedHamming).

9. Take an input sequence and predict its amino-acid specificity by

minimizing its weighted hamming distance against all possible (18 others)

amino-acid A-domain consensus sequences, and return the minimized

amino acid. Furthermore, check to make sure the returned amino acid

actually contains a sequence identical to the one input – output match or

mismatch to indicate success of predictor.

C. Discussion In both using 34-mer inputs and 10-mer inputs, the predictor is correct a

bit over half of the time (Tables 1 and 2). Given the low information content for

the amino-acid sequences which are predicted poorly, this makes some sense

(Figures 1 and 2). Typically, A-domain consensuses with high information

content are predicted well (Arginine, Alanine, Glutamine), however there certainly

are outliers to this trend. While Serine and Threonine are highly conserved

amongst many A-domains, they are each only predicted in one of the two

prediction models (one in the 10-mer model, and one in the 34-mer model). A

very odd successful prediction is that of valine. While the only well conserved

residues in valine (of which we have 28 examples, figure 2) are well conserved

across all A-domains, valine is still predicted by both the 10-mer and 34-mer

predictors. More complex weighting in the hamming distance function might

improve prediction power to include serine and threonine, and further not predict

valine. While it is nice that valine is predicted well, it is likely only an anomaly,

and would not hold if the predictor was applied to larger data sets.

Page 3: Nick Till NRPS Code Project Summary - Reed · PDF fileNick Till NRPS Code Project Summary A. Data formatting/trimming The data used in this project was obtained from a paper which

The most obvious problem with the current predictor is the necessity for

extensive preparation of the data beforehand. The 34 residues taken as a

signature sequence in half of the predictions are selected from an 8 Å radius ball

centered in the amino acid binding site. Thus, the ordering of the amino acids

does not reflect the ordering present in the primary structure. Furthermore, the

assumption that these residues will still lie within the 8 Å radius ball in all A-

domains relies on the assumption that the conformation of all A-domains will be

very similar to the only one we have a crystal structure for (Marahiel et al.). This

is an even more obvious problem when we take the 10-mer sequences as the

signatures, since one of these 10 residues lying outside the recognition site will

unduly affect the prediction.

The weighted Hamming function could be improved with a more detailed

look at the logos (figures 3 and 4) of each A-domain sequence. With an

understanding of which residues in the A-domain are crucial to recognition of a

given AA, the Hamming function could be weighted accordingly.

Page 4: Nick Till NRPS Code Project Summary - Reed · PDF fileNick Till NRPS Code Project Summary A. Data formatting/trimming The data used in this project was obtained from a paper which

REFERENCE

Figure 1. Information content by position for all 19 characterized amino acid A-domains, and a reference information content for all sequences (with length 10), regardless of A-domain specificity.

Page 5: Nick Till NRPS Code Project Summary - Reed · PDF fileNick Till NRPS Code Project Summary A. Data formatting/trimming The data used in this project was obtained from a paper which

REFERENCE

Figure 2. Information content by position for all 19 characterized amino acid A-domains, and a reference information content for all sequences (length 32), regardless of A-domain specificity.

Page 6: Nick Till NRPS Code Project Summary - Reed · PDF fileNick Till NRPS Code Project Summary A. Data formatting/trimming The data used in this project was obtained from a paper which

Table 1. Prediction of AA for a 34-mer sequence randomly selected from each known AA sequence set. Match indicates prediction success.

34-mer sequence Known

AA Predicted

AA Full

Match? 34-mer Match?

YWNPFDLSVMDPVSLFCGEYNTYGPTEATVAVTG ala ala Y Y TYATFDVSVWESTCIVGGEYNAYGPTEVAVETTI arg arg Y Y YWASFDLTVTAAKIVAGGEVNEYGPTETTVGCCA asn ala N N YWFSFDLGYTSPKLVLGGEINHYGPTETTIGAIA asp asp Y Y LSLSFDHFVEQDSGDCVGEINGYGPTEVSITTHK cys ala N N LGLAFDASVQQTDGLVGGETNVYGPTETCVDASS gln gln Y Y LGLAFDASVKQADMIVGGDTNVYGPTECCVDAAS glu val N N FAMTFDISALELQALVGGETNLYGPTETTIWSTF gly gly Y Y VNTSFDGSVFDGFILFGGEIHVYGPTESTVYATY ile ala N N LWDAFDASIWEPFLLTGGDVNNYGPTENTVVATS leu leu Y Y YDHWFDAAWQPADTALGGEFNCYGPTETTVEAVV lys val N N TAQAFDAAVWESALIVAGDVNAYGLTETTVCATM phe phe Y Y LFEAFDVCYQESVSITAGEHNHYGPSETHVVSAY pro pro Y Y RWMTFDVSVWEWHFMCSGEHNLYGPTEAAVDVTA ser ser Y Y LHQHFDFSVWEGNQIFGGEINMYGITETTVHVTY thr val N N LDRVFDVSMADPVMVSGGDHNEYGVTEATVVSTV trp val N N TWRFFDGCVTSTLITFAGEANEYGPTENSVATTI tyr ala N N LNAGFDASTFEGWLIIGGDWNGYGPTENTTFSTC val val Y Y

Table 2. Prediction of AA for a 10-mer sequence randomly selected from each known AA sequence set. Match indicates prediction success.

10-mer sequence Known AA Predicted AA Full Match? 10-mer match?

DLMVLCTVA- ala ala Y Y DVWTIGAVE- arg arg Y Y DLTKVGEVG- asn ala N N DLTKVGHIG- asp ala N N DHESDVGIT- cys ala N N DAQDLGVVD- gln gln Y Y DAKDIGVVD- glu ala N N DILQLGLIW- gly gly Y Y

DGFFLGVVY- ile ala N N DAWFLGNVV- leu leu Y Y DAQDAGCVE- lys lys Y Y DAWAIAAVC- phe phe Y Y DVQVIAHVV- pro pro Y Y

DVWHMSLVD- ser ala N N DFWNIGMVH- thr thr Y Y DVAVVGEVV- trp ala N N DGTLTAEVA- tyr ala N N DAFWIGGTF- val val Y Y

Page 7: Nick Till NRPS Code Project Summary - Reed · PDF fileNick Till NRPS Code Project Summary A. Data formatting/trimming The data used in this project was obtained from a paper which

All 10-mers Logo

WebLogo 3.4

0.0

1.0

2.0

3.0

4.0

bits DIGFVLAGAEMQYLTFWVWSKTHFN

5

N

M

T

Y

F

VIL

D

V

C

SAGN

H

V

E

M

G

AL

TIVH

G

C

A

Y

W

F

V

D

10

EK

Fig. 3
Page 8: Nick Till NRPS Code Project Summary - Reed · PDF fileNick Till NRPS Code Project Summary A. Data formatting/trimming The data used in this project was obtained from a paper which

All 32-mers Logo

WebLogo 3.4

0.0

1.0

2.0

3.0

4.0

bits

V

S

R

TYFLY

G

SDNAWI

H

A

L

T

M

G

P

HTAS

5

H

FDIGFVLAM

C

G

T

FASY

L

A

TIV

10

G

A

E

M

Q

YLTFW

Y

A

S

QDEM

S

W

L

T

APG

V

W

S

K

T

H

FN

I

T

A

G

Q

S

FL

15

N

M

T

Y

F

VIL

A

I

TCLFV

D

V

C

SAGG

S

DE

20

C

W

L

T

F

YHVI

Q

S

HNN

H

V

E

M

G

AL

W

GYG

25

V

AIPA

STEV

S

NAT

V

H

CAST

30

TIVH

G

C

A

Y

W

F

V

D

C

TVAS

N

V

M

I

AST

W

F

S

I

Y

A

Figure 4.