Upload
alia-elswick
View
220
Download
2
Embed Size (px)
Citation preview
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
T cell Epitope predictionsusing bioinformatics
(Neural Networks andhidden Markov models)
Morten Nielsen, CBS, BioCentrum,
DTU
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Processing of intracellular proteins
http://www.nki.nl/nkidep/h4/neefjes/neefjes.htm
MHC binding
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
What makes a peptide a potential and effective
epitope?• Part of a pathogen protein• Successful processing
– Proteasome cleavage– TAP binding
• Binds to MHC molecule• Protein function
– Early in replication• Sequence conservation in
evolution
Sars virus
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
From proteins to From proteins to immunogensimmunogens
Lauemøller et al., 2000
20% processed 0.5% bind MHC 50% CTL response
=> 1/2000 peptide are immunogenic
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Location of class I epitopes
GP1200 proteinStructure(1GM9)
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
MHC class I with peptideMHC class I with peptide
http://www.nki.nl/nkidep/h4/neefjes/neefjes.htm
Anchor positions
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Prediction of HLA binding specificity
• Simple Motifs– Allowed/non allowed amino acids
• Extended motifs– Amino acid preferences (SYFPEITHI)– Anchor/Preferred/other amino acids
• Hidden Markov models– Peptide statistics from sequence alignment
• Neural networks– Can take sequence correlations into account
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Where to get data?• SYFPEITHI database
– 3500 peptides known to bind to HMC class I and II – Only published data
• MHCpep– 13000 peptides known to bind to HMC class I and II – Published data and direct submission– No update since 1998
• Binding affinity assays– Quantitative data. How strong does a peptide bind
to the MHC molecule?– Costly and people do not publish negative results..
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Databases and web resources
• HLA Informatics Group, ANRI (HLA sequence database)
• IMGT/HLA Database (HLA sequence database)• SYFPEITHI (Database of HLA Class I and II peptides)• MHCPEP (Database of HLA Class I and II peptides)• BIMAS (HLA Class I predictor)• SYFPEITHI (HLA Class I predictor)• NetMHC (HLA Class I prediction)
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Sequence informationSLLPAIVEL YLLPAIVHI TLWVDPYEV GLVPFLVSV KLLEPVLLL LLDVPTAAV LLDVPTAAV LLDVPTAAVLLDVPTAAV VLFRGGPRG MVDGTLLLL YMNGTMSQV MLLSVPLLL SLLGLLVEV ALLPPINIL TLIKIQHTLHLIDYLVTS ILAPPVVKL ALFPQLVIL GILGFVFTL STNRQSGRQ GLDVLTAKV RILGAVAKV QVCERIPTIILFGHENRV ILMEHIHKL ILDQKINEV SLAGGIIGV LLIENVASL FLLWATAEA SLPDFGISY KKREEAPSLLERPGGNEI ALSNLEVKL ALNELLQHV DLERKVESL FLGENISNF ALSDHHIYL GLSEFTEYL STAPPAHGVPLDGEYFTL GVLVGVALI RTLDKVLEV HLSTAFARV RLDSYVRSL YMNGTMSQV GILGFVFTL ILKEPVHGVILGFVFTLT LLFGYPVYV GLSPTVWLS WLSLLVPFV FLPSDFFPS CLGGLLTMV FIAGNSAYE KLGEFYNQMKLVALGINA DLMGYIPLV RLVTLKDIV MLLAVLYCL AAGIGILTV YLEPGPVTA LLDGTATLR ITDQVPFSVKTWGQYWQV TITDQVPFS AFHHVAREL YLNKIQNSL MMRKLAILS AIMDKNIIL IMDKNIILK SMVGNWAKVSLLAPGAKQ KIFGSLAFL ELVSEFSRM KLTPLCVTL VLYRYGSFS YIGEVLVSV CINGVCWTV VMNILLQYVILTVILGVL KVLEYVIKV FLWGPRALV GLSRYVARL FLLTRILTI HLGNVKYLV GIAGGLALL GLQDCTMLVTGAPVTYST VIYQYMDDL VLPDVFIRC VLPDVFIRC AVGIGIAVV LVVLGLLAV ALGLGLLPV GIGIGVLAAGAGIGVAVL IAGIGILAI LIVIGILIL LAGIGLIAA VDGIGILTI GAGIGVLTA AAGIGIIQI QAGIGILLAKARDPHSGH KACDPHSGH ACDPHSGHF SLYNTVATL RGPGRAFVT NLVPMVATV GLHCYEQLV PLKQHFQIVAVFDRKSDA LLDFVRFMG VLVKSPNHV GLAPPQHLI LLGRNSFEV PLTFGWCYK VLEWRFDSR TLNAWVKVVGLCTLVAML FIDSYICQV IISAVVGIL VMAGVGSPY LLWTLVVLL SVRDRLARL LLMDCSGSI CLTSTVQLVVLHDDLLEA LMWITQCFL SLLMWITQC QLSLLMWIT LLGATCMFV RLTRFLSRV YMDGTMSQV FLTPKKLQCISNDVCAQV VKTDGNPPE SVYDFFVWL FLYGALLLA VLFSSDFRI LMWAKIGPV SLLLELEEV SLSRFSWGAYTAFTIPSI RLMKQDFSV RLPRIFCSC FLWGPRAYA RLLQETELV SLFEGIDFY SLDQSVVEL RLNMFTPYINMFTPYIGV LMIIPLINV TLFIGSHVV SLVIVTTFV VLQWASLAV ILAKFLHWL STAPPHVNV LLLLTVLTVVVLGVVFGI ILHNGAYSL MIMVKCWMI MLGTHTMEV MLGTHTMEV SLADTNSLA LLWAARPRL GVALQTMKQGLYDGMEHL KMVELVHFL YLQLVFGIE MLMAQEALA LMAQEALAF VYDGREHTV YLSGANLNL RMFPNAPYLEAAGIGILT TLDSQVMSL STPPPGTRV KVAELVHFL IMIGVLVGV ALCRWGLLL LLFAGVQCQ VLLCESTAVYLSTAFARV YLLEMLWRL SLDDYNHLV RTLDKVLEV GLPVEYLQV KLIANNTRV FIYAGSLSA KLVANNTRLFLDEFMEGV ALQPGTALL VLDGLDVLL SLYSFPEPE ALYVDSLFF SLLQHLIGL ELTLGEFLK MINAYLDKLAAGIGILTV FLPSDFFPS SVRDRLARL SLREWLLRI LLSAWILTA AAGIGILTV AVPDEIPPL FAYDGKDYIAAGIGILTV FLPSDFFPS AAGIGILTV FLPSDFFPS AAGIGILTV FLWGPRALV ETVSEQSNV ITLWQRPLV
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Sequence logo
• Height of a column equal to log 20 + p log p
• Relative height of a letter is p
• Highly useful tool to visualize sequence motifs
High information positions
MHC class IHLA-A0201
http://www.cbs.dtu.dk/~gorodkin/appl/plogo.html
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Characterizing a binding motif
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
10 peptides known to bind MHC What can we learn?
1. A at P1 favors binding?
2. I is not allowed at P9? 3. K at P4 favors binding?
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Sequence information
• Description of binding motif
• ExamplePA = 6/10
PG = 2/10
PT = PK = 1/10
PC = PD = …PV = 0
• Problems– Few data– Data
redundancy/duplication
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Sequence information Raw sequence counting
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Pseudo-count and sequence weighting
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
• Poor or biased sampling of sequence space
• I is not found at position P9. Does this mean that I is forbidden?
• No! Use Blosum substitution matrix to estimate pseudo frequency of I at P9
}Similar sequencesWeight 1/5
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
The Blosum matrix
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Sequence weightingALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Pseudo counts
• Sequence weighting and pseudo count– Prediction accuracy
0.60
• Motif found on all data (485)– Prediction accuracy
0.79
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Weight matrices
• Estimate amino acid frequencies from alignment
• Now a weight matrix is given as
Wij = log(pij/qj)– Here i is a position in the motif, and j an amino acid.
qj is the background frequency for amino acid j.
• W is a L x 20 matrix, L is motif length• Score sequences to weight matrix by looking
up and adding L values from matrix
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Scoring sequences to a weight matrix
A R N D C Q E G H I L K M F P S T W Y V 1 0.6 0.4 -3.5 -2.4 -0.4 -1.9 -2.7 0.3 -1.1 1.0 0.3 0.0 1.4 1.2 -2.7 1.4 -1.2 -2.0 1.1 0.7 2 -1.6 -6.6 -6.5 -5.4 -2.5 -4.0 -4.7 -3.7 -6.3 1.0 5.1 -3.7 3.1 -4.2 -4.3 -4.2 -0.2 -5.9 -3.8 0.4 3 0.2 -1.3 0.1 1.5 0.0 -1.8 -3.3 0.4 0.5 -1.0 0.3 -2.5 1.2 1.0 -0.1 -0.3 -0.5 3.4 1.6 0.0 4 -0.1 -0.1 -2.0 2.0 -1.6 0.5 0.8 2.0 -3.3 0.1 -1.7 -1.0 -2.2 -1.6 1.7 -0.6 -0.2 1.3 -6.8 -0.7 5 -1.6 -0.1 0.1 -2.2 -1.2 0.4 -0.5 1.9 1.2 -2.2 -0.5 -1.3 -2.2 1.7 1.2 -2.5 -0.1 1.7 1.5 1.0 6 -0.7 -1.4 -1.0 -2.3 1.1 -1.3 -1.4 -0.2 -1.0 1.8 0.8 -1.9 0.2 1.0 -0.4 -0.6 0.4 -0.5 -0.0 2.1 7 1.1 -3.8 -0.2 -1.3 1.3 -0.3 -1.3 -1.4 2.1 0.6 0.7 -5.0 1.1 0.9 1.3 -0.5 -0.9 2.9 -0.4 0.5 8 -2.2 1.0 -0.8 -2.9 -1.4 0.4 0.1 -0.4 0.2 -0.0 1.1 -0.5 -0.5 0.7 -0.3 0.8 0.8 -0.7 1.3 -1.1 9 -0.2 -3.5 -6.1 -4.5 0.7 -0.8 -2.5 -4.0 -2.6 0.9 2.8 -3.0 -1.8 -1.4 -6.2 -1.9 -1.6 -4.9 -1.6 4.5
ILYQVPFSVALPYWNFATMTAQWWLDA
Which peptide is most likely to bind?Which peptide second?
15.0 -3.4 0.8
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
How to predict• The effect on the binding affinity of
having a given amino acid at one position can be influenced by the amino acids at other positions in the peptide (sequence correlations).
– Two adjacent amino acids may for example compete for the space in a pocket in the MHC molecule.
• Artificial neural networks (ANN) are ideally suited to take such correlations into account
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Neural networks• Neural networks
can learn higher order correlations!– What does this
mean?
A A => 0A C => 1C A => 1C C => 0
No linear function can learn this pattern
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Neural networks
w11
w12
v1
w21w22
v2
€
XOR(x1,x2) = (x1 + x2) − 2 ⋅ x1 ⋅ x2 = y − z
y = x1 + x2
z = 2 ⋅ x1 ⋅ x2
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Evaluation of prediction accuracy
True positive proportion = TP/(AP)
False positive proportion = FP/(AN)
Aroc=0.5
Aroc=0.8
Roc curves
Pearson correlation
TPFP
APAN
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Epitope predictionsSequence motif and HMM’s
Sequence motif HMM
cc: 0.76Aroc: 0.92
cc: 0.80Aroc: 0.95
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Epitope prediction. Neural Networks
cc: 0.91Aroc: 0.98
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Evaluation of prediction accuracy
0
0.2
0.4
0.6
0.8
1
MotifHmm ANN
PearAroc
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Hepatitis C virus. Epitope predictions
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Proteasomal cleavage
• Netchop (http://www.cbs.dtu.dk/services/NetChop-3.0/)
– Epitopes have strong C terminal cleavage– Epitopes can have strong internal cleavage
sites
• Selection strategy– High binding peptides– High cleavage probability at C terminal
NMVPFFPPV..S.....S
CEN
TER
FO
R B
IOLO
GIC
AL S
EQ
UEN
CE A
NA
LY
SIS
TEC
HN
ICA
L U
NIV
ER
SIT
Y O
F D
EN
MA
RK
DTU
Hvad nu?
• 29 marts. Introduktion til hidden Markov models og weight matrices
• 5 april. Introduktion til neural networks
• 12 april. Introduktion til projekt• 10 maj. Aflever projekt