Upload
eleanor-mckinney
View
246
Download
4
Embed Size (px)
Citation preview
Computational Biology
Dr. Jens Allmer
Lecture Slides Week 3
MBG404 Overview
Data
Generation
Processing
Storage
Mining
Pipelining
Sample preparation for mass spectrometry
1.3 M
0.5 M
Sucrosegradient
Thylakoids
Starch, etc.
1.8 M
Centrifugation of crudecell extracts in
sucrose gradient
Separation of the thylakoid fractions via
SDS PAGE
Cutting ofinteresting bands
from the gel
Proteolytic (trypsin)digestion in gel
Liquid chromatography of resulting peptides
Mass Spectrometry (MS)
1D SDS PAGE of thylakoid fraction from crude cell extracts of Chlamydomonas reinhardtii.
Mass spectrometric methods for
protein identification
Peng, J. and Gygi, S.P. (2001) Proteomics: the move to mixtures. J. Mass Spectrom., 36,
1083-1091.
Schematic depiction of an ion trap mass spectrometer
+ c Full ms [ 400.00-2000.00]
400 600 800 1000 1200 1400 1600 1800 2000m/z
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
Re
lativ
e A
bund
ance
626.3
835.5
982.4
610.21054.4
1156.2852.21157.5703.2
885.0578.8503.9 765.91217.7445.1 1469.71259.8
+ d Z ms [ 622.30-632.30]
622 623 624 625 626 627 628 629 630 631 632m/z
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100 626.1
626.6
627.1
627.71
2
200 400 600 800 1000 1200 1400 1600 1800m/z
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
Re
lativ
e A
bund
ance
479.4535.8
828.2
957.3
715.2958.2
1070.3406.2 602.2
3
Example tandem MS spectrum
Scan 4501
Scan 4502
Ion trap mass spectrometry (ITMS)
Single peptide ions
Collision-induced dissociation (CID)
Sequentially fragmentedpeptide ions
Databasesearch
‚Hit‘Significant match with
theoretical fragmentation patternof a database sequence
Tandem mass spectra
Mass spectra
Peptideamino acid sequences
De
novo
seq
uenc
ing Translated DNA-
or protein sequences
‚In silico‘ tryptic digestion
Theoretical MS/MSfragmentation
pattern
DA
TAB
AS
E
BLAS
T
Mass spectrometric peptide fragmentation spectrum analysis(Sequest or Mascot)
Cross correlation
• Digesting the database with the enzyme in question.
• Picking all fragments within a mass window close to the precursor mass of the peptide in the mass spectrum
• Calculating an artificial spectrum from all those fragments
• Cross correlate spectra to original mass spectrum
WLQYSEVIHAR
Theoretical spectrum in red (a,b,c,x,y,z ions) and measured spectrum in blue
Ion trap mass spectrometry (ITMS)
Single peptide ions
Collision-induced dissociation (CID)
Sequentially fragmentedpeptide ions
Databasesearch
‚Hit‘Significant match with
theoretical fragmentation patternof a database sequence
Tandem mass spectra
Mass spectra
Peptideamino acid sequences
De
novo
seq
uenc
ing Translated DNA-
or protein sequences
‚In silico‘ tryptic digestion
Theoretical MS/MSfragmentation
pattern
DA
TAB
AS
E
BLAS
T
Mass spectrometric peptide fragmentation spectrum analysis(Sequest or Mascot)
Limitation: Identification is limited to peptide sequences present in the database.
Database Search Software
• Many tools have been developed– OMSSA (NCBI, discontinoued)– X!Tandem (The global proteome machine)
X!Tandem
• http://www.thegpm.org/tandem/
X!Tandem Initalization Files
• X!Tandem– Taxonomy.xml– Default_Input.xml– Input.xml
• Running X!Tandem– ?>tandem.exe input.xml
• That was easy– But behold, what about the input?
OMSSA
• Open Mass Spectrometry Search Algorithm• Discontinued
– Due to problems?
• Still existing uses– PeptideShaker– SearchGUI
Sequence Alignment
• Exact– simple
• Approximate– More difficult
target
pattern
target
pattern
Sequence Alignment
• Exact pattern matching– Naive method aligns pattern with each location of the target– Boyer-Moore indexes the pattern to skip some alignments– Wu-Manber indexes many patterns and skips some alignments– Indexing
• Suffix tree indexes target and then quickly finds each pattern• Many other methods
Sequence Alignment
• Approximate pattern matching– Pairwise
• Local– Smith Waterman– BLAST– FASTA
• Global– Needlemann Wunsch
– Multiple• T-Coffee• ClustalW• ...
Basic Local Alignment Seach Tool
• Input– Pattern– Target– Search parameters and settings
• Output– Alignments in various formats
• XML
• Help– http://www.ncbi.nlm.nih.gov/books/NBK1763/
BLAST
• Target– Needs to be indexed– Cannot be FASTA– Must fit to the pattern and BLAST variant
• protein target and protein pattern can be searched using blastp
• Target indexing– makeblastdb, in the BLAST package can index FASTA files– Needs sequence input (e.g. FASTA, asn.1)– Needs sequence type to be provided e.g.: protein
BLAST
• blastp– Needs indexed database– Needs query sequence (can be unindexed FASTA)– Produces alignments
22
Blast flavors
• BlastN - nt versus nt database• BlastP - protein versus protein database• BlastX - translated nt (6 frames) versus protein database• tBlastN - protein versus translated nt database (6 frames)• tBlastX - translated nt versus translated nt database (both 6 frames)
Query: DNA Protein
DB: DNA Protein
BLAST Output
• XML– -outfmt 5
• This switch leads to XML output
End Theory I
• 5 min mindmapping• 10 min break
Practice I
Download Blast
• http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download– Get blastp and makeblastdb from mbg404 since you are not
allowed to install anything
• Download a Fasta file (protein, genome, collection of sequences in fasta format)– Database must consist of amino acids since we only have
access to blastp today
• Use makeblastdb from the Blast package to index the file
• Several files will be created when you do it right
MakeDB
• Example– makeblastdb -in seq.fasta -dbtype prot -out seqBl –title
seqBlastDB
• More information?– Go to the doc folder of BLAST– Documentation is there– http://www.ncbi.nlm.nih.gov/books/NBK1763/
BLAST
• Now that we have an indexed database try to run BLAST
• Read documentation and try to solve the simplest case– You will need the indexed database and you will need a FASTA
file as query– You could create queries from the database and slightly change
them
• Good luck
OMSSA
• Unzip folder and check– Alternatively, download from NCBI
• MS/MS mgf file• Database file as FASTA• makeblastdb.exe• omssacl.exe• usermods.xml
OMSSA
Before running OMSSA, database file must be converted to BLAST-like format.
So let’s run makeblastdb.exe to create a hash-indexed database
OMSSA
Here 2 different settings are used.First one is with 0.05 product ion toleranceSecond one is with default product ion toleranceFor variable modifications (-mv) check usermods.xml
X!Tandem
• Unzip folder and check
• Mgf formated spectra (file)• Database file (FASTA)• tandem-win32-10-12-01-1 folder• Used .xml configuration files (default_input.xml, input.xml
and taxonomy.xml)• To get the same output given in zip folder;
– Replace configuration files in «tandem-win\bin» folder with ones in «used» folder.
– Also copy database file to «fasta» folder and .mgf file to «bin» in «tandem-win»
X!Tandem Console Application
X!Tandem Default Input
Parameters such as mass tolerances, enzyme type, number of charged for search can be reset in default_input.xml
X!Tandem Input.xml
In input.xml file, you should specify path of:• taxonomy.xml • default_input.xml • Spectra filename • Output filenameNOTE: Here input.xml and all files above are in same folder(directory))
X!Tandem Taxonomy
In taxonomy file, you should specify «database file path». In this example, database file is in «fasta» folder in «Xtandem\tandem-win32-10-12-01-1» folder.
X!Tandem Output
End Practice I
• 15 min break
Theory II
Automation
• BLAST needed sequence file preprocessing
• OMSSA, X!Tandem, etc may need conversion of spectra files– A lot of manual processes
• Needed: an automation facility
• Solution: computational pipelines
Computational Pipeline
PepNovo
SpectraFormat
Converter
Spectra
mgfmzXML
dtamz2...
dta
ResultConverter
DB
Analysis Network
PepNovo
SpectraFormat
Converter
Spectra
mgfmzXML
dtamz2...
ResultConverter
GPF
Lutefisk
ResultConverter
2DB
General Pipeline Considerations• Data cannot be connected to data
• Operations cannot be connected directly either
• Data needs to be transformed (operation)
• In the example the data element cannot be directly connected to the DB
• The data element is also not necessar it has been added to clarify that the process generates data which will go directly to the DB
Data
Operation
Data Flow
DataStoreDB
DataStoreDB
Operation Data
X
OpenMS
Pipeline Examples
• We will see a few examples– OpenMS/TOPP
– Trans Proteomics Pipeline
– Proteomatics
– Ensembl
TOPP - The OpenMS Proteomics Pipeline
http://open-ms.sourceforge.net/
Trans-Proteomics Pipeline TPP
http://sourceforge.net/projects/sashimi/
Proteomatic
http://www.uni-muenster.de/hippler/proteomatic/
Proteomatic
http://www.uni-muenster.de/hippler/proteomatic/
Ensembl
http://genome.cshlp.org/content/14/5/934.full http://www.ensembl.org
Standardization
• Some programs have the same aim– Unfortunately, produce largely different output– Depend on different input formats– One need for pipelines arises from this
• Standardization can eleviate that problem
• Currently mostly XML– Developments of controlled vocabularies are seen
• In ten years full transition to ontologies expected
Standardization (HUPO PSI)
Selfmade
• Windows– Batch script– Powershell
• Linux– Bash script– Shell script
• Common– A file that contains instructions– Usually found in the console
Delete Temp
• Batch script– cd c:\– cd Windows\Temp– rm –r –s *.*
• Save file as – DeleteTemp.bat
• Put the file into – C:\Users\%USERNAME%\AppData\Roaming\Microsoft\
Windows\Start Menu\Programs\Startup
• Next startup– The temporary files will be deleted
Pipelining
• The previous example performed pipelining
• You can use this for anything like – First making a BLAST DB– Second searching it
• Advantage– You have a log of the settings etc.– You can repeat it at any time
End Theory II
• 5 min mindmapping• 10 min break
Practice II
Raw Data
• Screenshots
• Copy paste
• Unstructured
• Not integrated
• Unreflected
Information
• Structured Data
• Sorted
• Integrated
• Properly graphed• Figure
– Number– Caption – Reference
0.22-0.43 0.55-0.66 0.67-0.830
0.2
0.4
0.6
0.8
1
PepNovo PEAKS Lutefisk OMSSA
Spectral QualityP
redi
ctio
n D
ista
nce
Figure 1: Spectral Quality (present fragment ions/ expected fragment ions) versus Prediction Distance to the true sequence (normalized edit distance; 0:great and 1:poor). Predictions were done by PepNovo, PEAKS, and Lutefisk while identification was done with OMSSA. All MS/MS spectra were of charge 1.
Even Better
0.22-0.43 0.55-0.66 0.67-0.830
0.2
0.4
0.6
0.8
1
PepNovo COMAS Lutefisk OMSSA PEAKS
Spectral Quality
Pre
dict
ion
Qua
lity
Figure 5: Spectral Quality (present fragment ions / expected fragment ions) versus Prediction Quality (normalized edit distance). The box-and-whisker plot presents three groups at different spectral quality. Note there were no measurements between 0.43 and 0.55, before 0.22 and after 0.83.
Presenting Data
• When presenting data in your manuscripts:
• Raw data (not acceptable)
• Information (minimum)
• Knowledge (strive for this)
Click when ready
Whiteboardmaths.com
© 2004 - 2008 All rights reserved
Stand SW 100
In addition to the demos/free presentations in this area there are at least 8 complete (and FREE) presentations waiting for download under the My Account button. Simply register to download immediately.
www.similima.com 63
Median, Quartiles, Inter-Quartile Range and Box Plots.Measures of SpreadRemember: The range is the measure of spread that goes with the mean.
Mean = 7 + 5 + 2 + 7 + 6 + 12 + 10 + 4 + 8 + 9 10 = 70
10= 7
Range = 12 – 2 = 10
Example 1. Two dice were thrown 10 times and their scores were added together and recorded. Find the mean and range for this data.
7, 5, 2, 7, 6, 12, 10, 4, 8, 9
www.similima.com 64
Median, Quartiles, Inter-Quartile Range and Box Plots.
Measures of SpreadThe range is not a good measure of spread because one extreme, (very high or very low value) can have a big affect. The measure of spread that goes with the median is called the inter-quartile range and is generally a better measure of spread because it is not affected by extreme values.
A reminder about the median
www.similima.com 65
Single middle value
Averages (The Median)
The median is the middle value of a set of data once the data has been ordered.
Example 1. Robert hit 11 balls at Grimsby driving range. The recorded distances of his drives, measured in yards, are given below. Find the median distance for his drives.
85, 125, 130, 65, 100, 70, 75, 50, 140, 95, 70
Median drive = 85 yards
50, 65, 70, 70, 75, 85, 95, 100, 125, 130, 140
Ordered data
www.similima.com 66
Two middle values so take the mean.
Averages (The Median)
The median is the middle value of a set of data once the data has been ordered.
Example 1. Robert hit 12 balls at Grimsby driving range. The recorded distances of his drives, measured in yards, are given below. Find the median distance for his drives.
85, 125, 130, 65, 100, 70, 75, 50, 140, 135, 95, 70
Median drive = 90 yards
50, 65, 70, 70, 75, 85, 95, 100, 125, 130, 135, 140
Ordered data
www.similima.com 67
Finding the median, quartiles and inter-quartile range.
12, 6, 4, 9, 8, 4, 9, 8, 5, 9, 8, 10
4, 4, 5, 6, 8, 8, 8, 9, 9, 9, 10, 12
Order the data
Inter-Quartile Range = 9 - 5½ = 3½
Example 1: Find the median and quartiles for the data below.
Lower Quartile = 5½
Q1
Upper Quartile = 9
Q3
Median = 8
Q2
www.similima.com 68
Upper Quartile = 10
Q3
Lower Quartile = 4
Q1
Median = 8
Q2
3, 4, 4, 6, 8, 8, 8, 9, 10, 10, 15,
Finding the median, quartiles and inter-quartile range.
6, 3, 9, 8, 4, 10, 8, 4, 15, 8, 10 Order the data
Inter-Quartile Range = 10 - 4 = 6
Example 2: Find the median and quartiles for the data below.
www.similima.com 69
2, 5, 6, 6, 7, 8, 8, 8, 9, 9, 10, 15
Median = 8 hours and the inter-quartile range = 9 – 6 = 3 hours.
Battery Life: The life of 12 batteries recorded in hours is:
2, 5, 6, 6, 7, 8, 8, 8, 9, 9, 10, 15
Mean = 93/12 = 7.75 hours and the range = 15 – 2 = 13 hours.
Discuss the calculations below.
The averages are similar but the measures of spread are significantly different since the extreme values of 2 and 15 are not included in the inter-quartile range.
www.similima.com 70
4 5 6 7 8 9 10 11 12
MedianLower
QuartileUpper
QuartileLowest Value
Highest Value
BoxWhiskerWhisker
130 140 150 160 170 180 190
Boys
Girlscm
Box and Whisker Diagrams.
Box plots are useful for comparing two or more sets of data like that shown below for heights of boys and girls in a class.
Anatomy of a Box and Whisker Diagram.
Box Plotswww.similima.com 71
Lower Quartile = 5½
Q1
Upper Quartile = 9
Q3
Median = 8
Q2
4 5 6 7 8 9 10 11 12
4, 4, 5, 6, 8, 8, 8, 9, 9, 9, 10, 12
Example 1: Draw a Box plot for the data below
Drawing a Box Plot.
www.similima.com 72
Upper Quartile = 10
Q3
Lower Quartile = 4
Q1
Median = 8
Q2
3, 4, 4, 6, 8, 8, 8, 9, 10, 10, 15,
Example 2: Draw a Box plot for the data below
Drawing a Box Plot.
3 4 5 6 7 8 9 10 11 12 13 14 15
www.similima.com 73
Upper Quartile = 180
Qu
Lower Quartile = 158
QL
Median = 171
Q2
Question: Stuart recorded the heights in cm of boys in his class as shown below. Draw a box plot for this data.
Drawing a Box Plot.
137, 148, 155, 158, 165, 166, 166, 171, 171, 173, 175, 180, 184, 186, 186
130 140 150 160 170 180 190cm
www.similima.com 74
2. The boys are taller on average.
Question: Gemma recorded the heights in cm of girls in the same class and constructed a box plot from the data. The box plots for both boys and girls are shown below. Use the box plots to choose some correct statements comparing heights of boys and girls in the class. Justify your answers.
Drawing a Box Plot.
130 140 150 160 170 180 190
Boys
Girls
cm
1. The girls are taller on average.
3. The girls show less variability in height.
4. The boys show less variability in height.
5. The smallest person is a girl.
6. The tallest person is a boy.www.similima.com 75
Konstanz Information Miner
• We will use the Workflow Management and Data Analytics Platform
• First we need to find out how to get our data into KNIME
Create Data
• Use Excel to create two colums– Girls, boys
• Make a few hundred random numbers (randbetween)– 140 -170 for girls– 150 - 180 for boys
• Copy the table• Paste into Notepad++• Save as Distribution.txt
KNIME Data Import
• Open Knime• Select the folder containing the data as workspace
• Right click LOCAL– Select new workflow– Name it HeightAnalysis
• Drag and Drop Distribution.txt into the workflow
Box Plot
• Type box to find box plot node
• Double click • Right click Box Plot node
– Select Execute and open views
• Done
Workflow