Computational Biology Dr. Jens Allmer Lecture Slides Week 3

Computational Biology

Dr. Jens Allmer

Lecture Slides Week 3

MBG404 Overview

Data

Generation

Processing

Storage

Mining

Pipelining

Sample preparation for mass spectrometry

1.3 M

0.5 M

Sucrosegradient

Thylakoids

Starch, etc.

1.8 M

Centrifugation of crudecell extracts in

sucrose gradient

Separation of the thylakoid fractions via

SDS PAGE

Cutting ofinteresting bands

from the gel

Proteolytic (trypsin)digestion in gel

Liquid chromatography of resulting peptides

Mass Spectrometry (MS)

1D SDS PAGE of thylakoid fraction from crude cell extracts of Chlamydomonas reinhardtii.

Mass spectrometric methods for

protein identification

Peng, J. and Gygi, S.P. (2001) Proteomics: the move to mixtures. J. Mass Spectrom., 36,

1083-1091.

Schematic depiction of an ion trap mass spectrometer

+ c Full ms [ 400.00-2000.00]

400 600 800 1000 1200 1400 1600 1800 2000m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Re

lativ

e A

bund

ance

626.3

835.5

982.4

610.21054.4

1156.2852.21157.5703.2

885.0578.8503.9 765.91217.7445.1 1469.71259.8

+ d Z ms [ 622.30-632.30]

622 623 624 625 626 627 628 629 630 631 632m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100 626.1

626.6

627.1

627.71

2

200 400 600 800 1000 1200 1400 1600 1800m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Re

lativ

e A

bund

ance

479.4535.8

828.2

957.3

715.2958.2

1070.3406.2 602.2

3

Example tandem MS spectrum

Scan 4501

Scan 4502

Ion trap mass spectrometry (ITMS)

Single peptide ions

Collision-induced dissociation (CID)

Sequentially fragmentedpeptide ions

Databasesearch

‚Hit‘Significant match with

theoretical fragmentation patternof a database sequence

Tandem mass spectra

Mass spectra

Peptideamino acid sequences

De

novo

seq

uenc

ing Translated DNA-

or protein sequences

‚In silico‘ tryptic digestion

Theoretical MS/MSfragmentation

pattern

DA

TAB

AS

E

BLAS

T

Mass spectrometric peptide fragmentation spectrum analysis(Sequest or Mascot)

Cross correlation

• Digesting the database with the enzyme in question.

• Picking all fragments within a mass window close to the precursor mass of the peptide in the mass spectrum

• Calculating an artificial spectrum from all those fragments

• Cross correlate spectra to original mass spectrum

WLQYSEVIHAR

Theoretical spectrum in red (a,b,c,x,y,z ions) and measured spectrum in blue

Ion trap mass spectrometry (ITMS)

Single peptide ions

Collision-induced dissociation (CID)

Sequentially fragmentedpeptide ions

Databasesearch

‚Hit‘Significant match with

theoretical fragmentation patternof a database sequence

Tandem mass spectra

Mass spectra

Peptideamino acid sequences

De

novo

seq

uenc

ing Translated DNA-

or protein sequences

‚In silico‘ tryptic digestion

Theoretical MS/MSfragmentation

pattern

DA

TAB

AS

E

BLAS

T

Mass spectrometric peptide fragmentation spectrum analysis(Sequest or Mascot)

Limitation: Identification is limited to peptide sequences present in the database.

Database Search Software

• Many tools have been developed– OMSSA (NCBI, discontinoued)– X!Tandem (The global proteome machine)

X!Tandem

• http://www.thegpm.org/tandem/

X!Tandem Initalization Files

• X!Tandem– Taxonomy.xml– Default_Input.xml– Input.xml

• Running X!Tandem– ?>tandem.exe input.xml

• That was easy– But behold, what about the input?

OMSSA

• Open Mass Spectrometry Search Algorithm• Discontinued

– Due to problems?

• Still existing uses– PeptideShaker– SearchGUI

Sequence Alignment

• Exact– simple

• Approximate– More difficult

target

pattern

target

pattern

Sequence Alignment

• Exact pattern matching– Naive method aligns pattern with each location of the target– Boyer-Moore indexes the pattern to skip some alignments– Wu-Manber indexes many patterns and skips some alignments– Indexing

• Suffix tree indexes target and then quickly finds each pattern• Many other methods

Sequence Alignment

• Approximate pattern matching– Pairwise

• Local– Smith Waterman– BLAST– FASTA

• Global– Needlemann Wunsch

– Multiple• T-Coffee• ClustalW• ...

Basic Local Alignment Seach Tool

• Input– Pattern– Target– Search parameters and settings

• Output– Alignments in various formats

• XML

• Help– http://www.ncbi.nlm.nih.gov/books/NBK1763/

BLAST

• Target– Needs to be indexed– Cannot be FASTA– Must fit to the pattern and BLAST variant

• protein target and protein pattern can be searched using blastp

• Target indexing– makeblastdb, in the BLAST package can index FASTA files– Needs sequence input (e.g. FASTA, asn.1)– Needs sequence type to be provided e.g.: protein

BLAST

• blastp– Needs indexed database– Needs query sequence (can be unindexed FASTA)– Produces alignments

22

Blast flavors

• BlastN - nt versus nt database• BlastP - protein versus protein database• BlastX - translated nt (6 frames) versus protein database• tBlastN - protein versus translated nt database (6 frames)• tBlastX - translated nt versus translated nt database (both 6 frames)

Query: DNA Protein

DB: DNA Protein

BLAST Output

• XML– -outfmt 5

• This switch leads to XML output

End Theory I

• 5 min mindmapping• 10 min break

Practice I

Download Blast

• http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download– Get blastp and makeblastdb from mbg404 since you are not

allowed to install anything

• Download a Fasta file (protein, genome, collection of sequences in fasta format)– Database must consist of amino acids since we only have

access to blastp today

• Use makeblastdb from the Blast package to index the file

• Several files will be created when you do it right

http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download

http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download

MakeDB

• Example– makeblastdb -in seq.fasta -dbtype prot -out seqBl –title

seqBlastDB

• More information?– Go to the doc folder of BLAST– Documentation is there– http://www.ncbi.nlm.nih.gov/books/NBK1763/

BLAST

• Now that we have an indexed database try to run BLAST

• Read documentation and try to solve the simplest case– You will need the indexed database and you will need a FASTA

file as query– You could create queries from the database and slightly change

them

• Good luck

OMSSA

• Unzip folder and check– Alternatively, download from NCBI

• MS/MS mgf file• Database file as FASTA• makeblastdb.exe• omssacl.exe• usermods.xml

OMSSA

Before running OMSSA, database file must be converted to BLAST-like format.

So let’s run makeblastdb.exe to create a hash-indexed database

OMSSA

Here 2 different settings are used.First one is with 0.05 product ion toleranceSecond one is with default product ion toleranceFor variable modifications (-mv) check usermods.xml

X!Tandem

• Unzip folder and check

• Mgf formated spectra (file)• Database file (FASTA)• tandem-win32-10-12-01-1 folder• Used .xml configuration files (default_input.xml, input.xml

and taxonomy.xml)• To get the same output given in zip folder;

– Replace configuration files in «tandem-win\bin» folder with ones in «used» folder.

– Also copy database file to «fasta» folder and .mgf file to «bin» in «tandem-win»

X!Tandem Console Application

X!Tandem Default Input

Parameters such as mass tolerances, enzyme type, number of charged for search can be reset in default_input.xml

X!Tandem Input.xml

In input.xml file, you should specify path of:• taxonomy.xml • default_input.xml • Spectra filename • Output filenameNOTE: Here input.xml and all files above are in same folder(directory))

X!Tandem Taxonomy

In taxonomy file, you should specify «database file path». In this example, database file is in «fasta» folder in «Xtandem\tandem-win32-10-12-01-1» folder.

X!Tandem Output

End Practice I

• 15 min break

Theory II

Automation

• BLAST needed sequence file preprocessing

• OMSSA, X!Tandem, etc may need conversion of spectra files– A lot of manual processes

• Needed: an automation facility

• Solution: computational pipelines

Computational Pipeline

PepNovo

SpectraFormat

Converter

Spectra

mgfmzXML

dtamz2...

dta

ResultConverter

DB

Analysis Network

PepNovo

SpectraFormat

Converter

Spectra

mgfmzXML

dtamz2...

ResultConverter

GPF

Lutefisk

ResultConverter

2DB

General Pipeline Considerations• Data cannot be connected to data

• Operations cannot be connected directly either

• Data needs to be transformed (operation)

• In the example the data element cannot be directly connected to the DB

• The data element is also not necessar it has been added to clarify that the process generates data which will go directly to the DB

Data

Operation

Data Flow

DataStoreDB

DataStoreDB

Operation Data

X

OpenMS

Pipeline Examples

• We will see a few examples– OpenMS/TOPP

– Trans Proteomics Pipeline

– Proteomatics

– Ensembl

TOPP - The OpenMS Proteomics Pipeline

http://open-ms.sourceforge.net/

http://open-ms.sourceforge.net/

Trans-Proteomics Pipeline TPP

http://sourceforge.net/projects/sashimi/

http://sourceforge.net/projects/sashimi/

Proteomatic

http://www.uni-muenster.de/hippler/proteomatic/

Proteomatic

http://www.uni-muenster.de/hippler/proteomatic/

Ensembl

http://genome.cshlp.org/content/14/5/934.full http://www.ensembl.org

http://genome.cshlp.org/content/14/5/934.full

http://www.ensembl.org/

Standardization

• Some programs have the same aim– Unfortunately, produce largely different output– Depend on different input formats– One need for pipelines arises from this

• Standardization can eleviate that problem

• Currently mostly XML– Developments of controlled vocabularies are seen

• In ten years full transition to ontologies expected

Standardization (HUPO PSI)

Selfmade

• Windows– Batch script– Powershell

• Linux– Bash script– Shell script

• Common– A file that contains instructions– Usually found in the console

Delete Temp

• Batch script– cd c:\– cd Windows\Temp– rm –r –s *.*

• Save file as – DeleteTemp.bat

• Put the file into – C:\Users\%USERNAME%\AppData\Roaming\Microsoft\

Windows\Start Menu\Programs\Startup

• Next startup– The temporary files will be deleted

Pipelining

• The previous example performed pipelining

• You can use this for anything like – First making a BLAST DB– Second searching it

• Advantage– You have a log of the settings etc.– You can repeat it at any time

End Theory II

• 5 min mindmapping• 10 min break

Practice II

Raw Data

• Screenshots

• Copy paste

• Unstructured

• Not integrated

• Unreflected

Information

• Structured Data

• Sorted

• Integrated

• Properly graphed• Figure

– Number– Caption – Reference

0.22-0.43 0.55-0.66 0.67-0.830

0.2

0.4

0.6

0.8

1

PepNovo PEAKS Lutefisk OMSSA

Spectral QualityP

redi

ctio

n D

ista

nce

Figure 1: Spectral Quality (present fragment ions/ expected fragment ions) versus Prediction Distance to the true sequence (normalized edit distance; 0:great and 1:poor). Predictions were done by PepNovo, PEAKS, and Lutefisk while identification was done with OMSSA. All MS/MS spectra were of charge 1.

Even Better

0.22-0.43 0.55-0.66 0.67-0.830

0.2

0.4

0.6

0.8

1

PepNovo COMAS Lutefisk OMSSA PEAKS

Spectral Quality

Pre

dict

ion

Qua

lity

Figure 5: Spectral Quality (present fragment ions / expected fragment ions) versus Prediction Quality (normalized edit distance). The box-and-whisker plot presents three groups at different spectral quality. Note there were no measurements between 0.43 and 0.55, before 0.22 and after 0.83.

Presenting Data

• When presenting data in your manuscripts:

• Raw data (not acceptable)

• Information (minimum)

• Knowledge (strive for this)

Click when ready

Whiteboardmaths.com

© 2004 - 2008 All rights reserved

Stand SW 100

In addition to the demos/free presentations in this area there are at least 8 complete (and FREE) presentations waiting for download under the My Account button. Simply register to download immediately.

www.similima.com 63

Median, Quartiles, Inter-Quartile Range and Box Plots.Measures of SpreadRemember: The range is the measure of spread that goes with the mean.

Mean = 7 + 5 + 2 + 7 + 6 + 12 + 10 + 4 + 8 + 9 10 = 70

10= 7

Range = 12 – 2 = 10

Example 1. Two dice were thrown 10 times and their scores were added together and recorded. Find the mean and range for this data.

7, 5, 2, 7, 6, 12, 10, 4, 8, 9

www.similima.com 64

Median, Quartiles, Inter-Quartile Range and Box Plots.

Measures of SpreadThe range is not a good measure of spread because one extreme, (very high or very low value) can have a big affect. The measure of spread that goes with the median is called the inter-quartile range and is generally a better measure of spread because it is not affected by extreme values.

A reminder about the median

www.similima.com 65

Single middle value

Averages (The Median)

The median is the middle value of a set of data once the data has been ordered.

Example 1. Robert hit 11 balls at Grimsby driving range. The recorded distances of his drives, measured in yards, are given below. Find the median distance for his drives.

85, 125, 130, 65, 100, 70, 75, 50, 140, 95, 70

Median drive = 85 yards

50, 65, 70, 70, 75, 85, 95, 100, 125, 130, 140

Ordered data

www.similima.com 66

Two middle values so take the mean.

Averages (The Median)

The median is the middle value of a set of data once the data has been ordered.

Example 1. Robert hit 12 balls at Grimsby driving range. The recorded distances of his drives, measured in yards, are given below. Find the median distance for his drives.

85, 125, 130, 65, 100, 70, 75, 50, 140, 135, 95, 70

Median drive = 90 yards

50, 65, 70, 70, 75, 85, 95, 100, 125, 130, 135, 140

Ordered data

www.similima.com 67

Finding the median, quartiles and inter-quartile range.

12, 6, 4, 9, 8, 4, 9, 8, 5, 9, 8, 10

4, 4, 5, 6, 8, 8, 8, 9, 9, 9, 10, 12

Order the data

Inter-Quartile Range = 9 - 5½ = 3½

Example 1: Find the median and quartiles for the data below.

Lower Quartile = 5½

Q1

Upper Quartile = 9

Q3

Median = 8

Q2

www.similima.com 68

Upper Quartile = 10

Q3

Lower Quartile = 4

Q1

Median = 8

Q2

3, 4, 4, 6, 8, 8, 8, 9, 10, 10, 15,

Finding the median, quartiles and inter-quartile range.

6, 3, 9, 8, 4, 10, 8, 4, 15, 8, 10 Order the data

Inter-Quartile Range = 10 - 4 = 6

Example 2: Find the median and quartiles for the data below.

www.similima.com 69

2, 5, 6, 6, 7, 8, 8, 8, 9, 9, 10, 15

Median = 8 hours and the inter-quartile range = 9 – 6 = 3 hours.

Battery Life: The life of 12 batteries recorded in hours is:

2, 5, 6, 6, 7, 8, 8, 8, 9, 9, 10, 15

Mean = 93/12 = 7.75 hours and the range = 15 – 2 = 13 hours.

Discuss the calculations below.

The averages are similar but the measures of spread are significantly different since the extreme values of 2 and 15 are not included in the inter-quartile range.

www.similima.com 70

4 5 6 7 8 9 10 11 12

MedianLower

QuartileUpper

QuartileLowest Value

Highest Value

BoxWhiskerWhisker

130 140 150 160 170 180 190

Boys

Girlscm

Box and Whisker Diagrams.

Box plots are useful for comparing two or more sets of data like that shown below for heights of boys and girls in a class.

Anatomy of a Box and Whisker Diagram.

Box Plotswww.similima.com 71

Lower Quartile = 5½

Q1

Upper Quartile = 9

Q3

Median = 8

Q2

4 5 6 7 8 9 10 11 12

4, 4, 5, 6, 8, 8, 8, 9, 9, 9, 10, 12

Example 1: Draw a Box plot for the data below

Drawing a Box Plot.

www.similima.com 72

Upper Quartile = 10

Q3

Lower Quartile = 4

Q1

Median = 8

Q2

3, 4, 4, 6, 8, 8, 8, 9, 10, 10, 15,

Example 2: Draw a Box plot for the data below

Drawing a Box Plot.

3 4 5 6 7 8 9 10 11 12 13 14 15

www.similima.com 73

Upper Quartile = 180

Qu

Lower Quartile = 158

QL

Median = 171

Q2

Question: Stuart recorded the heights in cm of boys in his class as shown below. Draw a box plot for this data.

Drawing a Box Plot.

137, 148, 155, 158, 165, 166, 166, 171, 171, 173, 175, 180, 184, 186, 186

130 140 150 160 170 180 190cm

www.similima.com 74

2. The boys are taller on average.

Question: Gemma recorded the heights in cm of girls in the same class and constructed a box plot from the data. The box plots for both boys and girls are shown below. Use the box plots to choose some correct statements comparing heights of boys and girls in the class. Justify your answers.

Drawing a Box Plot.

130 140 150 160 170 180 190

Boys

Girls

cm

1. The girls are taller on average.

3. The girls show less variability in height.

4. The boys show less variability in height.

5. The smallest person is a girl.

6. The tallest person is a boy.www.similima.com 75

Konstanz Information Miner

• We will use the Workflow Management and Data Analytics Platform

• First we need to find out how to get our data into KNIME

Create Data

• Use Excel to create two colums– Girls, boys

• Make a few hundred random numbers (randbetween)– 140 -170 for girls– 150 - 180 for boys

• Copy the table• Paste into Notepad++• Save as Distribution.txt

KNIME Data Import

• Open Knime• Select the folder containing the data as workspace

• Right click LOCAL– Select new workflow– Name it HeightAnalysis

• Drag and Drop Distribution.txt into the workflow

Box Plot

• Type box to find box plot node

• Double click • Right click Box Plot node

– Select Execute and open views

• Done

Workflow

Documents

Computational Biology Dr. Jens Allmer Lecture Slides Week 3