©CMBI 2005 Transfer of information The main topic of this course is transfer of information. A month in the lab can easily save you an hour in front of

©CMBI 2005

Transfer of information

The main topic of this course is transfer of information.

A month in the lab can easily save you an hour in front of the computer.

Nothing is impossible for a man who doesn’t have to do it himself.

But, to err is human, but to really screw things up, you need a computer.

©CMBI 2005


The main topic of this course is transfer of information.

In the protein world that leads to the questions:

1)From which protein can I transfer information2)How do I transfer what information from where to wher

Today’s answer is BLAST…

©CMBI 2005

Database Searching with BLAST

Database searching with BLAST involves a series of topics we will deal with today:

•Database Searching•Sequence Alignment•Scoring Matrices•Significance of an alignmentand:•BLAST, algorithm•BLAST, parameters•BLAST, output

©CMBI 2005

Database Searching

Identify similarities between:

your query sequencelikely with unknown structure and function

database subject sequenceswith elucidated structures and function

©CMBI 2005

Database searching concept

The query sequence is compared/aligned with every subject sequence in the database.

High-scoring database sequences are assumed to be evolutionary related to the query sequence.

If sequences are related by divergence from a common ancestor, there are said to be homologous.

We can only transfer information between homologs.

(And we will learn later that that is because structure is maintained longer during evolution than sequence).

©CMBI 2005


We want to be able to say things like “this serine is phorphorylated in the database protein, so in my homologous protein the corresponding serine is likely to be phosphorylated too”.

That requires that the green serine and the purple serine both come from a common ancestor that was phosphorylated too.

And that, in turn, requires that both serines are located at the same location in their respective structures.

©CMBI 2005

Equivalent structural positions

To know if positions in two different proteins are equivalent, we need to know both protein structures and compare them with protein structure comparison software.

But by the time you have solved one or two protein structures the four years of your PhD period are over...

So, we need a short-cut, and that, ladies and gentleman, will be a sequence alignment (i.e. Blast + ...).

©CMBI 2005

Sequence alignment

Sequence alignment is a simple concept. You only have to find out which pairs of residues in two homologous sequences are derived from the same residue in the common ancestor.

TTSASDFRTRTTHIKILLMRL STSATSYRTRSTHLRLMLMRI seems easy, but:

ASDFTHGTREWDSTYHLIMNV LTEYSHNSKDFETSFNILLQL looks very hard...

(Still, both alignments seem correct to me, and four weeks from now, you will agree, I hope).

©CMBI 2005

Sequence alignment is easy:

You only need three things:

1)A computer program that produces all possible alignments, and

2)A computer program that gives each alignment a score, and, the simplest,

3)A computer program that selects the highest scoring alignment from the very large number you tried.

(The next two weeks you will learn that only point 2 is difficult)

©CMBI 2005

Scoring Matrix/Substitution Matrix

To score the quality of an alignment you need ‘something’ that compares amino acids, a matrix.

Contains scores for pairs of residues

So, for protein/protein comparisons we need a 20 x 20 matrix of similarity scores where identical amino acids and those of similar character give higher scores compared to those of different character.

(And next week you will learn which residues are similar)

©CMBI 2005

Substitution Matrices

Not all amino acids are equalResidues mutate more easily to similar onesResidues at surface mutate more easilyAromatics mutate preferably into aromatics

Mutations tend to favor some substitutionsCore tends to be hydrophobic

Selection tends to favor some substitutionsCysteines are dangerous at the surfaceCysteines in bridges seldom mutate

©CMBI 2005

PAM250 Matrix

©CMBI 2005

Scoring example

Score of an alignment is the sum of the scores of all pairs of residues in the alignment

sequence 1: TCCPSIVARSNsequence 2: SCCPSISARNT

1 12 12 6 2 5 -1 2 6 1 0 => score = 46

©CMBI 2005

Dayhoff Matrix (1)

The group of Dayhoff created a scoring matrix from a dataset of closely similar protein sequences that could be aligned unambiguously.

Then they counted all mutations (and non-mutations) and calculated the mutation frequencies

With a bit of math, they converted these frequencies into the famous Dayhoff matrix (also called PAM matrix).

©CMBI 2005

Given the frequency of Leu and Val in my sequences, and the frequency of mutations,, do I see more mutations of V L than I would expect by chance alone?

Score of mutation A B = log (observed a b mutation / expected a b mutations)

This is called a log odd and can be negative, zero, or positive. Zero means no information, no contribution to the score of the alignment.

When using a log odds matrix, the total score of the alignment is given by the sum of the scores for each aligned pair of residues.

Dayhoff Matrix (2)

©CMBI 2005

Dayhoff Matrix (3)

This log odds matrix is called PAM 1. An evolutionary distance of 1 PAM (point accepted mutation) means there has been 1 point mutation per 100 residues

PAM 1 may be used to generate matrices for greater evolutionary distances by multiplying it repeatedly by itself.

PAM250: – 2,5 mutations per residue.– equivalent to 20% matches remaining between two sequences,

i.e. 80% of the amino acid positions are observed to have changed (one or more times).

– is default in many analysis packages.

©CMBI 2005

BLOSUM Matrix

Limit of Dayhoff matrix:

Matrices based on the Dayhoff model of evolutionary rates are derived from alignments of sequences that are at least 85% identical; that might not be optimal…

An alternative approach has been developed by Henikoff and Henikoff using local multiple alignments of more distantly related sequences.

All matrices are symmetrical...

©CMBI 2005

BLOSUM Matrix (2)

The BLOSUM matrices (BLOcks SUbstitution Matrix) are based on the BLOCKS database.

The BLOCKS database utilizes the concept of blocks (un-gapped amino acid pattern), that act as signatures of a family of proteins.

Substitution frequencies for all pairs of amino acids were then calculated and this used to calculate a log odds BLOSUM matrix.

Different matrices are obtained by varying the identity threshold. For example, BLOSUM80 was derived using blocks of 80% identity.

Which Matrix to use?

Close relationships (Low PAM, high Blosum)Distant relationships (High PAM, low Blosum)

Often used defaults are: PAM250, BLOSUM62

BLOSUM 80 BLOSUM 62 BLOSUM 45PAM 20 PAM 120 PAM 250

More conserved More variable

©CMBI 2005

Significance of alignment (1)

When is an alignment statistically significant?

In other words:

How much different is the alignment score found from scores obtained by aligning any odd sequences to the query sequence?

Or:

What is the probability that an alignment with this score could have arisen by chance?

©CMBI 2005

Significance of alignment (2)

Database size= 20 x 106 amino acids

peptide #hits

A 1 x 106

AP 50000IAP 2500LIAP 125WLIAP 6KWLIAP 0,3KWLIAPY 0,015

©CMBI 2005

BLAST

Question: What database sequences are most similar to (or contain the most similar regions to) my own sequence?

•BLAST finds the highest scoring locally optimal alignments between a query sequence and all database sequences. •Very fast algorithm•Can be used to search extremely large databases•Sufficiently sensitive and selective for most purposes•Robust – the default parameters can usually be used

©CMBI 2005

BLAST – Algorithme

Step 1: Read/understand user query sequence.

Step 2: Use hashing technology to select several thousand likely candidates.

Step 3: Do a real alignment between the query sequence and those likely candidate. ‘Real alignment’ is a main topic of this course.

Step 4: Present output to user.

©CMBI 2005

BLAST Algorithm, Step 2

The program first looks for series of short, highly similar fragment, it extends these matching segments in both directions by adding residues. Residues will be added until the incremental score drops below a threshold.

©CMBI 2005

Basic BLAST Algorithms

Program Query Database

BLASTP Protein Protein

BLASTN DNA DNA

BLASTX translatedDNA protein

TBLASTN protein translatedDNA

TBLASTX translatedDNA translatedDNA

©CMBI 2005

PSI-BLAST

Position-Specific Iterated BLAST• Distant relationships are often best detected by motif

or profile searches rather than pair-wise comparisons• PSI-BLAST first performs a BLAST search. • PSI-BLAST uses the information from significant

BLAST alignments returned to construct a position specific score matrix, which replaces the query sequence for the next round of database searching.

• PSI-BLAST may be iterated until no new significant alignments are found.

©CMBI 2005

BLAST Input

Steps in running BLAST:

•Entering your query sequence (cut-and-paste)•Select the database(s) you want to searchAnd, optionally:•Choose output parameters•Choose alignment parameters (scoring matrix, filters,….)

Example query=>somethingAFIWLLSCYALLGTTFGCGVNAIHPVLTGLSKIVNGEEAVPGTWPWQVTLQDRSGFHFC GGSLISEDWVVTAAHCGVRTSEILIAGEFDQGSDEDNIQVLRIAKVFKQPKYSILTVNND ITLLKLASPARYSQTISAVCLPSVDDDAGSLCATTGWGRTKYNANKSPDKLERAALPLLT NAECKRSWGRRLTDVMICGAASGVSSCMGDSGGPLVCQKDGAYTLVAIVSWASDTCSASS GGVYAKVTKIIPWVQKILSSN

©CMBI 2010

BLAST Output

A high scoreindicates a likely relationship

A low probability indicates that a match is unlikely to have arisen by chance

©CMBI 2010

BLAST Output

Low scores with high probabilities suggest that matches have arisen by chance

©CMBI 2005

Alignment Significance in BLAST

P-value (probability) Relates the score for an alignment to the likelihood that it arose by chance. The closer to zero, the greater the confidence that the hit is real.

E-value (expect value)The number of alignments with E that would be expected by chance in that database (e.g. if E=10, 10 matches with scores this high are expected to be found by chance).A match will be reported if its E is below the threshold.Lower E thresholds are more stringent, and report fewer matches.

©CMBI 2010

BLAST result: easy

©CMBI 2010

BLAST result: less easy

©CMBI 2010

BLAST result: very difficult

©CMBI 2005

Low complexity filter

Many sequences contain repeats or stretches that consist predominantly of one type of amino acid.

E.g. Many nuclear proteins have a poly-asparagine tail, membrane proteins often consist of mainly hydrophobic amino acids, or many binding proteins have proline rich stretches.

ASDFGTRGHPPPPPPPPPPP---------------NPPPPPPPPPLTSSDFRGT

Are NOT homologs, but analogs.

NNNNNNNN

©CMBI 2010

BLAST - Low complexity filter

Filter ON Filter OFF

NNNNNNNN

Your BLAST query sequence will look like this:

©CMBI 2005

Demo

IJs, CNCZ, en het internet dienende komt nu een demo…

Documents

©CMBI 2005 Transfer of information The main topic of this course is transfer of information. A month in the lab can easily save you an hour in front of