52
CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

Embed Size (px)

Citation preview

Page 1: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

CS177 Review/Summary of the Madej

lectures

Tom Madej 12.07.05

Page 2: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

Overview

• Basic biology.

• Protein/DNA sequence comparison.

• Protein structure comparison/classification.

• NCBI databases overview.

• Miscellaneous topics.

Page 3: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05
Page 4: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

Lodish et al. Molecular Cell Biology, W.H. Freeman 2000

Page 5: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

Protein/DNA sequence comparison

• What is the meaning of a sequence alignment?

• Scoring methods; amino acid substitution matrices, PSSMs.

• Basic computational methods; e.g. BLAST.

• Know how to run PSI-BLAST, interpret the results.

Page 6: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

Homology

“… whenever statistically significant sequence or structural similarity between proteins or protein domains is observed, this is an indication of their divergent evolution from a common ancestor or, in other words, evidence of homology.”

E.V. Koonin and M.Y. Galperin, Sequence – Evolution – Function, Kluwer 2003

Page 7: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05
Page 8: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

A simple phylogenetic tree…

Page 9: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

Human hemoglobin and more distantly related globins

• Human and horse

• Human and fish

• Human and insect

• Human and bacteria

Page 10: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

Alignment notation: different notations for the same alignment!

VISDWNMPN-------MDGLECILVV----AANDGPMPQTRE

VISDWnm---pnMDGLECILVVaandgpmPQTRE

Page 11: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

Computing sequence alignments

• You must be able to recognize the “answer” (correct alignment) when you see it (scoring system).

• You must be able to find the answer; i.e. compute it efficiently.

Page 12: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

Scoring and computing alignments

• “Position independent” amino acid substitution tables; e.g. BLOSUM62.

• Global alignment algorithms such as Smith-Waterman (dynamic programming); or fast heuristics such as BLAST.

Page 13: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05
Page 14: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

Score this alignment:

VISDWnm---pnMDGLECILVVaandgpmPQTRE

Use: BLOSUM62 matrix; gap opening penalty 10;gap extension penalty 1

(-1 + 4 – 2 – 3 – 3) –10 – 1*11 + (-2 + 0 – 2 – 2 + 5) = -27

Page 15: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

BLAST (Basic Local Alignment Search Tool)

• Extremely fast, can be on the order of 50-100 times faster than Smith-Waterman.

• Method of choice for database searches.

• Statistical theory for significance of results (extreme value distribution).

• Heuristic; does not guarantee optimal results.

• Many variants, e.g. PHI-, PSI-, RPS-BLAST.

Page 16: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

Why database searches?

• Gene finding.

• Assigning likely function to a gene.

• Identifying regulatory elements.

• Understanding genome evolution.

• Assisting in sequence assembly.

• Finding relations between genes.

Page 17: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

Issues in database searches

• Speed.

• Relevance of the search results (selectivity).

• Recovering all information of interest (sensitivity).– The results depend on the search parameters, e.g. gap

penalty, scoring matrix.– Sometimes searches with more than one matrix should be

performed.

Page 18: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

E-values, P-values

• E-value, Expectation value; this is the expected number of hits of at least the given score, that you would expect by random chance for the search database.

• P-value, Probability value; this is the probability that a hit would attain at least the given score, by random chance for the search database.

• E-values are easier to interpret than P-values.

• If the E-value is small enough, e.g. no more than 0.10, then it is essentially a P-value.

Page 19: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

PSI-BLAST

• Position Specific Iterated BLAST

• As a first step runs a (regular) BLAST.

• Hits that cross the threshold are used to construct a position specific score matrix (PSSM).

• A new search is done using the PSSM to find more remotely related sequences.

• The last two steps are iterated until convergence.

Page 20: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

PSSM (Position Specific Score Matrix)

• One column per residue in the query sequence.

• Per-column residue frequencies are computed so that log-odds scores may be assigned to each residue type in each column.

• There are difficulties; e.g. pseudo-counts are needed if there are not a lot of sequences, the sequences must be weighted to compensate for redundancy.

Page 21: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

Two key advantages of PSSMs

• More sensitive scoring because of improved estimates of probabilities for a.a.’s at specific positions.

• Describes the important motifs that occur in the protein family and therefore enhances the selectivity.

Page 22: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

Position Specific Substitution Rates

Active site serineWeakly conserved serine

Page 23: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

Position Specific Score Matrix (PSSM)

A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 D -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3

Active site nucleophile

Serine scored differently in these two positions

Page 24: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

PSI-BLAST key points

• The first PSSM is constructed from all hits that cross the significance threshold using “standard” BLAST.

• The search is then carried out with the PSSM to draw in new significant hits.

• If new hits are found then a new PSSM is constructed; these last two steps are iterated.

• The computation terminates upon “convergence”, i.e. when no new sequences are found to cross the significance threshold.

Page 25: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

Protein structure comparison/classification

• Protein secondary structure elements.

• Supersecondary structures (simple structure motifs).

• Folds and domains.

• Comparing structures (VAST).

• Superfolds.

• Fold classification (SCOP).

• Conserved Domain Database (CDD).

Page 26: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

α-helix (3chy)

backbone atoms with sidechains

Page 27: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

Parallel β-strands (3chy)

Page 28: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

Anti-parallel β-strands (1hbq)

Page 29: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

Higher level organization

• A single protein may consist of multiple domains. Examples: 1liy A, 1bgc A. The domains may or may not perform different functions.

• Proteins may form higher-level assemblies. Useful for complicated biochemical processes that require several steps, e.g. processing/synthesis of a molecule. Example: 1l1o chains A, B, C.

Page 30: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

Supersecondary structures

• β-hairpin

• α-hairpin

• βαβ-unit

• β4 Greek key

• βα Greek key

Page 31: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

Supersecondary structure: simple units

G.M. Salem et al. J. Mol. Biol. (1999) 287 969-981

Page 32: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

Supersecondary structure: Greek key motifs

G.M. Salem et al. J. Mol. Biol. (1999) 287 969-981

Page 33: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

Protein folds

• There is a continuum of similarity!

• Fold definition: two folds are similar if they have a similar arrangement of SSEs (architecture) and connectivity (topology). Sometimes a few SSEs may be missing.

• Fold classification: To get an idea of the variety of different folds, one must adjust for sequence redundancy and also try to correctly assign homologs that have low sequence identity (e.g. below 25%).

Page 34: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

Vector Alignment Search Tool (VAST)

• Fast structure comparison based on representing SSEs by vectors.

• A measure of statistical significance (VAST E-value) is computed (very differently from a BLAST E-value).

• VAST structure neighbor lists useful for recognizing structural similarity.

Page 35: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

Superfolds (Orengo, Jones, Thornton)

• Distribution of fold types is highly non-uniform.

• There are about 10 types of folds, the superfolds, to which about 30% of the other folds are similar. There are many examples of “isolated” fold types.

• Superfolds are characterized by a wide range of sequence diversity and spanning a range of non-similar functions.

• It is a research question as to the evolutionary relationships of the superfolds, i.e. do they arise by divergent or convergent evolution?

Page 36: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

Superfolds and examples

• Globin 1hlm sea cucumber hemoglobin; 1cpcA phycocyanin; 1colA colicin

• α-up-down 2hmqA hemerythrin; 256bA cytochrome B562; 1lpe apolipoprotein E3

• Trefoil 1i1b interleukin-1β; 1aaiB ricin; 1tie erythrina trypsin inhibitor

• TIM barrel 1timA triosephosphate isomerase; 1ald aldolase; 5rubA rubisco

• OB fold 1quqA replication protein A 32kDa subunit; 1mjc major cold-shock protein; 1bcpD pertussis toxin S5 subunit

• α/β doubly-wound 5p21 Ras p21; 4fxn flavodoxin; 3chy CheY

• Immunoglobulin 2rhe Bence-Jones protein; 2cd4 CD4; 1ten tenascin

• UB αβ roll 1ubq ubiquitin; 1fxiA ferredoxin; 1pgx protein G

• Jelly roll 2stv tobacco necrosis virus; 1tnfA tumor necrosis factor; 2ltnA pea lectin

• Plaitfold (Split αβ sandwich) 1aps acylphosphatase; 1fxd ferredoxin; 2hpr histidine-containing phosphocarrier

Page 37: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

Fold classification (when you have the structure…)

• First, look up PubMed abstracts for any relevant papers. E.g. if this is from a PDB file there will be references in it.

• Try checking SCOP or CATH.

• Look at VAST neighbors. See if the structure in question is highly similar to another structure with a known fold.

Page 38: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

SCOP (Structural Classification of Proteins)

• http://scop.mrc-lmb.cam.ac.uk/scop/

• Levels of the SCOP hierarchy:– Family: clear evolutionary relationship– Superfamily: probable common evolutionary origin– Fold: major structural similarity

Page 39: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05
Page 40: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

Bioinformatics databases

• Entrez is by far the most useful, because of the links between the individual databases, e.g. literature, sequence, structure, taxonomy, etc.

• Other specialty databases available on the internet can also be very useful, of course!

Page 41: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

Genomes

Taxonomy

Links Between and Within Nodes

PubMed abstracts

Nucleotide sequences

Protein sequences

3-D Structure

3 -D Structures

Word weight

VAST

BLASTBLAST

Phylogeny

ComputationalComputational

Computational

Computational

Page 42: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

Entrez queries

• Be able to formulate queries using index terms (Preview/Index), and limits.

Page 43: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05
Page 44: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

Exercises!

• How many protein structures are there that include DNA and are from bacteria?

• In PubMed, how many articles are there from the journal Science and have “Alzheimer” in the title or abstract, and “amyloid beta” anywhere? How many since the year 2000?

• Notice that the results are not 100% accurate!

• In 3D Domains, how many domains are there with no more than two helices and 8 to 10 strands and are from the mouse?

Page 45: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

P53 tumor suppressor protein

• Li-Fraumeni syndrome; only one functional copy of p53 predisposes to cancer.

• Mutations in p53 are found in most tumor types.

• p53 binds to DNA and stimulates another gene to produce p21, which binds to another protein cdk2. This prevents the cell from progressing thru the cell cycle.

Page 46: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

G. Giglia-Mari, A. Sarasi, Hum. Mutat. (2003) 21 217-228.

Page 47: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

Exercise!

• Use Cn3D to investigate the binding of p53 to DNA.

• Formulate a query for Structure that will require the DNA molecules to be present (there are 2 structures like this).

Page 48: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

Miscellaneous topics

• BLAST a sequence against a genome; locate hits on chromosomes with map viewer.

• Obtain genomic sequence with map viewer.

• Spidey to predict intron/exon structure (but we won’t use spidey on the exam!).

• How sequence variations can affect protein structure/function.

Page 49: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

“EST exercise” summary

• BLAST the EST (or other DNA seq) against the genome.

• From the BLAST output you can get the genomic coordinates of any nucleotide differences.

• Use map viewer to locate the hit on a chromosome; assume the hit is in the region of a gene.

• By following the gene link you can get an accession for mRNA.• By using the “dl” link you can get an accession for the genomic

sequence.

• Use “spidey” with the mRNA and genomic sequence to locate changed residues in the protein.

Page 50: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

“EST exercise” summary (cont.)

• From the gene report you can follow the protein link, and then “Blink”.

• From the BLAST link page you can get to CDD and related structures.

• Since you know where are the changed residues you can use the structures to study what effect the changes might have on the function of the protein.

Page 51: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

Gene variants that can affect protein function

• Mutation to a stop codon; truncates the protein product!

• Insertion/deletion of multiple bases; changes the sequence of amino acid residues.

• Single point change could alter folding properties of the protein.

• Single point change could affect the active site of the protein.

• Single point change could affect an interaction site with another molecule.

Page 52: CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05

Important note!

• Most diseases (e.g. cancer) are complex and involve multiple factors (not just a single malfunctioning protein!).