69
Bioinformatics in the Mathematics Curriculum Jennifer R. Galovich MAA Short Course Mathfest 2007 San Jose, CA

Bioinformatics in the Mathematics Curriculum

  • Upload
    lang

  • View
    19

  • Download
    2

Embed Size (px)

DESCRIPTION

Bioinformatics in the Mathematics Curriculum. Jennifer R. Galovich MAA Short Course Mathfest 2007 San Jose, CA. HELP!!!!. Need judges for Math/Bio student talks (Janet Andersen prize) Saturday afternoon (talks are at Fairmont). What is Bioinformatics?. Outline. - PowerPoint PPT Presentation

Citation preview

Page 1: Bioinformatics in the Mathematics Curriculum

Bioinformatics in the Mathematics Curriculum

Jennifer R. Galovich

MAA Short Course

Mathfest 2007

San Jose, CA

Page 2: Bioinformatics in the Mathematics Curriculum

HELP!!!!

Need judges for Math/Bio student talks

(Janet Andersen prize)

Saturday afternoon (talks are at Fairmont)

Page 3: Bioinformatics in the Mathematics Curriculum

What is Bioinformatics?

Mathematics

Computer ScienceMolecular Biology

Page 4: Bioinformatics in the Mathematics Curriculum

Outline

• Level I: Algorithms for sequence (DNA, RNA, amino acid) alignment

• Level II: Getting a better handle on microarray data

• Level III: RNA secondary structure• College of St. Benedict/St. John’s University

Page 5: Bioinformatics in the Mathematics Curriculum

I.

Algorithms for Sequence Alignment

Page 6: Bioinformatics in the Mathematics Curriculum

Algorithms for Sequence Alignment – Biological Context –

• Similar sequences similar structure/function.

• Explore frequently occurring patterns to identify important functional motifs

• Starting point for phylogenetic analysis to measure variation between species and among populations:

Similarity of sequences Evolutionary conservation

Page 7: Bioinformatics in the Mathematics Curriculum

Algorithms for Sequence Alignment – Mathematical Learning Goals –

• Concept of an algorithm

• Matrix language and notation

• Recursions and dynamic programming

Page 8: Bioinformatics in the Mathematics Curriculum

Potential Audience

• Mathematics for Liberal Arts

• Mathematics for Allied Health Professions

• Freshman Seminar

Page 9: Bioinformatics in the Mathematics Curriculum

Global vs Local

• The global alignment problem: Measure the similarity between two sequences considered in their entirety.

[Needleman-Wunsch]

• The local alignment problem: Identify strongly similar subsequences (and ignore the rest)

[Smith-Waterman]

Page 10: Bioinformatics in the Mathematics Curriculum

Sequences differ because of mutations occurring over the course of evolution.

Three types of mutation:

• Substitution of one base by another

• Insertion of one or more bases

• Deletion of one or more bases

Insertion of base X into S gives S*

Deletion of base X from S* gives S

“Indel”

Page 11: Bioinformatics in the Mathematics Curriculum

Scoring functions• Reward matches• Penalize mismatches• Penalize indels (gap penalty)

+1

X

X

Y

X

X

_

X

or

Match Mismatch-1

Indel-2

Page 12: Bioinformatics in the Mathematics Curriculum

Pairwise Alignment

• Let S and T be (DNA) sequences. Insert spaces (gaps) into S and/or T so that S and T are the same length.

• Score the alignment according to a previously constructed scoring function

• Reinsert gaps, as needed, to find the maximum score alignment

Page 13: Bioinformatics in the Mathematics Curriculum

Example

S: A T C T G A T

T: T G C A T A

A T C T G A T

T G C _ A T A

Can you do better?

-2-1 -1 +1 -1 -1 -1 = -6

Page 14: Bioinformatics in the Mathematics Curriculum

Naïve Sequence Alignment

If S has length n and T has length m then there are = way too many

possible alignments.

A better way….

n

mn

Page 15: Bioinformatics in the Mathematics Curriculum

Needleman-Wunsch (1970)

Definition: The ith prefix of a sequence is the subsequence consisting of the first i letters

N-W solves the alignment problem by constructing a DP (dynamic programming) matrix A where A(i,j) gives the score of the best alignment between the ith prefix of S and the jth prefix of T, keeping track of the best alignment as it builds.

Page 16: Bioinformatics in the Mathematics Curriculum

A(i,j) = max

1)1,1(

2)1,(

2),1(

jiA

jiA

jiA Align Si with a gap -- vertical

Align Tj with a gap -- horizontal

Si /Tj match/mismatch -- diagonal

T G C A T A

0 -2 -4 -6 -8 -10 -12

A -2 -1 -3 -5 -5 -7 -9

T -4 -1 -2

C -6

T -8 ETC

G -10

A -12

T -14

Page 17: Bioinformatics in the Mathematics Curriculum

T G C A T A

0 -2 -4 -6 -8 -10 -12

A -2 -1 -3 -5 -5 -7 -9

T -4 -1 -2 -4 -6 -4 -6

C -6 -3 -2 -1 -3 -5 -5

T -8 -5 -4 -3 -2 -2 -4

G -10 -7 -4 -5 -4 -3 -3

A -12 -9 -6 -5 -4 -5 -2

T -14 -11 -8 -7 -6 -3 -4

S A T C T G A T

T T G C A T A --

Score -1 -1 +1 -1 -1 +1 -2 = -4

Page 18: Bioinformatics in the Mathematics Curriculum

Your turn!

Align

S: AAAC

T: AGC

1. Make a best guess

2. Use N-W to check

Page 19: Bioinformatics in the Mathematics Curriculum

A G C

0 -2 -4 -6

A -2

A -4

A -6

C -8

Page 20: Bioinformatics in the Mathematics Curriculum

A G C

0 -2 -4 -6

A -2 1 -1 3

A -4 -1 0 -2

A -6 -3 -2 -1

C -8 -5 -4 -1

--AGC A – GC A G--CAAAC A A AC A AAC

Page 21: Bioinformatics in the Mathematics Curriculum

Local alignment:Smith-Waterman

Same idea, but adapt scoring function to ignore negative scores:

0

1)1,1(

2)1,(

2),1(

jiA

jiA

jiA

A(i,j) = max

Align Si with a gap -- vertical

Align Tj with a gap -- horizontal

Si /Tj match/mismatch – diagonal

Start over

Page 22: Bioinformatics in the Mathematics Curriculum

ExampleT G C A T A

0 0 0 0 0 0 0

A 0 0 0 0 1 0 1

T 0 1 0 0 0 2 0

C 0 0 0 1 0 0 1

T 0 1 0 0 0 1 0

G 0 0 2 0 0 0 0

A 0 0 0 1 1 0 1

T 0 1 0 0 0 2 0

TG or AT TG AT

Page 23: Bioinformatics in the Mathematics Curriculum

A Riff on Gaps

So far, our gaps have been linear, but they could be…• Affine (to penalize opening a gap differently from

extending of a gap)• Your favorite increasing concave down function of the

length of the gap

P. Higgs and T. Attwood,Bioinformatics and Molecular Evolution, p. 122.

Page 24: Bioinformatics in the Mathematics Curriculum

Other variations on the theme

I. Multiple sequence alignment – align many sequences in order to uncover regions conserved across an entire group.

Progressive alignment:

All pairwise alignments Distance matrix Cluster diagram (guide tree)

Align clusters to form larger clusters

Page 25: Bioinformatics in the Mathematics Curriculum

II. Align proteins (sequences of amino acides)

Scoring matrix for DNA is 4 X 4.

Scoring matrices for amino acids are 20 x 20, e.g.

PAM (Point Accepted Mutation) and

BLOSUM (BLocks SUbstitution Matrices)

Both based on estimates of probabilities of substitution of one amino acid for another, using different data bases

Page 26: Bioinformatics in the Mathematics Curriculum

Software – CLUSTALX (via MEGA)

Example: Compare mitochondrial DNA sequences of primates

(Human, chimp, gorilla, orangutan, gibbon)

Data from Brown et al. J. Molecular Evolution 18 (1982) 225 – 239.

Page 27: Bioinformatics in the Mathematics Curriculum
Page 28: Bioinformatics in the Mathematics Curriculum

Resulting phylogeny

Pan troglodytes V00672

Gorilla V00658

Hylobates lar V00659

P pygmaeus V00675

Homo sapiens D38112

0.05

(Chimpanzee)

(Gibbon)

(Orangutan)

Page 29: Bioinformatics in the Mathematics Curriculum

II.

Managing Microarray Data

Page 30: Bioinformatics in the Mathematics Curriculum

Managing Microarray Data – Biological Context –

Goal: Measure (simultaneously) the level of expression of the genes in a cell (by measuring concentration of mRNA)

Applications:

• Compare mRNA levels in different types of cells

• Characterize different types of cancer

• ETC

Page 31: Bioinformatics in the Mathematics Curriculum

Managing Microarray Data – Mathematical Learning Goals– • Matrix operations

• Reinforce theorems about and properties of eigenvalues and eigenvectors

• Diagonalization of a symmetric matrix

Page 32: Bioinformatics in the Mathematics Curriculum

What is a microarray?

• “A” microarray experiment is actually the same experiment performed on many genes or proteins at the same time, hence LOTS of data to examine for trends and other features.

• Physically, a microarray is a slide onto which a rectangular array of spots of DN A sequences ( aka “probes”) have been deposited.

Page 33: Bioinformatics in the Mathematics Curriculum

How to make a microarray

http://www.accessexcellence.org/RC/VL/GG/microArray.html

Page 34: Bioinformatics in the Mathematics Curriculum

Microarray Matrix

• Compute R = log2(red/green intensity ratios) • Compare arrays for many samples (time points,

organisms, tumors,….) with all intensities computed relative to the same reference.

• Produce an p x N matrix (p indexes the genes, N indexes the samples) where

R < 0 : gene is down-regulated in test (red) sample compared to reference (green) sampleR = 0: gene is equally expressed in both samples

R > 0: gene is up-regulated in test (red) sample compared to reference (green) sample

Page 35: Bioinformatics in the Mathematics Curriculum

Example

http://media.pearsoncmg.com/bc/bc_campbell_genomics_2/medialib/web_art/Web_Art_Ch_6.pdf

Page 36: Bioinformatics in the Mathematics Curriculum

Problem

Somehow -- Find patterns of expression in what is typically something like thousands of genes expressed in tens or hundreds of different tumor cells types – the p x N matrix.

Page 37: Bioinformatics in the Mathematics Curriculum

Linear Algebra to the Rescue!!

• Goal – Engineer a projection of the high-dimensional data space onto a lower dimensional space – that is, find a point of view from which to observe the higher dimensional space that captures as much of the variability in the data as possible, and ignores the “noise”.

Principal Component Analysis

Page 38: Bioinformatics in the Mathematics Curriculum

The Algorithm

• Center the p x N matrix [X1 XN]:

Let M = (X1 ++ XN),

let B = (X1-M ++ XN-M), and

let S = BBT

• Diagonalize the p x p covariance matrix S.

N

1

1

1

N

Page 39: Bioinformatics in the Mathematics Curriculum

• Since S is positive semi-definite,the eigenvalues 1, …, p are non-negative.

• Order the eigenvalues of S in decreasing order and let u1,…,up be the corresponding (unit) eigenvectors. Let P = [u1,…,up]

• Define the change of variable Y = PX. Then the variance of y1 is maximized. Thus y1 is the first principal component; Similarly, y2, the second principal component, is orthogonal to y1 and maximizes the remaining variance, etc.

Page 40: Bioinformatics in the Mathematics Curriculum

Bottom Line

Instead of trying to understand the data in a p-dimensional space, reduce the dimensionality of the data space by choosing as many of the principal components yi as are needed to account for as much of the variance as desired.

Page 41: Bioinformatics in the Mathematics Curriculum

Payoffs

• Identify the genes with the largest (absolute) coefficients in the principal components to give some biological interpretation to the components

• Use this biological interpretation to assist in classifying the samples

• Plot the data with respect to the principal components to visualize clusters

Page 42: Bioinformatics in the Mathematics Curriculum

Crescenzi and Giuliani, FEBS Letters 507 (2001)

Page 43: Bioinformatics in the Mathematics Curriculum

Higgs and Attwood, Bioinformatics and Molecular Evolution

Page 44: Bioinformatics in the Mathematics Curriculum

Extensions…

• Singular Value Decomposition (Alter et al.)

• Clustering methods – hierarchical and otherwise, including gene shaving (Hastie, et al.)

• Machine learning, e.g. support vector machines. (Moore)

Page 45: Bioinformatics in the Mathematics Curriculum

III.

Combinatorics

and

RNA Folding

Page 46: Bioinformatics in the Mathematics Curriculum

Combinatorics and RNA Folding – Biological Context –

Crick’s Central Dogma

DNA

RNA

Proteins

Page 47: Bioinformatics in the Mathematics Curriculum

B. Subtilis RNase P RNA

GUUCUUAACGUUCGGGUAAUCGCUGCAGAUCUUGAAUCUGUAGAGGAAAGUCCAUGCUCGCACGGUGCUGAGAUGCCCGUAGUGUUCGUGCCUAGCGAAGUCAUAAGCUAGGGCAGUCUUUAGAGGCUGACGGCAGGAAAAAAGCCUACGUCUUCGGAUAUGGCUGAGUAUCCUUGAAAGUGCCACAGUGACGAAGUCUCACUAGAAAUGGUGAGAGUGGAACGCGGUAAACCCCUCGAGCGAGAAACCCAAAUUUUGGUAGGGGAACCUUCUUAACGGAAUUCAACGGAGAGAAGGACAGAAUGCUUUCUGUAGAUAGAUGAUUGCCGCCUGAGUACGAGGUGAUGAGCCGUUUGCAGUACGAUGGAACAAAACAUGGCUUACAGAACG UUAGACCACU

Page 48: Bioinformatics in the Mathematics Curriculum

http://www.bioinfo.rpi.edu/~zukerm/lectures/RNAfold-html/img24.gif

Page 49: Bioinformatics in the Mathematics Curriculum

B. Subtilis RNase P RNA

http://www.pharmazie.uni-marburg.de/pharmchem/akhartmann/bilder/rnase_p_bsubtilis.gif

Page 50: Bioinformatics in the Mathematics Curriculum

Folding Structure Function

Challenge:

Predict/describe RNA secondary structure

Page 51: Bioinformatics in the Mathematics Curriculum

Combinatorics and RNA Folding – Mathematical Learning Goals –

Exploration of various topics in combinatorics:

• Catalan numbers

• Binary trees

• Non-crossing set partitions

• 3-2-1 avoiding permutations

• Permutation statistics and properties of their distributions

Page 52: Bioinformatics in the Mathematics Curriculum

Definition: A secondary structure on {1, 2, ..., n} is a simple graph (i.e., a set of unordered pairs of elements of {1, 2, ..., n}) such that(i) the degree of every vertex is at most 1(ii) if (i, j) is an edge, then |i - j| >1(iii) if i < h < j < k and (i, j) is an edge, then (h, k) cannot be an edge.

NO… NO…

NO…YES!!!

RNA secondary structure as a combinatorial object(certain non-crossing set partitions)

Page 53: Bioinformatics in the Mathematics Curriculum

Is there a formula for s(n, k), the number of secondary structures on {1, 2, ..., n} which have exactly k edges?

Yes.

Do we know what it is?

Yes!!Theorem: (Schmitt and Waterman, 1994)The number of secondary structures on {1, 2, ..., n} whichhave exactly k edges (bonds) is

k

kn

kn

knkn

1

2

11

1

Page 54: Bioinformatics in the Mathematics Curriculum

Proof idea:

Secondary structures on {1, 2, ..., n} with k edges

(Unlabelled) ordered trees

with n – k + 1 vertices and n - 2k leaves

So…Count the ordered trees.

Page 55: Bioinformatics in the Mathematics Curriculum

The S-W Bijection by example: n = 13 k = 4

2

1 2 3 4 5 6 7 8 9 10 11 12 13

1 121

7

8

0

5

3

14

13

10

9 11

4 6

Page 56: Bioinformatics in the Mathematics Curriculum

A Permutation Model: Exploiting the Catalan connection

• Recall the Catalan number:

and that MANY sets of combinatorial objects are counted by Cn:

n

n

nCn

2

1

1

Non-crossing set partitions of {1,2,...,n) Ordered trees on n+1 vertices and...a host* of others

*Current number of combinatorial interpretations of Cn: 135. See Richard Stanley’s webpage for the latest additions.

http://www-math.mit.edu/~rstan/ec/

Page 57: Bioinformatics in the Mathematics Curriculum

3-2-1 Avoidance

Definition: A 3-2-1 avoiding permutation is a permutation with no decreasing substring of length 3.

Example: 1 4 6 2 5 3 is not 3-2-1 avoiding. 1 4 6 2 3 5 is 3-2-1 avoiding

Important Fact:3-2-1 avoiding permutations on n letters are counted by Cn.

Page 58: Bioinformatics in the Mathematics Curriculum

Non-crossingset partitions

3-2-1 avoidingpermutations

RNA secondary

structures???

Page 59: Bioinformatics in the Mathematics Curriculum

Some Definitions

Let π1 π2... πn be a permutation. πk is an excedance if k < πk πk is a strict non-excedance if k > πk

a pair (πi, πj) is an inversion in π if i < j but πi > πj. π has a descent at position i if πi > πi+1

Example:k: 1 2 3 4 5 6 7 πk : 4 7 3 2 6 5 1

Page 60: Bioinformatics in the Mathematics Curriculum

Let Πn be the set of all 3-2-1 avoiding permutations such that:

(i) If position i has a strict non-excedance, position i+1 does not.(ii) If c is a strict non-excedance, then c+1 is not.(iii) Every strict non-excedance is the second element of at least

two inversion pairs.

Let Πn,k be the set of all permutations in Πn which have exactly k strict non-excedances.

Example: n = 13; k = 4k: 1 2 3 4 5 6 7 8 9 10 11 12 13πk : 2 4 5 1 7 3 8 9 11 6 12 10 13

Page 61: Bioinformatics in the Mathematics Curriculum

Main Theorem

Let SSn,k be the set of all RNA secondary structures with k bonds.

Then there is a bijection from SSn,k to Πn,k.

Page 62: Bioinformatics in the Mathematics Curriculum

1 2 3 4 5 6 7 8 9 10 11 12 13

n = 13, k = 4

2 4 5 1 7 3 8 9 11 6 12 10 13

Page 63: Bioinformatics in the Mathematics Curriculum

Permutation Statistics

exc(π) = number of excedances in π

inv(π) = number of inversions in π

maj(π) = sum of the descent positions in π

Example:

k = 4exc(π) =inv(π) =maj(π) =

812

28

1 2 3 4 5 6 7 8 9 10 11 12 13

2 4 5 1 7 3 8 9 11 6 12 10 13

Page 64: Bioinformatics in the Mathematics Curriculum

Two RNA SS Statistics

Tau: Let vi be the number of unpaired bases internal to bond i. Then we define

τ(s) = i iv

Bond Index : B(s) = sum of the positions corresponding to left or right bonds.

1 2 3 4 5 6 7 8 9 10 11 12 13

τ(s) =

B(s) =

(4 + 2 + 1 + 1) = 8

(1+2+4+6+7+9+11 +12) = 52

Page 65: Bioinformatics in the Mathematics Curriculum

Our example: n = 13, k = 4s:

π: 2 4 5 1 7 3 8 9 11 6 12 10 13

inv(π) = 12maj(π) = 28

τ(s) = 8B(s) = 52

Theorem:

1 2 3 4 5 6 7 8 9 10 11 12 13

(1) inv = τ + k

(2) B = 2 (maj + k) - inv

Page 66: Bioinformatics in the Mathematics Curriculum

Distributions

Fact: The statistics inv and maj are equidistributed on the set of all permutations, and are both symmetric and unimodal.

What about B and τ?

Fact: B is symmetric on SSn,k and is unimodal when k = 1 (boring)

Conjectures: B is unimodal for any value of k. τ is unimodal, but not symmetric.

Page 67: Bioinformatics in the Mathematics Curriculum

Future directions

• Prove the unimodality conjectures• Use the B and τ statistics to evaluate folding

algorithms or find potential novel RNA structures or to distinguish among various types of small RNAs

• Investigate: Is there a useful connection between these statistics and RNA function?

• Use permutation statistics to describe/identify RNA motifs

• Look at B statistic on experimentally verified structures

Page 68: Bioinformatics in the Mathematics Curriculum

In summary….

• Bioinformatics tools can be introduced at all levels of the mathematics curriculum to reinforce standard content.

• You don’t need to be an expert (in biology or computer science) to do this!

Page 69: Bioinformatics in the Mathematics Curriculum

Opportunity!!!

Invited paper session

“Mathematical Questions in Bioinformatics”

Friday, 2 - 4