62
262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

  • View
    222

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Gene Recognition

Credits for slides:Serafim BatzoglouMarina AlexanderssonLior PachterSerge Saxonov

Page 2: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Gene structure

exon1 exon2 exon3intron1 intron2

transcription

translation

splicing

exon = protein-codingintron = non-coding

Codon:A triplet of nucleotides that is converted to one amino acid

Page 3: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA

exon exon exonintronintronintergene intergene

Hidden Markov Models for Gene Finding

Intergene State

First Exon State

IntronState

Page 4: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA

exon exon exonintronintronintergene intergene

Hidden Markov Models for Gene Finding

Intergene State

First Exon State

IntronState

Page 5: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

TAA A A A A A A A A A A AA AAT T T T TT TT T T TT T T TG GGG G G G GGGG G G G GCC C C C C C

Exon1 Exon2 Exon3

Duration d

Duration HMM for Gene Finding

iPINTRON(xi | xi-1…xi-w)

PEXON_DUR(d)iPEXON((i – j + 2)%3)) (xi | xi-1…xi-w)

j+2

P5’SS(xi-3…xi+4)

PSTOP(xi-4…xi+3)

Page 6: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

HMM-based Gene Finders

• GENMARK (Borodovsky & McIninch 1993)

• GENIE (Kulp 1996)

• GENSCAN (Burge 1997) Big jump in accuracy of de novo gene finding Currently, one of the best HMM with duration modeling for Exon states

• FGENESH (Solovyev 1997) Currently one of the best

• HMMgene (Krogh 1997)

• VEIL (Henderson, Salzberg, & Fasman 1997)

Page 7: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Better way to do it: negative binomial

• EasyGene:

Prokaryotic

gene-finder

Larsen TS, Krogh A

• Negative binomial with n = 3

Page 8: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

GENSCAN’s hidden weapon

• C+G content is correlated with: Gene content (+) Mean exon length (+) Mean intron length (–)

• These quantities affect parameters of model

• Solution Train parameters of model in four

different C+G content ranges!

Page 9: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Evaluation of Accuracy

(Slide by NF Samatova)

Sensitivity (SN) Fraction of exons (coding nucleotides) whose boundaries are predicted exactly (that are predicted as coding)

•Specificity (Sp) Fraction of the predicted exons (coding nucleotides) that are exactly correct (that are coding)

•Correlation Coefficient (CC)

Combined measure of Sensitivity & Specificity Range: -1 (always wrong) +1 (always right)

TP FP TN FN TP FN TN

Actual

Predicted

Coding / No Coding

TNFN

FPTP

Pre

dic

ted

Actual

No

Co

din

g /

Co

din

g

Page 10: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Results of GENSCAN

• On the initial test dataset (Burset & Guigo) 80% exact exon detection

• 10% partial exons• 10% wrong exons

• In general

HMMs have been best in de novo prediction In practice they overpredict human genes by ~2x

Page 11: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Comparison-based Methods

Page 12: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Cross-species gene finding

5’ 3’

Exon1 Exon2 Exon3Intron1 Intron2

[human]

[mouse]

GGTTTT--ATGAGTAAAGTAGACACTCCAGTAACGCGGTGAGTAC----ATTAA | ||||| ||||| ||| ||||| ||||||||||||| | |C-TCAGGAATGAGCAAAGTCGAC---CCAGTAACGCGGTAAGTACATTAACGA-

Page 13: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Comparison of 1196 orthologous genes(Makalowski et al., 1996)

• Sequence identity between genes in human/mouse– exons: 84.6%– protein: 85.4%– introns: 35%– 5’ UTRs: 67%– 3’ UTRs: 69%

• 27 proteins were 100% identical

Page 14: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Page 15: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Not always: HoxA human-mouse

Page 16: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Patterns of Conservation

30% 1.3%

0.14%

58%14%

10.2%

Genes Intergenic

Mutations Gaps Frameshifts

Separation

2-fold10-fold75-fold

Page 17: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Twinscan

• Twinscan is an augmented version of the Gencscan HMM.

E I

transitions

duration

emissionsACUAUACAGACAUAUAUCAU

Page 18: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Twinscan Algorithm

1. Align the two sequences (eg. from human and mouse)

2. Mark each human base as gap ( - ), mismatch ( : ), match ( | )

New “alphabet”: 4 x 3 = 12 letters = { A-, A:, A|, C-, C:, C|, G-, G:, G|, U-, U:, U| }

3. Run Viterbi using emissions ek(b) where b { A-, A:, A|, …, T| }

Emission distributions ek(b) estimated from real genes from human/mouse

eI(x|) < eE(x|): matches favored in exonseI(x-) > eE(x-): gaps (and mismatches) favored in introns

Page 19: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Example

Human: ACGGCGACGUGCACGU

Mouse: ACUGUGACGUGCACUU

Alignment: ||:|:|||||||||:|

Input to Twinscan HMM:A| C| G: G| C: G| A| C| G| U| G| C| A| C| G: U|

Recall, eE(A|) > eI(A|)

eE(A-) < eI(A-)

Likely exon

Page 20: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

HMMs for simultaneous alignment and gene finding:

Generalized Pair HMMs

Page 21: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

The SLAM hidden Markov model

Page 22: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Exon GPHMM

d

e

1.Choose exon lengths (d,e).2.Generate alignment of length d+e.

Page 23: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Approximate alignment

Page 24: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Measuring Performance

Page 25: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Example: HoxA2 and HoxA3

SLAM

SGP-2

TwinscanGenscan

TBLASTXSLAM CNS

VISTARefSeq

Page 26: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Gene Regulation and Gene Regulation and MicroarraysMicroarrays

Page 27: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Overview

• A. Gene Expression and Regulation

• B. Measuring Gene Expression: Microarrays

• C. Finding Regulatory Motifs

Page 28: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Cells respond to environment

Cell responds toenvironment—various external messages

Page 29: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Genome is fixed – Cells are dynamic

• A genome is static

Every cell in our body has a copy of same genome

• A cell is dynamic

Responds to external conditions Most cells follow a cell cycle of division

• Cells differentiate during development

• Gene expression varies according to:

Cell type Cell cycle External conditions Location

slide credits: M. Kellis

Page 30: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Where gene regulation takes place

• Opening of chromatin

• Transcription

• Translation

• Protein stability

• Protein modifications

Page 31: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Transcriptional Regulation

• Efficient place to regulate:

No energy wasted making intermediate products

• However, slowest response time

After a receptor notices a change:

1. Cascade message to nucleus

2. Open chromatin & bind transcription factors

3. Recruit RNA polymerase and transcribe

4. Splice mRNA and send to cytoplasm

5. Translate into protein

Page 32: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Transcription Factors Binding to DNA

Transcription regulation:

Certain transcription factors bind DNA

Binding recognizes DNA substrings:

Regulatory motifs

Page 33: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Promoter and Enhancers

• Promoter necessary to start transcription

• Enhancers can affect transcription from afar

Page 34: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Regulation of Genes

GeneRegulatory Element

RNA polymerase(Protein)

Transcription Factor(Protein)

DNA

Page 35: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Regulation of Genes

Gene

RNA polymerase

Transcription Factor(Protein)

Regulatory Element

DNA

Page 36: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Regulation of Genes

Gene

RNA polymerase

Transcription Factor

Regulatory Element

DNA

New protein

Page 37: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAAT

Page 38: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATTT

Promoter motifs

3’ UTR motifs

Exons

Introns

Page 39: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Example: A Human heat shock protein

• TATA box: positioning transcription start

• TATA, CCAAT: constitutive transcription

• GRE: glucocorticoid response

• MRE: metal response

• HSE: heat shock element

TATASP1CCAAT AP2HSEAP2CCAATSP1

promoter of heat shock hsp70

0--158

GENE

Page 40: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

The Cell as a Regulatory Network

• Genes = wires• Motifs = gates

A B Make DC

If C then D

If B then NOT D

If A and B then D D

Make BD

If D then B

C

gene D

gene B

Page 41: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

The Cell as a Regulatory Network (2)

Page 42: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

DNA Microarrays

Measuring gene transcription in a high-throughput fashion

Page 43: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

What is a microarray

Page 44: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

What is a microarray

• Measure the level of mRNA messages in a cell

DN

A 1

DN

A 3

DN

A 5

DN

A 6

DN

A 4

DN

A 2

cDNA 4

cDNA 6

Hybridize Gen

e 1

Gen

e 3

Gen

e 5

Gen

e 6

Gen

e 4

Gen

e 2

MeasureRNA 4

RNA 6

RT

slide credits: M. Kellis

Page 45: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

What is a microarray

• A 2D array of DNA sequences from thousands of genes

• Each spot has many copies of same gene

• Measure number of hybridizations per spot

Result:• Thousands of “experiments” – one per gene –

in one go

• Perform many microarrays for different conditions: Time during cell cycle Temperature Nutrient level

Page 46: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Goal of Microarray Experiments

• Measure level of gene expression across many different conditions:

Expression Matrix M: {genes}{conditions}:

Mij = |genei| in conditionj

• Group genes into coregulated sets

Observe cells under different conditions

Find genes with similar expression profiles

• Potentially regulated by same TF

slide credits: M. Kellis

Page 47: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Clustering vs. Classification

• Clustering Idea: Groups of genes that share similar function have similar expression

patterns• Hierarchical clustering• k-means • Bayesian approaches• Projection techniques

• Principal Component Analysis• Independent Component Analysis

• Classification Idea: A cell can be in one of several states

• (Diseased vs. Healthy, Cancer X vs. Cancer Y vs. Normal) Can we train an algorithm to use the gene expression patterns to

determine which state a cell is in?• Support Vector Machines• Decision Trees• Neural Networks• K-Nearest Neighbors

Page 48: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Clustering Algorithms

b

ed

f

a

c

h

ga b d e f g hc

• K-meansb

ed

f

a

c

h

gc1

c2

c3a b g hcd e f

• Hierarchical

slide credits: M. Kellis

Page 49: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Hierarchical clustering

• Bottom-up algorithm: Initialization: each point in a separate cluster

• At each step: Choose the pair of closest clusters Merge

• The exact behavior of the algorithm depends on how we define the distance CD(X,Y) between clusters X and Y

• Avoids the problem of specifying the number of clusters

b

ed

f

a

c

h

g

slide credits: M. Kellis

Page 50: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Distance between clusters

• CD(X,Y)=minx X, y Y D(x,y)

Single-link method

• CD(X,Y)=maxx X, y Y D(x,y)

Complete-link method

• CD(X,Y)=avgx X, y Y D(x,y)

Average-link method

• CD(X,Y)=D( avg(X) , avg(Y) )

Centroid method

ed

f

h

g

ed

f

h

g

ed

f

h

g

ed

f

h

g

slide credits: M. Kellis

Page 51: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Results of Clustering Gene Expression

• CLUSTER is simple and easy to use

• De facto standard for microarray analysis

Time: O(N2M)

N: #genesM: #conditions

Page 52: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

K-Means Clustering Algorithm

• Each cluster Xi has a center ci

• Define the clustering cost criterion

• COST(X1,…Xk) = ∑Xi ∑x Xi |x – ci|2

• Algorithm tries to find clusters X1…Xk and centers c1…ck that minimize COST

• K-means algorithm: Initialize centers Repeat:

• Compute best clusters for given centers

• → Attach each point to the closest center

• Compute best centers for given clusters

• → Choose the centroid of points in cluster

Until the changes in COST are “small”

b

ed

f

a

c

h

g

c1

c2

c3

slide credits: M. Kellis

Page 53: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

K-Means Algorithm

• Randomly Initialize Clusters

Page 54: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

K-Means Algorithm

• Assign data points to nearest clusters

Page 55: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

K-Means Algorithm

• Recalculate Clusters

Page 56: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

K-Means Algorithm

• Recalculate Clusters

Page 57: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

K-Means Algorithm

• Repeat

Page 58: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

K-Means Algorithm

• Repeat

Page 59: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

K-Means Algorithm

• Repeat … until convergence

Time: O(KNM) per iteration

N: #genesM: #conditions

Page 60: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Mixture of Gaussians – Probabilistic K-means

• Data is modeled as mixture of K Gaussians N(1, 2I), …, N(K, 2I)

Prior probabilities 1, …, K

• Different i for every Gaussian i, or even different covariance matrices are possible, but learning becomes harder

P(x) = ∑i P(x | N(1, 2I)) i

Use EM to learn parameters

Page 61: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Analysis of Clustering Data

• Statistical Significance of Clusters

Gene Ontology http://www.geneontology.org/

KEGG http://www.genome.jp/kegg/

• Regulatory motifs responsible for common expression

• Regulatory Networks

• Experimental Verification

Page 62: CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

CS262 Lecture 16, Win07, Batzoglou

Evaluating clusters – Hypergeometric Distribution

rm

k

N

mk

pN

m

p

rposP )(

• N experiments, p labeled ++, (N-p) ––• Cluster: k elements, m labeled ++• P-value of single cluster containing k

elements of which at least r are ++

Prob that a randomly chosen set of k experiments would result in m positive and k-m negative

P-value of uniformity

in computed cluster

slide credits: M. Kellis