Upload
ezra-peters
View
219
Download
1
Embed Size (px)
Citation preview
Filters: Information reducersSequence filter
TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC TTAGATAAAC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCACGCCC CTCCGTAAAC CTCTAACATG ATGTCAGCAA ATATTAAAAA TGAATAAACT TTGTTAAAGG TACAAATGAA AATTAGCAAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT CATTCTAGGG AAACCTGTAT GGTTACATGA ACTGCCTAAA AAACAAGCTA TTATATATTT TAAGAAATTA ATTGCAATTA ATTTCCTGGG CCCCAGCTGT CATTAAAAAG AGGCAAATAC AGCCAAGGAC GACAGCACTG ACCCTCAAGA AGGCACCGGC TGACAGACAG GCTGAAATTC CGCTGAGAGC AGAGTGGTAC ATTGAACCCT CCCTGCACCA GGTCTTTCCT GTGGGCACTG AGTGCAGACA ATGAATGACT GAACGAACGA TTGAATGAAA AGAAATGAGA TATGAGGCAA TCACAGCATC AGGTGACCTT AGTATCTATT CTCGGGAGCG CACGGCTCTA AAGAGGCCCA TATCCAGGCA CCTTTAGATG CAAGAAGGAG GAAACAGCTC GAAATCCCTG AGGCCGGAGG GTCAAGAACT CTCCACCGGC GGCAGCGGCC CCCCGGCCTA AGGCTGCCTG TGCTATAAAT ACGCGGCCCA TTCCCTGGGC TCGGCGGGAC AGATAACATG AATGTGCCCT
CTCCGTAAAC CTCTAAC...How organism is made
How organism works
From Sequence to OrganismHow does Nature do it?
ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu...
Genetic code Rules of folding
Active site
From Sequence to OrganismHow does Nature do it?
ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu...
Active site
Cell interaction
Metabolism,Architecture
Genetic code Rules of folding
From Sequence to OrganismHow does Nature do it?
ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu...
Genetic code
Active site
Gives us:
• Custom antibiotics
Genetic code Rules of folding
From Sequence to OrganismHow does Nature do it?
ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu...
Gives us:
• Custom antibiotics • Custom antibodies• Custom enzymes• New materials
Genetic code Rules of folding
Active site
From Sequence to OrganismHow does Nature do it?
ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu...
Genetic code
Rules of transcriptional and post-transcriptional control
• Transcr’l initiation• Transcr’l termination/ polyA tailing• Splicing• Transl’l initiation
?
TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC TTAGATAAAC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCACGCCC CTCCGTAAAC CTCTAACATG ATGTCAGCAA ATATTAAAAA TGAATAAACT TTGTTAAAGG TACAAATGAA AATTAGCAAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT CATTCTAGGG AAACCTGTAT GGTTACATGA ACTGCCTAAA AAACAAGCTA TTATATATTT TAAGAAATTA ATTGCAATTA ATTTCCTGGG CCCCAGCTGT CATTAAAAAG AGGCAAATAC AGCCAAGGAC GACAGCACTG ACCCTCAAGA AGGCACCGGC TGACAGACAG GCTGAAATTC CGCTGAGAGC AGAGTGGTAC ATTGAACCCT CCCTGCACCA GGTCTTTCCT GTGGGCACTG AGTGCAGACA ATGAATGACT GAACGAACGA TTGAATGAAA AGAAATGAGA
ATGACTTATGATCAACGCACAGGGCTA3%
TCTACTTATATTCAATCCACAGGGCTACACCTAGTTCTTGAAGAGTCTGTTGAATGAACACATACATGGTTTATCTGTTTTTCTGTCTGCTCTGACCTCTGGCAGCTT
TAGCCTGCCCCACTCTTAGATAAACGAACCTTAGTGACTTCTGCTATACCAAAGTCTCCACGCCCCTCCGTAAACCTCTAACATGATGTCAGCAAATATTAAAAATGA
97%
From Sequence to OrganismHow does Nature do it?
Natural filters/transformations
• Selective transcription
• Selective processing
• Translation
• Folding
DNA Functional protein
From Sequence to OrganismHow does Nature do it?
Natural filters/transformations
DNA Functional protein
From Sequence to OrganismHow can WE do it?
Simulation of Nature
Utterence of Wm Shakespeare
Utterence of George W Bush
“Whether ‘tis nobler in the mind to suffer the slings and arrows
of outrageous fortune...”
“We must give our military every tool and weapon it needs to prevail...”
???
From Sequence to OrganismHow can WE do it?
Surrogate Processes
Utterence of Wm Shakespeare
Utterence of George W Bush
“Whether ‘tis nobler in the mind to suffer the slings and arrows
of outrageous fortune...”
“We must give our military every tool and weapon it needs to prevail...”
Words/sentence; Choice of words; Sentence structure; …
From Sequence to OrganismHow can WE do it?
Natural filters/transformations
• Selective transcription
• Selective processing
• Translation
• Folding
Surrogate filters
Characteristics of coding sequences/introns
My sequence
• Gene finders
Predicted coding regions
From Sequence to OrganismHow can WE do it?
Natural filters/transformations
• Selective transcription
• Selective processing
• Translation
• Folding
Surrogate filters• Gene finders
• Similarity finders
Sequence/motif Databases My sequence
From Sequence to OrganismHow can WE do it?
Natural filters/transformations
• Selective transcription
• Selective processing
• Translation
• Folding
Surrogate filters• Gene finders
• Similarity finders
• Feature finders
Predicted features
Characteristicsof features
My sequence
From Sequence to OrganismHow can WE do it?
Natural filters/transformations
• Selective transcription
• Selective processing
• Translation
• Folding
Surrogate filters• Gene finders
• Similarity finders
• Feature finders
• Pattern finders
My sequences Statistical engine
Surrogate Filters
• Gene finders
• Similarity finders
• Feature finders
• Pattern finders
How do they work? Case studies• Real problems
• Mixed strategies
You do it
Surrogate FiltersGene finders
Class 1: Start/Stop codon search (Map, Frames, OrfFinder)
CTCCACGCCCCTCCGTACACCTCTAACATGATGTCAGCAAATATTAAAAATGAATAAACTTTGTGACATGTACAAATGGAAATATGCAA
CT CCA CGC CCC TCC GTA CAC CTC TAA CAT GAT CTC AGC AAA TAT TAA AAA TGA ATA AAC TTT GTG ACA TGT ACA AAT GGA AAT ATG CAA
CTC CAC GCC CCT CCG TAC ACC TCT AAC ATG ATC TCA GCA AAT ATT AAA AAT GAA TAA ACT TTG TGA CAT GTA CAA ATG GAA ATA TGC AAC TCC ACG CCC CTC CGT ACA CCT CTA ACA TGA TCT CAG CAA ATA TTA AAA ATG AAT AAA CTT TGT GAC ATG TAC AAA TGG AAA TAT GCA A
Look for start codons (ATG) (GTG,TTG)
Look for stop codons (TAA,TAG,TGA)
CTCCACGCCCCTCCGTACACCTCTAACATGATGTCAGCAAATATTAAAAATGAATAAACTTTGTGACATGTACAAATGGAAATATGCAA
TTGCATATTTCCATTTGTACATGTCACAAAGTTTATTCATTTTTAATATTTGCTGAGATCATGTTAGAGGTGTACGGAGGGGCGTGGAG
Surrogate FiltersGene finders
Class 1: Start/Stop codon search (Map, Frames, OrfFinder)
Look for start codons (ATG) (GTG,TTG)
Look for stop codons (TAA,TAG,TGA)
Pro: Quick, simple
Con: Useless for eukaryotic genomic sequences (introns) Inaccurate (start codon problem)
Inaccurate (doubtful short open reading frames)
Surrogate FiltersGene finders
Class 1: Start/Stop codon search (Map, Frames, OrfFinder)
Surrogate FiltersGene finders
Genetic CodeUUU Phe UCU Ser UAU Tyr UGU CysUUC Phe UCC Ser UAC Tyr UGC CysUUA Leu UCA Ser UAA ochre UGA opalUUG Leu UCG Ser UAG amber UGG TrpCUU Leu CCU Pro CAU His CGU ArgCUC Leu CCC Pro CAC His CGC ArgCUA Leu CCA Pro CAA Gln CGA ArgCUG Leu CCG Pro CAG Gln CGG ArgAUU Ile ACU Thr AAU Asn AGU SerAUC Ile ACC Thr AAC Asn AGC SerAUA Ile ACA Thr AAA Lys AGA ArgAUG Met ACG Thr AAG Lys AGG ArgGUU Val GCU Ala GAU Asp GGU GlyGUC Val GCC Ala GAC Asp GGC GlyGUA Val GCA Ala GAA Glu GGA GlyGUG Val GCG Ala GAG Glu GGG Gly
The code is degenerate
Class 2: Codon bias recognition (TestCode)
Are codons equally used?
Surrogate FiltersGene finders
Genetic Code (human)UUU Phe UCU Ser UAU Tyr UGU CysUUC Phe UCC Ser UAC Tyr UGC CysUUA Leu UCA Ser UAA ochre UGA opalUUG Leu UCG Ser UAG amber UGG TrpCUU Leu CCU Pro CAU His CGU ArgCUC Leu CCC Pro CAC His CGC ArgCUA Leu CCA Pro CAA Gln CGA ArgCUG Leu CCG Pro CAG Gln CGG ArgAUU Ile ACU Thr AAU Asn AGU SerAUC Ile ACC Thr AAC Asn AGC SerAUA Ile ACA Thr AAA Lys AGA ArgAUG Met ACG Thr AAG Lys AGG ArgGUU Val GCU Ala GAU Asp GGU GlyGUC Val GCC Ala GAC Asp GGC GlyGUA Val GCA Ala GAA Glu GGA GlyGUG Val GCG Ala GAG Glu GGG Gly
Codon usage is biased
Most frequently used codons
Class 2: Codon bias recognition (TestCode)
Codon bias universal?
Surrogate FiltersGene finders
Class 2: Codon bias recognition (TestCode)
Pro: Quick, simple, available through GCG Better than Class 1 in excluding false open reading framesCon: Useless for eukaryotic genomic sequences (introns) Gives only general areas of open reading frames
Surrogate FiltersGene finders
Class 3: Hidden Markov Model (HMM)-based recognition
Principle Step 1: Create model through extensive training set * Training set = proven or suspected genes * Organism-specific
Step 2: Assess candidate genes through filter of model
Step 1: Create model through extensive training set
Surrogate FiltersGene finders
Class 3: Hidden Markov Model (HMM)-based recognition
AAAAACAAGAATACA . . .TTGTTT
TrainingSet
AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATCAATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAACCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAATGACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACACTTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTCATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCTATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAATCCATAGTTATTATTACTTATGACTAAAACAAAATTACTATGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATGACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTATATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTCAAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACTGAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCACTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGATCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGATGCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGGTAAAAGTTCGGCGCACCTGTGGAGATGATTCGGTAGCTTT
Step 1: Create model through extensive training set
AAAA: 33%
AAAC: 25%
AAAG: 12%
AAAT: 30%
Surrogate FiltersGene finders
Class 3: Hidden Markov Model (HMM)-based recognition
AAAAACAAGAATACA . . .TTGTTT
TrainingSet
AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATCAATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAACCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAATGACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACACTTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTCATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCTATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAATCCATAGTTATTATTACTTATGACTAAAACAAAATTACTATGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATGACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTATATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTCAAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACTGAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCACTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGATCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGATGCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGGTAAAAGTTCGGCGCACCTGTGGAGATGATTCGGTAGCTTT
Step 1: Create model through extensive training set
AACA: 30%
AACC: 20%
AACG: 15%
AACT: 35%
AAAAACAAGAATACA . . .TTGTTT
Surrogate FiltersGene finders
Class 3: Hidden Markov Model (HMM)-based recognition
TrainingSet
AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATCAATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAACCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAATGACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACACTTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTCATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCTATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAATCCATAGTTATTATTACTTATGACTAAAACAAAATTACTATGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATGACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTATATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTCAAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACTGAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCACTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGATCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGATGCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGGTAAAAGTTCGGCGCACCTGTGGAGATGATTCGGTAGCTTT
Step 2: Assess candidate genes
A C G TAAA 0.33 0.25 0.12 0.30AAC 0.30 0.20 0.15 0.35AAG 0.35 0.15 0.20 0.30 AAT 0.30 0.15 0.20 0.25 ACA 0.25 0.20 0.15 0.35 . . .TTG 0.25 0.30 0.15 0.30TTT 0.30 0.25 0.10 0.35
Candidategene
AAAGCAA…
0.12
3rd order Markov model
Surrogate FiltersGene finders
Class 3: Hidden Markov Model (HMM)-based recognition
Step 2: Assess candidate genes
AAAGCAA…
0.12 x 0.15
3rd order Markov model
Surrogate FiltersGene finders
Class 3: Hidden Markov Model (HMM)-based recognition
A C G TAAA 0.33 0.25 0.12 0.30AAC 0.30 0.20 0.15 0.35AAG 0.35 0.15 0.20 0.30 AAT 0.30 0.15 0.20 0.25 ACA 0.25 0.20 0.15 0.35 . . .TTG 0.25 0.30 0.15 0.30TTT 0.30 0.25 0.10 0.35
Candidategene
Step 2: Assess candidate genes
AAAGCTA…
0.12 x 0.15 . . .
So far, not a good candidate!
3rd order Markov model
Surrogate FiltersGene finders
Class 3: Hidden Markov Model (HMM)-based recognition
A C G TAAA 0.33 0.25 0.12 0.30AAC 0.30 0.20 0.15 0.35AAG 0.35 0.15 0.20 0.30 AAT 0.30 0.15 0.20 0.25 ACA 0.25 0.20 0.15 0.35 . . .TTG 0.25 0.30 0.15 0.30TTT 0.30 0.25 0.10 0.35
Candidategene
Surrogate FiltersGene finders
Class 3: Hidden Markov Model (HMM)-based recognition
Pro: Almost most accurate method known
Con: Needs big training set May miss genes of foreign origin
Will miss very small genes
Surrogate FiltersGene finders
Class 3: Hidden Markov Model (HMM)-based recognition
Pro: Almost most accurate method known
Con: Needs big training set May miss genes of foreign origin
Will miss very small genes
Case of the Hidden HeterocystStrategy to find heterocyst differentiation genes
Nostoc genome Transposon
1. Use transposon mutagenesis
Case of the Hidden HeterocystStrategy to find heterocyst differentiation genes
Nostoc genome Transposon
1. Use transposon mutagenesisto find a mutant defective in heterocyst differentiation
Case of the Hidden HeterocystStrategy to find heterocyst differentiation genes
Nostoc genome
2. Sequence out from transposon
AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATCAATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAACCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAATGACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACACTTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTCATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCTATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAATCCATAGTTATTATTACTTATGACTAAAACAAAATTACTATGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATGACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTATATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTCAAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACTGAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCACTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGATCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGATGCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGGTAAAAGTTCGGCGCACCTGTGGA
1. Use transposon mutagenesisto find a mutant defective in heterocyst differentiation
Case of the Hidden HeterocystStrategy to find heterocyst differentiation genes
Nostoc genome
2. Sequence out from transposon
AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATCAATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAACCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAATGACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACACTTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTCATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCTATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAATCCATAGTTATTATTACTTATGACTAAAACAAAATTACTATGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATGACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTATATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTCAAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACTGAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCACTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGATCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGATGCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGGTAAAAGTTCGGCGCACCTGTGGA
1. Use transposon mutagenesisto find a mutant defective in heterocyst differentiation
3. Find gene boundaries
4. Identify gene
Do it
Case of the Hidden HeterocystStrategy to find heterocyst differentiation genes
1. Go to http://www.vcu.edu/~elhaij/BioInf
2. Open second browser (Ctrl-N in Netscape)
Go to same site (copy and paste URL)
3. In 1st browser, go to Program List Click on Gene Finders Open GeneMark
4. In 2nd browser, open Nostoc sequence
Case of the Hidden HeterocystStrategy to find heterocyst differentiation genes
Mission successful:
>Translation: 397..639 (direct), 81 amino acids VLGSKIEEGPKHIILDLSQIDFIDSSGLGALVQLAKQAQTAEGTLQIVTNARVTQTVKLVRLEKFLSLQKSVEEALENVK*
… or was it?
Check predicted protein against databases
Surrogate FiltersSimilarity finders
Blast• BlastP: Protein sequence to search protein database
• BlastN: Nucleotide sequence to search nucleotide database
• BlastX: Nucleotide sequence (translated) to search protein database
• TBlastN: Protein sequence to search (translated) nucleotide database
• Blast2Seq: Compare two sequences you specify
Do itFastA
• (Various flavors)
Pfam (Protein motif families)Finds conserved motifs similar to protein sequence
Case of the Hidden HeterocystStrategy to find heterocyst differentiation genes
Mission successful:
>Translation: 397..639 (direct), 81 amino acids VLGSKIEEGPKHIILDLSQIDFIDSSGLGALVQLAKQAQTAEGTLQIVTNARVTQTVKLVRLEKFLSLQKSVEEALENVK*
Why?• GeneMark correct: Conservation of noncoding regions
VLGSK
• GeneMark wrong: Fooled by weird aa sequence or start codon
Case of the Hidden HeterocystStrategy to find heterocyst differentiation genes
Moral
Automated gene finders are wonderful, but common sense is better
Don’t trust automated annotation
Surrogate FiltersFeature finders
Hidden Markov model-based methods
• Good for contiguous features (e.g. signal sequences)• Not good with features with gaps (e.g. promoters)
Ad hoc methods
• Feature-specific rules (e.g. tandem repeats, terminators)
Position-dependent frequency tables = Position-specific scoring matrix (PSSM) = Weight table
Surrogate FiltersFeature finders
Position-dependent frequency tables
CCCTATATAAGGC... histone H1tCGCTATAAAAACT... HMG-17GGGTATATAAGCG... b'-tubulin b'2GGCTATATAAAAC... a'-actin skel-m.TTCTATAAAGCGG... a'-cardiac actinCCCTATAAAACCC... b'-actinGAGTATAAAGCAC... keratin I 50KGGTTATAAAAACA... vimentinCAGTATAAAAGGG... a'1(I) collagenCCGTATAAATAGG... a'2(I) collagenTCCCATATAAGCC... fibronectin
Some of 106 aligned human promoter
sequences (near -26)
Consensus TATAAA
Surrogate FiltersFeature finders
Position-dependent frequency tables
A 21 29 -----
0 100 0 100 81 91 57 32 15 26 T 16 22 ---
-- 87 0 100 0 19 0 21 6 10 11
C 28 24 -----
13 0 0 0 0 0 0 15 33 28 G 35 25 ---
-- 0 0 0 0 0 9 22 47 42 34
CCCTATATAAGGC... histone H1tCGCTATAAAAACT... HMG-17GGGTATATAAGCG... b'-tubulin b'2GGCTATATAAAAC... a'-actin skel-m.TTCTATAAAGCGG... a'-cardiac actinCCCTATAAAACCC... b'-actinGAGTATAAAGCAC... keratin I 50KGGTTATAAAAACA... vimentinCAGTATAAAAGGG... a'1(I) collagenCCGTATAAATAGG... a'2(I) collagenTCCCATATAAGCC... fibronectin
Some of 106 aligned human promoter
sequences (near -26)
aceB ACTATGGAGCATCTGCACATGAAAACCatpI ACCTCGAAGGGAGCAGGAGTGAAAAACbioB ACGTTTTGGAGAAGCCCCATGGCTCACglnA ATCCAGGAGAGTTAAAGTATGTCCGCTglnH TAGAAAAAAGGAAATGCTATGAAGTCTlacZ TTCACACAGGAAACAGCTATGACCATGrpsJ AATTGGAGCTCTGGTCTCATGCAGAACserC GCAACGTGGTGAGGGGAAATGGCTCAAsucA GATGCTTAAGGGATCACGATGCAGAACtrpE CAAAATTAGAGAATAACAATGCAAACA
Position-Specific Scoring Matrix in action
Surrogate FiltersFeature finders
Experimentally proven
start sites
unknown
aceB ACTATGGAGCATCTGCACATGAAAACCatpI ACCTCGAAGGGAGCAGGAGTGAAAAACbioB ACGTTTTGGAGAAGCCCCATGGCTCACglnA ATCCAGGAGAGTTAAAGTATGTCCGCTglnH TAGAAAAAAGGAAATGCTATGAAGTCTlacZ TTCACACAGGAAACAGCTATGACCATGrpsJ AATTGGAGCTCTGGTCTCATGCAGAACserC GCAACGTGGTGAGGGGAAATGGCTCAAsucA GATGCTTAAGGGATCACGATGCAGAACtrpE CAAAATTAGAGAATAACAATGCAAACA
Position-Specific Scoring Matrix in action
Surrogate FiltersFeature finders
Experimentally proven
start sites
unknown
aceB ACCACATAACTATGGAGCATCTGCACATGAAAACCatpI ACCTCGAAGGGAGCAG.....GAGTGAAAAACbioB ACGTTTTGGAGAAGC...CCCATGGCTCACglnA ATCCAGGAGAGTTA.AAGTATGTCCGCTglnH TAGAAAAAAGGAAATG.....CTATGAAGTCTlacZ TTCACACAGGAAACAG....CTATGACCATGrpsJ AATTGGAGCTCTGGTCTCATGCAGAACserC GCAACGTGGTGAGGG...GAAATGGCTCAAsucA GATGCTTAAGGGATCA....CGATGCAGAACtrpE CAAAATTAGAGAATA...ACAATGCAAACA
Surrogate FiltersFeature finders
Position-Specific Scoring Matrix in action
ACGT
aceB ACCACATAACTATGGAGCATCT.GCACATGAAAACCatpI ACCTCGAAGGGAGCAG.....GAGTGAAAAACbioB ACGTTTTGGAGAAGC...CCCATGGCTCACglnA ATCCAGGAGAGTTA.AAGTATGTCCGCTglnH TAGAAAAAAGGAAATG.....CTATGAAGTCTlacZ TTCACACAGGAAACAG....CTATGACCATGrpsJ AATTGGAGCTCTGGTCTCATGCAGAACserC GCAACGTGGTGAGGG...GAAATGGCTCAAsucA GATGCTTAAGGGATCA....CGATGCAGAACtrpE CAAAATTAGAGAATA...ACAATGCAAACA
Surrogate FiltersFeature finders
Position-Specific Scoring Matrix in action
ACGT
Surrogate FiltersPattern finders
Specified patterns (FindPatterns, PatScan) e.g. Find instances of restriction sites
New pattern discovery (Meme, Gibbs sampler)
snRNA U1 (pU1-6) AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTChistone H1t GCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTTHMG-14 CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGGTP1 GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTTprotamine P1 CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACTnucleolin GCAGGCTCAGTCTTTCGCCTCAGTCTCGAGCTCTCGCTGGsnRNP E TGCCGCCGCGTGACCTTCACACTTCCGCTTCCGGTTCTTTrp S14 GACACGGAAGTGACCCCCGTCGCTCCGCCCTCTCCCACTCrp S17 TGGCCTAAGCTTTAACAGGCTTCGCCTGTGCTTCCTGTTTribosomal p. S19 ACCCTACGCCCGACTTGTGCGCCCGGGAAACCCCGTCGTTa'-tubulin ba'1 GGTCTGGGCGTCCCGGCTGGGCCCCGTGTCTGTGCGCACGb'-tubulin b'2 GGGAGGGTATATAAGCGTTGGCGGACGGTCGGTTGTAGCAa'-actin skel-m. CCGCGGGCTATATAAAACCTGAGCAGAGGGACAAGCGGCCa'-cardiac actin TCAGCGTTCTATAAAGCGGCCCTCCTGGAGCCAGCCACCCb'-actin CGCGGCGGCGCCCTATAAAACCCAGCGGCGCGACGCGCCA
Human sequences 5’ to transcriptional start
Surrogate FiltersPattern finders
How do pattern finders work?
snRNA U1 (pU1-6) AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTChistone H1t GCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTTHMG-14 CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGGTP1 GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTTprotamine P1 CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT
Step 1. Arbitrarily choose candidate pattern from a sequence
Step 2. Find best matches to pattern in all sequences
Step 3. Construct position-dependent frequency table based on matches
Step 4. Calculate relative probability of matches from frequency table
GACAGGGCAGAAGCCCGGGTGTTTGCCGGGGACGCGGCCCCCGGGCCTGCCGCAGAGCTG
A 0.208 0.292 0.000 0.999 0.000 0.999 0.811 0.905 0.575 0.321 0.151 0.264T 0.160 0.217 0.867 0.000 0.999 0.000 0.189 0.000 0.208 0.057 0.104 0.113C 0.283 0.236 0.132 0.000 0.000 0.000 0.000 0.000 0.000 0.151 0.330 0.283G 0.349 0.255 0.000 0.000 0.000 0.000 0.000 0.95 0.217 0.472 0.415 0.340
Surrogate FiltersPattern finders
How do pattern finders work?
snRNA U1 (pU1-6) AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTChistone H1t GCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTTHMG-14 CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGGTP1 GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTTprotamine P1 CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT
Step 1. Arbitrarily choose candidate pattern from a sequence
Step 2. Find best matches to pattern in all sequences
Step 3. Construct position-dependent frequency table based on matches
Step 4. Calculate relative probability of matches from frequency table
Step 5. If probability score high, remember pattern and score
Surrogate FiltersPattern finders
How do pattern finders work?
snRNA U1 (pU1-6) AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTChistone H1t GCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTTHMG-14 CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGGTP1 GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTTprotamine P1 CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT
Step 1. Arbitrarily choose candidate pattern from a sequence
Step 2. Find best matches to pattern in all sequences
Step 3. Construct position-dependent frequency table based on matches
Step 4. Calculate relative probability of matches from frequency table
Step 5. If probability score high, remember pattern and score
Step 6. Repeat Steps 1 - 5
Surrogate FiltersScenario II – Case of the Masked Motif• You’ve found a gene related to Purple Tongue Syndrome
• BlastP: Encoded protein related to cAMP-binding proteins
• Are the similarities trivial? Related to cAMP binding?
• Does your protein contain cAMP-binding site?
• What IS a cAMP-binding site?
Task
1. Determine what is a cAMP-binding site
2. Determine if your protein has one
Surrogate FiltersScenario II – Case of the Masked Motif
1. Collect sequences of known cAMP-binding proteins
2. Run Meme, a pattern-finding programAsk it to find any significant motifs
3. Rerun Meme. Demand that every protein has identified motifs
4. Run Pfam over known sequence to check
Do it
Strategy
Surrogate FiltersScenario III – Case of the Mortal Mitochondrion
Progressive External Ophthalmoplegia (PEO)• Slow paralysis of voluntary eye muscles• Many other symptoms (e.g., frequent deafness)• Loss of mitochondrial DNA
Surrogate FiltersScenario III – Case of the Mortal Mitochondrion
Progressive External Ophthalmoplegia (PEO)• Slow paralysis of voluntary eye muscles• Many other symptoms (e.g., frequent deafness)• Loss of mitochondrial DNA
Inheritance• Mendelian• Autosomal dominant• Linked to chromosome 4q34
Surrogate FiltersScenario III – Case of the Mortal Mitochondrion
Progressive External Ophthalmoplegia (PEO)• Slow paralysis of voluntary eye muscles• Many other symptoms (e.g., frequent deafness)• Loss of mitochondrial DNA
Inheritance• Mendelian• Autosomal dominant• Linked to chromosome 4q34
Your task• Examine sequence of 4q34 region
• Assess likelihood that a gene in the area could cause disease symptoms
Surrogate FiltersScenario III – Case of the Mortal Mitochondrion
Examining Sequence of 4q34 Regiontctacttatattcaatccacagggctacacctagttcttggtacacagtacatgctcagcaagagtctgttgaatgaacacatacatggtttatctgtttgtctcttccgagttcttgacttctgtctgctctgacctctggcagctttccactagtttctagctttcattctgcttacctggatttcggaactctagcctgccccactcttagataaacgcatgccctctgtggccctggaaccttagtgacttctgctataccaaagtctccacgcccagggtgacacgcagctgcagctccgtaaacctctaacatgatgtcagcaaatattaaaaaaaaaaagtttataaaaacaatgaataaactttgttaaaggtacaaatgaaaattagcaaacatgggaagataattgagtaaagagtttaaagttaaaaacgaattgcagtcattctaggggaaggaacagttgtatttgaaaacctgtatggttacatgaactgcctaaaaaacaagctaaggaaaattaaagctcagatttatatattttaagaaattaattgcaattaatttcctgggattaaatagcatttcctcaaccccagctgtcattaaaaagaggcaaatacagccaaggactggatcttctccggaaggctgacagcactgaccctcaagaaggcaccggctgacagacagaacattctgccctaatatgtgctgaaattccgctgagagcagagtggtacattgaaccctttaggggcttacaaaagaagtgtcctgtgttttagagtcacagagttttgcagaaacaagtatgaattcacctagtggccccctgcaccaggtctttcctgtgggcactgagtgcagacacatcaatatgtaatagcagaatgaatgactgaacgaacgattgaatgaaaagaaatgagaggcagcaggttgtcagattctatgaggcaatcacagcatcaggtgaccttagtatctatttgagaggactgccatttattctcgggagcgcacggctctaaagaggcccatatccaggcagtgagctctggtggggggcgcctttagatgcaagaaggaggaaacagctcgaaatccctgggcctgagcgcggcccgtgcaggccggagggtcaagaactctccaccggcggcagcggcccggtgtctgccccggcttcgccccggcctaaggctgcctgtgctataaatacgcggcccacatgccgcggtgacacggtgttccctgggctcggcgggacagataacatgaatgtgccctttaaacgtcccaagttgcagggacagcccccggcccagcctcgctcccggaagcgccttcgcccccgatgccctctgcagctgggaggagggggcgccccgcacctgcccagccaatgcgcggcgcgagcgccggccgcgacccgcctcctctcgcgagagcccggcggggatataagggggagctgcgggccaggcggcggccccctagcgtcgcgcagggtcggggactgcgcgcggtgccaggccgggcgtgggcgagagcacgaacgggctgcctgcgggctgagagcgtcgagctgtcaccatgggtgatcacgcttggagcttcctaaaggacttcctggccgggggcgtcgccgctgccgtctccaagaccgcggtcgcccccatcgagagggtcaaactgctgctgcaggtgaggaccgcgcggtgcaagaggcgggcgcgggcgcggcgggccgggcggggcgcgcgatgcggcgcgagctgcagggcgcggggcgccgcggaaaatctgcgccaggccacaggcccgggcgcccgcccgcccgcgggggaagaaggtgccctctgcgtagagacaggtccagcgtcagtcgcagattcctggtgtcgggtggcgcccggcgttcgggtgtctatatatggaaacccacccggagccggtttacgtgtgccagatcctgcgcccgtgacagcacgggcgtgcactcaggcccggaggcacctagtgattgccagtatttttggcaccgtcttatgcgcacgcacctttacaataaaaacatcaaaataatcatcacccaagaattcccttatcgtatctcatgcacaatgctgtatgtaggctgacgccttcatctttatgtaacctctgtgagagagttattcttctccattttacagatgaagctgaggttttgaaatattaagaaacaattttcggaataaactcagatcatcctgtctccaaatcttttcctcccctacctggtcgctgaatggtttatcatcctctcgtgttttcctccacctgcccaaaaggtcagggcccctcaatgaggaagagcccaatttgggagtcagaattactaacaacaaaacccccacaaattgctcacaacggcagcaaacccttaataattgattacttggattatctgcttgaaaactttggaggcctaatgtttagtggatttattctccttcctctattagagcatctagtagagatcctcatctccagggtgatcagagtgacactgagaaattgtcattttttggccatcatgtctattaaatccaaagccctttgaagcagggagtgttactcatttctgtcccccagtaagcccctcatacagttctcaaacctagggaaagtgaaataaataaatggctatagctttatataattcaatcaccttttcagtttatttggggcaatacctttccctcaaataccctaataattgaagcaacattggattattttggcttgttatccagtaactaacatggataacagtatccatttacacgtcctcgtatccatttgatttcctcatcctttttttcttcaaaaaaaaaatctaggaagtgcaaaccttttttttttctcctgtcctcttcccttctctctaccctgcctgtcctctgtcacccaccctcccctccaccaggtccagcatgccagcaaacagatcagtgctgagaagcagtacaaagggatcattgattgtgtggtgagaatccctaaggagcagggcttcctctccttctggaggggtaacctggccaacgtgatccgttacttccccacccaagctctcaacttcgccttcaaggacaagtacaagcagctcttcttagggggtgtggatcggcataagcagttctggcgctactttgctggtaacctggcgtccggtggggccgctggggccacctccctttgctttgtctacccgctggactttgctaggaccaggttggctgctgatgtgggcaagggcgccgcccagcgtgagttccatggtctgggcgactgtatcatcaagatcttcaagtctgatggcctgagggggctctaccagggtttcaacgtctctgtccaaggcatcattatctatagagctgcctacttcggagtctatgatactgccaagggtgagagaggggcatcggggagaaggagggtggtgtggaaagaggatcctatgggatctataactcacaaaggacctgatatatattgatcttgttttttctagtctctgggataattgaggcttctgaatgaggaggtgatgtgcataagttaatagctgaagcgttccttgtgtcctctactgaaataaactctggcctttagttattcagagaggaggaggggggagcctgtctccctctagacacagccatagcagttactgagtttaacttgaagccacttccaatgccctgtatacaagctgagcactgcccctccggggtccggagagggcagcagccacctttgctgtctgcctggtcatatgtgaagcacctgcacaggggcaggttccccgcaaggtcagagcatggagctggaggtgcagtggcctctctccctccacctgctttctgctgagaacaggcacttcatagccgttcggcttctgggctctgtccacagggatgctgcctgaccccaagaacgtgcacatttttgtgagctggatgattgcccagagtgtgacggcagtcgcagggctggtgtcctacccctttgacactgttcgtcgtagaatgatgatgcagtccggccggaaagggggtaagcttgtgctctactcatctaaacttgtttggttttgcccgaggagaacattttacagggctcctttcagtcttccttactggaaattaattttcaaaattatttgataaggacttagggaagaaagatggtattaattccccctaacgttctcaactatcctattagggaaaagtattttccattttattagagatgataagaacatgaatagtaagacatttagatgtgaatttaactaggtatccagcattatagagaccctaggccctcttcccttagagcctgggtgcaaaagctagggaaaagaagtagttagctacttcttacaaagaactcttgcttccctcctagttacaggtgttagtgggatggggtgtttagctgggtagagatggcctgaagcaatctgttgtgccagagaaagttttggcttctataggttgaaccatatgaaattgccactttaaaagtcaaaaacagtccaatgttagcagtttcgtatgtttcaacgaatagttacagccttttatttagactgcataacctcgtgcaggatcatctgaggctcagcctcagttcggtcctccataaaaaaaggtaaccgcgtagcataatactcctgctccactgcgcccttcttgtttcgcagttgggcagtccatgaattacttggttaattgccccagttcttcactgaccttgaactaatggagtaggaatgacaggagacccagcctgccagtgaagcaaggaaggagatgtccagtgggatgttgcatggagctgggactccatgcccagatgaccctgattttataaaactggtaacagtgtgtacagatatgtttcaggggaaaagtctctttcctccagcgttacggagccctcaccagcatttgtttccacagccgatattatgtacacggggacagttgactgctggaggaagattgcaaaagacgaaggagccaaggccttcttcaaaggtgcctggtccaatgtgctgagaggcatgggcggtgcttttgtattggtgttgtatgatgagatcaaaaaatatgtctaatgtaattaaaacacaagttcacagatttacatgaacttgatctacaagttcacagatccattgtgtggtttaatagactattcctaggggaagtaaaaagatctgggataaaaccagactgaaggaatacctcagaagagatgcttcattgagtgttcattaaaccacacatgtattttgtatttattttacatttaaattcccacagcaaatagaaaataatttatcatacttgtacaattaactgaagaattgataataactgaatgtgaaacatcaataaagaccacttaatgcacgctttctattttattgaactcttattaactgtaaaatgcatttttaaaagatcaaaaatgcatattttctagcatgattcatgtatcagtcagcagccaagcttctaaatgccagatattatattgagaatgtattatatgagaacgtacaatgcttaaagttccggttttcaaacttaggcaggtcatattctatctatcttatccagcgttactgtaggctagaaagtgataatggctttcataatcctgccttgtcttaggcactttcctgcag
Strategy
• Protein has function associated with mitochondrial location?
• Protein has structure associated with mitochondrial location?
• Assume that encoded protein is in mitochondria
– Use Gene finder to identify protein sequence(s) – Use Similarity finder to identify possible function
– Use Feature finders to identify pertinent regions – (What ARE pertinent regions?)
Surrogate FiltersScenario III – Case of the Mortal Mitochondrion
Name: PEO-related_gene?First three lines of sequence:tctacttatattcaatccacagggctacacctagttcttggtacacagtacatgctcagcaagagtctgttgaatgaacacatacatggtttatctgtttgtctcttccgagttcttgacttctgtctgctctgacctctggcagctttccactagtttctagctttcattctgcttacctggatttcggaactctagcctgccccactcttagataaacgcatg
fgene Wed Feb 27 16:55:29 GMT 2002>PEO-related_gene? length of sequence - 5768 number of predicted exons - 5 positions of predicted exons: 1607 - 1717 w= 17.84 ORF: 1607 - 1717 2985 - 3231 w= 9.13 ORF: 2985 - 3230 3421 - 3471 w= 6.08 ORF: 3423 - 3470 3980 - 4120 w= 12.62 ORF: 3982 - 4119 5035 - 5192 w= 1.93 ORF: 5037 - 5192 Length of Coding region- 708bp Amino acid sequence - 235aaMGDHAWSFLKDFLAGGVAAAVSKTAVAPIERVKLLLQVQHASKQISAEKQYKGIIDCVVRIPKEQGFLSFWRGNLANVIRYFPTQALNFAFKDKYKQLFLGGVDRHKQFWRYFAGNLASGIIIYRAAYFGVYDTAKGMLPDPKNVHIFVSWMIAQSVTAVAGLVSYPFDTVRRRMMMQSGRKGADIMYTGTVDCWRKIAKDEGAKAFFKGAWSNVLRGMGGAFVLVLYDEIKKYV*
Surrogate FiltersScenario III – Case of the Mortal Mitochondrion
Run 4q34 region through FGene
Name: PEO-related_gene?First three lines of sequence:tctacttatattcaatccacagggctacacctagttcttggtacacagtacatgctcagcaagagtctgttgaatgaacacatacatggtttatctgtttgtctcttccgagttcttgacttctgtctgctctgacctctggcagctttccactagtttctagctttcattctgcttacctggatttcggaactctagcctgccccactcttagataaacgcatg
Fgenesh Wed Feb 27 16:59:14 GMT 2002 FGENESH 1.0 Prediction of potential genes in Human genomic DNA Time: Wed Feb 27 16:59:14 2002 Seq name: PEO-related_gene? Length of sequence: 5768 GC content: 48 Zone: 2Positions of predicted genes and exons: G Str Feature Start End Score ORF Len
1 + TSS 1216 -2.70 1 + 1 CDSf 1607 - 1717 18.01 1607 - 1717 111 1 + 2 CDSi 2985 - 3471 52.41 2985 - 3470 486 1 + 3 CDSi 3980 - 4120 20.99 3982 - 4119 138 1 + 4 CDSl 5035 - 5192 2.32 5037 - 5192 156 1 + PolA 5471 0.92
Predicted protein(s):>FGENESH 1 4 exon (s) 1607 - 5192 298 aa, chain +MGDHAWSFLKDFLAGGVAAAVSKTAVAPIERVKLLLQVQHASKQISAEKQYKGIIDCVVRIPKEQGFLSFWRGNLANVIRYFPTQALNFAFKDKYKQLFLGGVDRHKQFWRYFAGNLASGGAAGATSLCFVYPLDFARTRLAADVGKGAAQREFHGLGDCIIKIFKSDGLRGLYQGFNVSVQGIIIYRAAYFGVYDTAKGMLPDPKNVHIFVSWMIAQSVTAVAGLVSYPFDTVRRRMMMQSGRKGADIMYTGTVDCWRKIAKDEGAKAFFKGAWSNVLRGMGGAFVLVLYDEIKKYV
FGENE output 1607 - 1717 w= 17.84 2985 - 3231 w= 9.13 3421 - 3471 w= 6.08 3980 - 4120 w= 12.62 5035 - 5192 w= 1.93
Surrogate FiltersScenario III – Case of the Mortal Mitochondrion
Run 4q34 region through FGeneSH
How to decide where exons are?
AAAAAAAAmRNA
DNA
PExon Intron Exon Intron Exon hnRNA
Strategy• Compare sequence of 4q34 region to sequence of mRNA• Sequence of mRNA may be in cDNA library• Expressed Sequence Tag (EST) library
Problems• Library may not exist• Expression of gene may be low
MORAL: Trust, but verify.
Feature FGene(splice site
recognition)
FGeneSH(FGene +
HMM model)
BlastN ofEST library
(comparewith known)
TranscriptionStart Site
1216 1501
Exon 1 …1607-1717 …1607-1717 …1607-1717
Exon 2 2985-3231 3421-3471
2985-3471 2985-3471
Exon X 3980-4120Exon 3 5035-5192… 5035-5192… 5035-5192…
PolyA site ? ? ? ? ? ?
Final Score Card for Gene Finders
3980-4120 3980-4120
Surrogate FiltersScenario III – Case of the Mortal Mitochondrion
Run 4q34 region through BlastN (x human est’s)
Strategy
• Protein has function associated with mitochondrial location?
• Protein has structure associated with mitochondrial location?
• Assume that encoded protein is in mitochondria
– Use Gene finder to identify protein sequence(s) – Use Similarity finder to identify possible function
– Use Feature finders to identify pertinent structures – (What ARE pertinent structures?)
Surrogate FiltersScenario III – Case of the Mortal Mitochondrion
Name: PEO-related_gene?First three lines of sequence:tctacttatattcaatccacagggctacacctagttcttggtacacagtacatgctcagcaagagtctgttgaatgaacacatacatggtttatctgtttgtctcttccgagttcttgacttctgtctgctctgacctctggcagctttccactagtttctagctttcattctgcttacctggatttcggaactctagcctgccccactcttagataaacgcatg
Fgenesh Wed Feb 27 16:59:14 GMT 2002 FGENESH 1.0 Prediction of potential genes in Human genomic DNA Time: Wed Feb 27 16:59:14 2002 Seq name: PEO-related_gene? Length of sequence: 5768 GC content: 48 Zone: 2Positions of predicted genes and exons: G Str Feature Start End Score ORF Len
1 + TSS 1216 -2.70 1 + 1 CDSf 1607 - 1717 18.01 1607 - 1717 111 1 + 2 CDSi 2985 - 3471 52.41 2985 - 3470 486 1 + 3 CDSi 3980 - 4120 20.99 3982 - 4119 138 1 + 4 CDSl 5035 - 5192 2.32 5037 - 5192 156 1 + PolA 5471 0.92
Predicted protein(s):>FGENESH 1 4 exon (s) 1607 - 5192 298 aa, chain +MGDHAWSFLKDFLAGGVAAAVSKTAVAPIERVKLLLQVQHASKQISAEKQYKGIIDCVVRIPKEQGFLSFWRGNLANVIRYFPTQALNFAFKDKYKQLFLGGVDRHKQFWRYFAGNLASGGAAGATSLCFVYPLDFARTRLAADVGKGAAQREFHGLGDCIIKIFKSDGLRGLYQGFNVSVQGIIIYRAAYFGVYDTAKGMLPDPKNVHIFVSWMIAQSVTAVAGLVSYPFDTVRRRMMMQSGRKGADIMYTGTVDCWRKIAKDEGAKAFFKGAWSNVLRGMGGAFVLVLYDEIKKYV
Surrogate FiltersScenario III – Case of the Mortal Mitochondrion
Run 4q34 region through BlastP
Summary
• One protein in region
• Contains mitochondrial carrier motifs
• Similar to ATP/ADP transporter
• Mitochondrial signal sequence?
Reasonable candidate for PEO-related protein
Surrogate FiltersScenario III – Case of the Mortal Mitochondrion
Run 4q34 region through BlastP
Complex gene discovery
Your turn: Repeat and extend characterization of PEO-related gene
1. Take same sequence (FastA format) e-mailed to you
2. Get better estimate of promoter and polyA site (e.g. by TSSW and PolyASH) (Is there a TATA box upstream from the predicted promoter?)
3. Find encoded protein sequence by suitable method (e.g. FGeneSH(GC) or comparison with cDNA)
4. Continue characterization of protein * Contains signal sequence? * Contains transmembrane domains?