51
CSE182-L10 MS Spec Applications + Gene Finding + Projects

CSE182-L10 MS Spec Applications + Gene Finding + Projects

  • View
    222

  • Download
    1

Embed Size (px)

Citation preview

Page 1: CSE182-L10 MS Spec Applications + Gene Finding + Projects

CSE182-L10

MS Spec Applications + Gene Finding + Projects

Page 2: CSE182-L10 MS Spec Applications + Gene Finding + Projects

Relative abundance computation

• Once we have features matched across runs, we have data identical to microarrays .

• Features can be ‘identified’ in separate MS2 experiments

run

feature

intensity

Page 3: CSE182-L10 MS Spec Applications + Gene Finding + Projects

Structural genomics via MS

Page 4: CSE182-L10 MS Spec Applications + Gene Finding + Projects

Cross-linking

• Cross-links are ‘fixed’ length that bind to amino-acids.

• How can they help predict structure?

• Protocol– Cross-link native protein– Denature, digest– MS/MS (identify cross-linked

peptides)

• Potentially valuable, but not widely used

Page 5: CSE182-L10 MS Spec Applications + Gene Finding + Projects

Identifying Cross-linked peptides

• Identify all peptide pairs, whose mass explains the parent mass.

• Given a list of peptide pairs, find the pair, and the linked position that best explains the MS2 data.

• What is the number of possible candidate pairs.

• Fragmentation in the presence of linkers is poorly understood

• How do you separate cross-linked peptides from singly linked, and non-cross-linked peptides?

Page 6: CSE182-L10 MS Spec Applications + Gene Finding + Projects

Identifying cross-linked peptides

• Use isotopically labeled cross-linking agents.

• Cross-linked peptides will show up as pairs separated by a small mass.

• Non cross-linked peptides appear at one position only.

Page 7: CSE182-L10 MS Spec Applications + Gene Finding + Projects

MS application: Protein-protein interaction

• Proteins combine to form functional complexes.

• An antibody is a special kind of protein that can recognize a specific protein

• Use an antibody to recognize a protein in a complex. Isolate & Purify the complex that binds to the antibody.

• Identify all the proteins in the complex via mass spectrometry.

Page 8: CSE182-L10 MS Spec Applications + Gene Finding + Projects

Mass Spectrometry: conclusion

• Mass Spectrometry can be used to identify peptides, modifications, quantitation, protein structure, protein-protein interaction (complex formation)

• Each of these poses significant computational challenges.

Page 9: CSE182-L10 MS Spec Applications + Gene Finding + Projects

Proteomic Databases/Tools

Page 10: CSE182-L10 MS Spec Applications + Gene Finding + Projects

Eukaryotic Gene Prediction

Page 11: CSE182-L10 MS Spec Applications + Gene Finding + Projects

Eukaryotic gene structure

Page 12: CSE182-L10 MS Spec Applications + Gene Finding + Projects

Translation

Page 13: CSE182-L10 MS Spec Applications + Gene Finding + Projects

Gene Features

ATG

5’ UTR

intron

exon3’ UTR

AcceptorDonor splice siteTranscription start

Translation start

Page 14: CSE182-L10 MS Spec Applications + Gene Finding + Projects

Gene identification

• Eukaryotic gene definitions: – Location that codes for a protein– The transcript sequence(s) that encodes the protein– The protein sequence(s)

• Suppose you want to know all of the genes in an organism.

• This was a major problem in the 70s. PhDs, and careers were spent isolating a single gene sequence.

• All of that changed with the development of high throughput methods like EST sequencing

Page 15: CSE182-L10 MS Spec Applications + Gene Finding + Projects

EST Sequencing

• Suppose we could collect all of the mRNA.

• However, mRNA is unstable• An enzyme called reverse

transcriptase is used to make a DNA copy of the RNA.

• Use DNA polymerase to get a complementary DNA strand.

• Sequence the (stable) cDNA from both ends.

• This leads to a collection of transcripts/expressed sequences (ESTs).

• Many might be from the same gene

AAAATTTT

AAAATTTT

Page 16: CSE182-L10 MS Spec Applications + Gene Finding + Projects

EST Sequencing

• Often, reverse transcriptase breaks off early. Why is this a good thing?

• The 3’ end may not have a much coding sequence.• We can assemble the 5’ end to get more of the

coding sequence

Page 17: CSE182-L10 MS Spec Applications + Gene Finding + Projects

Project 2

• EST assembly• Given a collection of EST (3’)

sequences, your goal is to cluster all ESTs from the same gene, and produce a consensus.

• How would you do it if we also had 5’ EST sequences?

Page 18: CSE182-L10 MS Spec Applications + Gene Finding + Projects

Project 1

• Goal: Look for signals in the UTR.• The UTR is not boring. It often folds into

a 2 D structure and subsequently affects transcription/translation of genes.

• What are Riboswitches?• miRNA?

Page 19: CSE182-L10 MS Spec Applications + Gene Finding + Projects

Project 3

• Goal is to predict expressed genes using ESTs/proteins and mass spectrometry.

Page 20: CSE182-L10 MS Spec Applications + Gene Finding + Projects

Project guidelines

• 4 Checkpoints.• The first is mainly to identify a project,

project partners, and answer a few simple questions to get started.

• Deadline 11/3/05.

Page 21: CSE182-L10 MS Spec Applications + Gene Finding + Projects

Gene Finding: The 1st generation

• Given genomic DNA, does it contain a gene (or not)?

• Key idea: The distributions of nucleotides is different in coding (translated exons) and non-coding regions.

• Therefore, a statistical test can be used to discriminate between coding and non-coding regions.

Page 22: CSE182-L10 MS Spec Applications + Gene Finding + Projects

Coding versus Non-coding

• You are given a collection of exons, and a collection of intergenic sequence.

• Count the number of occurrences of ATGATG in Introns and Exons.– Suppose 1% of the hexamers in Exons are ATGATG– Only 0.01% of the hexamers in Intons are ATGATG

• How can you use this idea to find genes?

Page 23: CSE182-L10 MS Spec Applications + Gene Finding + Projects

Generalizing

AAAAAAAAAAACAAAAAGAAAAAT

I E

Compute a frequency count for all hexamers. Use this to decide whether a sequence is an exon/intron

Page 24: CSE182-L10 MS Spec Applications + Gene Finding + Projects

Coding versus non-coding

• Fickett and Tung (1992) compared various measures• Measures that preserve the triplet frame are

the most successful.

• Genscan: 5th order Markov Model• Conservation across species

Page 25: CSE182-L10 MS Spec Applications + Gene Finding + Projects

Coding vs. non-coding regions

Given : Three 5th order transition matrices C(1),C(2),C(3)

trained on coding exons

P h (Xa,b ) = C((h+i)mod 3+1)[Xa+i]i= 0

b−a

Coding ratio, r =Ph (Xa,b )

PD(Xa,b )

Coding Score s = log2(r)

Compute average coding score (per base) of exons and introns, and take the difference. If the measure is good, the difference must be biased away from 0.

Page 26: CSE182-L10 MS Spec Applications + Gene Finding + Projects

Coding differential for 380 genes

Page 27: CSE182-L10 MS Spec Applications + Gene Finding + Projects

Other Signals

GTATG

AG

Coding

Page 28: CSE182-L10 MS Spec Applications + Gene Finding + Projects

Coding region can be detected

Coding

• Plot the coding score using a sliding window of fixed length.• The (large) exons will show up reliably.• Not enough to predict gene boundaries reliably

Page 29: CSE182-L10 MS Spec Applications + Gene Finding + Projects

Other Signals

GTATG

AG

Coding

• Signals at exon boundaries are precise but not specific. Coding signals are specific but not precise.

• When combined they can be effective

Page 30: CSE182-L10 MS Spec Applications + Gene Finding + Projects

The second generation of Gene finding

• Ex: Grail II. Used statistical techniques to combine various signals into a coherent gene structure.

• It was not easy to train on many parameters. Guigo & Bursett test revealed that accuracy was still very low.

• Problem with multiple genes in a genomic region

Page 31: CSE182-L10 MS Spec Applications + Gene Finding + Projects
Page 32: CSE182-L10 MS Spec Applications + Gene Finding + Projects

HMMs and gene finding

• HMMs allow for a systematic approach to merging many signals.

• They can model multiple genes, partial genes in a genomic region, as also genes on both strands.

Page 33: CSE182-L10 MS Spec Applications + Gene Finding + Projects

The Viterbi Algorithm

Let vk(i) be the probability of the

most likely path that ends in state πk,

and emits symbols x1L x

k

Then,

vk (i +1) = ek (x i+1)maxl

(v l (i)alk )

Page 34: CSE182-L10 MS Spec Applications + Gene Finding + Projects

HMMs and gene finding

• The Viterbi algorithm (and backtracking) allows us to parse a string through the states of an HMM

• Can we describe Eukaryotic gene structure by the states of an HMM?• This could be a solution to the GF problem.

Page 35: CSE182-L10 MS Spec Applications + Gene Finding + Projects

An HMM for Gene structure

Page 36: CSE182-L10 MS Spec Applications + Gene Finding + Projects

Generalized HMMs, and other refinements

• A probabilistic model for each of the states (ex: Exon, Splice site) needs to be described

• In standard HMMs, there is an exponential distribution on the duration of time spent in a state.

• This is violated by many states of the gene structure HMM. Solution is to model these using generalized HMMs.

Page 37: CSE182-L10 MS Spec Applications + Gene Finding + Projects

Length distributions of Introns & Exons

Page 38: CSE182-L10 MS Spec Applications + Gene Finding + Projects

Generalized HMM for gene finding

• Each state also emits a ‘duration’ for which it will cycle in the same state. The time is generated according to a random process that depends on the state.

Page 39: CSE182-L10 MS Spec Applications + Gene Finding + Projects

Forward algorithm for gene finding

j i

qk

Fk (i) = P qkj<i

∑ (X j ,i) fqk ( j − i +1) alkl∈Q

∑ Fl ( j)

Emission Prob.: Probability that you emitted Xi..Xj in state qk (given by the 5th order markov model)

Forward Prob: Probability that you emitted I symbols and ended up in state qk

Duration Prob.: Probability that you stayedin state qk for j-i+1 steps

Page 40: CSE182-L10 MS Spec Applications + Gene Finding + Projects

HMMs and Gene finding

• Generalized HMMs are an attractive model for computational gene finding– Allow incorporation of various signals– Quality of gene finding depends upon

quality of signals.

Page 41: CSE182-L10 MS Spec Applications + Gene Finding + Projects

DNA Signals

• Coding versus non-coding• Splice Signals• Translation start

Page 42: CSE182-L10 MS Spec Applications + Gene Finding + Projects

Splice signals

• GT is a Donor signal, and AG is the acceptor signal

GT AG

Page 43: CSE182-L10 MS Spec Applications + Gene Finding + Projects

PWMs

• Fixed length for the splice signal.• Each position is generated independently

according to a distribution• Figure shows data from > 1200 donor

sites

321123456321123456AAGGTGAGTAAGGTGAGTCCGGTAAGTCCGGTAAGTGAGGTGAGGGAGGTGAGGTAGGTAAGGTAGGTAAGG

Page 44: CSE182-L10 MS Spec Applications + Gene Finding + Projects

MDD

• PWMs do not capture correlations between positions• Many position pairs in the Donor signal are correlated

Page 45: CSE182-L10 MS Spec Applications + Gene Finding + Projects

• Choose the position which has the highest correlation score.

• Split sequences into two: those which have the consensus at position I, and the remaining.

• Recurse until <Terminating conditions>

Page 46: CSE182-L10 MS Spec Applications + Gene Finding + Projects

MDD for Donor sites

Page 47: CSE182-L10 MS Spec Applications + Gene Finding + Projects

De novo Gene prediction: Sumary

• Various signals distinguish coding regions from non-coding

• HMMs are a reasonable model for Gene structures, and provide a uniform method for combining various signals.

• Further improvement may come from improved signal detection

Page 48: CSE182-L10 MS Spec Applications + Gene Finding + Projects

How many genes do we have?

Nature

Science

Page 49: CSE182-L10 MS Spec Applications + Gene Finding + Projects

Alternative splicing

Page 50: CSE182-L10 MS Spec Applications + Gene Finding + Projects

Comparative methods

• Gene prediction is harder with alternative splicing.• One approach might be to use comparative

methods to detect genes• Given a similar mRNA/protein (from another

species, perhaps?), can you find the best parse of a genomic sequence that matches that target sequence• Yes, with a variant on alignment algorithms that penalize

separately for introns, versus other gaps.

Page 51: CSE182-L10 MS Spec Applications + Gene Finding + Projects

Comparative gene finding tools

• Procrustes/Sim4: mRNA vs. genomic• Genewise: proteins versus genomic• CEM: genomic versus genomic• Twinscan: Combines comparative and

de novo approach.