59
Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence LO Leung Yau 7 th May, 2009

Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

  • Upload
    claire

  • View
    31

  • Download
    0

Embed Size (px)

DESCRIPTION

Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence. LO Leung Yau 7 th May, 2009. Outline. Biological Background Objective Current Approaches Various Models Problem: Insufficient Data Proposed Approach Predict TFBS from protein sequence - PowerPoint PPT Presentation

Citation preview

Page 1: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

LO Leung Yau

7th May, 2009

Page 2: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Outline

Biological Background Objective

Current Approaches Various Models Problem: Insufficient Data

Proposed Approach Predict TFBS from protein sequence Predict from protein sequence whether it is TF Gene sequence Protein sequence

Short Term Subtasks Better Motif Model Better tool to calculate P-value of Patterns

Page 3: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Biological Background – Cell

Basic unit of organisms Prokaryotic Eukaryotic

A bag of chemicals Metabolism controlled

by various enzymes Correct working needs

Suitable amounts of various proteins

Picture taken from http://en.wikipedia.org/wiki/Cell_(biology)

Page 4: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Biological Background – Protein Polymer of 20 types of

Amino Acids Folds into 3D structure Shape determines the

function Many types

Transcription Factors Enzymes Structural Proteins …

Picture taken from http://en.wikipedia.org/wiki/Proteinhttp://en.wikipedia.org/wiki/Amino_acid

Page 5: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Biological Background – DNA & RNA DNA

Double stranded Adenine, Cytosine, Guani

ne, Thymine A-T, G-C Those parts coding for pr

oteins are called genes RNA

Single stranded Adenine, Cytosine, Guani

ne, Uracil

Picture taken from http://en.wikipedia.org/wiki/Gene

Page 6: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Biological Background – DNA RNA Protein

Picture taken from http://en.wikipedia.org/wiki/Gene

gene

Page 7: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Biological Background – DNA RNA Protein

Transcriptional Regulatory Network is the complex interaction between genes, transcription factors (TF) and transcription factor binding

sites (TFBS).

Other functions

Transcription FactorsBinding sites

GenesPromoter regions

Page 8: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Complex Interactions between Genes, TFs and TFBSs

Page 9: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Biological Background – DNA RNA Protein

Transcriptional Regulatory Network is the complex interaction between genes, transcription factors (TF) and transcription factor binding

sites (TFBS).

Other functions

Transcription FactorsBinding sites

GenesPromoter regions

Page 10: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Importance of Inferring Transcriptional Regulatory Network Revealing the working of a cell and life Related to many diseases

Genetic disorders Understanding them will help us

Understand the diseases Design drugs to cure the diseases Engineering genetics

Page 11: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Objective

To infer transcriptional regulatory network (gene network) from genetic

and experimental data, utilizing different data sources as/when

appropriate

Page 12: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Outline

Biological Background Objective

Current Approaches Various Models Problem: Insufficient Data

Proposed Approach Predict TFBS from protein sequence Predict from protein sequence whether it is TF Gene sequence Protein sequence

Short Term Subtasks Better Motif Model Better tool to calculate P-value of Patterns

Page 13: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Current Approaches

Main Data Source Gene Expression Microarray Data

Models Parts Lists Topology Models Control Logic Models Dynamic Models

Problem Insufficient Data

Page 14: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Gene Expression Microarray Data High throughput Measures RNA level Relies on A-T, G-C

pairing Can monitor expression

of many genes

Picture taken from http://en.wikipedia.org/wiki/DNA_microarray_experiment

Page 15: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Gene Expression Microarray Data

Picture taken from http://en.wikipedia.org/wiki/DNA_microarray

Page 16: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Various Models of Transcriptional Regulatory Network (Gene Network) Different level of details

Parts Lists Topology Models Control Logic Models Dynamic Models

Boolean Network Petri Nets Difference and Differential Equations Finite State Linear Model (FSLM) Stochastic Networks

[86, 87, 88]

Page 17: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Parts List

The basic components of the gene network that we model

Including Genes Transcription Factors Promoters Transcription Factor Binding Sites …

gene

Page 18: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Topology Models – Example

Page 19: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Control Logic Models

Page 20: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Dynamic Models

Describe and simulate the dynamic changes in the state of the system

Predicting the network’s response to various environmental changes and stimuli. Boolean Network Petri Nets Difference and Differential Equations Hybrid: Finite State Linear Model (FSLM) Stochastic Networks

Page 21: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Boolean Network

[42, 93, 1, 55]

Page 22: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Boolean Network –Yeast Fission Example

10 Genes1024 States

[22]

Page 23: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Petri Nets - Example

[79, 34, 67, 92]

Page 24: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Difference and Differential Equations Continuous concentration of various molecules For difference equation, time is discrete For differential equation, time is continuous In general, they have the form

)),(),...,(()(

)),(()(

)),(),...,(()(

)),(()(

1'

'

1

ttgtgFtg

ttgFtg

ttgtgFttg

ttgFttg

nii

nii

[15, 24, 96]

Page 25: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Difference and Differential Equations Usually, the interactions are assumed to be

linear The model needs many parameters

Interpretation:>>0 means gene n activates gene 1

Page 26: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Finite State Linear Model (FSLM)

[91, 2, 66]

Page 27: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Stochastic Networks

In the real world, stochastic effects may play an important role

Some stochastic models have been proposed Noisy Networks Probabilistic Boolean Networks

Simulating a stochastic model is more computationally expensive

Depending on the purpose, stochastic models may not be necessary

Page 28: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Outline

Biological Background Objective

Current Approaches Various Models Problem: Insufficient Data

Proposed Approach Predict TFBS from protein sequence Predict from protein sequence whether it is TF Gene sequence Protein sequence

Short Term Subtasks Better Motif Model Better tool to calculate P-value of Patterns

Page 29: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Problem – Insufficient Data

In microarray data Many genes Small number of conditions/time points

Lead to unreliable estimated model

[17, 53]

Page 30: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Current Directions to Solve Insufficiency Problem Analysis Techniques for Small Sample Size

Regularization Akaike Information Criterion (AIC) Bayesian Information Criterion (BIC) Minimum Description Length (MDL) …

New model Integrate Multiple Microarray Data

Heterogeneous sources Different experiment settings

[21, 77, 54, 62, 104, 72, 84] [60, 107, 48, 8, 38]

Page 31: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Outline

Biological Background Objective

Current Approaches Various Models Problem: Insufficient Data

Proposed Approach Predict TFBS from protein sequence Predict from protein sequence whether it is TF Gene sequence Protein sequence

Short Term Subtasks Better Motif Model Better tool to calculate P-value of Patterns

Page 32: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Proposed Approach – Use Sequence

Other functions

Transcription FactorsBinding sites

GenesPromoter regions

There is a lot of information in genome sequenceWe should try to use them!

Page 33: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Proposed Approach – Core Components

DNARNAProtein

Binding Sites?

Transcription Factor?

The interaction between genes can therefore be inferred.

DNARNAProtein

Binding Sites?

Transcription Factor?

1

2

3

Page 34: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Proposed Approach – Core Components

TF

TFGene

Gene

Gene

Gene

Gene

TF

TF

Gene

Our approach gives initial network!

Can be used together with other approachesExtra!

Missed!Microarray Data

Page 35: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Component 1: Protein Sequence Binding Sites Need to predict

Binding domains of a protein The DNA segment bound by the domain The pattern bound by the protein

Need to search for occurrence of the pattern Better motif model is helpful

……………..LYDVAEYAGVSYQTVSRVV …………….

……………..gaaggGGTCAAGGTGACCgg……………

Protein

DNA

Picture taken from http://en.wikipedia.org/wiki/DNA-binding_domain

Page 36: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Component 2: Protein Sequence Transcription Factor ? Need to distinguish between

Transcription factors, and Other proteins

Characteristic motifs in binding domains are helpful features Transcription

Factor

Other Proteins

……………..LYDVAEYAGVSYQTVSRVV …………….

Page 37: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Component 3: DNA RNA Protein Sequence DNA pre-mRNA

Pre-mRNA mRNA

mRNA Protein sequence

Trivial, only TU

Alternative splicing!

Picture taken from http://en.wikipedia.org/wiki/Alternative_splicing

Genetic code of amino acids is known and quite

universal

Page 38: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Proposed Plan and Phases

Started!

Preparatory

Main Classifiers

Initial Network Construction &Testing Stage

Will start soon

Page 39: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Outline

Biological Background Objective

Current Approaches Various Models Problem: Insufficient Data

Proposed Approach Predict TFBS from protein sequence Predict from protein sequence whether it is TF Gene sequence Protein sequence

Short Term Subtasks Better Motif Model Better tool to calculate P-value of Patterns

Page 40: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Short Term Subtasks

Q-gram Indexed Approximate String Matching Tool

Exploring Different Motif Models Motifs with gaps

Develop an Improved Tool to Search Significant Patterns and Calculate p-value Deterministic Finite Automata (DFA) Finite Markov Chain Imbedding (FMCI) Pattern Markov Chain (PMC)

Already Done.

Page 41: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Q-gram Indexed Approximate String Matching Tool IDEA: quickly discard parts of the target

which CANNOT contain a match A kind of pruning Pruning is a successful strategy in many

problems

Target (Text/DB/…) sequence

Filtered out regions, do not bother to do fully sensitive checking

Pattern

Page 42: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Q-gram Indexed Approximate String Matching Tool

Page 43: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Outline

Biological Background Objective

Current Approaches Various Models Problem: Insufficient Data

Proposed Approach Predict TFBS from protein sequence Predict from protein sequence whether it is TF Gene sequence Protein sequence

Short Term Subtasks Better Motif Model Better tool to calculate P-value of Patterns

Page 44: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Exploring Different Motif Model Popular Motif Model

Position Weight Matrix (PWM) Assumptions

Fixed-length contiguous Independency of nucleotides

Easily handle wildcards But difficult to handle gaps

Has been successful in some datasets But perform poorly in Tompa(2005) dataset

Page 45: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Exploring Different Motif Model Aim:

To explore if motifs with gaps fit the data To explore different notions of “over-represented”

Approach: de novo motif discovery on existing dataset Assuming different models Assuming different notions of “over-represented”

Page 46: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Exploring Different Motif Model Models Tested

Model Wildcard? Gaps?

Exact

No Gap

No IUPAC

General

Page 47: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Exploring Different Motif Model - Notions of “over-represented” Count score:

P-value:

Estimated probability:

s1s2s3s4

s1+s2+s3+s44

Scores

X times

P(> X times in background)Background

Model

c1c2c3c4

P(TFBS | c1,c2,..,c4)

P(TFBS)P(c1,c2,..,c4 | TFBS)P(c1,c2,…,c4)=

Page 48: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Preliminary Results – Max F-measure

Precision

= TP/(TP+FP)

Recall

= TP/(TP+FN)

F-Measure

= 2pr/(p+r)

Page 49: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Preliminary Results – Tompa

Precision

= TP/(TP+FP)

Recall

= TP/(TP+FN)

F-Measure

= 2pr/(p+r)

Page 50: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Preliminary Results – Tompa

Precision

= TP/(TP+FP)

Recall

= TP/(TP+FN)

F-Measure

= 2pr/(p+r)

Page 51: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Preliminary Results – Tompa

Precision

= TP/(TP+FP)

Recall

= TP/(TP+FN)

F-Measure

= 2pr/(p+r)

Page 52: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Preliminary Results

Performance is data-set dependent Motif model with gaps worths exploring more P-value and Est-Prob worth exploring

Page 53: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Outline

Biological Background Objective

Current Approaches Various Models Problem: Insufficient Data

Proposed Approach Predict TFBS from protein sequence Predict from protein sequence whether it is TF Gene sequence Protein sequence

Short Term Subtasks Better Motif Model Better tool to calculate P-value of Patterns

Page 54: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Develop an Improved Tool to Search Significant Pattern and Calculate p-value Given

a pattern (possibly as general as regular expression)

a sequence A model of the sequence

Want The occurrence number of the pattern The distribution of occurrence number In particular, the p-value

Page 55: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Develop an Improved Tool to Search Significant Pattern and Calculate p-value Usual Sequence Models

M00: i.i.d., letters equally likely

M0: General i.i.d.

M1: First Order Markov Chain

Mk: Kth Order Markov Chain

Page 56: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Finite Markov Chain Imbedding and Pattern Markov Chain

DFA forthe pattern

Occur C times

DFA forthe pattern

DFA forthe pattern

DFA forthe pattern

DFA forthe pattern

C+1 copies

BackgroundMarkov Model

P Q

P Q

P Q

P Q1

A large Markov Chain, which can be used to calculate the desired probability

easily

0 times 1 time 2 times c times

[27, 58, 28, 76]

Page 57: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Existing Tool: SPatt

SPatt (Nuel 2008) allows Arbitrary finite alphabet Wildcard “.”, which matches any character Wildcard “[abc]”, which matches any of a,b,c Gaps “.(a-b)” which means a to b wildcards Alternative “p1 | p2”, which means p1 or p2

But currently only 1st Order Markov Model No full regular expression

Any Kth order Markov Model

Allow Regular Expression

Want to Improve

[76]

Page 58: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Summary

Work Done Developed an approximate string matching tool Collected TRANSFAC data Started exploring different motif models

Page 59: Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Thanks for your attention!

Q & A