43
1 Computational Investigation of Gene Regulatory Elements Ryan Weddle Computational Biosciences Internship Presentation 12/15/2004

Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

1

Computational Investigationof Gene Regulatory Elements

Ryan WeddleComputational Biosciences

Internship Presentation12/15/2004

Page 2: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

2

Table of Contents

Introduction . . . . 3Goals . . . . . 9Methods . . . . 12Results . . . . . 21Discussion . . . . 37Acknowledgements . . 43

Page 3: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

3

Introduction

Page 4: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

4

Invasive Glioma Glioma is a particularly devastating type of brain

cancer caused by mutations to glial cells. While tumors may be treated through traditional

means such as chemo and radiation therapies, thesemeans are less effective at preventing spread andrecurrence.

This is due to the fact that invasive glioma migratesinto other parts of the brain by phenotypicallydifferent invasive cells.

These cells are not rapidly dividing and are, thus,less effected by traditional anti-cancer therapy.

Page 5: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

5

Different Tumor Cells Tumor composed of

core and periphery Motile cells are more

prevalent in periphery Laser capture micro-

dissection used toseparate cellpopulations

Tumor PeripheryTumor Core

Motile Cells

Page 6: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

6

What makes them different? Microarray analysis

performed indicated aset of 15 differentiallyexpressed genes.

The differential levelsof mRNA between thetwo cell populationswere verified withqPCR analysis.

Page 7: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

7

What does this mean?

When a set of genes are differentiallyexpressed in this manner, it is oftenhypothesized that they may be co-regulated.

If they are co-regulated, thenunderstanding their regulation is usefulif we wish to prevent their functionthrough some therapeutic means.

Page 8: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

8

Gene RegulationEukaryotic gene regulation is much more

complicated than bacterial gene regulation.Takes place on several levels:

Chromatin remodelingTranscriptional controlMessage controlTranslational control

We are hope to understand thetranscriptional control through computationalmeans.

Page 9: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

9

Project Goals

Page 10: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

10

Exploratory Investigation

This project aims to gain understandingof the mechanisms that regulate thesedifferentially expressed genes.Leverage sequence dataInvestigate known methodsInvestigate new methodsGenerate and test hypotheses

Page 11: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

11

Leveraging Sequence Data

Two senses in which we are takingadvantage of the DNA sequenceresources now available:Searching genomic sequence data around

our genes for transcription factor bindingsites

Using sequence data from multiplegenomes to narrow our search

Page 12: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

12

Methods

Page 13: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

13

Investigating Known Methods

Phylogenetic FootprintingTransfac DatabasePattern Detection AlgorithmsAssociation Rule Mining

Page 14: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

14

Phylogenetic Footprinting Look at sequence which has been conserved over

evolutionary time: Ignore coding sequences Ignore known repeating sequences

Hypothesis is that conserved elements are underselective pressure due to some functional role.

We used PipMaker to create visualizations, and theblastz software program to compute ungappedalignments.

Due to limited availability at onset of project, weused only human and mouse genomes.

Page 15: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

15

Example: BCL2L2 Gene Pip

Black regions are ungapped alignments:Human vs mouseLong segments often codonsNotice some upstream conservation

Percent identity indicated by y-axis.

Page 16: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

16

Transfac Database

Database of known transcription factorbinding sitesCatalogues known occurrencesRepresent TFBS by consensus

sequences and weight matrix methods

Page 17: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

17

Matrix Example: TATAXXPO   A C G T01 3.00 3.89 2.48 3.99 N02 0.00 10.19 2.66 0.52 C03 0.33 3.33 0.00 9.71 T04 9.76 0.00 0.00 3.61 A05 0.00 0.00 0.00 13.36 T06 13.36 0.00 0.00 0.00 A07 12.40 0.00 0.00 0.96 A08 13.36 0.00 0.00 0.00 A09 13.36 0.00 0.00 0.00 A10 3.92 1.36 6.11 1.97 RXXBA total weight of sequences: 13.36XXCC consind generated matrix (random_expectation: 0.30)XX//

Page 18: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

18

Transfac UtilizationWe can use Transfac to scan DNA

sequences:Find potential occurrencesDifferent scores for different quality of matches

Cannot be used to find novel binding sites,only novel occurrences of known bindingsites.

Useful tool, but too noisy to be relied on inautomated processes.

Page 19: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

19

Pattern Detection Algorithms

Pattern detection algorithms are useful whenwe are looking for novel motifs.

We used the MEME/MAST tools to searchour conserved sequences for novel motifs:Most interesting result was an already known

splice sequenceMEME works best when you know how many

occurrences you are expecting and whereyou are expecting them.

Page 20: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

20

Association Rule MiningARM is a mechanism for finding rules about

association between different elements.Classical example is “market basket analysis”Here we are interested in any interesting patterns

in the occurrence of TFBS identified by Transfacin our conserved sequences.

Results in many low quality rules:Typically infrequent or low confidenceBest rules found due to overlapping putative

binding sites - little informational content

Page 21: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

21

Exploratory InvestigationResults

Page 22: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

22

Investigating Novel Methods

All existing methods had shortcomingswhen applied to our dataset:Transfac highly uncertainPattern detection and association rule

mining failed to yield interesting resultsToo few elements for meaningful

clustering, etc.How can we reframe the problem?

Page 23: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

23

Scaling It All Up

Association rule mining is intended forlarge databases.Our gene/TF universe was probably too

small to result in interesting rules.What if we could scale it up?

Look at every subsequence up to a certainlength in each genomic region

Determine identity between shortsequences by allowing slight mismatch

Page 24: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

24

Kmer Analysis

ARM can be modified to find very lowsupport rules that have high certainty - the“needle in haystack.”

We can build a database of all TFBS sizedshort sequences in our conserved sequencedata:Mine this database for association rulesInteresting rules might indicate functional

relationships.

Page 25: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

25

Building the Kmer Database

Sequence data for each gene was obtainedfrom both mouse and human genomesRepeat sequences and coding regions were

masked out.Kmer library for all 6-11mers with several

degrees of mismatch was constructed150,000 occurrences of 80,000 unique kmers550MB on disk40MB when we exclude all but perfect matches

Page 26: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

26

Refining the Kmer Database

This is still a very large database!Likely to result in many rulesHard to analyzeHow can we easily measure the similarity

within this database, before devoting timeto implementing new algorithms?

Narrow database to include onlyexactly matching 11mers

Page 27: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

27

Research Hypothesis“There is more short sequence similarity, as

measured by exactly matching 11mers, inour target sequence corpus, than would beexpected from random sequence data.”

If we can confirm this hypothesis, we canassert that there is interesting informationalcontent at the sequence level.Worthwhile to investigate further

Page 28: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

28

Randomized Sequence DataWe needed a basis for comparison to

determine whether the short sequencesimilarity observed in our data set wassignificant.Generate random sequence data that maintains

the same nucleotide bias for each sequencefragment

Perform kmer analyses on each of these randomtrials

100 trials in total

Page 29: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

29

Research Hypothesis Results

Randomly Generated Sequences

Page 30: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

30

Page 31: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

31

Summary Statistics11mer distributions calculated for both

Uniform nucleotide distributionSame distribution as in target data

A=26.7% C=23.0% G=26.1% T=24.2%

Z-test: Is our observed count of 73

11mers higher than thepopulation mean?

Z score = -46P-value < 10^-6

Page 32: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

32

Hypothesis Revisited Results looked promising.. However, they depended on assumptions about

random sequence data. Therefore, we revised our hypothesis:

“There is more short sequence similarity, as measured byexactly matching 11mers, in our target sequence corpus,than would be found by randomly sampling sets of genesfrom the human and mouse genomes.”

Confirming this hypothesis would provide concreteevidence that our observed 11mer similarityconstituted a meaningful departure from the norm.

Page 33: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

33

Analyzing Random GenesDownloaded all human-mouse homologs

from EMBLPerformed pre-processing on all homolog

pairsRepeatmaskingBlastz for phylogenetic footprinting

Randomly selected 100 sets of genesPerformed 11mer analysis on every setCatalogued results

Page 34: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

34

Research Hypothesis Results

Randomly Selected Sequences

Page 35: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

35

Distributions Overlay

Page 36: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

36

Comparing the Distributions

All distributions appear normal.73 observed 11mer matches are clearly

More occurrences than expected fromrandom sequence

Much fewer than expected from randomlyselected genes

What’s going on here?

Page 37: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

37

Discussion

Page 38: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

38

Conclusions73 observed 11mer matches are anecdotally

interesting.Transfac matches for TATA, various TFs

Our most exhaustive results indicate that,however, we cannot claim that the number ofmatches are statistically significant.

But, there are more variables involved in thefinal analysis, which could be controlled for infurther analyses.

Page 39: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

39

Possible Confounding FactorsAmount of conserved sequence may differ

due to:Percent conservationSize of genes

Controlled for in random sequencegeneration, but not in random gene selection

Assumes all genes are comparableControlling for these factors could be a good

avenue for future research

Page 40: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

40

Questioning Assumptions

Everything rests on the assumptionthat our target set of genes is co-regulated by common elements at theDNA sequence level.Further assumption that regulatory

mechanism is local to the genesWhat about chromatin and its role in

regulation?

Page 41: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

41

Suggestions for Future Work

It would be useful to repeat the final testswhile controlling for gene size andconservation.

Consider testing these same methods on analready well characterized set of co-regulated genes, rather than on aninvestigative data set.

Research methods for taking chromatin andDNA sequence structure into account.

Page 42: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

42

Things I LearnedIn exploratory investigations, Perl is your

best friend.Much of this would have been impossible to do

manually.Perl really is faster for rapid prototyping when you

don’t know in advanced what your needs will be.You can try new methods on old data, or old

methods on new data, but developing newmethods on new data is difficult.

Page 43: Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

43

Acknowledgements

Dr. Jeff Touchman . Tgen, ASUDr. Phillip Stafford. . Tgen, ASUDr. Rosemary Renaut . ASUDr. Michael Berens . TgenDr. Huan Liu . . . ASUDominique Hoelzinger . TgenMaulik Shah . . . Tgen, ASU