Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to

1

Computational Investigationof Gene Regulatory Elements

Ryan WeddleComputational Biosciences

Internship Presentation12/15/2004

2

Table of Contents

Introduction . . . . 3Goals . . . . . 9Methods . . . . 12Results . . . . . 21Discussion . . . . 37Acknowledgements . . 43

3

Introduction

4

Invasive Glioma Glioma is a particularly devastating type of brain

cancer caused by mutations to glial cells. While tumors may be treated through traditional

means such as chemo and radiation therapies, thesemeans are less effective at preventing spread andrecurrence.

This is due to the fact that invasive glioma migratesinto other parts of the brain by phenotypicallydifferent invasive cells.

These cells are not rapidly dividing and are, thus,less effected by traditional anti-cancer therapy.

5

Different Tumor Cells Tumor composed of

core and periphery Motile cells are more

prevalent in periphery Laser capture micro-

dissection used toseparate cellpopulations

Tumor PeripheryTumor Core

Motile Cells

6

What makes them different? Microarray analysis

performed indicated aset of 15 differentiallyexpressed genes.

The differential levelsof mRNA between thetwo cell populationswere verified withqPCR analysis.

7

What does this mean?

When a set of genes are differentiallyexpressed in this manner, it is oftenhypothesized that they may be co-regulated.

If they are co-regulated, thenunderstanding their regulation is usefulif we wish to prevent their functionthrough some therapeutic means.

8

Gene RegulationEukaryotic gene regulation is much more

complicated than bacterial gene regulation.Takes place on several levels:

Chromatin remodelingTranscriptional controlMessage controlTranslational control

We are hope to understand thetranscriptional control through computationalmeans.

9

Project Goals

10

Exploratory Investigation

This project aims to gain understandingof the mechanisms that regulate thesedifferentially expressed genes.Leverage sequence dataInvestigate known methodsInvestigate new methodsGenerate and test hypotheses

11

Leveraging Sequence Data

Two senses in which we are takingadvantage of the DNA sequenceresources now available:Searching genomic sequence data around

our genes for transcription factor bindingsites

Using sequence data from multiplegenomes to narrow our search

12

Methods

13

Investigating Known Methods

Phylogenetic FootprintingTransfac DatabasePattern Detection AlgorithmsAssociation Rule Mining

14

Phylogenetic Footprinting Look at sequence which has been conserved over

evolutionary time: Ignore coding sequences Ignore known repeating sequences

Hypothesis is that conserved elements are underselective pressure due to some functional role.

We used PipMaker to create visualizations, and theblastz software program to compute ungappedalignments.

Due to limited availability at onset of project, weused only human and mouse genomes.

15

Example: BCL2L2 Gene Pip

Black regions are ungapped alignments:Human vs mouseLong segments often codonsNotice some upstream conservation

Percent identity indicated by y-axis.

16

Transfac Database

Database of known transcription factorbinding sitesCatalogues known occurrencesRepresent TFBS by consensus

sequences and weight matrix methods

17

Matrix Example: TATAXXPO A C G T01 3.00 3.89 2.48 3.99 N02 0.00 10.19 2.66 0.52 C03 0.33 3.33 0.00 9.71 T04 9.76 0.00 0.00 3.61 A05 0.00 0.00 0.00 13.36 T06 13.36 0.00 0.00 0.00 A07 12.40 0.00 0.00 0.96 A08 13.36 0.00 0.00 0.00 A09 13.36 0.00 0.00 0.00 A10 3.92 1.36 6.11 1.97 RXXBA total weight of sequences: 13.36XXCC consind generated matrix (random_expectation: 0.30)XX//

18

Transfac UtilizationWe can use Transfac to scan DNA

sequences:Find potential occurrencesDifferent scores for different quality of matches

Cannot be used to find novel binding sites,only novel occurrences of known bindingsites.

Useful tool, but too noisy to be relied on inautomated processes.

19

Pattern Detection Algorithms

Pattern detection algorithms are useful whenwe are looking for novel motifs.

We used the MEME/MAST tools to searchour conserved sequences for novel motifs:Most interesting result was an already known

splice sequenceMEME works best when you know how many

occurrences you are expecting and whereyou are expecting them.

20

Association Rule MiningARM is a mechanism for finding rules about

association between different elements.Classical example is “market basket analysis”Here we are interested in any interesting patterns

in the occurrence of TFBS identified by Transfacin our conserved sequences.

Results in many low quality rules:Typically infrequent or low confidenceBest rules found due to overlapping putative

binding sites - little informational content

21

Exploratory InvestigationResults

22

Investigating Novel Methods

All existing methods had shortcomingswhen applied to our dataset:Transfac highly uncertainPattern detection and association rule

mining failed to yield interesting resultsToo few elements for meaningful

clustering, etc.How can we reframe the problem?

23

Scaling It All Up

Association rule mining is intended forlarge databases.Our gene/TF universe was probably too

small to result in interesting rules.What if we could scale it up?

Look at every subsequence up to a certainlength in each genomic region

Determine identity between shortsequences by allowing slight mismatch

24

Kmer Analysis

ARM can be modified to find very lowsupport rules that have high certainty - the“needle in haystack.”

We can build a database of all TFBS sizedshort sequences in our conserved sequencedata:Mine this database for association rulesInteresting rules might indicate functional

relationships.

25

Building the Kmer Database

Sequence data for each gene was obtainedfrom both mouse and human genomesRepeat sequences and coding regions were

masked out.Kmer library for all 6-11mers with several

degrees of mismatch was constructed150,000 occurrences of 80,000 unique kmers550MB on disk40MB when we exclude all but perfect matches

26

Refining the Kmer Database

This is still a very large database!Likely to result in many rulesHard to analyzeHow can we easily measure the similarity

within this database, before devoting timeto implementing new algorithms?

Narrow database to include onlyexactly matching 11mers

27

Research Hypothesis“There is more short sequence similarity, as

measured by exactly matching 11mers, inour target sequence corpus, than would beexpected from random sequence data.”

If we can confirm this hypothesis, we canassert that there is interesting informationalcontent at the sequence level.Worthwhile to investigate further

28

Randomized Sequence DataWe needed a basis for comparison to

determine whether the short sequencesimilarity observed in our data set wassignificant.Generate random sequence data that maintains

the same nucleotide bias for each sequencefragment

Perform kmer analyses on each of these randomtrials

100 trials in total

29

Research Hypothesis Results

Randomly Generated Sequences

30

31

Summary Statistics11mer distributions calculated for both

Uniform nucleotide distributionSame distribution as in target data

A=26.7% C=23.0% G=26.1% T=24.2%

Z-test: Is our observed count of 73

11mers higher than thepopulation mean?

Z score = -46P-value < 10^-6

32

Hypothesis Revisited Results looked promising.. However, they depended on assumptions about

random sequence data. Therefore, we revised our hypothesis:

“There is more short sequence similarity, as measured byexactly matching 11mers, in our target sequence corpus,than would be found by randomly sampling sets of genesfrom the human and mouse genomes.”

Confirming this hypothesis would provide concreteevidence that our observed 11mer similarityconstituted a meaningful departure from the norm.

33

Analyzing Random GenesDownloaded all human-mouse homologs

from EMBLPerformed pre-processing on all homolog

pairsRepeatmaskingBlastz for phylogenetic footprinting

Randomly selected 100 sets of genesPerformed 11mer analysis on every setCatalogued results

34

Research Hypothesis Results

Randomly Selected Sequences

35

Distributions Overlay

36

Comparing the Distributions

All distributions appear normal.73 observed 11mer matches are clearly

More occurrences than expected fromrandom sequence

Much fewer than expected from randomlyselected genes

What’s going on here?

37

Discussion

38

Conclusions73 observed 11mer matches are anecdotally

interesting.Transfac matches for TATA, various TFs

Our most exhaustive results indicate that,however, we cannot claim that the number ofmatches are statistically significant.

But, there are more variables involved in thefinal analysis, which could be controlled for infurther analyses.

39

Possible Confounding FactorsAmount of conserved sequence may differ

due to:Percent conservationSize of genes

Controlled for in random sequencegeneration, but not in random gene selection

Assumes all genes are comparableControlling for these factors could be a good

avenue for future research

40

Questioning Assumptions

Everything rests on the assumptionthat our target set of genes is co-regulated by common elements at theDNA sequence level.Further assumption that regulatory

mechanism is local to the genesWhat about chromatin and its role in

regulation?

41

Suggestions for Future Work

It would be useful to repeat the final testswhile controlling for gene size andconservation.

Consider testing these same methods on analready well characterized set of co-regulated genes, rather than on aninvestigative data set.

Research methods for taking chromatin andDNA sequence structure into account.

42

Things I LearnedIn exploratory investigations, Perl is your

best friend.Much of this would have been impossible to do

manually.Perl really is faster for rapid prototyping when you

don’t know in advanced what your needs will be.You can try new methods on old data, or old

methods on new data, but developing newmethods on new data is difficult.

43

Acknowledgements

Dr. Jeff Touchman . Tgen, ASUDr. Phillip Stafford. . Tgen, ASUDr. Rosemary Renaut . ASUDr. Michael Berens . TgenDr. Huan Liu . . . ASUDominique Hoelzinger . TgenMaulik Shah . . . Tgen, ASU

Documents

Computational Investigation of Gene Regulatory Elementscbs/projects/2004_presentation... · 2008-09-25 · Hypothesis is that conserved elements are under selective pressure due to