Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
1
Computational Investigationof Gene Regulatory Elements
Ryan WeddleComputational Biosciences
Internship Presentation12/15/2004
2
Table of Contents
Introduction . . . . 3Goals . . . . . 9Methods . . . . 12Results . . . . . 21Discussion . . . . 37Acknowledgements . . 43
3
Introduction
4
Invasive Glioma Glioma is a particularly devastating type of brain
cancer caused by mutations to glial cells. While tumors may be treated through traditional
means such as chemo and radiation therapies, thesemeans are less effective at preventing spread andrecurrence.
This is due to the fact that invasive glioma migratesinto other parts of the brain by phenotypicallydifferent invasive cells.
These cells are not rapidly dividing and are, thus,less effected by traditional anti-cancer therapy.
5
Different Tumor Cells Tumor composed of
core and periphery Motile cells are more
prevalent in periphery Laser capture micro-
dissection used toseparate cellpopulations
Tumor PeripheryTumor Core
Motile Cells
6
What makes them different? Microarray analysis
performed indicated aset of 15 differentiallyexpressed genes.
The differential levelsof mRNA between thetwo cell populationswere verified withqPCR analysis.
7
What does this mean?
When a set of genes are differentiallyexpressed in this manner, it is oftenhypothesized that they may be co-regulated.
If they are co-regulated, thenunderstanding their regulation is usefulif we wish to prevent their functionthrough some therapeutic means.
8
Gene RegulationEukaryotic gene regulation is much more
complicated than bacterial gene regulation.Takes place on several levels:
Chromatin remodelingTranscriptional controlMessage controlTranslational control
We are hope to understand thetranscriptional control through computationalmeans.
9
Project Goals
10
Exploratory Investigation
This project aims to gain understandingof the mechanisms that regulate thesedifferentially expressed genes.Leverage sequence dataInvestigate known methodsInvestigate new methodsGenerate and test hypotheses
11
Leveraging Sequence Data
Two senses in which we are takingadvantage of the DNA sequenceresources now available:Searching genomic sequence data around
our genes for transcription factor bindingsites
Using sequence data from multiplegenomes to narrow our search
12
Methods
13
Investigating Known Methods
Phylogenetic FootprintingTransfac DatabasePattern Detection AlgorithmsAssociation Rule Mining
14
Phylogenetic Footprinting Look at sequence which has been conserved over
evolutionary time: Ignore coding sequences Ignore known repeating sequences
Hypothesis is that conserved elements are underselective pressure due to some functional role.
We used PipMaker to create visualizations, and theblastz software program to compute ungappedalignments.
Due to limited availability at onset of project, weused only human and mouse genomes.
15
Example: BCL2L2 Gene Pip
Black regions are ungapped alignments:Human vs mouseLong segments often codonsNotice some upstream conservation
Percent identity indicated by y-axis.
16
Transfac Database
Database of known transcription factorbinding sitesCatalogues known occurrencesRepresent TFBS by consensus
sequences and weight matrix methods
17
Matrix Example: TATAXXPO A C G T01 3.00 3.89 2.48 3.99 N02 0.00 10.19 2.66 0.52 C03 0.33 3.33 0.00 9.71 T04 9.76 0.00 0.00 3.61 A05 0.00 0.00 0.00 13.36 T06 13.36 0.00 0.00 0.00 A07 12.40 0.00 0.00 0.96 A08 13.36 0.00 0.00 0.00 A09 13.36 0.00 0.00 0.00 A10 3.92 1.36 6.11 1.97 RXXBA total weight of sequences: 13.36XXCC consind generated matrix (random_expectation: 0.30)XX//
18
Transfac UtilizationWe can use Transfac to scan DNA
sequences:Find potential occurrencesDifferent scores for different quality of matches
Cannot be used to find novel binding sites,only novel occurrences of known bindingsites.
Useful tool, but too noisy to be relied on inautomated processes.
19
Pattern Detection Algorithms
Pattern detection algorithms are useful whenwe are looking for novel motifs.
We used the MEME/MAST tools to searchour conserved sequences for novel motifs:Most interesting result was an already known
splice sequenceMEME works best when you know how many
occurrences you are expecting and whereyou are expecting them.
20
Association Rule MiningARM is a mechanism for finding rules about
association between different elements.Classical example is “market basket analysis”Here we are interested in any interesting patterns
in the occurrence of TFBS identified by Transfacin our conserved sequences.
Results in many low quality rules:Typically infrequent or low confidenceBest rules found due to overlapping putative
binding sites - little informational content
21
Exploratory InvestigationResults
22
Investigating Novel Methods
All existing methods had shortcomingswhen applied to our dataset:Transfac highly uncertainPattern detection and association rule
mining failed to yield interesting resultsToo few elements for meaningful
clustering, etc.How can we reframe the problem?
23
Scaling It All Up
Association rule mining is intended forlarge databases.Our gene/TF universe was probably too
small to result in interesting rules.What if we could scale it up?
Look at every subsequence up to a certainlength in each genomic region
Determine identity between shortsequences by allowing slight mismatch
24
Kmer Analysis
ARM can be modified to find very lowsupport rules that have high certainty - the“needle in haystack.”
We can build a database of all TFBS sizedshort sequences in our conserved sequencedata:Mine this database for association rulesInteresting rules might indicate functional
relationships.
25
Building the Kmer Database
Sequence data for each gene was obtainedfrom both mouse and human genomesRepeat sequences and coding regions were
masked out.Kmer library for all 6-11mers with several
degrees of mismatch was constructed150,000 occurrences of 80,000 unique kmers550MB on disk40MB when we exclude all but perfect matches
26
Refining the Kmer Database
This is still a very large database!Likely to result in many rulesHard to analyzeHow can we easily measure the similarity
within this database, before devoting timeto implementing new algorithms?
Narrow database to include onlyexactly matching 11mers
27
Research Hypothesis“There is more short sequence similarity, as
measured by exactly matching 11mers, inour target sequence corpus, than would beexpected from random sequence data.”
If we can confirm this hypothesis, we canassert that there is interesting informationalcontent at the sequence level.Worthwhile to investigate further
28
Randomized Sequence DataWe needed a basis for comparison to
determine whether the short sequencesimilarity observed in our data set wassignificant.Generate random sequence data that maintains
the same nucleotide bias for each sequencefragment
Perform kmer analyses on each of these randomtrials
100 trials in total
29
Research Hypothesis Results
Randomly Generated Sequences
30
31
Summary Statistics11mer distributions calculated for both
Uniform nucleotide distributionSame distribution as in target data
A=26.7% C=23.0% G=26.1% T=24.2%
Z-test: Is our observed count of 73
11mers higher than thepopulation mean?
Z score = -46P-value < 10^-6
32
Hypothesis Revisited Results looked promising.. However, they depended on assumptions about
random sequence data. Therefore, we revised our hypothesis:
“There is more short sequence similarity, as measured byexactly matching 11mers, in our target sequence corpus,than would be found by randomly sampling sets of genesfrom the human and mouse genomes.”
Confirming this hypothesis would provide concreteevidence that our observed 11mer similarityconstituted a meaningful departure from the norm.
33
Analyzing Random GenesDownloaded all human-mouse homologs
from EMBLPerformed pre-processing on all homolog
pairsRepeatmaskingBlastz for phylogenetic footprinting
Randomly selected 100 sets of genesPerformed 11mer analysis on every setCatalogued results
34
Research Hypothesis Results
Randomly Selected Sequences
35
Distributions Overlay
36
Comparing the Distributions
All distributions appear normal.73 observed 11mer matches are clearly
More occurrences than expected fromrandom sequence
Much fewer than expected from randomlyselected genes
What’s going on here?
37
Discussion
38
Conclusions73 observed 11mer matches are anecdotally
interesting.Transfac matches for TATA, various TFs
Our most exhaustive results indicate that,however, we cannot claim that the number ofmatches are statistically significant.
But, there are more variables involved in thefinal analysis, which could be controlled for infurther analyses.
39
Possible Confounding FactorsAmount of conserved sequence may differ
due to:Percent conservationSize of genes
Controlled for in random sequencegeneration, but not in random gene selection
Assumes all genes are comparableControlling for these factors could be a good
avenue for future research
40
Questioning Assumptions
Everything rests on the assumptionthat our target set of genes is co-regulated by common elements at theDNA sequence level.Further assumption that regulatory
mechanism is local to the genesWhat about chromatin and its role in
regulation?
41
Suggestions for Future Work
It would be useful to repeat the final testswhile controlling for gene size andconservation.
Consider testing these same methods on analready well characterized set of co-regulated genes, rather than on aninvestigative data set.
Research methods for taking chromatin andDNA sequence structure into account.
42
Things I LearnedIn exploratory investigations, Perl is your
best friend.Much of this would have been impossible to do
manually.Perl really is faster for rapid prototyping when you
don’t know in advanced what your needs will be.You can try new methods on old data, or old
methods on new data, but developing newmethods on new data is difficult.
43
Acknowledgements
Dr. Jeff Touchman . Tgen, ASUDr. Phillip Stafford. . Tgen, ASUDr. Rosemary Renaut . ASUDr. Michael Berens . TgenDr. Huan Liu . . . ASUDominique Hoelzinger . TgenMaulik Shah . . . Tgen, ASU