Upload
duard
View
25
Download
0
Embed Size (px)
DESCRIPTION
Applications of scan statistics in molecular biology and neuroscience. by Chan Hock Peng Dept of Statistics and Applied Probabilty. Outline. 1. General introduction 2. Applications in molecular biology (weighted scan statistics) 3. Tail probability computations - PowerPoint PPT Presentation
Citation preview
Applications of Applications of scan statistics in scan statistics in
molecular molecular biology and biology and
neuroscienceneuroscienceby Chan Hock Peng by Chan Hock Peng
Dept of Statistics and Dept of Statistics and Applied ProbabiltyApplied Probabilty
OutlineOutline• 1. General introduction• 2. Applications in molecular biology
(weighted scan statistics)• 3. Tail probability computations• 4. Applications in neuroscience
(template matching problem)• 5. Tail probability computations• 6. Extensions and other applications
NotationNotation• : The maximum score in any window of
length u.• : The underlying rate of events
occurring under normal circumstances.• n: The length of the interval under
consideration.
uM
Example 1Example 1• (USA Today, 1996) On Feb 22, US Navy
suspended all operations of F-14 jet after third crash in one month.
• The three crashes in a month was seven times expected rate based on 5 year period.
• =3, n=5*365, =1/70.30M
Example 2Example 2• (Home News, 1995) In 10 month period,
11 residents died at a Tennessee State Institution. Number was twice what was expected.
• Judge was angry and ordered mental health commissioner to spend one in four weekends at institution.
• =11, n=?, =11/20.10M
Clusters of DAM sites in Clusters of DAM sites in E.Coli DNAE.Coli DNA
• Karlin and Brendel (1992).• DAM site--occurrence of the pattern GATC. • Important in repair and replication of DNA.• =8, n=4.7 million, =1.1/250. • P-value approx. of Naus (1982),
245M
87.0}8{ 245 MP
03.0}10{ 245 MP
Palindromes in DNAPalindromes in DNA• A-T and C-G are complementary
bases.• Complement of CCACGTGG is
GGTGCACC.• CCACGTGG is palindromic pattern
because its complement reads the same as itself backwards.
Palindromic sequences in Palindromic sequences in virusesviruses
• Masse et al. (1992) & Leung et al. (1994).• Palindromic sequences clusters around
origin of replication.• Event occurs if there is palindromic pattern
of length at least 10 base pairs.• HCMV sequence. =10, n=229354,
=0.001. p-value=0.00195.1000M
Extensions to general scoring Extensions to general scoring functions (weighted scan)functions (weighted scan)
• In Chew, Choi and Leung (2005), longer palindromic patterns are given larger weights.
• For example, a pattern of length k can be given score of k/10.
• p-value computations ?
Other applications of Other applications of weighted scanweighted scan
• Rajewsky et al. (2002) & Lifanov et al. (2003).
• Scanning for clusters of transcription factor binding sites.
• Position weighted matrices to score words for similarity to a given motif.
• Siepel et al. (2005). Searching for segments of high evolutionary conservation.
P-value computations for P-value computations for weighted scanweighted scan
• Chan and Zhang (2006).
where• I is a large deviation rate function.• is an overshoot function.• K is the moment generating function of the
scores.
''2
)/()(exp1}{
)/(
Ku
ukeunkMP
ukuI
u
Template matching in Template matching in neuroscienceneuroscience
• Neurons are basic units of information processing in brain.
• Generate small and highly peaked electric potentials known as spikes.
• Pattern of spikes modeled as point or counting process, e.g. Poisson process.
Template patternTemplate pattern• Dave and Margoliash (2000) and Mooney
(2000), the spike patterns of a zebra finch when it is listening to a bird song.
• Each contains the times in which spikes were generated for ith neuron in an interval of time [0,T).
),...,( )()1( dwww
)(iw
Longer spike train patternsLonger spike train patterns• Let be corresponding
spike train patterns when finch is sleeping, observed over a longer period of time [0,a).
• If w matches well with a segment of y, then evidence of bird song replay and hence song learning during sleep.
),...,( )()1( dyyy
Scoring functionScoring function• Consider kernel function f, e.g. let
f(x) = 1 if x < 0.025 ms, f(x)=-0.3 if x> 0.025 ms.
• For the illustration below, consider d=1 and T=0.2ms.
• Let w={.01, .05, .09, .12}.• Let y ={.32, .75, 1.03, 1.15, 1.25 }.
• To check if there is a match between w and the segment of y starting at time t=1, compare w = {.01,.05,.09,.12} against y-1 = {.03,.15}.
• The point .03 provides a score of 1 because there is point in w less than 0.025ms away.
• The point .15 provides a score of -0.3 because nearest point in w is more than 0.025ms away.
• Overall score at time t=1 is 1-0.3=0.7.
Scan statisticsScan statistics• For d>1, add up scores over all neurons
starting at same time t.• Scan statistics is the maximum
possible score over all t in the interval [0,a-T).
• Chi (2004) obtain approx of • Chan & Loh (2005) more precise approx of
was obtained.
TM
}){log( cMP T
}{ cMP T
Assumptions and related Assumptions and related informationinformation
• Each is stationary while are independent Poisson
processes.• Separate formulas when kernel f is
continuous and when it is not continuous.• Number of times a large score c is
exceeded is Poisson random variable.
)(iw)()1( ,..., dyy
Table of approximationsTable of approximations• c MC (s.e.) C & L0.017 0.0387(0.0019) 0.03830.018 0.0237(0.0012) 0.02410.019 0.0158(0.0008) 0.01490.020 0.0095(0.0005) 0.0091 0.021 0.0054(0.0003) 0.00550.022 0.0033(0.0002) 0.0033
Future worksFuture works• Higher dimension Poisson processes
e.g. 2 or 3 dimensional. • Applications in astronomy and
imaging.• Varying window-sizes.