Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
A nonparametric analysis of the spatial
distribution of the earthquakes magnitude�
Mario Francisco-Fernández
Universidad de A Coruña yAlejandro Quintela-del-Río
Universidad de A Coruña z
Rubén Fernández Casal
Universidad de A Coruña x
February 3, 2009
Abstract
This article describes an application of nonparametric local linear regres-
sion to study the spatial structure of the magnitude of earthquakes. If spatial
correlation is suspected in a data set of earthquakes in a concrete geographic
zone, the smoothing parameter needed to obtain the estimator of the mean
magnitude will be computed using a version of the �bias-corrected and esti-
mated�GCV method presented in Francisco-Fernández and Opsomer (2005).
This method allow us to take this spatial dependence into account to obtain
better smoothing parameters. Additionally, a parametric bootstrap technique
is used to quantify the variability of the spatial maps produced with the non-
parametric estimation method and to generate maps that shows the proba-
bility of being at high and low seismic risk in the considered area. These
techniques are applied to two di¤erent earthquakes data sets: the historic
�Short title: Nonparametric Analysis of Spatial DistributionyDepartamento de Matemáticas, Facultad de Informática, La Coruña,15071, Spain.zDepartamento de Matemáticas, Facultad de Informática, La Coruña,15071, Spain.xDepartamento de Matemáticas, Facultad de Informática, La Coruña,15071, Spain.
1
catalogue of the Galician region at the Northwest of Spain and the last ten
years in California.
Key Words: earthquakes; magnitude; local polynomial regression; nonpara-
metric estimation; parametric bootstrap;
1 Introduction
A seismic series is a set of earthquakes occurring in a given period of time in a given
area. Earthquakes of a seismic series are considered as stochastic mathematical
variables, belonging to a continuous space-time-energy medium with dimension 5
(X1i ; X
2i ; X
3i ; ti; Yi), whereX
1i andX
2i are the latitude and longitude of the epicenter,
X3i the depth of the focus, ti the origin time and Yi the magnitude. If the activity
develops without abrupt changes, it is possible to know the structure that exists
between the earthquakes (Torcal et al., 1999). One interesting approach is to study
the behavior of the seismic series using stochastic methods to try to specify the
relation between the �ve variables mentioned. These methods include multivariate
models.
There have been numerous studies of seismic series using stochastic methods,
such as, e.g. Kagan (1990), Ogata (1988), Udias and Rice (1975), or Vere-Jones
(1978, 1992) (see the exhaustive bibliography of page 726 of Torcal et al. (1999)).
However, standard parametric models applied to seismic data do not always �t those
data well. In part, this is because parametric models are usually well suited only to
a sequence of seismic events that have similar causes. Moreover, parametric models
can be insensitive to anomalous events, since models tend to be formulated through
experience of relatively conventional seismic activity. From 1970, approximately, an
exhaustive statistical literature exists inside what has been called �nonparametric
curve estimation�(beginning with Parzen (1962), Nadaraya (1964), or Watson and
Leadbetter (1964)). This methodology is a �exible and potent tool to describe
the behavior of univariate and multivariate data sets, because it does not need
the speci�cation of a concrete model to work (such as the normal distribution, or
a linear relation). The nonparametric statistical techniques are, in some cases, a
2
supplement for parametric models. In other cases they represent for themselves a
valid alternative to the classic parametric techniques, where it is presupposed that
the curve has a certain functional form depending on some parameters that it is
necessary to estimate. Particular monographs about this topic -that include great
quantity of application examples- are the following ones: Härdle (1990), Hastie and
Tibshirani (1990), Silverman (1986), Simono¤ (1996) or Wand and Jones (1995).
In this paper we suggest the application of nonparametric methods for analyzing
seismic data. More concretely, we are interested in mapping the (bidimensional)
spatial distribution of the earthquakes magnitudes, by means of the model
Yi = m(X1i ; X
2i ) + "i; i = 1; 2; : : : ; n:
where m(�) is a regression function, in which we do not suppose any concrete para-metric model and "i are random errors that may or may not be spatially correlated.
In statistical analysis of earthquakes, nonparametric methods have been used
by several authors to study the behavior of the stochastic variables involved in the
process of occurrence of the same. If we consider the nonparametric estimation of
the intensity function, we have the papers of Vere-Jones (1992), Bailey and Gatrell
(1995), Estévez et al. (2002) and Stock and Smith (2002). Regarding regression
estimation, the work of Grillenzoni (2005) develops adaptive modellings for earth-
quake data of Northern California. Failure rate or hazard estimation is analyzed in
the paper of Estévez et al. (2002) for Northwest data of the Iberian Peninsula (the
same zone considered in this work), and other European zones.
The goal of this article is to show the utility of a particular type of nonparametric
method, the named �local linear regression�(Fan and Gijbels, 1996), to the spatial
statistical analysis of earthquakes data. We will do this by considering the problem of
visualizing the spatial pattern of earthquakes magnitude, and applying to two seismic
data sets of two di¤erent zones, respectively. The representation of the magnitude
as a smooth spatial function, jointly with the use of bootstrap techniques, provides
a useful ��rst step� in identifying areas of high and low seismic risk, for instance,
or in spatial prediction (i.e. kriging, see e.g. Cressie (1993)). The organization
of the paper is as follows. Section 2 describes the statistical model and reviews
3
the nonparametric estimator, as well as the bootstrap method used. Section 3
provides information on the study areas and data, and shows the behavior of the
nonparametric spatial methods described in Section 2 on those real earthquake data.
Section 4 is devoted to conclusions.
2 Nonparametric regression models
A regression model tries to describe the existing relation between a explicative vari-
able, in general of dimension d; X = (X1; X2; :::; Xd), and a response variable Y;
based on a real data set f(X i; Yi)gni=1 ; obtained by the relationship
Yi = m (X i) + "i; i = 1; 2; : : : ; n; (1)
where m(�) is the unknown regression function, and "i are the random errors. Underthe assumption of E("ijX i) = 0, if X and Y are random variables, the regression
function at a location x is m(x) = E (Y jX = x).
In this paper, the following spatial nonparametric regression model for the earth-
quakes data will be used. Assume that a set ofR3-valued random vectors, f(X i; Yi)gni=1 ;are observed in a concrete interval of time, where the Yi are scalar responses vari-
ables and the X i are predictor variables with a common density f and compact
support � R2: In this article, we will refer to the X i as the locations (latitude
and longitude, expressed in degrees) corresponding to the Yi (magnitude). The re-
lationship between the locations and the responses variable is assumed to be of the
form (1), where m(�) is an unknown continuous and smooth function, and " is asecond order stationary process with:
Cov ("i; "jjX i;Xj) = C(X i �Xj); (2)
where C(u) is a positive-de�nite function, called the covariogram (with C(0) =
Var ("ijX i) = �2). The case of C(u) � �2If0g(u), where If0g is the indicatorfunction of the origin, corresponds to independent errors, whereas other covariance
functions may allow for di¤erent types of spatial dependence. For instance, one of
most well-known covariogram families is the Matern class (see e.g. Stein, 1999, pp.
4
48-52):
C�(u) = c0If0g(u) +c1
2��1�(�)
�kuka
��K��kuka
�; (3)
where kXk is the euclidean distance in Rd, K is the modi�ed Bessel function of
the second type, c0 is the nugget e¤ect, c1 is the partial sill (the variance �2 =
c0 + c1 is also called sill in this context), a is a scale parameter (proportional to the
autocorrelation range) and � is a smoothness parameter (which determines the shape
of the covariogram at small lags). The case of � = 0:5 corresponds to the exponential
model. In geostatistics, covariogram models that allow a lack of continuity at the
origin (non zero nugget e¤ect) are traditionally considered in practice, and we can
think that this corresponds with independent local variability (for more details see
e.g., Cressie, 1993, pp. 127-130).
This particular model assumes that the response variable Y is an unknown
smooth function of location, �masked�by zero-mean (second-order stationary) er-
rors. This is in contrast to the traditional kriging models, which typically assume
that the response variable is a simpler (linear or constant) function of location sup-
plemented by spatially correlated noise. In those models, a larger fraction of the
observed behavior of the data is therefore attributed to noise instead of to the under-
lying mean function, and consequently the error term should be modelled carefully
(even considering a complicated spatial dependence structure). It should be ex-
pected that with a model like (1) assumptions about the error term will have less
e¤ect on the conclusions obtained. It is also important to point out that determining
which model is in fact correct for the data cannot be done without replication at
the same locations, and in practice, di¤erent approaches could lead to similar esti-
mated (or predicted) spatial maps. However, the interpretation of the map and the
accompanying inference statements are di¤erent (for more details see e.g. Altman,
1997, or Cressie, 1993, section 3.1).
Obviously, we could consider a more general model if we include the depth in
the variable X, having a 3-dimensional explicative variable. Because one of the
earthquake catalogue in our work (the corresponding to Spanish data) not always
include the depth component for each recorded earthquake, and also for the di¢ culty
5
of interpreting graphs in more that three dimensions, we have not included this
variable in our analysis. Note also that we can modify model (1) in order to take
into account the complete spatial dimension or even the temporal component only
through the errors "i, considering a more complicated spatio-temporal dependence
(avoiding multidimensionality problems in the nonparametric estimation).
Our �rst goal is to estimate the mean function m(�) using a nonparametric es-timator. Nonparametric techniques try to estimate the regression function without
assuming a speci�c parametric family of functions for m(�). Classical nonparametricregression estimators are based on explaining the relationship between the data by
using weighted local means, that is, the estimator of m(x) can be written as
mH(x) =nXi=1
wH(X;x)Yi;
wherefwH(�)gni=1 is a weight sequence, characterized by a form and a scale. The
form corresponds to a kernel function K : Rd ! R; and the scale to a smoothing
parameter or bandwidth H :
While the election of the function K has not play an important role (normally,
K is a density function with some regularity conditions) in the estimation (consult
e.g. Härdle (1990)), the election of the bandwidth is more crucial, because the
shape of the resulting estimator varies greatly according to its value. In general, the
bandwidth matrixH controls the shape and the size of the local neighborhood used
for estimating m(x). So, ifH is �small�, we will obtain an undersmooth estimator,
with high variability; and, on the other hand, if H is �big�, the resulting estimator
will be very smooth and possibly with larger bias (see Section 2.2 below).
2.1 Local linear regression for spatial data.
When the explicative variables are bivariate variables, the local linear estimator for
m(�) at a location x is the solution for to the least squares minimization problem
min ;�
nXi=1
�Yi � � �T (X i � x)
2KH(X i � x);
where H is a 2 � 2 symmetric positive de�nite matrix; K is a bivariate ker-
nel and KH(u) = jHj�1K(H�1u). The Epanechnikov kernel function K(x) =
6
2�max
��1� kxk2
�; 0, a common choice for local linear regression, will be used
throughout this article. The local linear regression estimator can be written explic-
itly as
mH(x) = eT1
�XT
xW xXx
��1XT
xW xY � sTxY; (4)
where e1 is a vector with 1 in the �rst entry and all other entries 0, Y = (Y1; : : : ; Yn)T ,
W x = diag fKH(X1 � x); :::; KH(Xn � xg, and
Xx =
0BB@1 (X1 � x)T...
...
1 (Xn � x)T
1CCA :Local linear regression has been broadly studied in the statistical context of uni-
variate regression, and we refer to Wand and Jones (1995) for an overview. For
bivariate local linear regression, Ruppert and Wand (1994) provide the relevant as-
ymptotic theory for the case in which the errors are independently distributed. In
the spatial context, this assumption of independence is often not appropriate, and
accounting for possible correlation is required for both inference and smoothing pa-
rameter selection. For a review of issues related to nonparametric regression with
correlated errors, see Hart (1996) and Opsomer et al. (2001). Recently, Francisco-
Fernández and Opsomer (2005) discussed spatial smoothing and proposed a band-
width selection method that allows for the presence of correlated errors.
A description of the bandwidth selection methods used to analyze the earth-
quakes data will be presented in the following Section.
2.2 Bandwidth selection.
As it was stated above, smoothing parameter selection in kernel regression is a very
important issue in order to obtain reliable estimators. The in�uence of bandwidthH
in the estimator of m(x) can be observed in Figure 1, where three estimators of the
regression function using the local linear estimator with three di¤erent bandwidths
are presented. The data used to plot this Figure 1 will be described and studied
more deeply in Section 3.1. They consists of the locations and the corresponding
magnitude of 1397 earthquakes succeed in the Northwest of the Iberian Peninsula
7
from 25/November/1944 to 03/April/2008. We will use the model (1) and the local
linear regression estimator to map the estimated mean magnitude in that area.
Each one of the rows of this Figure shows, in the left map, the locations of the
observations and the bandwidth used to estimate the regression function at point
of coordinates 43o N and 8o W, and, in the right map, the corresponding estimator
in that area computed with this bandwidth. The bandwidth used in the �rst row
of the Figure 1 is the bandwidth HGCV ce, obtained using the �bias-corrected and
estimated�GCV criterion, described below. The bandwidths used in the second
and third rows are 0:5 �HGCV ce and 2 �HGCV ce, respectively. More details on the
parameters used to compute the bandwidth HGCV ce can be found in Section 3.1.
It can be observed that, compared with the optimal election (�rst row), when a
small bandwidth is used (second row), the estimator shows a signi�cant amount of
variation that appears to be spurious. It is important to note that the variability
shown here is actually less than that actually obtained by the regression, since the
plot trims o¤ several large spikes that go well beyond the values on the scale, both
in the positive and negative direction (this was done to maintain comparability with
the map in the �rst row of the Figure 1). On the other hand, the estimator obtained
with a larger bandwidth (third row) is much smoother than the optimal election.
This could lead to estimators removing some signi�cant parts of the mean pattern.
An important observation about the estimators showed in Figure 1 (especially in
the optimal case -�rst row-) is that the extreme low and high values occur at the
boundary of the estimation region, where smoothing methods such as local linear
regression often result in unreliable estimates. However, in this case, the areas of
interest are in the central part of the region.
There are several methods to select the smoothing parameter. Some of these
methods are specially designed for correlated observations, which is quite usual in the
spatial smoothing context. In the case of spatial independent observations, classical
smoothing parameters criteria for nonparametric regression are, for example, of
cross-validation type (Craven and Whaba, 1979). These methods consider selecting
8
Figure 1: Locations, bandwidth and regression estimator with three di¤erente band-
widths
the bandwidth H that minimizes the GCV function
GCV (H) =1
n
nXi=1
�Yi � mH(X i)
1� 1ntr(S)
�2: (5)
with S the n�n matrix whose ith row is equal to sTXi, the smoother vector for x =
X i; and tr(S) is the corresponding trace. Finding the minimizer of this function over
the d(d+1)=2 (d = 2 when we only consider longitude and latitude) parameters inH
can be performed using numerical algorithms as implemented in statistical software.
However, we should not use this criterion directly for selecting the bandwidth in
the presence of correlated errors, because its expectation is severely a¤ected by the
correlation (Liu, 2001). In this case, we propose to use the �bias-corrected�GCV
9
criterion, proposed in Francisco-Fernández and Opsomer (2005), based on selecting
the bandwidth H that minimizes the function
GCVc(H) =1
n
nXi=1
�Yi � mH(X i)
1� 1ntr (SR)
�2; (6)
with R the correlation matrix of the errors.
In practice, matrix R is unknown, so that, (6) is not yet a practical bandwidth
selection criterion. Following Francisco-Fernández and Opsomer (2005) we will as-
sume a parametric form for the covariogram, from which the correlogram can be
obtained ��(u) = C�(u)=C�(0), and then replace the unknown R(�) in (6) by an
estimate R(�). This method is called the �bias-corrected and estimated� GCV
criterion
The theoretical optimality properties of this last criterion were discussed in
Francisco-Fernández and Opsomer (2005). In that paper, the authors considered
the isotropic exponential model (corresponding to (3) with c0 = 0 and � = 0:5)
for the correlation function, and the dependence parameters were estimated using a
simple method-of-moments estimators speci�cally obtained for this model. A similar
approach will be used here, but traditional geostatistical methods will be employed
in the dependence modelling (in order to avoid bias in the dependence parameter
estimation).
The estimation of the spatial dependence was done through the variogram,
(u) = C(0) � C(u) (see e.g. Cressie, 1993, section 2.4.1, for a explanation aboutwhy variogram estimation is preferred to covariogram estimation). The general al-
gorithm is as follows:
1. Obtain a pilot bandwidth matrixH = �pilot (for instance, using the standard
GCV selection criterion (5)).
2. Using local linear estimator (4), with bandwidth matrix H, obtain the resid-
uals:
"i = Yi � mH(X i); i = 1; 2; : : : ; n:
10
3. Compute the classical variogram pilot estimator from the residuals:
(uk) =1
2jN(uk)jXN(uk)
(b"i � b"j)2 ; k = 1; : : : ; K:
where N(u) = f(i; j) : X i � Xj 2 Tol(u)g, Tol(u) is a tolerance regionaround u and jN(u)j denote the number of contributing pairs at lag u.
4. Fit a valid variogram model using a Weighted Least Squares (WLS) criterion:
� = argmin�
KXk=1
!k ( �(uk)� (uk))2 :
5. Update the bandwidth matrix H, using the �bias-corrected and estimated�
GCV method (6).
6. Repeat steps 2-5 until convergence is obtained.
In practice, few iterations of previous algorithm are usually needed (and very
similar results were observed considering just one iteration). In step 4, following the
recommendations of Journel and Huijbregts (1978, p.74), the �t of a valid model is
usually done up to half the maximum possible lag and considering only pilot estima-
tions with at least 30 contributing pairs (i.e. jN(u)j � 30) The weights !i are usu-ally chosen following the idea proposed by Cressie (1985) (i.e. !i = jN(u)j= �(ui);and proceeding iteratively). In the examples showed in this work, a Matern model
(3) was considered, allowing for the possibility of geometrical anisotropy (i.e. cor-
relation depending on direction as well as distance between observations; see e.g.
Cressie, 1993, pp. 62-64). Note that the case of c1 = 0 corresponds with (esti-
mated) independence. Other goodness-of-�t criteria could be used in step 4, for
instance, the approach in Francisco-Fernández and Opsomer (2005) or maximum
likelihood estimation.
2.3 Parametric bootstrap.
In a general way, bootstrap (Efron, 1979) is a resampling method that attempts to
estimate the sampling distribution of a population by generating new samples by
11
drawing (with replacement) from the original data. Under certain probabilistic con-
ditions (which are often true in reality), bootstrap usually produces more accurate
and reliable results than traditional methods which are either based upon the large
number distribution properties of the observations (see for example Efron and Tib-
shirani (1993) about the bootstrap and other resampling procedures). Nowadays,
as a computer-aided statistical technique, bootstrap is largely used in many �elds of
applied statistics such as medicine, biostatistics, environmental sciences, chemistry,
or seismology (Lamarre et al. (1992), Pisarenko and Sornette (2004))
Now, in order to incorporate variability assessments in our analysis of the earth-
quakes magnitude, we extended the parametric bootstrap for correlated data dis-
cussed in Vilar-Fernández and González-Manteiga (1996). It follows the same steps:
1. Obtain, using the algorithm described in previous section, the optimal band-
width matrixH (with the �bias-corrected and estimated�GCV criterion), the
corresponding residuals "i and the covariance function parameter estimates �.
2. Bootstrap datasets are generated by taking the estimated mean spatial surface
mH(X i) and adding bootstrap errors generated as a spatially correlated set
of errors. Bootstrap errors are obtained by the following steps:
(a) Using �, compute the variance-covariance matrices of the residuals " =
("1; "2; : : : ; "n) ; denoted by V .
(b) Using Cholesky decomposition, �nd the matrix P such that V = PP T :
(c) Obtain the �independent�variables e = (e1; e2; : : : ; en), given by
e = P�1":
(d) These independent variables are centered and, from them, we obtain
an independent bootstrap sample of sample size n, denoted by e� =
(e�1; e�2; : : : ; e
�n) :
(e) Finally, the bootstrap errors "� = ("�1; : : : ; "�n) are
"� = Pe�
12
and the bootstrap samples are:
Y �i = mH(X i) + "�i ; i = 1; 2; : : : ; n:
3. Once the bootstrap datasets are obtained, the above nonparametric regressions
are repeated for each bootstrap sample using the same bandwidthH as for the
original analysis, and a map of at-risk areas (magnitude larger or equal than a
threshold) is produced. This process is repeated a large number of times B (in
our analysis, we use B =1000). Finally, a map showing the frequency -across
bootstrap replicates, for each location- of how often that location is included
in the at-risk area is computed.
It is important to note that this procedure is especially designed for when the er-
rors are supposed to be spatially correlated. In the case of having independent errors,
the previous algorithm can be easily simpli�ed, since once the residuals ("1; : : : ; "n)
are obtained (using in this case the GCV method to select the bandwidth) in the
�rst step, the bootstrap errors ("�1; : : : ; "�n) are immediately obtained by resampling
with replacement among the original residuals.
3 Data analysis.
3.1 Northwest of the Iberian Peninsula.
The Hesperic Massif, composed by areas that conform a continuous outcrop and
occupy most of the western half of the Iberian Peninsula, is limited by other areas,
mainly of Mesozoic and Tertiary age. It is part of the Europe Hercinic chain, whose
trace can be seen from central Europe to the north-western extreme of France, fol-
lowing a general trend from east to west; next, it hides under the Atlantic Ocean,
forming a broad arc connecting with the Northwest coast of the Iberian penin-
sula. These areas are the Cantabrian Zone, Asturian-Leonese and Galician-Castilian
(Julivert et al. (1973), Julivert and Arboleya (1984)).
Our area of interest focus on the Northwest of the Iberian Peninsula, so our
statistical analysis will focus on data from the earthquakes succeed in the last two
13
areas. The Cantabrian zone has limits from the east and south by lands that form
the Mesozoic and tertiary coverture in Cantabria and the Plateau of Duero. The
western boundary is the other area, Asturian-Leonese (Perez-Estaún et al. (1991)).
A detailed study of the seismo-tectonics and dangerousness of the zone is done in
Rueda and Mezcua (2001).
For our study, we have considered the area limited by the coordinates 42o N �
44o N and 6o W �10o W, that involves the autonomic region of Galicia; this zone is
the most important respecting to the number of earthquakes and their magnitude.
We have selected the data bank of the National Geographic Institute (IGN) of Spain
(until April of 2008) at http://www.ign.es/ign/es/IGN/SisCatalogo.jsp, where the
seismic data by this public agency to current date in the Iberian Peninsula, Balearic
and Canary Islands are registered. Seismic activity increased considerably in Gali-
cia during the 1990�s, generating an important social alarm in the population, not
accustomed to feeling earthquakes of a notable magnitude. Actually, it is object of
research of several groups of work, among which is that of the authors of this work.
In Figure 2 we show a map of this zone, together with the selected data set, com-
posed by N= 1397 earthquakes succeed from 25/November/1944 to 03/April/2008.
As we can see in this map, there exist small spatial groups located in zones of high
risk: Sarria, Monforte and Celanova. A study of the distribution of the magnitude,
as well as of temporary cluster grouping, and their associated seismic risk can be
seen in Estévez et al. (2002), concerning to the earthquakes happened between the
years 1987 - 2000.
The �t of the nonparametric regression estimator (4) to the set of data was
done using the algorithm described in Section 2.2. In order to produce a map for
the survey area of interest, the local linear estimates were computed on a regular
50 � 50 grid overlaying the �eld. As noted earlier, the adjusted GCV method
for bandwidth selection of Francisco-Fernández and Opsomer (2005) requires the
speci�cation of a model for the correlation. Following an exploratory analysis of
the data, it could be seen that an isotropic exponential covariogram model (eq.
(3) with � = 0:5) is apparently adequate to describe the spatial dependence of the
residuals. As already noted at the end of Section 2, this model speci�cation is used
14
Figure 2: Locations of the observations
in the selection of bandwidth values and does not determine the actual shape of the
spatial distribution function m(x): Therefore, modest di¤erences between the true
spatial correlation and the assumed correlation model would have a negligible e¤ect
on m(x). Figure 3 shows the pilot variogram estimations and the �tted variogram
model. The estimated covariogram parameters were c0 = 0:080, c1 = 0:107 and
a = 0:044 (which corresponds to a practical range of 0.132).
Using these estimated parameters, the bandwidth matrix obtained with the se-
lection criterion (6) was:
H =
24 0:54 0
0 1:13
35 ;where the shape of this bandwidth at location of coordinates 43o N and 8oW, as well
as the corresponding regression estimator in the area of interest was previously shown
in the �rst row of Figure 1. This bandwidth corresponds to a moderate amount of
smoothing, since this bandwidth matrix implies that for any location x not on the
boundary of the study region, 20-40% of the observations are contributing (have non-
15
0.0 0.2 0.4 0.6 0.8
0.00
0.05
0.10
0.15
0.20
distance
sem
ivar
ianc
e
EmpiricalFitted model
Figure 3: Classical semivariogram estimations and WLS �tted exponential model
zero weight) to the nonparametric regression �t. Next, we use the bootstrap method
describe in Section 2.3 to plot maps of estimates of the likelihood of an earthquake
with magnitude larger or equal than a threshold occurring in each location of the
area of interest. Figure 4 show the maps with pointwise bootstrap probabilities
of being considered at risk of occurring an earthquake with magnitude larger or
equal than the threshold considered. To evaluate the sensitivity to the choice of the
threshold, we considered two thresholds: 2.5 in the left picture and 2.75 in the right
picture. We can observe that there is an important di¤erence between both maps.
So, while a big proportion of the area is in high risk of occurring an earthquake with
a magnitude larger or equal than 2.5, only the area limited, approximately, by the
coordinates 42:75o N �43:5o N and 6:5oW�8oWhas an important risk of occurring
16
Figure 4: Maps with bootstrap probabilities of areas with seismic risk for di¤erent
values of the threshold
an earthquake with a magnitude larger or equal than 2.75. On the other hand, in
this map, the highest values in the North limit are possibly due to a boundary e¤ect,
making it likely these very high risk values (close to one) are indeed spurious.
3.2 California data.
We use the composite seismicity catalog from the Advanced National Seismic System
(ANSS), available at http://quake.geo.berkeley.edu/anss/catalog-search.html. The
selected grid is 30:08o N to 44:99o N in latitude and 125o W to 115o W in longitude.
We selected earthquakes above magnitude 3.5 for the period 01/January/1998 to
01/April/2008. The reduced time period considered is due to the high number of
existing data in this area. Indeed, the number of data recorded in this period is
N = 4888, considerably greater than all the data recorded in the Galician region
previously analyzed. The reason of considering only a short period of time is the
following: the number of calculus necessary in the nonparametric estimator (4) is
linear in the number of the observations (n) and in the number of points in which we
realize the estimation (a bidimensional grid 50�50), but the cross-validation function(6) depends directly of the square of the number of observations. Also, the bootstrap
procedure implies to multiply by B (number or bootstrap replications) the volume
17
of calculations. Therefore, with the aim of showing the estimation and bootstrap
procedures in a di¤erent zone (relatively to its seismic risk) than the Galician one,
we have decided to analyze only the ten last years in the above mentioned catalogue.
Figure 5 shows the area of study and the locations of the observations.
Figure 5: Locations of the observations
Nonparametric analysis for California earthquake was developed by Grillenzoni
(2005). In Quintela-del-Río (2008) can be found an example of application of the
nonparametric functional data analysis (Ferraty and Vieu, 2006) to this zone.
We follow the same steps as in the previous example. Firstly, we �t the local
linear estimator to the earthquakes magnitude in this area. Nevertheless, in this
18
case, the spatial variability of the data was apparently captured by the trend, so
the residuals seems to exhibit no spatial correlation. For example, Figure 6 shows
an envelope for the empirical variogram obtained under spatial independence (by
random permutation of the data values on the spatial locations). Almost all values
lie inside the envelope, except the estimate at �rst lag. A deeper look to the residuals
shown that this was due to the presence of some atypical observations (with respect
to values in their neighborhood). After the exploratory analysis of the data, the
spatial independence of the data was assumed.
0 1 2 3
0.10
0.15
0.20
0.25
0.30
distance
sem
ivar
ianc
e
Figure 6: Envelope for the empirical variogram obtained under spatial independence
Due to this fact, we use the GCV algorithm for independent data, given in (5),
to obtain a suitable bandwidth. The bandwidth obtained with this criterion was:
19
H =
24 2:25 0
0 1:51
35 :So, with this bandwidth, we obtain an estimator for the mean magnitude in the
area of interest and then, applying the bootstrap method described in Section 2.3,
maps of estimates of the likelihood of an earthquake with magnitude larger or equal
than a threshold occurring in each location of the area of interest. We show in Figure
7 the local linear regression estimator with that bandwidth, in the left picture, and
the map with bootstrap probabilities of areas with seismic risk for threshold equals
to 4.0, in the right picture.
Figure 7: Estimated mean magnitude and estimated probability of ocurring an
earthquake with magnitude large of equal than 4
20
In this case, we can observe clearly which are the areas of highest seismic risk,
being of special interest the area located in the interior between San Francisco
and Los Angeles, with estimated probabilities (of occurring an earthquake with
magnitude larger or equal than four in those locations) close to one.
4 Conclusions.
In this article, we have developed a method for detecting areas that are at risk of
occurring an earthquake of a magnitude larger or equal than a given threshold. The
technique is a combination of nonparametric methods and a bootstrap algorithm.
To estimate the mean magnitude in the area of interests, we use the nonparamet-
ric local linear estimator for bivariate observations. To use this estimator a matrix
bandwidth is needed. In this paper, we suggest to apply cross-validation methods to
select this bandwidth. In the spatial �eld (as the addressed in this paper), it is quite
common to have spatial correlation between the observations. In that case, it is im-
portant to take this dependence into account in the bandwidth selection method.
So, in a practical situation, before applying the method itself, it is necessary to check
if the observations are or not spatially correlated. This can be done, for example,
studying the estimated semivariogram. If a spatial dependence is suspected, we pro-
pose to use the �bias-corrected�GCV criterion. To apply this method, a parametric
dependence structure for the data must be assumed and the corresponding parame-
ters estimated from the observations. Once again, the estimated semivariograms can
help us to do this process. The spatial dependence structure is also needed in the
resampling process in the bootstrap method. On the other hand, if the observations
can be assumed to be independent, the bandwidth matrix can be selected using the
GCV method and the previous bootstrap method must be modi�ed for independent
observations. These methods were applied to two di¤erent earthquakes data sets:
the historic catalogue of the Galician region at the Northwest of Spain and the last
ten years in California. Di¤erent threshold values were used in each of these exam-
ples. More generally, the overall approach of spatial smoothing and bootstrap-based
density mapping provides a useful and �exible set of statistical tools with which is
21
better to visualize and analyze spatial distribution data of earthquakes.
5 Acknowledgments.
The research of Mario Francisco-Fernández has been partially supported by Grant
MTM2008-00166 (ERDF included) and Grant PGIDIT07PXIB105259PR. The re-
search of Alejandro Quintela-del-Río has been partially supported by MEC Grant
MTM2006-03523.
6 Bibliography.
References
[1] Altman, N. S. (1997), Krige, smooth, both or neither? ASA Proceedings of the
Section on Statistics and the Environment, 60�65, American Statistical Associ-
ation, Alexandria, VA.
[2] Bailey, T.C. and Gatrell, A.C. (1995), Interactive Spatial Data Analysis, Long-
man, Essex, UK.
[3] Craven, P. and Wahba, G. (1979), Smoothing noisy data with spline functions:
estimating the correct degree of smoothing by the method of generalized cross-
validation, Numer. Math. 31, 377�403.
[4] Cressie, N. (1985), Fitting variogram models by weighted least squares, Math.
Geol. 17, 563�586.
[5] Cressie, N. (1993), Statistics for Spatial Data. John Wiley & Sons, New York,
2nd edition.
[6] Efron, B. (1979), Bootstrap Methods: Another Look at the Jackknife, Ann.
Statist. 7, 1�26.
22
[7] Efron, B., and Tibshirani, R. J. (1993), An introduction to the bootstrap, Chap-
man & Hall, New-York.
[8] Estévez-Pérez, G., Lorenzo-Cimadevila, H. and Quintela-del-Río, A. (2002),
Nonparametric analysis of the time structure of seismicity in a geographic re-
gion, Ann. Geophys.-Italy 45, 497-512.
[9] Fan, J. and Gijbels, I. (1996), Local Polynomial Modeling and its Applications,
Chapman and Hall, London.
[10] Francisco-Fernández, M. and Opsomer, J.D. (2005), Smoothing parameter se-
lection methods for nonparametric regression with spatially correlated errors,
Canad. J. Statist. 33, 539�558.
[11] Ferraty, F. and P. Vieu (2006), Nonparametric Functional Data Analysis,
Springer-Verlag.
[12] Grillenzoni, C. (2005), Non-parametric smoothing of spatio-temporal point
processes, J. Statist. Plann. Inference 128, 1, 61-78.
[13] Härdle, W. (1990), Applied nonparametric regression, Cambridge University
Press.
[14] Hart, J. (1996), Some automated methods of smoothing time-dependent data,
J. Nonparametr. Stat. 6, 115�142.
[15] Hastie, T. and Tibshirani, R. (1990), Generalized additive models, Chapman
and Hall, London.
[16] Journel, A.G. and Huijbregts, C.J. (1978), Mining geostatistics, Academic
Press, London. 22-47.
[17] Julivert, M., Fontbote, J.M., Ribeiro, A. and Conde, L. (1973), Mapa tectónico
de la Península Ibérica y Baleares E: 1:1.000.000. Inst. Geol. Min. Esp., Spain.
23
[18] Julivert, M. and Arboleya, M.L. (1984), A geometrical and kinematical ap-
proach to the nappe structure in a arcuate fold belt: the cantabrian nappes
(Hercynian chain, NW Spain), Jour. of Struct. Geology 6, 499-519.
[19] Kagan, Y.Y. (1990), Random stress and earthquake statistics: spatial depen-
dence. Geophys. J. Int., 102, 573-583.
[20] Lamarre, M., Townshend, B. and Shah, H.C. (1992), Application of the boot-
strap method to quantify uncertainty in seismic hazard estimates, Bull. Seismol.
Soc. Amer. 82,104-119.
[21] Liu, X. (2001), Kernel smoothing for spatially correlated data, Ph. D. thesis,
Department of Statistics, Iowa State University.
[22] Nadaraya, E. A. (1964), On Estimating Regression, Theory Probab. Appl. 9,
141-142.
[23] Ogata, Y. (1988), Statistical models for earthquake ocurrence and residual
analysis for point processes., J. Amer. Statist. Assoc. 83, 9-27.
[24] Opsomer, J. D., Wang, Y. and Yang, Y. (2001), Nonparametric regression with
correlated errors, Statist. Sci. 16, 134�153.
[25] Parzen, E. (1962), On estimation of a probability density function and mode,
Ann. Math. Statis. 32, 1065-1076.
[26] Pérez-Estaún, A., Martínez-Catalán, J.R. and Bastida, F. (1991), Crustal thick-
ening and deformation sequence in the footwall to the suture of the Variscan
belt of northwest Spain, Tectonophysics 191, 243-253.
[27] Pisarenko, V.F. and Sornette, D. (2004), Statistical detection and characteriza-
tion of a deviation from the Gutenberg-Richter distribution above magnitude
8, Pure Appl. Geophys. 161, 839�864.
[28] Quintela-del-Río, A. (2008), Hazard function given a functional variable: non-
parametric estimation under strong mixing conditions, J. Nonparametr. Stat.
20, 413-430.
24
[29] Rueda, J. and Mezcua, J. (2001), Sismicidad, sismotectónica y peligrosidad
sísmica en Galicia, IGN Technical Publication 35.
[30] Ruppert, D. andWand, M.P. (1994), Multivariate locally weighted least squares
regression, Ann. Statist. 22, 1346�1370.
[31] Silverman, B. W. (1986), Density Estimation for Statistics and Data Analysis,
Monographs on Statistics and Applied Probability, Chapman & Hall.
[32] Simono¤, J. (1996), Smoothing Methods in Statistics, Springer-Verlag, New
York.
[33] Stein, M. L. (1999), Interpolation of spatial data, some theory for Kriging.
Springer Series in Statistics, Springer-Verlag, New York.
[34] Stock, C. and Smith, E. (2002), Comparison between seismicity models gener-
ated by di¤erent kernel estimations. Bull. Seismol. Soc. Amer. 92, 913�922.
[35] Torcal, F., Posadas, A., Chica, M. and Serrano, I. (1999), Application of con-
ditional geostatistical simulation to calculate the probability of occurrence of
earthquakes belonging to a seismic series, Geophys. J. Int. 139, 703-725.
[36] Udias, A. and Rice, J. (1975),. Statistical analysis of microearthquakes activity
near San Andres Geophysical Observatory, Hollister, California, Bull. Seismol.
Soc. Amer. 65, 809-828.
[37] Vere-Jones, D. (1978), Earthquake prediction - A statician�s view, J. Phys.
Earth, 26, 129-146.
[38] Vere-Jones, D. (1992), Statistical methods for the description and display of
earthquake catlogues, in Statistics in the Enviromnmental and Earth Sciences,
pp. 220-244, eds Walden, A.T. & Guttorp, P., London.
[39] Vilar-Fernández, J.M. and González-Manteiga, W. (1996), Bootstrap test of
goodness of �t to a linear model when errors are correlated, Comm. Statist.
Theory Methods 25, 2925�2953.
25
[40] Wand, M. P. and Jones, M. C. (1995), Kernel Smoothing, Chapman and Hall,
London.
[41] Watson, G.S. and Leadbetter, M.R. (1964), Hazard analysis I, Biometrika 51,
175-184.
26