A nonparametric analysis of the spatial distribution of the …dm.udc.es/profesores/mario/ficheros/terremotos... · 2009-03-06 · as a smooth spatial function, jointly with the use

A nonparametric analysis of the spatial

distribution of the earthquakes magnitude�

Mario Francisco-Fernández

Universidad de A Coruña yAlejandro Quintela-del-Río

Universidad de A Coruña z

Rubén Fernández Casal

Universidad de A Coruña x

February 3, 2009

Abstract

This article describes an application of nonparametric local linear regres-

sion to study the spatial structure of the magnitude of earthquakes. If spatial

correlation is suspected in a data set of earthquakes in a concrete geographic

zone, the smoothing parameter needed to obtain the estimator of the mean

magnitude will be computed using a version of the �bias-corrected and esti-

mated�GCV method presented in Francisco-Fernández and Opsomer (2005).

This method allow us to take this spatial dependence into account to obtain

better smoothing parameters. Additionally, a parametric bootstrap technique

is used to quantify the variability of the spatial maps produced with the non-

parametric estimation method and to generate maps that shows the proba-

bility of being at high and low seismic risk in the considered area. These

techniques are applied to two di¤erent earthquakes data sets: the historic

�Short title: Nonparametric Analysis of Spatial DistributionyDepartamento de Matemáticas, Facultad de Informática, La Coruña,15071, Spain.zDepartamento de Matemáticas, Facultad de Informática, La Coruña,15071, Spain.xDepartamento de Matemáticas, Facultad de Informática, La Coruña,15071, Spain.

1

catalogue of the Galician region at the Northwest of Spain and the last ten

years in California.

Key Words: earthquakes; magnitude; local polynomial regression; nonpara-

metric estimation; parametric bootstrap;

1 Introduction

A seismic series is a set of earthquakes occurring in a given period of time in a given

area. Earthquakes of a seismic series are considered as stochastic mathematical

variables, belonging to a continuous space-time-energy medium with dimension 5

(X1i ; X

2i ; X

3i ; ti; Yi), whereX

1i andX

2i are the latitude and longitude of the epicenter,

X3i the depth of the focus, ti the origin time and Yi the magnitude. If the activity

develops without abrupt changes, it is possible to know the structure that exists

between the earthquakes (Torcal et al., 1999). One interesting approach is to study

the behavior of the seismic series using stochastic methods to try to specify the

relation between the �ve variables mentioned. These methods include multivariate

models.

There have been numerous studies of seismic series using stochastic methods,

such as, e.g. Kagan (1990), Ogata (1988), Udias and Rice (1975), or Vere-Jones

(1978, 1992) (see the exhaustive bibliography of page 726 of Torcal et al. (1999)).

However, standard parametric models applied to seismic data do not always �t those

data well. In part, this is because parametric models are usually well suited only to

a sequence of seismic events that have similar causes. Moreover, parametric models

can be insensitive to anomalous events, since models tend to be formulated through

experience of relatively conventional seismic activity. From 1970, approximately, an

exhaustive statistical literature exists inside what has been called �nonparametric

curve estimation�(beginning with Parzen (1962), Nadaraya (1964), or Watson and

Leadbetter (1964)). This methodology is a �exible and potent tool to describe

the behavior of univariate and multivariate data sets, because it does not need

the speci�cation of a concrete model to work (such as the normal distribution, or

a linear relation). The nonparametric statistical techniques are, in some cases, a

2

supplement for parametric models. In other cases they represent for themselves a

valid alternative to the classic parametric techniques, where it is presupposed that

the curve has a certain functional form depending on some parameters that it is

necessary to estimate. Particular monographs about this topic -that include great

quantity of application examples- are the following ones: Härdle (1990), Hastie and

Tibshirani (1990), Silverman (1986), Simono¤ (1996) or Wand and Jones (1995).

In this paper we suggest the application of nonparametric methods for analyzing

seismic data. More concretely, we are interested in mapping the (bidimensional)

spatial distribution of the earthquakes magnitudes, by means of the model

Yi = m(X1i ; X

2i ) + "i; i = 1; 2; : : : ; n:

where m(�) is a regression function, in which we do not suppose any concrete para-metric model and "i are random errors that may or may not be spatially correlated.

In statistical analysis of earthquakes, nonparametric methods have been used

by several authors to study the behavior of the stochastic variables involved in the

process of occurrence of the same. If we consider the nonparametric estimation of

the intensity function, we have the papers of Vere-Jones (1992), Bailey and Gatrell

(1995), Estévez et al. (2002) and Stock and Smith (2002). Regarding regression

estimation, the work of Grillenzoni (2005) develops adaptive modellings for earth-

quake data of Northern California. Failure rate or hazard estimation is analyzed in

the paper of Estévez et al. (2002) for Northwest data of the Iberian Peninsula (the

same zone considered in this work), and other European zones.

The goal of this article is to show the utility of a particular type of nonparametric

method, the named �local linear regression�(Fan and Gijbels, 1996), to the spatial

statistical analysis of earthquakes data. We will do this by considering the problem of

visualizing the spatial pattern of earthquakes magnitude, and applying to two seismic

data sets of two di¤erent zones, respectively. The representation of the magnitude

as a smooth spatial function, jointly with the use of bootstrap techniques, provides

a useful ��rst step� in identifying areas of high and low seismic risk, for instance,

or in spatial prediction (i.e. kriging, see e.g. Cressie (1993)). The organization

of the paper is as follows. Section 2 describes the statistical model and reviews

3

the nonparametric estimator, as well as the bootstrap method used. Section 3

provides information on the study areas and data, and shows the behavior of the

nonparametric spatial methods described in Section 2 on those real earthquake data.

Section 4 is devoted to conclusions.

2 Nonparametric regression models

A regression model tries to describe the existing relation between a explicative vari-

able, in general of dimension d; X = (X1; X2; :::; Xd), and a response variable Y;

based on a real data set f(X i; Yi)gni=1 ; obtained by the relationship

Yi = m (X i) + "i; i = 1; 2; : : : ; n; (1)

where m(�) is the unknown regression function, and "i are the random errors. Underthe assumption of E("ijX i) = 0, if X and Y are random variables, the regression

function at a location x is m(x) = E (Y jX = x).

In this paper, the following spatial nonparametric regression model for the earth-

quakes data will be used. Assume that a set ofR3-valued random vectors, f(X i; Yi)gni=1 ;are observed in a concrete interval of time, where the Yi are scalar responses vari-

ables and the X i are predictor variables with a common density f and compact

support � R2: In this article, we will refer to the X i as the locations (latitude

and longitude, expressed in degrees) corresponding to the Yi (magnitude). The re-

lationship between the locations and the responses variable is assumed to be of the

form (1), where m(�) is an unknown continuous and smooth function, and " is asecond order stationary process with:

Cov ("i; "jjX i;Xj) = C(X i �Xj); (2)

where C(u) is a positive-de�nite function, called the covariogram (with C(0) =

Var ("ijX i) = �2). The case of C(u) � �2If0g(u), where If0g is the indicatorfunction of the origin, corresponds to independent errors, whereas other covariance

functions may allow for di¤erent types of spatial dependence. For instance, one of

most well-known covariogram families is the Matern class (see e.g. Stein, 1999, pp.

4

48-52):

C�(u) = c0If0g(u) +c1

2��1�(�)

�kuka

��K��kuka

�; (3)

where kXk is the euclidean distance in Rd, K is the modi�ed Bessel function of

the second type, c0 is the nugget e¤ect, c1 is the partial sill (the variance �2 =

c0 + c1 is also called sill in this context), a is a scale parameter (proportional to the

autocorrelation range) and � is a smoothness parameter (which determines the shape

of the covariogram at small lags). The case of � = 0:5 corresponds to the exponential

model. In geostatistics, covariogram models that allow a lack of continuity at the

origin (non zero nugget e¤ect) are traditionally considered in practice, and we can

think that this corresponds with independent local variability (for more details see

e.g., Cressie, 1993, pp. 127-130).

This particular model assumes that the response variable Y is an unknown

smooth function of location, �masked�by zero-mean (second-order stationary) er-

rors. This is in contrast to the traditional kriging models, which typically assume

that the response variable is a simpler (linear or constant) function of location sup-

plemented by spatially correlated noise. In those models, a larger fraction of the

observed behavior of the data is therefore attributed to noise instead of to the under-

lying mean function, and consequently the error term should be modelled carefully

(even considering a complicated spatial dependence structure). It should be ex-

pected that with a model like (1) assumptions about the error term will have less

e¤ect on the conclusions obtained. It is also important to point out that determining

which model is in fact correct for the data cannot be done without replication at

the same locations, and in practice, di¤erent approaches could lead to similar esti-

mated (or predicted) spatial maps. However, the interpretation of the map and the

accompanying inference statements are di¤erent (for more details see e.g. Altman,

1997, or Cressie, 1993, section 3.1).

Obviously, we could consider a more general model if we include the depth in

the variable X, having a 3-dimensional explicative variable. Because one of the

earthquake catalogue in our work (the corresponding to Spanish data) not always

include the depth component for each recorded earthquake, and also for the di¢ culty

5

of interpreting graphs in more that three dimensions, we have not included this

variable in our analysis. Note also that we can modify model (1) in order to take

into account the complete spatial dimension or even the temporal component only

through the errors "i, considering a more complicated spatio-temporal dependence

(avoiding multidimensionality problems in the nonparametric estimation).

Our �rst goal is to estimate the mean function m(�) using a nonparametric es-timator. Nonparametric techniques try to estimate the regression function without

assuming a speci�c parametric family of functions for m(�). Classical nonparametricregression estimators are based on explaining the relationship between the data by

using weighted local means, that is, the estimator of m(x) can be written as

mH(x) =nXi=1

wH(X;x)Yi;

wherefwH(�)gni=1 is a weight sequence, characterized by a form and a scale. The

form corresponds to a kernel function K : Rd ! R; and the scale to a smoothing

parameter or bandwidth H :

While the election of the function K has not play an important role (normally,

K is a density function with some regularity conditions) in the estimation (consult

e.g. Härdle (1990)), the election of the bandwidth is more crucial, because the

shape of the resulting estimator varies greatly according to its value. In general, the

bandwidth matrixH controls the shape and the size of the local neighborhood used

for estimating m(x). So, ifH is �small�, we will obtain an undersmooth estimator,

with high variability; and, on the other hand, if H is �big�, the resulting estimator

will be very smooth and possibly with larger bias (see Section 2.2 below).

2.1 Local linear regression for spatial data.

When the explicative variables are bivariate variables, the local linear estimator for

m(�) at a location x is the solution for to the least squares minimization problem

min ;�

nXi=1

�Yi � � �T (X i � x)

2KH(X i � x);

where H is a 2 � 2 symmetric positive de�nite matrix; K is a bivariate ker-

nel and KH(u) = jHj�1K(H�1u). The Epanechnikov kernel function K(x) =

6

2�max

��1� kxk2

�; 0, a common choice for local linear regression, will be used

throughout this article. The local linear regression estimator can be written explic-

itly as

mH(x) = eT1

�XT

xW xXx

��1XT

xW xY � sTxY; (4)

where e1 is a vector with 1 in the �rst entry and all other entries 0, Y = (Y1; : : : ; Yn)T ,

W x = diag fKH(X1 � x); :::; KH(Xn � xg, and

Xx =

0BB@1 (X1 � x)T...

...

1 (Xn � x)T

1CCA :Local linear regression has been broadly studied in the statistical context of uni-

variate regression, and we refer to Wand and Jones (1995) for an overview. For

bivariate local linear regression, Ruppert and Wand (1994) provide the relevant as-

ymptotic theory for the case in which the errors are independently distributed. In

the spatial context, this assumption of independence is often not appropriate, and

accounting for possible correlation is required for both inference and smoothing pa-

rameter selection. For a review of issues related to nonparametric regression with

correlated errors, see Hart (1996) and Opsomer et al. (2001). Recently, Francisco-

Fernández and Opsomer (2005) discussed spatial smoothing and proposed a band-

width selection method that allows for the presence of correlated errors.

A description of the bandwidth selection methods used to analyze the earth-

quakes data will be presented in the following Section.

2.2 Bandwidth selection.

As it was stated above, smoothing parameter selection in kernel regression is a very

important issue in order to obtain reliable estimators. The in�uence of bandwidthH

in the estimator of m(x) can be observed in Figure 1, where three estimators of the

regression function using the local linear estimator with three di¤erent bandwidths

are presented. The data used to plot this Figure 1 will be described and studied

more deeply in Section 3.1. They consists of the locations and the corresponding

magnitude of 1397 earthquakes succeed in the Northwest of the Iberian Peninsula

7

from 25/November/1944 to 03/April/2008. We will use the model (1) and the local

linear regression estimator to map the estimated mean magnitude in that area.

Each one of the rows of this Figure shows, in the left map, the locations of the

observations and the bandwidth used to estimate the regression function at point

of coordinates 43o N and 8o W, and, in the right map, the corresponding estimator

in that area computed with this bandwidth. The bandwidth used in the �rst row

of the Figure 1 is the bandwidth HGCV ce, obtained using the �bias-corrected and

estimated�GCV criterion, described below. The bandwidths used in the second

and third rows are 0:5 �HGCV ce and 2 �HGCV ce, respectively. More details on the

parameters used to compute the bandwidth HGCV ce can be found in Section 3.1.

It can be observed that, compared with the optimal election (�rst row), when a

small bandwidth is used (second row), the estimator shows a signi�cant amount of

variation that appears to be spurious. It is important to note that the variability

shown here is actually less than that actually obtained by the regression, since the

plot trims o¤ several large spikes that go well beyond the values on the scale, both

in the positive and negative direction (this was done to maintain comparability with

the map in the �rst row of the Figure 1). On the other hand, the estimator obtained

with a larger bandwidth (third row) is much smoother than the optimal election.

This could lead to estimators removing some signi�cant parts of the mean pattern.

An important observation about the estimators showed in Figure 1 (especially in

the optimal case -�rst row-) is that the extreme low and high values occur at the

boundary of the estimation region, where smoothing methods such as local linear

regression often result in unreliable estimates. However, in this case, the areas of

interest are in the central part of the region.

There are several methods to select the smoothing parameter. Some of these

methods are specially designed for correlated observations, which is quite usual in the

spatial smoothing context. In the case of spatial independent observations, classical

smoothing parameters criteria for nonparametric regression are, for example, of

cross-validation type (Craven and Whaba, 1979). These methods consider selecting

8

Figure 1: Locations, bandwidth and regression estimator with three di¤erente band-

widths

the bandwidth H that minimizes the GCV function

GCV (H) =1

n

nXi=1

�Yi � mH(X i)

1� 1ntr(S)

�2: (5)

with S the n�n matrix whose ith row is equal to sTXi, the smoother vector for x =

X i; and tr(S) is the corresponding trace. Finding the minimizer of this function over

the d(d+1)=2 (d = 2 when we only consider longitude and latitude) parameters inH

can be performed using numerical algorithms as implemented in statistical software.

However, we should not use this criterion directly for selecting the bandwidth in

the presence of correlated errors, because its expectation is severely a¤ected by the

correlation (Liu, 2001). In this case, we propose to use the �bias-corrected�GCV

9

criterion, proposed in Francisco-Fernández and Opsomer (2005), based on selecting

the bandwidth H that minimizes the function

GCVc(H) =1

n

nXi=1

�Yi � mH(X i)

1� 1ntr (SR)

�2; (6)

with R the correlation matrix of the errors.

In practice, matrix R is unknown, so that, (6) is not yet a practical bandwidth

selection criterion. Following Francisco-Fernández and Opsomer (2005) we will as-

sume a parametric form for the covariogram, from which the correlogram can be

obtained ��(u) = C�(u)=C�(0), and then replace the unknown R(�) in (6) by an

estimate R(�). This method is called the �bias-corrected and estimated� GCV

criterion

The theoretical optimality properties of this last criterion were discussed in

Francisco-Fernández and Opsomer (2005). In that paper, the authors considered

the isotropic exponential model (corresponding to (3) with c0 = 0 and � = 0:5)

for the correlation function, and the dependence parameters were estimated using a

simple method-of-moments estimators speci�cally obtained for this model. A similar

approach will be used here, but traditional geostatistical methods will be employed

in the dependence modelling (in order to avoid bias in the dependence parameter

estimation).

The estimation of the spatial dependence was done through the variogram,

(u) = C(0) � C(u) (see e.g. Cressie, 1993, section 2.4.1, for a explanation aboutwhy variogram estimation is preferred to covariogram estimation). The general al-

gorithm is as follows:

1. Obtain a pilot bandwidth matrixH = �pilot (for instance, using the standard

GCV selection criterion (5)).

2. Using local linear estimator (4), with bandwidth matrix H, obtain the resid-

uals:

"i = Yi � mH(X i); i = 1; 2; : : : ; n:

10

3. Compute the classical variogram pilot estimator from the residuals:

(uk) =1

2jN(uk)jXN(uk)

(b"i � b"j)2 ; k = 1; : : : ; K:

where N(u) = f(i; j) : X i � Xj 2 Tol(u)g, Tol(u) is a tolerance regionaround u and jN(u)j denote the number of contributing pairs at lag u.

4. Fit a valid variogram model using a Weighted Least Squares (WLS) criterion:

� = argmin�

KXk=1

!k ( �(uk)� (uk))2 :

5. Update the bandwidth matrix H, using the �bias-corrected and estimated�

GCV method (6).

6. Repeat steps 2-5 until convergence is obtained.

In practice, few iterations of previous algorithm are usually needed (and very

similar results were observed considering just one iteration). In step 4, following the

recommendations of Journel and Huijbregts (1978, p.74), the �t of a valid model is

usually done up to half the maximum possible lag and considering only pilot estima-

tions with at least 30 contributing pairs (i.e. jN(u)j � 30) The weights !i are usu-ally chosen following the idea proposed by Cressie (1985) (i.e. !i = jN(u)j= �(ui);and proceeding iteratively). In the examples showed in this work, a Matern model

(3) was considered, allowing for the possibility of geometrical anisotropy (i.e. cor-

relation depending on direction as well as distance between observations; see e.g.

Cressie, 1993, pp. 62-64). Note that the case of c1 = 0 corresponds with (esti-

mated) independence. Other goodness-of-�t criteria could be used in step 4, for

instance, the approach in Francisco-Fernández and Opsomer (2005) or maximum

likelihood estimation.

2.3 Parametric bootstrap.

In a general way, bootstrap (Efron, 1979) is a resampling method that attempts to

estimate the sampling distribution of a population by generating new samples by

11

drawing (with replacement) from the original data. Under certain probabilistic con-

ditions (which are often true in reality), bootstrap usually produces more accurate

and reliable results than traditional methods which are either based upon the large

number distribution properties of the observations (see for example Efron and Tib-

shirani (1993) about the bootstrap and other resampling procedures). Nowadays,

as a computer-aided statistical technique, bootstrap is largely used in many �elds of

applied statistics such as medicine, biostatistics, environmental sciences, chemistry,

or seismology (Lamarre et al. (1992), Pisarenko and Sornette (2004))

Now, in order to incorporate variability assessments in our analysis of the earth-

quakes magnitude, we extended the parametric bootstrap for correlated data dis-

cussed in Vilar-Fernández and González-Manteiga (1996). It follows the same steps:

1. Obtain, using the algorithm described in previous section, the optimal band-

width matrixH (with the �bias-corrected and estimated�GCV criterion), the

corresponding residuals "i and the covariance function parameter estimates �.

2. Bootstrap datasets are generated by taking the estimated mean spatial surface

mH(X i) and adding bootstrap errors generated as a spatially correlated set

of errors. Bootstrap errors are obtained by the following steps:

(a) Using �, compute the variance-covariance matrices of the residuals " =

("1; "2; : : : ; "n) ; denoted by V .

(b) Using Cholesky decomposition, �nd the matrix P such that V = PP T :

(c) Obtain the �independent�variables e = (e1; e2; : : : ; en), given by

e = P�1":

(d) These independent variables are centered and, from them, we obtain

an independent bootstrap sample of sample size n, denoted by e� =

(e�1; e�2; : : : ; e

�n) :

(e) Finally, the bootstrap errors "� = ("�1; : : : ; "�n) are

"� = Pe�

12

and the bootstrap samples are:

Y �i = mH(X i) + "�i ; i = 1; 2; : : : ; n:

3. Once the bootstrap datasets are obtained, the above nonparametric regressions

are repeated for each bootstrap sample using the same bandwidthH as for the

original analysis, and a map of at-risk areas (magnitude larger or equal than a

threshold) is produced. This process is repeated a large number of times B (in

our analysis, we use B =1000). Finally, a map showing the frequency -across

bootstrap replicates, for each location- of how often that location is included

in the at-risk area is computed.

It is important to note that this procedure is especially designed for when the er-

rors are supposed to be spatially correlated. In the case of having independent errors,

the previous algorithm can be easily simpli�ed, since once the residuals ("1; : : : ; "n)

are obtained (using in this case the GCV method to select the bandwidth) in the

�rst step, the bootstrap errors ("�1; : : : ; "�n) are immediately obtained by resampling

with replacement among the original residuals.

3 Data analysis.

3.1 Northwest of the Iberian Peninsula.

The Hesperic Massif, composed by areas that conform a continuous outcrop and

occupy most of the western half of the Iberian Peninsula, is limited by other areas,

mainly of Mesozoic and Tertiary age. It is part of the Europe Hercinic chain, whose

trace can be seen from central Europe to the north-western extreme of France, fol-

lowing a general trend from east to west; next, it hides under the Atlantic Ocean,

forming a broad arc connecting with the Northwest coast of the Iberian penin-

sula. These areas are the Cantabrian Zone, Asturian-Leonese and Galician-Castilian

(Julivert et al. (1973), Julivert and Arboleya (1984)).

Our area of interest focus on the Northwest of the Iberian Peninsula, so our

statistical analysis will focus on data from the earthquakes succeed in the last two

13

areas. The Cantabrian zone has limits from the east and south by lands that form

the Mesozoic and tertiary coverture in Cantabria and the Plateau of Duero. The

western boundary is the other area, Asturian-Leonese (Perez-Estaún et al. (1991)).

A detailed study of the seismo-tectonics and dangerousness of the zone is done in

Rueda and Mezcua (2001).

For our study, we have considered the area limited by the coordinates 42o N �

44o N and 6o W �10o W, that involves the autonomic region of Galicia; this zone is

the most important respecting to the number of earthquakes and their magnitude.

We have selected the data bank of the National Geographic Institute (IGN) of Spain

(until April of 2008) at http://www.ign.es/ign/es/IGN/SisCatalogo.jsp, where the

seismic data by this public agency to current date in the Iberian Peninsula, Balearic

and Canary Islands are registered. Seismic activity increased considerably in Gali-

cia during the 1990�s, generating an important social alarm in the population, not

accustomed to feeling earthquakes of a notable magnitude. Actually, it is object of

research of several groups of work, among which is that of the authors of this work.

In Figure 2 we show a map of this zone, together with the selected data set, com-

posed by N= 1397 earthquakes succeed from 25/November/1944 to 03/April/2008.

As we can see in this map, there exist small spatial groups located in zones of high

risk: Sarria, Monforte and Celanova. A study of the distribution of the magnitude,

as well as of temporary cluster grouping, and their associated seismic risk can be

seen in Estévez et al. (2002), concerning to the earthquakes happened between the

years 1987 - 2000.

The �t of the nonparametric regression estimator (4) to the set of data was

done using the algorithm described in Section 2.2. In order to produce a map for

the survey area of interest, the local linear estimates were computed on a regular

50 � 50 grid overlaying the �eld. As noted earlier, the adjusted GCV method

for bandwidth selection of Francisco-Fernández and Opsomer (2005) requires the

speci�cation of a model for the correlation. Following an exploratory analysis of

the data, it could be seen that an isotropic exponential covariogram model (eq.

(3) with � = 0:5) is apparently adequate to describe the spatial dependence of the

residuals. As already noted at the end of Section 2, this model speci�cation is used

14

Figure 2: Locations of the observations

in the selection of bandwidth values and does not determine the actual shape of the

spatial distribution function m(x): Therefore, modest di¤erences between the true

spatial correlation and the assumed correlation model would have a negligible e¤ect

on m(x). Figure 3 shows the pilot variogram estimations and the �tted variogram

model. The estimated covariogram parameters were c0 = 0:080, c1 = 0:107 and

a = 0:044 (which corresponds to a practical range of 0.132).

Using these estimated parameters, the bandwidth matrix obtained with the se-

lection criterion (6) was:

H =

24 0:54 0

0 1:13

35 ;where the shape of this bandwidth at location of coordinates 43o N and 8oW, as well

as the corresponding regression estimator in the area of interest was previously shown

in the �rst row of Figure 1. This bandwidth corresponds to a moderate amount of

smoothing, since this bandwidth matrix implies that for any location x not on the

boundary of the study region, 20-40% of the observations are contributing (have non-

15

0.0 0.2 0.4 0.6 0.8

0.00

0.05

0.10

0.15

0.20

distance

sem

ivar

ianc

e

EmpiricalFitted model

Figure 3: Classical semivariogram estimations and WLS �tted exponential model

zero weight) to the nonparametric regression �t. Next, we use the bootstrap method

describe in Section 2.3 to plot maps of estimates of the likelihood of an earthquake

with magnitude larger or equal than a threshold occurring in each location of the

area of interest. Figure 4 show the maps with pointwise bootstrap probabilities

of being considered at risk of occurring an earthquake with magnitude larger or

equal than the threshold considered. To evaluate the sensitivity to the choice of the

threshold, we considered two thresholds: 2.5 in the left picture and 2.75 in the right

picture. We can observe that there is an important di¤erence between both maps.

So, while a big proportion of the area is in high risk of occurring an earthquake with

a magnitude larger or equal than 2.5, only the area limited, approximately, by the

coordinates 42:75o N �43:5o N and 6:5oW�8oWhas an important risk of occurring

16

Figure 4: Maps with bootstrap probabilities of areas with seismic risk for di¤erent

values of the threshold

an earthquake with a magnitude larger or equal than 2.75. On the other hand, in

this map, the highest values in the North limit are possibly due to a boundary e¤ect,

making it likely these very high risk values (close to one) are indeed spurious.

3.2 California data.

We use the composite seismicity catalog from the Advanced National Seismic System

(ANSS), available at http://quake.geo.berkeley.edu/anss/catalog-search.html. The

selected grid is 30:08o N to 44:99o N in latitude and 125o W to 115o W in longitude.

We selected earthquakes above magnitude 3.5 for the period 01/January/1998 to

01/April/2008. The reduced time period considered is due to the high number of

existing data in this area. Indeed, the number of data recorded in this period is

N = 4888, considerably greater than all the data recorded in the Galician region

previously analyzed. The reason of considering only a short period of time is the

following: the number of calculus necessary in the nonparametric estimator (4) is

linear in the number of the observations (n) and in the number of points in which we

realize the estimation (a bidimensional grid 50�50), but the cross-validation function(6) depends directly of the square of the number of observations. Also, the bootstrap

procedure implies to multiply by B (number or bootstrap replications) the volume

17

of calculations. Therefore, with the aim of showing the estimation and bootstrap

procedures in a di¤erent zone (relatively to its seismic risk) than the Galician one,

we have decided to analyze only the ten last years in the above mentioned catalogue.

Figure 5 shows the area of study and the locations of the observations.

Figure 5: Locations of the observations

Nonparametric analysis for California earthquake was developed by Grillenzoni

(2005). In Quintela-del-Río (2008) can be found an example of application of the

nonparametric functional data analysis (Ferraty and Vieu, 2006) to this zone.

We follow the same steps as in the previous example. Firstly, we �t the local

linear estimator to the earthquakes magnitude in this area. Nevertheless, in this

18

case, the spatial variability of the data was apparently captured by the trend, so

the residuals seems to exhibit no spatial correlation. For example, Figure 6 shows

an envelope for the empirical variogram obtained under spatial independence (by

random permutation of the data values on the spatial locations). Almost all values

lie inside the envelope, except the estimate at �rst lag. A deeper look to the residuals

shown that this was due to the presence of some atypical observations (with respect

to values in their neighborhood). After the exploratory analysis of the data, the

spatial independence of the data was assumed.

0 1 2 3

0.10

0.15

0.20

0.25

0.30

distance

sem

ivar

ianc

e

Figure 6: Envelope for the empirical variogram obtained under spatial independence

Due to this fact, we use the GCV algorithm for independent data, given in (5),

to obtain a suitable bandwidth. The bandwidth obtained with this criterion was:

19

H =

24 2:25 0

0 1:51

35 :So, with this bandwidth, we obtain an estimator for the mean magnitude in the

area of interest and then, applying the bootstrap method described in Section 2.3,

maps of estimates of the likelihood of an earthquake with magnitude larger or equal

than a threshold occurring in each location of the area of interest. We show in Figure

7 the local linear regression estimator with that bandwidth, in the left picture, and

the map with bootstrap probabilities of areas with seismic risk for threshold equals

to 4.0, in the right picture.

Figure 7: Estimated mean magnitude and estimated probability of ocurring an

earthquake with magnitude large of equal than 4

20

In this case, we can observe clearly which are the areas of highest seismic risk,

being of special interest the area located in the interior between San Francisco

and Los Angeles, with estimated probabilities (of occurring an earthquake with

magnitude larger or equal than four in those locations) close to one.

4 Conclusions.

In this article, we have developed a method for detecting areas that are at risk of

occurring an earthquake of a magnitude larger or equal than a given threshold. The

technique is a combination of nonparametric methods and a bootstrap algorithm.

To estimate the mean magnitude in the area of interests, we use the nonparamet-

ric local linear estimator for bivariate observations. To use this estimator a matrix

bandwidth is needed. In this paper, we suggest to apply cross-validation methods to

select this bandwidth. In the spatial �eld (as the addressed in this paper), it is quite

common to have spatial correlation between the observations. In that case, it is im-

portant to take this dependence into account in the bandwidth selection method.

So, in a practical situation, before applying the method itself, it is necessary to check

if the observations are or not spatially correlated. This can be done, for example,

studying the estimated semivariogram. If a spatial dependence is suspected, we pro-

pose to use the �bias-corrected�GCV criterion. To apply this method, a parametric

dependence structure for the data must be assumed and the corresponding parame-

ters estimated from the observations. Once again, the estimated semivariograms can

help us to do this process. The spatial dependence structure is also needed in the

resampling process in the bootstrap method. On the other hand, if the observations

can be assumed to be independent, the bandwidth matrix can be selected using the

GCV method and the previous bootstrap method must be modi�ed for independent

observations. These methods were applied to two di¤erent earthquakes data sets:

the historic catalogue of the Galician region at the Northwest of Spain and the last

ten years in California. Di¤erent threshold values were used in each of these exam-

ples. More generally, the overall approach of spatial smoothing and bootstrap-based

density mapping provides a useful and �exible set of statistical tools with which is

21

better to visualize and analyze spatial distribution data of earthquakes.

5 Acknowledgments.

The research of Mario Francisco-Fernández has been partially supported by Grant

MTM2008-00166 (ERDF included) and Grant PGIDIT07PXIB105259PR. The re-

search of Alejandro Quintela-del-Río has been partially supported by MEC Grant

MTM2006-03523.

6 Bibliography.

References

[1] Altman, N. S. (1997), Krige, smooth, both or neither? ASA Proceedings of the

Section on Statistics and the Environment, 60�65, American Statistical Associ-

ation, Alexandria, VA.

[2] Bailey, T.C. and Gatrell, A.C. (1995), Interactive Spatial Data Analysis, Long-

man, Essex, UK.

[3] Craven, P. and Wahba, G. (1979), Smoothing noisy data with spline functions:

estimating the correct degree of smoothing by the method of generalized cross-

validation, Numer. Math. 31, 377�403.

[4] Cressie, N. (1985), Fitting variogram models by weighted least squares, Math.

Geol. 17, 563�586.

[5] Cressie, N. (1993), Statistics for Spatial Data. John Wiley & Sons, New York,

2nd edition.

[6] Efron, B. (1979), Bootstrap Methods: Another Look at the Jackknife, Ann.

Statist. 7, 1�26.

22

[7] Efron, B., and Tibshirani, R. J. (1993), An introduction to the bootstrap, Chap-

man & Hall, New-York.

[8] Estévez-Pérez, G., Lorenzo-Cimadevila, H. and Quintela-del-Río, A. (2002),

Nonparametric analysis of the time structure of seismicity in a geographic re-

gion, Ann. Geophys.-Italy 45, 497-512.

[9] Fan, J. and Gijbels, I. (1996), Local Polynomial Modeling and its Applications,

Chapman and Hall, London.

[10] Francisco-Fernández, M. and Opsomer, J.D. (2005), Smoothing parameter se-

lection methods for nonparametric regression with spatially correlated errors,

Canad. J. Statist. 33, 539�558.

[11] Ferraty, F. and P. Vieu (2006), Nonparametric Functional Data Analysis,

Springer-Verlag.

[12] Grillenzoni, C. (2005), Non-parametric smoothing of spatio-temporal point

processes, J. Statist. Plann. Inference 128, 1, 61-78.

[13] Härdle, W. (1990), Applied nonparametric regression, Cambridge University

Press.

[14] Hart, J. (1996), Some automated methods of smoothing time-dependent data,

J. Nonparametr. Stat. 6, 115�142.

[15] Hastie, T. and Tibshirani, R. (1990), Generalized additive models, Chapman

and Hall, London.

[16] Journel, A.G. and Huijbregts, C.J. (1978), Mining geostatistics, Academic

Press, London. 22-47.

[17] Julivert, M., Fontbote, J.M., Ribeiro, A. and Conde, L. (1973), Mapa tectónico

de la Península Ibérica y Baleares E: 1:1.000.000. Inst. Geol. Min. Esp., Spain.

23

[18] Julivert, M. and Arboleya, M.L. (1984), A geometrical and kinematical ap-

proach to the nappe structure in a arcuate fold belt: the cantabrian nappes

(Hercynian chain, NW Spain), Jour. of Struct. Geology 6, 499-519.

[19] Kagan, Y.Y. (1990), Random stress and earthquake statistics: spatial depen-

dence. Geophys. J. Int., 102, 573-583.

[20] Lamarre, M., Townshend, B. and Shah, H.C. (1992), Application of the boot-

strap method to quantify uncertainty in seismic hazard estimates, Bull. Seismol.

Soc. Amer. 82,104-119.

[21] Liu, X. (2001), Kernel smoothing for spatially correlated data, Ph. D. thesis,

Department of Statistics, Iowa State University.

[22] Nadaraya, E. A. (1964), On Estimating Regression, Theory Probab. Appl. 9,

141-142.

[23] Ogata, Y. (1988), Statistical models for earthquake ocurrence and residual

analysis for point processes., J. Amer. Statist. Assoc. 83, 9-27.

[24] Opsomer, J. D., Wang, Y. and Yang, Y. (2001), Nonparametric regression with

correlated errors, Statist. Sci. 16, 134�153.

[25] Parzen, E. (1962), On estimation of a probability density function and mode,

Ann. Math. Statis. 32, 1065-1076.

[26] Pérez-Estaún, A., Martínez-Catalán, J.R. and Bastida, F. (1991), Crustal thick-

ening and deformation sequence in the footwall to the suture of the Variscan

belt of northwest Spain, Tectonophysics 191, 243-253.

[27] Pisarenko, V.F. and Sornette, D. (2004), Statistical detection and characteriza-

tion of a deviation from the Gutenberg-Richter distribution above magnitude

8, Pure Appl. Geophys. 161, 839�864.

[28] Quintela-del-Río, A. (2008), Hazard function given a functional variable: non-

parametric estimation under strong mixing conditions, J. Nonparametr. Stat.

20, 413-430.

24

[29] Rueda, J. and Mezcua, J. (2001), Sismicidad, sismotectónica y peligrosidad

sísmica en Galicia, IGN Technical Publication 35.

[30] Ruppert, D. andWand, M.P. (1994), Multivariate locally weighted least squares

regression, Ann. Statist. 22, 1346�1370.

[31] Silverman, B. W. (1986), Density Estimation for Statistics and Data Analysis,

Monographs on Statistics and Applied Probability, Chapman & Hall.

[32] Simono¤, J. (1996), Smoothing Methods in Statistics, Springer-Verlag, New

York.

[33] Stein, M. L. (1999), Interpolation of spatial data, some theory for Kriging.

Springer Series in Statistics, Springer-Verlag, New York.

[34] Stock, C. and Smith, E. (2002), Comparison between seismicity models gener-

ated by di¤erent kernel estimations. Bull. Seismol. Soc. Amer. 92, 913�922.

[35] Torcal, F., Posadas, A., Chica, M. and Serrano, I. (1999), Application of con-

ditional geostatistical simulation to calculate the probability of occurrence of

earthquakes belonging to a seismic series, Geophys. J. Int. 139, 703-725.

[36] Udias, A. and Rice, J. (1975),. Statistical analysis of microearthquakes activity

near San Andres Geophysical Observatory, Hollister, California, Bull. Seismol.

Soc. Amer. 65, 809-828.

[37] Vere-Jones, D. (1978), Earthquake prediction - A statician�s view, J. Phys.

Earth, 26, 129-146.

[38] Vere-Jones, D. (1992), Statistical methods for the description and display of

earthquake catlogues, in Statistics in the Enviromnmental and Earth Sciences,

pp. 220-244, eds Walden, A.T. & Guttorp, P., London.

[39] Vilar-Fernández, J.M. and González-Manteiga, W. (1996), Bootstrap test of

goodness of �t to a linear model when errors are correlated, Comm. Statist.

Theory Methods 25, 2925�2953.

25

[40] Wand, M. P. and Jones, M. C. (1995), Kernel Smoothing, Chapman and Hall,

London.

[41] Watson, G.S. and Leadbetter, M.R. (1964), Hazard analysis I, Biometrika 51,

175-184.

26

Documents

A nonparametric analysis of the spatial distribution of the …dm.udc.es/profesores/mario/ficheros/terremotos... · 2009-03-06 · as a smooth spatial function, jointly with the use