24
Scalable Parallel Algorithms for Surface Fitting and Data Mining Peter Christen a,1 , Markus Hegland a , Ole M. Nielsen a , Stephen Roberts b , Peter E. Strazdins c and Irfan Altas d a Computer Sciences Laboratory, RSISE, Australian National University, Canberra, ACT 0200, Australia b School of Mathematical Sciences, Australian National University c Department of Computer Science, Australian National University d School of Information Studies, Charles Sturt University, Wagga Wagga, NSW 2678, Australia Abstract This paper presents parallel scalable algorithms for high dimensional surface fitting and predictive modelling which can be used in data mining applications. The pre- sented algorithms are based on techniques like finite elements, thin plate splines, additive models and wavelets. They consist of two phases: First, data is read from secondary storage and a linear system is assembled. Secondly, the linear system is solved. The assembly can be done with almost no communication and the size of the linear system is independent of the data size. Thus the presented algorithms are both scalable with the data size and the number of processors. Key words: Data Mining, Thin Plate Splines, Additive Models, Wavelets, Parallel Linear System 1 Introduction In the last few years there has been an explosive growth in the amount of data being collected. The computerisation of business transactions and the use of bar codes in commercial outlets has provided businesses with enormous amounts of data. In science, projects like the Human Genome Project deal with Terabytes of data. Revealing patterns and relationships in a data set can 1 Corresponding author, E-Mail: [email protected] Preprint submitted to Elsevier Preprint 6 June 2000

Scalable parallel algorithms for surface fitting and data mining

Embed Size (px)

Citation preview

Scalable Parallel Algorithms for Surface

Fitting and Data Mining

Peter Christen a,1, Markus Hegland a, Ole M. Nielsen a,Stephen Roberts b, Peter E. Strazdins c and Irfan Altas d

aComputer Sciences Laboratory, RSISE, Australian National University,Canberra, ACT 0200, Australia

bSchool of Mathematical Sciences, Australian National UniversitycDepartment of Computer Science, Australian National University

dSchool of Information Studies, Charles Sturt University,Wagga Wagga, NSW 2678, Australia

Abstract

This paper presents parallel scalable algorithms for high dimensional surface fittingand predictive modelling which can be used in data mining applications. The pre-sented algorithms are based on techniques like finite elements, thin plate splines,additive models and wavelets. They consist of two phases: First, data is read fromsecondary storage and a linear system is assembled. Secondly, the linear system issolved. The assembly can be done with almost no communication and the size ofthe linear system is independent of the data size. Thus the presented algorithms areboth scalable with the data size and the number of processors.

Key words: Data Mining, Thin Plate Splines, Additive Models, Wavelets, ParallelLinear System

1 Introduction

In the last few years there has been an explosive growth in the amount ofdata being collected. The computerisation of business transactions and theuse of bar codes in commercial outlets has provided businesses with enormousamounts of data. In science, projects like the Human Genome Project dealwith Terabytes of data. Revealing patterns and relationships in a data set can

1 Corresponding author, E-Mail: [email protected]

Preprint submitted to Elsevier Preprint 6 June 2000

improve the goals, missions and objectives of many organisations and researchprojects. For example, sales records can reveal highly profitable retail salespatterns. As such it is important to develop automatic techniques to processand to detect patterns in very large data sets. This process is called Data

Mining [6].

Algorithms applied in data mining have to deal with two major challenges:Large data sets and high dimensions (many attributes). In recent years, datasets had the size of Gigabytes but Terabyte data collections are now being usedin business and the first Petabyte collections are appearing in science [11]. Ithas also been suggested that the size of databases in an average companydoubles every 18 months [5] which is similar to the growth of hardware perfor-mance according to Moore’s law. Consequently, data mining algorithms haveto be able to scale from smaller to larger data sizes when more data becomesavailable. The complexity of data is also growing as more attributes tend tobe logged in each record. Data mining algorithms must, therefore, also be ableto handle high dimensions in order to process such data sets.

This combination of large data size with high data complexity poses a toughchallenge for all data mining algorithms and parallel processing is a must inorder to get reasonable response times. Moreover, algorithms which do notscale linearly with data size are not feasible. In this paper, we present scalableparallel algorithms for predictive modelling and high dimensional surface fit-ting that successfully deal with these issues and are being applied to real worlddata mining problems where data sizes consist of up to 36 million records witha dimensionality as large as 44 attributes.

An important technique applied in data mining is multivariate regressionwhich is used to determine functional relationships in high dimensional datasets. A major difficulty which one faces when applying nonparametric meth-ods is that the complexity grows exponentially with the dimension of the dataset. This has been called the curse of dimensionality [15]. In Section 2 we in-troduce our basic algorithms and explain how they deal with this curse. Moredetailed description of three methods are then given in Sections 3, 4 and 5. Itthen follows a discussion of the complexity and scalability aspects of our al-gorithms in Section 6. Before we present implementation details in Sections 8and 9 we give a short overview of the data mining process in Section 7. Sec-tion 10 finalises the paper with our conclusions and presents some ideas forfuture work.

2

2 A Finite Element Approach to Data Mining

An important task in data mining is predictive modelling. As a predictivemodel in some way describes the average behaviour of a data set, one canuse it to find data records that lie outside of the expected behaviour. Theseoutliers often have simple natural explanations but, in some cases, may belinked to fraudulent behaviour.

A predictive model is described by a function y = f(x1, . . . , xd) from the set,T , of attribute vectors of dimension d into the the response set, S. If S isdiscrete (often binary), the determination of f is a classification problem andif S is a set of real numbers, one speaks of regression. In the following it willmainly be assumed that all the attributes xi as well as y are real values andwe set x = (x1, . . . , xd)

T .

In many applications, the response variable y is known to depend in a smoothway on the values of the attributes so it is natural to compute f as a leastsquares approximation to the data with an additional smoothness componentimposed. In this paper, we state the problem formally as follows. Given n datarecords (x(i), y(i)), i = 1, . . . , n where x(i) ∈ Ω with Ω = [0 . . . 1]d the compactd-dimensional unit cube.

We wish to minimise the following functional subject to some constraints:

Jα(f) =n∑

i=1

(f(x(i)) − y(i))2 + α∫

Ω|Lf(x)|2 dx (1)

where α is the smoothing parameter and L is a differential operator whosedifferent choices may lead to different approximation techniques. For example,for d = 2, the standard thin plate splines method can be obtained by selectingL as the summation of the second derivatives of f(x). The smoothing param-eter α controls the trade-off between smoothness and fit: In the limit α → 0the function f becomes an interpolant. If α is large, f becomes very smoothbut may not reflect the data very well. The choice of α is data dependent.Some techniques such as generalised cross validation can be employed to pickan appropriate value of α for a given data set. More detailed discussion on thechoice of α can be found in [25].

One can choose different function spaces to approximate the minimiser f ofequation (1). For example, picking the approximating functions in the space ofpiecewise multi-linear functions requires the use of a non-conforming finite ele-ment approach for the Thin Plate Splines Finite Element Method (TPSFEM)as discussed in Section 3 and requires the addition of some constraints for Ad-ditive Models (ADDFIT) as discussed in Section 5. As such we generally needto consider the addition of constraints when we discretise the minimisation

3

problem (1).

In this paper, we describe and compare three different methods to approximatethe minimiser of Equation (1). These methods are:

• TPSFEM: Piecewise multilinear finite elements (Section 3)• HISURF: Hierarchical multilinear finite elements (interpolatory wavelets)

(Section 4)• ADDFIT: Additive models (Section 5)

These methods have been implemented as data mining tools and are used forpredictive modelling. All of these tools approximate the minimiser in Equation(1) but they differ in how well they approximate f and more importantly intheir algorithmic complexities. Roughly speaking, TPSFEM gives the mostaccurate approximation at the highest computational cost whereas ADDFIThas the lowest cost but the coarsest approximation. HISURF sits somewhere inbetween these two extremes and provides good approximations at a reasonablecost. Details about these methods are given in Sections 3 to 5.

For the TPSFEM and ADDFIT methods we use piecewise multi-linear basisfunctions to approximate f . That is, in vector form,

f(x) = b(x)Tc (2)

where typically the vector b(x) is sparse and the vector c represents the linearcombination coefficients. (Note that f in the HISURF method is approximatedby using wavelet functions as explained in Section 4).

By using Equation (2) the least square part of Equation (1) takes the form

A =n∑

i=1

b(x(i))b(x(i))T (3)

with a right-hand side vector

b =n∑

i=1

b(x(i))y(i). (4)

These sums are evaluated by reading the data records from disk and computingthe sum which is then accumulated into global data structures. Thus, theformation of A and b is scalable with respect to the data size. Also, theoperation parallelises well by first summing up local matrices on each processorand secondly summing all these local matrices at the end.

The approximation of Equation (1) together with the additional constraints

4

leads to a linear system of equations of the form

A B

BT 0

c

w

=

b

0

(5)

The size of the system m is independent of the number of data records, n, butdepends on the number of basis functions, γ, used for the discretisation. Theexplicit form of such a system is presented in Section 3.

Our three methods have two main steps to approximate f in Equation (1):

I Assembly: The γ × γ matrix A and the γ × 1 vector b are assembledby using Equation (3)and Equation (4). This step requires access to all ndata records and it can be organised such that the computational work islinear in n. As usually γ << n this step can be interpreted as a reductionoperation on the original data. Note that the construction of the matricescoming from the smoothing part of Equation (1) and constraints do notrequire accessing the data at all. These matrices have similar sizes to A.Hence, the cost related to the assembly of these matrices are negligible.

II Solving: This step assembles them×mmatrices coming from the smoothpart of Equation (1) and solves the entire system. This step does notinvolve the n data records and the computational work depends only onm, typically as O(m3).

Note that for large n step I will dominate whereas for large m step II willdominate. As the number of data records n is usually very large for datamining applications, the overall complexity is mainly dominated by n.

3 Surface Fitting and Thin Plate Splines

In this section we describe the Thin Plate Spline Finite Element Method(TPSFEM). The fitting of a function to a set of data records is a recurringproblem across numerous disciplines such as analysing meteorological data,reconstructing three dimensional scanned images and regression type analysesof very large data sets in the data mining area. One of the most frequentlyused and efficient techniques is the thin plate splines approach. The standardthin plate spline is the function that minimises the functional

Jα(f) =n∑

i=1

(f(x(i)) − y(i))2 + α∫

Ω

||=2

(

2

ν

)

(Df(x))2dx (6)

5

where ν = (ν1, ..., νd) is a d dimensional multi-index, |ν| =∑d

s=1 νs. The cased = 1 corresponds to the natural cubic spline smoother and the case d = 2 tothe thin plate spline smoother. For d > 2 the smoothing term in the functionalcan be extended to

Ω

||=2

(

2

ν

)

(Df(x))2dx +∫

Ω

||∞=1

(Df(x))2dx (7)

which ensures that pointwise evaluation of functions is well defined on theHilbert space defined by the boundedness of the functional. However wit thisextension we lose rotation invariance of our functional.

An explicit representation of the thin plate spline as a sum of radial basisfunctions was obtained by Duchon [10]. This approach requires the solutionof a symmetric indefinite dense linear system of equations that has a size pro-portional to the number of data records n. Although this initial approach wasimproved by a number of later works [4,22] these techniques require complexdata structures and algorithms, with O(n) workspace. Thus, standard thinplate splines may not be practical for applications that have very large datasets which is the case for data mining.

In this section we present a discrete thin plate spline which combines thefavourable properties of finite element surface fitting with the ones of thin platesplines. The mathematical foundations of the method are briefly presentedhere. The details of the method were given in [16].

The smoothing problem can be approximated by minimising the functionalover a finite dimensional space of piecewise multi-linear functions. In vectornotation f in this space will be of the form

f(x) = b(x)Tc.

The idea is to minimise Jα over all f of this form. Unfortunately the smoothingterm in Equation (6) is not defined for piecewise multi-linear functions. Wecan either work with a more complicated set of approximation functions orwe can use a non-conforming finite element principle, and introduce piecewisemulti-linear functions to represent the gradient of f . We have chosen the latercourse. Let u = (bTg1, ..., b

Tgd) which will represent the gradient of f . Thefunctions f and u satisfy the relationship

Ω∇f(x) · ∇v(x) dx =

Ωu(x) · ∇v(x) dx (8)

for all piecewise multi-linear functions v. This is equivalent to the relationship

Lc0 =d∑

s=1

Gsgs (9)

6

where L is a discrete approximation to the minus laplacian operator and

(G1, ...,Gd)

is a discrete approximation to the gradient operator. We now consider theminimiser of the functional

Jα(c, g1, ..., gd)=1

n

n∑

i=1

(b(x(i))Tc − y(i))2 + α∫

Ω

d∑

s=1

∇(bTgs) · ∇(bTgs) dx

=1

n

n∑

i=1

(b(x(i))Tc − y(i))2 + αd∑

s=1

gTs Lgs. (10)

Our smoothing problem now consists of minimising this functional over allvectors c, g1, ..., gd subject to the constraint (9).

The functional has exactly one minimum if the observation points x(i) arenot collinear. The function defined by f(x) = b(x)Tc provides a smootherwhich has essentially the same smoothing properties as the original thin platesmoothing spline, provided the discretisation is small enough.

For the rest of this section we will consider the case d = 2, but other cases ofd are completely analogous. The discrete minimisation problem (10) is equiv-alent to the linear system

A 0 0 LT

0 αL 0 −GT1

0 0 αL −GT2

L −G1 −G2 0

c

g1

g2

w

=

b

0

0

0

(11)

where w is a lagrange multiplier associated with the constraint (9). The ma-trices L, G1 and G2 are independent of the data points and the matrix A andthe vector b as define in the previos section depend on the data. The assemblyof these matrices will be discussed in Section 8. The matrix A is symmetricindefinite and in our case, sparse.

If ms is the size of the discretisation in the s dimension then the number ofbasis functions for TPSFEM is γ =

∏ds=1ms. If ms = µ for all s then

γ = µd.

There are many possible ways to solve equation (11). In particular it is possibleto eliminate all the variables except g1 and g2. The equations for g1 and g2

7

then become

αL + GT1 ZG1 GT

1 ZG2

GT2 ZG1 αL + GT

2 ZG2

g1

g2

=

GT1 z

GT2 z

(12)

where Z = L−1AL−1 and z = L−1b. Strictly speaking, L−1 doesn’t exist,but we can define c = L−1b to be that vector c which satisfies the augmentedsystem

L e

eT 0

c

g

=

b

0

.

In fact, in our code, all calculations involving L−1 are provided by a multigridPoisson solver.

It can be seen that the system (12) is symmetric positive definite and so weuse the conjugate gradient method to solve the equation. In fact we use thepreconditioned conjugate gradient method, with preconditioner matrix

Q =

L−1 0

0 L−1

.

It can be seen that each iteration of our preconditioned conjugate gradientsolver involves four applications of L−1. For smooth problems we need on theorder of twenty iterations of our PCG method to reduce the residual of theequation by a factor 10−10.

The details of this approach, the solution techniques of the system and par-allel implementation of it are presented in [8]. One of the advantages of thisapproach is that the size of the linear system depends on the discretisationsize instead of the number of data records.

The TPSFEM method can be used for surface fitting problems with dimen-sions 1 to 3. For problems with higher dimensions the curse of dimensionalitybecomes an obstacle due to exponential growth of the dimension of the fi-nite element space and the need to use more complicated smoothing terms.However, TPSFEM can still be employed to handle the surface fitting prob-lems with more than 3 variables by combining TPSFEM with the methodsdiscussed in Section 4 and 5.

8

4 High Dimensional Smoothing Using Wavelets

This section describes the High Dimensional Surface Smoothing (HISURF)method. It uses a hierarchical interpolary wavelet basis and tensor productsthereof to approximate f as given in (1). Wavelets provide a multi-level de-composition of f in each dimension which can lead to very compact approx-imations af reasonable well behaved functions. See [9,21,23] for introductionsto wavelet theory.

Let Ij,l =[

l2j ;

l+12j

]

be an interval. The 1D hierarchical basis functions are then

σj,l =

ϕ0,l, j = 0

ψj−1,l, j > 0.(13)

where

ϕj,l(x) =

1 + 2jx− l, x ∈ Ij,l−1

1 − 2jx+ l, x ∈ Ij,l

0, otherwise

and ψj−1,l = ϕj,2l+1, l = 0, . . . , 2j. The high dimensional basis functions areformed as the tensor products σj,l =

⊗ds=1 σjs,ls where σj,l are tensorial multi

dimensional wavelet functions, j = (j1, . . . , jd)T is a vector of scales in each

dimension and l = (l1, . . . , ld)T is a vector of positions in each dimension.

Let j ∈ Z be a fixed maximal resolution, let γ = 2j + 1 be the number ofgrid points in each dimension, and let Ud

j be the space generated by piecewiselinear d-dimensional functions interpolated from the γd total grid points. Thesefunctions are the generic finite element basis used in the previous section. Itcan be shown (see [18]) that the functions σj,l also form a basis for Ud

j . Wecall this basis the rectangular wavelet basis.

A function f(x1, . . . , xd) ∈ Udj then has an expansion in terms of the rectan-

gular basis as follows

u =j∑

j1,...,jd=0

θ(j1)∑

l1=0

· · ·θ(jd)∑

ld=0

dj,lσj,l (14)

where the function θ(i) denotes the last local index of coefficients at scale i,defined as θ(i) = 2|i−1| − 1, i ≥ 0.

It can be shown [18] that terms of this expansion where j1 + · · ·+ jd > j canbe deactivated without sacrificing the essential approximation power. This is

9

Fig. 1. Comparison of four different coefficient matrices.

data independent or a-priori compression as opposed to the more common (inthe wavelet literature) data dependent compression where wavelet coefficientsare discarded based on their magnitude. The data dependent compressionis efficient for a function with isolated singularities. On the other hand, forfitting a high dimensional smooth surface, singularities are unlikely to occurso good approximations can be achieved by using a-priori compression. Sincethis compression scheme is data-independent the algorithm can be very fast.

The compression error is bounded by the expression const(

j+d−1j

)

2−2j wherethe constant depends only on the smoothness of f . In return, the dimensionof the compressed system is of the order jd−12j; which is a significant reduc-tion compared to TPSFEM, especially for large γ and d. The approximatedsmoothing surface is then computed in terms the active coefficients only.

Figure 1 shows the four different matrices for the smoothing problem withd = 4, j = 2. The upper left corner shows it in terms of the generic basis for Uj .It is a large banded matrix with bandwith increasing with d. The upper rightcorner shows the coefficient matrix in terms of the rectangular wavelet basisfor Uj . Because of scale interactions, the number of nonzeros has increased.The lower left corner shows the same matrix with inactive coefficients setto zero, using only active basis functions. The lower right corner shows thecompressed matrix where all inactive rows and columns have been removed.This is the matrix we use to compute the compressed surface.

10

TPSFEM: 16641 vars HISURF: 833 vars ADDFIT: 388 vars

Fig. 2. Modelling 2D aeromagnetic data with 735,700 records.

5 Additive Models

In this section we describe our method for Additive Model Fitting calledADDFIT. Functions of d variables can be represented as sums of the form

f(x1, . . . , xd) = f0 +d∑

i=1

fi(xi) +∑

i<j

fi,j(xi, xj) + · · · .

Such decompositions originate from the Analysis of Variance and have thusbeen called ANOVA-decompositions [12]. They can be viewed as generali-sations of Taylor and Fourier-series. However, the terms are only uniquelydetermined if additional constraints are imposed. If this is not done, the com-ponent f1(x1), for example, is a special case of f1,2(x1, x2) and thus cannot bedetermined.

Including so-called interaction terms fi1,... ,ik(xi1 , . . . , xik) up to order k = dallows the exact representation of f . This, however, is computationally infea-sible in general due to the curse of dimensionality which also posed a majorchallenge to TPSFEM and HISURF. Luckily, for high-dimensional data onlythe inclusion of lower-order terms is required as they give approximationswhich converge with the dimension d for smooth functions [17]. Practical al-gorithms [3,13,15,25] typically give good approximations for k = 1 or k = 2and it is common folklore that interactions with higher order than k = 5 arehighly unusual. (Of course one also requires enough data in order to identifysuch high-dimensional interactions.) The terms in the ANOVA decompositionare represented using the same basis functions which have been used for TPS-FEM and HISURF. The more general cases will be discussed elsewhere, herewe only discuss functions of the form

f(x) = f0 +d∑

s=1

fs(xs).

These additive models are discussed extensively in [15] from a statistical view-

11

0 729

0

729

nz = 64

TPSFEM

0 123

0

123

nz = 3969

HISURF

0 9 18 27

0

9

18

27

nz = 36

ADDFIT

Fig. 3. Elementary matrix structure, d = 3

point. The predictor variables xs can be real numbers, categories or even morecomplex objects like sets, graphs and vectors. (Vectors allow the inclusion ofhigher order interactions.) In the following, however, only simple data types(real and categorical) will be discussed. Additive models have many advan-tages. First, they are easy to interpret as the overall effect is given as a sumof effects of single variables. When interpreting additive models, however, onehas to take into account that the variables xs might be correlated.Our im-plementation of additive models uses a basis representation of the componentfunctions fs as

fs(xs) =γs∑

i=1

cs,ibs,i(xs)

with the basis functions bs,i, the coefficients cs,i and where γs is the number ofbasis functions characterising fs. The basis functions are such that for any xs

only a small number of basis functions have nonzero values. For the categoricalvariables the basis functions are just the category indicator functions andfor real variables we use piecewise linear functions. A difficulty is that thesefunctions may be linearly dependent over the data set and this has to beaddressed with constraints.

While the components of the sum defining A in Equation (3) are typicallysparse (see Figure 3) the final matrix A is more or less dense, as can be seenin Figure 4. In the parallel implementation each processor needs storage ofthe size of A to store the partial sum. This means that we will not be ableto increase the accuracy of the model in a scalable way with the numbersof processors but this was not essential for our project. The accuracy of theestimate is given by the accuracy of the model (bias) and the variance of theestimate. The number of data records controls the variance of the estimate.

The total number of basis functions used is m = 1+∑d

s=1 γs. Thus for constantγs = γ the number of unknowns scales linearly with the dimension d,

m = 1 + dγ.

12

0 200 400 600

0

100

200

300

400

500

600

700

nz = 15625

3D finite element matrix

0 50 100

0

20

40

60

80

100

120

nz = 13401

3D compressed matrix

0 10 20

0

5

10

15

20

25

nz = 561

3D additive model

0 5 10 1B

1KB

1MB

1GB

1TB

dimension

Storage requirements

TPSFEMHISURFADDFIT

Fig. 4. Matrix structure and complexity for a 3D data set

This scalability with dimension d is an import advantage of the additive model.

Figure 4 shows clearly that additive models lead to small dense linear sys-tems which makes high dimensional modelling feasible. The somewhat poorapproximation can be improved by introducing derived variables. These newvariables should be chosen to suit particular applications. For higher dimen-sional problems it can be shown that the approximation power gets betterwith number of dimensions [17].

6 Complexity and Scalability

The process of assembling the linear systems has the same structure for allthree presented methods. For each data records, the elementary vector b(x) iscomputed. The number of non-zero elements in this vector is O(d). It dependson the method, but typically each variable contributes only a small number ofnon-zero elements to this vector. Forming the normal equations matrix A is

13

thus of order O(d2) for each data record. The total complexity of assemblingn data records sequentially is therefore

Tassem(1) = O(d2n).

As each processor can assemble a local linear system reading a fraction n/p ofthe data set without communication, the parallel complexity of the assemblyprocess becomes

Tassem(p) = O

(

d2n

p

)

.

Before it can be solved, the local linear systems have to be summed intothe final linear system. The algorithm used to solve this system not onlydepends on the chosen method but also on the size of the assembled linearsystem. If the size is below a given threshold (which depends on the parallelhardware) we collect and sum the local linear systems on one processor andsolve it sequentially. A reduction operation is needed in this case, which has acomplexity of O(log(p)d m) for TPSFEM and O(log(p)m2) for HISURF andADDFIT, respectively. If we aim to solve the linear system in parallel, we canuse a block-cyclic distribution which has the same complexity as the reductionoperation, but the scalability is better as none of the processors becomes abottleneck. For a scalable algorithm for very large p the size of the matrix

would actually have to decrease like√

log(p), however, this decrease is notthought to be too dramatic with the current numbers of processors available.The time for reducing or redistributing the linear system is still much smallerthan the time to assemble the local matrices as we deal with a very largenumber of data records n.

From the storage point of view each processor has to store the whole matrix asevery data record can contribute non-zero elements anywhere in the matrix.As can be seen from Figure 4 the matrix data structure depends on the chosenmethod. For TPSFEM only as small number of diagonals are filled with non-zero elements. For a d dimensional data set 3d diagonals will be filled, but asthe matrix is symmetric only

3d + 1

2

need actually to be stored. Using HISURF we get a dense matrix because ofscale interactions, and for ADDFIT a block matrix with d2 blocks is assembled.As all off-diagonal blocks can get dense we treat the complete matrix as dense.The size of the linear system also depends on the used method. AlthoughTPSFEM results in a sparse matrix (which gets denser with increasing d) themethod is not scalable with respect to the number of dimensions as can beseen from Figure 4.

14

EvaluateModel

BuildModel(s)

PrepareData

UnderstandCustomer

AnalyseData

TakeAction

Fig. 5. The Data Mining Process

7 The Data Mining Process

Before we start describing the implementation of our algorithms we give ashort introduction to the process of data mining [7] as presented in Figure 5.A data mining project usually starts with a customer who has a large database and who wants this data to be analysed in order to find some valuableinformation that is previously unknown and hidden in the data.

The first step in data mining – after understanding what a customer wants– is the extraction and preprocessing of suitable data out of a given datacollection. Once data is understood and available, it can be used by modellingalgorithms like neural networks, decision trees, association rules or the herepresented predictive models. The results of this modelling process then haveto be evaluated and deployed back to the customers organisation. As Figure 5shows data mining is not a linear process. Some steps have to be done severaltimes until desired results are achieved. Different modelling techniques forexample require different data preparations.

Figure 6 shows the steps that actually are involved in our data mining al-gorithms. The main purpose of a data collection is usually not data mining.Therefore, a suitable subset of the data often has to be extracted and convertedso it can be used for further analysis by data mining tools. This extractionmay include choosing a subset of attributes and/or a subset of records. Once adata set is available, it has to be cleaned and preprocessed. Data can be noisyor contain missing values that have to be converted into some well definedvalues so a modelling technique can deal with them. This data preparationphase can also include a normalisation or conversion of data into numericalvalues (if e.g. a data mining tool only works with numerical data).

For our data mining algorithms we need to convert categorical attributesinto category numbers and normalise continuous attributes into the inter-val [0.0 . . . 1.0]. This preprocessing has to be done once only for a given dataset. A straightforward parallelisation distributes the attributes on processors.Unfortunately this has two drawbacks. First, every processor still has to readthe whole data file (assumed all attributes are stored in a common file) andsecondly load imbalance may occur if some processors have to process more

15

ADDFIT / TPSFEM / HISURF

BinaryData

LinearSystem

Graphs,Reports

Modely=f(x)

Preprocess,Normalise

SolveLinear System

AssembleLinear System

RawData

Visualize,Postprocess

Fig. 6. Steps of our Data Mining Algorithms

Table 1Preprocessing times in seconds for 56 million records

Number of processors

1 2 4 6 8 10

First processor 3328 1865 1054 827 829 802

Last processor – 2027 1551 1367 1342 1067

attributes than others. Moreover, the time to process categorical attributes de-pends heavily on the number of categories, as for each data record a categoryhas to be checked against all already stored categories.

Table 1 presents timings we measured for an example data set consisting ofalmost 56 million data records each with ten attributes (two continuous andeight categoricals). This tests have been run on a Sun Enterprise 4500 SMPmachine with twelve 400 MHz Ultra-Sparc processors, 6.75 Gigabytes of mainmemory and an attached 256 Gigabytes RAID disk array. As can be seen thereis quite a difference between the first and last processor to finish, which is dueto one categorical attribute that has 885,351categories. Since the number ofcategories is often unknown prior to preprocessing, it is hard to achieve goodload balancing. For continuous – as well as categorical attributes with only asmall number of categories – good load balancing can be achieved, as the timein such a case is bounded by the time to read and parse the (text) file fromsecondary storage.

Once the binary normalised files are available we can run our data miningalgorithms with different parameters (grid resolution, smoothing parameter,choosing predictor and response variable out of the set of attributes, etc.) untilthe desired model accuracy is achieved.

The preprocessing and visualisation step finally includes a suitable presenta-tion of the achieved results. This can either be a printed report describing theoutcomes in a customer related language or it can be an interactive tool where

16

0 5 10 15 20

Number of processes

0

100

200

300

Tim

e in

sec

onds

Master-WorkerSPMD

Block length = 1000

0 5 10 15 20

Number of processes

0

100

200

300

Tim

e in

sec

onds

Master-WorkerSPMD

Block length = 100,000

Fig. 7. Assembly on Sun Enterprise

a customer can interactively run the models and explore the data.

8 Assembling the Linear System

Once a data set has been normalised it is available as binary files on secondarystorage. The next step in our algorithms, which also requires access to thedata, is the assembly of a linear system. For each method a different linearsystem has to be assembled, but the structure of the assembly remains thesame. Basically, each data record adds some values into the linear system atsome places.

The assembly of data records into the linear system is additive and thus eachdata record can be assembled independently from all others. Therefore, par-allelisation can be achieved in a way that each processor assembles a fractionn/p of all n data records into a local linear system. Load balancing in theassembly is easily achieved, because each processor can read and assemble thesame number of data records. However, if the processors are loaded differently,the most loaded node may become a bottleneck.

The assembly has been implemented for the ADDFIT method using C andMPI [20]. As the basic assembly structure is the same for all three methods,it is simple to include new routines for TPSFEM and HISURF. The matrixdata structure and the assembly of a data record are the only parts that haveto be changed. All results presented in this paper are based on the assemblyof a linear system for ADDFIT.

We implemented both an SPMD and a master-worker version of the assembly.In the SPMD implementation each processor computes the partition of a fileit has to assemble. He then open the file, seeks to the start position, reads theappropriate number of data records and assembles them into a local linear sys-

17

0 5 10 15 20

Number of processes

0

100

200

300

400

Tim

e in

sec

onds

Master-WorkerSPMD

Block length = 1000

0 5 10 15 20

Number of processes

0

50

100

150

200

250

Tim

e in

sec

onds

Master-WorkerSPMD

Block length = 100,000

Fig. 8. Assembly on Beowulf

tem. In the MW implementation the master sends short messages to workerscontaining the file name, the start position and the number of records a workerhas to assemble. The workers then opens the file, seeks to the given positionand reads and assembles these records into a local linear system. Once a workeris finished he send a ready message back to the master. The master sends tasksto workers as long as there are data records to assemble. With this method anautomatic load balancing is achieved as more loaded processors (e.g. by tasksfrom other users) need longer to assemble records but automatically get lessdata. However, the additional costs for this load balancing are small messagesthat are communicated between master and workers.

For both implementations the innermost loop operates in a blocking structure,i.e. a block (a given number of data records) is loaded from file and assembledinto the local linear system, then the next block is loaded etc. Using differentblock sizes it is possible to trade between disk reading, memory usage andcommunication (for the MW implementation only).

Once all data records are assembled the local linear systems have to be col-lected somehow to form the final linear system. Depending on the algorithmused to solve the linear system (sequential or in parallel) we reduce and sumthe linear system on one processor or we use a block-cyclic distribution. Theresults presented in this section include the time to reduce and sum the linearsystem on one processor.

Based on real-world data we created a random test set containing three con-tinuous and four categorical attributes. The times for assembling about elevenmillion records on two different parallel architectures can be seen in Figures 7and 8. The first machine used was the twelve processor Sun Enterprise de-scribed earlier, while the second machine was the 196 processor Beowulf Linuxcluster Bunyip at the Australian National University [1]. The cluster is buildwith 98 dual 550 MHz Pentium III nodes, each containing 384 Megabytes ofRAM (totally about 36 Gigabytes), 13 Gigabytes of disk space (totally 1.3

18

0 10 20 30 40 50

Number of processes

0

10

20

30

40

50

Spee

dup

Master-WorkerSPMD

Block length = 10,000

0 10 20 30 40 50

Number of processes

0

0.5

1

Eff

icie

ncy

Master-WorkerSPMD

Block length = 10,000

Fig. 9. Assembly on Beowulf (Speedup and Efficiency with block length 10,000)

Terabytes) and 3 × 100 MBit/s fast Ethernet cards. Logically 96 nodes areconnected in four groups of 24 nodes arranged as a tetrahedron with a groupof nodes at each vertex. Two nodes are designated as servers. All data fileshave been distributed onto local disks.

Remarkable is the increased run time for the master-worker implementationon the Sun Enterprise server with a block length of 1000 data records if morethan 15 processes are started (twelve processors are available). This increaseis due to the larger number of messages exchanged between the master andworker processes. No messages are needed for the SPMD implementation andso the time needed for assembly does not increase if more processes are started.As communication is implemented over shared memory access – which is veryfast – there is not much overhead for communication on the Sun EnterpriseSMP machine.

On the Beowulf one can clearly see the time needed for communication asboth for block length 1000 and 100,000 the master-worker implementationneeds more time than the SPMD implementation.

The speedup and efficiency on the Beowulf (Figure 9) is almost linear for up to48 nodes with the same data set as before. As the data files are replicated onlocal disks on every node the is no communication involved with file access. Theassembled linear system in the presented example is with m < 100 very small.But even when communicating a larger system the time for the reduction issmall compared to the time for the assembly process.

9 Solving the Linear System

The systems currently generated from HISURF (Section 4) and ADDFIT (Sec-tion 5) are dense and symmetric, positive definite in the former case, and semi-

19

definite in the latter case. However, in future refinements of these models, thedefiniteness property may be lost, for example because of the addition of extraconstraints or in the case of additive models, extending it to a second-ordermodel.

The size of the assembled linear system depends on the total number of cate-gories for categorical variables and on the number of grid points for continuousvariables. Solving this system can be done on either a sequential or parallelsolver depending on the size of the system and the availability of a parallelmachine.

We thus require a solver that will be accurate for any symmetric dense system,and also have good parallel and sequential performance. The former require-ment argues for a direct solver with good stability properties; the latter arguesfor one that exploits symmetry to require only m3

3+O(m2) floating point op-

erations, and that has been shown to have an efficient parallelisation. A directsolver for general symmetric (indefinite) systems based on the diagonal pivot-ing method [2,14] meets these requirements.

In the diagonal pivoting method, the decomposition A = LDLT is performed,where L is an m × m lower triangular matrix with a unit diagonal, and Dis a block diagonal matrix with either 1 × 1 or 2 × 2 sub-blocks [14]. Thefactorisation of A proceeds column by column; in the elimination of columnj, 3 cases arise:

D1 eliminate using a 1 × 1 pivot from Aj,j.This corresponds to the definite case, and will be used when Aj,j is suffi-ciently large (compared with max(Aj+1:m,j)).

D2 eliminate using a 1 × 1 pivot from Ai,i, where i > j.This corresponds to the semi-definite case; a symmetric interchange withrow / columns i and j must be performed.

D3 eliminate using a 2 × 2 pivot using columns j and i (this case produces a2 × 2 sub-block at column j of D).This corresponds to the indefinite case; a symmetric interchange with rows/columns i and j + 1 must be performed. However, columns j and j + 1 areeliminated in this case.

The tests used to decide between these cases, and the searches used to selectcolumn i, yield several algorithms based on the method, the most well-knownbeing the several variants of the Bunch-Kaufman algorithm (see [14] and thereferences cited within).

It has been recently shown for the Bunch-Kaufman algorithm that there is noguarantee that the growth of L is bounded [2]. Variants such as the bounded

Bunch-Kaufman and fast Bunch-Parlett algorithms have been devised whichovercomes this problem. The extra accuracy of these methods results from

20

more extensive searching for stable pivot columns i for cases D2 and D3, witha corresponding more frequent use of these cases.

For linear systems that are close to definite, such as are likely to be gener-ated by our models, the diagonal pivoting methods permit most columns tobe eliminated by case D1, requiring no symmetric interchanges. For parallelimplementation, this is a highly useful property, as even for large matrices,the communication startup and volume overheads of symmetric interchangeis very high [24]. Such a system of order m = 10, 000 was solved on a 16 nodeFujitsu AP3000 distributed memory multicomputer (having 300MHz Ultra-SPARC II nodes) with a sustained speed of 5.6 Gigaflops; this represents aparallel speedup 0f 12.8 [24]. Here, a square block-cyclic distribution with astorage block size of r = 44 and with an algorithmic blocking factor of ω = r,was found to be optimal.

This parallel solver achieved its high performance using a version of theBunch-Kaufman algorithm enabling further reduction of the proportion ofcases D2 and D3. Here growth ‘credits’ from large diagonal elements in pre-ceding columns were used to create a more relaxed criterion for applying caseD1; however, this still achieves the same growth bound as the original Bunch-Kaufman method [24].

Similar parallel performance could be achieved with a more accurate algo-rithm that searched (first) for candidate columns i for cases D2 and D3 fromthe current storage block [19]. If this search was successful, the symmetric in-terchange would require no communication, resulting in no parallel overhead.Such a strategy could be based on the Parlett-Reid algorithm used for sparsematrices [2,19]. As it provides stronger guarantees of accuracy, this algorithmapplied to the dense case is preferred for our parallel solver.

Results for this new method will be published elsewhere.

10 Conclusions

We have presented a three-fold approach to predictive modelling of very largedata sets that is both scalable with the data size and the number of processors.All three methods are based on approximate smoothing splines using finiteelement techniques. While all methods are equally scalable, they differ in theirapproximation power and range of dimensions for which they are applicable.A first method, called TPSFEM, is based on piecewise multilinear elementsand is most suited to low-dimensional problems. For the intermediate rangeof dimensions we suggest HISURF which uses a sparse-grid type approachbased on bi-orthogonal wavelets. Finally, for very high-dimensions we have

21

developed ADDFIT implementing additive models.

Timing results on two different parallel architectures show the parallel scal-ability of the first step – the assembly of a linear system – of our methods.Work on a optimised parallel solver for the linear systems resulting for ouralgorithms is an ongoing project.

In future work we plan to add support for data types other than continuous andcategorical variables. We hope to include in a first instance support for sets,time series and graphs. We plan also to explore how the reduction techniquescould be used to address data mining problems other than prediction.

Acknowledgements

This research is supported by the Australian Advanced Computational Sys-tems CRC. Peter Christen is funded by grants from the Swiss National Science

Foundation (SNF) and the Novartis Stiftung, vormals Ciba-Geigy Jubilaums-

Stiftung, Switzerland.

References

[1] Douglas Aberdeen, Jonathan Baxter and Robert Edwards, 98 c/MFlop, Ultra-Large-Scale Neural Network Training on a PIII Cluster, Gordon Bell awardprice/performance entry, and student paper award entry. Submitted to theHigh-Performance Networking and Computing Conference, Dallas, November2000.

[2] Cleve Ashcraft, Roger G. Grimes, and John G. Lewis, Accurate SymmetricIndefinite Linear Equation Solvers, Simax, 1999.

[3] Sergey Bakin, Markus hegland and Mike Osborne, Can MARS be Improvedwith B-Splines?, Computational Techniques and Applications: CTAC97, WorldScientific Publishing Co., pages 75-82, 1998.

[4] R.K. Beatson, G. Goodsell and M.J.D. Powell, On Multigrid Techniques forThin Plate Spline Interpolation in Two Dimensions, in the Mathematics ofNumerical Analyses, Lectures in Appl. Math., pages 77-97, 1996.

[5] G. Bell and J.N. Gray, The revolution yet to happen, Beyond Calculation (P.J.Denning and R.M. Metcalfe, eds.), Springer Verlag, 1997.

[6] Michael J.A. Berry and Gordon Linoff, Data Mining, Techniques for Marketing,Sales and Customer Support. John Wiley & Sons, 1997.

22

[7] P. Chapman, R. Kerber, J. Clinton, T. Khabaza, T. Reinartz and R. With, TheCRISP-DM Process Model, Discussion paper, March 1999. www.crisp.org

[8] Peter Christen, Markus Hegland, Stephen Roberts and Irfan Altas, A ScalableParallel FEM Surface Fitting Algorithm for Data Mining, submitted to theInternational Journal of Computer Research (IJCR), Special Issue on IndustrialApplications of Parallel Computing, November 1999.

[9] I. Daubechies, Orthonormal bases of compactly supported wavelets, Comm. Pureand Appl. Math., pages 909-996, 1998.

[10] J. Duchon, Splines Minimizing Rotation-Invariant Semi-norms in SobolevSpaces, in Lecture Notes in Math., vol. 571, pages 85-100, 1977.

[11] D. Dullmann, Petabyte databases. Proceedings of the ACM SIGMODInternational Conference on Management of Data (SIGMOD-99), ACM Press,July 1999.

[12] B. Efron and C. Stein, The jackknife estimate of variance, The Annals ofStatistics, vol. 9, num. 3, pages 586-596, 1981.

[13] J.H. Friedman, Multivariate adaptive regression splines, The Annals ofStatistics, vol. 19, pages 1-141, 1991.

[14] Gene Golub and Charles Van Loan, Matrix Computations, John HopkinsUniversity Press, Baltimore, second edition, 1989.

[15] T.J. Hastie and R.J. Tibshirani, Generalized additive models, Chapman andHall, Monographs on statistics and applied probability 43, 1990.

[16] Markus Hegland, Stephen Roberts and Irfan Altas, Finite Element Thin PlateSplines for Data Mining Applications, in Mathematical Methods for Curves andSurfaces II, M. Daehlen, T. Lyche and L.L. Schumaker, Eds., pages 245-253,Vanderbilt University Press, 1998.

[17] Markus Hegland and Vladimir Pestov, Additive Models in High Dimensions,School of Mathematical and Computing Sciences, Victoria University ofWellington, 1999.

[18] Markus Hegland, Ole M. Nielsen and Zuowei Shen, High Dimensional Smoothingbased on Multi-level Analysis. In preparation.

[19] John G. Lewis. Private comunications, 1999.

[20] W. Gropp, E. Lusk and A. Skjellum,Using MPI – Portable ParallelProgramming with the Message-Passing Interface. The MIT Press, Cambridge,Massachusetts, 1994.

[21] Ole Møller Nielsen, Wavelets in Scientific Computing, PhD thesis, TechnicalUniversity of Denmark, 1998.Available at www.bigfoot.com/~Ole.Nielsen/thesis.html

[22] R. Sibson and G. Stone, Computation of Thin Plate Splines, SIAM J. Sci. Statis.Comput., 12:1304-1313, 1991.

23

[23] Gilbert Strang and Truong Nguyen, Wavelets and Filter Banks, Wellesley-Cambridge Press, 1996.

[24] Peter E. Strazdins, Accelerated Methods for Performing the LDLTDecomposition, Proceedings of CTAC’99: The 9th Biennial ComputationalTechniques and Applications Conference and Workshops, Canberra, September1999.

[25] G. Wahba, Spline models for observational data, CBMS-NSF RegionalConference Series in Applied Mathematics, vol. 59, SIAM, 1990.

24