prev

next

out of 42

View

214Download

0

Embed Size (px)

7/31/2019 Handouts on Data-driven Modelling, part 3 (UNESCO-IHE)

1/42

DataData--drivendriven modellingmodelling

in waterin water--related problems.related problems.PART 3PART 3

Dimitri P. Solomatine

www.ihe.nl/hi/sol sol@ihe.nl

UNESCO-IHE Institute for Water EducationHydroinformatics Chair

D.P. Solomatine. Data-driven modelling (part 3). 2

Finding groups (clusters) in dataFinding groups (clusters) in data

(unsupervised learning)(unsupervised learning)

7/31/2019 Handouts on Data-driven Modelling, part 3 (UNESCO-IHE)

2/42

D.P. Solomatine. Data-driven modelling (part 3). 3

ClusteringClustering

classificationis aimed at identifying mapping (function) thatmaps any given input xito a nominal variable (class) yi.

finding the groups (clusters) in an input data set is clustering

Clustering is often the preparation phase for classification:

the identified clusters could be labelled as classes, each inputinstance then can be associated with an output value (class) andthe instances set {xi, yi} can be built

Cluster1

Cluster 2

Cluster 3

a) b)

D.P. Solomatine. Data-driven modelling (part 3). 4

Reasons to use clusteringReasons to use clustering

labelling large data sets can be very costly;

clustering may actually give an insight into the data and helpdiscover classes which are not known in advance;

clustering may find featuresthat can be used for categorization.

7/31/2019 Handouts on Data-driven Modelling, part 3 (UNESCO-IHE)

3/42

D.P. Solomatine. Data-driven modelling (part 3). 5

VoronoiVoronoi diagramsdiagrams

D.P. Solomatine. Data-driven modelling (part 3). 6

Methods for clusteringMethods for clustering

partition-based clustering (K-means, fuzzy C-means, based onEuclidean distance);

hierarchical clustering (agglomerative hierarchical clustering,nearest-neighbour algorithm);

feature extraction methods: principal component analysis(PCA), self-organizing feature (SOF) maps (also referred to asKohonen neural networks).

7/31/2019 Handouts on Data-driven Modelling, part 3 (UNESCO-IHE)

4/42

D.P. Solomatine. Data-driven modelling (part 3). 7

kk--meansmeans clausteringclaustering

find the best division ofNsamples by Kclusters Cisuch that thetotal distance between the clustered samples and theirrespective centers (that is, the total variance) is minimized:

where i is the center of class i.

=

=K

i Cn

in

i

xJ1

2||

D.P. Solomatine. Data-driven modelling (part 3). 8

kk--means clustering: algorithmmeans clustering: algorithm

1 randomly assigning instances to the clusters

2 compute the centers according to

3 reassigne the instances to the nearest clusters center

4 recalculate centers

5 reassign the instances to new centers

repeat 2-5 until total variance J stops decreasing (or centersstop to move).

=iCn

n

i

i xN

|1

7/31/2019 Handouts on Data-driven Modelling, part 3 (UNESCO-IHE)

5/42

D.P. Solomatine. Data-driven modelling (part 3). 9

kk--means clustering: illustrationmeans clustering: illustration

D.P. Solomatine. Data-driven modelling (part 3). 10

KohonenKohonen networknetwork

(Self(Self--organizing feature maporganizing feature map -- SOFM)SOFM)

7/31/2019 Handouts on Data-driven Modelling, part 3 (UNESCO-IHE)

6/42

D.P. Solomatine. Data-driven modelling (part 3). 11

SOFM: main ideaSOFM: main idea

x1

1

1 2 3

j

N

2 M

x2

1112

1M

j(0)j(t1)

j(t2)

NM

xM

...

...

a) b)

D.P. Solomatine. Data-driven modelling (part 3). 12

SOFM: algorithm (1)SOFM: algorithm (1)

0 Initialize weight, normally with small random values.

Set topological neighborhood parameters.

Set learning rate parameters.

Iteration number t= 1.

1 While stopping conditionis false, do iteration t(steps 28):

2 For each input vector x = {x1,..., xN} do steps 3 8:

3 For each output node kcalculate the similarity measure (in thiscase the Euclidean distance) between the input and the weight

vector: =

=N

i

iikxwkD

1

2)()(

7/31/2019 Handouts on Data-driven Modelling, part 3 (UNESCO-IHE)

7/42

D.P. Solomatine. Data-driven modelling (part 3). 13

SOFM: algorithm (2)SOFM: algorithm (2)

4 Find index kmaxsuch that D(k)is a minimum this will refer to thewinning node.

5 Update the weights for the node kmaxand for all nodes kwithin aspecified neighborhood radius rfrom kmax:

6 Update learning rate (t)

7 Reduce radius rused in the neigborhood function N(this can bedone less frequently than at each iteration).

8 Test stopping condition.

)]([),()()()1( twxtrNttwtw ikiikik +=+

D.P. Solomatine. Data-driven modelling (part 3). 14

SOFM: exampleSOFM: example

Input set: sampling points in a square randomly (the probabilityof sampling a point in the central square region was 20 timesgreater than elsewhere in a square)

The target space is discrete and includes 100 output nodesarranged in 2 dimensions

SOFM is able to find the cluster the area of the pointsconcentration

7/31/2019 Handouts on Data-driven Modelling, part 3 (UNESCO-IHE)

8/42

D.P. Solomatine. Data-driven modelling (part 3). 15

SOFM: visualisation and interpretationSOFM: visualisation and interpretation

count maps, which is the easiestand mostly used method. This is aplot showing for each output nodenumber of times when it was awinning one. It can beinterpolated into colour shading aswell

distance matrix (of size K x K)which elements are Euclideandistance of each output unit to its

immediate neighbouring units

D.P. Solomatine. Data-driven modelling (part 3). 16

SOFM: visualization and interpretationSOFM: visualization and interpretation

vector positionor cluster maps:

colours are coded according totheir similarity in the inputspace

each dot corresponds to oneoutput map unit

each map unit is connected toits neighbours by line

7/31/2019 Handouts on Data-driven Modelling, part 3 (UNESCO-IHE)

9/42

D.P. Solomatine. Data-driven modelling (part 3). 17

SOFM: visualization and interpretationSOFM: visualization and interpretation

vector positionor cluster maps:

in 3D

D.P. Solomatine. Data-driven modelling (part 3). 18

InstanceInstance--based learningbased learning

(lazy learning)(lazy learning)

7/31/2019 Handouts on Data-driven Modelling, part 3 (UNESCO-IHE)

10/42

D.P. Solomatine. Data-driven modelling (part 3). 19

Lazy and eager learningLazy and eager learning

Eager learning:

first ML (data-driven) model is built

then it is tested and used

Lazy learning

no ML model is built (i.e lazy)

when newexamples come, the output is generated immediatelyon the basis of the training examples

Other names for lazy learning:

Instance-based

Exemplar-based

Case-based

Experience-based

Edited k-nearest neighbor

D.P. Solomatine. Data-driven modelling (part 3). 20

kk--Nearest neighbors method: classificationNearest neighbors method: classification

instances are points in 2-dim. space, output is boolean (+ or -)

new instance xq is classified w.r.t. proximity of nearest traininginstances

to class + (if 1 neighbor is considered)

to class - (if 4 neighbors are considered)

for discrete-valued outputs assign: the most common value

VoronoiVoronoi diagram for 1diagram for 1--Nearest neighborNearest neighbor

7/31/2019 Handouts on Data-driven Modelling, part 3 (UNESCO-IHE)

11/42

D.P. Solomatine. Data-driven modelling (part 3). 21

NotationsNotations

instance x as {a1(x) ... an(x)} where ar(x) denotes the value ofthe r-th attribute of instance x.

distance between two instances xiand xj is defined to bed(xi, xj) where

2))()((),( jrirji xaxaxxd =

D.P. Solomatine. Data-driven modelling (part 3). 22

kk--NearestNearest neighborneighbor algorithmalgorithm

Training

Build the set of training examples D.

Classification

Given a query instance xq to be classified,

Let x1... xkdenote the kinstances from Dthat are nearest to xq Return

where (a, b)=1, ifa= b, and (a, b)=0 otherwise

V= {v1,,vs} set of possible output values.

==

k

i

iVv

q xfvxF1

))(,(maxarg)(

7/31/2019 Handouts on Data-driven Modelling, part 3 (UNESCO-IHE)

12/42

D.P. Solomatine. Data-driven modelling (part 3). 23

kk--Nearest neighbors: regressionNearest neighbors: regression

(target function is real(target function is real--valued )valued )

model a real-valued target function F: n .

instances are points in n-dim. space, output is a real number

new instance xq is valued w.r.t.

values of nearest training instances (average ofkinstances istaken, or the weighted average)

values and proximity of nearest training instances (locallyweighted regressionmodel is built and used to predict the value ofnew instance)

In this case the final line on the k-NN algorithm should bereplaced by the line

k

xf

xF

k

i

i

q

== 1

))(

)(

D.P. Solomatine. Data-driven modelling (part 3). 24

Distance weighted kDistance weighted k--NN algorit