Handouts on Data-driven Modelling, part 3 (UNESCO-IHE)

  • View
    214

  • Download
    0

Embed Size (px)

Text of Handouts on Data-driven Modelling, part 3 (UNESCO-IHE)

  • 7/31/2019 Handouts on Data-driven Modelling, part 3 (UNESCO-IHE)

    1/42

    DataData--drivendriven modellingmodelling

    in waterin water--related problems.related problems.PART 3PART 3

    Dimitri P. Solomatine

    www.ihe.nl/hi/sol sol@ihe.nl

    UNESCO-IHE Institute for Water EducationHydroinformatics Chair

    D.P. Solomatine. Data-driven modelling (part 3). 2

    Finding groups (clusters) in dataFinding groups (clusters) in data

    (unsupervised learning)(unsupervised learning)

  • 7/31/2019 Handouts on Data-driven Modelling, part 3 (UNESCO-IHE)

    2/42

    D.P. Solomatine. Data-driven modelling (part 3). 3

    ClusteringClustering

    classificationis aimed at identifying mapping (function) thatmaps any given input xito a nominal variable (class) yi.

    finding the groups (clusters) in an input data set is clustering

    Clustering is often the preparation phase for classification:

    the identified clusters could be labelled as classes, each inputinstance then can be associated with an output value (class) andthe instances set {xi, yi} can be built

    Cluster1

    Cluster 2

    Cluster 3

    a) b)

    D.P. Solomatine. Data-driven modelling (part 3). 4

    Reasons to use clusteringReasons to use clustering

    labelling large data sets can be very costly;

    clustering may actually give an insight into the data and helpdiscover classes which are not known in advance;

    clustering may find featuresthat can be used for categorization.

  • 7/31/2019 Handouts on Data-driven Modelling, part 3 (UNESCO-IHE)

    3/42

    D.P. Solomatine. Data-driven modelling (part 3). 5

    VoronoiVoronoi diagramsdiagrams

    D.P. Solomatine. Data-driven modelling (part 3). 6

    Methods for clusteringMethods for clustering

    partition-based clustering (K-means, fuzzy C-means, based onEuclidean distance);

    hierarchical clustering (agglomerative hierarchical clustering,nearest-neighbour algorithm);

    feature extraction methods: principal component analysis(PCA), self-organizing feature (SOF) maps (also referred to asKohonen neural networks).

  • 7/31/2019 Handouts on Data-driven Modelling, part 3 (UNESCO-IHE)

    4/42

    D.P. Solomatine. Data-driven modelling (part 3). 7

    kk--meansmeans clausteringclaustering

    find the best division ofNsamples by Kclusters Cisuch that thetotal distance between the clustered samples and theirrespective centers (that is, the total variance) is minimized:

    where i is the center of class i.

    =

    =K

    i Cn

    in

    i

    xJ1

    2||

    D.P. Solomatine. Data-driven modelling (part 3). 8

    kk--means clustering: algorithmmeans clustering: algorithm

    1 randomly assigning instances to the clusters

    2 compute the centers according to

    3 reassigne the instances to the nearest clusters center

    4 recalculate centers

    5 reassign the instances to new centers

    repeat 2-5 until total variance J stops decreasing (or centersstop to move).

    =iCn

    n

    i

    i xN

    |1

  • 7/31/2019 Handouts on Data-driven Modelling, part 3 (UNESCO-IHE)

    5/42

    D.P. Solomatine. Data-driven modelling (part 3). 9

    kk--means clustering: illustrationmeans clustering: illustration

    D.P. Solomatine. Data-driven modelling (part 3). 10

    KohonenKohonen networknetwork

    (Self(Self--organizing feature maporganizing feature map -- SOFM)SOFM)

  • 7/31/2019 Handouts on Data-driven Modelling, part 3 (UNESCO-IHE)

    6/42

    D.P. Solomatine. Data-driven modelling (part 3). 11

    SOFM: main ideaSOFM: main idea

    x1

    1

    1 2 3

    j

    N

    2 M

    x2

    1112

    1M

    j(0)j(t1)

    j(t2)

    NM

    xM

    ...

    ...

    a) b)

    D.P. Solomatine. Data-driven modelling (part 3). 12

    SOFM: algorithm (1)SOFM: algorithm (1)

    0 Initialize weight, normally with small random values.

    Set topological neighborhood parameters.

    Set learning rate parameters.

    Iteration number t= 1.

    1 While stopping conditionis false, do iteration t(steps 28):

    2 For each input vector x = {x1,..., xN} do steps 3 8:

    3 For each output node kcalculate the similarity measure (in thiscase the Euclidean distance) between the input and the weight

    vector: =

    =N

    i

    iikxwkD

    1

    2)()(

  • 7/31/2019 Handouts on Data-driven Modelling, part 3 (UNESCO-IHE)

    7/42

    D.P. Solomatine. Data-driven modelling (part 3). 13

    SOFM: algorithm (2)SOFM: algorithm (2)

    4 Find index kmaxsuch that D(k)is a minimum this will refer to thewinning node.

    5 Update the weights for the node kmaxand for all nodes kwithin aspecified neighborhood radius rfrom kmax:

    6 Update learning rate (t)

    7 Reduce radius rused in the neigborhood function N(this can bedone less frequently than at each iteration).

    8 Test stopping condition.

    )]([),()()()1( twxtrNttwtw ikiikik +=+

    D.P. Solomatine. Data-driven modelling (part 3). 14

    SOFM: exampleSOFM: example

    Input set: sampling points in a square randomly (the probabilityof sampling a point in the central square region was 20 timesgreater than elsewhere in a square)

    The target space is discrete and includes 100 output nodesarranged in 2 dimensions

    SOFM is able to find the cluster the area of the pointsconcentration

  • 7/31/2019 Handouts on Data-driven Modelling, part 3 (UNESCO-IHE)

    8/42

    D.P. Solomatine. Data-driven modelling (part 3). 15

    SOFM: visualisation and interpretationSOFM: visualisation and interpretation

    count maps, which is the easiestand mostly used method. This is aplot showing for each output nodenumber of times when it was awinning one. It can beinterpolated into colour shading aswell

    distance matrix (of size K x K)which elements are Euclideandistance of each output unit to its

    immediate neighbouring units

    D.P. Solomatine. Data-driven modelling (part 3). 16

    SOFM: visualization and interpretationSOFM: visualization and interpretation

    vector positionor cluster maps:

    colours are coded according totheir similarity in the inputspace

    each dot corresponds to oneoutput map unit

    each map unit is connected toits neighbours by line

  • 7/31/2019 Handouts on Data-driven Modelling, part 3 (UNESCO-IHE)

    9/42

    D.P. Solomatine. Data-driven modelling (part 3). 17

    SOFM: visualization and interpretationSOFM: visualization and interpretation

    vector positionor cluster maps:

    in 3D

    D.P. Solomatine. Data-driven modelling (part 3). 18

    InstanceInstance--based learningbased learning

    (lazy learning)(lazy learning)

  • 7/31/2019 Handouts on Data-driven Modelling, part 3 (UNESCO-IHE)

    10/42

    D.P. Solomatine. Data-driven modelling (part 3). 19

    Lazy and eager learningLazy and eager learning

    Eager learning:

    first ML (data-driven) model is built

    then it is tested and used

    Lazy learning

    no ML model is built (i.e lazy)

    when newexamples come, the output is generated immediatelyon the basis of the training examples

    Other names for lazy learning:

    Instance-based

    Exemplar-based

    Case-based

    Experience-based

    Edited k-nearest neighbor

    D.P. Solomatine. Data-driven modelling (part 3). 20

    kk--Nearest neighbors method: classificationNearest neighbors method: classification

    instances are points in 2-dim. space, output is boolean (+ or -)

    new instance xq is classified w.r.t. proximity of nearest traininginstances

    to class + (if 1 neighbor is considered)

    to class - (if 4 neighbors are considered)

    for discrete-valued outputs assign: the most common value

    VoronoiVoronoi diagram for 1diagram for 1--Nearest neighborNearest neighbor

  • 7/31/2019 Handouts on Data-driven Modelling, part 3 (UNESCO-IHE)

    11/42

    D.P. Solomatine. Data-driven modelling (part 3). 21

    NotationsNotations

    instance x as {a1(x) ... an(x)} where ar(x) denotes the value ofthe r-th attribute of instance x.

    distance between two instances xiand xj is defined to bed(xi, xj) where

    2))()((),( jrirji xaxaxxd =

    D.P. Solomatine. Data-driven modelling (part 3). 22

    kk--NearestNearest neighborneighbor algorithmalgorithm

    Training

    Build the set of training examples D.

    Classification

    Given a query instance xq to be classified,

    Let x1... xkdenote the kinstances from Dthat are nearest to xq Return

    where (a, b)=1, ifa= b, and (a, b)=0 otherwise

    V= {v1,,vs} set of possible output values.

    ==

    k

    i

    iVv

    q xfvxF1

    ))(,(maxarg)(

  • 7/31/2019 Handouts on Data-driven Modelling, part 3 (UNESCO-IHE)

    12/42

    D.P. Solomatine. Data-driven modelling (part 3). 23

    kk--Nearest neighbors: regressionNearest neighbors: regression

    (target function is real(target function is real--valued )valued )

    model a real-valued target function F: n .

    instances are points in n-dim. space, output is a real number

    new instance xq is valued w.r.t.

    values of nearest training instances (average ofkinstances istaken, or the weighted average)

    values and proximity of nearest training instances (locallyweighted regressionmodel is built and used to predict the value ofnew instance)

    In this case the final line on the k-NN algorithm should bereplaced by the line

    k

    xf

    xF

    k

    i

    i

    q

    == 1

    ))(

    )(

    D.P. Solomatine. Data-driven modelling (part 3). 24

    Distance weighted kDistance weighted k--NN algorit