Handouts on Data-driven Modelling, part 3 (UNESCO-IHE)

7/31/2019 Handouts on Data-driven Modelling, part 3 (UNESCO-IHE)

1/42

DataData--drivendriven modellingmodelling

in waterin water--related problems.related problems.PART 3PART 3

Dimitri P. Solomatine

www.ihe.nl/hi/sol [email protected]

UNESCO-IHE Institute for Water EducationHydroinformatics Chair

D.P. Solomatine. Data-driven modelling (part 3). 2

Finding groups (clusters) in dataFinding groups (clusters) in data

(unsupervised learning)(unsupervised learning)


2/42


ClusteringClustering

classificationis aimed at identifying mapping (function) thatmaps any given input xito a nominal variable (class) yi.

finding the groups (clusters) in an input data set is clustering

Clustering is often the preparation phase for classification:

the identified clusters could be labelled as classes, each inputinstance then can be associated with an output value (class) andthe instances set {xi, yi} can be built

Cluster1

Cluster 2

Cluster 3

a) b)


Reasons to use clusteringReasons to use clustering

labelling large data sets can be very costly;

clustering may actually give an insight into the data and helpdiscover classes which are not known in advance;

clustering may find featuresthat can be used for categorization.


3/42


VoronoiVoronoi diagramsdiagrams


Methods for clusteringMethods for clustering

partition-based clustering (K-means, fuzzy C-means, based onEuclidean distance);

hierarchical clustering (agglomerative hierarchical clustering,nearest-neighbour algorithm);

feature extraction methods: principal component analysis(PCA), self-organizing feature (SOF) maps (also referred to asKohonen neural networks).


4/42


kk--meansmeans clausteringclaustering

find the best division ofNsamples by Kclusters Cisuch that thetotal distance between the clustered samples and theirrespective centers (that is, the total variance) is minimized:

where i is the center of class i.

=

=K

i Cn

in

i

xJ1

2||


kk--means clustering: algorithmmeans clustering: algorithm

1 randomly assigning instances to the clusters

2 compute the centers according to

3 reassigne the instances to the nearest clusters center

4 recalculate centers

5 reassign the instances to new centers

repeat 2-5 until total variance J stops decreasing (or centersstop to move).

=iCn

n

i

i xN

|1


5/42


kk--means clustering: illustrationmeans clustering: illustration


KohonenKohonen networknetwork

(Self(Self--organizing feature maporganizing feature map -- SOFM)SOFM)


6/42


SOFM: main ideaSOFM: main idea

x1

1

1 2 3

j

N

2 M

x2

1112

1M

j(0)j(t1)

j(t2)

NM

xM

...

...

a) b)


SOFM: algorithm (1)SOFM: algorithm (1)

0 Initialize weight, normally with small random values.

Set topological neighborhood parameters.

Set learning rate parameters.

Iteration number t= 1.

1 While stopping conditionis false, do iteration t(steps 28):

2 For each input vector x = {x1,..., xN} do steps 3 8:

3 For each output node kcalculate the similarity measure (in thiscase the Euclidean distance) between the input and the weight

vector: =

=N

i

iikxwkD

1

2)()(


7/42


SOFM: algorithm (2)SOFM: algorithm (2)

4 Find index kmaxsuch that D(k)is a minimum this will refer to thewinning node.

5 Update the weights for the node kmaxand for all nodes kwithin aspecified neighborhood radius rfrom kmax:

6 Update learning rate (t)

7 Reduce radius rused in the neigborhood function N(this can bedone less frequently than at each iteration).

8 Test stopping condition.

)]([),()()()1( twxtrNttwtw ikiikik +=+


SOFM: exampleSOFM: example

Input set: sampling points in a square randomly (the probabilityof sampling a point in the central square region was 20 timesgreater than elsewhere in a square)

The target space is discrete and includes 100 output nodesarranged in 2 dimensions

SOFM is able to find the cluster the area of the pointsconcentration


8/42


SOFM: visualisation and interpretationSOFM: visualisation and interpretation

count maps, which is the easiestand mostly used method. This is aplot showing for each output nodenumber of times when it was awinning one. It can beinterpolated into colour shading aswell

distance matrix (of size K x K)which elements are Euclideandistance of each output unit to its

immediate neighbouring units


SOFM: visualization and interpretationSOFM: visualization and interpretation

vector positionor cluster maps:

colours are coded according totheir similarity in the inputspace

each dot corresponds to oneoutput map unit

each map unit is connected toits neighbours by line


9/42


SOFM: visualization and interpretationSOFM: visualization and interpretation

vector positionor cluster maps:

in 3D


InstanceInstance--based learningbased learning

(lazy learning)(lazy learning)


10/42


Lazy and eager learningLazy and eager learning

Eager learning:

first ML (data-driven) model is built

then it is tested and used

Lazy learning

no ML model is built (i.e lazy)

when newexamples come, the output is generated immediatelyon the basis of the training examples

Other names for lazy learning:

Instance-based

Exemplar-based

Case-based

Experience-based

Edited k-nearest neighbor


kk--Nearest neighbors method: classificationNearest neighbors method: classification

instances are points in 2-dim. space, output is boolean (+ or -)

new instance xq is classified w.r.t. proximity of nearest traininginstances

to class + (if 1 neighbor is considered)

to class - (if 4 neighbors are considered)

for discrete-valued outputs assign: the most common value

VoronoiVoronoi diagram for 1diagram for 1--Nearest neighborNearest neighbor


11/42


NotationsNotations

instance x as {a1(x) ... an(x)} where ar(x) denotes the value ofthe r-th attribute of instance x.

distance between two instances xiand xj is defined to bed(xi, xj) where

2))()((),( jrirji xaxaxxd =


kk--NearestNearest neighborneighbor algorithmalgorithm

Training

Build the set of training examples D.

Classification

Given a query instance xq to be classified,

Let x1... xkdenote the kinstances from Dthat are nearest to xq Return

where (a, b)=1, ifa= b, and (a, b)=0 otherwise

V= {v1,,vs} set of possible output values.

==

k

i

iVv

q xfvxF1

))(,(maxarg)(


12/42


kk--Nearest neighbors: regressionNearest neighbors: regression

(target function is real(target function is real--valued )valued )

model a real-valued target function F: n .

instances are points in n-dim. space, output is a real number

new instance xq is valued w.r.t.

values of nearest training instances (average ofkinstances istaken, or the weighted average)

values and proximity of nearest training instances (locallyweighted regressionmodel is built and used to predict the value ofnew instance)

In this case the final line on the k-NN algorithm should bereplaced by the line

k

xf

xF

k

i

i

q

== 1

))(

)(


Distance weighted kDistance weighted k--NN algorithmNN algorithm

(classification)(classification)

weigh the contribution of each of the kneighbors according totheir distance to the query point xq, giving greater weight witocloser neighbors

This can be accomplished by replacing the final line in thealgorithm by

where the weight is

=

=k

i

iiVv

q xfvwxF1

))(,(maxarg)(

2),(

1

iq

ixxd

w =


13/42


Distance weighted kDistance weighted k--NN algorithmNN algorithm

(numerical prediction)(numerical prediction)

for real-valued output this is accomplished by replacing the finalline in the algorithm by

where the weight is

2),(

1

iqi xxdw =

=

==k

i

i

k

i

ii

q

w

xfw

xF

1

1

))(

)(


kk--Nearest neighbors: using all examplesNearest neighbors: using all examples

for classification:

for regression:

=

=instancesAll

i

iiVv

q xfvwxF1

))(,(maxarg)(

=

==instancesAll

i

i

instancesAll

i

ii

qw

xfw

xF

1

1

))(

)(


14/42


kk--Nearest neighbors: commentsNearest neighbors: comments

k-NN creates a localmodel of the proximity of new instance, insteadof a globalmodel of all training instances

robust to noisy training data

requires considerable amount of data

distance between instances is calculated based on allattributes (andnot on 1 as in decision trees). Possible problem: imagine instances described by 20 attributes, but only 2 are relevant to

target function

curse of dimensionality: nearest neighbor method is easily mislead whenhigh-dimensional X

solution: stretch j-th axis by weight zj chosen to minimize prediction error

with number of training instances , kNN approaches Bayesian

optimal classification


Locally weighted regression (1)Locally weighted regression (1)

construct an explicit approximation F(x) of the target functionf(x)over a localregion surrounding the new query point xq

If F(x) is linear then this is called locally weighted linearregression

Instead of minimizing the global error E, here the local errorE(xq) has to be minimized

)(...)()( 110 xawxawwxF nn+++=


15/42


Locally weighted regression (2)Locally weighted regression (2)

Various approaches to minimizing error E(xq):

Minimize the squared error over just k nearest neighbors

Minimize the squared error over entire set Dof training examples,while weighting the error of each training example by somedecreasing function Kof its distance from xq:

Combine 1 and 2 (to reduce computational costs):

2

1 ))()((2

1)(

=qxofnbrsnearestkx

q xFxfxE

)),(())()((2

1)( 22 xxdKxFxfxE q

Dx

q

=

)),(())()((2

1)(

2

3 xxdKxFxfxE qxofnbrsnearestkx

q

q

=


CaseCase--base reasoning (CBR)base reasoning (CBR)

instance-based learning, but output is not-real valued but isrepresented by symbolic descriptions

methods used to retrieve similar instances are more elaborate(not just Euclidean distance)

Applications:

conceptual design of mechanical devices based on a stored libraryof previous designs (Sycara 1992)

new legal cases based on previous rulings (Ashley 1990)

selection of an appropriate hydrological model based on previousexperience (Kukuric 1997, PhD of IHE)


16/42


Remarks on Lazy and Eager learningRemarks on Lazy and Eager learning

Lazy methods: k-NN, locally weighted regression, CBR

Eager learners: are "eager" to before they observe the testinginstance xqthey already built the global approximation of thetarget function.

Lazy learners:

defer the decision of how to generalize beyond the training data untileach new instance is encountered,

when newexamples come, the output is generated immediately onthe basis of nearest training examples

Lazy learners have a richer set of hypotheses - they select anappropriate hypothesis (e.g. linear function) for each new instance

So Lazy methods are better suited to customize to unknownfuture instances


Fuzzy ruleFuzzy rule--based systemsbased systems


17/42


Fuzzy logicFuzzy logic

introduced in 1965 by Lotfi ZADEH, Univ. of Berkeley

Boolean logic is two-valued (False, True). Fuzzy logic is multi-valued (False...AlmostFalse...AlmostTrue...True)

Fuzzy set theory deals with degree of truththat the outcomebelongs to a certain category (partial truth)

a fuzzy seton a universe U: for any uU there is acorresponding real number A(u)[0,1] called grade ofmembershipof u belonging to A

mappingA: U [0,1]is called membership functionof A


Example of an ordinary and a fuzzy set "tall people"Example of an ordinary and a fuzzy set "tall people"


18/42


Various shapes of membership functionsVarious shapes of membership functions

[, +] is supportof the fuzzy set, 1 is its kernel

-

1

+

0

1

a) Triangular membership function

-

1

+

0

1

b) Bell-shaped function

-

1

+

0

1

c) Dome-shaped function

-

1

+

0

1

d) Inverted cycloid function


Example of a membership functionExample of a membership function

"appropriate water level in the reservoir""appropriate water level in the reservoir"

supportsupportkernelkernel


19/42


AlphaAlpha--cutcut

0.5-cut = [4.5, 7.0]


Fuzzy numbersFuzzy numbers

Special cases of fuzzy sets are fuzzy numbers

A fuzzy subset Aof the set of real numbers is called a fuzzynumberif

there is at least one zsuch that A(z) = 1 (normality assumption)

for every real numbers a, b, cwith a< c< b

A(c) min (A(a), A(b))(convexity assumption, meaning that the membership function of a

fuzzy number consists of an increasing and decreasing part, and

possibly flat parts)


20/4222


Linguistic variable: exampleLinguistic variable: example

WATER LEVEL

Enough Volume for

flood detentionNavigableEnvironmentally

Friendly

11

11

1

0

10.8

0.9

0.3

0.80.7

0.5

0.7

0.9

11

1

0.2

0 5 10 15 20 25 30 35 40 45 50

Water level (m)

BASE

VARIABLE

LINGUISTIC VARIABLE

Fuzzy RestrictionFuzzy Values

of water level

CompatibilityLinks

Linguistic variablecan take linguistic values (like low, high, navigable)associated with fuzzy subsets Mof the universe U(here U= [0,50])


Operations on fuzzy setsOperations on fuzzy sets


21/4222


Fuzzy rulesFuzzy rules

Fuzzy rules are linguistic constructs of the type

IF A THEN B

where A and B are collections of propositions containing linguisticvariables (i.e. variables with linguistic values). A is called apremiseand B is the consequenceof the rule.

If there are Kpremises in a system, the i-th rule has the form:

where ais a crisp input,Aand Bare linguistic variables, is oneof the operators AND, OR, XOR.

ikikii BthenAisaAisaAisaIf ,2,21,1 ...


Additive model of combining rulesAdditive model of combining rules


22/4222


Fuzzy ruleFuzzy rule--based systems (FS)based systems (FS)

use linguistic variables based on fuzzy logic

based on encoding relationships between variables in the formof rules

rules are generated through the analysis of a large datasamples

such rules are used to produce the values of the outputvariables given new input values


Example: Fuzzy rules in controlExample: Fuzzy rules in control

MEDIUM

SLOW

STOP

FAST

BLAST

AIR MOTOR SPEED0 20 40 60 80 100

1.0

0.0

0.6

0.2


1.0

0.0

0.6

0.2


1.0

0.0

0.6

0.2

The weighted sum

combination method.

The crested weightedsum combinationmethod.

RIGHT

COOL

COLD

WARM

HOT

TEMPERATURE0C

5 10 15 20 25 30 35

1.0

0.0

0.6

0.2

defuzzyfication using

centroid of the area

RIGHT

COOL

COLD

WARM

HOT

BLAST

FAST

MEDIUM

SLOW

STOP

TEMPERATURE5 10 15 20 25 30 35

If Warm,then

fast

If Cool,then

slow

If Right,then

medium

If Hot,then

blast

If Cold,then

stop

100

80

60

40

20

0

AIRMOTORSPEED

rules like: IF Temperature is CoolTHEN AirMotorSpeed := Slow

Input: Temperature = 22What will be the AirMotorSpeed?

Temperature is

RIGHT with

degree of

fulfillment (DOF)

= 0.6

and WARM with

DOF = 0.2

two rules are fired


23/4222


Combining premises in a ruleCombining premises in a rule

Degree of fulfillment (DOF)is the extent to which the premise(left) part of a fuzzy rule is satisfied

The means to combine the memberships of the inputs to thecorresponding fuzzy sets into a DOF is called inference

Product inferencefor rule iis defined as:

(rule is sensitive to the change in the amount of truth containedin each premise)

Minimum inferencefor rule i is define like this:

( ) =

==K

k

kAii aADOF ki1

)(,

( ) ( )kA

Kkii aMinADOF ki ,..1

=

==


Combining rules: example for 2 inputsCombining rules: example for 2 inputs

Input 2

Output

Input 1

LL

H

M

M

H


24/4222


Combining rules: weighted sum combinationCombining rules: weighted sum combination

weighted sum combinationuses the DOF of each rule as aweight

If there are Irules each having a response fuzzy set BiwithDOF ofi, the combined membership function

=

==I

i

Biu

I

i

Bi

B

xMax

x

x

i

i

1

1

)(

)(

)(


1.0

0.0

0.6

0.2 The weighted sumcombination method.


Combining rules: crested weighted sum combinationCombining rules: crested weighted sum combination

crested weighted sum combinationis there, when each outputmembership function is clipped off at a height corresponding tothe rules degree of fulfillment

If there are Irules each having a response fuzzy set BiwithDOF ofi, the combined membership function

( )

( )

=

==I

i

Biu

I

i

Bi

B

xMinMax

xMin

x

i

i

1

1

)(,

)(,

)(


1.0

0.0

0.6

0.2

The crested weighted

sum combination

method.


25/4222


Combining rules:Combining rules: defuzzificationdefuzzification

Defuzzificationis a mapping from a fuzzy consequence ofconsequences Bito a crisp consequence

this is actually the identification of the fuzzy mean

the most widely used method is:

find the centroid (center of gravity) of the area below themembership function and take its abscissa coordinate as the crispoutput.


1.0

0.0

0.6

0.2The weighted sum

combination method.

defuzzyfication usingcentroid of the area


In the previous example the rules were given.In the previous example the rules were given.

But how to build them from data?But how to build them from data?

the following is given/assumed:

the known rule structure, that is the number of premises in eachrule

shapes of membership functions

the number of rules

the training set Tis given: a set ofSobserved inputs (a) andoutput (b) real-valued vectors:

It is assumed that we are training Irules with Kpremises in asystem, where the i-th rule has the following form:

where a is a crisp input,Aand Bare triangular fuzzy numbers.

parameters ofA and B (supports and kernels) are to be found

( ) ( ) ( )( ){ }SssbsasaT K ,...,1;,,...,1 ==

ikikii BthenAisaANDANDAisaANDAisaIf ,2,21,1 ...


26/4222


Building rules from data:Building rules from data:

weighted counting algorithm (1)weighted counting algorithm (1)




uses the subset of the training set that satisfies the premises ofa rule at least to a degree of fulfilment of threshold toconstruct the shape of the corresponding consequence

It is accomplished with the following steps (iis the rule number,kis the premise number)


27/4222




1 Define the support (-i,k +i,k) of the i-th rules premiseAi,k 2 Ai,k is assumed to be a triangular fuzzy number

(-i,k 1i,k ,

+i,k)T where

1i,k is the mean of all possible ak(s)values

which fulfil at least partially the ithrule:

3 Calculate the DOFs i(s)for each premise vector(a1(s) ak(s))corresponding to the training set Tand each rule iwhose premises were determined in step 1.

4 Select a threshold > 0such that only responses with DOF >will be considered in the construction of the rule response. The

corresponding response is assumed to be also a triangular fuzzynumber (-i,k 1i,k , +i,k)T defined by:

)()(

sbMins

ii

>

=

>

>=

)(

)(1

)(

)()(

s i

s i

i

i

i

s

sbs)(

)(sbMax

si

i

>

+ =

=iRs

k

i

ki saN

)(11

,


Fuzzy ruleFuzzy rule--based system:based system:

learning rules from datalearning rules from data

HISTORICAL

DATATRAINING

RULES

CRISP

INPUT

(X)

FUZZIFIER

FUZZYINFERENCE

ENGINE

DEFUZZIFIERCRISP

OUTPUT

(Y)

EXPERT

JUDGEMENTS

(are not considered here)


28/4222


Modeling spatial

rainfall distr ibut ionusing Fuzzy ru le-

based system :

filling missing data inpast records estimating rainfalldepth at a stationCaprile (based on datafor Arabba andAndraz) in case of asudden equipmentfailure

Arabba

Andraz

Caprile

Case study: catchment in Veneto region, ItalyCase study: catchment in Veneto region, Italy


Problem formulationProblem formulation

Daily precipitation at three stations in 1985-91

Data split for training and verification

Daily precipitation at Andraz & Arabba used to determine thedaily precipitation at Caprile

Performance indices

Mean square error (MSE) b/n modeled & observed data

Percentage of predictions within a predefined tolerance target (5%is used)

Problems:

missing records in training data

non-uniform distribution ofdata


29/4222


MethodsMethods cosideredcosidered

Traditional Normal ratio method

Neural network

Fuzzy rule-based system

++

= CC

XB

B

XA

A

XX P

N

NP

N

NP

N

NP

3

1


How many rules to use?How many rules to use?

Too many rules lead to overfitting and the higher error onverification

Effect of the Number of Rules

2

4

6

8

10

12

4 9 16 25 36

Number of rules

MeanSq

uareError

1988-91(T)

1985-87(V)


30/4233


Results: best performanceResults: best performance

Training Performance

(1989-91)

0

10

20

30

40

50

60

70

80

0 10 20 30 40 50 60 70 80

Observed Precipitation

SimulatedPrecipitation

Verification Performance

(1985-88)

0

10

20

30

40

50

60

70

80

0 10 20 30 40 50 60 70 80

Observed Precipitation

SimulatedPrecipitation

Precipitation at CAPRILE for the

first 120 days of 1987


Veneto case study: comparison of fuzzy rules,Veneto case study: comparison of fuzzy rules,

neural network and the normal ratio methodneural network and the normal ratio method

Performance Comparison

(Case 1)

2

3

4

5

6

7

8

9

1989-9

1(T)198

5(V)198

6(V)198

7(V)198

8(V)

1985-8

8(V)

MeanSquareError

FRBS NNN TRAD

Performance Comparison

(Case 1)

86

88

90

92

94

96

98

1989-9

1(T)198

5(V)198

6(V)198

7(V)198

8(V)

1985-8

8(V)

Within5%Tolerance

FRBS NNN TRAD


31/4233


Veneto case study: conclusionsVeneto case study: conclusions

FRBS was the most accurate than the ANN and the Normal ratiomethod

training is faster than of ANN

Issues to pay attention to:

curse of dimensionality: more than 5 inputs is very difficult tohandle

too many rules may cause overfitting

non-uniformly distributed data lead to empty areas where rulescannot be trained


Case studyCase study DelflandDelfland: training ANN or Fuzzy: training ANN or Fuzzy

controller on data obtained from an optimal controllercontroller on data obtained from an optimal controllerin water level controlin water level control

Hydrological

processes in

the polders

ANN or

FRBS

model+ -

Aquarius

optimal

controller

Training

y(t)d y(t) u(t) y(t)

water

level

Target

waterlevel

Error in

controlsignal

pumping

rate

data-driven controller (ANN or Fuzzy rule-based system) is trained ondata generated by the optimal controller, then can replace it


32/4233


Case study:Case study: DelflandDelfland


Replicating controller by ANNReplicating controller by ANN

(output(output -- pump status at timepump status at time tt))

Input variables in Local control

water level at time t-1

water level at time t

pump status at time t-1

Input variables in Centralised dynamic control

precipitation at time t-2

precipitation at time t-1

precipitation at time t

water level at time t-1 water level at time t

groundwater level at time t

pump status at time t-1


33/4233


Performance of the Neural network reproducingPerformance of the Neural network reproducing

behaviourbehaviour of an optimal controllerof an optimal controller

Pumpstatus


Fuzzy rules reproducing optimal control of water level inFuzzy rules reproducing optimal control of water level in

DelflandDelfland

Pumpstatus


34/4233


Bayesian learningBayesian learning


Bayesian theoremBayesian theorem

we are interested in determining the best hypothesis hfromsome space H, given the observed data D

Some notations:

P(h) = prior probability that hypothesis h holds

P(D) = prior probability that training data D will be observed(without knowledge which hypothesis holds)

P(D/h) = probability of observing data D given h holds

P(h/D) = probability that h holds given observed data D

Bayes theorem:

)(

)()/()/(

DP

hPhDPDhP =


35/4233


Selecting "best" hypothesis usingSelecting "best" hypothesis using BayesBayes theoremtheorem

learning in Bayesian sense: selecting the most probablehypothesis (maximum a posteriori hypothesis MAP)

P(D/h) is called likelihoodof data D given h

if all hypotheses are equally probable, then maximum likelihood(ML)hypothesis:

)()/(maxarg

)(

)()/(maxarg

)/(maxarg

hPhDP

DP

hPhDP

DhPh

Hh

Hh

HhMAP

=

=

)/(maxarg hDPhHh

ML

=


Bayesian learning: exampleBayesian learning: example

hypothesis h = "patient has cancer", alternative = "no cancer"

prior knowledge (without data): P(h)=0.008

data that can be observed: test with 2 outcomes (+ or -):

right results:

P(+/cancer) = 0.98 P(-/nocancer) = 0.97

errors:

P(-/cancer) = 0.02 P(+/nocancer) = 0.03

suppose data is observed: a patient is tested and result is +is then hypothesis correct?: choose hypothesis with MAP, that ishypothesis for which P(D/h)P(h) = max

P(+/cancer) P(cancer) = 0.98 * 0.008 = 0.0078

P(+/nocancer) P(nocancer) = 0.03 * 0.992 = 0.0298

--> hypothesis "no cancer" wins


36/4233


NaiveNaive BayesBayes classifierclassifier

assume that each instance xof the data set is characterized bythe several attributes {a1,, an}

target function F(x) can take on any value from a finite set V

a set of training examples {xi} is provided

when a new instance < a1,, an> is presented, the classifiershould identify the most probable target value vMAP.


NaNaveve BayesBayes classifier (2)classifier (2)

This condition can be written like this

or by applying Bayes theorem:

P(vj) can be estimated simply by counting the frequency withwhich each target value vjoccurs in data

)()/,...,(maxarg

),...,(

)()/,...,(maxarg

1

1

1

jjnVv

n

jjn

VvMAP

vPvaaP

aaP

vPvaaPv

j

j

=

=

),...,/(maxarg 1 njVv

MAP aavPvj

=


37/4233


NaNaveve BayesBayes classifier (3)classifier (3)

terms P(a1,, an/ vj) can be estimated by counting in a similarway, however, the total number of these terms is equal to thenumber of possible instances times the number of possibletarget values - so it is difficult

The solution is in a simplifying assumption that the attributevalues a1,, anare conditionally independent given the targetvalue. In this case P(a1,, an/ vj) = i P(ai/ vj) and to estimateP(ai/ vj) is much easier also by counting the frequency.

This gives the rule of the nave Bayes classifier:

=i

jijVv

NaiveBayes vaPvPvj

)/()(maxarg


Modular models:Modular models:

committee machinescommittee machines, ensembles,, ensembles,

mixtures of experts, boostingmixtures of experts, boosting


38/4233


Committee machine (modular model)Committee machine (modular model)

Instead of building one model, several models are built eachresponsible for a particular situation

High flows

Low flows

Medium flows

Rainfall (t-3)

Rainfall (t-2)

Flow Q(t)

separate models are built

past records

New record

(hydrometeorological

condition).

It is to be attributed to one (or

several classes), and the

corresponding models will be

run

Consider a forecasting model Q(t+1) =f (R(t-2), R(t-3), Q(t-1)


Committee machine (modular model)Committee machine (modular model)


39/4233


Committee machinesCommittee machines (modular model)(modular model)

input data is split into subsets and separate data-driven modelsare trained:

hard split : sort according to position in the input space (low - highrainfall) this allows to bring in the physical insight

no split : do not sort, but train several models on the same dataand then combine results by some voting scheme (committeemachine)

voting by majority, weighted majority, by averaging

soft split : split according to how well a given model trained withthis data, and then train also other models. Example: boosting

present the original training data (N examples) set to machine 1

assign higher probability to samples that are badly classified sample N examples from training set based on the new distribution

train machine 2

continue, ending with n machines


Committee machine with hard split,Committee machine with hard split,

expert (expert (specialisedspecialised) models trained on subsets) models trained on subsets

y1

Machine 2Machine 1 Machine n

Inputx

y2 yn

Splitting (gating machine)


40/4244


Committee machine with no split (ensemble),Committee machine with no split (ensemble),

all models are trained on the same setall models are trained on the same set

y1


Inputx

y2 yn

No splitting

Combiner (averaging scheme)

y


Committee machine with soft split of data.Committee machine with soft split of data.

BoostingBoosting

y1


Input x

Combiner (weighted averaging scheme)

y2

yn

samplingNtraining examples from the distribution where

badly predicted examples are given higher probability

y

redistribution

redistribution


41/4244


UsingUsing mix ture of expert s (models)mix ture of expert s (models) ::

each model is for particular hydrological conditioneach model is for particular hydrological condition

Condition 3

Pa t-1 > 50

Pa Mov2 t-2 200

Module

1

Module

2

Y

N

M5

ANN

M5

ANN

?


Combining physicallyCombining physically--based and databased and data--driven models.driven models.

Complementary use of a DataComplementary use of a Data--driven modeldriven model

HYDROLOGIC

FORECASTING

MODEL

Input data

Observed output

Model

errors

Forecastederrors

DATA-

DRIVENerror

forecastingmodel

Improved

output

Model

parameters

Model output

PHYSICAL SYSTEM

HYDROLOGIC

FORECASTING

MODEL

Input data

Observed output

Model

errors

Forecastederrors

DATA-

DRIVENerror

forecastingmodel

Improved

output

Model

parameters

Model output

PHYSICAL SYSTEM


42/42


End of Part 3End of Part 3

Documents

Handouts on Data-driven Modelling, part 3 (UNESCO-IHE)