Transcript
Page 1: Introduction to Machine Learning

Introduction to Machine Learning

Jinhyuk Choi

Human-Computer Interaction Lab @ Information and Communications University

Page 2: Introduction to Machine Learning

Contents

Concepts of Machine Learning

Multilayer Perceptrons

Decision Trees

Bayesian Networks

Page 3: Introduction to Machine Learning

What is Machine Learning?

Large storage / large amount of data

Looks random but certain patterns

Web log data

Medical record

Network optimization

Bioinformatics

Machine vision

Speech recognition…

No complete identification of the process

A good or useful approximation

Page 4: Introduction to Machine Learning

What is Machine Learning?Definition

Programming computers to optimize a

performance criterion using example data or past

experience

Role of Statistics

Inference from a sample

Role of Computer science

Efficient algorithms to solve the optimization problem

Representing and evaluating the model for inference

Descriptive (training) / predictive (generalization)

Learning from Human-generated data??

Page 5: Introduction to Machine Learning

What is Machine Learning?Concept Learning

• Inducing general functions from specific training examples (positive or negative)

• Looking for the hypothesis that best fits the training examples

• Concepts:

- describing some subset of objects or events defined over a larger set

- a boolean-valued function

Objects

눈, 코, 다리생식능력,

무생물…

Bird

날개, 부리,

깃털…

Concept

boolean function :

Bird(animal) “true or not”

Page 6: Introduction to Machine Learning

What is Machine Learning?Concept Learning

Inferring a boolean-valued function from training examples of its input and output

Positive examples

Negative examples

Hypothesis 1

Hypothesis 2

Concept

Web log data

Medical record

Network optimization

Bioinformatics

Machine vision

Speech recognition…

Page 7: Introduction to Machine Learning

What is Machine Learning?Learning Problem Design

Do you enjoy sports ?

Learn to predict the value of “EnjoySports” for an arbitrary day, based on the value of its other attributes

What problem?

Why learning?

Attributes selection

Effective?

Enough?

What learning algorithm?

Page 8: Introduction to Machine Learning

Applications

Learning associations

Classification

Regression

Unsupervised learning

Reinforcement learning

Page 9: Introduction to Machine Learning

Examples (1)

TV program preference inference based on web usage data

Web page #1

Web page #2

Web page #3

Web page #4

….

Classifier

TV Program #1

TV Program #2

TV Program #3

TV Program #4

….

1 2

3

What are we supposed to do at each step?

Page 10: Introduction to Machine Learning

Examples (2)from a HW of Neural Networks Class (KAIST-2002)

Function approximation (Mexican hat)

2 2

3 1 2 1 2 1 2( , ) sin 2 , , [ 1,1]f x x x x x x

Page 11: Introduction to Machine Learning

Examples (3)from a HW of Machine Learning Class (ICU-2006)

Face image classification

Page 12: Introduction to Machine Learning

Examples (4)from a HW of Machine Learning Class (ICU-2006)

Page 13: Introduction to Machine Learning

Examples (5)from a HW of Machine Learning Class (ICU-2006)

Sensay

Page 14: Introduction to Machine Learning

Examples (6)

A. Krause et. al, “Unsupervised, Dynamic Identification of Physiological and Activity Context in Wearable

Computing”, ISWC 2005

Page 15: Introduction to Machine Learning

#1. Multilayer Perceptrons

Page 16: Introduction to Machine Learning

Neural Network?

VS. Adaline

MLP

SOM

Hopfield network

RDFN

Bifurcating neuron networks

Page 17: Introduction to Machine Learning

Multilayer Networks of Sigmoid Units

• Supervised learning

• 2-layer

• Fully connected

Really looks like the brain??

Page 18: Introduction to Machine Learning

Sigmoid Unit

Page 19: Introduction to Machine Learning

The back-propagation algorithm

Network model

ix jy ko

j ji ii

y v xs

jivkjw

k kj jj

o w ys

Input layer hidden layer output layer

Error function: 2

1,

2 k kk

E v w t o

Stochastic gradient descent

Page 20: Introduction to Machine Learning

Gradient-Descent Function Minimization

Page 21: Introduction to Machine Learning

Gradient-descent function minimization

In order to find a vector parameter x

that minimizes a function f x

Start with a random initial value of 0x x

.

Determine the direction of the steepest descent in the parameter space by

1 2

, ,...,n

f f ff

x x x

Move to the direction a step.

1x i x i fh

Repeat the above two steps until no more change in x

.

For gradient-descent to work…

The function to be minimized should be continuous.

The function should not have too many local minima.

Page 22: Introduction to Machine Learning

Back-propagation

Page 23: Introduction to Machine Learning

Derivation of back-propagation algorithm

Adjustment of kjw :

2

21 1

2 2

11 1 2

2

1

k k jk k jk jk j k j k j

j k k k k

j k k k k

Et o t w y

w w w

y o o t o

y o o t o

s

1ok

kj k k k k jkj

Ew o o t o y

wd

h h

Page 24: Introduction to Machine Learning

Derivation of back-propagation algorithm

Adjustment of jiv :

2

2

2

1 1

2 2

1

2

11 1 1 2

2

1

k k k kj jk k jj i j i j i

k kj ji ik j ij i

k k k ki j j kjk

i j

Et o t w y

v v v

t w v xv

x y y w o o t o

x y y

s

s s

1k k k kj kjk

w o o t o

1 1

1

yj

ji j j kj k k k k ikji

oj j kj k i

k

Ev y y w o o t o x

v

y y w x

d

h h

h d

Page 25: Introduction to Machine Learning

Backpropagation

Page 26: Introduction to Machine Learning

Batch learning vs. Incremental learning

Batch standard backprop proceeds as

follows:

Initialize the weights W.

Repeat the following steps:

Process all the training data DL to compute the gradient

of the average error function AQ(DL,W).

Update the weights by subtracting the gradient times the

learning rate.

Incremental standard backprop can be done as follows:

Initialize the weights W.

Repeat the following steps for j = 1 to NL:

Process one training case (y_j,X_j) to compute the gradient

of the error (loss) function Q(y_j,X_j,W).

Update the weights by subtracting the gradient times the

learning rate.

Page 27: Introduction to Machine Learning

Training

Page 28: Introduction to Machine Learning

Overfitting

Page 29: Introduction to Machine Learning

#2. Decision Trees

Page 30: Introduction to Machine Learning

Introduction

Divide & conquer

Hierarchical model

Sequence of recursive splits

Decision node vs. leaf node

Advantage

Interpretability

IF-THEN rules

Page 31: Introduction to Machine Learning

Divide and Conquer Internal decision nodes Univariate: Uses a single attribute, xi

Numeric xi : Binary split : xi > wm

Discrete xi : n-way split for n possible values

Multivariate: Uses all attributes, x

Leaves Classification: Class labels, or proportions Regression: Numeric; r average, or local fit

Learning Construction of the tree using training examples Looking for the simplest tree among the trees that code the training

data without error Based on heuristics NP-complete “Greedy”; find the best split recursively (Breiman et al, 1984; Quinlan, 1986, 1993)

Page 32: Introduction to Machine Learning

Classification Trees

Split is main procedure for tree construction By impurity measure

For node m, Nm instances reach m, Nim

belong to Ci

Node m is pure if pim is 0 or 1

Measure of impurity is entropy

To be pure!!!

m

imi

miN

Npm,CP̂ x|

K

i

im

imm pp

12logI

Page 33: Introduction to Machine Learning

Representation

Each node specifies a test of some attribute of the instance

Each branch correspond to one of the possible values for this attribute

Page 34: Introduction to Machine Learning

Best Split

If node m is pure, generate a leaf and stop, otherwise split and continue recursively

Impurity after split: Nmj of Nm take branch j. Nimj belong to

Ci

Find the variable and split that min impurity (among all variables -- and split positions for numeric variables)

mj

imji

mjiN

Npj,m,CP̂ x|

K

i

imj

imj

n

j m

mj

m ppN

N

12

1

logI'

Q) “Which attribute should be tested at the root of the tree?”

Page 35: Introduction to Machine Learning

Top-Down Induction of Decision Trees

Page 36: Introduction to Machine Learning

Entropy “Measure of uncertainty”

“Expected number of bits to resolve uncertainty”

Suppose Pr{X = 0} = 1/8

If other events are equally likely, the number of events is 8. To indicate one out of so many events, one needs lg 8 bits.

Consider a binary random variable X s.t. Pr{X = 0} = 0.1.

The expected number of bits:

In general, if a random variable X has c values with prob. p_c:

The expected number of bits:

1.01

1lg1.01

1.0

1lg1.0

1 1

1lg lg

c c

i i i

i ii

H p p pp

Page 37: Introduction to Machine Learning

EntropyExample

14 examples

Entropy 0 : all members positive or negative

Entropy 1 : equal number of positive & negative

0 < Entropy < 1 : unequal number of positive & negative

2 2

([9 ,5 ])

(9 /14) log (9 /14) (5 /14) log (5/14) 0.940

Entropy

Page 38: Introduction to Machine Learning

Information Gain

Measures the expected reduction in entropy caused by partitioning the examples

Page 39: Introduction to Machine Learning

Information Gain

ICU-Student tree

Gender

HeightIQ

Candidate

• # of samples = 100

• # of positive samples = 50

• Entropy = 1

Male Female

Left side:

• # of samples = 50

• # of positive samples = 40

• Entropy = 0.72

Right side:

• # of samples = 50

• # of positive samples = 10

• Entropy = 0.72

On average

• Entropy = 0.5 * 0.72 + 0.5*0.72

= 0.72

• Reduction in entropy = 0.28

Information gain

Page 40: Introduction to Machine Learning

Training Examples

Page 41: Introduction to Machine Learning

Selecting the Next Attribute

Page 42: Introduction to Machine Learning

Partially learned tree

Page 43: Introduction to Machine Learning

Hypothesis Space Search

Hypothesis space: the set of

all possible decision trees

DT is guided by information

gain measure.

Occam’s razor ??

Page 44: Introduction to Machine Learning

Overfitting

• Why “over”-fitting?

– A model can become more complex than the true target function(concept) when it tries to satisfy noisy data as well

Page 45: Introduction to Machine Learning

Avoiding over-fitting the data Two classes of approaches to avoid overfitting

Stop growing the tree earlier.

Post-prune the tree after overfitting

Ok, but how to determine the optimal size of a tree?

Use validation examples to evaluate the effect of pruning (stopping)

Use a statistical test to estimate the effect of pruning (stopping)

Use a measure of complexity for encoding decision tree.

Approaches based on the first strategy

Reduced error pruning

Rule post-pruning

Page 46: Introduction to Machine Learning

Rule Extraction from Trees

C4.5Rules (Quinlan, 1993)

Page 47: Introduction to Machine Learning

#3. Bayesian Networks

Page 48: Introduction to Machine Learning

Bayes’ RuleIntroduction

xx

xp

pPP

CCC

| |

posterior

likelihoodprior

evidence

1|1|0

00|11|

110

xx

xxx

CC

CCCC

CC

Pp

PpPpp

PP

Page 49: Introduction to Machine Learning

Bayes’ Rule: K>2 ClassesIntroduction

K

kkk

ii

iii

CPCp

CPCp

p

CPCpCP

1

|

|

||

x

x

x

xx

xx | max |if choose

1 and 01

kkii

K

iii

CPCPC

CPCP

Page 50: Introduction to Machine Learning

Bayesian NetworksIntroduction

Graphical models, probabilistic networks causality and influence

Nodes are hypotheses (random vars) and the prob corresponds to our belief in the truth of the hypothesis

Arcs are direct influences between hypotheses

The structure is represented as a directed acyclic graph (DAG) Representation of the dependencies among random variables

The parameters are the conditional probs in the arcs

Small set of probability, relating only neighbor node

all possible combinations of cicumstances

B.N.

Page 51: Introduction to Machine Learning

Bayesian NetworksIntroduction

Learning Inducing a graph

From prior knowledge

From structure learning

Estimating parameters

EM

Inference Beliefs from evidences

Especially among the nodes not directly connected

Page 52: Introduction to Machine Learning

StructureIntroduction

Initial configuration of BN Root nodes

Prior probabilities

Non-root nodes

Conditional probabilities given all possible combinations of direct predecessors

A B

D

E

C

P(b)P(a)

P(d|ab), P(d|aㄱb), P(d|ㄱab), P(d|ㄱaㄱb)

P(e|d)

P(e|ㄱd)

P(c|a)

P(c|ㄱa)

Page 53: Introduction to Machine Learning

Causes and Bayes’ RuleIntroduction

75060204090

4090

|~|

|

||

.....

..

R~PRWPRPRWP

RPRWP

WP

RPRWPWRP

Diagnostic inference:Knowing that the grass is wet, what is the probability that rain is the cause?causal

diagnostic

Page 54: Introduction to Machine Learning

Causal vs Diagnostic InferenceIntroduction

Causal inference: If the sprinkler is on, what is the probability that the grass is wet?

P(W|S) = P(W|R,S) P(R|S) + P(W|~R,S) P(~R|S)

= P(W|R,S) P(R) + P(W|~R,S) P(~R)

= 0.95*0.4 + 0.9*0.6 = 0.92

Diagnostic inference: If the grass is wet, what is the probabilitythat the sprinkler is on? P(S|W) = 0.35 > 0.2 P(S)P(S|R,W) = 0.21Explaining away: Knowing that it has rained

decreases the probability that the sprinkler is on.

Page 55: Introduction to Machine Learning

Bayesian Networks: CausesIntroduction

Causal inference:P(W|C) = P(W|R,S) P(R,S|C) +

P(W|~R,S) P(~R,S|C) + P(W|R,~S) P(R,~S|C) + P(W|~R,~S) P(~R,~S|C)

and use the fact thatP(R,S|C) = P(R|C) P(S|C)

Diagnostic: P(C|W ) = ?

Page 56: Introduction to Machine Learning

Bayesian Nets: Local structureIntroduction

P (F | C) = ?

d

i

iid XXPX,XP1

1 parents|

Page 57: Introduction to Machine Learning

Bayesian Networks: InferenceIntroduction

P (C,S,R,W,F ) = P (C ) P (S |C ) P (R |C ) P (W |R,S ) P (F |R )

P (C,F ) = ∑S ∑R ∑W P (C,S,R,W,F )

P (F |C) = P (C,F ) / P(C ) Not efficient!

Belief propagation (Pearl, 1988) Junction trees (Lauritzen and Spiegelhalter, 1988) Independence assumption

Page 58: Introduction to Machine Learning

InferenceEvidence & Belief Propagation

Evidence – values of observed nodes

V3 = T, V6 = 3

Our belief in what the value of Vi

„should‟ be changes.

This belief is propagated

As if the CPTs became

V3=T 1.0

V3=F 0.0

P V2=T V2=F

V6=1 0.0 0.0

V6=2 0.0 0.0

V6=3 1.0 1.0

V1

V5

V2

V4

V3

V6

Page 59: Introduction to Machine Learning

Specifically:

9

Belief Propagation

Message

Messages

Going down arrow, sum out parent Going up arrow, Bayes Law

)(

)()|()|(

BP

APABPBAP

Bayes Law:

1/a

“Causal” message “Diagnostic” message

* some figures from: Peter Lucas BN lecture course

Page 60: Introduction to Machine Learning

The Messages

• What are the messages?

• For simplicity, let the nodes be binary

V1

V2

V1=T 0.8

V1=F 0.2

P V1=T V1=F

V2=T 0.4 0.9

V2=F 0.6 0.1

The message passes on information.

What information? Observe:

P(V2| V1) = P(V2| V1=T)P(V1=T)

+ P(V2| V1=F)P(V1=F)

The information needed is the CPT

of V1 = V(V1)

Messages capture information

passed from parent to child

Page 61: Introduction to Machine Learning

The Messages

)|()()(

)|()()|( 121

2

12121 VVPVP

VP

VVPVPVVP a

• We know what the messages are

• What about ?

V1

V2

Assume E = { V2 } and compute by Bayes‟rule:

The information not available at V1 is the P(V2|V1). To be

passed upwards by a -message. Again, this is not in general

exactly the CPT, but the belief based on evidence down the tree.

Page 62: Introduction to Machine Learning

Belief Propagation

V

U2

V1 V2

U1

π(U2)

π(V1)π(V2)

π(U1)

λ(U1)

λ(V2)

λ(V1)

λ(U2)

Page 63: Introduction to Machine Learning

Evidence & Belief

V1

V5

V2

V4

V3

V6

Evidence

Belief

Evidence

Works for classification ??

Page 64: Introduction to Machine Learning

Naive Bayes’ Classifier

Given C, xj are independent:

p(x|C) = p(x1|C) p(x2|C) ... p(xd|C)

Page 65: Introduction to Machine Learning

Application ProceduresFor classification

MLP

Data collection & Pre-processing (Training data / Test data)

Decision node selection (output node)

Network training

Generalization

Parameter tuning & Pruning

Final network

Decision Trees

Data collection & Pre-processing (Training data / Test data)

Decision attribute selection

Tree construction

Pruning

Final tree

Bayesian Networks

Data collection & Pre-processing (Training data / Test data)

Structure configuration

Prior knowledge

Parameter learning

Decision node selection

Inference (classification)

Evidence & belief

Final network

Page 66: Introduction to Machine Learning

Simulation

Simulation Packages

WEKA (JAVA)

http://www.cs.waikato.ac.nz/ml/weka/

FullBNT (MATLAB)

http://www.cs.ubc.ca/~murphyk/Software/BNT/bnt.html

MSBNx

http://research.microsoft.com/msbn/

MATLAB Neural Networks Toolbox

http://www.mathworks.com/products/neuralnet/

C4.5

http://www.rulequest.com/Personal/

Page 67: Introduction to Machine Learning

WEKA

Page 68: Introduction to Machine Learning

FullBNT clear all

N = 4; % 노드의 개수

dag = zeros(N,N); % 네크워크 구조 shell

C = 1; S = 2; R = 3; W = 4; % 각 노드 Naming

dag(C,[R S]) = 1; % 네트워크 구조 명시

dag(R,W) = 1;

dag(S,W)=1;

%discrete_nodes = 1:N;

node_sizes = 2*ones(1,N); % 각 노드가 가질 수 있는 값의 개수

%node_sizes = [4 2 3 5];

%onodes = [];

%bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes, 'observed', onodes);

bnet = mk_bnet(dag, node_sizes, 'names', {'C','S','R','W'}, 'discrete', 1:4);

%C = bnet.names('cloudy'); % bnet.names is an associative array

%bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);

%%%%%% Specified Parameters

%bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);

%bnet.CPD{R} = tabular_CPD(bnet, R, [0.8 0.2 0.2 0.8]);

%bnet.CPD{S} = tabular_CPD(bnet, S, [0.5 0.9 0.5 0.1]);

%bnet.CPD{W} = tabular_CPD(bnet, W, [1 0.1 0.1 0.01 0 0.9 0.9 0.99]);

Page 69: Introduction to Machine Learning

MSBNx

Page 70: Introduction to Machine Learning

References Textbooks

Ethem ALPAYDIN, Introduction to Machine Learning, The MIT Press, 2004

Tom Mitchell, Machine Learning, McGraw Hill, 1997

Neapolitan, R.E., Learning Bayesian Networks, Prentice Hall, 2003

Materials

Serafín Moral, Learning Bayesian Networks, University of Granada, Spain

Zheng Rong Yang, Connectionism, Exeter University

KyuTae Cho ,Jeong KiYoo ,HeeJin Lee, Uncertainty in AI, Probabilistic reasoning, Especially for Bayesian Networks

Gary Bradski, Sebastian Thrun, Bayesian Networks in Computer Vision, Stanford University

Recommended Textbooks

Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006

J. Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, 1992

Haykin, Simon S., Neural networks : a comprehensive foundation, Prentice Hall, 1999

Jensen, Finn V., Bayesian networks and decision graphs, Springer, 2007


Recommended