1 Machine Learning: Naïve Bayes, Neural Networks, Clustering Skim 20.5 CMSC 471

1

Machine Machine Learning:Learning:

Naïve Bayes, Neural Networks, Naïve Bayes, Neural Networks, ClusteringClustering

Skim 20.5Skim 20.5

CMSC 471

2

TheTheNaïve BayesNaïve Bayes

ClassifierClassifier

Some material adapted from slides Some material adapted from slides byby

Tom Mitchell, CMU.Tom Mitchell, CMU.

3

The Naïve Bayes The Naïve Bayes ClassifierClassifier

)(

)|()()|(

j

ijiji XP

YXPYPXYP

Recall Bayes rule:Recall Bayes rule:

Which is short for:Which is short for:

We can re-write this as:We can re-write this as:

)(

)|()()|(

j

ijiji xXP

yYxXPyYPxXyYP

k kkj

ijiji yYPyYxXP

yYxXPyYPxXyYP

)()|(

)|()()|(

4

Deriving Naïve BayesDeriving Naïve Bayes Idea: use the training data to directly Idea: use the training data to directly

estimate:estimate:

Then, we can use these values to estimateThen, we can use these values to estimateusing Bayes rule.using Bayes rule.

Recall that representing the full joint Recall that representing the full joint probabilityprobability

is not practical.is not practical.

)(YP)|( YXP and

)|( newXYP

)|,,,( 21 YXXXP n

5

Deriving Naïve BayesDeriving Naïve Bayes

However, if we make the assumption However, if we make the assumption that the attributes are independent, that the attributes are independent, estimation is easy!estimation is easy!

In other words, we assume all attributes In other words, we assume all attributes are conditionally independent given Y.are conditionally independent given Y. Often this assumption is violated in practice, Often this assumption is violated in practice,

but more on that later…but more on that later…

i

in YXPYXXP )|()|,,( 1

6

Deriving Naïve BayesDeriving Naïve Bayes

Let and label Y be Let and label Y be discrete.discrete.

Then, we can estimate Then, we can estimate and and

directly from the training data by directly from the training data by counting!counting!

nXXX ,,1

)( iYP)|( ii YXP

SkySky TempTemp HumidHumid WindWind WaterWater ForecaForecastst

Play?Play?

sunnysunny warmwarm normalnormal strongstrong warmwarm samesame yesyessunnysunny warmwarm highhigh strongstrong warmwarm samesame yesyesrainyrainy coldcold highhigh strongstrong warmwarm changechange nonosunnysunny warmwarm highhigh strongstrong coolcool changechange yesyes

P(Sky = sunny | Play = yes) = ?

P(Humid = high | Play = yes) = ?

7

The Naïve Bayes The Naïve Bayes ClassifierClassifier

Now we have:Now we have:

which is just a one-level Bayesian which is just a one-level Bayesian NetworkNetwork

To classify a new point XTo classify a new point Xnewnew::

)( iHP

… … Attributes (evidence)

Labels (hypotheses)

1 ni

j

XXX

Y )( jYP)|( ji YXP

k i kik

jiijnj yYXPyYP

yYXPyYPXXyYP

)|()(

)|()(),,|( 1

i

kiky

new yYXPyYPYk

)|()(maxarg

8

The Naïve Bayes The Naïve Bayes AlgorithmAlgorithm

For each value yFor each value ykk

Estimate P(Y = yEstimate P(Y = ykk) from the data.) from the data.

For each value xFor each value xijij of each attribute X of each attribute Xii

Estimate P(XEstimate P(Xii=x=xijij | Y = y | Y = ykk))

Classify a new point via:Classify a new point via:

In practice, the independence In practice, the independence assumption doesn’t often hold true, but assumption doesn’t often hold true, but Naïve Bayes performs very well despite Naïve Bayes performs very well despite it.it.

i

kiky

new yYXPyYPYk

)|()(maxarg

9

Naïve Bayes ApplicationsNaïve Bayes Applications Text classificationText classification

Which e-mails are spam?Which e-mails are spam? Which e-mails are meeting notices?Which e-mails are meeting notices? Which author wrote a document?Which author wrote a document?

Classifying mental statesClassifying mental states

People Words Animal Words

Learning P(BrainActivity | WordCategory)

Pairwise ClassificationAccuracy: 85%

10

Neural Neural NetworksNetworks

Some material adapted from lecture notes by Lise Getoor and Ron Parr

Adapted from slides byTim Finin andMarie desJardins.

11

Neural functionNeural function Brain function (thought) occurs as the Brain function (thought) occurs as the

result of the firing of result of the firing of neuronsneurons Neurons connect to each other through Neurons connect to each other through

synapsessynapses, which propagate , which propagate action action potentialpotential (electrical impulses) by releasing (electrical impulses) by releasing neurotransmittersneurotransmitters

Synapses can be Synapses can be excitatory excitatory (potential-(potential-increasing) or increasing) or inhibitory inhibitory (potential-(potential-decreasing), and have varying decreasing), and have varying activation activation thresholdsthresholds

Learning occurs as a result of the synapses’ Learning occurs as a result of the synapses’ plasticicityplasticicity: They exhibit long-term : They exhibit long-term changes in connection strengthchanges in connection strength

There are about 10There are about 1011 11 neurons and about 10neurons and about 101414 synapses in the human brain(!)synapses in the human brain(!)

12

Biology of a neuronBiology of a neuron

13

Brain structureBrain structure Different areas of the brain have different Different areas of the brain have different

functionsfunctions Some areas seem to have the same function in all Some areas seem to have the same function in all

humans (e.g., Broca’s region for motor speech); the humans (e.g., Broca’s region for motor speech); the overall layout is generally consistentoverall layout is generally consistent

Some areas are more plastic, and vary in their function; Some areas are more plastic, and vary in their function; also, the lower-level structure and function vary greatlyalso, the lower-level structure and function vary greatly

We don’t know how different functions are We don’t know how different functions are “assigned” or acquired“assigned” or acquired Partly the result of the physical layout / connection to Partly the result of the physical layout / connection to

inputs (sensors) and outputs (effectors)inputs (sensors) and outputs (effectors) Partly the result of experience (learning)Partly the result of experience (learning)

We We reallyreally don’t understand how this neural don’t understand how this neural structure leads to what we perceive as structure leads to what we perceive as “consciousness” or “thought”“consciousness” or “thought”

Artificial neural networks are not nearly as Artificial neural networks are not nearly as complex or intricate as the actual brain structurecomplex or intricate as the actual brain structure

14

Comparison of Comparison of computing powercomputing power

Computers are way faster than neurons…Computers are way faster than neurons… But there are a lot more neurons than we can But there are a lot more neurons than we can

reasonably model in modern digital computers, reasonably model in modern digital computers, and they all fire in paralleland they all fire in parallel

Neural networks are designed to be massively Neural networks are designed to be massively parallelparallel

The brain is effectively a billion times fasterThe brain is effectively a billion times faster

INFORMATION CIRCA INFORMATION CIRCA 19951995

ComputerComputer Human BrainHuman Brain

Computation UnitsComputation Units 1 CPU, 101 CPU, 1055 Gates Gates 10101111 Neurons Neurons

Storage UnitsStorage Units 101044 bits RAM, 10 bits RAM, 101010 bits bits diskdisk

10101111 neurons, 10 neurons, 101414 synapsessynapses

Cycle timeCycle time 1010-8-8 sec sec 1010-3-3 sec sec

BandwidthBandwidth 101044 bits/sec bits/sec 10101414 bits/sec bits/sec

Updates / secUpdates / sec 101055 10101414

15

Neural networksNeural networks

Neural networks are made up of Neural networks are made up of nodesnodes or or unitsunits, , connected by connected by linkslinks

Each link has an associated Each link has an associated weightweight and and activation activation levellevel

Each node has an Each node has an input functioninput function (typically (typically summing over weighted inputs), an summing over weighted inputs), an activation activation functionfunction, and an , and an outputoutput

Output units

Hidden units

Input units

Layered feed-forward network

16

Model of a neuronModel of a neuron

Neuron modeled as a unit i Neuron modeled as a unit i weights on input unit j to i, wweights on input unit j to i, wjiji

net input to unit i is:net input to unit i is:

Activation function g() determines the Activation function g() determines the neuron’s outputneuron’s output g() is typically a sigmoidg() is typically a sigmoid output is either 0 or 1 (no partial activation)output is either 0 or 1 (no partial activation)

j

jiji owin

17

““Executing” neural Executing” neural networksnetworks

Input units are set by some exterior function Input units are set by some exterior function (think of these as (think of these as sensorssensors), which causes their ), which causes their output links to be output links to be activatedactivated at the specified level at the specified level

Working forward through the network, the Working forward through the network, the input input functionfunction of each unit is applied to compute the of each unit is applied to compute the input valueinput value Usually this is just the weighted sum of the activation Usually this is just the weighted sum of the activation

on the links feeding into this nodeon the links feeding into this node The The activation functionactivation function transforms this input transforms this input

function into a final valuefunction into a final value Typically this is a Typically this is a nonlinearnonlinear function, often a function, often a sigmoidsigmoid

function corresponding to the “threshold” of that nodefunction corresponding to the “threshold” of that node

18

Learning rulesLearning rules

Rosenblatt (1959) suggested that if a Rosenblatt (1959) suggested that if a target output value is provided for a target output value is provided for a single neuron with fixed inputs, can single neuron with fixed inputs, can incrementally change weights to learn incrementally change weights to learn to produce these outputs using the to produce these outputs using the perceptron learning ruleperceptron learning rule assumes binary valued input/outputsassumes binary valued input/outputs assumes a single linear threshold unitassumes a single linear threshold unit

19

Perceptron learning rulePerceptron learning rule If the target output for unit i is tIf the target output for unit i is tii

Equivalent to the intuitive rules:Equivalent to the intuitive rules: If output is correct, don’t change the weightsIf output is correct, don’t change the weights If output is low (oIf output is low (oii=0, t=0, tii=1), increment weights =1), increment weights

for all the inputs which are 1for all the inputs which are 1 If output is high (oIf output is high (oii=1, t=1, tii=0), decrement =0), decrement

weights for all inputs which are 1weights for all inputs which are 1 Must also adjust threshold. Or equivalently Must also adjust threshold. Or equivalently

assume there is a weight wassume there is a weight w0i0i for an extra input for an extra input unit that has an output of 1.unit that has an output of 1.

jiijiji ootww )(

20

Perceptron learning Perceptron learning algorithmalgorithm

Repeatedly iterate through examples adjusting Repeatedly iterate through examples adjusting weights according to the perceptron learning weights according to the perceptron learning rule until all outputs are correctrule until all outputs are correct Initialize the weights to all zero (or random)Initialize the weights to all zero (or random) Until outputs for all training examples are Until outputs for all training examples are

correctcorrect for each training example e dofor each training example e do

compute the current output ocompute the current output ojj

compare it to the target tcompare it to the target tjj and update and update weightsweights

Each execution of outer loop is called an Each execution of outer loop is called an epochepoch For multiple category problems, learn a For multiple category problems, learn a

separate perceptron for each category and separate perceptron for each category and assign to the class whose perceptron most assign to the class whose perceptron most exceeds its thresholdexceeds its threshold

21

Representation Representation limitations of a limitations of a

perceptronperceptron Perceptrons can only represent Perceptrons can only represent

linear threshold functions and can linear threshold functions and can therefore only learn functions which therefore only learn functions which linearly separate the data.linearly separate the data. i.e., the positive and negative examples i.e., the positive and negative examples

are separable by a hyperplane in n-are separable by a hyperplane in n-dimensional spacedimensional space

<W,X> - = 0

> 0 on this side

< 0 on this side

22

Perceptron learnabilityPerceptron learnability

Perceptron Convergence TheoremPerceptron Convergence Theorem: : If there is a set of weights that is If there is a set of weights that is consistent with the training data (i.e., consistent with the training data (i.e., the data is linearly separable), the the data is linearly separable), the perceptron learning algorithm will perceptron learning algorithm will converge (Minicksy & Papert, 1969)converge (Minicksy & Papert, 1969)

Unfortunately, many functions (like Unfortunately, many functions (like parity) cannot be represented by LTUparity) cannot be represented by LTU

23

Learning: Learning: BackpropagationBackpropagation

Similar to perceptron learning algorithm, Similar to perceptron learning algorithm, we cycle through our exampleswe cycle through our examples if the output of the network is correct, if the output of the network is correct,

no changes are madeno changes are made if there is an error, the weights are if there is an error, the weights are

adjusted to reduce the erroradjusted to reduce the error The trick is to assess the blame for the The trick is to assess the blame for the

error and divide it among the error and divide it among the contributing weightscontributing weights

24

Output layerOutput layer

As in perceptron learning algorithm, As in perceptron learning algorithm, we want to minimize difference we want to minimize difference between target output and the between target output and the output actually computedoutput actually computed )in(gErraWW iijjiji

activation of hidden unit j (Ti – Oi) derivative

of activationfunction)in(gErr iii

ijjiji aWW

25

Hidden layersHidden layers

Need to define error; we do error Need to define error; we do error backpropagation. backpropagation.

Intuition: Each hidden node j is Intuition: Each hidden node j is “responsible” for some fraction of the error “responsible” for some fraction of the error I I in each of the output nodes to which it in each of the output nodes to which it connects. connects.

I I divided according to the strength of the divided according to the strength of the connection betweenconnection between hidden node and the hidden node and the output node and propagated back to output node and propagated back to provide the provide the j j values for the hidden layer:values for the hidden layer:

ij

jijj W)in(g

jkkjkj IWW update rule:

26

Backprogation algorithmBackprogation algorithm

Compute the Compute the values for the output values for the output units using the observed errorunits using the observed error

Starting with output layer, repeat the Starting with output layer, repeat the following for each layer in the network, following for each layer in the network, until earliest hidden layer is reached:until earliest hidden layer is reached: propagate the propagate the values back to the values back to the

previous layerprevious layer update the weights between the two update the weights between the two

layerslayers

27

Backprop issuesBackprop issues

““Backprop is the cockroach of Backprop is the cockroach of machine learning. It’s ugly, and machine learning. It’s ugly, and annoying, but you just can’t get rid annoying, but you just can’t get rid of it.” of it.” Geoff HintonGeoff Hinton

Problems: Problems: black boxblack box local minimalocal minima

28

UnsuperviseUnsupervised Learning: d Learning: ClusteringClusteringSome material adapted from slides by Andrew Some material adapted from slides by Andrew

Moore, CMU.Moore, CMU.

Visit Visit http://www.autonlab.org/tutorials/http://www.autonlab.org/tutorials/ for forAndrew’s repository of Data Mining tutorials.Andrew’s repository of Data Mining tutorials.

29

Unsupervised LearningUnsupervised Learning Supervised learning used labeled data pairs Supervised learning used labeled data pairs

(x, y) to learn a function f : X→Y.(x, y) to learn a function f : X→Y. But, what if we don’t have labels?But, what if we don’t have labels?

No labels = No labels = unsupervised learningunsupervised learning Only some points are labeled = Only some points are labeled = semi-semi-

supervised learningsupervised learning Labels may be expensive to obtain, so we only get Labels may be expensive to obtain, so we only get

a few.a few.

ClusteringClustering is the unsupervised grouping of is the unsupervised grouping of data points. It can be used for data points. It can be used for knowledge knowledge discoverydiscovery..

30

Clustering DataClustering Data

31

K-Means ClusteringK-Means Clustering

K-Means ( k , data )• Randomly choose k

cluster center locations (centroids).

• Loop until convergence

• Assign each point to the cluster of the closest centroid.

• Reestimate the cluster centroids based on the data assigned to each.

32







33







34

K-Means AnimationK-Means Animation

Example generated by Andrew Moore using Dan Pelleg’s super-duper fast K-means system:

Dan Pelleg and Andrew Moore. Accelerating Exact k-means Algorithms with Geometric Reasoning.Proc. Conference onKnowledge Discovery in Databases 1999.

35

Problems with K-MeansProblems with K-Means VeryVery sensitive to the initial points. sensitive to the initial points.

Do many runs of k-Means, each with Do many runs of k-Means, each with different initial centroids.different initial centroids.

Seed the centroids using a better Seed the centroids using a better method than random. (e.g. Farthest-method than random. (e.g. Farthest-first sampling)first sampling)

Must manually choose k.Must manually choose k. Learn the optimal k for the clustering. Learn the optimal k for the clustering.

(Note that this requires a performance (Note that this requires a performance measure.)measure.)

36

Problems with K-MeansProblems with K-Means How do you tell it which clustering you How do you tell it which clustering you

want?want?

Constrained clustering techniquesConstrained clustering techniques

Same-cluster constraint(must-link)

Different-cluster constraint(cannot-link)

Documents

1 Machine Learning: Naïve Bayes, Neural Networks, Clustering Skim 20.5 CMSC 471