Naïve Bayes Classiﬁcation - Brown Universitycs.brown.edu/courses/cs195w/slides/naivebayes.pdf · Theory Naïve Bayes in SQL Generalisations of Naive-Bayes Bayesian K-means (BKM)

Theory Naïve Bayes in SQL

Naïve Bayes Classification

Nickolai Riabov, Kenneth Tiong

Brown University

Fall 2013

Nickolai Riabov, Kenneth Tiong Naïve Bayes Classification


Structure of the Talk

Theory of Naïve Bayes classificationNaive Bayes in SQL



Notation

X – Set of features of the dataY – Set of classes of the data



Bayes’ Theorem

P(y |x) = P(x |y)P(y)P(x)

P(y) – Prior probability of being in class yP(x) – Probability of features xP(x |y) – Likelihood of features x given class yP(y |x) – Posterior probability of y



Maximum a posteriori estimate

Based on Bayes’ theorem, we can compute which of theclasses y maximizes the posterior probability

y∗ = arg maxy∈Y

P(y |x)

= arg maxy∈Y

P(x |y)P(y)P(x)

= arg maxy∈Y

P(x |y)P(y)

(Note: we can drop P(x) since it is common to allposteriors)



Commonality with maximum likelihood

Assume that all classes are equally likely a priori:

P(y) =1

# of elements in Y∀ y ∈ Y

Then,y∗ = arg max

y∈YP(x |y)

That is, y∗ is the y that maximizes the likelihood function



Desirable Properties of the Bayes Classifier

Incrementality: Each new element of the training setleads to an update in the likelihood function. This makesthe estimator robustCombines Prior Knowledge and Observed DataOutputs a probability distribution in addition to aclassification



Bayes Classifier

Assumption: Training set consists of instances ofdifferent classes y that are functions of features x(In this case, assume each point has k features, andthere are n points in the training set)Task: Classify a new point x:,n+1 as belonging to a classyn+1 ∈ Y on the basis of its features by using a MAPclassifier

y∗ ∈ arg maxyn+1∈Y

P(x1,n+1, x2,n+1, · · · , xk ,n+1|yn+1)P(yn+1)



Bayes Classifier

P(y) can either be externally specified (i.e. it canactually be a prior), or can be estimated as thefrequency of classes in the training setP(x1, x2, · · · , xk |y) has O(|X |k |Y |) parameters – canonly be estimated with a very large number of datapoints



Bayes Classifier

Can reduce the dimensionality of the problem byassuming that features are conditionally independentgiven the class (this is the Naïve Bayes Assumption)

P(x1, x2, · · · , xk |y) =k∏

i=1

P(xi |y)

Now, there’s only O(|X ||Y |) parameters to estimateIf the distribution of x1, · · · xn|y is continuous, this resultis even more important

P(x1, x2, · · · , xk |y) has to be estimatednonparametrically; this method is very sensitive tohigh-dimensional problems



Bayes Classifier

Learning step consists of estimating P(xi |y)∀i ∈ {1, 2, · · · , k}Data with unknown class is classified by computing they∗ that maximizes the posterior


P(yn+1)k∏

i=1

P(xn+1,i |yn+1)

Note: Due to underflow, the above is usually replacedwith the numerically tractable expression


ln(P(yn+1)) +k∑

i=1

ln(P(xn+1,i |yn+1))



Example

Classifying emails into spam or ham

Training set: n tuples that contain the text of the email and itsclass

xi,j =

{1 if word i in email j0 otherwise

; yj =

{1 if ham0 if spam

Calculate likelihood of each word by class:

P(xi |y = 1) =

∑nj=1 xi,j · yj∑n

j=1 yj

P(xi |y = 0) =

∑nj=1 xi,j · (1− yj)∑n

j=1(1− yj)



Example

Define prior, calculate numerator of posterior probability:

P(yn+1 = 1|x1,n+1, x2,n+1, · · · , xk ,n+1)

∝ P(yn+1 = 1)k∏

i=1

P(xi,n+1|yn+1 = 1)

P(yn+1 = 0|x1,n+1, x2,n+1, · · · , xk ,n+1)

∝ P(yn+1 = 0)k∏

i=1

P(xi,n+1|yn+1 = 0)

If P(yn+1 = 1|~xn+1) > P(yn+1 = 0|~xn+1), classify as ham.

If P(yn+1 = 1|~xn+1) < P(yn+1 = 0|~xn+1), classify as spam.



Naive Bayes in SQL

Why SQL?Standard language in a DBMSEliminates need to understand and modify internalsource

DrawbacksLimitations in manipulating vectors and matricesMore overhead than systems languages (e.g. C)



Efficient SQL implementations of Naïve Bayes

Numeric attributesBinning is required (create k uniform intervals betweenmin and max, or take intervals around the mean basedon multiples of std dev)Two passes over the data set to transform numericalattributes to discrete onesFirst pass for minimum, maximum and meanSecond pass for variance (due to numerical issues)

Discrete attributesWe can compute histograms on each attribute with SQLaggregations



Generalisations of Naive-Bayes

Bayesian K-means (BKM) is a generalisation of NaïveBayes (NB)NB has 1 cluster per class, BKM has k > 1 clusters perclassThe class decomposition is found by K-Means algorithm



K-Means algorithm

K-Means algorithm finds k clusters by choosing k datapoints at random as initial cluster centers. Each datapoint is then assigned to the cluster with center that isclosest to that point.Each cluster center is then replaced by the mean of alldata points that have been assigned to that clusterThis process is iterated until no data point is reassignedto a different cluster.



Tables needed for Bayesian K-means



Example SQL queries for K-Means algorithm

The following SQL statement computes k distances foreach point, corresponding to the gth class.



Results

Experiment with 4 real data sets, comparing NB, BKM,and decision trees (DT)Numeric and discrete versions of Naïve Bayes hadsimilar accuracyBKM was more accurate than NB and similar to decisiontrees in global accuracy. However BKM is more accuratewhen computing a breakdown of accuracy per class



Results

Low numbers of clusters produced good resultsEquivalent implementation of NB in SQL and C++: SQLis four times slowerSQL queries were faster than User-Defined functions(SQL optimisations are important!)NB and BKM exhibited linear scalability in data set sizeand dimensionality.


Documents

Naïve Bayes Classiﬁcation - Brown Universitycs.brown.edu/courses/cs195w/slides/naivebayes.pdf · Theory Naïve Bayes in SQL Generalisations of Naive-Bayes Bayesian K-means (BKM)