Naive bayes classifier

ABHIJIT SENGUPTA MBA(D),IV th Semester,

Roll no: 79

Naïve Bayes Classification Model

A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".

In simple terms, a naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.

Probabilistic model

Abstractly, the probability model for a classifier is a conditional model

over a dependent class variable with a small number of outcomes or classes, conditional on several feature variables through . The problem is that if the number of features is large or when a feature can take on a large number of values, then basing such a model on probability tables is infeasible. We therefore reformulate the model to make it more tractable.

Using Bayes' theorem, this can be written

In plain English, using Bayesian Probability terminology, the above equation can be written as

1 | P a g e

http://www.wikipedia.org/wiki/Statistical_independence

http://www.wikipedia.org/wiki/Statistical_independence

http://www.wikipedia.org/wiki/Bayesian_statistics

http://www.wikipedia.org/wiki/Bayes'_theorem

http://www.wikipedia.org/wiki/Bayes'_theorem

http://www.wikipedia.org/wiki/Classifier_(mathematics)

In practice, there is interest only in the numerator of that fraction, because the denominator does not depend on and the values of the features are given, so that the denominator is effectively constant. The numerator is equivalent to the joint probability model

which can be rewritten as follows, using the chain rule for repeated applications of the definition of conditional probability:

Now the "naive" conditional independence assumptions come into play: assume that each feature is conditionally independent of every other feature for

given the category . This means that

,

, ,

and so on, for . Thus, the joint model can be expressed as

2 | P a g e

This means that under the above independence assumptions, the conditional distribution over the class variable is:

where the evidence is a scaling factor dependent only on , that is, a constant if the values of the feature variables are known.

The denominator Z is being constant, if we compare the numerators, we shall get which of the outcome or class C is most likely to occur given a set of values of Fi.

Let us discuss this with the following example:

Example

In the following table ,there are four attributes age, income, student & credit rating of 14 tuples.The records r1,r2,….r14 are described by their attributes. Let us consider a tuple X having the following attributes :X = ( age= youth, income = medium, student = yes, credit_rating = fair)Will a person belonging to tuple X buy a computer?

Records AGE Income Student Credit rating

BuysComputer

R1 <=30 high No Fair NoR2 <=30 high No excellent NoR3 30-40 high No Fair YesR4 >40 medium No Fair YesR5 >40 low Yes Fair yes

3 | P a g e

R6 >40 low Yes excellent NoR7 31-40 low Yes excellent YesR8 <=30 Medium No Fair noR9 <=30 low Yes Fair YesR10 >40 medium Yes Fair YesR11 <=30 medium Yes excellent YesR12 30-40 medium No excellent YesR13 30-40 High Yes Fair YesR14 >40 medium no excellent no

D : Set of tuplesEach Tuple is an ‘n’ dimensional attribute vector X : (x1,x2,x3,…. xn)Let there be ‘m’ Classes : C1,C2,C3…CmNaïve Bayes classifier predicts X belongs to Class Ci iff P (Ci/X) > P(Cj/X) for 1<= j <= m , j <> i.Maximum Posteriori Hypothesis: P(Ci/X) = P(X/Ci) P(Ci) / P(X)Maximize P(X/Ci) P(Ci) as P(X) is constantWith many attributes, it is computationally expensive to evaluate P(X/Ci).Naïve Assumption of “class conditional independence”

Now, P(X/Ci).P(Ci)=P(Xk/Ci).P(Ci)& P(Xk/Ci) =P(X1/Ci).P(X2/Ci)……..P(Xn/Ci)

Theory applied on previous example:P(C1) = P(buys_computer = yes) = 9/14 =0.643P(C2) = P(buys_computer = no) = 5/14= 0.357P(age=youth /buys_computer = yes) = 2/9 =0.222P(age=youth /buys_computer = no) = 3/5 =0.600P(income=medium /buys_computer = yes) = 4/9 =0.444P(income=medium /buys_computer = no) = 2/5 =0.400P(student=yes /buys_computer = yes) = 6/9 =0.667P(credit rating=fair /buys_computer = no) = 2/5 =0.400

4 | P a g e

P(X/Buys a computer = yes) = P(age=youth /buys_computer = yes) * P(income=medium/buys_computer = yes) * P(student=yes /buys_computer = yes) * P(credit rating=fair/buys_computer = yes) = 0.222 * 0.444 * 0.667 * 0.667 = 0.044

Similliarly,P(X/Buys a computer = No) = 0.600 * 0.400 * 0.200 * 0.400 = 0.019

Now,find class Ci that Maximizes P(X/Ci) * P(Ci)=>P(X/Buys a computer = yes) * P(buys_computer = yes) =.044*.643= 0.028=>P(X/Buys a computer = No) * P(buys_computer = no) = .019*.357=0.007

Prediction : Buys a computer for Tuple X

In spite of their naive design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable efficacy of naive Bayes classifiers. Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as boosted trees or random forests.

An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire covariance matrix. It is particularly suited when the dimensionality of the inputs is high. Parameter estimation for naive Bayes models uses the method of maximum likelihood. In spite over-simplified assumptions, it often performs better in many complex real world situations.

5 | P a g e

http://www.wikipedia.org/wiki/Covariance_matrix

http://www.wikipedia.org/wiki/Random_forests

http://www.wikipedia.org/wiki/Boosted_trees

http://www.wikipedia.org/wiki/Efficacy

Uses of Naive Bayes classification:

1. Naive Bayes text classificationThe Bayesian classification is used as a probabilistic learning method (Naive Bayes text classification). Naive Bayes classifiers are among the most successful known algorithms for learning to classify text documents.

2. Spam filtering Spam filtering is the best known use of Naive Bayesian text classification. It makes use of a naive Bayes classifier to identify spam e-mail.Bayesian spam filtering has become a popular mechanism to distinguish illegitimate spam email from legitimate email (sometimes called "ham" or "bacn").[4] Many modern mail clients implement Bayesian spam filtering. Users can also install separate email filtering programs.Server-side email filters, such as DSPAM, SpamAssassin, SpamBayes, Bogofilter and ASSP, make use of Bayesian spam filtering techniques, and the functionality is sometimes embedded within mail server software itself.

3. Hybrid Recommender System Using Naive Bayes Classifier and Collaborative FilteringRecommender Systems apply machine learning and data mining techniques for filtering unseen information and can predict whether a user would like a given resource.It is proposed a unique switching hybrid recommendation approach by combining a Naïve Bayes classification approach with the collaborative filtering. Experimental results on two different data sets show that the proposed algorithm is scalable and provide better performance–in terms of accuracy and coverage–than other algorithms while at the same time eliminates some recorded problems with the recommender systems.4. Online applications This online application has been set up as a simple example of supervised machine learning and affective computing. Using a training set of examples which reflect nice, nasty or neutral sentiments, we're training Ditto to distinguish between them.Simple Emotion Modelling, combines a statistically based classifier with a dynamical model. The Naive Bayes classifier employs single words and word pairs as features. It allocates user utterances into nice, nasty and neutral classes, labelled +1, -1 and 0 respectively. This numerical output drives a simple first-order dynamical system, whose state represents the simulated emotional state of the experiment's personification.

6 | P a g e

Bibliography:

(http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html)(http://en.wikipedia.org)(http://eprints.ecs.soton.ac.uk/18483/)(http://www.convo.co.uk/x02/)http://www.statsoft.comData Mining Concepts & Techniques,Jiowai Han

7 | P a g e

http://www.statsoft.com/

http://www.convo.co.uk/x02/

http://eprints.ecs.soton.ac.uk/18483/

Technology

Naive bayes classifier