1 A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs Ph.D Showcase, Dept. of Computer Science Sasi Kumar Pitchaimalai Ph.D Candidate Database

1

A K-Means Based Bayesian Classifier Inside a DBMS

Using SQL & UDFsPh.D Showcase, Dept. of Computer Science

Sasi Kumar PitchaimalaiPh.D Candidate

Database Systems Group, Department of Computer ScienceUniversity of Houston

Advisor: Dr. Carlos Ordonez

2

Motivation

• Naïve Bayes Classifier(NB)–One of the most popular and

important classifiers in Machine Learning–Robust, Powerful, Fast to Compute

And Easy to Understand• Programming Inside A DBMS–SQL can easily handle complex

computations–UDFs can use arrays and processed

in memory

Data Mining Inside A DBMS

•Avoids Exporting the data outside the DBMS

•Major overhead•Data Security

•Scales Linearly with large data sets•Exploit parallelism provided by a DBMS•Use optimized queries with simple database operations•Objective: Push computations involving large data sets inside the DBMS

4

Bayesian Classifier Based On K-Means (BKM)

• A Generalization Of Naïve Bayes(NB)• The Algorithm

– Initialization: Randomly initialize k clusters per class from the data set.

– E-Step: Compute Euclidean distance, find nearest cluster and then compute sufficient statistics.

– M-Step: Re-compute cluster centers and radii. Check Convergence.

• The E-Step and M-Step are repeated until model converges i.e clusters do not move

BKM: Finding the clusters per class

6

Database Optimizations

• Five different query optimization techniques for distance computation were introduced.

• User Defined Functions (UDFs) – Computing distance and nearest cluster in a single UDF.

• Using CASE statement instead of aggregations.

• Sufficient Statistics of the clusters were computed in a single table scan.

7

Comparing Accuracy – NB Vs BKM Vs DT

Data Set

Algorithm

Global

Class-0

Class-1

pima NB 76% 80% 68%

BKM 76% 87% 53%

DT 68% 76% 53%

Spam NB 70% 87% 45%

BKM 73% 91% 43%

DT 80% 85% 72%

Bscale NB 50% 51% 30%

BKM 59% 59% 60%

DT 89% 96% 0%

Wbcancer

NB 93% 91% 95%

BKM 93% 84% 97%

DT 95% 94% 96%

•Global Accuracy: BKM better than NB and worse than DT(Decision Tree) in most cases

•Class Breakdown Accuracy:

BKM better than NB except 2 cases proving class decomposition is a positive step towards increasing NB accuracy. DT performs poorly here and really worse in case of the bscale.

8

BKM Scalability- Varying n,d,k

Times per Iteration. Defaults: d=4,k=4,n=100k

Comparing DBMS with MapReduce

MapReduce: A distributed non-transactional high performance data intensive processing framework.

Incremental Mining

• An UDF performing incremental data mining exploiting data parallelism

• Minimizing the number of scans(1-3) on the data set

• Provides an approximation of the model before we scan through the complete data set

• Requires thread safe sharing of the model without affecting performance

Papers• Carlos Ordonez, Sasi K. Pitchaimalai: One-pass

data mining algorithms in a DBMS with UDFs. SIGMOD Conference 2011: 1217-1220

• Sasi K. Pitchaimalai, Carlos Ordonez, Carlos Garcia Alvarado : Comparing SQL and MapReduce to compute Naïve Bayes in a Single Table Scan, CloudDB, CIKM 2010

• Carlos Ordonez, Sasi K. Pitchaimalai: Fast UDFs to compute sufficient statistics on large data sets exploiting caching and sampling, DKE 2010

• Carlos Ordonez, Sasi K. Pitchaimalai - Bayesian Classifiers Programmed in SQL, TKDE 2008

• Sasi K. Pitchaimalai, Carlos Ordonez, Carlos Garcia Alvarado – Efficient Distance Computation Using SQL Queries and UDFs, ICDM 2008

http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/o/Ordonez:Carlos.html

http://www.informatik.uni-trier.de/~ley/db/conf/sigmod/sigmod2011.html

Documents

1 A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs Ph.D Showcase, Dept. of Computer Science Sasi Kumar Pitchaimalai Ph.D Candidate Database