Upload
shauna-hancock
View
216
Download
2
Embed Size (px)
Citation preview
1
A K-Means Based Bayesian Classifier Inside a DBMS
Using SQL & UDFsPh.D Showcase, Dept. of Computer Science
Sasi Kumar PitchaimalaiPh.D Candidate
Database Systems Group, Department of Computer ScienceUniversity of Houston
Advisor: Dr. Carlos Ordonez
2
Motivation
• Naïve Bayes Classifier(NB)–One of the most popular and
important classifiers in Machine Learning–Robust, Powerful, Fast to Compute
And Easy to Understand• Programming Inside A DBMS–SQL can easily handle complex
computations–UDFs can use arrays and processed
in memory
Data Mining Inside A DBMS
•Avoids Exporting the data outside the DBMS
•Major overhead•Data Security
•Scales Linearly with large data sets•Exploit parallelism provided by a DBMS•Use optimized queries with simple database operations•Objective: Push computations involving large data sets inside the DBMS
4
Bayesian Classifier Based On K-Means (BKM)
• A Generalization Of Naïve Bayes(NB)• The Algorithm
– Initialization: Randomly initialize k clusters per class from the data set.
– E-Step: Compute Euclidean distance, find nearest cluster and then compute sufficient statistics.
– M-Step: Re-compute cluster centers and radii. Check Convergence.
• The E-Step and M-Step are repeated until model converges i.e clusters do not move
BKM: Finding the clusters per class
6
Database Optimizations
• Five different query optimization techniques for distance computation were introduced.
• User Defined Functions (UDFs) – Computing distance and nearest cluster in a single UDF.
• Using CASE statement instead of aggregations.
• Sufficient Statistics of the clusters were computed in a single table scan.
7
Comparing Accuracy – NB Vs BKM Vs DT
Data Set
Algorithm
Global
Class-0
Class-1
pima NB 76% 80% 68%
BKM 76% 87% 53%
DT 68% 76% 53%
Spam NB 70% 87% 45%
BKM 73% 91% 43%
DT 80% 85% 72%
Bscale NB 50% 51% 30%
BKM 59% 59% 60%
DT 89% 96% 0%
Wbcancer
NB 93% 91% 95%
BKM 93% 84% 97%
DT 95% 94% 96%
•Global Accuracy: BKM better than NB and worse than DT(Decision Tree) in most cases
•Class Breakdown Accuracy:
BKM better than NB except 2 cases proving class decomposition is a positive step towards increasing NB accuracy. DT performs poorly here and really worse in case of the bscale.
8
BKM Scalability- Varying n,d,k
Times per Iteration. Defaults: d=4,k=4,n=100k
Comparing DBMS with MapReduce
MapReduce: A distributed non-transactional high performance data intensive processing framework.
Incremental Mining
• An UDF performing incremental data mining exploiting data parallelism
• Minimizing the number of scans(1-3) on the data set
• Provides an approximation of the model before we scan through the complete data set
• Requires thread safe sharing of the model without affecting performance
Papers• Carlos Ordonez, Sasi K. Pitchaimalai: One-pass
data mining algorithms in a DBMS with UDFs. SIGMOD Conference 2011: 1217-1220
• Sasi K. Pitchaimalai, Carlos Ordonez, Carlos Garcia Alvarado : Comparing SQL and MapReduce to compute Naïve Bayes in a Single Table Scan, CloudDB, CIKM 2010
• Carlos Ordonez, Sasi K. Pitchaimalai: Fast UDFs to compute sufficient statistics on large data sets exploiting caching and sampling, DKE 2010
• Carlos Ordonez, Sasi K. Pitchaimalai - Bayesian Classifiers Programmed in SQL, TKDE 2008
• Sasi K. Pitchaimalai, Carlos Ordonez, Carlos Garcia Alvarado – Efficient Distance Computation Using SQL Queries and UDFs, ICDM 2008