Data Stream Mining and Incremental Discretization John Russo CS561 Final Project April 26, 2007

Data Stream Mining and Data Stream Mining and Incremental DiscretizationIncremental Discretization

John RussoJohn Russo

CS561 Final ProjectCS561 Final Project

April 26, 2007April 26, 2007

OverviewOverview

IntroductionIntroduction Data Mining: A Brief OverviewData Mining: A Brief Overview HistogramsHistograms Challenges of Streaming Data to Data MiningChallenges of Streaming Data to Data Mining Using Histograms for Incremental Using Histograms for Incremental

Discretization of Data StreamsDiscretization of Data Streams Fuzzy HistogramsFuzzy Histograms Future WorkFuture Work

IntroductionIntroduction

Data miningData mining Class of algorithms for knowledge discoveryClass of algorithms for knowledge discovery Patterns, trends, predictionsPatterns, trends, predictions Utilizes statistical methods, neural networks, genetic algorithms, Utilizes statistical methods, neural networks, genetic algorithms,

decision trees, etc.decision trees, etc.

Streaming data presents unique challenges to traditional Streaming data presents unique challenges to traditional data miningdata mining Non-persistence – one opportunity to mineNon-persistence – one opportunity to mine Data ratesData rates Non-discreteNon-discrete Changing over timeChanging over time Huge volumes of dataHuge volumes of data

Data MiningData MiningTypes of RelationshipsTypes of Relationships

ClassesClasses Predetermined groupsPredetermined groups

ClustersClusters Groups of related dataGroups of related data

Sequential PatternsSequential Patterns Used to predict behaviorUsed to predict behavior

AssociationsAssociations Rules are built from associations between Rules are built from associations between

datadata

Data MiningData MiningAlgorithmsAlgorithms

K-means clusteringK-means clustering Unsupervised learning algorithmUnsupervised learning algorithm Classified data set into pre-defined clustersClassified data set into pre-defined clusters

Decision TreesDecision Trees Used to generate rules for classificationUsed to generate rules for classification Two common types:Two common types:

CARTCART CHAIDCHAID

Nearest NeighborNearest Neighbor Classify a record in a dataset based upon similar Classify a record in a dataset based upon similar

records in a historical datasetrecords in a historical dataset

Data MiningData MiningAlgorithms (continued)Algorithms (continued)

Rule InductionRule Induction Uses statistical significance to find interesting Uses statistical significance to find interesting

rulesrules

Data VisualizationData Visualization Uses graphics for miningUses graphics for mining

Histograms and Data MiningHistograms and Data MiningHistogram of Wire Diameters (Small Bins)

0

1

2

3

4

5

6

7

0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40

Diameter (mm)

Fre

qu

ency

Histograms and Supervised Histograms and Supervised Learning – An ExampleLearning – An Example

Age Income Marital Status Credit Rating Mortgage

Approval <=30 Low Single Excellent Yes <=30 Medium Divorced Good No 31-40 High Married Poor No 31-40 High Married Excellent Yes <=30 High Married Good Yes 41-50 Low Married Excellent Yes 41-50 Medium Single Poor Yes >50 High Married Good No >50 Low Single Excellent No <=30 Low Married Excellent No Table 1 - Training Data for a Naïve Bayesian Classification


We have two classes:We have two classes: Mortgage approval = “yes”Mortgage approval = “yes”

P(mortgage approval = "Yes") = 5/10 = .5P(mortgage approval = "Yes") = 5/10 = .5 Mortgage approval = “no”Mortgage approval = “no”

P(mortgage approval = "Yes") = 5/10 = .5P(mortgage approval = "Yes") = 5/10 = .5 Let’s calculate some of the conditional probabilities based upon training Let’s calculate some of the conditional probabilities based upon training

data:data: P(age<=30|mortgage approval = "Yes") = 2/5 = .4P(age<=30|mortgage approval = "Yes") = 2/5 = .4 P(age<=30|mortgage approval = "No") = 2/5 = .4P(age<=30|mortgage approval = "No") = 2/5 = .4 P(income="Low"| mortgage approval = "Yes") = 2/5 = .4P(income="Low"| mortgage approval = "Yes") = 2/5 = .4 P(income="Low"| mortgage approval = "No") = 2/5 = .4P(income="Low"| mortgage approval = "No") = 2/5 = .4 P(income = "Medium"|mortgage approval = "Yes") = 1/5 = .2P(income = "Medium"|mortgage approval = "Yes") = 1/5 = .2 P(income = "Medium"|mortgage approval = "No") = 1/5 = .2P(income = "Medium"|mortgage approval = "No") = 1/5 = .2 P(marital status = "Married"| mortgage approval = "Yes") = 3/5 = 0.6P(marital status = "Married"| mortgage approval = "Yes") = 3/5 = 0.6 P(marital status = "Married"| mortgage approval = "No") = 3/5 = 0.6P(marital status = "Married"| mortgage approval = "No") = 3/5 = 0.6 P(credit rating = "Good"|mortgage approval = "Yes") = 1/5 = .2P(credit rating = "Good"|mortgage approval = "Yes") = 1/5 = .2 P(credit rating = "Good"|mortgage approval = "No") = 2/5 = .5P(credit rating = "Good"|mortgage approval = "No") = 2/5 = .5


We will use Bayes’ rule and the naïve We will use Bayes’ rule and the naïve assumption that all attributes are independent:assumption that all attributes are independent:

P(AP(A11=a=a11......AAkk=a=akk) is irrelevant, since it is the ) is irrelevant, since it is the

same for every classsame for every class Now, let’s predict the class for one observation:Now, let’s predict the class for one observation:

X=Age<=30, income="medium", marital status = X=Age<=30, income="medium", marital status = "married", credit rating = "good" "married", credit rating = "good"

)()...(

)|...(

11

11cCP

aAaAP

cCaAaAP

kk

kk


P(X|mortgage approval = "Yes") = .4 * .2 * .6 * .2 = 0.0096P(X|mortgage approval = "Yes") = .4 * .2 * .6 * .2 = 0.0096 P(X|mortgage approval = "No") = .4 * .2 * .6 * .5 = 0.024P(X|mortgage approval = "No") = .4 * .2 * .6 * .5 = 0.024 P(X|C=c)*P(C=c) : 0.0096 * .4 = 0.00384P(X|C=c)*P(C=c) : 0.0096 * .4 = 0.00384 0.024 * .4 = 0.00960.024 * .4 = 0.0096 X belongs to “no” class.X belongs to “no” class.

The probabilities are determined by frequency counts, the frequencies The probabilities are determined by frequency counts, the frequencies are tabulated in bins.are tabulated in bins.

Two common types of histogramsTwo common types of histograms Equal-width – the range of observed values is divided into k Equal-width – the range of observed values is divided into k

intervalsintervals Equal-frequency – the frequencies are equal in all binsEqual-frequency – the frequencies are equal in all bins

Difficulty is determining number of bins or kDifficulty is determining number of bins or k Sturges’ ruleSturges’ rule Scott’s rule Scott’s rule

Determining k for a data stream is problematicDetermining k for a data stream is problematic

Challenges of Data Streaming to Challenges of Data Streaming to Data Mining Data Mining

Determining k for a histogram or machine Determining k for a histogram or machine learninglearning

Concept driftingConcept drifting Data from the past is no longer valid for the model Data from the past is no longer valid for the model

todaytoday Several approachesSeveral approaches

Incremental learning – CVFDTIncremental learning – CVFDT Ensemble classifiersEnsemble classifiers Ambiguous decision treesAmbiguous decision trees

What about “ebb and flow” problem?What about “ebb and flow” problem?

Incremental DiscretizationIncremental Discretization

Way to create discrete intervals from a Way to create discrete intervals from a data streamdata stream

Partition Incremental Discretization (PID) Partition Incremental Discretization (PID) algorithm (Gama and Pinto)algorithm (Gama and Pinto) Two-level algorithmTwo-level algorithm Creates intervals at level 1Creates intervals at level 1

Only one pass over the streamOnly one pass over the stream Aggregates level 1 intervals into level 2 Aggregates level 1 intervals into level 2

intervalsintervals

Incremental DiscretizationIncremental DiscretizationExampleExample

Temp Soil Moisture Sprinkler Flow 45 .3 Medium 86 .1 High 67 .8 Low 32 .98 Off 91 .1 High 85 .8 Medium 75 .5 Medium 56 .1 Medium 82 .9 Low 83 .5 Medium 84 .6 Medium 26 .35 Off 82 .55 Low 83 0.0 High 84 .25 Low


Sensor data reporting on air temperature, Sensor data reporting on air temperature, soil moisture and flow of water in a soil moisture and flow of water in a sprinkler. sprinkler.

The data shown in the previous slide is The data shown in the previous slide is training datatraining data

Once trained, model can predict what we Once trained, model can predict what we should set sprinkler to based upon should set sprinkler to based upon conditionsconditions

4 class problem4 class problem


We will walk through level 1 for the temperature We will walk through level 1 for the temperature attribute.attribute. Decide an estimated range -> 30 – 85Decide an estimated range -> 30 – 85 Pick number of intervals (11)Pick number of intervals (11) Step is set to 5Step is set to 5 2 vectors: breaks and counts2 vectors: breaks and counts Set a threshold for splitting an interval -> 33% of all observed Set a threshold for splitting an interval -> 33% of all observed

valuesvalues Begin to work through training setBegin to work through training set If a value falls below the lower bound of the range, add a If a value falls below the lower bound of the range, add a

new interval before the first intervalnew interval before the first interval If a value falls above the upper bound of the range, add a If a value falls above the upper bound of the range, add a

new interval after the last valuenew interval after the last value If an interval reaches the threshold, split it evenly and divide If an interval reaches the threshold, split it evenly and divide

the count between the old interval and the newthe count between the old interval and the new


Breaks vector for our Breaks vector for our sample after trainingsample after training

Counts vector for our Counts vector for our sample after trainingsample after training

2525 3030 3535 4040 4545 5050 5555 6060 6565 7070 7575 8080 82.582.5 8585 9090 9595

11 11 00 00 00 00 11 00 11 00 2.52.5 3.53.5 22 11 00

Second LayerSecond Layer

The second layer is invoked whenever The second layer is invoked whenever necessary. necessary. User interventionUser intervention Changes in intervals of first layerChanges in intervals of first layer

InputInput Breaks and counters from layer 1Breaks and counters from layer 1 Type of histogram to be generatedType of histogram to be generated

Second LayerSecond Layer

Objective is to create a smaller number of intervals Objective is to create a smaller number of intervals based upon layer 1intervalsbased upon layer 1intervals

For equal width histograms:For equal width histograms: Computes number of intervals based upon observed range in Computes number of intervals based upon observed range in

layer 1layer 1 Traverses the vector of breaks once and adds counters of Traverses the vector of breaks once and adds counters of

consecutive intervalsconsecutive intervals

Equal frequencyEqual frequency Computes exact number of data points in each intervalComputes exact number of data points in each interval Traverses counter and adds counts for consecutive intervalTraverses counter and adds counts for consecutive interval Stops for each layer 2 interval when frequency is reachedStops for each layer 2 interval when frequency is reached

Application of PID for Data MiningApplication of PID for Data Mining

Add a data structure to both layer 1 and Add a data structure to both layer 1 and layer 2.layer 2.

Matrix:Matrix: Columns: intervalsColumns: intervals Rows: classesRows: classes

Naïve Bayesian classification can be Naïve Bayesian classification can be easily doneeasily done

Example MatrixExample MatrixTemperature AttributeTemperature Attribute

ClassClass 2525 3030 3535 4040 4545 5050 5555 6060 6565 7070 7575 8080 82.582.5 8585 9090 9595

HighHigh 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 00

MedMed 00 00 00 00 11 00 11 00 00 00 11 00 22 11 00 00

LowLow 00 00 00 00 00 00 00 00 11 00 00 22 11 00 00 00

OffOff 11 11 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Dealing with Concept DriftDealing with Concept Drift

What happens when What happens when training is no longer training is no longer valid (for example, valid (for example, winter?)winter?)

Assume sensors are Assume sensors are still on in winter but still on in winter but sprinklers are notsprinklers are not

Temp Soil Moisture Sprinkler Flow 26 .3 Off 32 .1 Off 35 .8 Off 21 .98 Off -9 .1 Off 0 .8 Off 7 .5 Off 23 .1 Off 18 .9 Off 10 .5 Off 34 .6 Off 32 .35 Off 20 .55 Off 12 0.0 Off 14 .25 Off

Dealing with Concept DriftDealing with Concept DriftFuzzy HistogramsFuzzy Histograms

Fuzzy histograms are used for visual Fuzzy histograms are used for visual content representation.content representation.

A given attribute can be a member of more A given attribute can be a member of more than 1 interval. than 1 interval. Varying degrees of membershipVarying degrees of membership Degree of membership is determined by a Degree of membership is determined by a

membership functionmembership function

Fuzzy Histograms with PIDFuzzy Histograms with PID

Use membership function to build layer 2 Use membership function to build layer 2 intervals based upon a determinant in intervals based upon a determinant in layer 1layer 1

Sprinkler exampleSprinkler example Soil moisture is potentially a member of >1 Soil moisture is potentially a member of >1

intervalinterval One interval is a high valueOne interval is a high value During winter, ensure that all values of During winter, ensure that all values of

moisture fall into highest end of rangemoisture fall into highest end of range

ReferencesReferences [1] Hand, David. Mannila, Heikki and Padhraic Smyth. [1] Hand, David. Mannila, Heikki and Padhraic Smyth. Principles of Data Mining. Principles of Data Mining. Cambridge, MA: MIT Press, Cambridge, MA: MIT Press,

2001.2001. [2] Sturges, H.(1926) The choice of a class-interval. J. Amer. Statist. Assoc., 21, 65–66.[2] Sturges, H.(1926) The choice of a class-interval. J. Amer. Statist. Assoc., 21, 65–66. [3] D.W. Scott. On optimal and data-based histograms, Biometrika 66(1979) 605-610.[3] D.W. Scott. On optimal and data-based histograms, Biometrika 66(1979) 605-610. [4] David Freedman and [4] David Freedman and Persi DiaconisPersi Diaconis (1981). "On the histogram as a density estimator: (1981). "On the histogram as a density estimator: LL2 theory." 2 theory." Probability Theory and Related FieldsProbability Theory and Related Fields. 57(4): 453-476. 57(4): 453-476 [5] Jianping Zhang, Huan Liu and Paul P. Wang, Some current issues of streaming data mining, Information [5] Jianping Zhang, Huan Liu and Paul P. Wang, Some current issues of streaming data mining, Information

Sciences, Volume 176, Issue 14, Streaming Data Mining, 22 July 2006, Pages 1949-1951.Sciences, Volume 176, Issue 14, Streaming Data Mining, 22 July 2006, Pages 1949-1951. [6] Hulten, G., Spencer, L., and Domingos, P. 2001. Mining time-changing data streams. In [6] Hulten, G., Spencer, L., and Domingos, P. 2001. Mining time-changing data streams. In Proceedings of the Proceedings of the

Seventh ACM SIGKDD international Conference on Knowledge Discovery and Data MiningSeventh ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (San Francisco, (San Francisco, California, August 26 - 29, 2001). KDD '01. ACM Press, New York, NY, 97-106.California, August 26 - 29, 2001). KDD '01. ACM Press, New York, NY, 97-106.

[7] Wang, H., Fan, W., Yu, P. S., and Han, J. 2003. Mining concept-drifting data streams using ensemble [7] Wang, H., Fan, W., Yu, P. S., and Han, J. 2003. Mining concept-drifting data streams using ensemble classifiers. In classifiers. In Proceedings of the Ninth ACM SIGKDD international Conference on Knowledge Discovery and Proceedings of the Ninth ACM SIGKDD international Conference on Knowledge Discovery and Data MiningData Mining (Washington, D.C., August 24 - 27, 2003). KDD '03. ACM Press, New York, NY, 226-235. (Washington, D.C., August 24 - 27, 2003). KDD '03. ACM Press, New York, NY, 226-235.

[8] Natwichai, J. and Li, X. (2004). Knowledge Maintenance on Data Streams with Concept Drifting. In: Zhang, J., [8] Natwichai, J. and Li, X. (2004). Knowledge Maintenance on Data Streams with Concept Drifting. In: Zhang, J., He, J. and Fu, Y. 2004, (705-710), Shanghai, China.He, J. and Fu, Y. 2004, (705-710), Shanghai, China.

[9] Gama, J. and Pinto, C. 2006. Discretization from data streams: applications to histograms and data mining. In [9] Gama, J. and Pinto, C. 2006. Discretization from data streams: applications to histograms and data mining. In Proceedings of the 2006 ACM Symposium on Applied ComputingProceedings of the 2006 ACM Symposium on Applied Computing (Dijon, France, April 23 - 27, 2006). SAC '06. (Dijon, France, April 23 - 27, 2006). SAC '06. ACM Press, New York, NY, 662-667.ACM Press, New York, NY, 662-667.

[10] Anastasios Doulamis and Nikolaos Doulamis.Fuzzy histograms for Efficient Visual Content [10] Anastasios Doulamis and Nikolaos Doulamis.Fuzzy histograms for Efficient Visual Content Representation:Application to content-based image retrieval. In IEEE International Conference on Multimedia and Representation:Application to content-based image retrieval. In IEEE International Conference on Multimedia and Expo(ICME’01),page227.IEEE Press,2001.Expo(ICME’01),page227.IEEE Press,2001.

[11] Gaber, M.M., Zaslavsky, A. & Krishnaswamy, S. 2005, "Mining data streams: a review", [11] Gaber, M.M., Zaslavsky, A. & Krishnaswamy, S. 2005, "Mining data streams: a review", SIGMOD Rec., SIGMOD Rec., vol. vol. 34, no. 2, pp. 18-26.34, no. 2, pp. 18-26.

Questions ?Questions ?

Documents

Data Stream Mining and Incremental Discretization John Russo CS561 Final Project April 26, 2007