Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

1


Mining High-Speed Data StreamsPedro Domingos

Geoff Hulten

Sixth ACM SIGKDD International Confrence - 2000

Presented by:Afsoon Yousefi

http://www.powerpointstyles.com/



2 Outlines Introduction

Hoeffding Trees

The VFST System

Performance Study

Conclusion

Qs & As



3

Introduction

Hoeffding Trees

The VFST System

Performance Study

Conclusion

Qs & As



4Introduction

In today’s information society, extraction of knowledge is becoming a very important task for many people. We live in an age of knowledge revolution.

Many organizations have more than very large data bases that grow at a rate of several million records per day.

OpportunitiesChallenges

Main limited resources in knowledge discovery systems:

TimeMemorySample size



5Introduction—cont.

Traditional systems:Small amount of data is availableUsing a fraction of available

computational power

Current systems:The bottleneck is time and memoryUsing a fraction of available samples of

dataTry to mine databases that don’t fit in

main memory

Available algorithms:Efficient, but not guarantee a similar

learned model to the batch mode.• Never recover from an unfavorable set of

early examples.• Sensitive to example ordering.

Produce the same model as batch version, but not efficiently.

• Slower than the batch algorithm.




Requirements of algorithms to overcome these problems:

Operate continuously and indefinitely

Incorporate examples as they arrive

Never loosing potentially valuable information

Build a model using at most one scan of the data.

Use only a fixed amount of main memory.

Require small constant time per record.

Make a usable model available at any point in time.

Produce a model equivalent to the one obtained by ordinary database mining algorithm.

By changing the data-generating over time, the model at any time should be up-to-date.




Such requirements are fulfilled by:

Incremental learning methods

Online methods

Successive methods

Sequential methods



8

Introduction

Hoeffding Trees

The VFST System

Performance Study

Conclusion

Qs & As



9Hoeffding Trees

Classic decision tree learners:

CART, ID3, C4.5All examples simultaneously in main

memory.

Disk based decision tree learners:

SLIQ, SPRINTExamples are stored on disk.Expensive to learn complex trees or very

large datasets.

Consider a subset of training examples to find the best attribute:

For extremely large datasets.Read each examples at most once.Directly mine online data sources.Build complex trees with acceptable

computational cost.



10Hoeffding Trees—cont.

Given a set of examples of the form

: number of examples : discrete class label : a vector of attributes (symbolic or

numeric)

Goal : produce

A model that will predict the classes of future examples with high accuracy.




Given a stream of examples:

Use first ones to choose the root test.Pass succeeding ones to corresponding

leaves.Pick best attributes there.… And so on recursively

How many examples are necessary at each node?

Hoeffding BoundAdditive Chernof BoundA statistical result




Hoeffding bound:

: heuristic measure used to choose test attributes C4.5 information gain CART Gini index Assume is to be maximized

: heuristic measure after seeing examples : attribute with highest observed : second-best attribute : difference between and

: probability of choosing the wrong attribute

Hoeffding bound guarantees that is the correct choice with probability if:

examples have been seen at this node




Hoeffding bound:

If

is the best attribute with probability

Node needs to accumulate examples from the stream until becomes

smaller than

It is independent of the probability distribution generating the observations.

More conservative than distribution dependent ones.



14Hoeffding Tree algorithm

Inputs:

: is a sequence of examples. : is a set of discrete attributes. : is a split evaluation function. : desired probability of choosing the wrong

attribute at any given node.

Output:

: is a decision tree.



15Hoeffding Tree algorithm—cont.

Procedure HoeffdingTree ()

Let be a tree with a single leaf (the root).Let Let predict most frequent class in .For each class

For each value of each attribute Let



16Hoeffding Tree algorithm—cont.

For each example in • Sort into a leaf using • For each and each such that

o Increment .• Label with majority class among examples

seen at .• Compute for each attribute .• Let be the attribute with highest .• Let be the attribute with second-highest .• Compute .• If , then

o Replace by an internal node that split on .o For each branch of the split

- Add a new leaf , Let .- Let predict most frequent class.- For each class and each that

. Let Return .




: leaf probability (assume this is constant).

: tree produced by Hoeffding tree algorithm with desired given an infinite sequence of examples .

: decision tree induced by choosing at each node the attribute with true greatest .

: intentional disagreement between two decision trees:

: probability that the attribute vector will be observed.

: indicator function (1:true argument, 0:otherwise)

THEOREM :




Suppose that the best and second-best attribute differ by 10%

According to

requires 380 examples

requires 345 more examples

An exponential improvement in can be obtained with a linear increase in the number of examples



19

Introduction

Hoeffding Trees

The VFST System

Performance Study

Conclusion

Qs & As



20The VFDT System

Very Fast Decision Tree learner (VFDT).

A decision tree learning system.

based on the Hoeffding tree algorithm.

Either uses information gain or Gini index as attribute evaluation measure.

Includes a number of refinements to Hoeffding tree algorithm:

Ties. computation.Memory.Poor attributes.Initialization.Rescans.



21The VFDT System—cont.

Ties

Two or more attributes have very similar ’s

Potentially many examples will be required to decide between them with high confidence.

It makes little difference which attribute is chosen.

If : split on the current best attribute.




computation

The most significant part of the time cost per example is recomputing .

Computing for every new example is inefficient.

new examples must be accumulated at a leaf before recomputing .




Memory

VFDT’s memory use is dominated by the memory required to keep counts for all growing leaves.

If the maximum available memory reached, VFDT deactivates the least promising leaves.

The least promising leaves are considered to be the ones with the lowest values of .




Poor attributes

VFDS’s memory usage is also minimized by dropping early on attributes that do not look promising.

As soon as the difference between an attribute’s and the best one’s becomes greater than , the attribute can be dropped.

The memory used to store the corresponding counts can be freed.




Initialization

VFDT can be initialized with the tree produced by a conventional RAM-based learner on a small subset of the data.

The tree can either be input as it is, or over-pruned.

Gives VFDT a “head start”.




rescans

VFDT can rescan previously-seen examples.

Can be activate if:

The data arrives slowly enough that there is time for it.

The dataset is finite and small enough that it is feasible.



27

Introduction

Hoeffding Trees

The VFST System

Performance Study

Conclusion

Qs & As



28 Synthetic Data Study

Comparing VFDT with C4.5 release 8.

Restricted two systems to using the same amount of RAM.

VFDT used information gain as the function.

14 concepts were used, all with 2 classes and 100 attributes.

For each level after the first 3A fraction of the nodes was replaced by leavesThe rest become splits on a random attributeAt depth of 18, all the nodes were replaced with

leaves.Each leaf was randomly assigned a class

Stream of training examples was then generated

Sampling uniformly from the instance space.Assigning classes according to the target tree.Various levels of class and attribute noise was

added.




—cont.Accuracy as a function of the number of training examples.




—cont. Tree size as a function of the number of training examples.




—cont.Accuracy as a function of the noise level. 4 runs on same concept

(C4.5:100k,VFDT:20million examples)



32Lesion Study

Effect of initializing VFDT with C4.5 with and without over-pruning.



33Web Data

Applying VFDT to mining the steam of Web page requests.

From the whole University of Washington mail campus.

To mine 1.6 million examples:

VFDT took 1540 seconds to do one pass over the training data.

983 seconds was spent reading data from disk. C4.5 took 24 hours to mine 1.6 million examples.



34Web Data—cont.

Performance on Web data



35

Introduction

Hoeffding Trees

The VFST System

Performance Study

Conclusion

Qs & As



36Conclusion

Hoeffding trees:

A method for learning online. Learns the high-volume data streams. Allows learning in very small constant time per

example. Guarantees high similarity to the corresponding

batch trees.

VFDT system:

A high performance data mining system. Based on Hoeffding trees. Effective in taking advantage of massive number

of examples.



37

Introduction

Hoeffding Trees

The VFST System

Performance Study

Conclusion

Qs & As



38Qs & As

Name 4 requirements of algorithms to overcome current disk-based available algorithms?

Operate continuously and indefinitelyIncorporate examples as they arriveNever loosing potentially valuable

informationBuild a model using at most one scan of the

data.Use only a fixed amount of main memory.Require small constant time per record.Make a usable model available at any point

in time.Produce a model equivalent to the one

obtained by ordinary database mining algorithm.

By changing the data-generating over time, the model at any time should be up-to-date



39Qs & As

What are the benefits of considering a subset of training examples to find the best attribute:

For extremely large datasets.

Read each examples at most once.

Directly mine online data sources.

Build complex trees with acceptable computational cost.



40Qs & As

How does VFDT’s tie refinement to Hoeffding tree algorithm works?

Two or more attributes have very similar ’s

Potentially many examples will be required to decide between them with high confidence.

It makes little difference which attribute is chosen.

If : split on the current best attribute.


Documents

Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon