40
Powerpoint Templates 1 Powerpoint Templates Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by : Afsoon Yousefi

Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Embed Size (px)

Citation preview

Page 1: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

1

Powerpoint Templates

Mining High-Speed Data StreamsPedro Domingos

Geoff Hulten

Sixth ACM SIGKDD International Confrence - 2000

Presented by:Afsoon Yousefi

Page 2: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

2 Outlines Introduction

Hoeffding Trees

The VFST System

Performance Study

Conclusion

Qs & As

Page 3: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

3

Introduction

Hoeffding Trees

The VFST System

Performance Study

Conclusion

Qs & As

Page 4: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

4Introduction

In today’s information society, extraction of knowledge is becoming a very important task for many people. We live in an age of knowledge revolution.

Many organizations have more than very large data bases that grow at a rate of several million records per day.

OpportunitiesChallenges

Main limited resources in knowledge discovery systems:

TimeMemorySample size

Page 5: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

5Introduction—cont.

Traditional systems:Small amount of data is availableUsing a fraction of available

computational power

Current systems:The bottleneck is time and memoryUsing a fraction of available samples of

dataTry to mine databases that don’t fit in

main memory

Available algorithms:Efficient, but not guarantee a similar

learned model to the batch mode.• Never recover from an unfavorable set of

early examples.• Sensitive to example ordering.

Produce the same model as batch version, but not efficiently.

• Slower than the batch algorithm.

Page 6: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

6Introduction—cont.

Requirements of algorithms to overcome these problems:

Operate continuously and indefinitely

Incorporate examples as they arrive

Never loosing potentially valuable information

Build a model using at most one scan of the data.

Use only a fixed amount of main memory.

Require small constant time per record.

Make a usable model available at any point in time.

Produce a model equivalent to the one obtained by ordinary database mining algorithm.

By changing the data-generating over time, the model at any time should be up-to-date.

Page 7: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

7Introduction—cont.

Such requirements are fulfilled by:

Incremental learning methods

Online methods

Successive methods

Sequential methods

Page 8: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

8

Introduction

Hoeffding Trees

The VFST System

Performance Study

Conclusion

Qs & As

Page 9: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

9Hoeffding Trees

Classic decision tree learners:

CART, ID3, C4.5All examples simultaneously in main

memory.

Disk based decision tree learners:

SLIQ, SPRINTExamples are stored on disk.Expensive to learn complex trees or very

large datasets.

Consider a subset of training examples to find the best attribute:

For extremely large datasets.Read each examples at most once.Directly mine online data sources.Build complex trees with acceptable

computational cost.

Page 10: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

10Hoeffding Trees—cont.

Given a set of examples of the form

: number of examples : discrete class label : a vector of attributes (symbolic or

numeric)

Goal : produce

A model that will predict the classes of future examples with high accuracy.

Page 11: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

11Hoeffding Trees—cont.

Given a stream of examples:

Use first ones to choose the root test.Pass succeeding ones to corresponding

leaves.Pick best attributes there.… And so on recursively

How many examples are necessary at each node?

Hoeffding BoundAdditive Chernof BoundA statistical result

Page 12: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

12Hoeffding Trees—cont.

Hoeffding bound:

: heuristic measure used to choose test attributes C4.5 information gain CART Gini index Assume is to be maximized

: heuristic measure after seeing examples : attribute with highest observed : second-best attribute : difference between and

: probability of choosing the wrong attribute

Hoeffding bound guarantees that is the correct choice with probability if:

examples have been seen at this node

Page 13: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

13Hoeffding Trees—cont.

Hoeffding bound:

If

is the best attribute with probability

Node needs to accumulate examples from the stream until becomes

smaller than

It is independent of the probability distribution generating the observations.

More conservative than distribution dependent ones.

Page 14: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

14Hoeffding Tree algorithm

Inputs:

: is a sequence of examples. : is a set of discrete attributes. : is a split evaluation function. : desired probability of choosing the wrong

attribute at any given node.

Output:

: is a decision tree.

Page 15: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

15Hoeffding Tree algorithm—cont.

Procedure HoeffdingTree ()

Let be a tree with a single leaf (the root).Let Let predict most frequent class in .For each class

For each value of each attribute Let

Page 16: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

16Hoeffding Tree algorithm—cont.

For each example in • Sort into a leaf using • For each and each such that

o Increment .• Label with majority class among examples

seen at .• Compute for each attribute .• Let be the attribute with highest .• Let be the attribute with second-highest .• Compute .• If , then

o Replace by an internal node that split on .o For each branch of the split

- Add a new leaf , Let .- Let predict most frequent class.- For each class and each that

. Let Return .

Page 17: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

17Hoeffding Trees—cont.

: leaf probability (assume this is constant).

: tree produced by Hoeffding tree algorithm with desired given an infinite sequence of examples .

: decision tree induced by choosing at each node the attribute with true greatest .

: intentional disagreement between two decision trees:

: probability that the attribute vector will be observed.

: indicator function (1:true argument, 0:otherwise)

THEOREM :

Page 18: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

18Hoeffding Trees—cont.

Suppose that the best and second-best attribute differ by 10%

According to

requires 380 examples

requires 345 more examples

An exponential improvement in can be obtained with a linear increase in the number of examples

Page 19: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

19

Introduction

Hoeffding Trees

The VFST System

Performance Study

Conclusion

Qs & As

Page 20: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

20The VFDT System

Very Fast Decision Tree learner (VFDT).

A decision tree learning system.

based on the Hoeffding tree algorithm.

Either uses information gain or Gini index as attribute evaluation measure.

Includes a number of refinements to Hoeffding tree algorithm:

Ties. computation.Memory.Poor attributes.Initialization.Rescans.

Page 21: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

21The VFDT System—cont.

Ties

Two or more attributes have very similar ’s

Potentially many examples will be required to decide between them with high confidence.

It makes little difference which attribute is chosen.

If : split on the current best attribute.

Page 22: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

22The VFDT System—cont.

computation

The most significant part of the time cost per example is recomputing .

Computing for every new example is inefficient.

new examples must be accumulated at a leaf before recomputing .

Page 23: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

23The VFDT System—cont.

Memory

VFDT’s memory use is dominated by the memory required to keep counts for all growing leaves.

If the maximum available memory reached, VFDT deactivates the least promising leaves.

The least promising leaves are considered to be the ones with the lowest values of .

Page 24: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

24The VFDT System—cont.

Poor attributes

VFDS’s memory usage is also minimized by dropping early on attributes that do not look promising.

As soon as the difference between an attribute’s and the best one’s becomes greater than , the attribute can be dropped.

The memory used to store the corresponding counts can be freed.

Page 25: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

25The VFDT System—cont.

Initialization

VFDT can be initialized with the tree produced by a conventional RAM-based learner on a small subset of the data.

The tree can either be input as it is, or over-pruned.

Gives VFDT a “head start”.

Page 26: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

26The VFDT System—cont.

rescans

VFDT can rescan previously-seen examples.

Can be activate if:

The data arrives slowly enough that there is time for it.

The dataset is finite and small enough that it is feasible.

Page 27: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

27

Introduction

Hoeffding Trees

The VFST System

Performance Study

Conclusion

Qs & As

Page 28: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

28 Synthetic Data Study

Comparing VFDT with C4.5 release 8.

Restricted two systems to using the same amount of RAM.

VFDT used information gain as the function.

14 concepts were used, all with 2 classes and 100 attributes.

For each level after the first 3A fraction of the nodes was replaced by leavesThe rest become splits on a random attributeAt depth of 18, all the nodes were replaced with

leaves.Each leaf was randomly assigned a class

Stream of training examples was then generated

Sampling uniformly from the instance space.Assigning classes according to the target tree.Various levels of class and attribute noise was

added.

Page 29: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

29 Synthetic Data Study

—cont.Accuracy as a function of the number of training examples.

Page 30: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

30 Synthetic Data Study

—cont. Tree size as a function of the number of training examples.

Page 31: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

31 Synthetic Data Study

—cont.Accuracy as a function of the noise level. 4 runs on same concept

(C4.5:100k,VFDT:20million examples)

Page 32: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

32Lesion Study

Effect of initializing VFDT with C4.5 with and without over-pruning.

Page 33: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

33Web Data

Applying VFDT to mining the steam of Web page requests.

From the whole University of Washington mail campus.

To mine 1.6 million examples:

VFDT took 1540 seconds to do one pass over the training data.

983 seconds was spent reading data from disk. C4.5 took 24 hours to mine 1.6 million examples.

Page 34: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

34Web Data—cont.

Performance on Web data

Page 35: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

35

Introduction

Hoeffding Trees

The VFST System

Performance Study

Conclusion

Qs & As

Page 36: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

36Conclusion

Hoeffding trees:

A method for learning online. Learns the high-volume data streams. Allows learning in very small constant time per

example. Guarantees high similarity to the corresponding

batch trees.

VFDT system:

A high performance data mining system. Based on Hoeffding trees. Effective in taking advantage of massive number

of examples.

Page 37: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

37

Introduction

Hoeffding Trees

The VFST System

Performance Study

Conclusion

Qs & As

Page 38: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

38Qs & As

Name 4 requirements of algorithms to overcome current disk-based available algorithms?

Operate continuously and indefinitelyIncorporate examples as they arriveNever loosing potentially valuable

informationBuild a model using at most one scan of the

data.Use only a fixed amount of main memory.Require small constant time per record.Make a usable model available at any point

in time.Produce a model equivalent to the one

obtained by ordinary database mining algorithm.

By changing the data-generating over time, the model at any time should be up-to-date

Page 39: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

39Qs & As

What are the benefits of considering a subset of training examples to find the best attribute:

For extremely large datasets.

Read each examples at most once.

Directly mine online data sources.

Build complex trees with acceptable computational cost.

Page 40: Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence - 2000 Presented by: Afsoon

Powerpoint Templates

40Qs & As

How does VFDT’s tie refinement to Hoeffding tree algorithm works?

Two or more attributes have very similar ’s

Potentially many examples will be required to decide between them with high confidence.

It makes little difference which attribute is chosen.

If : split on the current best attribute.