18
[email protected] 1 Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Le e Department of Computer Science National Chiao-Tung University Hsinchu, Taiwan 300, R.O.C. http://www.csie.nctu.edu.tw/~hfli/ *: corresponding author

[email protected] Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science

Embed Size (px)

Citation preview

Page 1: Hfli@csie.nctu.edu.tw1 Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science

[email protected] 1

Online Mining of Frequent Query Trees over XML

Data Streams

Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee

Department of Computer ScienceNational Chiao-Tung University

Hsinchu, Taiwan 300, R.O.C.http://www.csie.nctu.edu.tw/~hfli/

*: corresponding author

Page 2: Hfli@csie.nctu.edu.tw1 Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science

[email protected] 2

Outline

Introduction Mining of Data Streams, Tree Mining

Problem Definition Online Mining of Frequent Query Trees over

XML Data Streams The Proposed Algorithm

FQT-Stream (Frequent Query Trees of Streams)

Conclusions and Future Work

Page 3: Hfli@csie.nctu.edu.tw1 Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science

[email protected] 3

Mining of Data Streams: Motivations Many Applications generate data streams

Day to day business (credit card, ATM transactions, etc)

Hot Web services (XML data, record and click streams) Telecommunication (call records) Financial market (stock exchange) Surveillance (sensor network, audio/video) System management (network events)

Application characteristics Massive volumes of data (several terabytes) Records arrive at a rapid rate Data distribution changes on the fly

What do we want to get from data streams ? Real time query answering, Statistics, and Pattern

discovery

Page 4: Hfli@csie.nctu.edu.tw1 Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science

[email protected] 4

Mining of Data Streams: Computation Model

Requirements of Mining Data Streams

Single pass: each record is examined at most once Bounded storage: Limited Memory for storing

synopsis Real-time: Per record processing time (to maintain

synopsis) must be low

StreamMining

Processor

Synopsisin Memory

Buffer(Approximate)

Results

Data Streams

Page 5: Hfli@csie.nctu.edu.tw1 Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science

[email protected] 5

Problem Definition of Frequent Query Tree Mining (1/2)

XML Query Tree Stream (XQTS) A sequence of query trees (QTs) QT1, QT2, …, QTN

N is tree id the latest incoming query tree Support of a Query Tree QTi

sup(QTi): the number of QTs in XQTS containing QTi as a subtree

Page 6: Hfli@csie.nctu.edu.tw1 Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science

[email protected] 6

Problem Definition of Frequent Query Tree Mining (2/2)

A QTi is a Frequent Query Tree (FQT) if and only if sup(QTi) sN

s is a user-defined minimum support threshold in the range of [0, 1]

Our Task To mine the set of all frequent query tree

s (FQTs) by one scan of the XQTS Using as smaller memory as possible

Page 7: Hfli@csie.nctu.edu.tw1 Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science

[email protected] 7

Proposed Algorithm FQT-Stream (Frequent Query Trees of Streams)

FQT-Stream consists of 5 phases 1. read a QT (Query Tree) from the buffer in the main me

mory 2. transform the QT into a new NQTS (Normalized Query T

ree Sequence) representation 3. construct a in-memory summary data structure called

FQT-forest (a forest of Frequent Query Trees) by projecting the NQTSs

4. prune the infrequent query trees from FQT-forest 5. find the set of all FQTs (Frequent Query Trees) from cur

rent FQT-forest Since phase 1 is straightforward,

We focus on phases 2-5

Page 8: Hfli@csie.nctu.edu.tw1 Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science

[email protected] 8

Phase 2 of FQT-Stream: NQTS Transformation

NQTS Transformation of QT Using DFS on the QT

A sequence of triple (node-id, level, order) level: the level of the QT order: sequence order of the NQTS

For example (5-NQTS in Figure 1)

Page 9: Hfli@csie.nctu.edu.tw1 Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science

[email protected] 9

Phase 3 of FQT-Stream: FQT-forest Construction (1/4)

For each NQTS, 2 steps are performed to construct the FQT-forest Step 1: enumerate each NQTS into a

set of sub-sequences using Order-Break (OB) technique

OB is a level-wise method

Page 10: Hfli@csie.nctu.edu.tw1 Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science

[email protected] 10

Phase 3 of FQT-Stream: Step 1 of FQT-forest Construction (2/4)

For example, a 5-NQTS = <(A, 0, 1), (B, 1, 2), (D, 2, 3), (E, 2, 4), (C, 1, 5)> First, the 5-NQTS is broken into three 4-

NQTSs <(A, 0, 1), (D, 2, 3), (E, 2, 4), (C, 1, 5)> <(A, 0, 1), (B, 1, 2), (E, 2, 4), (C, 1, 5)> <(A, 0, 1), (B, 1, 2), (D, 2, 3), (C, 1, 5)>

These sequences are 1-OB (One Order Break) 1-OB sequences have one order break in the

sequence order The original 5-NQTS is called 0-OB

Page 11: Hfli@csie.nctu.edu.tw1 Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science

[email protected] 11

Phase 3 of FQT-Stream: Step 1 of FQT-forest Construction (3/4) After delete the duplicates

Three 4-NQTSs Two 3-NQTSs with One Order Break

Two 3-NQTSs One 2-NQTS <(A, 0, 1), (E, 2, 4), (C, 1, 5)>, <(A, 0, 1), (B, 1, 2),

(C, 1, 5)><(A, 0, 1), (C, 1, 5)> Finally, the set of 1-OB contains 8 NQT

Ss

Page 12: Hfli@csie.nctu.edu.tw1 Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science

[email protected] 12

Phase 3 of FQT-Stream: Step 1 of FQT-forest Construction (4/4) Set of 2-OB is generated from the set of

1-OB For example

2-OB <(A, 0, 1), (D, 2, 3), (C, 1, 5)> is generated from 1-OB <(A, 0, 1), (D, 2, 3), (E, 2, 4), (C, 1, 5)>

Repeat this process until no candidate k-OB

Property 1 The maximum size of order break is k-3, i.e.,

(k-3)-OB, if the query tree has k nodes

Page 13: Hfli@csie.nctu.edu.tw1 Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science

[email protected] 13

Phase 3 of FQT-Stream: Step 2 of FQT-forest Construction (1/3) The OBs (0-OB, 1-OB, 2-OB) are projec

ted and inserted into a FQT-forest using Incremental Projection (IP) technique A NQTS, <X1X2…Xi>, with i nodes is projec

ted into i sub-NQTSs (also called node-suffix NQTSs)

<Xi>, <XiXi-1>, …, <X2>, <X1> We use one field node-id to represent the fields (n

ode-id, level, order) for simplicity

Page 14: Hfli@csie.nctu.edu.tw1 Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science

[email protected] 14

Phase 3 of FQT-Stream: Step 2 of FQT-forest Construction (2/3) Example of IP

1-OB: <(A, 0, 1), (D, 2, 3), (E, 2, 4), (C, 1, 5)> is projected into 4 node-suffix NQTSs as follows

<(C, 1, 5)> <(E, 2, 4), (C, 1, 5)> <(D, 2, 3), (E, 2, 4), (C, 1, 5)> <(A, 0, 1), (D, 2, 3), (E, 2, 4), (C, 1, 5)>

After projection, a tree structure checking is preformed

If the level of the first node in a node-suffix NQTS is not the smallest level

the node-suffix NQTS is deleted

Page 15: Hfli@csie.nctu.edu.tw1 Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science

[email protected] 15

Phase 3 of FQT-Stream: Step 2 of FQT-forest Construction (3/3) After tree structure checking

The node-suffix NQTSs are inserted into FQT-forest Update the corresponding nodes’ supports

FQT-forest consists of 2 parts FN-list

A list of Frequent Nodes Each node Xi in FN-list has a NQTS-tree (Xi.NQTS-tree)

NQTS-trees (trees of Normalized Query Tree Sequences) A sequence (NQTS) is represented by a path And its appearance frequent is maintained in the last of nod

e of the path

Page 16: Hfli@csie.nctu.edu.tw1 Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science

[email protected] 16

Phase 4 of FQT-Stream: Infrequent Information Pruning In order to guarantee the limited spac

e requirement Pruning Infrequent Information

Pruning steps Check each node Xi in the FN-list of FQT-f

orest If its sup(Xi) < sN delete Xi and its NQTS-tre

e Check other NQTS-trees to prune these infre

quent nodes

Page 17: Hfli@csie.nctu.edu.tw1 Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science

[email protected] 17

Phase 4 of FQT-Stream: Frequent Query Tree Mining Assume that there are k frequent nodes, <X1,

X2, …, Xk>, in the FN-list FQT-Stream traverses the Xi.NQTS-tree (i, i = 1,

2, …, k) to find the sequences with prefix Xi whose estimated support is greater than or equal to sN in a DFS manner

These frequent query trees are stored into a temporal list, called FQT-List

Page 18: Hfli@csie.nctu.edu.tw1 Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science

[email protected] 18

Conclusions and Future Work We propose an efficient one-pass

algorithm FQT-Stream (Frequent Query Trees of Streams) To find the set of all frequent query

trees over the entire history of online XML data streams

Future Work Online Mining of Frequent Query

Trees over Sliding Windows