56
Statistical Approaches to Mining Multivariate Data Streams Eric Vance, Duke University Department of Statistical Science David Banks, Duke University Tamraparni Dasu, AT&T Labs - Research JSM July 31, 2007 Salt Lake City, Utah

Statistical Approaches to Mining Multivariate Data Streams

  • Upload
    tudor

  • View
    46

  • Download
    0

Embed Size (px)

DESCRIPTION

Statistical Approaches to Mining Multivariate Data Streams. Eric Vance, Duke University Department of Statistical Science David Banks, Duke University Tamraparni Dasu, AT&T Labs - Research. JSM July 31, 2007 Salt Lake City, Utah. Data Streams. Huge amounts of complex data - PowerPoint PPT Presentation

Citation preview

Page 1: Statistical Approaches to Mining Multivariate Data Streams

Statistical Approaches to Mining Multivariate Data Streams

Eric Vance, Duke University Department of Statistical Science

David Banks, Duke University

Tamraparni Dasu, AT&T Labs - Research

JSM July 31, 2007Salt Lake City, Utah

Page 2: Statistical Approaches to Mining Multivariate Data Streams

Data Streams

• Huge amounts of complex data

• Rapid rate of accumulation

• One-time access to raw data

Page 3: Statistical Approaches to Mining Multivariate Data Streams

Change Detection• Problem: Change detection in complicated data

streams

• Three criteria– Nonparametric– Fast– Statistical guarantees

Page 4: Statistical Approaches to Mining Multivariate Data Streams

E-Commerce Server Data• Data description

– One server in a network of servers– Data polled every 5 minutes– 27 variables– 5 week time period

• Quality issues– Many variables unimportant or unchanging– Missing data

Page 5: Statistical Approaches to Mining Multivariate Data Streams

E-Commerce Server Data• Variable Elimination

– Several variables remain constant (Total Swap) – Predictable and non-informative (Used SysDisk Space)– Correlated (CPU Used%, CPU User%)

• 6 Variables Selected– CPU Used% – Number of Procs – Number of Threads – Ping Latency – Used Swap – Log Packets = log(In Packets + Out Packets)

Page 6: Statistical Approaches to Mining Multivariate Data Streams

Daily BoxplotsCPU

Procs

Threads

Page 7: Statistical Approaches to Mining Multivariate Data Streams

Daily BoxplotsPing

Swap

Packets

Page 8: Statistical Approaches to Mining Multivariate Data Streams

Our Approach

• Partitioning scheme (Multivariate Histogram)

• Data depth: Rank each point in relation to Mahalanobis distance from center of data

• Data Pyramid: Determine which direction in the data is most extreme

• Profile based comparison– Identify changes in profiles over time (week to week)

Page 9: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Example in 2D• 5 “center-outward” depth layers

• 4 pyramidsA: depth 4 pyramid +y

B: depth 3 pyramid -x C: depth 1 pyramid +x

A

BC

Page 10: Statistical Approaches to Mining Multivariate Data Streams

Identify Depth and Direction1. Compute center of comparison Data Sphere

2. Calculate Mahalanobis distance for each point

3. In which of the 5 quantiles of depth is

4. Determine direction of greatest variation for

X

M i = (x i − X ′ ) S−1(x i − X )

x i

M i

x i

Page 11: Statistical Approaches to Mining Multivariate Data Streams

Data Partition in 6 Dimensions

Page 12: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into Bins

Page 13: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into Bins

CPU

Page 14: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into Bins

CPU

CP

U

Page 15: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into BinsC

PU

Procs

Page 16: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into BinsC

PU

Pro

cs

Procs

Page 17: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into BinsC

PU

Pro

cs

Threads

Page 18: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into BinsC

PU

Pro

cs

Threads

Th

read

s

Page 19: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into BinsC

PU

Pro

cs

Threads

Th

read

s

Threads

Page 20: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into BinsC

PU

Pro

cs

Threads

Th

read

s

Threads

Th

read

s

Page 21: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into BinsC

PU

Pro

cs

Th

read

s

Ping

Page 22: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into BinsC

PU

Pro

cs

Th

read

s

Ping

Pin

g

Page 23: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into BinsC

PU

Pro

cs

Th

read

s

Pin

g

Swap

Page 24: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into BinsC

PU

Pro

cs

Th

read

s

Pin

g

Swap

Sw

ap

Page 25: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into BinsC

PU

Pro

cs

Th

read

s

Pin

g

Sw

ap

Packets

Page 26: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into BinsC

PU

Pro

cs

Th

read

s

Pin

g

Sw

ap

Packets

Pac

ket

s

Page 27: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into Bins: Weeks 1 and 2

Page 28: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into Bins: Weeks 1 and 2

Swap

Page 29: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into Bins: Weeks 1 and 2

Swap

Sw

ap

Page 30: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into Bins: Weeks 1 and 2

Swap

Sw

ap

Swap

Page 31: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into Bins: Weeks 1 and 2

Swap

Sw

ap

Page 32: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into Bins: Weeks 1 and 2

Threads

Page 33: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into Bins: Weeks 1 and 2

Threads

Th

read

s

Page 34: Statistical Approaches to Mining Multivariate Data Streams

6 Variables Over 5 Weeks

Page 35: Statistical Approaches to Mining Multivariate Data Streams

Week 1 Results

Page 36: Statistical Approaches to Mining Multivariate Data Streams

Week 2 Results

Page 37: Statistical Approaches to Mining Multivariate Data Streams

6 Variables Over 5 Weeks

Page 38: Statistical Approaches to Mining Multivariate Data Streams

6 Variables Over 5 Weeks

Threads

Page 39: Statistical Approaches to Mining Multivariate Data Streams

Week 3 Results

Page 40: Statistical Approaches to Mining Multivariate Data Streams

Week 3 Results

Th

read

s

Page 41: Statistical Approaches to Mining Multivariate Data Streams

Week 4 Results

Page 42: Statistical Approaches to Mining Multivariate Data Streams

Week 4 Results

Pro

cs

Page 43: Statistical Approaches to Mining Multivariate Data Streams

Week 4 Results

Th

rea d

s

Page 44: Statistical Approaches to Mining Multivariate Data Streams

6 Variables Over 5 Weeks

Page 45: Statistical Approaches to Mining Multivariate Data Streams

6 Variables Over 5 Weeks

Packets

Page 46: Statistical Approaches to Mining Multivariate Data Streams

Week 5 Results

Page 47: Statistical Approaches to Mining Multivariate Data Streams

Week 5 Results

Pac

ket

s

Page 48: Statistical Approaches to Mining Multivariate Data Streams

Trimmed Mean and Covariance• Applying partitioning method using a trimmed

mean and trimmed covariance matrix – Use 90% most central data points– Recompute center– Recompute Mahalanobis distances using new center

and new covariance matrix

• Bins are more uniformly filled– More points appear closer to trimmed center– More variables become “most extreme”

Page 49: Statistical Approaches to Mining Multivariate Data Streams

Trimmed Comparison: Week 1

Page 50: Statistical Approaches to Mining Multivariate Data Streams

Trimmed Results: Week 1

Page 51: Statistical Approaches to Mining Multivariate Data Streams

Trimmed Results: Week 2

Page 52: Statistical Approaches to Mining Multivariate Data Streams

Trimmed Results: Week 3

Page 53: Statistical Approaches to Mining Multivariate Data Streams

Trimmed Results: Week 4

Page 54: Statistical Approaches to Mining Multivariate Data Streams

Trimmed Results: Week 5

Page 55: Statistical Approaches to Mining Multivariate Data Streams

Multivariate Tests of Statistical Significance

• Multinomial test – Chi-squared test, G-test

• Bootstrap and Resampling

• Bayesian methods

Page 56: Statistical Approaches to Mining Multivariate Data Streams

Conclusion

• Quickly categorize data points non-parametrically

• Compare bins

• Identify which variables change most

• Questions or comments to

ervance @ stat.duke.edu