Upload
tommy96
View
467
Download
0
Embed Size (px)
DESCRIPTION
Citation preview
Statistical Data Mining: A Short
Course for the Army Conference on Applied Statistics
Edward J. WegmanGeorge Mason University
Jeffrey L. SolkaNaval Surface Warfare Center
Statistical Data Mining Agenda
Introduction and ComplexityData Preparation and CompressionDatabases and Data Mining via Association RulesClustering, Classification, and DiscriminationPattern Recognition and Intrusion DetectionColor Theory and DesignVisual Data MiningCrystalVision Installation and Practice
Introduction to Data Mining
Introduction to Data Mining
What is Data Mining All AboutHierarchy of Data Set SizeComputational Complexity and FeasibilityData Mining Defined & Contrasted with EDAExamples
Introduction to Data Mining
Why Data MiningWhat is Knowledge Discovery in
DatabasesPotential Applications
Fraud Detection Manufacturing Processes Targeting Markets Scientific Data Analysis Risk Management Web Intelligence
Introduction to Data Mining
Data Mining: On what kind of data? Relational Databases Data Warehouses Transactional Databases Advanced
Object-relationalSpatial, Temporal, SpatiotemporalText, wwwHeterogeneous, Legacy, Distributed
Introduction to Data Mining
Data Mining: Why now? Confluence of multiple disciplines
Database systems, data warehouses, OLAPMachine learningStatistical and data analysis methodsVisualizationMathematical programmingHigh performance computing
Introduction to Data Mining
Why do we need data mining? Large number of records (cases) (108-1012
bytes) High dimensional data (variables) (10-104
attributes)How do you explore millions of records, tens or
hundreds of fields, and find patterns?
Introduction to Data Mining
Why do we need data mining?Only a small portion, typically 5% to 10%, of
the collected data is ever analyzed.Data that may never be explored continues
to be collected out of fear that something that may prove important in the future may be missing.
Magnitude of data precludes most traditional analysis (more on complexity later).
Introduction to Data Mining
KDD and data mining have roots in traditional database technology
As database grow, the ability of the decision support process to exploit traditional (I.e. Boolean) query languages is limited.
• Many queries of interest are difficult/impossible to state in traditional query languages
• “Find all cases of fraud in IRS tax returns.”• “Find all individuals likely to ignore Census
questionnaires.”• “Find all documents relating to this customer’s
problem.”
Complexity
Complexity
Descriptor Data Set Size in Bytes Storage Mode Tiny 102 Piece of Paper Small 104 A Few Pieces of Paper Medium 106 A Floppy Disk Large 108 Hard Disk Huge 1010 Multiple Hard Disks Massive 1012 Robotic Magnetic Tape Storage Silos Supermassive 1015 Distributed Data Archives
The Huber-Wegman Taxonomy of Data Set Sizes
Complexity
O( n ) Calculate Means, Variances, Kernel Density Estimates
O(n log(n)) Calculate Fast Fourier TransformsO(n c) Calculate Singular Value Decomposition of
an r x c Matrix; Solve a Multiple Linear Regression
O( n2 ) Solve most Clustering AlgorithmsO( an ) Detect Multivariate Outliers
Algorithmic Complexity
Complexity
Table 2: Number of Operations for Algorithms of VariousComputational Complexities and Various Data Set Sizes
n n1/2 n n log(n) n3/2 n2
tiny 10 102 2x102 103 104
small 102 104 4x104 106 108
medium 103 106 6x106 109 1012
large 104 108 8x108 1012 1016
huge 105 1010 1011 1015 1020
Complexity
Table 4: Computational Feasibility on a Pentium PC10 megaflop performance assumed
n n1/2 n n log(n) n3/2 n2
tiny 10-6
seconds10-5
seconds2x10-5
seconds.0001
seconds.001
seconds
small 10-5
seconds.001
seconds.004
seconds.1
seconds10
seconds
medium .0001seconds
.1seconds
.6seconds
1.67minutes
1.16days
large .001seconds
10seconds
1.3minutes
1.16days
31.7years
huge .01seconds
16.7minutes
2.78hours
3.17years
317,000 years
Complexity
Table 5: Computational Feasibility on a Silicon Graphics Onyx Workstation300 megaflop performance assumed
n n1/2 n n log(n) n3/2 n2
tiny 3.3x10-8
seconds3.3x10-7
seconds6.7x10-7
seconds3.3x10-6
seconds3.3x10-5
seconds
small 3.3x10-7
seconds3.3x10-5
seconds1.3x10-4
seconds3.3x10-3
seconds.33
seconds
medium 3.3x10-6
seconds3.3x10-3
seconds.02
seconds3.3
seconds55
minutes
large 3.3x10-5
seconds.33
seconds2.7
seconds55
minutes1.04years
huge 3.3x10-4
seconds33
seconds5.5
minutes38.2days
10,464years
Complexity
Table 6: Computational Feasibility on an Intel Paragon XP/S A44.2 gigaflop performance assumed
n n1/2 n n log(n) n3/2 n2
tiny 2.4x10-9
seconds2.4x10-8
seconds4.8x10-8
seconds2.4x10-7
seconds2.4x10-6
seconds
small 2.4x10-8
seconds2.4x10-6
seconds9.5x10-6
seconds2.4x10-4
seconds.024
seconds
medium 2.4x10-7
seconds2.4x10-4
seconds.0014
seconds.24
seconds4.0
minutes
large 2.4x10-6
seconds.024
seconds.19
seconds4.0
minutes27.8days
huge 2.4x10-5
seconds2.4
seconds24
seconds66.7
hours761
years
Complexity
Table 7: Computational Feasibility on a Teraflop Grand Challenge Computer1000 gigaflop performance assumed
n n1/2 n n log(n) n3/2 n2
tiny 10-11
seconds10-10
seconds2x10-10
seconds10-9
seconds10-8
seconds
small 10-10
seconds10-8
seconds4x10-8
seconds10-6
seconds10-4
seconds
medium 10-9
seconds10-6
seconds6x10-6
seconds.001
seconds1
second
large 10-8
seconds10-4
seconds8x10-4
seconds1
second2.8
hours
huge 10-7
seconds.01
seconds.1
seconds16.7
minutes3.2
years
Complexity
Table 8: Types of Computers for Interactive FeasibilityResponse Time < 1 second
n n1/2 n n log(n) n3/2 n2
tiny PersonalComputer
PersonalComputer
PersonalComputer
PersonalComputer
PersonalComputer
small PersonalComputer
PersonalComputer
PersonalComputer
PersonalComputer
SuperComputer
medium PersonalComputer
PersonalComputer
PersonalComputer
Super Computer TeraflopComputer
large PersonalComputer
Workstation Super Computer TeraflopComputer
---
huge PersonalComputer
SuperComputer
TeraflopComputer
--- ---
Complexity
Table 9: Types of Computers for FeasibilityResponse Time < 1 week
n n1/2 n n log(n) n3/2 n2
tiny PersonalComputer
PersonalComputer
PersonalComputer
PersonalComputer
PersonalComputer
small PersonalComputer
PersonalComputer
PersonalComputer
PersonalComputer
PersonalComputer
medium PersonalComputer
PersonalComputer
PersonalComputer
PersonalComputer
PersonalComputer
large PersonalComputer
PersonalComputer
PersonalComputer
PersonalComputer
TeraflopComputer
huge PersonalComputer
PersonalComputer
PersonalComputer
Super Computer ---
Complexity
Table 10: Transfer Rates for a Variety of Data Transfer Regimes
n standardethernet10 mega-bits/sec
fastethernet
100 mega-bits/sec
hard disktransfer
2027 kilo-bytes/sec
cachetransfer @ 200
megahertz
1.25x106
bytes/sec1.25x107
bytes/sec2.027x106
bytes/sec2x108
bytes/sec
tiny 8x10-5
seconds8x10-6
seconds4.9x10-5
seconds5x10-6
seconds
small 8x10-3
seconds8x10-4
seconds4.9x10-3
seconds5x10-5
seconds
medium .8seconds
.08seconds
.49seconds
5x10-3
seconds
large 1.3minutes
8seconds
49seconds
.5seconds
huge 2.2hours
13.3minutes
1.36hours
50seconds
Complexity
Table 11: Resolvable Number of Pixels AcrossScreen for Several Viewing Scenarios
19 inchmonitor @24 inches
25 inchTV @12 feet
15 footscreen @
20 feet
immersion
Angle 39.005o 9.922o 41.112o 140o
5 seconds of arcresolution(Valyus)
28,084 7,144 29,601 100,800
1 minute of arcresolution
2,340 595 2,467 8,400
3.6 minute of arcresolution(Wegman)
650 165 685 2,333
4.38 minutesof arc resolution
(Maar 1)
534 136 563 1,918
.486 minutes ofarc/foveal cone
(Maar 2)
4,815 1,225 5,076 17,284
Complexity
ScenariosTypical high resolution workstations,
1280x1024 = 1.31x106 pixelsRealistic using Wegman, immersion, 4:5 aspect ratio,
2333x1866 = 4.35x106 pixels Very optimistic using 1 minute arc, immersion, 4:5
aspect ratio, 8400x6720 = 5.65x107 pixelsWildly optimistic using Maar(2), immersion, 4:5 aspect ratio, 17,284x13,828 = 2.39x108 pixels
Massive Data Sets
One Terabyte Datasetvs
One Million Megabyte Data Sets
Both difficult to analyzebut for different reasons
Massive Data Sets: Commonly Used Language
Data Mining = DMKnowledge Discovery in Databases =
KDDMassive Data Sets = MDData Analysis = DA
Massive Data Sets
DM MDDM DA
Even DA + MD DM
1. Computationally Feasible Algorithms2. Little or No Human Intervention
Data Mining of Massive Datasets
Data Mining is a kind of Exploratory Data Analysis with Little or No Human Interaction using Computationally Feasible Techniques,
i.e., the Attempt to find Interesting Structure unknown a priori
Massive Data Sets
Major Issues Complexity Non-homogeneity
Examples Huber’s Air Traffic Control Highway Maintenance Ultrasonic NDE
Massive Data Sets
Air Traffic Control 6 to 12 Radar stations, several hundred
aircraft, 64-byte record per radar per aircraft per antenna turn
megabyte of data per minute
Massive Data Sets
Highway Maintenance Records of maintenance records and
measurements of road quality for several decades
Records of uneven quality Records missing
Massive Data Sets
NDE using Ultrasound Inspection of cast iron projectiles Time series of length 256, 360 degrees,
550 levels = 50,688,000 observations per projectile
Several thousand projectiles per day
Massive Data Sets: A Distinction
Human Analysis of the Structure of Data and Pitfalls
vs Human Analysis of the Data Itself
Limits of HVS and computational complexity limit the latter
Former is the basis for design of the analysis engine