660
Data Mining and Analysis: Fundamental Concepts and Algorithms Mohammed J. Zaki Wagner Meira Jr.

Data Mining and Analysis: Fundamental Concepts and Algorithmsgnetna.com/books/数据挖掘/A 数据挖掘与分析... · Data Mining and Analysis: Fundamental Concepts and Algorithms

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

  • Data Mining and Analysis:

    Fundamental Concepts and Algorithms

    Mohammed J. Zaki

    Wagner Meira Jr.

  • CONTENTS i

    Contents

    Preface 1

    1 Data Mining and Analysis 41.1 Data Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Data: Algebraic and Geometric View . . . . . . . . . . . . . . . . . . 7

    1.3.1 Distance and Angle . . . . . . . . . . . . . . . . . . . . . . . . 91.3.2 Mean and Total Variance . . . . . . . . . . . . . . . . . . . . 131.3.3 Orthogonal Projection . . . . . . . . . . . . . . . . . . . . . . 141.3.4 Linear Independence and Dimensionality . . . . . . . . . . . . 15

    1.4 Data: Probabilistic View . . . . . . . . . . . . . . . . . . . . . . . . . 171.4.1 Bivariate Random Variables . . . . . . . . . . . . . . . . . . . 241.4.2 Multivariate Random Variable . . . . . . . . . . . . . . . . . 281.4.3 Random Sample and Statistics . . . . . . . . . . . . . . . . . 29

    1.5 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311.5.1 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . 311.5.2 Frequent Pattern Mining . . . . . . . . . . . . . . . . . . . . . 331.5.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331.5.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    1.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    I Data Analysis Foundations 37

    2 Numeric Attributes 382.1 Univariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    2.1.1 Measures of Central Tendency . . . . . . . . . . . . . . . . . . 392.1.2 Measures of Dispersion . . . . . . . . . . . . . . . . . . . . . . 43

    2.2 Bivariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482.2.1 Measures of Location and Dispersion . . . . . . . . . . . . . . 492.2.2 Measures of Association . . . . . . . . . . . . . . . . . . . . . 50

  • CONTENTS ii

    2.3 Multivariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 542.4 Data Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592.5 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

    2.5.1 Univariate Normal Distribution . . . . . . . . . . . . . . . . . 612.5.2 Multivariate Normal Distribution . . . . . . . . . . . . . . . . 63

    2.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

    3 Categorical Attributes 713.1 Univariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

    3.1.1 Bernoulli Variable . . . . . . . . . . . . . . . . . . . . . . . . 713.1.2 Multivariate Bernoulli Variable . . . . . . . . . . . . . . . . . 74

    3.2 Bivariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 813.2.1 Attribute Dependence: Contingency Analysis . . . . . . . . . 88

    3.3 Multivariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 933.3.1 Multi-way Contingency Analysis . . . . . . . . . . . . . . . . 95

    3.4 Distance and Angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 983.5 Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1003.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1023.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

    4 Graph Data 1054.1 Graph Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1054.2 Topological Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 1104.3 Centrality Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

    4.3.1 Basic Centralities . . . . . . . . . . . . . . . . . . . . . . . . . 1154.3.2 Web Centralities . . . . . . . . . . . . . . . . . . . . . . . . . 117

    4.4 Graph Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1264.4.1 Erdös-Rényi Random Graph Model . . . . . . . . . . . . . . . 1294.4.2 Watts-Strogatz Small-world Graph Model . . . . . . . . . . . 1334.4.3 Barabási-Albert Scale-free Model . . . . . . . . . . . . . . . . 139

    4.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1474.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

    5 Kernel Methods 1505.1 Kernel Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

    5.1.1 Reproducing Kernel Map . . . . . . . . . . . . . . . . . . . . 1565.1.2 Mercer Kernel Map . . . . . . . . . . . . . . . . . . . . . . . . 158

    5.2 Vector Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1615.3 Basic Kernel Operations in Feature Space . . . . . . . . . . . . . . . 1665.4 Kernels for Complex Objects . . . . . . . . . . . . . . . . . . . . . . 173

    5.4.1 Spectrum Kernel for Strings . . . . . . . . . . . . . . . . . . . 1735.4.2 Diffusion Kernels on Graph Nodes . . . . . . . . . . . . . . . 175

  • CONTENTS iii

    5.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1805.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

    6 High-Dimensional Data 1826.1 High-Dimensional Objects . . . . . . . . . . . . . . . . . . . . . . . . 1826.2 High-Dimensional Volumes . . . . . . . . . . . . . . . . . . . . . . . . 1846.3 Hypersphere Inscribed within Hypercube . . . . . . . . . . . . . . . . 1876.4 Volume of Thin Hypersphere Shell . . . . . . . . . . . . . . . . . . . 1896.5 Diagonals in Hyperspace . . . . . . . . . . . . . . . . . . . . . . . . . 1906.6 Density of the Multivariate Normal . . . . . . . . . . . . . . . . . . . 1916.7 Appendix: Derivation of Hypersphere Volume . . . . . . . . . . . . . 1956.8 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2006.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

    7 Dimensionality Reduction 2047.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2047.2 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . 209

    7.2.1 Best Line Approximation . . . . . . . . . . . . . . . . . . . . 2097.2.2 Best Two-dimensional Approximation . . . . . . . . . . . . . 2137.2.3 Best r-dimensional Approximation . . . . . . . . . . . . . . . 2177.2.4 Geometry of PCA . . . . . . . . . . . . . . . . . . . . . . . . 222

    7.3 Kernel Principal Component Analysis (Kernel PCA) . . . . . . . . . 2257.4 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . 233

    7.4.1 Geometry of SVD . . . . . . . . . . . . . . . . . . . . . . . . 2347.4.2 Connection between SVD and PCA . . . . . . . . . . . . . . . 235

    7.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2377.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238

    II Frequent Pattern Mining 240

    8 Itemset Mining 2418.1 Frequent Itemsets and Association Rules . . . . . . . . . . . . . . . . 2418.2 Itemset Mining Algorithms . . . . . . . . . . . . . . . . . . . . . . . 245

    8.2.1 Level-Wise Approach: Apriori Algorithm . . . . . . . . . . . 2478.2.2 Tidset Intersection Approach: Eclat Algorithm . . . . . . . . 2508.2.3 Frequent Pattern Tree Approach: FPGrowth Algorithm . . . 256

    8.3 Generating Association Rules . . . . . . . . . . . . . . . . . . . . . . 2608.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2638.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263

  • CONTENTS iv

    9 Summarizing Itemsets 2699.1 Maximal and Closed Frequent Itemsets . . . . . . . . . . . . . . . . . 2699.2 Mining Maximal Frequent Itemsets: GenMax Algorithm . . . . . . . 2739.3 Mining Closed Frequent Itemsets: Charm algorithm . . . . . . . . . 2759.4 Non-Derivable Itemsets . . . . . . . . . . . . . . . . . . . . . . . . . . 2789.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2849.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285

    10 Sequence Mining 28910.1 Frequent Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28910.2 Mining Frequent Sequences . . . . . . . . . . . . . . . . . . . . . . . 290

    10.2.1 Level-Wise Mining: GSP . . . . . . . . . . . . . . . . . . . . . 29210.2.2 Vertical Sequence Mining: SPADE . . . . . . . . . . . . . . . 29310.2.3 Projection-Based Sequence Mining: PrefixSpan . . . . . . . . 296

    10.3 Substring Mining via Suffix Trees . . . . . . . . . . . . . . . . . . . . 29810.3.1 Suffix Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29810.3.2 Ukkonen’s Linear Time Algorithm . . . . . . . . . . . . . . . 301

    10.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30910.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309

    11 Graph Pattern Mining 31411.1 Isomorphism and Support . . . . . . . . . . . . . . . . . . . . . . . . 31411.2 Candidate Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 318

    11.2.1 Canonical Code . . . . . . . . . . . . . . . . . . . . . . . . . . 32011.3 The gSpan Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 323

    11.3.1 Extension and Support Computation . . . . . . . . . . . . . . 32611.3.2 Canonicality Checking . . . . . . . . . . . . . . . . . . . . . . 330

    11.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33111.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333

    12 Pattern and Rule Assessment 33712.1 Rule and Pattern Assessment Measures . . . . . . . . . . . . . . . . 337

    12.1.1 Rule Assessment Measures . . . . . . . . . . . . . . . . . . . . 33812.1.2 Pattern Assessment Measures . . . . . . . . . . . . . . . . . . 34612.1.3 Comparing Multiple Rules and Patterns . . . . . . . . . . . . 349

    12.2 Significance Testing and Confidence Intervals . . . . . . . . . . . . . 35412.2.1 Fisher Exact Test for Productive Rules . . . . . . . . . . . . . 35412.2.2 Permutation Test for Significance . . . . . . . . . . . . . . . . 35912.2.3 Bootstrap Sampling for Confidence Interval . . . . . . . . . . 364

    12.3 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36712.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368

  • CONTENTS v

    III Clustering 370

    13 Representative-based Clustering 37113.1 K-means Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37213.2 Kernel K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37513.3 Expectation Maximization (EM) Clustering . . . . . . . . . . . . . . 381

    13.3.1 EM in One Dimension . . . . . . . . . . . . . . . . . . . . . . 38313.3.2 EM in d-Dimensions . . . . . . . . . . . . . . . . . . . . . . . 38613.3.3 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . 39313.3.4 Expectation-Maximization Approach . . . . . . . . . . . . . . 397

    13.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40013.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401

    14 Hierarchical Clustering 40414.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40414.2 Agglomerative Hierarchical Clustering . . . . . . . . . . . . . . . . . 407

    14.2.1 Distance between Clusters . . . . . . . . . . . . . . . . . . . . 40714.2.2 Updating Distance Matrix . . . . . . . . . . . . . . . . . . . . 41114.2.3 Computational Complexity . . . . . . . . . . . . . . . . . . . 413

    14.3 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41314.4 Exercises and Projects . . . . . . . . . . . . . . . . . . . . . . . . . . 414

    15 Density-based Clustering 41715.1 The DBSCAN Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 41815.2 Kernel Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . 421

    15.2.1 Univariate Density Estimation . . . . . . . . . . . . . . . . . 42215.2.2 Multivariate Density Estimation . . . . . . . . . . . . . . . . 42415.2.3 Nearest Neighbor Density Estimation . . . . . . . . . . . . . . 427

    15.3 Density-based Clustering: DENCLUE . . . . . . . . . . . . . . . . . 42815.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43415.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434

    16 Spectral and Graph Clustering 43816.1 Graphs and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 43816.2 Clustering as Graph Cuts . . . . . . . . . . . . . . . . . . . . . . . . 446

    16.2.1 Clustering Objective Functions: Ratio and Normalized Cut . 44816.2.2 Spectral Clustering Algorithm . . . . . . . . . . . . . . . . . . 45116.2.3 Maximization Objectives: Average Cut and Modularity . . . . 455

    16.3 Markov Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46316.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47016.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471

  • CONTENTS vi

    17 Clustering Validation 47317.1 External Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474

    17.1.1 Matching Based Measures . . . . . . . . . . . . . . . . . . . . 47417.1.2 Entropy Based Measures . . . . . . . . . . . . . . . . . . . . . 47917.1.3 Pair-wise Measures . . . . . . . . . . . . . . . . . . . . . . . . 48217.1.4 Correlation Measures . . . . . . . . . . . . . . . . . . . . . . . 486

    17.2 Internal Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48917.3 Relative Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498

    17.3.1 Cluster Stability . . . . . . . . . . . . . . . . . . . . . . . . . 50517.3.2 Clustering Tendency . . . . . . . . . . . . . . . . . . . . . . . 508

    17.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51317.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514

    IV Classification 516

    18 Probabilistic Classification 51718.1 Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517

    18.1.1 Estimating the Prior Probability . . . . . . . . . . . . . . . . 51818.1.2 Estimating the Likelihood . . . . . . . . . . . . . . . . . . . . 518

    18.2 Naive Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . 52418.3 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52818.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528

    19 Decision Tree Classifier 53019.1 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53219.2 Decision Tree Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 535

    19.2.1 Split-point Evaluation Measures . . . . . . . . . . . . . . . . 53619.2.2 Evaluating Split-points . . . . . . . . . . . . . . . . . . . . . . 53719.2.3 Computational Complexity . . . . . . . . . . . . . . . . . . . 545

    19.3 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54619.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547

    20 Linear Discriminant Analysis 54920.1 Optimal Linear Discriminant . . . . . . . . . . . . . . . . . . . . . . 54920.2 Kernel Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . 55620.3 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56420.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564

    21 Support Vector Machines 56621.1 Linear Discriminants and Margins . . . . . . . . . . . . . . . . . . . 56621.2 SVM: Linear and Separable Case . . . . . . . . . . . . . . . . . . . . 57221.3 Soft Margin SVM: Linear and Non-Separable Case . . . . . . . . . . 577

  • CONTENTS vii

    21.3.1 Hinge Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57821.3.2 Quadratic Loss . . . . . . . . . . . . . . . . . . . . . . . . . . 582

    21.4 Kernel SVM: Nonlinear Case . . . . . . . . . . . . . . . . . . . . . . 58321.5 SVM Training Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 588

    21.5.1 Dual Solution: Stochastic Gradient Ascent . . . . . . . . . . . 58821.5.2 Primal Solution: Newton Optimization . . . . . . . . . . . . . 593

    22 Classification Assessment 60222.1 Classification Performance Measures . . . . . . . . . . . . . . . . . . 602

    22.1.1 Contingency Table Based Measures . . . . . . . . . . . . . . . 60422.1.2 Binary Classification: Positive and Negative Class . . . . . . 60722.1.3 ROC Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 611

    22.2 Classifier Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 61622.2.1 K-fold Cross-Validation . . . . . . . . . . . . . . . . . . . . . 61722.2.2 Bootstrap Resampling . . . . . . . . . . . . . . . . . . . . . . 61822.2.3 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . 62022.2.4 Comparing Classifiers: Paired t-Test . . . . . . . . . . . . . . 625

    22.3 Bias-Variance Decomposition . . . . . . . . . . . . . . . . . . . . . . 62722.3.1 Ensemble Classifiers . . . . . . . . . . . . . . . . . . . . . . . 632

    22.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63822.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639

    Index 641

  • CONTENTS 1

    Preface

    This book is an outgrowth of data mining courses at RPI and UFMG; the RPI coursehas been offered every Fall since 1998, whereas the UFMG course has been offeredsince 2002. While there are several good books on data mining and related topics,we felt that many of them are either too high-level or too advanced. Our goal wasto write an introductory text which focuses on the fundamental algorithms in datamining and analysis. It lays the mathematical foundations for the core data miningmethods, with key concepts explained when first encountered; the book also tries tobuild the intuition behind the formulas to aid understanding.

    The main parts of the book include exploratory data analysis, frequent patternmining, clustering and classification. The book lays the basic foundations of thesetasks, and it also covers cutting edge topics like kernel methods, high dimensionaldata analysis, and complex graphs and networks. It integrates concepts from relateddisciplines like machine learning and statistics, and is also ideal for a course on dataanalysis. Most of the prerequisite material is covered in the text, especially on linearalgebra, and probability and statistics.

    The book includes many examples to illustrate the main technical concepts. Italso has end of chapter exercises, which have been used in class. All of the algorithmsin the book have been implemented by the authors. We suggest that the reader usetheir favorite data analysis and mining software to work through our examples, andto implement the algorithms we describe in text; we recommend the R software,or the Python language with its NumPy package. The datasets used and othersupplementary material like project ideas, slides, and so on, are available online atthe book’s companion site and its mirrors at RPI and UFMG

    • http://dataminingbook.info• http://www.cs.rpi.edu/~zaki/dataminingbook• http://www.dcc.ufmg.br/dataminingbook

    Having understood the basic principles and algorithms in data mining and dataanalysis, the readers will be well equipped to develop their own methods or use moreadvanced techniques.

  • CONTENTS 2

    Suggested Roadmaps

    The chapter dependency graph is shown in Figure 1. We suggest some typicalroadmaps for courses and readings based on this book. For an undergraduate levelcourse, we suggest the following chapters: 1-3, 8, 10, 12-15, 17-19, and 21-22. For anundergraduate course without exploratory data analysis, we recommend Chapters1, 8-15, 17-19, and 21-22. For a graduate course, one possibility is to quickly goover the material in Part I, or to assume it as background reading and to directlycover Chapters 9-23; the other parts of the book, namely frequent pattern mining(Part II), clustering (Part III), and classification (Part IV) can be covered in anyorder. For a course on data analysis the chapters must include 1-7, 13-14, 15 (Section2), and 20. Finally, for a course with an emphasis on graphs and kernels we suggestChapters 4, 5, 7 (Sections 1-3), 11-12, 13 (Sections 1-2), 16-17, 20-22.

    1

    2

    14 6 7 15 5

    13

    17

    16 20

    22

    21

    4 19

    3

    18 8

    11

    12

    9 10

    Figure 1: Chapter Dependencies

    Acknowledgments

    Initial drafts of this book have been used in many data mining courses. We receivedmany valuable comments and corrections from both the faculty and students. Ourthanks go to

    • Muhammad Abulaish, Jamia Millia Islamia, India

  • CONTENTS 3

    • Mohammad Al Hasan, Indiana University Purdue University at Indianapolis• Marcio Luiz Bunte de Carvalho, Universidade Federal de Minas Gerais, Brazil• Loïc Cerf, Universidade Federal de Minas Gerais, Brazil• Ayhan Demiriz, Sakarya University, Turkey• Murat Dundar, Indiana University Purdue University at Indianapolis• Jun Luke Huan, University of Kansas• Ruoming Jin, Kent State University• Latifur Khan, University of Texas, Dallas• Pauli Miettinen, Max-Planck-Institut für Informatik, Germany• Suat Ozdemir, Gazi University, Turkey• Naren Ramakrishnan, Virginia Polytechnic and State University• Leonardo Chaves Dutra da Rocha, Universidade Federal de São João del-Rei,

    Brazil

    • Saeed Salem, North Dakota State University• Ankur Teredesai, University of Washington, Tacoma• Hannu Toivonen, University of Helsinki, Finland• Adriano Alonso Veloso, Universidade Federal de Minas Gerais, Brazil• Jason T.L. Wang, New Jersey Institute of Technology• Jianyong Wang, Tsinghua University, China• Jiong Yang, Case Western Reserve University• Jieping Ye, Arizona State University

    We would like to thank all the students enrolled in our data mining courses at RPIand UFMG, and also the anonymous reviewers who provided technical commentson various chapters. In addition, we thank CNPq, CAPES, FAPEMIG, Inweb –the National Institute of Science and Technology for the Web, and Brazil’s Sciencewithout Borders program for their support. We thank Lauren Cowles, our editor atCambridge University Press, for her guidance and patience in realizing this book.

    Finally, on a more personal front, MJZ would like to dedicate the book to Amina,Abrar, Afsah, and his parents, and WMJ would like to dedicate the book to Patricia,Gabriel, Marina and his parents, Wagner and Marlene. This book would not havebeen possible without their patience and support.

    Troy Mohammed J. ZakiBelo Horizonte Wagner Meira, Jr.Summer 2013

  • CHAPTER 1. DATA MINING AND ANALYSIS 4

    Chapter 1

    Data Mining and Analysis

    Data mining is the process of discovering insightful, interesting, and novel patterns,as well as descriptive, understandable and predictive models from large-scale data.We begin this chapter by looking at basic properties of data modeled as a data ma-trix. We emphasize the geometric and algebraic views, as well as the probabilisticinterpretation of data. We then discuss the main data mining tasks, which span ex-ploratory data analysis, frequent pattern mining, clustering and classification, layingout the road-map for the book.

    1.1 Data Matrix

    Data can often be represented or abstracted as an n×d data matrix, with n rows andd columns, where rows correspond to entities in the dataset, and columns representattributes or properties of interest. Each row in the data matrix records the observedattribute values for a given entity. The n× d data matrix is given as

    D =

    X1 X2 · · · Xdx1 x11 x12 · · · x1dx2 x21 x22 · · · x2d...

    ......

    . . ....

    xn xn1 xn2 · · · xnd

    where xi denotes the i-th row, which is a d-tuple given as

    xi = (xi1, xi2, · · · , xid)

    and where Xj denotes the j-th column, which is an n-tuple given as

    Xj = (x1j , x2j , · · · , xnj)

    Depending on the application domain, rows may also be referred to as entities,instances, examples, records, transactions, objects, points, feature-vectors, tuples and

  • CHAPTER 1. DATA MINING AND ANALYSIS 5

    so on. Likewise, columns may also be called attributes, properties, features, dimen-sions, variables, fields, and so on. The number of instances n is referred to as thesize of the data, whereas the number of attributes d is called the dimensionality ofthe data. The analysis of a single attribute is referred to as univariate analysis,whereas the simultaneous analysis of two attributes is called bivariate analysis andthe simultaneous analysis of more than two attributes is called multivariate analysis.

    sepal sepal petal petalclass

    length width length widthX1 X2 X3 X4 X5

    x1 5.9 3.0 4.2 1.5 Iris-versicolorx2 6.9 3.1 4.9 1.5 Iris-versicolorx3 6.6 2.9 4.6 1.3 Iris-versicolorx4 4.6 3.2 1.4 0.2 Iris-setosax5 6.0 2.2 4.0 1.0 Iris-versicolorx6 4.7 3.2 1.3 0.2 Iris-setosax7 6.5 3.0 5.8 2.2 Iris-virginicax8 5.8 2.7 5.1 1.9 Iris-virginica...

    ......

    ......

    ...x149 7.7 3.8 6.7 2.2 Iris-virginicax150 5.1 3.4 1.5 0.2 Iris-setosa

    Table 1.1: Extract from the Iris Dataset

    Example 1.1: Table 1.1 shows an extract of the Iris dataset; the complete dataforms a 150×5 data matrix. Each entity is an Iris flower, and the attributes includesepal length, sepal width, petal length and petal width in centimeters, andthe type or class of the Iris flower. The first row is given as the 5-tuple

    x1 = (5.9, 3.0, 4.2, 1.5, Iris-versicolor)

    Not all datasets are in the form of a data matrix. For instance, more complexdatasets can be in the form of sequences (e.g., DNA, Proteins), text, time-series,images, audio, video, and so on, which may need special techniques for analysis.However, in many cases even if the raw data is not a data matrix it can usually betransformed into that form via feature extraction. For example, given a database ofimages, we can create a data matrix where rows represent images and columns corre-spond to image features like color, texture, and so on. Sometimes, certain attributesmay have special semantics associated with them requiring special treatment. For

  • CHAPTER 1. DATA MINING AND ANALYSIS 6

    instance, temporal or spatial attributes are often treated differently. It is also worthnoting that traditional data analysis assumes that each entity or instance is inde-pendent. However, given the interconnected nature of the world we live in, thisassumption may not always hold. Instances may be connected to other instances viavarious kinds of relationships, giving rise to a data graph, where a node representsan entity and an edge represents the relationship between two entities.

    1.2 Attributes

    Attributes may be classified into two main types depending on their domain, i.e.,depending on the types of values they take on.

    Numeric Attributes A numeric attribute is one that has a real-valued or integer-valued domain. For example, Age with domain(Age) = N, where N denotes the setof natural numbers (non-negative integers), is numeric, and so is petal length inTable 1.1, with domain(petal length) = R+ (the set of all positive real numbers).Numeric attributes that take on a finite or countably infinite set of values are calleddiscrete, whereas those that can take on any real value are called continuous. As aspecial case of discrete, if an attribute has as its domain the set {0, 1}, it is called abinary attribute. Numeric attributes can be further classified into two types:

    • Interval-scaled: For these kinds of attributes only differences (addition or sub-traction) make sense. For example, attribute temperature measured in ◦C or◦F is interval-scaled. If it is 20 ◦C on one day and 10 ◦C on the followingday, it is meaningful to talk about a temperature drop of 10 ◦C, but it is notmeaningful to say that it is twice as cold as the previous day.

    • Ratio-scaled: Here one can compute both differences as well as ratios betweenvalues. For example, for attribute Age, we can say that someone who is 20years old is twice as old as someone who is 10 years old.

    Categorical Attributes A categorical attribute is one that has a set-valued do-main composed of a set of symbols. For example, Sex and Education could becategorical attributes with their domains given as

    domain(Sex) = {M, F}domain(Education) = {HighSchool, BS, MS, PhD}

    Categorical attributes may be of two types:

    • Nominal: The attribute values in the domain are unordered, and thus onlyequality comparisons are meaningful. That is, we can check only whether thevalue of the attribute for two given instances is the same or not. For example,

  • CHAPTER 1. DATA MINING AND ANALYSIS 7

    Sex is a nominal attribute. Also class in Table 1.1 is a nominal attribute withdomain(class) = {iris-setosa, iris-versicolor, iris-virginica}.

    • Ordinal: The attribute values are ordered, and thus both equality comparisons(is one value equal to another) and inequality comparisons (is one value lessthan or greater than another) are allowed, though it may not be possible toquantify the difference between values. For example, Education is an ordi-nal attribute, since its domain values are ordered by increasing educationalqualification.

    1.3 Data: Algebraic and Geometric View

    If the d attributes or dimensions in the data matrix D are all numeric, then eachrow can be considered as a d-dimensional point

    xi = (xi1, xi2, · · · , xid) ∈ Rd

    or equivalently, each row may be considered as a d-dimensional column vector (allvectors are assumed to be column vectors by default)

    xi =

    xi1xi2...

    xid

    =

    (xi1 xi2 · · · xid

    )T ∈ Rd

    where T is the matrix transpose operator.The d-dimensional Cartesian coordinate space is specified via the d unit vectors,

    called the standard basis vectors, along each of the axes. The j-th standard basisvector ej is the d-dimensional unit vector whose j-th component is 1 and the rest ofthe components are 0

    ej = (0, . . . , 1j , . . . , 0)T

    Any other vector in Rd can be written as linear combination of the standard basisvectors. For example, each of the points xi can be written as the linear combination

    xi = xi1e1 + xi2e2 + · · ·+ xided =d∑

    j=1

    xijej

    where the scalar value xij is the coordinate value along the j-th axis or attribute.

  • CHAPTER 1. DATA MINING AND ANALYSIS 8

    0

    1

    2

    3

    4

    0 1 2 3 4 5 6X1

    X2

    bcx1 = (5.9, 3.0)

    (a)

    X1

    X2

    X3

    12

    34

    56

    1 2 3

    1

    2

    3

    4

    bCx1 = (5.9, 3.0, 4.2)

    (b)

    Figure 1.1: Row x1 as a point and vector in (a) R2 and (b) R3

    2

    2.5

    3.0

    3.5

    4.0

    4.5

    4 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

    X1: sepal length

    X2:

    sepal

    wid

    th

    bCbC

    bC

    bC

    bC

    bCbC

    bC

    bC

    bC

    bCbC

    bC bCbC

    bCbC

    bC

    bCbC

    bC bC

    bC

    bCbC

    bC

    bC

    bCbC

    bC

    bC

    bC

    bCbC

    bC bCbC

    bC

    bCbC

    bC bC bCbC

    bC

    bC bC

    bC

    bCbC

    bCbC

    bC

    bCbC bC

    bCbC

    bCbC

    bCbC

    bCbC

    bCbCbC

    bC

    bCbC

    bC

    bCbC

    bCbCbC

    bCbC

    bCbC

    bC

    bC

    bC

    bCbC

    bC

    bCbC

    bC

    bC

    bCbC bC

    bCbC

    bC

    bC

    bC

    bC

    bC

    bCbC

    bCbCbC

    bC

    bCbC

    bC

    bC

    bC

    bC

    bC

    bC

    bC

    bC

    bCbC

    bC

    bCbC

    bCbC

    bC

    bC

    bC

    bC

    bC

    bCbC

    bC

    bC bC

    bC

    bC

    bCbC

    bC

    bC

    bC

    bC

    bCbC

    bCbC

    bC

    bC

    bC

    bC

    bC

    b

    Figure 1.2: Scatter Plot: sepal length versus sepal width. Solid circle shows themean point.

  • CHAPTER 1. DATA MINING AND ANALYSIS 9

    Example 1.2: Consider the Iris data in Table 1.1. If we project the entiredata onto the first two attributes, then each row can be considered as a pointor a vector in 2-dimensional space. For example, the projection of the 5-tuplex1 = (5.9, 3.0, 4.2, 1.5, Iris-versicolor) on the first two attributes is shown inFigure 1.1a. Figure 1.2 shows the scatter plot of all the n = 150 points in the2-dimensional space spanned by the first two attributes. Likewise, Figure 1.1bshows x1 as a point and vector in 3-dimensional space, by projecting the data ontothe first three attributes. The point (5.9, 3.0, 4.2) can be seen as specifying thecoefficients in the linear combination of the standard basis vectors in R3

    x1 = 5.9e1 + 3.0e2 + 4.2e3 = 5.9

    100

    + 3.0

    010

    + 4.2

    001

    =

    5.93.04.2

    Each numeric column or attribute can also be treated as a vector in an n-dimensional space Rn

    Xj =

    x1jx2j...

    xnj

    If all attributes are numeric, then the data matrix D is in fact an n× d matrix,also written as D ∈ Rn×d, given as

    D =

    x11 x12 · · · x1dx21 x22 · · · x2d...

    .... . .

    ...xn1 xn2 · · · xnd

    =

    — xT1 —

    — xT2 —

    ...— xTn —

    =

    | | |X1 X2 · · · Xd| | |

    As we can see, we can consider the entire dataset as an n× d matrix, or equivalentlyas a set of n row vectors xTi ∈ Rd or as a set of d column vectors Xj ∈ Rn.

    1.3.1 Distance and Angle

    Treating data instances and attributes as vectors, and the entire dataset as a matrix,enables one to apply both geometric and algebraic methods to aid in the data miningand analysis tasks.

    Let a,b ∈ Rm be two m-dimensional vectors given as

    a =

    a1a2...am

    b =

    b1b2...bm

  • CHAPTER 1. DATA MINING AND ANALYSIS 10

    Dot Product The dot product between a and b is defined as the scalar value

    aTb =(a1 a2 · · · am

    b1b2...bm

    = a1b1 + a2b2 + · · ·+ ambm

    =m∑

    i=1

    aibi

    Length The Euclidean norm or length of a vector a ∈ Rm is defined as

    ‖a‖ =√aTa =

    √a21 + a

    22 + · · ·+ a2m =

    √√√√m∑

    i=1

    a2i

    The unit vector in the direction of a is given as

    u =a

    ‖a‖ =(

    1

    ‖a‖

    )a

    By definition u has length ‖u‖ = 1, and it is also called a normalized vector, whichcan be used in lieu of a in some analysis tasks.

    The Euclidean norm is a special case of a general class of norms, known as Lp-norm, defined as

    ‖a‖p =(|a1|p + |a2|p + · · ·+ |am|p

    ) 1p=

    (m∑

    i=1

    |ai|p) 1

    p

    for any p 6= 0. Thus, the Euclidean norm corresponds to the case when p = 2.

    Distance From the Euclidean norm we can define the Euclidean distance betweena and b, as follows

    δ(a,b) = ‖a− b‖ =√

    (a− b)T (a− b) =

    √√√√m∑

    i=1

    (ai − bi)2 (1.1)

    Thus, the length of a vector is simply its distance from the zero vector 0, all of whoseelements are 0, i.e., ‖a‖ = ‖a− 0‖ = δ(a,0).

    From the general Lp-norm we can define the corresponding Lp-distance function,given as follows

    δp(a,b) = ‖a− b‖p (1.2)

  • CHAPTER 1. DATA MINING AND ANALYSIS 11

    Angle The cosine of the smallest angle between vectors a and b, also called thecosine similarity, is given as

    cos θ =aTb

    ‖a‖ ‖b‖ =(

    a

    ‖a‖

    )T (b

    ‖b‖

    )(1.3)

    Thus, the cosine of the angle between a and b is given as the dot product of the unitvectors a‖a‖ and

    b‖b‖ .

    The Cauchy-Schwartz inequality states that for any vectors a and b in Rm

    |aTb| ≤ ‖a‖ · ‖b‖

    It follows immediately from the Cauchy-Schwartz inequality that

    −1 ≤ cos θ ≤ 1

    Since the smallest angle θ ∈ [0◦, 180◦] and since cos θ ∈ [−1, 1], the cosine similarityvalue ranges from +1 corresponding to an angle of 0◦, to −1 corresponding to anangle of 180◦ (or π radians).

    Orthogonality Two vectors a and b are said to be orthogonal if and only if aTb =0, which in turn implies that cos θ = 0, that is, the angle between them is 90◦ or π2radians. In this case, we say that they have no similarity.

    0

    1

    2

    3

    4

    0 1 2 3 4 5X1

    X2

    bc (5, 3)

    bc(1, 4)

    ab

    a− b

    θ

    Figure 1.3: Distance and Angle. Unit vectors are shown in gray.

    Example 1.3 (Distance and Angle): Figure 1.3 shows the two vectors

    a =

    (53

    )and b =

    (14

    )

  • CHAPTER 1. DATA MINING AND ANALYSIS 12

    Using (1.1), the Euclidean distance between them is given as

    δ(a,b) =√

    (5− 1)2 + (3− 4)2 =√16 + 1 =

    √17 = 4.12

    The distance can also be computed as the magnitude of the vector

    a− b =(53

    )−(14

    )=

    (4−1

    )

    since ‖a− b‖ =√

    42 + (−1)2 =√17 = 4.12.

    The unit vector in the direction of a is given as

    ua =a

    ‖a‖ =1√

    52 + 32

    (53

    )=

    1√34

    (53

    )=

    (0.860.51

    )

    The unit vector in the direction of b can be computed similarly

    ub =

    (0.240.97

    )

    These unit vectors are also shown in gray in Figure 1.3.By (1.3) the cosine of the angle between a and b is given as

    cos θ =

    (53

    )T (14

    )

    √52 + 32

    √12 + 42

    =17√

    34× 17 =1√2

    We can get the angle by computing the inverse of the cosine

    θ = cos−1(1/√2)= 45◦

    Let us consider the Lp-norm for a with p = 3; we get

    ‖a‖3 =(53 + 33

    )1/3= (153)1/3 = 5.34

    The distance between a and b using (1.2) for the Lp-norm with p = 3 is given as

    ‖a− b‖3 =∥∥(4,−1)T

    ∥∥3=(43 + (−1)3

    )1/3= (63)1/3 = 3.98

  • CHAPTER 1. DATA MINING AND ANALYSIS 13

    1.3.2 Mean and Total Variance

    Mean The mean of the data matrix D is the vector obtained as the average of allthe row-vectors

    mean(D) = µ =1

    n

    n∑

    i=1

    xi

    Total Variance The total variance of the data matrix D is the average squareddistance of each point from the mean

    var(D) =1

    n

    n∑

    i=1

    δ(xi,µ)2 =

    1

    n

    n∑

    i=1

    ‖xi − µ‖2 (1.4)

    Simplifying (1.4) we obtain

    var(D) =1

    n

    n∑

    i=1

    (‖xi‖2 − 2xTi µ+ ‖µ‖2

    )

    =1

    n

    (n∑

    i=1

    ‖xi‖2 − 2nµT(1

    n

    n∑

    i=1

    xi

    )+ n ‖µ‖2

    )

    =1

    n

    (n∑

    i=1

    ‖xi‖2 − 2nµTµ+ n ‖µ‖2)

    =1

    n

    (n∑

    i=1

    ‖xi‖2)− ‖µ‖2

    The total variance is thus the difference between the average of the squared mag-nitude of the data points and the squared magnitude of the mean (average of thepoints).

    Centered Data Matrix Often we need to center the data matrix by making themean coincide with the origin of the data space. The centered data matrix is obtainedby subtracting the mean from all the points

    Z = D− 1 · µT =

    xT1

    xT2...xTn

    µT

    µT

    ...µT

    =

    xT1 − µTxT2 − µT

    ...xTn − µT

    =

    zT1

    zT2...zTn

    (1.5)

    where zi = xi − µ represents the centered point corresponding to xi, and 1 ∈ Rnis the n-dimensional vector all of whose elements have value 1. The mean of thecentered data matrix Z is 0 ∈ Rd, since we have subtracted the mean µ from all thepoints xi.

  • CHAPTER 1. DATA MINING AND ANALYSIS 14

    0

    1

    2

    3

    4

    0 1 2 3 4 5X1

    X2

    a

    b

    r =b⊥

    p=b ‖

    Figure 1.4: Orthogonal Projection

    1.3.3 Orthogonal Projection

    Often in data mining we need to project a point or vector onto another vector, forexample to obtain a new point after a change of the basis vectors. Let a,b ∈ Rmbe two m-dimensional vectors. An orthogonal decomposition of the vector b in thedirection of another vector a, illustrated in Figure 1.4, is given as

    b = b‖ + b⊥ = p+ r (1.6)

    where p = b‖ is parallel to a, and r = b⊥ is perpendicular or orthogonal to a. Thevector p is called the orthogonal projection or simply projection of b on the vectora. Note that the point p ∈ Rm is the point closest to b on the line passing througha. Thus, the magnitude of the vector r = b − p gives the perpendicular distancebetween b and a, which is often interpreted as the residual or error vector betweenthe points b and p.

    We can derive an expression for p by noting that p = ca for some scalar c, sincep is parallel to a. Thus, r = b−p = b− ca. Since p and r are orthogonal, we have

    pT r = (ca)T (b− ca) = caTb− c2aTa = 0

    which implies that c =aTb

    aTa

    Therefore, the projection of b on a is given as

    p = b‖ = ca =(aTb

    aTa

    )a (1.7)

  • CHAPTER 1. DATA MINING AND ANALYSIS 15

    uT

    utuTut

    uT

    ut

    uT

    ut

    uTut

    uT

    ut

    uTut uT

    ut

    uT

    ut

    uT

    ut

    uT

    ut

    uT

    ut

    uTut

    uT

    ut

    uT

    ut

    uT

    ut

    uT

    ut

    uT

    ut

    uT

    ut uT

    ut

    uT

    ut

    uT utuT

    ut

    uTut

    uT

    utuT

    ut

    uTut

    uT

    ut

    uTut

    uT

    utuT

    ut

    uT

    ut

    uT

    ut

    uT

    ut

    uT

    ut

    uT

    ut

    uT

    ut

    uT

    ut

    uTut

    uT

    utuT

    ut uTut

    uTut

    uT

    ut

    uT

    ut

    uT

    ut

    uT

    ut

    uTut uT

    ut

    uT

    ut

    rS

    rs

    rS

    rsrSrsrSrs

    rS

    rsrS

    rs

    rS

    rs

    rS

    rs

    rS

    rs

    rS

    rs

    rS

    rs

    rSrs

    rS

    rs

    rSrs

    rS

    rs

    rS rsrS

    rs

    rS

    rs

    rSrsrS

    rs

    rS

    rs

    rS

    rsrS

    rs

    rSrsrSrs

    rS

    rs

    rS

    rs

    rS

    rs

    rS

    rs

    rSrs

    rS

    rs

    rS

    rs

    rS

    rs

    rS

    rs

    rS

    rs

    rS

    rs

    rS

    rs

    rSrs

    rS

    rs

    rS

    rsrS

    rs

    rS

    rs

    rSrs

    rSrs rS

    rs

    rS

    rs

    rSrsrS

    rs

    rSrs

    rS

    rs

    bCbcbC

    bc

    bC

    bc

    bC

    bc

    bCbc

    bCbc

    bC

    bc

    bC

    bcbC

    bc

    bC

    bc

    bC

    bcbC

    bc

    bC bc

    bCbc

    bCbc

    bC

    bc

    bC

    bcbCbc

    bCbcbC

    bcbCbc

    bC

    bc

    bCbcbCbc

    bC

    bc

    bC

    bcbC

    bc

    bC

    bc

    bCbc

    bCbcbC

    bcbC

    bc

    bC

    bc

    bC

    bc

    bC

    bc

    bCbc

    bC

    bcbC

    bc

    bC

    bcbCbc

    bCbc

    bC

    bcbCbcbC

    bc bC

    bc

    bC

    bc

    bC

    bcbC

    bc bC

    bcbC

    bcX1

    X2ℓ

    -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0

    -1.0

    -0.5

    0.0

    0.5

    1.0

    1.5

    Figure 1.5: Projecting the Centered Data onto the Line ℓ

    Example 1.4: Restricting the Iris dataset to the first two dimensions, sepallength and sepal width, the mean point is given as

    mean(D) =

    (5.8433.054

    )

    which is shown as the black circle in Figure 1.2. The corresponding centered datais shown in Figure 1.5, and the total variance is var(D) = 0.868 (centering doesnot change this value).

    Figure 1.5 shows the projection of each point onto the line ℓ, which is the linethat maximizes the separation between the class iris-setosa (squares) from theother two class (circles and triangles). The line ℓ is given as the set of all the points

    (x1, x2)T satisfying the constraint

    (x1x2

    )= c

    (−2.152.75

    )for all scalars c ∈ R.

    1.3.4 Linear Independence and Dimensionality

    Given the data matrix

    D =(x1 x2 · · · xn

    )T=(X1 X2 · · · Xd

    )

  • CHAPTER 1. DATA MINING AND ANALYSIS 16

    we are often interested in the linear combinations of the rows (points) or the columns(attributes). For instance, different linear combinations of the original d attributesyield new derived attributes, which play a key role in feature extraction and dimen-sionality reduction.

    Given any set of vectors v1,v2, · · · ,vk in an m-dimensional vector space Rm,their linear combination is given as

    c1v1 + c2v2 + · · ·+ ckvk

    where ci ∈ R are scalar values. The set of all possible linear combinations of the kvectors is called the span, denoted as span(v1, · · · ,vk), which is itself a vector spacebeing a subspace of Rm. If span(v1, · · · ,vk) = Rm, then we say that v1, · · · ,vk is aspanning set for Rm.

    Row and Column Space There are several interesting vector spaces associatedwith the data matrix D, two of which are the column space and row space of D.The column space of D, denoted col(D), is the set of all linear combinations of thed column vectors or attributes Xj ∈ Rn, i.e.,

    col(D) = span(X1,X2, · · · ,Xd)

    By definition col(D) is a subspace of Rn. The row space of D, denoted row(D), isthe set of all linear combinations of the n row vectors or points xi ∈ Rd, i.e.,

    row(D) = span(x1,x2, · · · ,xn)

    By definition row(D) is a subspace of Rd. Note also that the row space of D is thecolumn space of DT

    row(D) = col(DT )

    Linear Independence We say that the vectors v1, · · · ,vk are linearly dependent ifat least one vector can be written as a linear combination of the others. Alternatively,the k vectors are linearly dependent if there are scalars c1, c2, · · · , ck, at least one ofwhich is not zero, such that

    c1v1 + c2v2 + · · · + ckvk = 0

    On the other hand, v1, · · · ,vk are linearly independent if and only if

    c1v1 + c2v2 + · · ·+ ckvk = 0 implies c1 = c2 = · · · = ck = 0

    Simply put, a set of vectors is linearly independent if none of them can be writtenas a linear combination of the other vectors in the set.

  • CHAPTER 1. DATA MINING AND ANALYSIS 17

    Dimension and Rank Let S be a subspace of Rm. A basis for S is a set ofvectors in S, say v1, · · · ,vk, that are linearly independent and they span S, i.e.,span(v1, · · · ,vk) = S. In fact, a basis is a minimal spanning set. If the vectors inthe basis are pair-wise orthogonal, they are said to form an orthogonal basis for S.If, in addition, they are also normalized to be unit vectors, then they make up anorthonormal basis for S. For instance, the standard basis for Rm is an orthonormalbasis consisting of the vectors

    e1 =

    10...0

    e2 =

    01...0

    · · · em =

    00...1

    Any two bases for S must have the same number of vectors, and the number ofvectors in a basis for S is called the dimension of S, denoted as dim(S). Since S isa subspace of Rm, we must have dim(S) ≤ m.

    It is a remarkable fact that, for any matrix, the dimension of its row and columnspace is the same, and this dimension is also called the rank of the matrix. Forthe data matrix D ∈ Rn×d, we have rank(D) ≤ min(n, d), which follows fromthe fact that the column space can have dimension at most d, and the row spacecan have dimension at most n. Thus, even though the data points are ostensiblyin a d dimensional attribute space (the extrinsic dimensionality), if rank(D) < d,then the data points reside in a lower dimensional subspace of Rd, and in this caserank(D) gives an indication about the intrinsic dimensionality of the data. In fact,with dimensionality reduction methods it is often possible to approximate D ∈ Rn×dwith a derived data matrix D′ ∈ Rn×k, which has much lower dimensionality, i.e.,k ≪ d. In this case k may reflect the “true” intrinsic dimensionality of the data.

    Example 1.5: The line ℓ in Figure 1.5 is given as ℓ = span((−2.15 2.75

    )T),

    with dim(ℓ) = 1. After normalization, we obtain the orthonormal basis for ℓ asthe unit vector

    1√12.19

    (−2.152.75

    )=

    (−0.6150.788

    )

    1.4 Data: Probabilistic View

    The probabilistic view of the data assumes that each numeric attribute X is a randomvariable, defined as a function that assigns a real number to each outcome of anexperiment (i.e., some process of observation or measurement). Formally, X is a

  • CHAPTER 1. DATA MINING AND ANALYSIS 18

    function X : O → R, where O, the domain of X, is the set of all possible outcomesof the experiment, also called the sample space, and R, the range of X, is the set ofreal numbers. If the outcomes are numeric, and represent the observed values of therandom variable, then X : O → O is simply the identity function: X(v) = v for allv ∈ O. The distinction between the outcomes and the value of the random variableis important, since we may want to treat the observed values differently dependingon the context, as seen in Example 1.6.

    A random variable X is called a discrete random variable if it takes on onlya finite or countably infinite number of values in its range, whereas X is called acontinuous random variable if it can take on any value in its range.

    5.9 6.9 6.6 4.6 6.0 4.7 6.5 5.8 6.7 6.7 5.1 5.1 5.7 6.1 4.95.0 5.0 5.7 5.0 7.2 5.9 6.5 5.7 5.5 4.9 5.0 5.5 4.6 7.2 6.85.4 5.0 5.7 5.8 5.1 5.6 5.8 5.1 6.3 6.3 5.6 6.1 6.8 7.3 5.64.8 7.1 5.7 5.3 5.7 5.7 5.6 4.4 6.3 5.4 6.3 6.9 7.7 6.1 5.66.1 6.4 5.0 5.1 5.6 5.4 5.8 4.9 4.6 5.2 7.9 7.7 6.1 5.5 4.64.7 4.4 6.2 4.8 6.0 6.2 5.0 6.4 6.3 6.7 5.0 5.9 6.7 5.4 6.34.8 4.4 6.4 6.2 6.0 7.4 4.9 7.0 5.5 6.3 6.8 6.1 6.5 6.7 6.74.8 4.9 6.9 4.5 4.3 5.2 5.0 6.4 5.2 5.8 5.5 7.6 6.3 6.4 6.35.8 5.0 6.7 6.0 5.1 4.8 5.7 5.1 6.6 6.4 5.2 6.4 7.7 5.8 4.95.4 5.1 6.0 6.5 5.5 7.2 6.9 6.2 6.5 6.0 5.4 5.5 6.7 7.7 5.1

    Table 1.2: Iris Dataset: sepal length (in centimeters)

    Example 1.6: Consider the sepal length attribute (X1) for the Iris dataset inTable 1.1. All n = 150 values of this attribute are shown in Table 1.2, which liein the range [4.3, 7.9] with centimeters as the unit of measurement. Let us assumethat these constitute the set of all possible outcomes O.

    By default, we can consider the attribute X1 to be a continuous random vari-able, given as the identity function X1(v) = v, since the outcomes (sepal lengthvalues) are all numeric.

    On the other hand, if we want to distinguish between Iris flowers with shortand long sepal lengths, with long being, say, a length of 7cm or more, we can definea discrete random variable A as follows

    A(v) =

    {0 If v < 7

    1 If v ≥ 7

    In this case the domain of A is [4.3, 7.9], and its range is {0, 1}. Thus, A assumesnon-zero probability only at the discrete values 0 and 1.

  • CHAPTER 1. DATA MINING AND ANALYSIS 19

    Probability Mass Function If X is discrete, the probability mass function of Xis defined as

    f(x) = P (X = x) for all x ∈ R

    In other words, the function f gives the probability P (X = x) that the randomvariable X has the exact value x. The name “probability mass function” intuitivelyconveys the fact that the probability is concentrated or massed at only discrete valuesin the range of X, and is zero for all other values. f must also obey the basic rulesof probability. That is, f must be non-negative

    f(x) ≥ 0

    and the sum of all probabilities should add to 1

    x

    f(x) = 1

    Example 1.7 (Bernoulli and Binomial Distribution): In Example 1.6, Awas defined as discrete random variable representing long sepal length. From thesepal length data in Table 1.2 we find that only 13 Irises have sepal length of atleast 7cm. We can thus estimate the probability mass function of A as follows

    f(1) = P (A = 1) =13

    150= 0.087 = p

    and

    f(0) = P (A = 0) =137

    150= 0.913 = 1− p

    In this case we say that A has a Bernoulli distribution with parameter p ∈ [0, 1],which denotes the probability of a success, i.e., the probability of picking an Iriswith a long sepal length at random from the set of all points. On the other hand,1−p is the probability of a failure, i.e., of not picking an Iris with long sepal length.

    Let us consider another discrete random variable B, denoting the number ofIrises with long sepal length in m independent Bernoulli trials with probability ofsuccess p. In this case, B takes on the discrete values [0,m], and its probabilitymass function is given by the Binomial distribution

    f(k) = P (B = k) =

    (m

    k

    )pk(1− p)m−k

    The formula can be understood as follows. There are(mk

    )ways of picking k long

    sepal length Irises out of the m trials. For each selection of k long sepal lengthIrises, the total probability of the k successes is pk, and the total probability of m−k

  • CHAPTER 1. DATA MINING AND ANALYSIS 20

    failures is (1− p)m−k. For example, since p = 0.087 from above, the probability ofobserving exactly k = 2 Irises with long sepal length in m = 10 trials is given as

    f(2) = P (B = 2) =

    (10

    2

    )(0.087)2(0.913)8 = 0.164

    Figure 1.6 shows the full probability mass function for different values of k form = 10. Since p is quite small, the probability of k successes in so few a trials fallsoff rapidly as k increases, becoming practically zero for values of k ≥ 6.

    0.1

    0.2

    0.3

    0.4

    0 1 2 3 4 5 6 7 8 9 10k

    P (B=k)

    Figure 1.6: Binomial Distribution: Probability Mass Function (m = 10, p = 0.087)

    Probability Density Function If X is continuous, its range is the entire setof real numbers R. The probability of any specific value x is only one out of theinfinitely many possible values in the range of X, which means that P (X = x) = 0for all x ∈ R. However, this does not mean that the value x is impossible, sincein that case we would conclude that all values are impossible! What it means isthat the probability mass is spread so thinly over the range of values, that it can bemeasured only over intervals [a, b] ⊂ R, rather than at specific points. Thus, insteadof the probability mass function, we define the probability density function, which

  • CHAPTER 1. DATA MINING AND ANALYSIS 21

    specifies the probability that the variable X takes on values in any interval [a, b] ⊂ R

    P(X ∈ [a, b]

    )=

    b∫

    a

    f(x)dx

    As before, the density function f must satisfy the basic laws of probability

    f(x) ≥ 0, for all x ∈ R

    and ∞∫

    −∞

    f(x)dx = 1

    We can get an intuitive understanding of the density function f by consideringthe probability density over a small interval of width 2ǫ > 0, centered at x, namely[x− ǫ, x+ ǫ]

    P(X ∈ [x− ǫ, x+ ǫ]

    )=

    x+ǫ∫

    x−ǫ

    f(x)dx ≃ 2ǫ · f(x)

    f(x) ≃ P(X ∈ [x− ǫ, x+ ǫ]

    )

    2ǫ(1.8)

    f(x) thus gives the probability density at x, given as the ratio of the probabilitymass to the width of the interval, i.e., the probability mass per unit distance. Thus,it is important to note that P (X = x) 6= f(x).

    Even though the probability density function f(x) does not specify the probabil-ity P (X = x), it can be used to obtain the relative probability of one value x1 overanother x2, since for a given ǫ > 0, by (1.8), we have

    P (X ∈ [x1 − ǫ, x1 + ǫ])P (X ∈ [x2 − ǫ, x2 + ǫ])

    ≃ 2ǫ · f(x1)2ǫ · f(x2)

    =f(x1)

    f(x2)(1.9)

    Thus, if f(x1) is larger than f(x2), then values of X close to x1 are more probablethan values close to x2, and vice versa.

    Example 1.8 (Normal Distribution): Consider again the sepal length val-ues from the Iris dataset, as shown in Table 1.2. Let us assume that these valuesfollow a Gaussian or normal density function, given as

    f(x) =1√2πσ2

    exp

    {−(x− µ)22σ2

    }

  • CHAPTER 1. DATA MINING AND ANALYSIS 22

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    2 3 4 5 6 7 8 9x

    f(x)

    µ± ǫ

    Figure 1.7: Normal Distribution: Probability Density Function (µ = 5.84, σ2 =0.681)

    There are two parameters of the normal density distribution, namely, µ, whichrepresents the mean value, and σ2, which represents the variance of the values(these parameters will be discussed in Chapter 2). Figure 1.7 shows the character-istic “bell” shape plot of the normal distribution. The parameters, µ = 5.84 andσ2 = 0.681, were estimated directly from the data for sepal length in Table 1.2.

    Whereas f(x = µ) = f(5.84) =1√

    2π · 0.681exp{0} = 0.483, we emphasize that

    the probability of observing X = µ is zero, i.e., P (X = µ) = 0. Thus, P (X = x)is not given by f(x), rather, P (X = x) is given as the area under the curve for aninfinitesimally small interval [x − ǫ, x + ǫ] centered at x, with ǫ > 0. Figure 1.7illustrates this with the shaded region centered at µ = 5.84. From (1.8), we have

    P (X = µ) ≃ 2ǫ · f(µ) = 2ǫ · 0.483 = 0.967ǫ

    As ǫ → 0, we get P (X = µ) → 0. However, based on (1.9) we can claim that theprobability of observing values close to the mean value µ = 5.84 is 2.67 times theprobability of observing values close to x = 7, since

    f(5.84)

    f(7)=

    0.483

    0.18= 2.69

  • CHAPTER 1. DATA MINING AND ANALYSIS 23

    Cumulative Distribution Function For any random variable X, whether dis-crete or continuous, we can define the cumulative distribution function (CDF) F :R→ [0, 1], that gives the probability of observing a value at most some given valuex

    F (x) = P (X ≤ x) for all −∞ < x

  • CHAPTER 1. DATA MINING AND ANALYSIS 24

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    0 1 2 3 4 5 6 7 8 9 10x

    F (x)

    (µ, F (µ)) = (5.84, 0.5)

    Figure 1.9: Cumulative Distribution Function for the Normal Distribution

    Figure 1.9 shows the cumulative distribution function for the normal densityfunction shown in Figure 1.6. As expected, for a continuous random variable,the CDF is also continuous, and non-decreasing. Since the normal distribution issymmetric about the mean, we have F (µ) = P (X ≤ µ) = 0.5.

    1.4.1 Bivariate Random Variables

    Instead of considering each attribute as a random variable, we can also perform pair-wise analysis by considering a pair of attributes, X1 and X2, as a bivariate randomvariable

    X =

    (X1X2

    )

    X : O → R2 is a function that assigns to each outcome in the sample space, a pairof real numbers, i.e., a 2-dimensional vector

    (x1x2

    )∈ R2. As in the univariate case,

    if the outcomes are numeric, then the default is to assume X to be the identityfunction.

    Joint Probability Mass Function If X1 and X2 are both discrete random vari-ables then X has a joint probability mass function given as follows

    f(x) = f(x1, x2) = P (X1 = x1,X2 = x2) = P (X = x)

  • CHAPTER 1. DATA MINING AND ANALYSIS 25

    f must satisfy the following two conditions

    f(x) = f(x1, x2) ≥ 0 for all −∞ < x1, x2

  • CHAPTER 1. DATA MINING AND ANALYSIS 26

    X1X2

    f(x)b

    b

    b

    b

    0.773

    0.14

    0.067

    0.02

    011

    Figure 1.10: Joint Probability Mass Function: X1 (long sepal length), X2 (long sepalwidth)

    Example 1.10 (Bivariate Distributions): Consider the sepal length andsepal width attributes in the Iris dataset, plotted in Figure 1.2. Let A denotethe Bernoulli random variable corresponding to long sepal length (at least 7cm),as defined in Example 1.7.

    Define another Bernoulli random variable B corresponding to long sepal width,

    say, at least 3.5cm. Let X =

    (AB

    )be the discrete bivariate random variable, then

    the joint probability mass function of X can be estimated from the data as follows

    f (0, 0) = P (A = 0, B = 0) =116

    150= 0.773

    f (0, 1) = P (A = 0, B = 1) =21

    150= 0.140

    f (1, 0) = P (A = 1, B = 0) =10

    150= 0.067

    f (1, 1) = P (A = 1, B = 1) =3

    150= 0.020

    Figure 1.10 shows a plot of this probability mass function.Treating attributes X1 and X2 in the Iris dataset (see Table 1.1) as con-

    tinuous random variables, we can define a continuous bivariate random variable

    X =

    (X1X2

    ). Assuming that X follows a bivariate normal distribution, its joint

    probability density function is given as

    f(x|µ,Σ) = 12π√|Σ|

    exp

    {−(x− µ)

    T Σ−1 (x− µ)2

    }

  • CHAPTER 1. DATA MINING AND ANALYSIS 27

    Here µ and Σ are the parameters of the bivariate normal distribution, representingthe 2-dimensional mean vector and covariance matrix, which will be discussed indetail in Chapter 2. Further |Σ| denotes the determinant of Σ. The plot of thebivariate normal density is given in Figure 1.11, with mean

    µ = (5.843, 3.054)T

    and covariance matrix

    Σ =

    (0.681 −0.039−0.039 0.187

    )

    It is important to emphasize that the function f(x) specifies only the probabilitydensity at x, and f(x) 6= P (X = x). As before, we have P (X = x) = 0.

    X1

    X2

    f(x)

    01

    23

    45

    67

    89

    01

    23

    45

    0

    0.2

    0.4

    b

    Figure 1.11: Bivariate Normal Density: µ = (5.843, 3.054)T (black point)

    Joint Cumulative Distribution Function The joint cumulative distributionfunction for two random variables X1 and X2 is defined as the function F , suchthat for all values x1, x2 ∈ (−∞,∞),

    F (x) = F (x1, x2) = P (X1 ≤ x1 and X2 ≤ x2) = P (X ≤ x)

    Statistical Independence Two random variables X1 and X2 are said to be (sta-tistically) independent if, for every W1 ⊂ R and W2 ⊂ R, we have

    P (X1 ∈W1 and X2 ∈W2) = P (X1 ∈W1) · P (X2 ∈W2)

  • CHAPTER 1. DATA MINING AND ANALYSIS 28

    Furthermore, if X1 and X2 are independent, then the following two conditions arealso satisfied

    F (x) = F (x1, x2) = F1(x1) · F2(x2)f(x) = f(x1, x2) = f1(x1) · f2(x2)

    where Fi is the cumulative distribution function, and fi is the probability mass ordensity function for random variable Xi.

    1.4.2 Multivariate Random Variable

    A d-dimensional multivariate random variable X = (X1,X2, · · · ,Xd)T , also called avector random variable, is defined as a function that assigns a vector of real numbersto each outcome in the sample space, i.e., X : O → Rd. The range of X can bedenoted as a vector x = (x1, x2, · · · , xd)T . In case all Xj are numeric, then X isby default assumed to be the identity function. In other words, if all attributes arenumeric, we can treat each outcome in the sample space (i.e., each point in the datamatrix) as a vector random variable. On the other hand, if the attributes are notall numeric, then X maps the outcomes to numeric vectors in its range.

    If all Xj are discrete, then X is jointly discrete and its joint probability massfunction f is given as

    f(x) = P (X = x)

    f(x1, x2, · · · , xd) = P (X1 = x1,X2 = x2, · · · ,Xd = xd)

    If all Xj are continuous, then X is jointly continuous and its joint probability densityfunction is given as

    P (X ∈W ) =∫· · ·∫

    x∈W

    f(x)dx

    P((X1,X2, · · · ,Xd)T ∈W

    )=

    ∫· · ·∫

    (x1,x2,··· ,xd)T∈W

    f(x1, x2, · · · , xd)dx1dx2 · · · dxd

    for any d-dimensional region W ⊆ Rd.The laws of probability must be obeyed as usual, i.e., f(x) ≥ 0 and sum of f

    over all x in the range of X must be 1. The joint cumulative distribution functionof X = (X1, · · · ,Xd)T is given as

    F (x) = P (X ≤ x)F (x1, x2, · · · , xd) = P (X1 ≤ x1,X2 ≤ x2, · · · ,Xd ≤ xd)

    for every point x ∈ Rd.

  • CHAPTER 1. DATA MINING AND ANALYSIS 29

    We say that X1,X2, · · · ,Xd are independent random variables if and only if, forevery region Wi ⊂ R, we have

    P (X1 ∈W1 and X2 ∈W2 · · · and Xd ∈Wd) =P (X1 ∈W1) · P (X2 ∈W2) · . . . · P (Xd ∈Wd) (1.10)

    If X1,X2, · · · ,Xd are independent then the following conditions are also satisfiedF (x) = F (x1, · · · , xd) = F1(x1) · F2(x2) · . . . · Fd(xd)f(x) = f(x1, · · · , xd) = f1(x1) · f2(x2) · . . . · fd(xd) (1.11)

    where Fi is the cumulative distribution function, and fi is the probability mass ordensity function for random variable Xi.

    1.4.3 Random Sample and Statistics

    The probability mass or density function of a random variable X may follow someknown form, or as is often the case in data analysis, it may be unknown. Whenthe probability function is not known, it may still be convenient to assume thatthe values follow some known distribution, based on the characteristics of the data.However, even in this case, the parameters of the distribution may still be unknown.Thus, in general, either the parameters, or the entire distribution, may have to beestimated from the data.

    In statistics, the word population is used to refer to the set or universe of all en-tities under study. Usually we are interested in certain characteristics or parametersof the entire population (e.g., the mean age of all computer science students in theUS). However, looking at the entire population may not be feasible or may be tooexpensive. Instead, we try to make inferences about the population parameters bydrawing a random sample from the population, and by computing appropriate statis-tics from the sample that give estimates of the corresponding population parametersof interest.

    Univariate Sample Given a random variable X, a random sample of size n fromX is defined as a set of n independent and identically distributed (IID) randomvariables S1, S2, · · · , Sn, i.e., all of the Si’s are statistically independent of eachother, and follow the same probability mass or density function as X.

    If we treat attribute X as a random variable, then each of the observed valuesof X, namely, xi (1 ≤ i ≤ n), are themselves treated as identity random variables,and the observed data is assumed to be a random sample drawn from X. That is,all xi are considered to be mutually independent and identically distributed as X.By (1.11) their joint probability function is given as

    f(x1, · · · , xn) =n∏

    i=1

    fX(xi)

  • CHAPTER 1. DATA MINING AND ANALYSIS 30

    where fX is the probability mass or density function for X.

    Multivariate Sample For multivariate parameter estimation, the n data pointsxi (with 1 ≤ i ≤ n) constitute a d-dimensional multivariate random sample drawnfrom the vector random variable X = (X1,X2, · · · ,Xd). That is, xi are assumed tobe independent and identically distributed, and thus their joint distribution is givenas

    f(x1,x2, · · · ,xn) =n∏

    i=1

    fX(xi) (1.12)

    where fX is the probability mass or density function for X.Estimating the parameters of a multivariate joint probability distribution is usu-

    ally difficult and computationally intensive. One common simplifying assumptionthat is typically made is that the d attributes X1,X2, · · · ,Xd are statistically inde-pendent. However, we do not assume that they are identically distributed, becausethat is almost never justified. Under the attribute independence assumption (1.12)can be rewritten as

    f(x1,x2, · · · ,xn) =n∏

    i=1

    f(xi) =

    n∏

    i=1

    d∏

    j=1

    f(xij)

    Statistic We can estimate a parameter of the population by defining an appropri-ate sample statistic, which is defined as a function of the sample. More precisely, let{Si}mi=1 denote the random sample of size m drawn from a (multivariate) randomvariable X. A statistic θ̂ is a function θ̂ : (S1,S2, · · · ,Sn) → R. The statistic is anestimate of the corresponding population parameter θ. As such, the statistic θ̂ isitself a random variable. If we use the value of a statistic to estimate a populationparameter, this value is called a point estimate of the parameter, and the statistic iscalled an estimator of the parameter. In Chapter 2 we will study different estimatorsfor population parameters that reflect the location (or centrality) and dispersion ofvalues.

    Example 1.11 (Sample Mean): Consider attribute sepal length (X1) in theIris dataset, whose values are shown in Table 1.2. Assume that the mean valueof X1 is not known. Let us assume that the observed values {xi}ni=1 constitute arandom sample drawn from X1.

    The sample mean is a statistic, defined as the average

    µ̂ =1

    n

    n∑

    i=1

    xi

  • CHAPTER 1. DATA MINING AND ANALYSIS 31

    Plugging in values from Table 1.2, we obtain

    µ̂ =1

    150(5.9 + 6.9 + · · ·+ 7.7 + 5.1) = 876.5

    150= 5.84

    The value µ̂ = 5.84 is a point estimate for the unknown population parameter µ,the (true) mean value of variable X1.

    1.5 Data Mining

    Data mining comprises the core algorithms that enable one to gain fundamentalinsights and knowledge from massive data. It is an interdisciplinary field mergingconcepts from allied areas like database systems, statistics, machine learning, andpattern recognition. In fact, data mining is part of a larger knowledge discoveryprocess, which includes pre-processing tasks like data extraction, data cleaning, datafusion, data reduction and feature construction, as well as post-processing steps likepattern and model interpretation, hypothesis confirmation and generation, and soon. This knowledge discovery and data mining process tends to be highly iterativeand interactive.

    The algebraic, geometric and probabilistic viewpoints of data play a key role indata mining. Given a dataset of n points in a d-dimensional space, the fundamentalanalysis and mining tasks covered in this book include exploratory data analysis,frequent pattern discovery, data clustering, and classification models, which are de-scribed next.

    1.5.1 Exploratory Data Analysis

    Exploratory data analysis aims to explore the numeric and categorical attributes ofthe data individually or jointly to extract key characteristics of the data sample viastatistics that give information about the centrality, dispersion, and so on. Movingaway from the IID assumption among the data points, it is also important to considerthe statistics that deal with the data as a graph, where the nodes denote the pointsand weighted edges denote the connections between points. This enables one toextract important topological attributes that give insights into the structure andmodels of networks and graphs. Kernel methods provide a fundamental connectionbetween the independent point-wise view of data, and the viewpoint that deals withpair-wise similarities between points. Many of the exploratory data analysis andmining tasks can be cast as kernel problems via the kernel trick, i.e., by showingthat the operations involve only dot-products between pairs of points. However,kernel methods also enable us to perform non-linear analysis by using familiar linearalgebraic and statistical methods in high-dimensional spaces comprising “non-linear”

  • CHAPTER 1. DATA MINING AND ANALYSIS 32

    dimensions. They further allow us to mine complex data as long as we have a wayto measure the pair-wise similarity between two abstract objects. Given that datamining deals with massive datasets with thousands of attributes and millions ofpoints, another goal of exploratory analysis is to reduce the amount of data to bemined. For instance, feature selection and dimensionality reduction methods areused to select the most important dimensions, discretization methods can be usedto reduce the number of values of an attribute, data sampling methods can be usedto reduce the data size, and so on.

    Part I begins with basic statistical analysis of univariate and multivariate numericdata in Chapter 2. We describe measures of central tendency such as mean, medianand mode, and then we consider measures of dispersion such as range, variance, andcovariance. We emphasize the dual algebraic and probabilistic views, and highlightthe geometric interpretation of the various measures. We especially focus on the mul-tivariate normal distribution, which is widely used as the default parametric modelfor data in both classification and clustering. In Chapter 3 we show how categoricaldata can be modeled via the multivariate binomial and the multinomial distribu-tions. We describe the contingency table analysis approach to test for dependencebetween categorical attributes. Next, in Chapter 4 we show how to analyze graphdata in terms of the topological structure, with special focus on various graph cen-trality measures such as closeness, betweenness, prestige, pagerank, and so on. Wealso study basic topological properties of real-world networks such as the small worldproperty, which states that real graphs have small average path length between pairsof points, the clustering effect, which indicates local clustering around nodes, andthe scale-free property, which manifests itself in a power-law degree distribution. Wedescribe models that can explain some of these characteristics of real-world graphs;these include the Erdös-Rényi random graph model, the Watts-Strogatz model, andthe Barabási-Albert model. Kernel methods are then introduced in Chapter 5, whichprovide new insights and connections between linear, non-linear, graph and complexdata mining tasks. We briefly highlight the theory behind kernel functions, with thekey concept being that a positive semi-definite kernel corresponds to a dot prod-uct in some high-dimensional feature space, and thus we can use familiar numericanalysis methods for non-linear or complex object analysis provided we can computethe pair-wise kernel matrix of similarities between points pairs. We describe variouskernels for numeric or vector data, as well as sequence and graph data. In Chapter 6we consider the peculiarities of high-dimensional space, colorfully referred to as thecurse of dimensionality. In particular, we study the scattering effect, i.e., the factthat data points lie along the surface and corners in high dimensions, with the “cen-ter” of the space being virtually empty. We show the proliferation of orthogonal axesand also the behavior of the multivariate normal distribution in high-dimensions. Fi-nally, in Chapter 7 we describe the widely used dimensionality reduction methodslike principal component analysis (PCA) and singular value decomposition (SVD).PCA finds the optimal k-dimensional subspace that captures most of the variance

  • CHAPTER 1. DATA MINING AND ANALYSIS 33

    in the data. We also show how kernel PCA can be used to find non-linear direc-tions that capture the most variance. We conclude with the powerful SVD spectraldecomposition method, studying its geometry, and its relationship to PCA.

    1.5.2 Frequent Pattern Mining

    Frequent pattern mining refers to the task of extracting informative and useful pat-terns in massive and complex datasets. Patterns comprise sets of co-occurring at-tribute values, called itemsets, or more complex patterns, such as sequences, whichconsider explicit precedence relationships (either positional or temporal), and graphs,which consider arbitrary relationships between points. The key goal is to discoverhidden trends and behaviors in the data to better understand the interactions amongthe points and attributes.

    Part II begins by presenting efficient algorithms for frequent itemset mining inChapter 8. The key methods include the level-wise Apriori algorithm, the “verti-cal” intersection based Eclat algorithm, and the frequent pattern tree and projectionbased FPGrowth method. Typically the mining process results in too many frequentpatterns that can be hard to interpret. In Chapter 9 we consider approaches tosummarize the mined patterns; these include maximal (GenMax algorithm), closed(Charm algorithm) and non-derivable itemsets. We describe effective methods forfrequent sequence mining in Chapter 10, which include the level-wise GSP method,the vertical SPADE algorithm, and the projection-based PrefixSpan approach. Wealso describe how consecutive subsequences, also called substrings, can be minedmuch more efficiently via Ukkonen’s linear time and space suffix tree method. Mov-ing beyond sequences to arbitrary graphs, we describe the popular and efficient gSpanalgorithm for frequent subgraph mining in Chapter 11. Graph mining involves twokey steps, namely graph isomorphism to eliminate duplicate patterns while enumera-tion and subgraph isomorphism while frequency computation. These operations canbe performed in polynomial time for sets and sequences, but for graphs it is knownthat subgraph isomorphism is NP-hard, and thus there is no polynomial time methodpossible unless P = NP . The gSpan method proposes a new canonical code and asystematic approach to subgraph extension, which allow it to efficiently detect du-plicates and to perform several subgraph isomorphism checks much more efficientlythan performing them individually. Given that pattern mining methods generatemany output results it is very important to assess the mined patterns. We discussstrategies for assessing both the frequent patterns and rules that can be mined fromthem in Chapter 12, especially emphasizing methods for significance testing.

    1.5.3 Clustering

    Clustering is the task of partitioning the points into natural groups called clusters,such that points within a group are very similar, whereas points across clusters areas dissimilar as possible. Depending on the data and desired cluster characteris-

  • CHAPTER 1. DATA MINING AND ANALYSIS 34

    tics, there are different types of clustering paradigms such as representative-based,hierarchical, density-based, graph-based and spectral clustering.

    Part III starts with representative-based clustering methods (Chapter 13), whichinclude the K-means and Expectation-Maximization (EM) algorithms. K-means isa greedy algorithm that minimizes the squared error of points from their respectivecluster means, and it performs hard clustering, that is, each point is assigned to onlyone cluster. We also show how kernel K-means can be used for non-linear clusters.EM generalizes K-means by modeling the data as a mixture of normal distributions,and it finds the cluster parameters (the mean and covariance matrix) by maximizingthe likelihood of the data. It is a soft clustering approach, i.e., instead of makinga hard assignment, it returns the probability of each cluster for each of the points.In Chapter 14 we consider various agglomerative hierarchical clustering methods,which start from each point in its own cluster, and successively merge (or agglom-erate) pairs of clusters until the desired number of clusters have been found. Weconsider various cluster proximity measures which distinguish the different hierar-chical methods. There are some datasets where the points from different clustersmay in fact be closer in distance than points from the same cluster; this usuallyhappens when the clusters are non-convex in shape. Density-based clustering meth-ods described in Chapter 15 use the density or connectedness properties to find suchnon-convex clusters. The two main methods are DBSCAN and its generalizationDENCLUE, which is based on kernel density estimation. We consider graph cluster-ing methods in Chapter 16, which are typically based on spectral analysis of graphdata. Graph clustering can be considered as an optimization problem over a k-waycut in a graph; different objectives can be cast as spectral decomposition of differentgraph matrices, such as the (normalized) adjacency matrix, Laplacian matrix, andso on, derived from the original graph data or from the kernel matrix. Finally, giventhe proliferation of different types of clustering methods, it is important to assessthe mined clusters as to how good they are in capturing the natural groups in data.In Chapter 17, we describe various clustering validation and evaluation strategies,spanning external and internal measures to compare a clustering with the ground-truth if it is available, or to compare two clusterings. We also highlight methods forclustering stability, i.e., the sensitivity of the clustering to data perturbation, andclustering tendency, i.e., the clusterability of the data. We also consider methods tochoose the parameter k, which is the user-specified value for the number of clustersto discover.

    1.5.4 Classification

    The classification task is to predict the label or class for a given unlabeled point.Formally, a classifier is a model or function M that predicts the class label ŷ for agiven input example x, that is, ŷ = M(x), where ŷ ∈ {c1, c2, · · · , ck} is the predictedclass label (a categorical attribute value). To build the model we require a set ofpoints with their correct class labels, which is called a training set. After learning the

  • CHAPTER 1. DATA MINING AND ANALYSIS 35

    model M , we can automatically predict the class for any new point. Many differenttypes of classification models have been proposed such as decision trees, probabilisticclassifiers, support vector machines, and so on.

    Part IV starts with the powerful Bayes classifier, which is an example of theprobabilistic classification approach (Chapter 18). It uses the Bayes theorem topredict the class as the one that maximizes the posterior probability P (ci|x). Themain task is to estimate the joint probability density function f(x) for each class,which is modeled via a multivariate normal distribution. One limitation of the Bayesapproach is the number of parameters to be estimated which scales as O(d2). Thenaive Bayes classifier makes the simplifying assumption that all attributes are in-dependent, which requires the estimation of only O(d) parameters. It is, however,surprisingly effective for many datasets. In Chapter 19 we consider the popular de-cision tree classifier, one of whose strengths is that it yields models that are easier tounderstand compared to other methods. A decision tree recursively partitions thedata space into “pure” regions that contain data points from only one class, with rel-atively few exceptions. Next, in chapter Chapter 20, we consider the task of findingan optimal direction that separates the points from two classes via linear discrimi-nant analysis. It can be considered as a dimensionality reduction method that alsotakes the class labels into account, unlike PCA that does not consider the class at-tribute. We also describe the generalization of linear to kernel discriminant analysis,which allows us to find non-linear directions via the kernel trick. In Chapter 21 wedescribe the support vector machine (SVM) approach in detail, which is one of themost effective classifiers for many different problem domains. The goal of SVMs isto find the optimal hyperplane that maximizes the margin between the classes. Viathe kernel trick SVMs can be used to find non-linear boundaries, which neverthelesscorrespond to some linear hyperplane in some high-dimensional “non-linear” space.One of the important tasks in classification is to assess how good the models are.We conclude this part with Chapter 22, which presents the various methodologiesfor assessing classification models. We define various classification performance mea-sures including ROC analysis. We then describe the bootstrap and cross-validationapproaches for classifier evaluation. Finally, we discuss the bias-variance tradeoff inclassification, and how ensemble classifiers can help improve the variance or the biasof a classifier.

    1.6 Further Reading

    For a review of the linear algebra concepts see (Strang, 2006; Poole, 2010), and forthe probabilistic view see (Evans and Rosenthal, 2011). There are several good bookson data mining, and machine and statistical learning; these include (Hand, Mannila,and Smyth, 2001; Han, Kamber, and Pei, 2006; Witten, Frank, and Hall, 2011; Tan,Steinbach, and Kumar, 2013; Bishop, 2006; Hastie, Tibshirani, and Friedman, 2009).

  • CHAPTER 1. DATA MINING AND ANALYSIS 36

    Bishop, C. (2006), Pattern Recognition and Machine Learning, Information Scienceand Statistics, Springer.

    Evans, M. and Rosenthal, J. (2011), Probability and Statistics: The Science of Un-certainty, 2nd Edition, W. H. Freeman.

    Hand, D., Mannila, H., and Smyth, P. (2001), Principles of data mining, Adaptativecomputation and machine learning series, Mit Press.

    Han, J., Kamber, M., and Pei, J. (2006), Data Mining, Second Edition: Conceptsand Techniques, The Morgan Kaufmann Series in Data Management Systems,Elsevier Science.

    Hastie, T., Tibshirani, R., and Friedman, J. (2009), The elements of statistical learn-ing, Second Edition, Springer series in statistics, Springer-Verlag New York.

    Poole, D. (2010), Linear Algebra: A Modern Introduction, 3rd Edition, CengageLearning.

    Strang, G. (2006), Linear Algebra and Its Applications, 4th Edition, Thomson Brooks/-Cole, Cengage learning.

    Tan, P., Steinbach, M., and Kumar, V. (2013), Introduction to Data Mining, Secondedition, Prentice Hall.

    Witten, I., Frank, E., and Hall, M. (2011), Data Mining: Practical Machine LearningTools and Techniques: Practical Machine Learning Tools and Techniques, Thirdedition, The Morgan Kaufmann Series in Data Management Systems, ElsevierScience.

    1.7 Exercises

    Q1. Show that the mean of the the centered data matrix Z in (1.5) is 0.

    Q2. Prove that for the Lp-distance in (1.2), we have

    δ∞(x,y) = limp→∞

    δp(x,y) =d

    maxi=1

    {|xi − yi|

    }

    for x,y ∈ Rd.

  • 37

    Part I

    Data Analysis Foundations

  • CHAPTER 2. NUMERIC ATTRIBUTES 38

    Chapter 2

    Numeric Attributes

    In this chapter, we discuss basic statistical methods for exploratory data analysis ofnumeric attributes. We look at measures of central tendency or location, measures ofdispersion, and measures of linear dependence or association between attributes. Weemphasize the connection between the probabilistic and the geometric and algebraicviews of the data matrix.

    2.1 Univariate Analysis

    Univariate analysis focuses on a single attribute at a time, thus the data matrix Dcan be thought of as an n× 1 matrix, or simply a column vector, given as

    D =

    X

    x1x2...xn

    where X is the numeric attribute of interest, with xi ∈ R. X is assumed to be arandom variable, with each point xi (1 ≤ i ≤ n) itself treated as an identity randomvariable. We assume that the observed data is a random sample drawn from X, i.e.,each variable xi is independent and identically distributed as X. In the vector view,we treat the sample as an n-dimensional vector, and write X ∈ Rn.

    In general, the probability density or mass function f(x) and the cumulativedistribution function F (x), for attribute X, are both unknown. However, we canestimate these distributions directly from the data sample, which also allows us tocompute statistics to estimate several important population parameters.

  • CHAPTER 2. NUMERIC ATTRIBUTES 39

    Empirical Cumulative Distribution Function (CDF) The empirical cumu-lative distribution function of X is given as