Upload
bilal-nasir
View
214
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Lec Slides
Citation preview
1 1
Data Mining: Concepts and Techniques
(3rd ed.)
— Chapter 1 —
Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-‐Champaign &
Simon Fraser University ©2013 Han, Kamber & Pei. All rights reserved.
2
Introduction
n Why Data Mining?
n What Is Data Mining?
n A Mul2-‐Dimensional View of Data Mining
n What Kinds of Data Can Be Mined?
n What Kinds of Pa?erns Can Be Mined?
n What Kinds of Technologies Are Used?
n What Kinds of Applica2ons Are Targeted?
n Major Issues in Data Mining
n A Brief History of Data Mining and Data Mining Society
3
Why Data Mining?
n Major sources of abundant data n Business: Web, e-‐commerce, transac2ons, stocks n Science: Remote sensing, bioinforma2cs, scien2fic simula2on n Society and everyone: news, digital cameras, YouTube
n The Explosive Growth of Data: from terabytes to petabytes n Automated data collec2on using database systems, sensors and APIs
n Data availability on the Web and through APIs n We are drowning in data, but starving for knowledge! n “Necessity is the mother of inven2on”—Data mining—
Automated analysis of massive data [sets]
Data Mining Example
4
5
What Is Data Mining?
n Data mining (knowledge discovery from data) n Extrac2on of interes2ng (non-‐trivial, implicit, previously unknown and poten2ally useful) pa?erns or knowledge from huge amount of data
n Alterna2ve names n Knowledge discovery (mining) in databases, knowledge extrac2on, data/pa?ern analysis, informa2on harves2ng, business intelligence, etc.
n Watch out: Is everything “data mining”? n Simple search and query processing n (Deduc2ve) expert systems
6
Knowledge Discovery (KDD) Process
n This is a view from typical database systems and data warehousing communi2es
n Data mining plays an essen2al role in the knowledge discovery process
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
7
Example: A Web Mining Framework
n Web mining usually involves n Data cleaning and integra2on from mul2ple sources n Warehousing the data n Data cube construc2on n Data selec2on for data mining n Data mining n Presenta2on of the mining results n Pa?erns and knowledge to be used or stored into knowledge-‐base
8
Data Mining in Business Intelligence
Increasing potential to support business decisions End User
Business Analyst
Data Analyst
DBA
Decision Making
Data Presentation Visualization Techniques
Data Mining Information Discovery
Data Exploration Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems
9
KDD Process: A Typical View from ML and Statistics
Input Data Data Mining
Data Pre-Processing
Post-Processing
n This is a view from typical machine learning and sta2s2cs communi2es
Data integration Normalization Feature selection Dimension reduction
Pattern discovery Association & correlation Classification Clustering Outlier analysis … … … …
Pattern evaluation Pattern selection Pattern interpretation Pattern visualization
10
Which View Do You Prefer?
n Which view do you prefer? n KDD vs. ML/Stat. vs. Business Intelligence
n Depending on the data, applica2ons, and your focus
n Data Mining vs. Data Explora2on
n Business intelligence view
n Warehouse, data cube, repor2ng but not much mining n Business objects vs. data mining tools
n Supply chain example: mining vs. OLAP vs. presenta2on tools
n Data presenta2on vs. data explora2on
11
Multi-Dimensional View of Data Mining
n Data to be mined n Database data (extended-‐rela2onal, object-‐oriented, heterogeneous,
legacy), data warehouse, transac2onal data, stream, spa2otemporal, 2me-‐series, sequence, text and web, mul2-‐media, graphs & social and informa2on networks
n Knowledge to be mined (or: Data mining funcNons) n Characteriza2on, discrimina2on, associa2on, classifica2on, clustering,
trend/devia2on, outlier analysis, etc. n Descrip2ve vs. predic2ve data mining n Mul2ple/integrated func2ons and mining at mul2ple levels
n Techniques uNlized n Data-‐intensive, data warehouse (OLAP), machine learning, sta2s2cs,
pa?ern recogni2on, visualiza2on, high-‐performance, etc. n ApplicaNons adapted
n Retail, telecommunica2on, banking, fraud analysis, bio-‐data mining, stock market analysis, text mining, Web mining, etc.
12
On What Kinds of Data?
n Database-‐oriented data sets and applica2ons
n Rela2onal database, data warehouse, transac2onal database
n Object-‐rela2onal databases, Heterogeneous databases and legacy databases
n Advanced data sets and advanced applica2ons
n Data streams and sensor data
n Time-‐series data, temporal data, sequence data (incl. bio-‐sequences)
n Structure data, graphs, social networks and informa2on networks
n Spa2al data and spa2otemporal data
n Mul2media database
n Text databases
n The World-‐Wide Web
13
Data Mining Function: (1) Generalization
n Informa2on integra2on and data warehouse construc2on n Data cleaning, transforma2on, integra2on, and mul2dimensional data model
n Data cube technology
n Scalable methods for compu2ng (i.e., materializing) mul2dimensional aggregates
n OLAP (online analy2cal processing) n Mul2dimensional concept descrip2on: Characteriza2on and
discrimina2on n Generalize, summarize, and contrast data characteris2cs, e.g., dry vs. wet region
14
Data Mining Function: (2) Association and Correlation Analysis
n Frequent pa?erns (or frequent itemsets) n What items are frequently purchased together in your Walmart?
n Associa2on, correla2on vs. causality
n A typical associa2on rule n Diaper à Milk [0.5%, 95%] (support, confidence)
n Are strongly associated items also strongly correlated? n How to mine such pa?erns and rules efficiently in large
datasets? n How to use such pa?erns for classifica2on, clustering, and
other applica2ons?
15
Data Mining Function: (3) Classification
n Classifica2on and label predic2on n Construct models (func2ons) based on some training examples n Describe and dis2nguish classes or concepts for future predic2on
n E.g., classify countries based on (climate), or classify cars based on (gas mileage)
n Predict some unknown class labels n Typical methods
n Decision trees, naïve Bayesian classifica2on, support vector machines, neural networks, rule-‐based classifica2on, pa?ern-‐based classifica2on, logis2c regression, …
n Typical applica2ons: n Credit card fraud detec2on, direct marke2ng, classifying stars, diseases, web-‐pages, …
16
Data Mining Function: (4) Cluster Analysis
n Unsupervised learning (i.e., Class label is unknown)
n Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribu2on pa?erns
n Principle: Maximizing intra-‐class similarity & minimizing interclass similarity
n Many methods and applica2ons
17
Data Mining Function: (5) Outlier Analysis
n Outlier: A data object that does not comply with the general behavior of the data
n Noise or excep2on? ― One person’s garbage could be another person’s treasure
n Methods: by-‐product of clustering or regression analysis, …
n Useful in fraud detec2on, rare events analysis
18
Time and Ordering: Sequential Pattern, Trend and Evolution Analysis
n Sequence, trend and evolu2on analysis n Trend, 2me-‐series, and devia2on analysis: e.g., regression and value predic2on
n Sequen2al pa?ern mining n e.g., first buy digital camera, then buy large SD memory cards
n Periodicity analysis n Mo2fs and biological sequence analysis
n Approximate and consecu2ve mo2fs n Similarity-‐based analysis
n Mining data streams n Ordered, 2me-‐varying, poten2ally infinite, data streams
19
Structure and Network Analysis
n Graph mining n Finding frequent subgraphs (e.g., chemical compounds), trees (XML),
substructures (web fragments) n Informa2on network analysis
n Social networks: actors (objects, nodes) and rela2onships (edges) n e.g., author networks in CS, terrorist networks
n Mul2ple heterogeneous networks n A person could be mul2ple informa2on networks: friends, family, classmates, …
n Links carry a lot of seman2c informa2on: Link mining n Web mining
n Web is a big informa2on network: from PageRank to Google n Analysis of Web informa2on networks
n Web community discovery, opinion mining, usage mining, …
20
Evaluation of Knowledge
n Are all mined knowledge interes2ng? n One can mine tremendous amount of “pa?erns” n Some may fit only certain dimension space (2me, loca2on, …)
n Some may not be representa2ve, may be transient, … n Evalua2on of mined knowledge → directly mine only interes2ng
knowledge? n Descrip2ve vs. predic2ve n Coverage n Typicality vs. novelty n Accuracy n Timeliness
21
Introduction
n Why Data Mining?
n What Is Data Mining?
n A Mul2-‐Dimensional View of Data Mining
n What Kinds of Data Can Be Mined?
n What Kinds of Pa?erns Can Be Mined?
n What Kinds of Technologies Are Used?
n What Kinds of Applica2ons Are Targeted?
n Major Issues in Data Mining
n A Brief History of Data Mining and Data Mining Society
n Summary
22
Data Mining: Confluence of Multiple Disciplines
Data Mining
Machine Learning
Statistics
Applications
Algorithm
Pattern Recognition
High-Performance Computing
Visualization
Database Technology
23
Why Confluence of Multiple Disciplines?
n Tremendous amount of data n Algorithms must be scalable to handle big data
n High-‐dimensionality of data n Micro-‐array may have tens of thousands of dimensions
n High complexity of data n Data streams and sensor data n Time-‐series data, temporal data, sequence data n Structure data, graphs, social and informa2on networks n Spa2al, spa2otemporal, mul2media, text and Web data n Sonware programs, scien2fic simula2ons
n New and sophis2cated applica2ons
24
Applications of Data Mining
n Web page analysis: from web page classifica2on, clustering to PageRank & HITS algorithms
n Collabora2ve analysis & recommender systems
n Basket data analysis to targeted marke2ng
n Biological and medical data analysis: classifica2on, cluster analysis (microarray data analysis), biological sequence analysis, biological network analysis
n Data mining and sonware engineering
n From major dedicated data mining systems/tools (e.g., SAS, MS SQL-‐Server Analysis Manager, Oracle Data Mining Tools) to invisible data mining
25
Major Issues in Data Mining (1)
n Mining Methodology
n Mining various and new kinds of knowledge
n Mining knowledge in mul2-‐dimensional space
n Data mining: An interdisciplinary effort
n Boos2ng the power of discovery in a networked environment
n Handling noise, uncertainty, and incompleteness of data
n Pa?ern evalua2on and pa?ern-‐ or constraint-‐guided mining
n User Interac2on
n Interac2ve mining
n Incorpora2on of background knowledge
n Presenta2on and visualiza2on of data mining results
26
Major Issues in Data Mining (2)
n Efficiency and Scalability
n Efficiency and scalability of data mining algorithms
n Parallel, distributed, stream, and incremental mining methods
n Diversity of data types
n Handling complex types of data
n Mining dynamic, networked, and global data repositories
n Data mining and society
n Social impacts of data mining
n Privacy-‐preserving data mining
n Invisible data mining
27
A Brief History of Data Mining Society
n 1989 IJCAI Workshop on Knowledge Discovery in Databases
n Knowledge Discovery in Databases (G. Piatetsky-‐Shapiro and W. Frawley, 1991)
n 1991-‐1994 Workshops on Knowledge Discovery in Databases
n Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-‐Shapiro, P. Smyth, and R. Uthurusamy, 1996)
n 1995-‐1998 Interna2onal Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-‐98)
n Journal of Data Mining and Knowledge Discovery (1997)
n ACM SIGKDD conferences since 1998 and SIGKDD Explora2ons
n More conferences on data mining
n PAKDD (1997), PKDD (1997), SIAM-‐Data Mining (2001), (IEEE) ICDM (2001), WSDM (2008), etc.
n ACM Transac2ons on KDD (2007)
28
Conferences and Journals on Data Mining
n KDD Conferences n ACM SIGKDD Int. Conf. on Knowledge
Discovery in Databases and Data Mining (KDD)
n SIAM Data Mining Conf. (SDM) n (IEEE) Int. Conf. on Data Mining
(ICDM) n European Conf. on Machine Learning
and Principles and prac2ces of Knowledge Discovery and Data Mining (ECML-‐PKDD)
n Pacific-‐Asia Conf. on Knowledge Discovery and Data Mining (PAKDD)
n Int. Conf. on Web Search and Data Mining (WSDM)
n Other related conferences n DB conferences: ACM SIGMOD,
VLDB, ICDE, EDBT, ICDT, …
n Web and IR conferences: WWW, SIGIR, WSDM
n ML conferences: ICML, NIPS n PR conferences: CVPR,
n Journals n Data Mining and Knowledge
Discovery (DAMI or DMKD)
n IEEE Trans. On Knowledge and Data Eng. (TKDE)
n KDD Explorations n ACM Trans. on KDD
29
Where to Find References? DBLP, CiteSeer, Google
n Data mining and KDD (SIGKDD: CDROM) n Conferences: ACM-‐SIGKDD, IEEE-‐ICDM, SIAM-‐DM, PKDD, PAKDD, etc. n Journal: Data Mining and Knowledge Discovery, KDD Explora2ons, ACM TKDD
n Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM) n Conferences: ACM-‐SIGMOD, ACM-‐PODS, VLDB, IEEE-‐ICDE, EDBT, ICDT, DASFAA n Journals: IEEE-‐TKDE, ACM-‐TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.
n AI & Machine Learning n Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc. n Journals: Machine Learning, Ar2ficial Intelligence, Knowledge and Informa2on Systems, IEEE-‐PAMI,
etc.
n Web and IR n Conferences: SIGIR, WWW, CIKM, etc. n Journals: WWW: Internet and Web Informa2on Systems,
n Sta2s2cs n Conferences: Joint Stat. Mee2ng, etc. n Journals: Annals of sta2s2cs, etc.
n Visualiza2on n Conference proceedings: CHI, ACM-‐SIGGraph, etc. n Journals: IEEE Trans. visualiza2on and computer graphics, etc.
30
Recommended Reference Books n E. Alpaydin. IntroducNon to Machine Learning, 2nd ed., MIT Press, 2011
n S. ChakrabarN. Mining the Web: StaNsNcal Analysis of Hypertex and Semi-‐Structured Data. Morgan Kaufmann, 2002
n R. O. Duda, P. E. Hart, and D. G. Stork, Pa[ern ClassificaNon, 2ed., Wiley-‐Interscience, 2000
n T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
n U. M. Fayyad, G. Piatetsky-‐Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/
MIT Press, 1996
n U. Fayyad, G. Grinstein, and A. Wierse, InformaNon VisualizaNon in Data Mining and Knowledge Discovery, Morgan Kaufmann,
2001
n J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd ed. , 2011
n T. HasNe, R. Tibshirani, and J. Friedman, The Elements of StaNsNcal Learning: Data Mining, Inference, and PredicNon, 2nd ed.,
Springer, 2009
n B. Liu, Web Data Mining, Springer 2006
n T. M. Mitchell, Machine Learning, McGraw Hill, 1997
n Y. Sun and J. Han, Mining Heterogeneous InformaNon Networks, Morgan & Claypool, 2012
n P.-‐N. Tan, M. Steinbach and V. Kumar, IntroducNon to Data Mining, Wiley, 2005
n S. M. Weiss and N. Indurkhya, PredicNve Data Mining, Morgan Kaufmann, 1998
n I. H. Wi[en and E. Frank, Data Mining: PracNcal Machine Learning Tools and Techniques with Java ImplementaNons, Morgan Kaufmann, 2nd ed. 2005
31
Summary
n Data mining: Discovering interes2ng pa?erns and knowledge from massive amount of data
n A natural evolu2on of science and informa2on technology, in great demand, with wide applica2ons
n A KDD process includes data cleaning, data integra2on, data selec2on, transforma2on, data mining, pa?ern evalua2on, and knowledge presenta2on
n Mining can be performed in a variety of data
n Data mining func2onali2es: characteriza2on, discrimina2on, associa2on, classifica2on, clustering, trend and outlier analysis, etc.
n Data mining technologies and applica2ons
n Major issues in data mining