31
1 1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 1 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at UrbanaChampaign & Simon Fraser University ©2013 Han, Kamber & Pei. All rights reserved.

01Intro_

Embed Size (px)

DESCRIPTION

Lec Slides

Citation preview

Page 1: 01Intro_

1 1

Data Mining: Concepts and Techniques

(3rd ed.)

— Chapter 1 —

Jiawei  Han,  Micheline  Kamber,  and  Jian  Pei  University  of  Illinois  at  Urbana-­‐Champaign  &  

Simon  Fraser  University  ©2013  Han,  Kamber  &  Pei.    All  rights  reserved.  

Page 2: 01Intro_

2

Introduction

n  Why  Data  Mining?  

n  What  Is  Data  Mining?  

n  A  Mul2-­‐Dimensional  View  of  Data  Mining  

n  What  Kinds  of  Data  Can  Be  Mined?  

n  What  Kinds  of  Pa?erns  Can  Be  Mined?  

n  What  Kinds  of  Technologies  Are  Used?  

n  What  Kinds  of  Applica2ons  Are  Targeted?    

n  Major  Issues  in  Data  Mining  

n  A  Brief  History  of  Data  Mining  and  Data  Mining  Society  

Page 3: 01Intro_

3

Why Data Mining?

n  Major  sources  of  abundant  data  n  Business:  Web,  e-­‐commerce,  transac2ons,  stocks  n  Science:  Remote  sensing,  bioinforma2cs,  scien2fic  simula2on  n  Society  and  everyone:  news,  digital  cameras,  YouTube  

n  The  Explosive  Growth  of  Data:  from  terabytes  to  petabytes  n  Automated  data  collec2on  using  database  systems,  sensors  and  APIs  

n  Data  availability  on  the  Web  and  through  APIs  n  We  are  drowning  in  data,  but  starving  for  knowledge!  n  “Necessity  is  the  mother  of  inven2on”—Data  mining—

Automated  analysis  of  massive  data  [sets]  

Page 4: 01Intro_

Data Mining Example

4  

Page 5: 01Intro_

5

What Is Data Mining?

n  Data  mining  (knowledge  discovery  from  data)    n  Extrac2on  of  interes2ng  (non-­‐trivial,  implicit,  previously  unknown  and  poten2ally  useful)  pa?erns  or  knowledge  from  huge  amount  of  data  

n  Alterna2ve  names  n  Knowledge  discovery  (mining)  in  databases,  knowledge  extrac2on,  data/pa?ern  analysis,  informa2on  harves2ng,  business  intelligence,  etc.  

n  Watch  out:  Is  everything  “data  mining”?    n  Simple  search  and  query  processing        n  (Deduc2ve)  expert  systems  

Page 6: 01Intro_

6

Knowledge Discovery (KDD) Process

n  This  is  a  view  from  typical  database  systems  and  data  warehousing  communi2es  

n  Data  mining  plays  an  essen2al  role  in  the  knowledge  discovery  process  

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

Page 7: 01Intro_

7

Example: A Web Mining Framework

n  Web  mining  usually  involves  n  Data  cleaning  and  integra2on  from  mul2ple  sources  n  Warehousing  the  data  n  Data  cube  construc2on  n  Data  selec2on  for  data  mining  n  Data  mining  n  Presenta2on  of  the  mining  results  n  Pa?erns  and  knowledge  to  be  used  or  stored  into  knowledge-­‐base  

Page 8: 01Intro_

8

Data Mining in Business Intelligence

Increasing potential to support business decisions End User

Business Analyst

Data Analyst

DBA

Decision Making

Data Presentation Visualization Techniques

Data Mining Information Discovery

Data Exploration Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses

Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems

Page 9: 01Intro_

9

KDD Process: A Typical View from ML and Statistics

Input Data Data Mining

Data Pre-Processing

Post-Processing

n  This  is  a  view  from  typical  machine  learning  and  sta2s2cs  communi2es  

Data integration Normalization Feature selection Dimension reduction

Pattern discovery Association & correlation Classification Clustering Outlier analysis … … … …

Pattern evaluation Pattern selection Pattern interpretation Pattern visualization

Page 10: 01Intro_

10

Which View Do You Prefer?

n  Which  view  do  you  prefer?  n  KDD  vs.  ML/Stat.  vs.  Business  Intelligence  

n  Depending  on  the  data,  applica2ons,  and  your  focus  

n  Data  Mining  vs.  Data  Explora2on  

n  Business  intelligence  view  

n  Warehouse,  data  cube,  repor2ng  but  not  much  mining  n  Business  objects  vs.  data  mining  tools  

n  Supply  chain  example:  mining  vs.  OLAP  vs.  presenta2on  tools  

n  Data  presenta2on  vs.  data  explora2on  

Page 11: 01Intro_

11

Multi-Dimensional View of Data Mining

n  Data  to  be  mined  n  Database  data  (extended-­‐rela2onal,  object-­‐oriented,  heterogeneous,  

legacy),  data  warehouse,  transac2onal  data,  stream,  spa2otemporal,  2me-­‐series,  sequence,  text  and  web,  mul2-­‐media,  graphs  &  social  and  informa2on  networks  

n  Knowledge  to  be  mined  (or:  Data  mining  funcNons)  n  Characteriza2on,  discrimina2on,  associa2on,  classifica2on,  clustering,  

trend/devia2on,  outlier  analysis,  etc.  n  Descrip2ve  vs.  predic2ve  data  mining    n  Mul2ple/integrated  func2ons  and  mining  at  mul2ple  levels  

n  Techniques  uNlized  n  Data-­‐intensive,  data  warehouse  (OLAP),  machine  learning,  sta2s2cs,  

pa?ern  recogni2on,  visualiza2on,  high-­‐performance,  etc.  n  ApplicaNons  adapted  

n  Retail,  telecommunica2on,  banking,  fraud  analysis,  bio-­‐data  mining,  stock  market  analysis,  text  mining,  Web  mining,  etc.  

Page 12: 01Intro_

12

On What Kinds of Data?

n  Database-­‐oriented  data  sets  and  applica2ons  

n  Rela2onal  database,  data  warehouse,  transac2onal  database  

n  Object-­‐rela2onal  databases,  Heterogeneous  databases  and  legacy  databases  

n  Advanced  data  sets  and  advanced  applica2ons    

n  Data  streams  and  sensor  data  

n  Time-­‐series  data,  temporal  data,  sequence  data  (incl.  bio-­‐sequences)    

n  Structure  data,  graphs,  social  networks  and  informa2on  networks  

n  Spa2al  data  and  spa2otemporal  data  

n  Mul2media  database  

n  Text  databases  

n  The  World-­‐Wide  Web  

Page 13: 01Intro_

13

Data Mining Function: (1) Generalization

n  Informa2on  integra2on  and  data  warehouse  construc2on  n  Data  cleaning,  transforma2on,  integra2on,  and  mul2dimensional  data  model  

n  Data  cube  technology  

n  Scalable  methods  for  compu2ng  (i.e.,  materializing)  mul2dimensional  aggregates  

n  OLAP  (online  analy2cal  processing)  n  Mul2dimensional  concept  descrip2on:  Characteriza2on  and  

discrimina2on  n  Generalize,  summarize,  and  contrast  data  characteris2cs,  e.g.,  dry  vs.  wet  region  

Page 14: 01Intro_

14

Data Mining Function: (2) Association and Correlation Analysis

n  Frequent  pa?erns  (or  frequent  itemsets)  n  What  items  are  frequently  purchased  together  in  your  Walmart?  

n  Associa2on,  correla2on  vs.  causality  

n  A  typical  associa2on  rule  n  Diaper  à  Milk  [0.5%,  95%]    (support,  confidence)  

n  Are  strongly  associated  items  also  strongly  correlated?  n  How  to  mine  such  pa?erns  and  rules  efficiently  in  large  

datasets?  n  How  to  use  such  pa?erns  for  classifica2on,  clustering,  and  

other  applica2ons?  

Page 15: 01Intro_

15

Data Mining Function: (3) Classification

n  Classifica2on  and  label  predic2on      n  Construct  models  (func2ons)  based  on  some  training  examples  n  Describe  and  dis2nguish  classes  or  concepts  for  future  predic2on  

n  E.g.,  classify  countries  based  on  (climate),  or  classify  cars  based  on  (gas  mileage)  

n  Predict  some  unknown  class  labels  n  Typical  methods  

n  Decision  trees,  naïve  Bayesian  classifica2on,  support  vector  machines,  neural  networks,  rule-­‐based  classifica2on,  pa?ern-­‐based  classifica2on,  logis2c  regression,  …  

n  Typical  applica2ons:  n  Credit  card  fraud  detec2on,  direct  marke2ng,  classifying  stars,  diseases,    web-­‐pages,  …  

Page 16: 01Intro_

16

Data Mining Function: (4) Cluster Analysis

n  Unsupervised  learning  (i.e.,  Class  label  is  unknown)  

n  Group  data  to  form  new  categories  (i.e.,  clusters),  e.g.,  cluster  houses  to  find  distribu2on  pa?erns  

n  Principle:  Maximizing  intra-­‐class  similarity  &  minimizing  interclass  similarity  

n  Many  methods  and  applica2ons  

Page 17: 01Intro_

17

Data Mining Function: (5) Outlier Analysis

n  Outlier:  A  data  object  that  does  not  comply  with  the  general  behavior  of  the  data  

n  Noise  or  excep2on?  ―  One  person’s  garbage  could  be  another  person’s  treasure  

n  Methods:  by-­‐product  of  clustering  or  regression  analysis,  …  

n  Useful  in  fraud  detec2on,  rare  events  analysis  

Page 18: 01Intro_

18

Time and Ordering: Sequential Pattern, Trend and Evolution Analysis

n  Sequence,  trend  and  evolu2on  analysis  n  Trend,  2me-­‐series,  and  devia2on  analysis:  e.g.,  regression  and  value  predic2on  

n  Sequen2al  pa?ern  mining  n  e.g.,  first  buy  digital  camera,  then  buy  large  SD  memory  cards  

n  Periodicity  analysis  n  Mo2fs  and  biological  sequence  analysis  

n  Approximate  and  consecu2ve  mo2fs  n  Similarity-­‐based  analysis  

n  Mining  data  streams  n  Ordered,  2me-­‐varying,  poten2ally  infinite,  data  streams  

Page 19: 01Intro_

19

Structure and Network Analysis

n  Graph  mining  n  Finding  frequent  subgraphs  (e.g.,  chemical  compounds),  trees  (XML),  

substructures  (web  fragments)  n  Informa2on  network  analysis  

n  Social  networks:  actors  (objects,  nodes)  and  rela2onships  (edges)  n  e.g.,  author  networks  in  CS,  terrorist  networks  

n  Mul2ple  heterogeneous  networks  n  A  person  could  be  mul2ple  informa2on  networks:  friends,  family,  classmates,  …  

n  Links  carry  a  lot  of  seman2c  informa2on:  Link  mining  n  Web  mining  

n  Web  is  a  big  informa2on  network:  from  PageRank  to  Google  n  Analysis  of  Web  informa2on  networks  

n  Web  community  discovery,  opinion  mining,  usage  mining,  …  

Page 20: 01Intro_

20

Evaluation of Knowledge

n  Are  all  mined  knowledge  interes2ng?  n  One  can  mine  tremendous  amount  of  “pa?erns”      n  Some  may  fit  only  certain  dimension  space  (2me,  loca2on,  …)  

n  Some  may  not  be  representa2ve,  may  be  transient,  …  n  Evalua2on  of  mined  knowledge  →  directly  mine  only  interes2ng  

knowledge?  n  Descrip2ve  vs.  predic2ve  n  Coverage  n  Typicality  vs.  novelty  n  Accuracy  n  Timeliness  

Page 21: 01Intro_

21

Introduction

n  Why  Data  Mining?  

n  What  Is  Data  Mining?  

n  A  Mul2-­‐Dimensional  View  of  Data  Mining  

n  What  Kinds  of  Data  Can  Be  Mined?  

n  What  Kinds  of  Pa?erns  Can  Be  Mined?  

n  What  Kinds  of  Technologies  Are  Used?  

n  What  Kinds  of  Applica2ons  Are  Targeted?    

n  Major  Issues  in  Data  Mining  

n  A  Brief  History  of  Data  Mining  and  Data  Mining  Society  

n  Summary  

Page 22: 01Intro_

22

Data Mining: Confluence of Multiple Disciplines

Data Mining

Machine Learning

Statistics

Applications

Algorithm

Pattern Recognition

High-Performance Computing

Visualization

Database Technology

Page 23: 01Intro_

23

Why Confluence of Multiple Disciplines?

n  Tremendous  amount  of  data  n  Algorithms  must  be  scalable  to  handle  big  data  

n  High-­‐dimensionality  of  data    n  Micro-­‐array  may  have  tens  of  thousands  of  dimensions  

n  High  complexity  of  data  n  Data  streams  and  sensor  data  n  Time-­‐series  data,  temporal  data,  sequence  data    n  Structure  data,  graphs,  social  and  informa2on  networks  n  Spa2al,  spa2otemporal,  mul2media,  text  and  Web  data  n  Sonware  programs,  scien2fic  simula2ons  

n  New  and  sophis2cated  applica2ons  

Page 24: 01Intro_

24

Applications of Data Mining

n  Web  page  analysis:  from  web  page  classifica2on,  clustering  to  PageRank  &  HITS  algorithms  

n  Collabora2ve  analysis  &  recommender  systems  

n  Basket  data  analysis  to  targeted  marke2ng  

n  Biological  and  medical  data  analysis:  classifica2on,  cluster  analysis  (microarray  data  analysis),    biological  sequence  analysis,  biological  network  analysis  

n  Data  mining  and  sonware  engineering      

n  From  major  dedicated  data  mining  systems/tools  (e.g.,  SAS,  MS  SQL-­‐Server  Analysis  Manager,  Oracle  Data  Mining  Tools)  to  invisible  data  mining  

Page 25: 01Intro_

25

Major Issues in Data Mining (1)

n  Mining  Methodology  

n  Mining  various  and  new  kinds  of  knowledge  

n  Mining  knowledge  in  mul2-­‐dimensional  space  

n  Data  mining:  An  interdisciplinary  effort  

n  Boos2ng  the  power  of  discovery  in  a  networked  environment  

n  Handling  noise,  uncertainty,  and  incompleteness  of  data  

n  Pa?ern  evalua2on  and  pa?ern-­‐  or  constraint-­‐guided  mining  

n  User  Interac2on  

n  Interac2ve  mining  

n  Incorpora2on  of  background  knowledge  

n  Presenta2on  and  visualiza2on  of  data  mining  results  

Page 26: 01Intro_

26

Major Issues in Data Mining (2)

n  Efficiency  and  Scalability  

n  Efficiency  and  scalability  of  data  mining  algorithms  

n  Parallel,  distributed,  stream,  and  incremental  mining  methods  

n  Diversity  of  data  types  

n  Handling  complex  types  of  data  

n  Mining  dynamic,  networked,  and  global  data  repositories  

n  Data  mining  and  society  

n  Social  impacts  of  data  mining  

n  Privacy-­‐preserving  data  mining  

n  Invisible  data  mining  

Page 27: 01Intro_

27

A Brief History of Data Mining Society

n  1989  IJCAI  Workshop  on  Knowledge  Discovery  in  Databases    

n  Knowledge  Discovery  in  Databases  (G.  Piatetsky-­‐Shapiro  and  W.  Frawley,  1991)  

n  1991-­‐1994  Workshops  on  Knowledge  Discovery  in  Databases  

n  Advances  in  Knowledge  Discovery  and  Data  Mining  (U.  Fayyad,  G.  Piatetsky-­‐Shapiro,  P.  Smyth,  and  R.  Uthurusamy,  1996)  

n  1995-­‐1998  Interna2onal  Conferences  on  Knowledge  Discovery  in  Databases  and  Data  Mining  (KDD’95-­‐98)  

n  Journal  of  Data  Mining  and  Knowledge  Discovery  (1997)  

n  ACM  SIGKDD  conferences  since  1998  and  SIGKDD  Explora2ons  

n  More  conferences  on  data  mining  

n  PAKDD  (1997),  PKDD  (1997),  SIAM-­‐Data  Mining  (2001),  (IEEE)  ICDM  (2001),  WSDM  (2008),  etc.  

n  ACM  Transac2ons  on  KDD  (2007)  

Page 28: 01Intro_

28

Conferences and Journals on Data Mining

n  KDD  Conferences  n  ACM  SIGKDD  Int.  Conf.  on  Knowledge  

Discovery  in  Databases  and  Data  Mining  (KDD)  

n  SIAM  Data  Mining  Conf.  (SDM)  n  (IEEE)  Int.  Conf.  on  Data  Mining  

(ICDM)  n  European  Conf.  on  Machine  Learning  

and  Principles  and  prac2ces  of  Knowledge  Discovery  and  Data  Mining  (ECML-­‐PKDD)  

n  Pacific-­‐Asia  Conf.  on  Knowledge  Discovery  and  Data  Mining  (PAKDD)  

n  Int.  Conf.  on  Web  Search  and  Data  Mining  (WSDM)  

n  Other related conferences n  DB conferences: ACM SIGMOD,

VLDB, ICDE, EDBT, ICDT, …

n  Web and IR conferences: WWW, SIGIR, WSDM

n  ML conferences: ICML, NIPS n  PR conferences: CVPR,

n  Journals n  Data Mining and Knowledge

Discovery (DAMI or DMKD)

n  IEEE Trans. On Knowledge and Data Eng. (TKDE)

n  KDD Explorations n  ACM Trans. on KDD

Page 29: 01Intro_

29

Where to Find References? DBLP, CiteSeer, Google

n  Data  mining  and  KDD  (SIGKDD:  CDROM)  n  Conferences:  ACM-­‐SIGKDD,  IEEE-­‐ICDM,  SIAM-­‐DM,  PKDD,  PAKDD,  etc.  n  Journal:  Data  Mining  and  Knowledge  Discovery,  KDD  Explora2ons,  ACM  TKDD  

n  Database  systems  (SIGMOD:  ACM  SIGMOD  Anthology—CD  ROM)  n  Conferences:  ACM-­‐SIGMOD,  ACM-­‐PODS,  VLDB,  IEEE-­‐ICDE,  EDBT,  ICDT,  DASFAA  n  Journals:  IEEE-­‐TKDE,  ACM-­‐TODS/TOIS,  JIIS,  J.  ACM,  VLDB  J.,  Info.  Sys.,  etc.  

n  AI  &  Machine  Learning  n  Conferences:  Machine  learning  (ML),  AAAI,  IJCAI,  COLT  (Learning  Theory),  CVPR,  NIPS,  etc.  n  Journals:  Machine  Learning,  Ar2ficial  Intelligence,  Knowledge  and  Informa2on  Systems,  IEEE-­‐PAMI,  

etc.  

n  Web  and  IR    n  Conferences:  SIGIR,  WWW,  CIKM,  etc.  n  Journals:  WWW:  Internet  and  Web  Informa2on  Systems,    

n  Sta2s2cs  n  Conferences:  Joint  Stat.  Mee2ng,  etc.  n  Journals:  Annals  of  sta2s2cs,  etc.  

n  Visualiza2on  n  Conference  proceedings:  CHI,  ACM-­‐SIGGraph,  etc.  n  Journals:  IEEE  Trans.  visualiza2on  and  computer  graphics,  etc.  

Page 30: 01Intro_

30

Recommended Reference Books n  E.  Alpaydin.  IntroducNon  to  Machine  Learning,  2nd  ed.,  MIT  Press,  2011    

n  S.  ChakrabarN.  Mining  the  Web:  StaNsNcal  Analysis  of  Hypertex  and  Semi-­‐Structured  Data.  Morgan  Kaufmann,  2002  

n  R.  O.  Duda,  P.  E.  Hart,  and  D.  G.  Stork,  Pa[ern  ClassificaNon,  2ed.,  Wiley-­‐Interscience,  2000  

n  T.  Dasu  and  T.  Johnson.    Exploratory  Data  Mining  and  Data  Cleaning.  John  Wiley  &  Sons,  2003  

n  U.  M.  Fayyad,  G.  Piatetsky-­‐Shapiro,  P.  Smyth,  and  R.  Uthurusamy.  Advances  in  Knowledge  Discovery  and  Data  Mining.  AAAI/

MIT  Press,  1996  

n  U.  Fayyad,  G.  Grinstein,  and  A.  Wierse,  InformaNon  VisualizaNon  in  Data  Mining  and  Knowledge  Discovery,  Morgan  Kaufmann,  

2001  

n  J.  Han,  M.  Kamber,  and  J.  Pei,  Data  Mining:  Concepts  and  Techniques.  Morgan  Kaufmann,  3rd  ed.  ,  2011  

n  T.  HasNe,  R.  Tibshirani,  and  J.  Friedman,  The  Elements  of  StaNsNcal  Learning:  Data  Mining,  Inference,  and  PredicNon,  2nd  ed.,  

Springer,  2009  

n  B.  Liu,  Web  Data  Mining,  Springer  2006  

n  T.  M.  Mitchell,  Machine  Learning,  McGraw  Hill,  1997  

n  Y.  Sun  and  J.  Han,  Mining  Heterogeneous  InformaNon  Networks,  Morgan  &  Claypool,  2012  

n  P.-­‐N.  Tan,  M.  Steinbach  and  V.  Kumar,  IntroducNon  to  Data  Mining,  Wiley,  2005  

n  S.  M.  Weiss  and  N.  Indurkhya,  PredicNve  Data  Mining,  Morgan  Kaufmann,  1998  

n  I.  H.  Wi[en  and  E.  Frank,    Data  Mining:  PracNcal  Machine  Learning  Tools  and  Techniques  with  Java  ImplementaNons,  Morgan  Kaufmann,  2nd  ed.  2005  

Page 31: 01Intro_

31

Summary

n  Data  mining:  Discovering  interes2ng  pa?erns  and  knowledge  from  massive  amount  of  data  

n  A  natural  evolu2on  of  science  and  informa2on  technology,  in  great  demand,  with  wide  applica2ons  

n  A  KDD  process  includes  data  cleaning,  data  integra2on,  data  selec2on,  transforma2on,  data  mining,  pa?ern  evalua2on,  and  knowledge  presenta2on  

n  Mining  can  be  performed  in  a  variety  of  data  

n  Data  mining  func2onali2es:  characteriza2on,  discrimina2on,  associa2on,  classifica2on,  clustering,  trend  and  outlier  analysis,  etc.  

n  Data  mining  technologies  and  applica2ons  

n  Major  issues  in  data  mining