Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Data Mining and Soft Computing
Session 1. Introduction to Data Mining and Knowledge DiscoveryIntroduction to Data Mining and Knowledge Discovery
Francisco HerreraResearch Group on Soft Computing andInformation Intelligent Systems (SCI2S)
Dept. of Computer Science and A.I. University of Granada SpainUniversity of Granada, Spain
Email: [email protected]://sci2s ugr eshttp://sci2s.ugr.es
http://decsai.ugr.es/~herrera
Data Mining and Soft Computing
Summary1. Introduction to Data Mining and Knowledge Discovery2 D t P ti2. Data Preparation 3. Introduction to Prediction, Classification, Clustering and Association4. Introduction to Soft Computing. Focusing our attention in Fuzzy Logic
and Evolutionary Computationand Evolutionary Computation5. Soft Computing Techniques in Data Mining: Fuzzy Data Mining and
Knowledge Extraction based on Evolutionary Learning6 Genetic Fuzzy Systems: State of the Art and New Trends6. Genetic Fuzzy Systems: State of the Art and New Trends7. Some Advanced Topics: Imbalanced data sets, subgroup discovery,
data complexity. 8. Final talk: How must I Do my Experimental Study? Design of8. Final talk: How must I Do my Experimental Study? Design of
Experiments in Data Mining/Computational Intelligence. Using Non-parametric Tests. Some Cases of Study.
Introduction to Data Mining and Knowledge Discovery
Outline
Discovery
Outline Motivation: Why Data Mining?y g
What is Data Mining?
History of Data Mining
Data Mining Terminology / KDD Process Data Mining Terminology / KDD Process
Data Mining Applications
Issues in Data Mining
Concluding Remarks3
Concluding Remarks
Introduction to Data Mining and Knowledge Discovery
Outline
Discovery
Outline Motivation: Why Data Mining?y g
What is Data Mining?
History of Data Mining
Data Mining Terminology / KDD Process Data Mining Terminology / KDD Process
Data Mining Applications
Issues in Data Mining
Concluding Remarks4
Concluding Remarks
Why DATA MINING?Why DATA MINING?
• Lots of data is being collected and warehoused – Web data, e-commerce– purchases at department/
tgrocery stores– Bank/Credit Card
transactionstransactions
• Computers have become cheaper and more powerful
• Competitive Pressure is Strong – Provide better, customized services for an edge (e.g. inProvide better, customized services for an edge (e.g. in
Customer Relationship Management)
Why DATA MINING?Why DATA MINING?
• Data collected and stored at enormous speeds (GB/hour)
– remote sensors on a satellite
– telescopes scanning the skies
– microarrays generating gene expression data
– scientific simulations generating terabytes of data
• Traditional techniques infeasible for raw data• Traditional techniques infeasible for raw data• Data mining may help scientists
– in classifying and segmenting datay g g g– in Hypothesis Formation
Why DATA MINING?Why DATA MINING?
• Huge amounts of data
• Data rich – but information poorData rich but information poor• Lying hidden in all this data is information!
7
Mining Large Data Sets - Motivation
• There is often information “hidden” in the data that is not readily evident
• Human analysts may take weeks to discover useful informationM h f th d t i l d t ll• Much of the data is never analyzed at all
3 500 000
4,000,000
2,500,000
3,000,000
3,500,000
The Data Gap
1,500,000
2,000,000Total new disk (TB) since 1995
0
500,000
1,000,000Number of
analysts0
1995 1996 1997 1998 1999
From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”
Data vs InformationData vs. Information
• Society produces massive amounts of data– business, science, medicine, economics, sports,
…P t ti ll l bl• Potentially valuable resource
• Raw data is useless– need techniques to automatically extract
information Data: recorded facts– Data: recorded facts
– Information: patterns underlying the data
9
Figura 1. Relación entre dato, información y conocimiento
http://www.uoc.edu/web/esp/art/uoc/molina1102/molina1102.html
10
Introduction to Data Mining and Knowledge Discovery
Outline
Discovery
Outline Motivation: Why Data Mining?y g
What is Data Mining?
History of Data Mining
Data Mining Terminology / KDD Process Data Mining Terminology / KDD Process
Data Mining Applications
Issues in Data Mining
Concluding Remarks11
Concluding Remarks
What is DATA MINING?What is DATA MINING?
M D fi iti
E t ti “ i i ” k l d f l t
Many Definitions
• Extracting or “mining” knowledge from large amounts of data
• Data -driven discovery and modeling of hidden patterns (we never new existed) in large volumes of datadata
• Extraction of implicit, previously unknown and yunexpected, potentially extremely useful information from data
12
What Is Data Mining?What Is Data Mining?
Data mining: Data mining: Extraction of interesting (non-trivial, implicit,
i l k d t ti ll f l)previously unknown and potentially useful)information or patterns from data in large databases
13
Data Mining is NOTData Mining is NOT• Data Warehousing What is not Data• Data Warehousing• (Deductive) query
processing
What is not Data Mining?
processing – SQL/ Reporting
• Software Agents– Look up phone number in phone • Software Agents
• Expert SystemsOnline Analytical Processing
directory
• Online Analytical Processing (OLAP)Statistical Analysis Tool
– Query a Web search engine for• Statistical Analysis Tool
• Data visualization
search engine for information about
“Amazon”14
Machine Learning TechniquesMachine Learning Techniques
Technical basis for data mining:
• algorithms for acquiring structural descriptions from examples
• Methods originate from artificial intelligence, hi l i i i d hmachine learning, statistics, and research on
databases15
Origins of Data Mining• Draws ideas from machine learning/AI, pattern
recognition statistics and database systemsrecognition, statistics, and database systems
AI/• Traditional Techniquesmay be unsuitable due to
AI/
Machine Learning/Pattern
Statistics
– Enormity of data– High dimensionality of data
Pattern Recognition
Data Miningg y– Heterogeneous,
distributed nature of data
Data Mining
Database systems
Multidisciplinary FieldMultidisciplinary Field
Database Technology StatisticsTechnology
Data MiningMachineLearning Visualization
OtherDisciplines
Artificial Intelligence p
Data Mining TasksData Mining TasksP di ti M th d• Prediction Methods– Use some variables to predict unknown or
future values of other variables.
• Description MethodsFind human interpretable patterns that– Find human-interpretable patterns that describe the data.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Summary in a PictureSummary in a Picture
How can I analyze this data?
Knowledge
We have rich data, but poor information
Data mining-searching for knowledge (interesting patterns) in your data
19
p (interesting patterns) in your data.
J. Han, M. Kamber. Data Mining. Concepts and Techniques Morgan Kaufmann, 2006 (Second Edition)
Introduction to Data Mining and Knowledge Discovery
Outline
Discovery
Outline Motivation: Why Data Mining?y g
What is Data Mining?
History of Data Mining
Data Mining Terminology / KDD Process Data Mining Terminology / KDD Process
Data Mining Applications
Issues in Data Mining
20
HistoryHistory• Emerged late 1980s
• Flourished –1990s
• Consolidation – 2000s
21
A Brief History of Data Mining SocietyA Brief History of Data Mining Society• 1989 IJCAI Workshop on Knowledge Discovery in Databases (Piatetsky-Shapiro)
– Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)
• 1991-1994 Workshops on Knowledge Discovery in Databases1991 1994 Workshops on Knowledge Discovery in Databases
– Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)1993 – Association Rule Mining Algorithm APRIORI proposed by Agraval Imielinski and Swamiproposed by Agraval, Imielinski and Swami.
• 1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98)
– Journal of Data Mining and Knowledge Discovery (1997)• 1998 ACM SIGKDD, SIGKDD’1999-2001 conferences, and SIGKDD Explorations• More conferences on data mining
22
More conferences on data mining
– PAKDD, PKDD, SIAM-Data Mining, (IEEE) ICDM, etc.
Introduction to Data Mining and Knowledge Discovery
Outline
Discovery
Outline Motivation: Why Data Mining?y g
What is Data Mining?
History of Data Mining
Data Mining Terminology / KDD Process Data Mining Terminology / KDD Process
Data Mining Applications
Issues in Data Mining
Concluding Remarks23
Concluding Remarks
TerminologyTerminology
• Gold Mining
• Knowledge mining from databases
• Knowledge extraction• Knowledge extraction
• Data/pattern analysisData/pattern analysis
• Knowledge Discovery Databases or KDD
• Business intelligence
24
Input Output ViewInput-Output View
ngng
Data(internal & external)
Reports
a M
inin
a M
inin
Dat
aD
ata
Decision ModelsObjective(s)
B i K l d N K l d
25
Business Knowledge New Knowledge
Process ViewProcess ViewInterpretation
&
Model
&Evaluation
Dissemination&
Build a decision tree
Check against hold-out set
Building Deployment
Aggregate individual incomes into household income
Learn about loans, repayments, etc.;Data
Pre-processing
PatternsModels
Learn about loans, repayments, etc.;Collect data about past performance
p g
UnderstandingDomain & Data
Pre-processedData
BusinessProblem
Formulation
RawData
SelectedData
26
Knowledge Discovery Processow edge scove y ocess
Interpretation
Integration
Raw
& EvaluationKnowledge
______
______
______
T f d
Patternsand
Rules
Data UnderstaTransformed DataTarget
Data
andingDATADATAWareWarehhhousehouse
27
KDD ProcessKDD Process
DatabaseDatabase
Selection Transformation
Data Preparation
Data Data MiningMining
Training Data
Model, Patternsp gg Patterns
Evaluation, Verification
28
Knowledge Discovery Processfl di t CRISP DMflow, according to CRISP-DM
see www.crisp-dm.org
ffor more information
29
Introduction to Data Mining and Knowledge Discovery
Outline
Discovery
Outline Motivation: Why Data Mining?y g
What is Data Mining?
History of Data Mining
Data Mining Terminology / KDD Process Data Mining Terminology / KDD Process
Data Mining Applications
Issues in Data Mining
Concluding Remarks30
Concluding Remarks
Data Mining TasksData Mining Tasks• Prediction Methods
– Use some variables to predict unknown or future values of other variables.
• Description Methods– Find human-interpretable patterns that describe the data.
Data Mining TasksData Mining Tasks...
• Classification [Predictive]• Clustering [Descriptive]Clustering [Descriptive]• Association Rule Discovery [Descriptive]• Sequential Pattern Discovery [Descriptive]• Regression [Predictive]• Regression [Predictive]• Deviation Detection [Predictive]• Time Series [Predictive]• Summarization [D i ti ]• Summarization [Descriptive]
Classification: DefinitionClassification: Definition• Given a collection of records (training set )( g )
– Each record contains a set of attributes, one of the attributes is the class.
• Find a model for class attribute as a function of the values of other attributes.
• Goal: previously unseen records should be assigned a class as accurately as possible.
A t t t i d t d t i th f th– A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to buildtraining and test sets, with training set used to build the model and test set used to validate it.
Classification ExampleClassification Example
Tid Refund MaritalStatus
TaxableIncome Cheat
Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
No Single 75K ?
Yes Married 50K ?
No Married 150K ?
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
Yes Divorced 90K ?
No Single 40K ?
No Married 80K ? Test6 No Married 60K No7 Yes Divorced 220K No
8 No Single 85K Yes
9 N M i d 75K N
No Married 80K ?10
TestSet
L9 No Married 75K No10 No Single 90K Yes
10
Training Set
ModelLearn
Classifier
Classification: Application 1Classification: Application 1• Direct Marketing
– Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product.
– Approach:• Use the data for a similar product introduced before. • We know which customers decided to buy and which
decided otherwise. This {buy, don’t buy} decision forms the class attribute.
• Collect various demographic, lifestyle, and company-interaction related information about all such customers.
Type of business where they stay how much they earn etc– Type of business, where they stay, how much they earn, etc.• Use this information as input attributes to learn a classifier
model.
Classification: Application 2Classification: Application 2• Fraud Detection
– Goal: Predict fraudulent cases in credit card transactions.A h– Approach:
• Use credit card transactions and the information on its account-holder as attributes.
– When does a customer buy, what does he buy, how often he pays on time, etc
• Label past transactions as fraud or fair transactions. This pforms the class attribute.
• Learn a model for the class of the transactions.• Use this model to detect fraud by observing credit cardUse this model to detect fraud by observing credit card
transactions on an account.
Classification: Application 3Classification: Application 3
• Customer Attrition/Churn:– Goal: To predict whether a customer is likely to be
lost to a competitor.– Approach:
• Use detailed record of transactions with each of the past and present customers, to find attributes.
– How often the customer calls, where he calls, what time-of-the , ,day he calls most, his financial status, marital status, etc.
• Label the customers as loyal or disloyal.Find a model for loyalty• Find a model for loyalty.
Classification: Application 4Classification: Application 4• Sky Survey Cataloging
– Goal: To predict class (star or galaxy) of sky objects, i ll i ll f i t b d th t l iespecially visually faint ones, based on the telescopic
survey images (from Palomar Observatory).– 3000 images with 23 040 x 23 040 pixels per image3000 images with 23,040 x 23,040 pixels per image.
– Approach:• Segment the image. g g• Measure image attributes (features) - 40 of them per object.• Model the class based on these features.• Success Story: Could find 16 new high red-shift quasars,
some of the farthest objects that are difficult to find!
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Classifying Galaxiesy gCourtesy: http://aps.umn.edu
Early Class: • Stages of Formation
Attributes:• Image features,
• Characteristics of light waves received etc
Intermediatewaves received, etc.
Late
D t SiData Size: • 72 million stars, 20 million galaxies
• Object Catalog: 9 GB• Image Database: 150 GB• Image Database: 150 GB
Clustering DefinitionClustering DefinitionGi t f d t i t h h i t f• Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such thatfind clusters such that– Data points in one cluster are more similar to one
anotheranother.– Data points in separate clusters are less similar to
one another.• Similarity Measures:
– Euclidean Distance if attributes are continuous.– Other Problem-specific Measures.
Illustrating ClusteringIllustrating ClusteringEuclidean Distance Based Clustering in 3-D space.
Intracluster distancesare minimized
Intracluster distancesare minimized
Intercluster distancesare maximized
Intercluster distancesare maximizedare minimizedare minimized are maximizedare maximized
Clustering: Application 1Clustering: Application 1• Market Segmentation:
G l bdi id k t i t di ti t b t f– Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with aselected as a market target to be reached with a distinct marketing mix.
– Approach: • Collect different attributes of customers based on
their geographical and lifestyle related information.Fi d l t f i il t• Find clusters of similar customers.
• Measure the clustering quality by observing buying patterns of customers in same cluster vs thosepatterns of customers in same cluster vs. those from different clusters.
Clustering: Application 2Clustering: Application 2D t Cl t i• Document Clustering:– Goal: To find groups of documents that are similar to
each other based on the important terms appearing ineach other based on the important terms appearing in them.Approach: To identify frequently occurring terms in– Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster.q
– Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.
Illustrating Document ClusteringIllustrating Document Clustering• Clustering Points: 3204 Articles of Los Angeles Times• Clustering Points: 3204 Articles of Los Angeles Times.• Similarity Measure: How many words are common in
these documents (after some word filtering).these documents (after some word filtering).
Category TotalArticles
CorrectlyPlacedArticles Placed
Financial 555 364
Foreign 341 260
National 273 36
Metro 943 746Metro 943 746
Sports 738 573
Entertainment 354 278Entertainment 354 278
Association Rule Discovery: DefinitionAssociation Rule Discovery: Definition
• Given a set of records each of which contain some number of items from a given collection;
P d d d l hi h ill di t– Produce dependency rules which will predict occurrence of an item based on occurrences of other itemsitems.
TID Items
1 B d C k Milk1 Bread, Coke, Milk2 Beer, Bread3 Beer, Coke, Diaper, Milk
Rules Discovered:{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
Rules Discovered:{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}4 Beer, Bread, Diaper, Milk5 Coke, Diaper, Milk
{ p , } { }{ p , } { }
Association Rule Discovery: Application 1Association Rule Discovery: Application 1
• Marketing and Sales Promotion:g– Let the rule discovered be
{Bagels, … } --> {Potato Chips}– Potato Chips as consequent => Can be used to
determine what should be done to boost its sales.B l i th t d t b d t– Bagels in the antecedent => Can be used to see which products would be affected if the store discontinues selling bagels.g g
– Bagels in antecedent and Potato chips in consequent=> Can be used to see what products should be sold
ith B l t t l f P t t hi !with Bagels to promote sale of Potato chips!
Association Rule Discovery: Application 2Association Rule Discovery: Application 2
S k t h lf t• Supermarket shelf management.– Goal: To identify items that are bought together by
s fficientl man c stomerssufficiently many customers.– Approach: Process the point-of-sale data collected
with barcode scanners to find dependencies amongwith barcode scanners to find dependencies among items.
– A classic rule --A classic rule • If a customer buys diaper and milk, then he is very
likely to buy beer.likely to buy beer.• So, don’t be surprised if you find six-packs stacked
next to diapers!p
Association Rule Discovery: Application 3Association Rule Discovery: Application 3
• Inventory Management:– Goal: A consumer appliance repair company wants to
ti i t th t f i itanticipate the nature of repairs on its consumer products and keep the service vehicles equipped with right parts to reduce on number of visits to consumerright parts to reduce on number of visits to consumer households.
– Approach: Process the data on tools and parts pp oac ocess e da a o oo s a d pa srequired in previous repairs at different consumer locations and discover the co-occurrence patterns.
DATA MINING EXAMPLEDATA MINING EXAMPLEBeer and Nappies/Diapers -- A Data Mining Urban LegendBeer and Nappies/Diapers -- A Data Mining Urban Legend
Introduction
There is a story that a large supermarket chain,usually Wal-Mart, did an analysis of customers' buying h bit d f d t ti ti ll i ifi t l tihabits and found a statistically significant correlation between purchases of beer and purchases of nappies (diapers in the US).
It was theorized that the reason for this was that fathers were stopping off at Wal-Mart to buy nappies for their babies, and since they could no longer go down to the pub as often, would buy beer as well. As a result of this g p yfinding, the supermarket chain is alleged to have the nappies next to the beer, resulting in increased sales of both.
The Original Version
49
The Original Version
DATA MINING EXAMPLEBeer and Nappies/Diapers -- A Data Mining Urban Legend
DATA MINING EXAMPLEBeer and Nappies/Diapers -- A Data Mining Urban Legend
Wh t i th i f ti ?What is the information?Diapers Bear
50
DATA MINING EXAMPLEBeer and Nappies/Diapers -- A Data Mining Urban Legend
DATA MINING EXAMPLEBeer and Nappies/Diapers -- A Data Mining Urban Legend
The stories are embellished with plausible sounding factoids, e.g. increased sales as a percentage:
"The man decided to b a si pack of beer hile he as"The man decided to buy a six pack of beer while he was stopped anyway. The discount chain moved the beer and
snacks such as peanuts and pretzels next to the disposablesnacks such as peanuts and pretzels next to the disposable diapers and increased sales on peanuts and pretzels by more
that 27%."
The now legendary revelation that men who bought diapers on Friday night at convenience stores were also likely to buy
51
on Friday night at convenience stores were also likely to buy beer is an example of market basket analysis.
Sequential Pattern Discovery: Definition• Given is a set of objects, with each object associated with its own
timeline of events find rules that predict strong sequentialtimeline of events, find rules that predict strong sequential dependencies among different events.
R l f d b fi t di i tt E t i
(A B) (C) (D E)
• Rules are formed by first disovering patterns. Event occurrences in the patterns are governed by timing constraints.
(A B) (C) (D E)ng
Sequential Pattern Discovery: Examples
• In telecommunications alarm logs,(I t P bl E i Li C t)– (Inverter_Problem Excessive_Line_Current)
(Rectifier_Alarm) --> (Fire_Alarm)• In point-of-sale transaction sequences,
– Computer Bookstore: (Intro_To_Visual_C) (C++_Primer) -->
(Perl for dummies Tcl Tk)(Perl_for_dummies,Tcl_Tk)– Athletic Apparel Store:
(Shoes) (Racket Racketball) > (Sports Jacket)(Shoes) (Racket, Racketball) --> (Sports_Jacket)
RegressionRegressionPredict a value of a given continuous valued variable• Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency.linear or nonlinear model of dependency.
• Greatly studied in statistics, neural network fields.• Examples:Examples:
– Predicting sales amounts of new product based on advetising expenditure.g p
– Predicting wind velocities as a function of temperature, humidity, air pressure, etc.p y p
– Time series prediction of stock market indices.
Deviation/Anomaly Detection• Detect significant deviations from normal
b h ibehavior• Applications:pp
– Credit Card Fraud Detection
– Network Intrusion et o t us oDetection
Typical network traffic at University level may reach over 100 million connections per day
Data Mining ApplicationsData Mining Applications• Science: Chemistry, Physics, Mediciney, y ,
– Biochemical analysis– Remote sensors on a satelliteRemote sensors on a satellite– Telescopes – star galaxy classification– Medical Image analysisMedical Image analysis
• BioscienceS b d l i– Sequence-based analysis
– Protein structure and function prediction– Protein family classification– Microarray gene expression
56
D t Mi i A li tiData Mining Applications• Pharmaceutical companies, Insurance and
Health care, Medicine– Drug development– Identify successful medical therapies– Claims analysis, fraudulent behavior– Medical diagnostic tools– Predict office visits
57
Data Mining Applications• Financial Industry, Banks, Businesses, E-
Data Mining Applicationsy, , ,
commerce– Stock and investment analysisy– Identify loyal customers vs. risky customer– Predict customer spendingPredict customer spending – Risk management– Sales forecastingSales forecasting
• Retail and MarketingCustomer buying patterns/demographic characteristics– Customer buying patterns/demographic characteristics
– Mailing campaignsMarket basket analysis
58
– Market basket analysis– Trend analysis
Data Mining ApplicationsData Mining Applications• Database analysis and decision support• Database analysis and decision support
– Market analysis and management• target marketing, customer relation management, market basket
analysis, cross selling, market segmentation
– Risk analysis and management• Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
– Fraud detection and management
59
Data Mining ApplicationsData Mining ApplicationsS t d E t t i t• Sports and Entertainment – IBM Advanced Scout analyzed NBA game statistics
(shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat
• Astronomy– JPL and the Palomar Observatory discovered 22
quasars with the help of data mining
60
Data Mining: Some applications in Granada
Problemas en la BancaProblemas en la Banca
61IMAGINÁTICA, Sevilla, 4 de Marzo de 2009
Data Mining: Some applications in Granada
Problemas en la Banca Contrato de Investigación: Uso de técnicas de minería de datos para la detección de patrones de baja calidad en operaciones bancariasEntidad Financiadora: CajaGranada- Ref. Fundacion Empresa Universidad de Granada 3022-00Duracion: 1 Marzo 2008 - 30 Marzo 2009Duracion: 1 Marzo 2008 30 Marzo 2009Contrato de Investigación: Técnicas de Minería de Datos y Soft Computing para la mejora de la calidad de datos introducidos en scoringEntidad Financiadora: Caja Navarra Ref. Fundacion Empresa Universidad de Granada 3245-00 Duracion: 1 Marzo 2008 - 30 Marzo 2009Duracion: 1 Marzo 2008 30 Marzo 2009
62Caja Navarra
Introduction to Data Mining and Knowledge Discovery
Outline
Discovery
Outline Motivation: Why Data Mining?y g
What is Data Mining?
History of Data Mining
Data Mining Terminology / KDD Process Data Mining Terminology / KDD Process
Data Mining Applications
Issues in Data Mining
Concluding Remarks63
Concluding Remarks
Major Issues in Data MiningMajor Issues in Data Mining
• Mining methodology and user interaction– Mining different kinds of knowledge in databases– Incorporation of background knowledge– Handling noise and incomplete data– Pattern evaluation: the interestingness problemPattern evaluation: the interestingness problem – Expression and visualization of data mining results
64
Major Issues in Data MiningMajor Issues in Data Mining
• Performance and scalability– Efficiency of data mining algorithms– Efficiency of data mining algorithms– Parallel, distributed and incremental mining methods
• Issues relating to the diversity of data typesH dli l ti l d l t f d t– Handling relational and complex types of data
– Mining information from diverse databases
65
Major Issues in Data MiningMajor Issues in Data Mining
• Issues related to applications and social impacts– Application of discovered knowledgepp g
• Domain-specific data mining tools• Intelligent query answering• Expert systems• Process control and decision making
A knowledge fusion problem– A knowledge fusion problem– Protection of data security, integrity, and privacy
66
Introduction to Data Mining and Knowledge Discovery
Outline
Discovery
Outline Motivation: Why Data Mining?y g
What is Data Mining?
History of Data Mining
Data Mining Terminology / KDD Process Data Mining Terminology / KDD Process
Data Mining Applications
Issues in Data Mining
Concluding Remarks67
Concluding Remarks
Introduction to Data Mining and Knowledge Discovery
Concluding RemarksSummary
• Data mining: discovering interesting patterns from large amounts of data
• A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentationpresentation
• Mining can be performed in a variety of information repositories
• Data mining functionalities: characterization, association, classification, clustering, outlier and trend analysis, etc.g y
68
Introduction to Data Mining and Knowledge Discovery
Data Mining en la WEBL i l WEB
El Web mining o Webmininges una metodología de
La presencia en la WEB
recuperación de la información que usa
herramientas de la minería de datos para extraer información tanto del
contenido de las páginas decontenido de las páginas, de su estructura de relaciones (enlaces) y de los registro
de navegación de losde navegación de los usuarios.
69“En una sociedad donde lo que no se
encuentra en la web parece que no existe ...”
Bibliographyg p yJ. Han, M. Kamber.
Data Mining. Concepts and Techniques Morgan Kaufmann, 2006 (Second Edition)g , ( )
http://www.cs.sfu.ca/~han/dmbook
I.H. Witten, E. Frank. Data Mining: Practical Machine Learning Tools and Techniques,
Second Edition Morgan Kaufmann 2005Second Edition,Morgan Kaufmann, 2005.http://www.cs.waikato.ac.nz/~ml/weka/book.html
Pang-Ning Tan, Michael Steinbach, and Vipin Kumar Introduction to Data Mining (First Edition)
Addi W l (M 2 2005)Addison Wesley, (May 2, 2005)http://www-users.cs.umn.edu/~kumar/dmbook/index.php
Margaret H. DunhamData Mining: Introductory and Advanced TopicsData Mining: Introductory and Advanced Topics
Prentice Hall, 2003http://lyle.smu.edu/~mhd/book
Dorian Pyle Data Preparation for Data MiningMorgan Kaufmann, Mar 15, 1999
70Mamdouh Refaat
Data Preparation for Data Mining Using SASMorgan Kaufmann, Sep. 29, 2006)
ISBN: 9781420089646Publication Date: April 09, 2009
Number of Pages: 208
ICDM’06 Panel on “Top 10 Algorithms in Data Mining”
71
Introduction to Data Mining and Knowledge Discovery
Concluding Remarks
Important steps for study: Knowledge
Targetd
Processed
Patterns
data data
D t Mi i
InterpretationEvaluation
data
Preprocessing
Data Mining Evaluation
Selection
p g& cleaning
7272
Introduction to Data Mining and Knowledge Discovery
73