Data Mining and Soft Computing - UVa · 2012. 5. 1. · Data Mining and Soft Computing Summary 1. Introduction to Data Mining and Knowledge Discovery 2. Dt P tiData Preparation 3

Data Mining and Soft Computing

Session 1. Introduction to Data Mining and Knowledge DiscoveryIntroduction to Data Mining and Knowledge Discovery

Francisco HerreraResearch Group on Soft Computing andInformation Intelligent Systems (SCI2S)

Dept. of Computer Science and A.I. University of Granada SpainUniversity of Granada, Spain

Email: [email protected]://sci2s ugr eshttp://sci2s.ugr.es

http://decsai.ugr.es/~herrera

Data Mining and Soft Computing

Summary1. Introduction to Data Mining and Knowledge Discovery2 D t P ti2. Data Preparation 3. Introduction to Prediction, Classification, Clustering and Association4. Introduction to Soft Computing. Focusing our attention in Fuzzy Logic

and Evolutionary Computationand Evolutionary Computation5. Soft Computing Techniques in Data Mining: Fuzzy Data Mining and

Knowledge Extraction based on Evolutionary Learning6 Genetic Fuzzy Systems: State of the Art and New Trends6. Genetic Fuzzy Systems: State of the Art and New Trends7. Some Advanced Topics: Imbalanced data sets, subgroup discovery,

data complexity. 8. Final talk: How must I Do my Experimental Study? Design of8. Final talk: How must I Do my Experimental Study? Design of

Experiments in Data Mining/Computational Intelligence. Using Non-parametric Tests. Some Cases of Study.

Introduction to Data Mining and Knowledge Discovery

Outline

Discovery

Outline Motivation: Why Data Mining?y g

What is Data Mining?

History of Data Mining

Data Mining Terminology / KDD Process Data Mining Terminology / KDD Process

Data Mining Applications

Issues in Data Mining

Concluding Remarks3

Concluding Remarks


Outline

Discovery







Concluding Remarks4

Concluding Remarks

Why DATA MINING?Why DATA MINING?

• Lots of data is being collected and warehoused – Web data, e-commerce– purchases at department/

tgrocery stores– Bank/Credit Card

transactionstransactions

• Computers have become cheaper and more powerful

• Competitive Pressure is Strong – Provide better, customized services for an edge (e.g. inProvide better, customized services for an edge (e.g. in

Customer Relationship Management)


• Data collected and stored at enormous speeds (GB/hour)

– remote sensors on a satellite

– telescopes scanning the skies

– microarrays generating gene expression data

– scientific simulations generating terabytes of data

• Traditional techniques infeasible for raw data• Traditional techniques infeasible for raw data• Data mining may help scientists

– in classifying and segmenting datay g g g– in Hypothesis Formation


• Huge amounts of data

• Data rich – but information poorData rich but information poor• Lying hidden in all this data is information!

7

Mining Large Data Sets - Motivation

• There is often information “hidden” in the data that is not readily evident

• Human analysts may take weeks to discover useful informationM h f th d t i l d t ll• Much of the data is never analyzed at all

3 500 000

4,000,000

2,500,000

3,000,000

3,500,000

The Data Gap

1,500,000

2,000,000Total new disk (TB) since 1995

0

500,000

1,000,000Number of

analysts0

1995 1996 1997 1998 1999

From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”

Data vs InformationData vs. Information

• Society produces massive amounts of data– business, science, medicine, economics, sports,

…P t ti ll l bl• Potentially valuable resource

• Raw data is useless– need techniques to automatically extract

information Data: recorded facts– Data: recorded facts

– Information: patterns underlying the data

9

Figura 1. Relación entre dato, información y conocimiento

http://www.uoc.edu/web/esp/art/uoc/molina1102/molina1102.html

10


Outline

Discovery







Concluding Remarks11

Concluding Remarks

What is DATA MINING?What is DATA MINING?

M D fi iti

E t ti “ i i ” k l d f l t

Many Definitions

• Extracting or “mining” knowledge from large amounts of data

• Data -driven discovery and modeling of hidden patterns (we never new existed) in large volumes of datadata

• Extraction of implicit, previously unknown and yunexpected, potentially extremely useful information from data

12

What Is Data Mining?What Is Data Mining?

Data mining: Data mining: Extraction of interesting (non-trivial, implicit,

i l k d t ti ll f l)previously unknown and potentially useful)information or patterns from data in large databases

13

Data Mining is NOTData Mining is NOT• Data Warehousing What is not Data• Data Warehousing• (Deductive) query

processing

What is not Data Mining?

processing – SQL/ Reporting

• Software Agents– Look up phone number in phone • Software Agents

• Expert SystemsOnline Analytical Processing

directory

• Online Analytical Processing (OLAP)Statistical Analysis Tool

– Query a Web search engine for• Statistical Analysis Tool

• Data visualization

search engine for information about

“Amazon”14

Machine Learning TechniquesMachine Learning Techniques

Technical basis for data mining:

• algorithms for acquiring structural descriptions from examples

• Methods originate from artificial intelligence, hi l i i i d hmachine learning, statistics, and research on

databases15

Origins of Data Mining• Draws ideas from machine learning/AI, pattern

recognition statistics and database systemsrecognition, statistics, and database systems

AI/• Traditional Techniquesmay be unsuitable due to

AI/

Machine Learning/Pattern

Statistics

– Enormity of data– High dimensionality of data

Pattern Recognition

Data Miningg y– Heterogeneous,

distributed nature of data

Data Mining

Database systems

Multidisciplinary FieldMultidisciplinary Field

Database Technology StatisticsTechnology

Data MiningMachineLearning Visualization

OtherDisciplines

Artificial Intelligence p

Data Mining TasksData Mining TasksP di ti M th d• Prediction Methods– Use some variables to predict unknown or

future values of other variables.

• Description MethodsFind human interpretable patterns that– Find human-interpretable patterns that describe the data.

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Summary in a PictureSummary in a Picture

How can I analyze this data?

Knowledge

We have rich data, but poor information

Data mining-searching for knowledge (interesting patterns) in your data

19

p (interesting patterns) in your data.

J. Han, M. Kamber. Data Mining. Concepts and Techniques Morgan Kaufmann, 2006 (Second Edition)


Outline

Discovery







20

HistoryHistory• Emerged late 1980s

• Flourished –1990s

• Consolidation – 2000s

21

A Brief History of Data Mining SocietyA Brief History of Data Mining Society• 1989 IJCAI Workshop on Knowledge Discovery in Databases (Piatetsky-Shapiro)

– Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)

• 1991-1994 Workshops on Knowledge Discovery in Databases1991 1994 Workshops on Knowledge Discovery in Databases

– Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)1993 – Association Rule Mining Algorithm APRIORI proposed by Agraval Imielinski and Swamiproposed by Agraval, Imielinski and Swami.

• 1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98)

– Journal of Data Mining and Knowledge Discovery (1997)• 1998 ACM SIGKDD, SIGKDD’1999-2001 conferences, and SIGKDD Explorations• More conferences on data mining

22

More conferences on data mining

– PAKDD, PKDD, SIAM-Data Mining, (IEEE) ICDM, etc.


Outline

Discovery








Concluding Remarks

TerminologyTerminology

• Gold Mining

• Knowledge mining from databases

• Knowledge extraction• Knowledge extraction

• Data/pattern analysisData/pattern analysis

• Knowledge Discovery Databases or KDD

• Business intelligence

24

Input Output ViewInput-Output View

ngng

Data(internal & external)

Reports

a M

inin

a M

inin

Dat

aD

ata

Decision ModelsObjective(s)

B i K l d N K l d

25

Business Knowledge New Knowledge

Process ViewProcess ViewInterpretation

&

Model

&Evaluation

Dissemination&

Build a decision tree

Check against hold-out set

Building Deployment

Aggregate individual incomes into household income

Learn about loans, repayments, etc.;Data

Pre-processing

PatternsModels

Learn about loans, repayments, etc.;Collect data about past performance

p g

UnderstandingDomain & Data

Pre-processedData

BusinessProblem

Formulation

RawData

SelectedData

26

Knowledge Discovery Processow edge scove y ocess

Interpretation

Integration

Raw

& EvaluationKnowledge

______

______

______

T f d

Patternsand

Rules

Data UnderstaTransformed DataTarget

Data

andingDATADATAWareWarehhhousehouse

27

KDD ProcessKDD Process

DatabaseDatabase

Selection Transformation

Data Preparation

Data Data MiningMining

Training Data

Model, Patternsp gg Patterns

Evaluation, Verification

28

Knowledge Discovery Processfl di t CRISP DMflow, according to CRISP-DM

see www.crisp-dm.org

ffor more information

29


Outline

Discovery








Concluding Remarks

Data Mining TasksData Mining Tasks• Prediction Methods

– Use some variables to predict unknown or future values of other variables.

• Description Methods– Find human-interpretable patterns that describe the data.

Data Mining TasksData Mining Tasks...

• Classification [Predictive]• Clustering [Descriptive]Clustering [Descriptive]• Association Rule Discovery [Descriptive]• Sequential Pattern Discovery [Descriptive]• Regression [Predictive]• Regression [Predictive]• Deviation Detection [Predictive]• Time Series [Predictive]• Summarization [D i ti ]• Summarization [Descriptive]

Classification: DefinitionClassification: Definition• Given a collection of records (training set )( g )

– Each record contains a set of attributes, one of the attributes is the class.

• Find a model for class attribute as a function of the values of other attributes.

• Goal: previously unseen records should be assigned a class as accurately as possible.

A t t t i d t d t i th f th– A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to buildtraining and test sets, with training set used to build the model and test set used to validate it.

Classification ExampleClassification Example

Tid Refund MaritalStatus

TaxableIncome Cheat

Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

No Single 75K ?

Yes Married 50K ?

No Married 150K ?

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

Yes Divorced 90K ?

No Single 40K ?

No Married 80K ? Test6 No Married 60K No7 Yes Divorced 220K No

8 No Single 85K Yes

9 N M i d 75K N

No Married 80K ?10

TestSet

L9 No Married 75K No10 No Single 90K Yes

10

Training Set

ModelLearn

Classifier

Classification: Application 1Classification: Application 1• Direct Marketing

– Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product.

– Approach:• Use the data for a similar product introduced before. • We know which customers decided to buy and which

decided otherwise. This {buy, don’t buy} decision forms the class attribute.

• Collect various demographic, lifestyle, and company-interaction related information about all such customers.

Type of business where they stay how much they earn etc– Type of business, where they stay, how much they earn, etc.• Use this information as input attributes to learn a classifier

model.

Classification: Application 2Classification: Application 2• Fraud Detection

– Goal: Predict fraudulent cases in credit card transactions.A h– Approach:

• Use credit card transactions and the information on its account-holder as attributes.

– When does a customer buy, what does he buy, how often he pays on time, etc

• Label past transactions as fraud or fair transactions. This pforms the class attribute.

• Learn a model for the class of the transactions.• Use this model to detect fraud by observing credit cardUse this model to detect fraud by observing credit card

transactions on an account.

Classification: Application 3Classification: Application 3

• Customer Attrition/Churn:– Goal: To predict whether a customer is likely to be

lost to a competitor.– Approach:

• Use detailed record of transactions with each of the past and present customers, to find attributes.

– How often the customer calls, where he calls, what time-of-the , ,day he calls most, his financial status, marital status, etc.

• Label the customers as loyal or disloyal.Find a model for loyalty• Find a model for loyalty.

Classification: Application 4Classification: Application 4• Sky Survey Cataloging

– Goal: To predict class (star or galaxy) of sky objects, i ll i ll f i t b d th t l iespecially visually faint ones, based on the telescopic

survey images (from Palomar Observatory).– 3000 images with 23 040 x 23 040 pixels per image3000 images with 23,040 x 23,040 pixels per image.

– Approach:• Segment the image. g g• Measure image attributes (features) - 40 of them per object.• Model the class based on these features.• Success Story: Could find 16 new high red-shift quasars,

some of the farthest objects that are difficult to find!

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Classifying Galaxiesy gCourtesy: http://aps.umn.edu

Early Class: • Stages of Formation

Attributes:• Image features,

• Characteristics of light waves received etc

Intermediatewaves received, etc.

Late

D t SiData Size: • 72 million stars, 20 million galaxies

• Object Catalog: 9 GB• Image Database: 150 GB• Image Database: 150 GB

Clustering DefinitionClustering DefinitionGi t f d t i t h h i t f• Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such thatfind clusters such that– Data points in one cluster are more similar to one

anotheranother.– Data points in separate clusters are less similar to

one another.• Similarity Measures:

– Euclidean Distance if attributes are continuous.– Other Problem-specific Measures.

Illustrating ClusteringIllustrating ClusteringEuclidean Distance Based Clustering in 3-D space.

Intracluster distancesare minimized

Intracluster distancesare minimized

Intercluster distancesare maximized

Intercluster distancesare maximizedare minimizedare minimized are maximizedare maximized

Clustering: Application 1Clustering: Application 1• Market Segmentation:

G l bdi id k t i t di ti t b t f– Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with aselected as a market target to be reached with a distinct marketing mix.

– Approach: • Collect different attributes of customers based on

their geographical and lifestyle related information.Fi d l t f i il t• Find clusters of similar customers.

• Measure the clustering quality by observing buying patterns of customers in same cluster vs thosepatterns of customers in same cluster vs. those from different clusters.

Clustering: Application 2Clustering: Application 2D t Cl t i• Document Clustering:– Goal: To find groups of documents that are similar to

each other based on the important terms appearing ineach other based on the important terms appearing in them.Approach: To identify frequently occurring terms in– Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster.q

– Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.

Illustrating Document ClusteringIllustrating Document Clustering• Clustering Points: 3204 Articles of Los Angeles Times• Clustering Points: 3204 Articles of Los Angeles Times.• Similarity Measure: How many words are common in

these documents (after some word filtering).these documents (after some word filtering).

Category TotalArticles

CorrectlyPlacedArticles Placed

Financial 555 364

Foreign 341 260

National 273 36

Metro 943 746Metro 943 746

Sports 738 573

Entertainment 354 278Entertainment 354 278

Association Rule Discovery: DefinitionAssociation Rule Discovery: Definition

• Given a set of records each of which contain some number of items from a given collection;

P d d d l hi h ill di t– Produce dependency rules which will predict occurrence of an item based on occurrences of other itemsitems.

TID Items

1 B d C k Milk1 Bread, Coke, Milk2 Beer, Bread3 Beer, Coke, Diaper, Milk

Rules Discovered:{Milk} --> {Coke}

{Diaper, Milk} --> {Beer}

Rules Discovered:{Milk} --> {Coke}

{Diaper, Milk} --> {Beer}4 Beer, Bread, Diaper, Milk5 Coke, Diaper, Milk

{ p , } { }{ p , } { }

Association Rule Discovery: Application 1Association Rule Discovery: Application 1

• Marketing and Sales Promotion:g– Let the rule discovered be

{Bagels, … } --> {Potato Chips}– Potato Chips as consequent => Can be used to

determine what should be done to boost its sales.B l i th t d t b d t– Bagels in the antecedent => Can be used to see which products would be affected if the store discontinues selling bagels.g g

– Bagels in antecedent and Potato chips in consequent=> Can be used to see what products should be sold

ith B l t t l f P t t hi !with Bagels to promote sale of Potato chips!


S k t h lf t• Supermarket shelf management.– Goal: To identify items that are bought together by

s fficientl man c stomerssufficiently many customers.– Approach: Process the point-of-sale data collected

with barcode scanners to find dependencies amongwith barcode scanners to find dependencies among items.

– A classic rule --A classic rule • If a customer buys diaper and milk, then he is very

likely to buy beer.likely to buy beer.• So, don’t be surprised if you find six-packs stacked

next to diapers!p


• Inventory Management:– Goal: A consumer appliance repair company wants to

ti i t th t f i itanticipate the nature of repairs on its consumer products and keep the service vehicles equipped with right parts to reduce on number of visits to consumerright parts to reduce on number of visits to consumer households.

– Approach: Process the data on tools and parts pp oac ocess e da a o oo s a d pa srequired in previous repairs at different consumer locations and discover the co-occurrence patterns.

DATA MINING EXAMPLEDATA MINING EXAMPLEBeer and Nappies/Diapers -- A Data Mining Urban LegendBeer and Nappies/Diapers -- A Data Mining Urban Legend

Introduction

There is a story that a large supermarket chain,usually Wal-Mart, did an analysis of customers' buying h bit d f d t ti ti ll i ifi t l tihabits and found a statistically significant correlation between purchases of beer and purchases of nappies (diapers in the US).

It was theorized that the reason for this was that fathers were stopping off at Wal-Mart to buy nappies for their babies, and since they could no longer go down to the pub as often, would buy beer as well. As a result of this g p yfinding, the supermarket chain is alleged to have the nappies next to the beer, resulting in increased sales of both.

The Original Version

49

The Original Version

DATA MINING EXAMPLEBeer and Nappies/Diapers -- A Data Mining Urban Legend


Wh t i th i f ti ?What is the information?Diapers Bear

50



The stories are embellished with plausible sounding factoids, e.g. increased sales as a percentage:

"The man decided to b a si pack of beer hile he as"The man decided to buy a six pack of beer while he was stopped anyway. The discount chain moved the beer and

snacks such as peanuts and pretzels next to the disposablesnacks such as peanuts and pretzels next to the disposable diapers and increased sales on peanuts and pretzels by more

that 27%."

The now legendary revelation that men who bought diapers on Friday night at convenience stores were also likely to buy

51

on Friday night at convenience stores were also likely to buy beer is an example of market basket analysis.

Sequential Pattern Discovery: Definition• Given is a set of objects, with each object associated with its own

timeline of events find rules that predict strong sequentialtimeline of events, find rules that predict strong sequential dependencies among different events.

R l f d b fi t di i tt E t i

(A B) (C) (D E)

• Rules are formed by first disovering patterns. Event occurrences in the patterns are governed by timing constraints.

(A B) (C) (D E)ng

Sequential Pattern Discovery: Examples

• In telecommunications alarm logs,(I t P bl E i Li C t)– (Inverter_Problem Excessive_Line_Current)

(Rectifier_Alarm) --> (Fire_Alarm)• In point-of-sale transaction sequences,

– Computer Bookstore: (Intro_To_Visual_C) (C++_Primer) -->

(Perl for dummies Tcl Tk)(Perl_for_dummies,Tcl_Tk)– Athletic Apparel Store:

(Shoes) (Racket Racketball) > (Sports Jacket)(Shoes) (Racket, Racketball) --> (Sports_Jacket)

RegressionRegressionPredict a value of a given continuous valued variable• Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency.linear or nonlinear model of dependency.

• Greatly studied in statistics, neural network fields.• Examples:Examples:

– Predicting sales amounts of new product based on advetising expenditure.g p

– Predicting wind velocities as a function of temperature, humidity, air pressure, etc.p y p

– Time series prediction of stock market indices.

Deviation/Anomaly Detection• Detect significant deviations from normal

b h ibehavior• Applications:pp

– Credit Card Fraud Detection

– Network Intrusion et o t us oDetection

Typical network traffic at University level may reach over 100 million connections per day

Data Mining ApplicationsData Mining Applications• Science: Chemistry, Physics, Mediciney, y ,

– Biochemical analysis– Remote sensors on a satelliteRemote sensors on a satellite– Telescopes – star galaxy classification– Medical Image analysisMedical Image analysis

• BioscienceS b d l i– Sequence-based analysis

– Protein structure and function prediction– Protein family classification– Microarray gene expression

56

D t Mi i A li tiData Mining Applications• Pharmaceutical companies, Insurance and

Health care, Medicine– Drug development– Identify successful medical therapies– Claims analysis, fraudulent behavior– Medical diagnostic tools– Predict office visits

57

Data Mining Applications• Financial Industry, Banks, Businesses, E-

Data Mining Applicationsy, , ,

commerce– Stock and investment analysisy– Identify loyal customers vs. risky customer– Predict customer spendingPredict customer spending – Risk management– Sales forecastingSales forecasting

• Retail and MarketingCustomer buying patterns/demographic characteristics– Customer buying patterns/demographic characteristics

– Mailing campaignsMarket basket analysis

58

– Market basket analysis– Trend analysis

Data Mining ApplicationsData Mining Applications• Database analysis and decision support• Database analysis and decision support

– Market analysis and management• target marketing, customer relation management, market basket

analysis, cross selling, market segmentation

– Risk analysis and management• Forecasting, customer retention, improved underwriting, quality

control, competitive analysis

– Fraud detection and management

59

Data Mining ApplicationsData Mining ApplicationsS t d E t t i t• Sports and Entertainment – IBM Advanced Scout analyzed NBA game statistics

(shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat

• Astronomy– JPL and the Palomar Observatory discovered 22

quasars with the help of data mining

60

Data Mining: Some applications in Granada

Problemas en la BancaProblemas en la Banca

61IMAGINÁTICA, Sevilla, 4 de Marzo de 2009

Data Mining: Some applications in Granada

Problemas en la Banca Contrato de Investigación: Uso de técnicas de minería de datos para la detección de patrones de baja calidad en operaciones bancariasEntidad Financiadora: CajaGranada- Ref. Fundacion Empresa Universidad de Granada 3022-00Duracion: 1 Marzo 2008 - 30 Marzo 2009Duracion: 1 Marzo 2008 30 Marzo 2009Contrato de Investigación: Técnicas de Minería de Datos y Soft Computing para la mejora de la calidad de datos introducidos en scoringEntidad Financiadora: Caja Navarra Ref. Fundacion Empresa Universidad de Granada 3245-00 Duracion: 1 Marzo 2008 - 30 Marzo 2009Duracion: 1 Marzo 2008 30 Marzo 2009

62Caja Navarra


Outline

Discovery








Concluding Remarks

Major Issues in Data MiningMajor Issues in Data Mining

• Mining methodology and user interaction– Mining different kinds of knowledge in databases– Incorporation of background knowledge– Handling noise and incomplete data– Pattern evaluation: the interestingness problemPattern evaluation: the interestingness problem – Expression and visualization of data mining results

64


• Performance and scalability– Efficiency of data mining algorithms– Efficiency of data mining algorithms– Parallel, distributed and incremental mining methods

• Issues relating to the diversity of data typesH dli l ti l d l t f d t– Handling relational and complex types of data

– Mining information from diverse databases

65


• Issues related to applications and social impacts– Application of discovered knowledgepp g

• Domain-specific data mining tools• Intelligent query answering• Expert systems• Process control and decision making

A knowledge fusion problem– A knowledge fusion problem– Protection of data security, integrity, and privacy

66


Outline

Discovery








Concluding Remarks


Concluding RemarksSummary

• Data mining: discovering interesting patterns from large amounts of data

• A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentationpresentation

• Mining can be performed in a variety of information repositories

• Data mining functionalities: characterization, association, classification, clustering, outlier and trend analysis, etc.g y

68


Data Mining en la WEBL i l WEB

El Web mining o Webmininges una metodología de

La presencia en la WEB

recuperación de la información que usa

herramientas de la minería de datos para extraer información tanto del

contenido de las páginas decontenido de las páginas, de su estructura de relaciones (enlaces) y de los registro

de navegación de losde navegación de los usuarios.

69“En una sociedad donde lo que no se

encuentra en la web parece que no existe ...”

Bibliographyg p yJ. Han, M. Kamber.

Data Mining. Concepts and Techniques Morgan Kaufmann, 2006 (Second Edition)g , ( )

http://www.cs.sfu.ca/~han/dmbook

I.H. Witten, E. Frank. Data Mining: Practical Machine Learning Tools and Techniques,

Second Edition Morgan Kaufmann 2005Second Edition,Morgan Kaufmann, 2005.http://www.cs.waikato.ac.nz/~ml/weka/book.html

Pang-Ning Tan, Michael Steinbach, and Vipin Kumar Introduction to Data Mining (First Edition)

Addi W l (M 2 2005)Addison Wesley, (May 2, 2005)http://www-users.cs.umn.edu/~kumar/dmbook/index.php

Margaret H. DunhamData Mining: Introductory and Advanced TopicsData Mining: Introductory and Advanced Topics

Prentice Hall, 2003http://lyle.smu.edu/~mhd/book

Dorian Pyle Data Preparation for Data MiningMorgan Kaufmann, Mar 15, 1999

70Mamdouh Refaat

Data Preparation for Data Mining Using SASMorgan Kaufmann, Sep. 29, 2006)

ISBN: 9781420089646Publication Date: April 09, 2009

Number of Pages: 208

ICDM’06 Panel on “Top 10 Algorithms in Data Mining”

71


Concluding Remarks

Important steps for study: Knowledge

Targetd

Processed

Patterns

data data

D t Mi i

InterpretationEvaluation

data

Preprocessing

Data Mining Evaluation

Selection

p g& cleaning

7272


73

Documents

Data Mining and Soft Computing - UVa · 2012. 5. 1. · Data Mining and Soft Computing Summary 1. Introduction to Data Mining and Knowledge Discovery 2. Dt P tiData Preparation 3