73
Data Mining and Soft Computing Session 1. Introduction to Data Mining and Knowledge Discovery Introduction to Data Mining and Knowledge Discovery Francisco Herrera Research Group on Soft Computing and Information Intelligent Systems (SCI 2 S) Dept. of Computer Science and A.I. University of Granada Spain University of Granada, Spain Email: [email protected] http://sci2s ugr es http://sci2s.ugr .es http://decsai.ugr.es/~herrera

Data Mining and Soft Computing - UVa · 2012. 5. 1. · Data Mining and Soft Computing Summary 1. Introduction to Data Mining and Knowledge Discovery 2. Dt P tiData Preparation 3

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • Data Mining and Soft Computing

    Session 1. Introduction to Data Mining and Knowledge DiscoveryIntroduction to Data Mining and Knowledge Discovery

    Francisco HerreraResearch Group on Soft Computing andInformation Intelligent Systems (SCI2S)

    Dept. of Computer Science and A.I. University of Granada SpainUniversity of Granada, Spain

    Email: [email protected]://sci2s ugr eshttp://sci2s.ugr.es

    http://decsai.ugr.es/~herrera

  • Data Mining and Soft Computing

    Summary1. Introduction to Data Mining and Knowledge Discovery2 D t P ti2. Data Preparation 3. Introduction to Prediction, Classification, Clustering and Association4. Introduction to Soft Computing. Focusing our attention in Fuzzy Logic

    and Evolutionary Computationand Evolutionary Computation5. Soft Computing Techniques in Data Mining: Fuzzy Data Mining and

    Knowledge Extraction based on Evolutionary Learning6 Genetic Fuzzy Systems: State of the Art and New Trends6. Genetic Fuzzy Systems: State of the Art and New Trends7. Some Advanced Topics: Imbalanced data sets, subgroup discovery,

    data complexity. 8. Final talk: How must I Do my Experimental Study? Design of8. Final talk: How must I Do my Experimental Study? Design of

    Experiments in Data Mining/Computational Intelligence. Using Non-parametric Tests. Some Cases of Study.

  • Introduction to Data Mining and Knowledge Discovery

    Outline

    Discovery

    Outline Motivation: Why Data Mining?y g

    What is Data Mining?

    History of Data Mining

    Data Mining Terminology / KDD Process Data Mining Terminology / KDD Process

    Data Mining Applications

    Issues in Data Mining

    Concluding Remarks3

    Concluding Remarks

  • Introduction to Data Mining and Knowledge Discovery

    Outline

    Discovery

    Outline Motivation: Why Data Mining?y g

    What is Data Mining?

    History of Data Mining

    Data Mining Terminology / KDD Process Data Mining Terminology / KDD Process

    Data Mining Applications

    Issues in Data Mining

    Concluding Remarks4

    Concluding Remarks

  • Why DATA MINING?Why DATA MINING?

    • Lots of data is being collected and warehoused – Web data, e-commerce– purchases at department/

    tgrocery stores– Bank/Credit Card

    transactionstransactions

    • Computers have become cheaper and more powerful

    • Competitive Pressure is Strong – Provide better, customized services for an edge (e.g. inProvide better, customized services for an edge (e.g. in

    Customer Relationship Management)

  • Why DATA MINING?Why DATA MINING?

    • Data collected and stored at enormous speeds (GB/hour)

    – remote sensors on a satellite

    – telescopes scanning the skies

    – microarrays generating gene expression data

    – scientific simulations generating terabytes of data

    • Traditional techniques infeasible for raw data• Traditional techniques infeasible for raw data• Data mining may help scientists

    – in classifying and segmenting datay g g g– in Hypothesis Formation

  • Why DATA MINING?Why DATA MINING?

    • Huge amounts of data

    • Data rich – but information poorData rich but information poor• Lying hidden in all this data is information!

    7

  • Mining Large Data Sets - Motivation

    • There is often information “hidden” in the data that is not readily evident

    • Human analysts may take weeks to discover useful informationM h f th d t i l d t ll• Much of the data is never analyzed at all

    3 500 000

    4,000,000

    2,500,000

    3,000,000

    3,500,000

    The Data Gap

    1,500,000

    2,000,000Total new disk (TB) since 1995

    0

    500,000

    1,000,000Number of

    analysts0

    1995 1996 1997 1998 1999

    From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”

  • Data vs InformationData vs. Information

    • Society produces massive amounts of data– business, science, medicine, economics, sports,

    …P t ti ll l bl• Potentially valuable resource

    • Raw data is useless– need techniques to automatically extract

    information Data: recorded facts– Data: recorded facts

    – Information: patterns underlying the data

    9

  • Figura 1. Relación entre dato, información y conocimiento

    http://www.uoc.edu/web/esp/art/uoc/molina1102/molina1102.html

    10

  • Introduction to Data Mining and Knowledge Discovery

    Outline

    Discovery

    Outline Motivation: Why Data Mining?y g

    What is Data Mining?

    History of Data Mining

    Data Mining Terminology / KDD Process Data Mining Terminology / KDD Process

    Data Mining Applications

    Issues in Data Mining

    Concluding Remarks11

    Concluding Remarks

  • What is DATA MINING?What is DATA MINING?

    M D fi iti

    E t ti “ i i ” k l d f l t

    Many Definitions

    • Extracting or “mining” knowledge from large amounts of data

    • Data -driven discovery and modeling of hidden patterns (we never new existed) in large volumes of datadata

    • Extraction of implicit, previously unknown and yunexpected, potentially extremely useful information from data

    12

  • What Is Data Mining?What Is Data Mining?

    Data mining: Data mining: Extraction of interesting (non-trivial, implicit,

    i l k d t ti ll f l)previously unknown and potentially useful)information or patterns from data in large databases

    13

  • Data Mining is NOTData Mining is NOT• Data Warehousing What is not Data• Data Warehousing• (Deductive) query

    processing

    What is not Data Mining?

    processing – SQL/ Reporting

    • Software Agents– Look up phone number in phone • Software Agents

    • Expert SystemsOnline Analytical Processing

    directory

    • Online Analytical Processing (OLAP)Statistical Analysis Tool

    – Query a Web search engine for• Statistical Analysis Tool

    • Data visualization

    search engine for information about

    “Amazon”14

  • Machine Learning TechniquesMachine Learning Techniques

    Technical basis for data mining:

    • algorithms for acquiring structural descriptions from examples

    • Methods originate from artificial intelligence, hi l i i i d hmachine learning, statistics, and research on

    databases15

  • Origins of Data Mining• Draws ideas from machine learning/AI, pattern

    recognition statistics and database systemsrecognition, statistics, and database systems

    AI/• Traditional Techniquesmay be unsuitable due to

    AI/

    Machine Learning/Pattern

    Statistics

    – Enormity of data– High dimensionality of data

    Pattern Recognition

    Data Miningg y– Heterogeneous,

    distributed nature of data

    Data Mining

    Database systems

  • Multidisciplinary FieldMultidisciplinary Field

    Database Technology StatisticsTechnology

    Data MiningMachineLearning Visualization

    OtherDisciplines

    Artificial Intelligence p

  • Data Mining TasksData Mining TasksP di ti M th d• Prediction Methods– Use some variables to predict unknown or

    future values of other variables.

    • Description MethodsFind human interpretable patterns that– Find human-interpretable patterns that describe the data.

    From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

  • Summary in a PictureSummary in a Picture

    How can I analyze this data?

    Knowledge

    We have rich data, but poor information

    Data mining-searching for knowledge (interesting patterns) in your data

    19

    p (interesting patterns) in your data.

    J. Han, M. Kamber. Data Mining. Concepts and Techniques Morgan Kaufmann, 2006 (Second Edition)

  • Introduction to Data Mining and Knowledge Discovery

    Outline

    Discovery

    Outline Motivation: Why Data Mining?y g

    What is Data Mining?

    History of Data Mining

    Data Mining Terminology / KDD Process Data Mining Terminology / KDD Process

    Data Mining Applications

    Issues in Data Mining

    20

  • HistoryHistory• Emerged late 1980s

    • Flourished –1990s

    • Consolidation – 2000s

    21

  • A Brief History of Data Mining SocietyA Brief History of Data Mining Society• 1989 IJCAI Workshop on Knowledge Discovery in Databases (Piatetsky-Shapiro)

    – Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)

    • 1991-1994 Workshops on Knowledge Discovery in Databases1991 1994 Workshops on Knowledge Discovery in Databases

    – Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)1993 – Association Rule Mining Algorithm APRIORI proposed by Agraval Imielinski and Swamiproposed by Agraval, Imielinski and Swami.

    • 1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98)

    – Journal of Data Mining and Knowledge Discovery (1997)• 1998 ACM SIGKDD, SIGKDD’1999-2001 conferences, and SIGKDD Explorations• More conferences on data mining

    22

    More conferences on data mining

    – PAKDD, PKDD, SIAM-Data Mining, (IEEE) ICDM, etc.

  • Introduction to Data Mining and Knowledge Discovery

    Outline

    Discovery

    Outline Motivation: Why Data Mining?y g

    What is Data Mining?

    History of Data Mining

    Data Mining Terminology / KDD Process Data Mining Terminology / KDD Process

    Data Mining Applications

    Issues in Data Mining

    Concluding Remarks23

    Concluding Remarks

  • TerminologyTerminology

    • Gold Mining

    • Knowledge mining from databases

    • Knowledge extraction• Knowledge extraction

    • Data/pattern analysisData/pattern analysis

    • Knowledge Discovery Databases or KDD

    • Business intelligence

    24

  • Input Output ViewInput-Output View

    ngng

    Data(internal & external)

    Reports

    a M

    inin

    a M

    inin

    Dat

    aD

    ata

    Decision ModelsObjective(s)

    B i K l d N K l d

    25

    Business Knowledge New Knowledge

  • Process ViewProcess ViewInterpretation

    &

    Model

    &Evaluation

    Dissemination&

    Build a decision tree

    Check against hold-out set

    Building Deployment

    Aggregate individual incomes into household income

    Learn about loans, repayments, etc.;Data

    Pre-processing

    PatternsModels

    Learn about loans, repayments, etc.;Collect data about past performance

    p g

    UnderstandingDomain & Data

    Pre-processedData

    BusinessProblem

    Formulation

    RawData

    SelectedData

    26

  • Knowledge Discovery Processow edge scove y ocess

    Interpretation

    Integration

    Raw

    & EvaluationKnowledge

    ______

    ______

    ______

    T f d

    Patternsand

    Rules

    Data UnderstaTransformed DataTarget

    Data

    andingDATADATAWareWarehhhousehouse

    27

  • KDD ProcessKDD Process

    DatabaseDatabase

    Selection Transformation

    Data Preparation

    Data Data MiningMining

    Training Data

    Model, Patternsp gg Patterns

    Evaluation, Verification

    28

  • Knowledge Discovery Processfl di t CRISP DMflow, according to CRISP-DM

    see www.crisp-dm.org

    ffor more information

    29

  • Introduction to Data Mining and Knowledge Discovery

    Outline

    Discovery

    Outline Motivation: Why Data Mining?y g

    What is Data Mining?

    History of Data Mining

    Data Mining Terminology / KDD Process Data Mining Terminology / KDD Process

    Data Mining Applications

    Issues in Data Mining

    Concluding Remarks30

    Concluding Remarks

  • Data Mining TasksData Mining Tasks• Prediction Methods

    – Use some variables to predict unknown or future values of other variables.

    • Description Methods– Find human-interpretable patterns that describe the data.

  • Data Mining TasksData Mining Tasks...

    • Classification [Predictive]• Clustering [Descriptive]Clustering [Descriptive]• Association Rule Discovery [Descriptive]• Sequential Pattern Discovery [Descriptive]• Regression [Predictive]• Regression [Predictive]• Deviation Detection [Predictive]• Time Series [Predictive]• Summarization [D i ti ]• Summarization [Descriptive]

  • Classification: DefinitionClassification: Definition• Given a collection of records (training set )( g )

    – Each record contains a set of attributes, one of the attributes is the class.

    • Find a model for class attribute as a function of the values of other attributes.

    • Goal: previously unseen records should be assigned a class as accurately as possible.

    A t t t i d t d t i th f th– A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to buildtraining and test sets, with training set used to build the model and test set used to validate it.

  • Classification ExampleClassification Example

    Tid Refund MaritalStatus

    TaxableIncome Cheat

    Refund MaritalStatus

    TaxableIncome Cheat

    1 Yes Single 125K No

    2 No Married 100K No

    3 No Single 70K No

    No Single 75K ?

    Yes Married 50K ?

    No Married 150K ?

    4 Yes Married 120K No

    5 No Divorced 95K Yes

    6 No Married 60K No

    Yes Divorced 90K ?

    No Single 40K ?

    No Married 80K ? Test6 No Married 60K No7 Yes Divorced 220K No

    8 No Single 85K Yes

    9 N M i d 75K N

    No Married 80K ?10

    TestSet

    L9 No Married 75K No10 No Single 90K Yes

    10

    Training Set

    ModelLearn

    Classifier

  • Classification: Application 1Classification: Application 1• Direct Marketing

    – Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product.

    – Approach:• Use the data for a similar product introduced before. • We know which customers decided to buy and which

    decided otherwise. This {buy, don’t buy} decision forms the class attribute.

    • Collect various demographic, lifestyle, and company-interaction related information about all such customers.

    Type of business where they stay how much they earn etc– Type of business, where they stay, how much they earn, etc.• Use this information as input attributes to learn a classifier

    model.

  • Classification: Application 2Classification: Application 2• Fraud Detection

    – Goal: Predict fraudulent cases in credit card transactions.A h– Approach:

    • Use credit card transactions and the information on its account-holder as attributes.

    – When does a customer buy, what does he buy, how often he pays on time, etc

    • Label past transactions as fraud or fair transactions. This pforms the class attribute.

    • Learn a model for the class of the transactions.• Use this model to detect fraud by observing credit cardUse this model to detect fraud by observing credit card

    transactions on an account.

  • Classification: Application 3Classification: Application 3

    • Customer Attrition/Churn:– Goal: To predict whether a customer is likely to be

    lost to a competitor.– Approach:

    • Use detailed record of transactions with each of the past and present customers, to find attributes.

    – How often the customer calls, where he calls, what time-of-the , ,day he calls most, his financial status, marital status, etc.

    • Label the customers as loyal or disloyal.Find a model for loyalty• Find a model for loyalty.

  • Classification: Application 4Classification: Application 4• Sky Survey Cataloging

    – Goal: To predict class (star or galaxy) of sky objects, i ll i ll f i t b d th t l iespecially visually faint ones, based on the telescopic

    survey images (from Palomar Observatory).– 3000 images with 23 040 x 23 040 pixels per image3000 images with 23,040 x 23,040 pixels per image.

    – Approach:• Segment the image. g g• Measure image attributes (features) - 40 of them per object.• Model the class based on these features.• Success Story: Could find 16 new high red-shift quasars,

    some of the farthest objects that are difficult to find!

    From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

  • Classifying Galaxiesy gCourtesy: http://aps.umn.edu

    Early Class: • Stages of Formation

    Attributes:• Image features,

    • Characteristics of light waves received etc

    Intermediatewaves received, etc.

    Late

    D t SiData Size: • 72 million stars, 20 million galaxies

    • Object Catalog: 9 GB• Image Database: 150 GB• Image Database: 150 GB

  • Clustering DefinitionClustering DefinitionGi t f d t i t h h i t f• Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such thatfind clusters such that– Data points in one cluster are more similar to one

    anotheranother.– Data points in separate clusters are less similar to

    one another.• Similarity Measures:

    – Euclidean Distance if attributes are continuous.– Other Problem-specific Measures.

  • Illustrating ClusteringIllustrating ClusteringEuclidean Distance Based Clustering in 3-D space.

    Intracluster distancesare minimized

    Intracluster distancesare minimized

    Intercluster distancesare maximized

    Intercluster distancesare maximizedare minimizedare minimized are maximizedare maximized

  • Clustering: Application 1Clustering: Application 1• Market Segmentation:

    G l bdi id k t i t di ti t b t f– Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with aselected as a market target to be reached with a distinct marketing mix.

    – Approach: • Collect different attributes of customers based on

    their geographical and lifestyle related information.Fi d l t f i il t• Find clusters of similar customers.

    • Measure the clustering quality by observing buying patterns of customers in same cluster vs thosepatterns of customers in same cluster vs. those from different clusters.

  • Clustering: Application 2Clustering: Application 2D t Cl t i• Document Clustering:– Goal: To find groups of documents that are similar to

    each other based on the important terms appearing ineach other based on the important terms appearing in them.Approach: To identify frequently occurring terms in– Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster.q

    – Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.

  • Illustrating Document ClusteringIllustrating Document Clustering• Clustering Points: 3204 Articles of Los Angeles Times• Clustering Points: 3204 Articles of Los Angeles Times.• Similarity Measure: How many words are common in

    these documents (after some word filtering).these documents (after some word filtering).

    Category TotalArticles

    CorrectlyPlacedArticles Placed

    Financial 555 364

    Foreign 341 260

    National 273 36

    Metro 943 746Metro 943 746

    Sports 738 573

    Entertainment 354 278Entertainment 354 278

  • Association Rule Discovery: DefinitionAssociation Rule Discovery: Definition

    • Given a set of records each of which contain some number of items from a given collection;

    P d d d l hi h ill di t– Produce dependency rules which will predict occurrence of an item based on occurrences of other itemsitems.

    TID Items

    1 B d C k Milk1 Bread, Coke, Milk2 Beer, Bread3 Beer, Coke, Diaper, Milk

    Rules Discovered:{Milk} --> {Coke}

    {Diaper, Milk} --> {Beer}

    Rules Discovered:{Milk} --> {Coke}

    {Diaper, Milk} --> {Beer}4 Beer, Bread, Diaper, Milk5 Coke, Diaper, Milk

    { p , } { }{ p , } { }

  • Association Rule Discovery: Application 1Association Rule Discovery: Application 1

    • Marketing and Sales Promotion:g– Let the rule discovered be

    {Bagels, … } --> {Potato Chips}– Potato Chips as consequent => Can be used to

    determine what should be done to boost its sales.B l i th t d t b d t– Bagels in the antecedent => Can be used to see which products would be affected if the store discontinues selling bagels.g g

    – Bagels in antecedent and Potato chips in consequent=> Can be used to see what products should be sold

    ith B l t t l f P t t hi !with Bagels to promote sale of Potato chips!

  • Association Rule Discovery: Application 2Association Rule Discovery: Application 2

    S k t h lf t• Supermarket shelf management.– Goal: To identify items that are bought together by

    s fficientl man c stomerssufficiently many customers.– Approach: Process the point-of-sale data collected

    with barcode scanners to find dependencies amongwith barcode scanners to find dependencies among items.

    – A classic rule --A classic rule • If a customer buys diaper and milk, then he is very

    likely to buy beer.likely to buy beer.• So, don’t be surprised if you find six-packs stacked

    next to diapers!p

  • Association Rule Discovery: Application 3Association Rule Discovery: Application 3

    • Inventory Management:– Goal: A consumer appliance repair company wants to

    ti i t th t f i itanticipate the nature of repairs on its consumer products and keep the service vehicles equipped with right parts to reduce on number of visits to consumerright parts to reduce on number of visits to consumer households.

    – Approach: Process the data on tools and parts pp oac ocess e da a o oo s a d pa srequired in previous repairs at different consumer locations and discover the co-occurrence patterns.

  • DATA MINING EXAMPLEDATA MINING EXAMPLEBeer and Nappies/Diapers -- A Data Mining Urban LegendBeer and Nappies/Diapers -- A Data Mining Urban Legend

    Introduction

    There is a story that a large supermarket chain,usually Wal-Mart, did an analysis of customers' buying h bit d f d t ti ti ll i ifi t l tihabits and found a statistically significant correlation between purchases of beer and purchases of nappies (diapers in the US).

    It was theorized that the reason for this was that fathers were stopping off at Wal-Mart to buy nappies for their babies, and since they could no longer go down to the pub as often, would buy beer as well. As a result of this g p yfinding, the supermarket chain is alleged to have the nappies next to the beer, resulting in increased sales of both.

    The Original Version

    49

    The Original Version

  • DATA MINING EXAMPLEBeer and Nappies/Diapers -- A Data Mining Urban Legend

    DATA MINING EXAMPLEBeer and Nappies/Diapers -- A Data Mining Urban Legend

    Wh t i th i f ti ?What is the information?Diapers Bear

    50

  • DATA MINING EXAMPLEBeer and Nappies/Diapers -- A Data Mining Urban Legend

    DATA MINING EXAMPLEBeer and Nappies/Diapers -- A Data Mining Urban Legend

    The stories are embellished with plausible sounding factoids, e.g. increased sales as a percentage:

    "The man decided to b a si pack of beer hile he as"The man decided to buy a six pack of beer while he was stopped anyway. The discount chain moved the beer and

    snacks such as peanuts and pretzels next to the disposablesnacks such as peanuts and pretzels next to the disposable diapers and increased sales on peanuts and pretzels by more

    that 27%."

    The now legendary revelation that men who bought diapers on Friday night at convenience stores were also likely to buy

    51

    on Friday night at convenience stores were also likely to buy beer is an example of market basket analysis.

  • Sequential Pattern Discovery: Definition• Given is a set of objects, with each object associated with its own

    timeline of events find rules that predict strong sequentialtimeline of events, find rules that predict strong sequential dependencies among different events.

    R l f d b fi t di i tt E t i

    (A B) (C) (D E)

    • Rules are formed by first disovering patterns. Event occurrences in the patterns are governed by timing constraints.

    (A B) (C) (D E)ng

  • Sequential Pattern Discovery: Examples

    • In telecommunications alarm logs,(I t P bl E i Li C t)– (Inverter_Problem Excessive_Line_Current)

    (Rectifier_Alarm) --> (Fire_Alarm)• In point-of-sale transaction sequences,

    – Computer Bookstore: (Intro_To_Visual_C) (C++_Primer) -->

    (Perl for dummies Tcl Tk)(Perl_for_dummies,Tcl_Tk)– Athletic Apparel Store:

    (Shoes) (Racket Racketball) > (Sports Jacket)(Shoes) (Racket, Racketball) --> (Sports_Jacket)

  • RegressionRegressionPredict a value of a given continuous valued variable• Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency.linear or nonlinear model of dependency.

    • Greatly studied in statistics, neural network fields.• Examples:Examples:

    – Predicting sales amounts of new product based on advetising expenditure.g p

    – Predicting wind velocities as a function of temperature, humidity, air pressure, etc.p y p

    – Time series prediction of stock market indices.

  • Deviation/Anomaly Detection• Detect significant deviations from normal

    b h ibehavior• Applications:pp

    – Credit Card Fraud Detection

    – Network Intrusion et o t us oDetection

    Typical network traffic at University level may reach over 100 million connections per day

  • Data Mining ApplicationsData Mining Applications• Science: Chemistry, Physics, Mediciney, y ,

    – Biochemical analysis– Remote sensors on a satelliteRemote sensors on a satellite– Telescopes – star galaxy classification– Medical Image analysisMedical Image analysis

    • BioscienceS b d l i– Sequence-based analysis

    – Protein structure and function prediction– Protein family classification– Microarray gene expression

    56

  • D t Mi i A li tiData Mining Applications• Pharmaceutical companies, Insurance and

    Health care, Medicine– Drug development– Identify successful medical therapies– Claims analysis, fraudulent behavior– Medical diagnostic tools– Predict office visits

    57

  • Data Mining Applications• Financial Industry, Banks, Businesses, E-

    Data Mining Applicationsy, , ,

    commerce– Stock and investment analysisy– Identify loyal customers vs. risky customer– Predict customer spendingPredict customer spending – Risk management– Sales forecastingSales forecasting

    • Retail and MarketingCustomer buying patterns/demographic characteristics– Customer buying patterns/demographic characteristics

    – Mailing campaignsMarket basket analysis

    58

    – Market basket analysis– Trend analysis

  • Data Mining ApplicationsData Mining Applications• Database analysis and decision support• Database analysis and decision support

    – Market analysis and management• target marketing, customer relation management, market basket

    analysis, cross selling, market segmentation

    – Risk analysis and management• Forecasting, customer retention, improved underwriting, quality

    control, competitive analysis

    – Fraud detection and management

    59

  • Data Mining ApplicationsData Mining ApplicationsS t d E t t i t• Sports and Entertainment – IBM Advanced Scout analyzed NBA game statistics

    (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat

    • Astronomy– JPL and the Palomar Observatory discovered 22

    quasars with the help of data mining

    60

  • Data Mining: Some applications in Granada

    Problemas en la BancaProblemas en la Banca

    61IMAGINÁTICA, Sevilla, 4 de Marzo de 2009

  • Data Mining: Some applications in Granada

    Problemas en la Banca Contrato de Investigación: Uso de técnicas de minería de datos para la detección de patrones de baja calidad en operaciones bancariasEntidad Financiadora: CajaGranada- Ref. Fundacion Empresa Universidad de Granada 3022-00Duracion: 1 Marzo 2008 - 30 Marzo 2009Duracion: 1 Marzo 2008 30 Marzo 2009Contrato de Investigación: Técnicas de Minería de Datos y Soft Computing para la mejora de la calidad de datos introducidos en scoringEntidad Financiadora: Caja Navarra Ref. Fundacion Empresa Universidad de Granada 3245-00 Duracion: 1 Marzo 2008 - 30 Marzo 2009Duracion: 1 Marzo 2008 30 Marzo 2009

    62Caja Navarra

  • Introduction to Data Mining and Knowledge Discovery

    Outline

    Discovery

    Outline Motivation: Why Data Mining?y g

    What is Data Mining?

    History of Data Mining

    Data Mining Terminology / KDD Process Data Mining Terminology / KDD Process

    Data Mining Applications

    Issues in Data Mining

    Concluding Remarks63

    Concluding Remarks

  • Major Issues in Data MiningMajor Issues in Data Mining

    • Mining methodology and user interaction– Mining different kinds of knowledge in databases– Incorporation of background knowledge– Handling noise and incomplete data– Pattern evaluation: the interestingness problemPattern evaluation: the interestingness problem – Expression and visualization of data mining results

    64

  • Major Issues in Data MiningMajor Issues in Data Mining

    • Performance and scalability– Efficiency of data mining algorithms– Efficiency of data mining algorithms– Parallel, distributed and incremental mining methods

    • Issues relating to the diversity of data typesH dli l ti l d l t f d t– Handling relational and complex types of data

    – Mining information from diverse databases

    65

  • Major Issues in Data MiningMajor Issues in Data Mining

    • Issues related to applications and social impacts– Application of discovered knowledgepp g

    • Domain-specific data mining tools• Intelligent query answering• Expert systems• Process control and decision making

    A knowledge fusion problem– A knowledge fusion problem– Protection of data security, integrity, and privacy

    66

  • Introduction to Data Mining and Knowledge Discovery

    Outline

    Discovery

    Outline Motivation: Why Data Mining?y g

    What is Data Mining?

    History of Data Mining

    Data Mining Terminology / KDD Process Data Mining Terminology / KDD Process

    Data Mining Applications

    Issues in Data Mining

    Concluding Remarks67

    Concluding Remarks

  • Introduction to Data Mining and Knowledge Discovery

    Concluding RemarksSummary

    • Data mining: discovering interesting patterns from large amounts of data

    • A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentationpresentation

    • Mining can be performed in a variety of information repositories

    • Data mining functionalities: characterization, association, classification, clustering, outlier and trend analysis, etc.g y

    68

  • Introduction to Data Mining and Knowledge Discovery

    Data Mining en la WEBL i l WEB

    El Web mining o Webmininges una metodología de

    La presencia en la WEB

    recuperación de la información que usa

    herramientas de la minería de datos para extraer información tanto del

    contenido de las páginas decontenido de las páginas, de su estructura de relaciones (enlaces) y de los registro

    de navegación de losde navegación de los usuarios.

    69“En una sociedad donde lo que no se

    encuentra en la web parece que no existe ...”

  • Bibliographyg p yJ. Han, M. Kamber.

    Data Mining. Concepts and Techniques Morgan Kaufmann, 2006 (Second Edition)g , ( )

    http://www.cs.sfu.ca/~han/dmbook

    I.H. Witten, E. Frank. Data Mining: Practical Machine Learning Tools and Techniques,

    Second Edition Morgan Kaufmann 2005Second Edition,Morgan Kaufmann, 2005.http://www.cs.waikato.ac.nz/~ml/weka/book.html

    Pang-Ning Tan, Michael Steinbach, and Vipin Kumar Introduction to Data Mining (First Edition)

    Addi W l (M 2 2005)Addison Wesley, (May 2, 2005)http://www-users.cs.umn.edu/~kumar/dmbook/index.php

    Margaret H. DunhamData Mining: Introductory and Advanced TopicsData Mining: Introductory and Advanced Topics

    Prentice Hall, 2003http://lyle.smu.edu/~mhd/book

    Dorian Pyle Data Preparation for Data MiningMorgan Kaufmann, Mar 15, 1999

    70Mamdouh Refaat

    Data Preparation for Data Mining Using SASMorgan Kaufmann, Sep. 29, 2006)

    ISBN: 9781420089646Publication Date: April 09, 2009

    Number of Pages: 208

  • ICDM’06 Panel on “Top 10 Algorithms in Data Mining”

    71

  • Introduction to Data Mining and Knowledge Discovery

    Concluding Remarks

    Important steps for study: Knowledge

    Targetd

    Processed

    Patterns

    data data

    D t Mi i

    InterpretationEvaluation

    data

    Preprocessing

    Data Mining Evaluation

    Selection

    p g& cleaning

    7272

  • Introduction to Data Mining and Knowledge Discovery

    73