datmindata minig

Embed Size (px)

Citation preview

  • 7/30/2019 datmindata minig

    1/39

    Data Mining

    Rajagopal Sukumar

    Cognizant Technology Solutions

  • 7/30/2019 datmindata minig

    2/39

    Agenda

    What is Data Mining ?

    Data Mining Techniques

    Data Mining Process

    Our work in Data Mining

    Tools available in the market

  • 7/30/2019 datmindata minig

    3/39

    What is Data Mining ? Data mining is the search for relationships

    and global patterns that exist in large

    databases but are `hidden' among the vast

    amount of data

    These relationships represent valuable

    knowledge about the database and the

    objects in the database and, if the database

    is a faithful mirror, of the real world registeredby the database.

  • 7/30/2019 datmindata minig

    4/39

    What is Data Mining ?

    The analogy with the mining process is

    described as:

    Data mining refers to "using a variety of

    techniques to identify nuggets of information ordecision-making knowledge in bodies of data,

    and extracting these in such a way that they can

    be put to use in the areas such as decision

    support, prediction, forecasting and estimation.

    The data is often voluminous, but as it stands oflow value as no direct use can be made of it; it is

    the hidden information in the data that is useful"

  • 7/30/2019 datmindata minig

    5/39

    Why do we need Data Mining ?

    We need it because everybody needs it

    !

    To uncover strategic competitive insightto drive market share and profits

  • 7/30/2019 datmindata minig

    6/39

    What can we do with our data ?

    Derive Quantitative Information How many people bought our products last month ?

    Explain Past Results

    Why did my monthly sales for our products have declined

    sharply ?

    Discover Hidden Patterns

    Houses with a male HOH (Head of the HHLD) are more likely to

    have both cats and dogs than those with a female. The actualratio is 7:3.

    Predict Future Results So those household in our customer base that have a male

    Head of Household are likely to have both cats and dogs. If we

    are a pet food supplier, think about the value of this prediction ?

  • 7/30/2019 datmindata minig

    7/39

    Transforming Data

    Data

    Facts/Information

    Knowledge

    Recommendations/Decisions

  • 7/30/2019 datmindata minig

    8/39

    OLAP Vs. Data MiningOLAP Data Mining

    Focus Summary Data Detail Data

    Dimensions Limited Lots

    No. of attributes Total in the tens Hundreds

    Size of datasets Small to medium Millions

    Analysis Deductive Predictive

    Technique Slice and Dice Automatic Discovery

    State oftechnology

    Mature Mature in StatisticalAnalysis/Emerging inKnowledgeDiscovery

  • 7/30/2019 datmindata minig

    9/39

    Data Mining Methods

    Decision Trees

    Case Based Reasoning

    Neural Networks

    Genetic Algorithms

    Linear and Non Linear RegressionAnalysis

  • 7/30/2019 datmindata minig

    10/39

    ToyType Buyersex Sales month Location Qty

    Car Boys Jan FL 50,000

    Car Boys Jan GA 10,000Doll Girls Feb FL 20,000

    Doll Girls Feb CA 15,000

    Car Boys Mar NY 20,000

    Car

    Boys

    Girls

    Jan

    Feb

    ...

    GA

    FL 50,000

    10,000

    < Highest

    < Lowest

    ...

    Decision Tree

  • 7/30/2019 datmindata minig

    11/39

    Case based Reasoning (CBR)

    Finds the closest situation that occurredin the past and adopts the same

    solution that was the right one

    Disadvantage is that CBR systems donot create rules or models summarizing

    the past experiences

    Example: Help Desk Support Systems

  • 7/30/2019 datmindata minig

    12/39

    Neural Networks

    Mimic the way learning occurs in the

    brain

    They are used extensively in thebusiness world as predictive models

    Each neuron takes many inputs and

    generates an output that is a non-linearfunction of the weighted sum of inputs

  • 7/30/2019 datmindata minig

    13/39

    Neural Networks

    Toy Type

    Buyer Sex

    Location

    Sale Month

    Quantity

    n1

    n2

    n3

    n4

    Good

    Bad

  • 7/30/2019 datmindata minig

    14/39

    Neural Networks

    y = Good or Bad

    y = w1n1 + w2n2 + w3n3 + w4n4

    The weights w1..w4 can be calculated

    using backward propagation by training

    the net using known values of y and the

    inputs

    Then the net can be used for

    predictions

  • 7/30/2019 datmindata minig

    15/39

    Genetic Algorithms

    Mimic the evolutionary process of

    natural selection

    It has a fitness function that determines

    those solutions that are better fits

    Then genetic operations mutations and

    mating are performed to generate more

    solutions

    Currently in research mode rather than

    in practical applications

  • 7/30/2019 datmindata minig

    16/39

    Linear and Non-Linear Regression

    Searching for a dependence of the

    target variable on other variables in the

    form of function of some predetermined

    polynomial form

    Quantity = A*Buyer Sex + B* Location +

    C* Month (This is linear !)

    Solving this equation for A, B, C using

    the available data can be a predictive

    model

  • 7/30/2019 datmindata minig

    17/39

    Usage

    Clustering

    Grouping data into disjoint sets that are

    similar in some respect. It also attempts to

    place dissimilar data in different clusters. For example, in the context of super

    market data, clustering of sale items to

    perform effective shelf spaceorganization is a typical application

    Clustering algorithms typically use a

    distance function to separate data

  • 7/30/2019 datmindata minig

    18/39

    Usage

    Classification

    Classifies data into distinctive groups

    For example, people can be categorized

    into the classifications of babies,

    children, teenagers, adults, and elderly.

    The attribute age two years or younger

    can be mapped to babies.

    Once data is classified, traits of these

    groups can be summarized

  • 7/30/2019 datmindata minig

    19/39

    Usage

    Deviation Detection

    Extracting anomalies or deviations in the

    dataAn anomaly may show a new fact of great

    interest

  • 7/30/2019 datmindata minig

    20/39

    UsageAssociation Rules

    Extracting associations between data

    items. Can be used to predict the value of

    one object based on the value of another. Find a model that identifies the most

    predictive characteristics of people

    buying toy pickup trucks ?

    Answer - During summer vacation,

    single parent families with certain

    income levels buy toy pickup trucks

  • 7/30/2019 datmindata minig

    21/39

    Association Rules

    70% of customers who order pen and

    pencils also order writing tablets

    If Writing Tablets are high margin items

    discover all associations that have

    Writing Tablets as a consequent

    If pencils are low margin items, discover

    all associations that have pencils as an

    antecedent to determine the impact of

    discontinuing pencils

  • 7/30/2019 datmindata minig

    22/39

  • 7/30/2019 datmindata minig

    23/39

    Data Preparation

    Data Cleansing

    Inconsistencies

    Toy types soft and plush mean the same

    Stale Data

    Address changes are not reflected correctly

    Typographical Errors

    words are misspelled or typed incorrectly Missing Values

    Tough problem to address

  • 7/30/2019 datmindata minig

    24/39

    Data Cleansing - Missing Values

    Treatment of missing numeric values is

    more difficult

    Artificial assignment change distributionand statistics of the field

    Assign using average values

    Segment data using another variable andassign segment averages

    Build a model and impute the missing

    values (the best method)

  • 7/30/2019 datmindata minig

    25/39

    Data Transformation

    Ratio Variables

    Time derivatives

    Discretization using quantiles

    Discretization using other mathematical

    transforms

  • 7/30/2019 datmindata minig

    26/39

  • 7/30/2019 datmindata minig

    27/39

    Time Derivatives

    Variation of data over time is very

    important to understand

    For example, toy sales time series = toy

    sales of current month - toy sales of

    previous month

    Cyclic Association Rules can be

    identified

    monthly sales of goods may have different

    correlations based on the season

  • 7/30/2019 datmindata minig

    28/39

  • 7/30/2019 datmindata minig

    29/39

  • 7/30/2019 datmindata minig

    30/39

    Data Mining Process

    Choose the study

    Classification/Clustering

    Deviation Detection

    Affinity Analysis

    Run the algorithm on the prepared data

    Analyze the outputs

    Make decisions

  • 7/30/2019 datmindata minig

    31/39

    Our Approach

    Demystification of Data Mining

    Built a Windows based Prototype to

    demonstrate decision trees

    Working on adding a module to our

    Adhoc Query Generator - Extempore

  • 7/30/2019 datmindata minig

    32/39

  • 7/30/2019 datmindata minig

    33/39

  • 7/30/2019 datmindata minig

    34/39

    What is Extempore ?

    EXTract M204 and Process On REquest

    Generates native M204 UL code

    Reports generated on multiple M204 fileswithout any M204 coding

    Complex report formatting with the help of

    reporting tools like info-maker

    Provides user friendly GUI

    Dynamically generates customized reports

  • 7/30/2019 datmindata minig

    35/39

    What is Extempore ?

    Structured user interface

    Point & click methodology

    Limited M204 knowledge required to use

    Quick access to M204 data

    Reports can be copied/saved and reused

    Data retrieved can be saved in formats like

    excel, CSV or HTML tables to be used byother systems

    Online & batch modes of execution

  • 7/30/2019 datmindata minig

    36/39

    Extempore Architecture

    CT LIB JANUS

    RPC to Sybase

    & results from

    RPC to client

    Sybase routes

    client RPC

    to M204Hidden connection

    from M204 to Sybaseto read report

    specification

  • 7/30/2019 datmindata minig

    37/39

    Tools in the market

    IBM Intelligent Miner

    Data Mind Corps Data Mind

    Professional Edition

    Angoss Softwares Knowledge Seeker

    Neuralwares Neuralworks Predict

    Pilot Softwares Discovery Server

    Redbrick Systems Data Mine

    Thinking Machines Corps Darwin

  • 7/30/2019 datmindata minig

    38/39

    Web sites

    Excellent reference sites

    http://www.thearling.com

    http://www.kdnuggets.com

    Source code sites

    C4.5 Decision Tree Algorithm

    htttp://ftp.cs.su.oz.au/pub/ml/

    OC1 Decision Tree Algorithm

    http:/www.cs.jhu.edu/

  • 7/30/2019 datmindata minig

    39/39

    Thank You !