Datamining Intro IEP 2

Embed Size (px)

Citation preview

  • 7/28/2019 Datamining Intro IEP 2

    1/16

    HITKARINI COLLAGE OF

    ENGGNERING ANDTECHNOLOGY

  • 7/28/2019 Datamining Intro IEP 2

    2/16

    Data mining

    Process of semi-automatically analyzinglarge databases to find patterns that are:

    valid: hold on new data with some certainity

    novel: non-obvious to the system

    useful: should be possible to act on the item

    understandable: humans should be able tointerpret the pattern

    Also known as Knowledge Discovery in

    Databases (KDD)

  • 7/28/2019 Datamining Intro IEP 2

    3/16

    Applications

    Banking: loan/credit card approvalpredict good customers based on old customers

    Customer relationship management:

    identify those who are likely to leave for a competitor.Targeted marketing:identify likely responders to promotions

    Manufacturing and production:automatically adjust knobs when process parameter changes

  • 7/28/2019 Datamining Intro IEP 2

    4/16

    Applications

    Medicine: disease outcome, effectiveness oftreatments

    analyze patient disease history: find relationshipbetween diseases

    Scientific data analysis:

    identify new galaxies by searching for sub clusters

    Web site/store design and promotion:find affinity of visitor to pages and modify layout

  • 7/28/2019 Datamining Intro IEP 2

    5/16

    The KDD process

    Data collectionsubset data: sampling might hurt if highly skewed data

    feature selection: principal component analysis, heuristic

    search

    Pre-processing: cleaningname/address cleaning, different meanings (annual, yearly),

    duplicate removal, supplying missing values

    Transformation:map complex objects e.g. time series data to features e.g.

    frequency

    Choosing mining task and mining method:

    Result evaluation and Visualization:

  • 7/28/2019 Datamining Intro IEP 2

    6/16

    Some basic operations

    Predictive:

    Regression

    Collaborative FilteringDescriptive:

    Clustering / similarity matching

    Association rules and variantsDeviation detection

  • 7/28/2019 Datamining Intro IEP 2

    7/16

    Clustering

    Unsupervised learning when old data with classlabels not available e.g. when introducing a new

    product.Group/cluster existing customers based on time

    series of payment history such that similarcustomers in same cluster.

    Key requirement: Need a good measure ofsimilarity between instances.

    Identify micro-markets and develop policies for

    each

  • 7/28/2019 Datamining Intro IEP 2

    8/16

    Clustering methods

    Hierarchical clustering

    agglomerative Vs divisive

    single link Vs complete link

    Partitional clustering

    distance-based: K-means

    model-based: EMdensity-based:

  • 7/28/2019 Datamining Intro IEP 2

    9/16

    Variants

    High confidence may not imply highcorrelation

    Use correlations. Find expected supportand large departures from thatinteresting..

    see statistical literature on contingency tables.

    Still too many rules, need to prune...

  • 7/28/2019 Datamining Intro IEP 2

    10/16

    Data Mining in Practice

  • 7/28/2019 Datamining Intro IEP 2

    11/16

    Application Areas

    Industry Application

    Finance Credit Card Analysis

    Insurance Claims, Fraud AnalysisTelecommunication Call record analysis

    Transport Logistics management

    Consumer goods promotion analysis

    Data Service providers Value added data

    Utilities Power usage analysis

  • 7/28/2019 Datamining Intro IEP 2

    12/16

    Data Mining works with

    Warehouse Data

    Data Warehousing provides theEnterprise with a memory

    Data Mining provides theEnterprise with intelligence

  • 7/28/2019 Datamining Intro IEP 2

    13/16

    Mining market

    Around 20 to 30 mining tool vendors

    Major tool players:Clementine,

    IBMs Intelligent Miner,

    SGIs MineSet,

    SASs Enterprise Miner.

    All pretty much the same set of toolsMany embedded products:fraud detection:

    electronic commerce applications,

    health care,

    customer relationship management: Epiphany

  • 7/28/2019 Datamining Intro IEP 2

    14/16

    OLAP Mining integration

    OLAP (On Line Analytical Processing)

    Fast interactive exploration of multidim.aggregates.

    Heavy reliance on manual operations foranalysis:

    Tedious and error-prone on large

    multidimensional dataIdeal platform for vertical integration of mining

    but needs to be interactive instead of batch.

  • 7/28/2019 Datamining Intro IEP 2

    15/16

  • 7/28/2019 Datamining Intro IEP 2

    16/16

    THANK YOU