7 Classification

Embed Size (px)

Citation preview

  • 8/2/2019 7 Classification

    1/63

    1

    Data Mining:Concepts and Techniques

    Classification and Prediction

  • 8/2/2019 7 Classification

    2/63

    2

    What is classification?

    What is prediction?

  • 8/2/2019 7 Classification

    3/63

    3

    Classification:

    predicts categorical class labels

    classifies data (constructs a model) based on thetraining set and the values (class labels) in a

    classifying attribute and uses it in classifying new data Prediction:

    models continuous-valued functions, i.e., predictsunknown or missing values

    Typical Applications credit approval

    target marketing

    medical diagnosis

    treatment effectiveness analysis

    Classification vs. Prediction

  • 8/2/2019 7 Classification

    4/63

    4

    ClassificationA Two-Step Process

    Model construction: describing a set of predetermined classes

    Each tuple/sample is assumed to belong to a predefined class,as determined by the class label attribute

    The set of tuples used for model construction: training set

    The model is represented as classification rules, decision trees,or mathematical formulae

    Model usage: for classifying future or unknown objects

    Estimate accuracy of the model

    The known label of test sample is compared with the

    classified result from the model Accuracy rate is the percentage of test set samples that are

    correctly classified by the model

    Test set is independent of training set, otherwise over-fittingwill occur

  • 8/2/2019 7 Classification

    5/63

    5

    Classification Process (1): ModelConstruction

    Training

    Data

    N A M E R A N K Y E A R S T E N U R E D

    Mike Assistant Prof 3 no

    Mary Assistant Prof 7 yesBill Professor 2 yes

    Jim Associate Prof 7 yes

    Dave Assistant Prof 6 no

    Anne Associate Prof 3 no

    Classification

    Algorithms

    IF rank = professor

    OR years > 6

    THEN tenured = yes

    Classifier

    (Model)

  • 8/2/2019 7 Classification

    6/63

    6

    Classification Process (2): Use theModel in Prediction

    Classifier

    Testing

    Data

    N A M E R A N K Y E A R S T E N U R E D

    Tom Assistant Prof 2 no

    Merlisa Associate Prof 7 no

    George Professor 5 yes

    Joseph Assistant Prof 7 yes

    Unseen Data

    (Jeff, Professor, 4)

    Tenured?

  • 8/2/2019 7 Classification

    7/637

    Supervised vs. UnsupervisedLearning

    Supervised learning (classification)

    Supervision: The training data (observations,

    measurements, etc.) are accompanied by labels

    indicating the class of the observations

    New data is classified based on the training set

    Unsupervised learning (clustering)

    The class labels of training data is unknown Given a set of measurements, observations, etc. with

    the aim of establishing the existence of classes or

    clusters in the data

  • 8/2/2019 7 Classification

    8/638

    Issues regarding classification and prediction

  • 8/2/2019 7 Classification

    9/639

    Issues regarding classification andprediction (1): Data Preparation

    Data cleaning

    Preprocess data in order to reduce noise and handle

    missing values

    Relevance analysis (feature selection)

    Remove the irrelevant or redundant attributes

    Data transformation

    Generalize and/or normalize data

  • 8/2/2019 7 Classification

    10/6310

    Issues regarding classification and prediction(2): Evaluating Classification Methods

    Predictive accuracy (correctly predict the class of newdata)

    Speed and scalability time to construct the model

    time to use the model Robustness

    handling noise and missing values Scalability

    efficiency in disk-resident databases Interpretability:

    understanding and insight provided by the model Goodness of rules

    decision tree size compactness of classification rules

  • 8/2/2019 7 Classification

    11/6311

    Classification by decision tree induction

  • 8/2/2019 7 Classification

    12/6312

    Classification by Decision TreeInduction

    Decision tree

    A flow-chart-like tree structure

    Internal node denotes a test on an attribute

    Branch represents an outcome of the test

    Leaf nodes represent class labels or class distribution Decision tree generation consists of two phases

    Tree construction

    At start, all the training examples are at the root

    Partition examples recursively based on selected attributes

    Tree pruning

    Identify and remove branches that reflect noise or outliers

    Use of decision tree: Classifying an unknown sample

    Test the attribute values of the sample against the decision tree

  • 8/2/2019 7 Classification

    13/6313

    Training Dataset

    age income student credit_rating

    40 low yes fair

    >40 low yes excellent

    3140 low yes excellent

  • 8/2/2019 7 Classification

    14/6314

    Output: A Decision Tree for buys_computer

    age?

    overcast

    student? credit rating?

    no yes fairexcellent

    40

    no noyes yes

    yes

    30..40

  • 8/2/2019 7 Classification

    15/6315

    Algorithm for Decision Tree Induction

    Basic algorithm (a greedy algorithm)

    Tree is constructed in a top-down recursive divide-and-conquermanner

    At start, all the training examples are at the root

    Attributes are categorical (if continuous-valued, they arediscretized in advance)

    Examples are partitioned recursively based on selected attributes

    Test attributes are selected on the basis of a heuristic orstatistical measure (e.g., information gain)

    Conditions for stopping partitioning

    All samples for a given node belong to the same class

    There are no remaining attributes for further partitioningmajority voting is employed for classifying the leaf

    There are no samples left

  • 8/2/2019 7 Classification

    16/6316

    Attribute Selection Measure

    Information gain

    All attributes are assumed to be categorical Can be modified for continuous-valued attributes

  • 8/2/2019 7 Classification

    17/6317

    Information Gain

    Select the attribute with the highest information gain

    Assume there are two classes, P and N

    Let the set of examples Scontain pelements of class P

    and nelements of class N

    The amount of information, needed to decide if an

    arbitrary example in Sbelongs to P or Nis defined as

    np

    n

    np

    n

    np

    p

    np

    pnpI

    22 loglog),(

  • 8/2/2019 7 Classification

    18/6318

    Information Gain in DecisionTree Induction

    Assume that using attribute A a set Swill be partitioned

    into sets {S1, S2, , Sv}

    IfSicontains piexamples ofPand niexamples ofN,

    the entropy, or the expected information needed toclassify objects in all subtrees Si is

    The encoding information that would be gained by

    branching onA

    1

    ),()(

    i

    iiii npI

    np

    npAE

    )(),()( AEnpIAGain

  • 8/2/2019 7 Classification

    19/6319

    Attribute Selection by InformationGain Computation

    Class P: buys_computer =

    yes

    Class N: buys_computer =

    no

    I(p, n) = I(9, 5) =0.940

    Compute the entropy for age:

    Hence

    Similarly

    age pi ni I(pi, ni)40 3 2 0.971

    69.0)2,3(14

    5

    )0,4(14

    4)3,2(

    14

    5)(

    I

    IIageE

    048.0)_(

    151.0)(

    029.0)(

    ratingcreditGain

    studentGain

    incomeGain

    )(),()( ageEnpIageGain

  • 8/2/2019 7 Classification

    20/6320

    Extracting Classification Rules from Trees

    Represent the knowledge in the form ofIF-THEN rules

    One rule is created for each path from the root to a leaf

    Each attribute-value pair along a path forms a conjunction

    The leaf node holds the class prediction

    Rules are easier for humans to understand

    Example

    IF age= 40 AND credit_rating= fair THEN buys_computer=no

  • 8/2/2019 7 Classification

    21/6321

    Avoid Overfitting in Classification

    The generated tree may overfit the training data

    Too many branches, some may reflect anomaliesdue to noise or outliers

    Result is in poor accuracy for unseen samples

    Two approaches to avoid overfitting Prepruning: Halt tree construction earlydo not split

    a node if this would result in the goodness measurefalling below a threshold

    Difficult to choose an appropriate threshold Postpruning: Remove branches from a fully grown

    treeget a sequence of progressively pruned trees

    Use a set of data different from the training datato decide which is the best pruned tree

  • 8/2/2019 7 Classification

    22/6322

    Approaches to Determine the FinalTree Size

    Separate training (2/3) and testing (1/3) sets

    Use cross validation, e.g., 10-fold cross validation

    Use all the data for training but apply a statistical test (e.g., chi-square) to

    estimate whether expanding or pruning a

    node may improve the entire distribution

    Use minimum description length (MDL) principle:

    halting growth of the tree when the encoding

    is minimized

  • 8/2/2019 7 Classification

    23/63

    23

    Enhancements to basic decisiontree induction

    Allow for continuous-valued attributes

    Dynamically define new discrete-valued attributes thatpartition the continuous attribute value into a discreteset of intervals

    Handle missing attribute values

    Assign the most common value of the attribute

    Assign probability to each of the possible values

    Attribute construction

    Create new attributes based on existing ones that aresparsely represented

    This reduces fragmentation, repetition, and replication

  • 8/2/2019 7 Classification

    24/63

    24

    Classification in Large Databases

    Classificationa classical problem extensively studied by

    statisticians and machine learning researchers

    Scalability: Classifying data sets with millions of examples

    and hundreds of attributes with reasonable speed Why decision tree induction in data mining?

    relatively faster learning speed (than other classificationmethods)

    convertible to simple and easy to understandclassification rules

    can use SQL queries for accessing databases

    comparable classification accuracy with other methods

  • 8/2/2019 7 Classification

    25/63

    25

    Scalable Decision Tree InductionMethods in Data Mining Studies

    SLIQ

    builds an index for each attribute and only class list andthe current attribute list reside in memory

    SPRINT

    constructs an attribute list data structure

    PUBLIC

    integrates tree splitting and tree pruning: stop growing

    the tree earlier RainForest

    separates the scalability aspects from the criteria thatdetermine the quality of the tree

    builds an AVC-list (attribute, value, class label)

  • 8/2/2019 7 Classification

    26/63

    26

    Data Cube-Based Decision-TreeInduction

    Integration of generalization with decision-tree induction

    Classification at primitive concept levels

    E.g., precise temperature, humidity, outlook, etc.

    Low-level concepts, scattered classes, bushyclassification-trees

    Semantic interpretation problems.

    Cube-based multi-level classification Relevance analysis at multi-levels.

    Information-gain analysis with dimension + level.

  • 8/2/2019 7 Classification

    27/63

    27

    Presentation of Classification Results

  • 8/2/2019 7 Classification

    28/63

    28

    Bayesian Classification

  • 8/2/2019 7 Classification

    29/63

    29

    Bayesian Classification: Why?

    Probabilistic learning: Calculate explicit probabilities forhypothesis, among the most practical approaches to certaintypes of learning problems

    Incremental: Each training example can incrementally

    increase/decrease the probability that a hypothesis iscorrect. Prior knowledge can be combined with observeddata.

    Probabilistic prediction: Predict multiple hypotheses,

    weighted by their probabilities Standard: Even when Bayesian methods are

    computationally intractable, they can provide a standard ofoptimal decision making against which other methods can

    be measured

  • 8/2/2019 7 Classification

    30/63

    30

    Bayesian Theorem

    Given training data D, posteriori probability of ahypothesis h, P(h|D)follows the Bayes theorem

    Practical difficulty: require initial knowledge of manyprobabilities, significant computational cost

    )(

    )()|()|(

    DP

    hPhDPDhP

  • 8/2/2019 7 Classification

    31/63

    31

    Naive Bayesian Classifier (simple Bayesianclassifier)works as

    Each data sample is represented by an n-dimensionalfeature vector, X=(x1, x2,xn), depicting nmeasurements on the sample from n attributes,

    respectively, A1,A2,,An. Suppose there are m classes, C1,C2,Cm. X is data

    sample (ie having no class label), the classifier willpredict that X belongs to the class having the highest

    posterior probability, conditioned on X. ie the naveBayesian classifier assigns an unknown sample X toclass Ci iff

    P(Ci/X) = P(X/Ci)P(Ci) / P(X)

  • 8/2/2019 7 Classification

    32/63

    32

    Naive Bayesian Classifier works as

    P(X) is constant for all classes, only P(X/Ci)P(Ci) need to bemaximized. Class prior probabilities may be estimated by P(Ci)=si/s , where si is the no. of training samples of class Ci, and s is the totalno. of training samples.

    If we have data sets with many attributes, it would be

    computationally expensive to compute P(X/Ci). To reducecomputation in evaluating P(X/Ci), the nave assumption of classconditional independence is made. This presumes that the valuesof the attributes are conditionally independent of one another,given the class label of the sample ie there are no dependence

    relationships among the attributes. Thus,n

    P(X/Ci)=p P(xk/ Ci)

    k=1

  • 8/2/2019 7 Classification

    33/63

    33

    Naive Bayesian Classifier works as

    the probabilities P(x1/Ci), P(x2/Ci), , P(xn/Ci) can be estimatedfrom training samples, where

    - If Ak is categorical, then P(xk/Ci)= sik/si,

    where sikis the no. of training samples of class Ci having the

    value xk for Akand si is the no. of training samples belonging toCi.

    - If Ak is continuous- valued, then the attribute is assumed tohave a Gaussian distribution so that

    P(xk/C

    i) = g(x

    k,m

    Ci,s

    Ci) =

    where g(xk,mCi ,s Ci) is the Gaussian (normal) density functionfor attribute Ak, while mCi and s Ci are the mean and standarddeviation respectively, given the values for attribute for Aktraining samples of class Ci.

  • 8/2/2019 7 Classification

    34/63

    34

    Naive Bayesian Classifier works as

    In order to classify an unknown sample X, P(X/Ci)P(Ci)is evaluated for each class Ci. Sample X is thenassigned to the class Ci iff

    P(X/Ci)P(Ci) > P(X/Cj)P(Cj)

    for 1

  • 8/2/2019 7 Classification

    35/63

    35

    Bayesian classification

    The classification problem may be formalizedusing a-posteriori probabilities:

    P(C|X) = prob. that the sample tuple

    X= is of class C.

    E.g. P(class=N | outlook=sunny,windy=true,)

    Idea: assign to sampleXthe class labelCsuchthatP(C|X) is maximal

  • 8/2/2019 7 Classification

    36/63

    36

    Estimating a-posteriori probabilities

    Bayes theorem:

    P(C|X) = P(X|C)P(C) / P(X)

    P(X) is constant for all classes

    P(C) = relative freq of class C samples

    C such that P(C|X) is maximum =C such that P(X|C)P(C) is maximum

    Problem: computing P(X|C) is unfeasible!

  • 8/2/2019 7 Classification

    37/63

    37

    The independence hypothesis

    makes computation possible

    yields optimal classifiers when satisfied

    but is seldom satisfied in practice, as attributes

    (variables) are often correlated.

    Attempts to overcome this limitation:

    Bayesian networks, that combine Bayesian reasoning

    with causal relationships between attributes

    Decision trees, that reason on one attribute at the

    time, considering most important attributes first

  • 8/2/2019 7 Classification

    38/63

    38

    Bayesian Belief Networks

    FamilyHistory

    LungCancer

    PositiveXRay

    Smoker

    Emphysema

    Dyspnea

    LC

    ~LC

    (FH, S) (FH, ~S)(~FH, S)(~FH, ~S)

    0.8

    0.2

    0.5

    0.5

    0.7

    0.3

    0.1

    0.9

    Bayesian Belief Networks

    The conditional probability table

    for the variable LungCancer

  • 8/2/2019 7 Classification

    39/63

    39

    Bayesian Belief Networks

    Bayesian belief network allows a subsetof the variables

    conditionally independent

    A graphical model of causal relationships

    Several cases of learning Bayesian belief networks

    Given both network structure and all the variables:

    easy

    Given network structure but only some variables

    When the network structure is not known in advance

  • 8/2/2019 7 Classification

    40/63

    40

    Classification by backpropagation

  • 8/2/2019 7 Classification

    41/63

    41

    Neural Networks

    Advantages

    prediction accuracy is generally high

    robust, works when training examples contain errors

    output may be discrete, real-valued, or a vector ofseveral discrete or real-valued attributes

    fast evaluation of the learned target function

    Criticism

    long training time difficult to understand the learned function (weights)

    not easy to incorporate domain knowledge

  • 8/2/2019 7 Classification

    42/63

    42

    A Neuron

    The n-dimensional input vector xis mapped intovariable yby means of the scalar product and anonlinear function mapping

    mk-

    f

    weighted

    sum

    Input

    vectorx

    outputy

    Activation

    function

    weight

    vector w

    w0

    w1

    wn

    x0

    x1

    xn

  • 8/2/2019 7 Classification

    43/63

    43

    Network Training

    The ultimate objective of training

    obtain a set of weights that makes almost all the

    tuples in the training data classified correctly

    Steps Initialize weights with random values

    Feed the input tuples into the network one by one

    For each unit

    Compute the net input to the unit as a linear combinationof all the inputs to the unit

    Compute the output value using the activation function

    Compute the error

    Update the weights and the bias

  • 8/2/2019 7 Classification

    44/63

    44

    Multi-Layer Perceptron

    Output nodes

    Input nodes

    Hidden nodes

    Output vector

    Input vector:xi

    wij

    i

    jiijj OwI

    jIj eO

    1

    1

    ))(1( jjjjj OTOOErr

    jkk

    kjjj wErrOOErr )1(

    ijijij OErrlww )(jjj

    Errl)(

  • 8/2/2019 7 Classification

    45/63

    45

    Classification based on concepts from

    association rule mining

  • 8/2/2019 7 Classification

    46/63

    46

    Association-Based Classification

    Several methods for association-based classification

    ARCS: Quantitative association mining and clusteringof association rules

    It beats C4.5 in (mainly) scalability and also accuracy

    Associative classification:

    It mines high support and high confidence rules in the form ofcond_set => y, where y is a class label

    CAEP (Classification by aggregating emerging patterns)

    Emerging patterns (EPs): the itemsets whose supportincreases significantly from one class to another

    Mine Eps based on minimum support and growth rate

  • 8/2/2019 7 Classification

    47/63

    47

    Other Classification Methods

  • 8/2/2019 7 Classification

    48/63

    48

    Other Classification Methods

    k-nearest neighbor classifier

    case-based reasoning

    Genetic algorithm Rough set approach

    Fuzzy set approaches

  • 8/2/2019 7 Classification

    49/63

    49

    Instance-Based Methods

    Instance-based learning: Store training examples and delay the processing

    (lazy evaluation) until a new instance must beclassified

    Typical approaches k-nearest neighbor approach

    Instances represented as points in a Euclideanspace.

    Locally weighted regression

    Constructs local approximation

    Case-based reasoning

    Uses symbolic representations and knowledge-based inference

  • 8/2/2019 7 Classification

    50/63

    50

    The k-Nearest Neighbor Algorithm

    All instances correspond to points in the n-D space. The nearest neighbor are defined in terms of

    Euclidean distance.

    The target function could be discrete- or real- valued.

    For discrete-valued, the k-NN returns the mostcommon value among the k training examples nearesttoxq.

    Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples.

    .

    _+

    _ xq

    +

    _ _+

    _

    _

    +

    .

    .

    .

    . .

  • 8/2/2019 7 Classification

    51/63

    51

    Discussion on the k-NN Algorithm

    The k-NN algorithm for continuous-valued target functions Calculate the mean values of the knearest neighbors

    Distance-weighted nearest neighbor algorithm

    Weight the contribution of each of the k neighbors

    according to their distance to the query point xq

    giving greater weight to closer neighbors

    Similarly, for real-valued target functions

    Robust to noisy data by averaging k-nearest neighbors

    Curse of dimensionality: distance between neighbors couldbe dominated by irrelevant attributes.

    To overcome it, axes stretch or elimination of the leastrelevant attributes.

    wd xq xi

    12( , )

  • 8/2/2019 7 Classification

    52/63

    52

    Case-Based Reasoning

    Also uses: lazy evaluation + analyze similar instances Difference: Instances are not points in a Euclidean space

    Example: Water faucet problem in CADET

    Methodology

    Instances represented by rich symbolic descriptions(e.g., function graphs)

    Multiple retrieved cases may be combined

    Tight coupling between case retrieval, knowledge-based

    reasoning, and problem solving Research issues

    Indexing based on syntactic similarity measure, andwhen failure, backtracking, and adapting to additionalcases

  • 8/2/2019 7 Classification

    53/63

    53

    Remarks on Lazy vs. Eager Learning

    Instance-based learning: lazy evaluation Decision-tree and Bayesian classification: eager evaluation

    Key differences

    Lazy method may consider query instance xqwhen deciding how togeneralize beyond the training data D

    Eager method cannot since they have already chosen globalapproximation when seeing the query

    Efficiency: Lazy - less time training but more time predicting

    Accuracy

    Lazy method effectively uses a richer hypothesis space since it usesmany local linear functions to form its implicit global approximationto the target function

    Eager: must commit to a single hypothesis that covers the entireinstance space

  • 8/2/2019 7 Classification

    54/63

    54

    Genetic Algorithms

    GA: based on an analogy to biological evolution Each rule is represented by a string of bits

    An initial population is created consisting of randomlygenerated rules

    e.g., IF A1 and Not A2 then C2 can be encoded as 100

    Based on the notion of survival of the fittest, a newpopulation is formed to consists of the fittest rules andtheir offsprings

    The fitness of a rule is represented by its classificationaccuracy on a set of training examples

    Offsprings are generated by crossover and mutation

  • 8/2/2019 7 Classification

    55/63

    55

    Rough Set Approach

    Rough sets are used to approximately or roughlydefine equivalent classes

    A rough set for a given class C is approximated by twosets: a lower approximation (certain to be in C) and an

    upper approximation (cannot be described as notbelonging to C)

    Finding the minimal subsets (reducts) of attributes (forfeature reduction) is NP-hard but a discernibility matrix

    is used to reduce the computation intensity

    Fuzzy Set

  • 8/2/2019 7 Classification

    56/63

    56

    Fuzzy SetApproaches

    Fuzzy logic uses truth values between 0.0 and 1.0 torepresent the degree of membership (such as usingfuzzy membership graph)

    Attribute values are converted to fuzzy values e.g., income is mapped into the discrete categories

    {low, medium, high} with fuzzy values calculated

    For a given new sample, more than one fuzzy value may

    apply Each applicable rule contributes a vote for membership

    in the categories

    Typically, the truth values for each predicted categoryare summed

  • 8/2/2019 7 Classification

    57/63

    57

    Prediction

  • 8/2/2019 7 Classification

    58/63

    58

    What Is Prediction?

    Prediction is similar to classification

    First, construct a model

    Second, use model to predict unknown value

    Major method for prediction is regression

    Linear and multiple regression

    Non-linear regression

    Prediction is different from classification Classification refers to predict categorical class label

    Prediction models continuous-valued functions

  • 8/2/2019 7 Classification

    59/63

    59

    Predictive modeling: Predict data values or constructgeneralized linear models based on the database data.

    One can only predict value ranges or category distributions

    Method outline:

    Minimal generalization Attribute relevance analysis

    Generalized linear model construction

    Prediction

    Determine the major factors which influence the prediction Data relevance analysis: uncertainty measurement,

    entropy analysis, expert judgement, etc.

    Multi-level prediction: drill-down and roll-up analysis

    Predictive Modeling in Databases

    Regress Analysis and Log-Linear

  • 8/2/2019 7 Classification

    60/63

    60

    Linear regression: Y = + X Two parameters , and specify the line and are to

    be estimated by using the data at hand.

    using the least squares criterion to the known values

    of Y1, Y2, , X1, X2, . Multiple regression: Y = b0 + b1 X1 + b2 X2.

    Many nonlinear functions can be transformed into theabove.

    Log-linear models:

    The multi-way table of joint probabilities isapproximated by a product of lower-order tables.

    Probability: p(a, b, c, d) =abacadbcd

    Regress Analysis and Log-LinearModels in Prediction

  • 8/2/2019 7 Classification

    61/63

    61

    Locally Weighted Regression

    Construct an explicit approximation to fover a local regionsurrounding query instance xq.

    Locally weighted linear regression:

    The target function fis approximated near xqusing thelinear function:

    minimize the squared error: distance-decreasing weightK

    the gradient descent training rule:

    In most cases, the target function is approximated by a

    constant, linear, or quadratic function.

    ( ) ( ) ( ) f x w w a x wnan x 0 1 1

    E xq f x f x x k nearest neighbors of xq

    K d xq x( ) ( ( )( ))

    _ _ _ _( ( , ))

    12

    2

    wj

    K d xq x f x f x ajx

    x k nearest neighbors of xq

    ( ( , ))(( ( ) ( )) ( )_ _ _ _

  • 8/2/2019 7 Classification

    62/63

    62

    Prediction: Numerical Data

    P di ti C t i l D t

  • 8/2/2019 7 Classification

    63/63

    Prediction: Categorical Data