Machine Learning: Algorithms and Applicationszini/ML/slides/ml_2012_lecture_09.pdf · Combining classifiers ! So far, we have only discussed individual classifiers, i.e., ... 3 6

11/05/12

1

Machine Learning: Algorithms and Applications Floriano Zini Free University of Bozen-Bolzano Faculty of Computer Science Academic Year 2011-2012 Lecture 9: 7 May 2012

Ensemble methods

Slides courtesy of Bing Liu: www.cs.uic.edu/~liub/WebMiningBook.html

11/05/12

2

Combining classifiers

n  So far, we have only discussed individual classifiers, i.e., how to build and use them

n  Can we combine multiple classifiers to produce a better classifier ?

n  Yes, sometimes n  We discuss two main algorithms:

q  Bagging q  Boosting

Bagging

n  Breiman, 1996

n  Bootstrap Aggregating = Bagging q  Application of bootstrap sampling (see lecture 6)

n  Given: set D containing m training examples

n  Create a sample S[i] of D by drawing m examples at random with replacement from D

n  S[i] of size m: expected to leave out 0.37 of examples from D

11/05/12

3

Bagging (cont…)

n  Training q  Create k bootstrap samples S[1], S[2], …, S[k]

q  Build a distinct classifier on each S[i] to produce k classifiers, using the same learning algorithm

n  Testing q  Classify each new instance by voting of the k

classifiers (equal weights)

Bagging Example Original training set D 1 2 3 4 5 6 7 8 Bootstrap sample S[1] 2 7 8 3 7 6 3 1 f1

Bootstrap sample S[2] 7 8 5 6 4 2 7 1 f2



f final (x) = argmaxy!{red,white}

1t: ft (x)=y"

11/05/12

4

Bagging (cont …)

n  When does it help? q  When learner is unstable

n  Small change to training set causes large change in the output classifier

n  True for decision trees, neural networks, evolutionary algorithms; not true for k-nearest neighbor, naïve Bayesian, SVM

q  Experimentally, bagging can help substantially for unstable learners, may somewhat degrade results for stable learners

Boosting n  A family of methods:

q  We only study AdaBoost (Freund & Schapire, 1996) n  Training

q  Produce a sequence of classifiers (the same base learner)

q  Each classifier is dependent on the previous one, and focuses on the previous one’s errors

q  Examples that are incorrectly predicted in previous classifiers are given higher weights

n  Testing q  The results of the series of classifiers are combined

to determine the final class of a test case

11/05/12

5

AdaBoost Weighted

training set (x1, y1, w1) (x2, y2, w2)

… (xn, yn, wn)

Non-negative weights sum to 1

Build a classifier ft whose accuracy on training set > ½ (better than random)

Change weights

Called weak classifier or base learner

AdaBoost algorithm

11/05/12

6

Does AdaBoost always work?

n  The actual performance of boosting depends on the data and the base learner q  It requires the base learner to be unstable as

bagging n  Boosting seems to be susceptible to noise

q  When the number of outliners is very large, the emphasis placed on the hard examples can hurt the performance

Unsupervised Learning

Slides courtesy of Bing Liu: www.cs.uic.edu/~liub/WebMiningBook.html

11/05/12

7

Road map n  Basic concepts n  K-means algorithm n  Representation of clusters n  Hierarchical clustering n  Distance functions n  Data standardization n  Handling mixed attributes n  Which clustering algorithm to use? n  Cluster evaluation n  Summary

Supervised learning vs. unsupervised learning n  Supervised learning: discover patterns in the

data that relate data attributes with a target (class) attribute q  These patterns are then utilized to predict the

values of the target attribute in future data instances

n  Unsupervised learning: the data have no target attribute q  We want to explore the data to find some intrinsic

structures in them

11/05/12

8

Clustering

n  Clustering is a technique for finding similarity groups in data, called clusters. I.e., q  it groups data instances that are similar to (near) each other

in one cluster and data instances that are very different (far away) from each other into different clusters

n  Clustering is often called an unsupervised learning task q  no class values denoting an a priori grouping of the data

instances are given, which is the case in supervised learning

An illustration n  The 2-dimensional data set has three natural groups

of data points, i.e., 3 natural clusters

n  Visually discovering clusters is quite easy in a 2 or 3-dimensional space, but hard when the number of dimensions increases q  Automatic techniques are needed

11/05/12

9

What is clustering for?

Some real-life examples n  Example 1: Group people of similar sizes

together to make “small”, “medium” and “large” T-Shirts q  Tailor-made for each person: too expensive q  One-size-fits-all: does not fit all

n  Example 2: In marketing, segment customers according to their similarities q  To do targeted marketing

What is clustering for? (cont…) n  Example 3: Given a collection of text documents,

we want to organize them according to their content similarities q  To produce a topic hierarchy

n  In fact, clustering is a very popular technique q  It has a long history, and used in almost every field,

e.g., medicine, psychology, botany, sociology, biology, archeology, marketing, insurance, libraries, etc.

q  In recent years, due to the rapid increase of online documents, text clustering has become important

11/05/12

10

Aspects of clustering n  A clustering algorithm

q  Partitional clustering q  Hierarchical clustering

n  A distance (similarity, or dissimilarity) function n  Clustering quality

q  Inter-clusters distance ⇒ maximized q  Intra-clusters distance ⇒ minimized

n  The quality of a clustering result depends on the algorithm, the distance function, and the application


11/05/12

11

K-means clustering n  K-means is a partitional clustering algorithm n  Let the set of data points (or instances) D be

{x1, x2, …, xn}, where xi = (xi1, xi2, …, xir) is a vector in a real-valued space X ⊆ Rr, and r is the number of attributes (dimensions) in the data

n  The k-means algorithm partitions the given data into k clusters q  Each cluster has a cluster center, called centroid,

which is the mean of the data points in the cluster q  k is specified by the user

K-means algorithm

11/05/12

12

Stopping/convergence criterion 1.  no (or minimum) re-assignments of data points

to different clusters, 2.  no (or minimum) change of centroids, or 3.  minimum decrease in the sum of squared

error (SSE),

q  Cj is the jth cluster, mj is the centroid of cluster Cj, and dist(x, mj) is the distance between data point x and centroid mj

SSE = dist(x,m j )2

x!Cj

"j=1

k

"

An example n  We want to find 2 clusters à k=2

11/05/12

13

An example (cont …)

An example distance function

n  The k-means algorithm can be used for any data set where the mean can be computed

n  In the Euclidean space q  the mean of Cj is

where |Cj| is the number of points in cluster Cj

q  the distance from a point xi to a centroid mj is

m j =1Cj

xixi!Cj

"

dist(xi,m j ) = xi !m j = (xi1 !mj1)2 + (xi2 !mj2 )

2 +…+ (xir !mjr )2

11/05/12

14

Strengths of k-means n  Strengths:

q  Simple: easy to understand and to implement q  Efficient: Time complexity is O(tkn)

n  n is the number of data points n  k is the number of clusters n  t is the number of iterations

q  Since both k and t are small, k-means is considered a linear algorithm (in the number of data points)

n  K-means is the most popular clustering algorithm n  It terminates at a local optimum if SSE is used

q  The global optimum is hard to find due to complexity

Weaknesses of k-means

n  The algorithm is only applicable if the mean is defined

n  For categorical data, use k-modes q  the data instances are described by r categorical attributes q  the mode of cluster Cj is a tuple mj=(mj1,…,mjr) where mji is

the most frequent value of the ith attribute of the instances in Cj n  Example

q  Cj={(apple,young), (orange, young), (apple, old), (peach, middle-age)} q  mj=(apple, young)

q  The distance between instances and the mode is the number of values they do not match (see later)

11/05/12

15

Weaknesses of k-means (cont …)

n  The user needs to specify k, that can be wrong

n  Several values for k are tried and the best is

selected

k=3 (correct) k=4 (wrong)

Weaknesses of k-means (cont …)

n  The algorithm is sensitive to outliers

q  Outliers are data points that are very far away from other data points

q  Outliers could be errors in the data recording or some special data points with very different values

11/05/12

16

Weaknesses of k-means: problems with outliers

Weaknesses of k-means: dealing with outliers 1.  Remove some data points that are much further away

from the centroids than other data points q  To be safe, we may want to monitor these possible outliers over

a few iterations and then decide to remove them

2.  Perform random sampling q  The chance of sampling an outlier is very small q  Use the sample to do a pre-clustering q  Assign the rest of the data points to the obtained clusters

n  by distance or similarity comparison n  by doing supervised learning

q  each cluster is regarded as a class q  the learned classifier is used to classify the remaining data points

11/05/12

17

Weaknesses of k-means (cont …) n  The algorithm is sensitive to initial seeds

Weaknesses of k-means (cont …) n  If we use different seeds: good results

11/05/12

18

Weaknesses of k-means: dealing with choice of seeds n  There are some methods to help choosing good seeds n  A simple method

1.  Compute the centroid m of the data set 2.  The first seed x1 is the furthest point from m 3.  The second seed x2 is the furthest point from x1

4.  Each subsequent seed xi is so that the sum of distances from xi to the already selected seeds is the largest

n  Dealing with outliers in the simple method q  Randomly select a sample of data points q  Apply the simple method to the sample q  The chance of an outlier in the sample is small

Weaknesses of k-means (cont …) n  The k-means algorithm is not suitable for

discovering clusters that are not hyper-ellipsoids (or hyper-spheres)

n  Are the 2 clusters in (B) necessarily bad? q  No, it depends on the application (more on this later)

+

11/05/12

19

K-means summary n  Despite weaknesses, k-means is still the most

popular algorithm due to its simplicity, efficiency and q  other clustering algorithms have their own lists of

weaknesses n  No clear evidence that any other clustering

algorithm performs better in general q  although they may be more suitable for some specific

types of data or applications n  Comparing different clustering algorithms is a

difficult task q  No one knows the correct clusters!


11/05/12

20

Common ways to represent clusters

n  Use the centroid of each cluster to represent the cluster q  compute the radius and q  standard deviation of the cluster q  to determine its spread in each dimension

q  The centroid representation alone works well if the clusters are of the hyper-spherical shape

q  If clusters are elongated or are of other shapes, centroids are not sufficient

Using classification model n  All the data points in a cluster are

regarded to have the same class label, e.g., the cluster ID q  run a supervised learning algorithm

on the data to find a classification model

n  A cluster may be split in a few rules q  But there is usually a dominant rule

n  The set of rules can be used to see if the clusters are conform to some intuition or domain knowledge

x ! 2" cluster 1x > 2, y >1.5" cluster 2x > 2, y !1.5" cluster 3

11/05/12

21

Use frequent values to represent cluster

n  This method is mainly for clustering of categorical data (e.g., k-modes clustering)

n  Main method used in text clustering, where a small set of frequent words in each cluster is selected to represent the cluster

Clusters of arbitrary shapes n  Hyper-elliptical and hyper-

spherical clusters are easy to represent, using their centroid together with spreads

n  Irregular shape clusters are hard to represent q  Using centroids are not suitable (upper

figure) in general n  Irregular shape clusters may not

be useful in some applications q  K-means clusters may be more useful

(lower figure), e.g., for making 2 size T-shirts

Documents

Machine Learning: Algorithms and Applicationszini/ML/slides/ml_2012_lecture_09.pdf · Combining classifiers ! So far, we have only discussed individual classifiers, i.e., ... 3 6