Knowledge Discovery Lecture 3csacarea/wordpress/wp-content/uploads/Lecture-3a.pdf · decision tree, a neural network, or other kinds of machine learning algorithms. •Just like “many

Knowledge DiscoveryLecture 3Babes-Bolyai University

Summer term

2018-2019

7. AdaBoost algorithm


• AdaBoost is one of the first boosting algorithms to be adapted insolving practices.

• Adaboost helps you combine multiple “weak classifiers” into a single“strong classifier”.

→ The weak learners in AdaBoost are decision trees with a single split,called decision stumps.

→ AdaBoost works by putting more weight on difficult to classifyinstances and less on those already handled well.

→ AdaBoost algorithms can be used for both classification andregression problem.


• Efforts towards a learning system with stronggeneralization ability. One of the most successfulparadigms is ensemble learning.

• In contrast to ordinary machine learningapproaches which try to generate one learner fromtraining data, ensemble methods try to construct aset of base learners and combine them.

• Base learners are usually generated from trainingdata by a base learning algorithm which can be adecision tree, a neural network, or other kinds ofmachine learning algorithms.

• Just like “many hands make light work,” thegeneralization ability of an ensemble is usuallysignificantly better than that of a single learner.Actually, ensemble methods are appealing mainlybecause they are able to boost weak learners,which are slightly better than random guess, tostrong learners, which can make very accuratepredictions. So, “base learners” are also referred as“weak learners.”

• AdaBoost is one of the most influential ensemblemethods.

• AdaBoost and its variants have been applied todiverse domains with great success, owing to theirsolid theoretical foundation, accurate prediction,and great simplicity.

• We can combine AdaBoost with a cascade processfor face detection. This face detector has beenrecognized as one of the most excitingbreakthroughs in computer vision (in particular,face detection) during the past decade.

• “Boosting” has become a buzzword in computervision and many other application areas.

AdaBoost algorithm

• Let X denote the instancespace, or in other words,feature space.

• Let Y denote the set of labelsthat express the underlyingconcepts which are to belearned.

• A training set D consists of minstances whose associatedlabels are observed, i.e., D ={(xi , yi )} (i ∈ {1, . . . ,m}),while the label of a testinstance is unknown and thusto be predicted.

• We assume both training andtest instances are drawnindependently and identicallyfrom an underlyingdistribution D.

AdaBoost algorithmA general boosting procedure• Suppose we are dealing with a binary

classification problem, that is, we aretrying to classify instances as positiveand negative.

• Usually we assume that there existsan unknown target concept, whichcorrectly assigns “positive” labels toinstances belonging to the conceptand “negative” labels to others.

• This unknown target concept isactually what we want to learn. Wecall this target concept ground-truth.

• For a binary classification problem, aclassifier working by random guesswill have 50% 0/1-loss.

• Suppose we are unlucky and onlyhave a weak classifier at hand, whichis only slightly better than randomguess on the underlying instancedistribution D, say, it has 49% 0/1-loss.

• Let’s denote this weak classifier as h1.It is obvious that h1 is not what wewant, and we will try to improve it.

• A natural idea is to correct themistakes made by h1.

A general boosting procedure

• We can try to derive a new distribution D’from D, which makes the mistakes of h1 moreevident, for example, it focuses more on theinstances wrongly classified by h1.

• We can train a classifier h2 from D’. Again,suppose we are unlucky and h2 is also a weakclassifier. Since D’ was derived from D, if D’satisfies some condition, h2 will be able toachieve a better performance than h1 on someplaces in D where h1 does not work well,without scarifying the places where h1performs well.

• Thus, by combining h1 and h2 in an appropriateway, the combined classifier will be able toachieve less loss than that achieved by h1. Byrepeating the above process, we can expect toget a combined classifier which has very small(ideally, zero) 0/1-loss on D.

Ada Boost

• AdaBoost, short for “Adaptive Boosting”, isthe first practical boosting algorithmproposed by Freund and Schapire in 1996. Itfocuses on classification problems and aims toconvert a set of weak classifiers into a strongone. The final equation for classification canbe represented as

where fm stands for the m-th weak classifier andtheta_m is the corresponding weight.

It is exactly the weighted combination of Mweak classifiers. The whole procedure of theAdaBoost algorithm can be summarized asfollows:

Ada Boost algorithm:

Given a data set containing n points, where

Here -1 denotes the negative class while 1 represents the positive one.

• Initialize the weight for each data point as:

Ada Boost

For iteration m=1,…,M:

• (1) Fit weak classifiers to the data set and select the one with the lowest weighted classification error:

• (2) Calculate the weight for the m-th weak classifier:

• For any classifier with accuracy higher than 50%, the weight is positive. The more accurate theclassifier, the larger the weight. For the classifier with less than 50% accuracy, the weight is negative.It means that we combine its prediction by flipping the sign. For example, we can turn a classifier with40% accuracy into 60% accuracy by flipping the sign of the prediction. Thus even the classifierperforms worse than random guessing, it still contributes to the final prediction. We only don’t wantany classifier with exact 50% accuracy, which doesn’t add any information and thus contributesnothing to the final prediction.

Ada Boost

(3) Update the weight for each data pointas:

where Zm is a normalization factor thatensures the sum of all instance weights isequal to 1.

• If a misclassified case is from a positiveweighted classifier, the “exp” term in thenumerator would be always larger than 1(y*f is always -1, theta_m is positive).Thus misclassified cases would beupdated with larger weights after aniteration. The same logic applies to thenegative weighted classifiers. The onlydifference is that the original correctclassifications would becomemisclassifications after flipping the sign.

• After M iteration we can get the finalprediction by summing up the weightedprediction of each classifier.

8. kNN: k-Nearest-Neighbors


• The k-Nearest-Neighbors (kNN) methodof classification is one of the simplestmethods in machine learning, and is agreat way to introduce yourself tomachine learning and classification ingeneral.

• At its most basic level, it is essentiallyclassification by finding the most similardata points in the training data, andmaking an educated guess based on theirclassifications.

• Although very simple to understand andimplement, this method has seen wideapplication in many domains, such asin recommendation systems, semanticsearching, and anomaly detection.

• As we would need to in any machinelearning problem, we must first find away to represent data points as featurevectors.

• A feature vector is our mathematicalrepresentation of data, and since thedesired characteristics of our data maynot be inherently numerical,preprocessing and feature-engineeringmay be required in order to create thesevectors.

• Given data with N unique features, thefeature vector would be a vector oflength N, where entry I of the vectorrepresents that data point’s value forfeature I. Each feature vector can thus bethought of as a point in RN.


• The k-Nearest-Neighbors (kNN) method of classification is one of thesimplest methods in machine learning, and is a great way to introduceyourself to machine learning and classification in general.

• At its most basic level, it is essentially classification by finding themost similar data points in the training data, and making an educatedguess based on their classifications.

• Although very simple to understand and implement, this method hasseen wide application in many domains, such as in recommendationsystems, semantic searching, and anomaly detection.


• kNN falls under lazy learning, which means thatthere is no explicit training phase beforeclassification.

• Instead, any attempts to generalize or abstract thedata is made upon classification.

• While this does mean that we can immediatelybegin classifying once we have our data, there aresome inherent problems with this type ofalgorithm.

• We must be able to keep the entire training set inmemory unless we apply some type of reductionto the data-set, and performing classifications canbe computationally expensive as the algorithmparse through all data points for eachclassification.

• For these reasons, kNN tends to work best onsmaller data-sets that do not have many features.

• Once we have formed our training data-set, which is represented as an M x N matrix where M is the number of data points and N is the number of features, we can now begin classifying. The gist of the kNN method is, for each classification query, to:

1. Compute a distance value between the item to be classified and every item in the training data-set.

2. Pick the k closest data points (the items with the k lowest distances)

3. Conduct a majority vote among these data points – the dominating classification in that pool is decided as the final classification.


• There are two important decisionsthat must be made before makingclassifications.

1. The value of k that will be used;this can either be decidedarbitrarily, or you can try cross-validation to find an optimalvalue.

2. The distance metric that will beused.

• Euclidean distance and Cosinesimilarity.

• Euclidean distance:

• Cosine similarity: Rather than calculating a magnitude, Cosine similarity instead uses the difference in direction between two vectors.


• The result of the kNN algorithm is adecision boundary that partitions RN intosections.

• Each section (colored distinctly below)represents a class in the classificationproblem.

• The boundaries need not be formed withactual training examples — they areinstead calculated using the distancemetric and the available training points.

• By taking RN in (small) chunks, we cancalculate the most likely class for ahypothetical data-point in that region,and we thus color that chunk as being inthe region for that class.


• Given a training set D and a test objectz, which is a vector of attribute valuesand has an unknown class label, thealgorithm computes the distance (orsimilarity) between z and all thetraining objects to determine itsnearest-neighbor list.

• It then assigns a class to z by takingthe class of the majority ofneighboring objects.

• Ties are broken in an unspecifiedmanner, for example, randomly or bytaking the most frequent class in thetraining set.


• The storage complexity of the algorithm is O(n), where n is thenumber of training objects.

• The time complexity is also O(n), since the distance needs to becomputed between the target and each training object.

• However, there is no time taken for the construction of theclassification model, for example, a decision tree or separatinghyperplane.

• Thus, kNN is different from most other classification techniques whichhave moderately to quite expensive model-building stages, but veryinexpensive O(constant) classification steps.

9. Naive Bayes Classification

• Naive Bayes is a simple, yet effectiveand commonly-used, machinelearning classifier.

• It is a probabilistic classifier thatmakes classifications using theMaximum A Posteriori decision rule ina Bayesian setting.

• It can also be represented using a verysimple Bayesian network.

• Naive Bayes classifiers have beenespecially popular for textclassification, and are a traditionalsolution for problems such as spamdetection.

The Model:

• The goal of any probabilistic classifierwith features x0 through xn and classesc0 through ck is to determine theprobability of the features occurring ineach class, and to return the mostlikely class.

• Therefore, for each class, we want tobe able to calculate P(ci | x0, …, xn).

• In order to do this, we use Bayes rule:

Naive Bayes Classification

• In the context of classification, you can replace Awith a class, ci, and B with the set of features, x0through xn.

• Since P(B) serves as normalization, and we areusually unable to calculate P(x0, …, xn), we cansimply ignore that term, and instead just state thatP(ci | x0, …, xn) ∝ P(x0, …, xn | ci) * P(ci), where ∝means “is proportional to”.

• P(ci) is simple to calculate; it is just the proportionof the data-set that falls in class i.

• P(x0, …, xn | ci) is more difficult to compute. Inorder to simplify its computation, we make theassumption that x0 through xn are conditionallyindependent given ci, which allows us to say thatP(x0, …, xn | ci) = P(x0 | ci) * P(x1 | ci) * … * P(xn | ci).

• This assumption is most likely not true — hence thename naive Bayes classifier, but the classifiernonetheless performs well in most situations.

• Therefore, the final representation of classprobability is the following:

• Calculating the individual P(xj | ci) terms willdepend on what distribution your features follow.In the context of text classification, where featuresmay be word counts, features may followa multinomial distribution.

• In other cases, where features are continuous, theymay follow a Gaussian distribution.


• Note that there is very little explicittraining in Naive Bayes compared toother common classification methods.

• The only work that must be donebefore prediction is finding theparameters for the features’ individualprobability distributions, which cantypically be done quickly anddeterministically.

• This means that Naive Bayes classifierscan perform well even with high-dimensional data points and/or alarge number of data points.


Classification:

• Now that we have a way toestimate the probability of a givendata point falling in a certain class,we need to be able to use this toproduce classifications.

• Naive Bayes handles this in a verysimple manner; simply pick the ci

that has the largest probabilitygiven the data point’s features.

• This is referred to as the MaximumA Posteriori decision rule.

• This is because, referring back toour formulation of Bayes rule, weonly use the P(B|A) and P(A) terms,which are the likelihood and priorterms, respectively.

• If we only used P(B|A), thelikelihood, we would be usinga Maximum Likelihood decisionrule.

Naive Bayes Classification: Example

• Keep in mind that: Learning a Naive Bayes classifier is just a matter ofcounting how many times each attribute co-occurs with each class

• P(c|x) is the posterior probability of class c given predictor(features).• P(c) is the probability of class.• P(x|c) is the likelihood which is the probability of predictorgiven class.• P(x) is the prior probability of predictor.


• 50% of the fruits are bananas• 30% are oranges• 20% are other fruits• Based on our training set we can

also say the following:• From 500 bananas 400 (0.8) are Long,

350 (0.7) are Sweet and 450 (0.9) areYellow

• Out of 300 oranges, 0 are Long, 150(0.5) are Sweet and 300 (1) are Yellow

• From the remaining 200 fruits, 100(0.5) are Long, 150 (0.75) are Sweetand 50 (0.25) are Yellow


• Given the features of a piece offruit and we need to predict theclass.

• If we’re told that the additionalfruit is Long, Sweet and Yellow, wecan classify it using the followingformula and subbing in the valuesfor each outcome, whether it’s aBanana, an Orange or Other Fruit.

• The one with the highestprobability (score) being thewinner.


Banana:


Orange:


Other fruit:


• In this case, based on the higher score ( 0.252 for banana ) we canassume this Long, Sweet and Yellow fruit is in fact, a Banana.

Knowledge Discovery

• Knowledge discovery is the process ofextracting useful knowledge fromdata.

Main Forms of Knowledge Discovery:

1. Prediction - deciding what theoutcome or meaning of a particularsituation is, by collecting andobserving its properties.

2. Clustering - putting objects orsituations into groups whosemembers resemble each other andare usefully different from themembers of other groups.

3. Understanding connections -figuring out how objects, processes, and especially people are connected.

4. Understanding the internal world of others - being able to tell what another person is thinking or feeling.

Knowledge Discovery: The Larger Process

1. Data collection: Before any analysis can be done, data must be collected,and put into usable form.

2. Analysis: This is the heart of the process, where the knowledge isextracted from the available data. An important part of this stage isevaluating the model that results from the analysis, both on its ownterms (does it make sense internally?) and in terms of the situation (is itplausible?).

3. Decision and Action: Once the analysis has been done, and a model builtfrom the data, the model can be used to have some effect in thesituation. For a predictive model, the action is to make a prediction; for aclustering model, a model of connections, or a model of other people,the result is a deeper understanding of the situation, which can then leadto action.

Prediction and Anomaly Detection

• Decision Trees

• Ensembles of Predictors

• Random Forests

• Support Vector Machines

• Neural Networks

• Rules

• Attribute Selection

• Distributed Prediction

• Symbiotic Prediction .

Looking for Similarity: Clustering

• Distance-Based Clustering

• Density-Based Clustering

• Distribution-Based Clustering

• Decomposition-Based Clustering

• Hierarchical Clustering

• Biclustering

• Clusters and Prediction

• Symbiotic Clustering

Looking inside groups: relationships discovery

• Social Network Analysis

• Visualization

• Pattern Matching/Information Retrieval

• Single-Node Exploration

• Unusual-Region Detection

• Ranking on Graphs

• Graph Clustering

• Edge Prediction

• Anomalous-Substructure Discovery

• Using Graphs for Prediction

Knowledge Discovery withFormal Concept Analysis

Documents

Knowledge Discovery Lecture 3csacarea/wordpress/wp-content/uploads/Lecture-3a.pdf · decision tree, a neural network, or other kinds of machine learning algorithms. •Just like “many