Random forest

RANDOM FOREST

WHAT IS RANDOM FOREST? Random forest is a classifier

An ensemble classifier using many decision tree models. Can be used for classification and regression Accuracy and variable importance information is provided with the result

A random forest is a collection of unpruned CART-like trees following specific rules for Tree growing Tree combination Self-testing Post-processing

Trees are grown using binary partitioning

COMPARISON Similar to decision tree with a few differences

For each split-point, the search is not over all variables but just over a part of variables

No pruning necessary. Trees can be grown until each node contain just very few observations

Advantages over decision tree Better prediction (in general) No parameter tuning necessary with RF

Terminology Training size (N) Total number of attributes (M) Number of attributes used (m) Total number of trees (n)

HOW RANDOM FOREST WORKS? A random seed is chosen which pulls out at random a collection of samples

from training dataset while maintaining the class distribution

With this selected dataset, a random set of attributes from original dataset is chosen based on user defined values. All the input variables are not considered because of enormous computation and high chances of over fitting

In a dataset, where M is the total number of input attributes in the dataset, only m attributes are chosen at random for each tree where m<M

The attribute for this set creates the best possible split using the gini index to develop a decision tree model. This process repeats for each of the branches until the termination condition stating that the leaves are the nodes that are too small to split.

MORE ON RANDOM FOREST - I Information from random forest

Classification accuracy Variable importance Outliers (Classification) Missing Data Estimation Error Rates for Random Forest Object

Advantages No need for pruning trees Accuracy and variable importance generated automatically Overfitting is not a problem Not very sensitive to outliers in training data Easy to set parameters

MORE ON RANDOM FOREST - II Limitations

Regression cant predict beyond range in the training data Extreme values are not predicted accurately

Applications Classification

Land cover classification Cloud screening

Regression Continuous field mapping Biomass mapping

MOTIVATION Efficient use of Multi-Core Technology

Though it is OS dependent, but the usage of Hadoop guarantees efficient use of multi-core

WINNOWING ALGORITHM Its a technique from machine learning for learning a linear classifier from

labelled examples

Similar to perceptron algorithm

While perceptron algorithm uses additive weight-update scheme, winnowing uses a multiplicative weight-update scheme

Performs well when many of the features given to the learner turns out to be irrelevant

During training, its shown a sequence of positive and negative examples. From these it learn a decision hyperplane which can be used to novel examples as positive or negative

Uses linear threshold function (like the perceptron training algorithm) as hypothesis and performs incremental updates to its current hypothesis

THE ALGORITHM Initialize the weights w1,…….wn to 1

Both winnow and perceptron algorithm uses the same classification scheme

The winnowing algorithms differs form the perceptron algorithm in its updating scheme. When misclassifying a positive training example x (i.e. a prediction was negative

because w.x was too small)

When misclassifying a negative training example x (i.e. Prediction was positive because w.x was too large)

THE WINNOW ALGORITHMSPAM Example – each email is a Boolean vector indicating which phase appears and which don’t

SPAM if at least one of the phrase in S is present

SIMPLE ALGORITHM FOR LEARNING A DISJUNCTION

EXAMPLE – WINNOWING ALGORITHM Initialize the weights w1, …..wn = 1 on the n variables

Given an example x = (x1,……..xn), output 1 if

Else output 0

If the algorithm makes a mistake: On positive – if it predicts 0 when f(x)=1, then for each xi equal to 1, double the

value of wi

On negative – if it predicts 1 when f(x)=0, then for each xi equal to 1 cut the value of wi in half

MAXIMUM ENTROPY The principle of maximum entropy states that, subject to precisely

stated prior data, the probability distribution which best represents the current state of knowledge is the one with the largest entropy.

Commonly used in Natural Language Processing, speech and Information Retrieval

What is maximum entropy classifier? Probabilistic classifier which belongs to the class of exponential models Does not assume the features that are conditionally independent of each other Based on the principle of maximum entropy and forms all models that fit our

training data and selects the one which has the largest entropy

TESTABLE INFORMATION A piece of information is testable if it can be determined whether a given

distribution is consistent with it

The expectation of variable x is 2.87

And p2 + p3 > 0.6

Are statements of testable information

Maximum entropy procedure consist of seeking the probability distribution which maximizes information entropy, subject to constrains of the information.

Entropy maximization takes place under a single constrain: the sum of probabilities must be one

APPLICATIONS When to use maximum entropy?

Since it makes minimum assumptions, we use it when we don’t know about the prior distribution

Used when we cannot assume conditional independence of the features

The principle of maximum entropy is commonly applied in two ways to inferential problems Prior Probabilities: its often used to obtain prior probability distribution for

Bayesian inference Maximum Entropy Models: involved in model specifications which are widely

used in natural language processing. Ex. Logistic regression

Data & Analytics

Random forest