Upload
gaurav-kasliwal
View
130
Download
3
Embed Size (px)
DESCRIPTION
Random Forest Model using Apache Mahout
Citation preview
CS 267 : Data Mining Presentation
Guided by : Dr. Tran
-Gaurav Kasliwal
Outline RandomForest Model
Mahout Overview
RandomForest using Mahout
Problem Description
Working Environment
Data Preparation
ML Model Generation
Demo
Using Gini Index
RandomForest Model Random forests are an ensemble learning method
for classification that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees.
Developed by Leo Breiman and Adele Cutler.
Mahout
Mahout is a library of scalable machine-learning algorithms, implemented on top of Apache Hadoopand using the MapReduce paradigm.
Scalable to large data sets
RandomForest using Mahout Generate a file descriptor for the dataset.
Run the example with train data and build Decision Forest model.
Use the Decision Forest model to Classify test data and get results.
Tuning the model to get better results.
Problem Definition To Benchmark machine learning model for Page-Rank Yahoo! Learning to Rank
Train Data : 34815 Records Test Data : 130166 Records
Data Description : {R} | {q_id} | {List: feature_id -> feature_value} where R = {0, 1, 2, 3, 4} q_id = query id (number) feature_id = number feature_value = 0 to 1
Working Environment Ubuntu
Hadoop 1.2.1
Mahout 0.9
Prepare Dataset Take data from input text file
Make a .csv file
Make directory in HDFS and upload train.csv and test.csv to the folder.
Data Loading (Load data to HDFS)
#hadoop fs -put train.arff final_data
#hadoop fs -put test.arff final_data
#hadoop fs -ls final_data (check by ls command )
Using Mahoutmake metadata:
#hadoop jar mahout-core-0.9-job.jar org.apache.mahout.classifier.df.tools.Describe -p final_data/train.csv -f final_data/train.info1 -d 702 N L
It creates a metadata train.info1 in final_data folder.
Create Modelmake model
#hadoop jar mahout-examples-0.9-job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest -Dmapred.max.split.size=1874231 -d final_data/train.arff -ds final_data/train.info -sl 5 -p -t 100 -o final-forest
Test Modeltest model
#hadoop jar mahout-examples-0.9-job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest -Dmapred.max.split.size=1874231 -d final_data/train.arff -ds final_data/train.info -p -t 1000 -o final-forest
Results
Summary results : Confusion Matrix and statistics
Tuning
(change the parameters -t and -sl) and check the results.
--nbtrees (-t) nbtrees Number of trees to grow
--selection (-sl) m Number of variables to select randomly at each tree-node.
Results #hadoop jar mahout-examples-0.9-job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest -
Dmapred.max.split.size=1874231 -d final_data/train.csv -ds final_data/train.info1 -sl 700 -p -t 600 -o final-forest2
#hadoop jar mahout-examples-0.9-job.jar org.apache.mahout.classifier.df.mapreduce.TestForest -ifinal_data/test.csv -ds final_data/train.info1 -m final-forest2 -a -mr -o final-pred2
RF Split selection Typically we select about square root (K) when there
are K is the total number of predictors available
If we have 500 columns of predictors we will select only about 23
We split our node with the best variable among the 23, not the best variable among the 500
Using Gini Index If a dataset T is split into two subsets T1 and T2 with
sizes N1 and N2 respectively, the gini index of the split data contains examples from n classes, the gini index (T) is defined as:
**The attribute value that provides the smallest SPLIT Gini (T) is chosen to split the node.
Example The example below shows the construction of a single
tree using the dataset .
Only two of the original four attributes are chosen for this tree construction.
tabulates the gini index value for the HOME_TYPE attribute at all possible splits.
the split HOME_TYPE <= 10 has the lowest value
Gini SPILT Value
Gini SPILT(HOME_TYPE<=6) 0.4000
Gini SPILT(HOME_TYPE<=10) 0.2671
Gini SPILT(HOME_TYPE<=15) 0.4671
Gini SPILT(HOME_TYPE<=30) 0.3000
Gini SPILT(HOME_TYPE<=31) 0.4800