7
Rapid Miner is the predictive analytics of choice for Pi-Cube. This is due to the following reasons: 1) Best Alternative. Rapid Miner serves as an extremely effective alternative to more costly software such as SAS, while offering a powerful computational platform compared to software such as R. 2) User Friendly. Rapid Miner offers an interface made out of operators and processes. This streamlines the model building process by eliminating or at the very least minimizing the coding required from the data analyst. 3) Robustness. Rapid Miner enables the user to customize operator use by offering parameter options for a vast majority of the operators. There are also various plug-ins that can be used to integrate, for example R code, into a Rapid Miner process. In this tutorial, we have (1) highlighted some of the basics of Rapid Miner’s interface and (2) offered a demonstration of a simple process. Using a loan dataset (provided), we will predict whether a particular customer will default on their loan. The loan dataset has 7 features. Loan status is the dependent variable with 2 levels: 1 = customer defaults, 0 = customer does not. Other features include: 1) Employment Length. Measured in years. 2) Homeownership Status. 0 = Own, 1 = Rent, 2 = Mortgage. 3) Loan Amount. 4) Annual Income. 5) Debt-to-Income Ratio 6) Unemployment Rate. 1 = Unemployment rate is less than 6.5% when the loan was issued. 7) Loan Status. There are 2000 examples (observations) in the dataset. We recommend you save the dataset on the desktop for easy access.

Rapid Miner is the predictive analytics of choice for Pi ... is the dataset that RapidMiner has edited to include the predictions from the model. The second is just an output of the

Embed Size (px)

Citation preview

Rapid Miner is the predictive analytics of choice for Pi-Cube. This is due to the following reasons:

1) Best Alternative. Rapid Miner serves as an extremely effective alternative to more costly software such as SAS, while

offering a powerful computational platform compared to software such as R.

2) User Friendly. Rapid Miner offers an interface made out of operators and processes. This streamlines the model building

process by eliminating or at the very least minimizing the coding required from the data analyst.

3) Robustness. Rapid Miner enables the user to customize operator use by offering parameter options for a vast majority of

the operators. There are also various plug-ins that can be used to integrate, for example R code, into a Rapid Miner process.

In this tutorial, we have (1) highlighted some of the basics of Rapid Miner’s interface and (2) offered a demonstration of a simple

process. Using a loan dataset (provided), we will predict whether a particular customer will default on their loan. The loan dataset

has 7 features. Loan status is the dependent variable with 2 levels: 1 = customer defaults, 0 = customer does not.

Other features include:

1) Employment Length. Measured in years.

2) Homeownership Status. 0 = Own, 1 = Rent, 2 = Mortgage.

3) Loan Amount.

4) Annual Income.

5) Debt-to-Income Ratio

6) Unemployment Rate. 1 = Unemployment rate is less than 6.5% when the loan was issued.

7) Loan Status.

There are 2000 examples (observations) in the dataset. We recommend you save the dataset on the desktop for easy access.

PART 0: The Rapid Miner Interface

Rapid Miner Interface (above)

Operators: This is where all the different modeling methods, data cleaning, statistical techniques, etc. can be found.

Repositories: This is where you can save (and find) datasets and processes.

Parameters: Each operator is equipped with customizable parameters. You can change this parameters here.

Help: Rapid Miner provides documentation, especially for operators, on this box.

This is where you can navigate from process construction to results.

“Play” button to run the process.

PART 1: Importing the loan dataset.

1) Open RapidMiner and click on “New Process.”

2) To import data, go to File -> Import Data -> Import CSV File…

3) Step 1: Data Import Wizard will appear. On the top menu bar, click “Desktop” to reveal the available files on the saved on

the desktop. Click “loandata.csv” and click “Next.”

4) Step 2: On the box titled “Column Separation,” Click “Comma “,”.” Then, click “Next.”

5) Step 3: Nothing to do here. Click “Next.”

6) Step 4: Under “Loan Status,” change “real” to “nominal” and “attribute” to “label.” This is to make sure the types match

when the modeling operator is used. Then, click “Next.”

7) Step 5: Name the dataset. Click on “Local Repository” to save the data. Then, click “Finish.”

8) After finishing the import, RapidMiner will shift to the “Results” Tab to show the imported dataset.

- Note the green highlighting under “Loan Status.” This is because it is was changed from “attribute” to “label.”

- Clicking on “Statistics” shows summary statistics and distributions of the attributes.

- Clicking on “Charts” gives the user the capability to plot the attributes against one another for data visualization.

PART 2: Short Process.

1) Locate the imported dataset on the Repositories tab. Drag and Drop the dataset onto the Process workspace.

2) Manuever to the Operators tab. Type in “Logistic” on the text bar. Drag and drop the “Logistic Regression” operator onto

the Process workspace.

3) Connect “out” port from the dataset to the “tra” port on the logistic regression operator.

4) Connect the “mod” port from the logistic regression operator to the “res” port on the far right hand side.

5) Click “Play” to run the process.

After clicking “Play,” the results tab will show the following. There are 2 windows that open when the process ends located on the

“Results” tab. The first is the “Kernel Model,” the model the results from the logistic regression operator. The second is the example

dataset (this is basically just the original dataset). Notice there is no prediction here. Why? Well, because we need another

operator.

PART 3: More Interesting Process.

A more interesting process will involves the “Apply Model” operator.

1) Manuever to the Operators tab. Type in “Apply Model” on the text bar. Drag and drop the “Apply Model” operator onto the

Process workspace.

2) Connect “mod” port from the logistic regression operator to the “mod” port on the Apply Model operator.

3) Connect “exa” port from the logistic regression operator to the “unl” port on the Apply Model Operator

4) Connect the “lab” port from the Apply Model operator to the “res” port on the far right hand side.

5) Click “Play” to run the process.

After clicking “Play,” the results tab will show the following. There are 2 windows that open when the process ends located on the

“Results” tab. The first is what RapidMiner called the “labeled” dataset. This is the dataset that RapidMiner has edited to include the

predictions from the model. The second is just an output of the original dataset.

- The second column from the left is the “prediction.” This column will contain what the model predicts will happen based on

the borrower’s characteristics. This is SIMILAR to the “Loan Status” column, but not IDENTICAL because this the

PREDICTION.

- The third and fourth column (shaded yellow) are the probabilities that the model calculates. The first yellow column from

the left shows the likelihood the client WILL NOT default. The second yellow column from the left shows the likelihood the

client WILL default.