Zeppelin quick start guide...cast -i Test.csv -f header=true schema=features:vec[12],label:double -o testing apply models -i testing -o Prediction testing models task=reg -i testing

Zeppelin quick start guide

1. Preface

Our Big Data Machine Learning Service (BDMLS) is built on the Hadoop

platform. It primarily provides big data analysis algorithm via Apache Spark. We pack

Spark algorithm into the runspark executive package for convenient command line

operating interface.

This description document uses actual cases to introduce the way to use runspark

to analyze data quickly on zeppelin. Only some runspark commands are used in this

document. For other questions about other commands, refer to [BDMLS Internal

Application Structure Instructions].

The system is in development, other functions will be added later.

2. Introduction of Zeppelin

Zeppelin is a notebook similar to ipython. It offers a web interface and it is

developed by NFLAB. It can be used for data analysis and visualization.

Zeppelin supports Apache Spark Scala, Apache Spark Python, SparkSQL, Hive,

Markdown and Shell. On Zeppelin, we can use the following commands to easily

use runspark and the diagram functions we provide.

2.1 Introduction of interpreter

The basic interpreter offered on Zeppelin is %sh. We can use this interpreter to

execute the command we could operate on the terminal. For example, the user can

enter

%sh

run1

to look up algorithms offered by runspark.

(The %sh function will be removed for safety concerns.)

Zeppelin also offers %python, %r for users familiar with Python or R to write the

program on Zeppelin, which is very convenient to use.

For quick data analysis, we developed an interpreter, %run, to call runspark. The

way to use the interpreter is similar as the terminal. You just need to add the command

you want to use to %run. For example,

%run

logreg –i [input data path] –f [Describe the data format (schema=…)] –o [Output the name]

3. Runspark example

3.1 Upload the external analysis data

Our data is mainly stored in the HDFS. For now you can upload the file from the

Zeppelin web page. By entering [User Name], you can upload the file to your hdfs

folder.

Click “upload” on the Home page and you can then enter the page for data

upload.

Enter the user name in the user field.

Lookup data commands quickly

ls

Use %run ls

to lookup the data on hdfs now.

Use %run ls [Data Name to be Looked Up]

to lookup the route.

show

Print out the file content generated by runspark.

Use %run show [Data Path].

You can lookup the path by using ls! if you don’t know the path.

Quickly lookup the way to use the algorithm and command

help

Use %run help

to lookup all the analysis commands provided by runspark.

Use %run help [Command Name]

%run ? [Command Name]

to lookup the way to use the command.

3.2 Example 1: Regression algorithm

Data introduction

In this example, we use driving behavior data. We use the mileage and

average speed as a reference for car premiums. We grade the car premium and

the higher the grade, the more likely it is that the driver has safe driving habits.

We want to predict these scores.

We build the model by calculation. We need to fill in 0 for the Chinese characters

and the missing or empty part of the data or encode them into numbers. In this

example, we don’t use those fields.

Data preparation

Functions for Data preparation are under development. We have organized the

data and uploaded it to HDFS. We need to define features (such as the average speed)

and labels (in this example, the score would be the label) when building the regression

model. When organizing data fields, we put features at the front and labels behind the

features.

The red frame indicates the feature items (12 in total). The blue frame indicates

the labels. The dimension equals 1.

Finally we will generate two files like these. One is used for model training and

another for model testing.

Before training models, we suggest defining the data format and saving it as

spark parameter packets. This would speed up the calculation. To convert the data into

the parameter packet, we mostly use cast. The following is the way to use the

command.

%run

cast -i Train.csv -f header=true schema=features:vec[12],label:double -o training

-i: We input our data after the command.

-f: It is used to set the data format. We have fields in the data; therefore, we set the

header to true. If not, we don’t need to set the header. “schema” is primarily used to

indicate which part in the data belongs to features or labels. For the example above,

the first 12 are features and their type is vec. The label type is double.

Model building and assessment

Parameters are known

Runspark offers lots of regression algorithms. We take linreg(linear regression)

as an example. If we want to lookup the way to use this algorithm, we just need to add

%run with linreg to lookup the parameters and instructions of this algorithm.

In this example, we use the following command to train models:

linreg regParam=0.01 -i training -o models

“regParam” indicates the model parameter. You can set the parameter yourself, use

the default setting or adjust other parameters. There is no best way and you need to

experiment with the parameter.

-i: We input our training data after the command.

-o: This indicates the model name we output after training.

We need to evaluate the outcome after model training. We use the testing data

obtained after cast conversion.

apply models -i testing -o Prediction

testing models task=reg -i testing -o report

apply is used to output the score for model prediction (e.g. 80).

testing is used to output default evaluation methods and outcomes (e.g. R-squared).

After the two commands, add the name of the models you trained and input the data

and output the result by using -i -o. The difference is that you must use the current

model type to “reg” when using the testing command.

In Zeppelin, we can write all commands above and execute them at the same

time.

%run


linreg regParam=0.01 -i training -o models

cast -i Test.csv -f header=true schema=features:vec[12],label:double -o testing

apply models -i testing -o Prediction


Parameters are unknown

You can also follow the method in the last section by setting all parameters one

by one. However, it is a bit of a hassle. In runspark, we offer the grid command to

help us set multiple values to the parameter at once.

By adding the grid command before the algorithm, you can set multiple values to the

parameter

and output multiple models. These models are also brought in automatically during

evaluation later on.

As for the example above, the command can be changed to the following:

%run


grid linreg regParam=0.01,0.02,0.03,0.04,0.05 -i training -o models

cast -i Test.csv -f header=true schema=features:vec[12],label:double -o testing


After checking which model is better, use the parameters you prefer and use the apply

command.

Diagram functions supported for now

RegScatCSV

This function mainly uses the scatter plot to show the distribution of Fields 1 and 2.

[Instruction]

%run

RegScatCSV -i [Data Path] -f1 [Field 1] -f2 [Field 2]

RegBar

This function is used for providing the analysis diagram of proportional error between

actual and predictive values during model evaluation.

[Instruction]

%run

RegBar -i [The Data Path Output via the apply Command] -n [Number of Bars] -s [Error Interval]

-n: indicates the number of bars shown on the bar chart.

-s: indicates the numeric space between each bar. In this example, it indicates the

deviation interval.

RegScat

This function mainly uses scatter plots to show the actual and predicted values.

[Instruction]

%run

RegScat -i [The Data Path Output via the apply Command] -n [Number of Dots (50 by default)]

-n: refers to the number of data entries displayed.

RegEval

This function is mainly used to present the outcome for the evaluation of various

parameters. For regression, R-squared, mse, rmse and mae are used for evaluation.

This diagram is usually used with the grid command.

[Instruction]

%run

RegEval -i [The Outcome Path Output via the test Command] -d [Parameter Sequence]

-d: indicates the sequence of parameter values, such as 0.1,0.2,0.3,0.4,0.5.

3.2 Example 2: Regression model

Data introduction

In this example, we use the open data of highway from the Taiwan Area National

Freeway Bureau (freeway.csv). The data includes Route ID, date (the month and the

day of the week...etc.) and travel time. We aim to predict the time it takes to drive

through a section (travel time).

Data preparation

First, we select the section we wish to analyze. In this example, we select the data

with the condition RouteID = 1. Therefore, we execute the following data:

1. Convert data into the data frame

[Instruction]

cast -i freeway.csv -f csv header=true

schema=RouteID:int,DateTime:string,TripTime:double,timeSlot:double,month:double,weekday:

double,unknown:double -o freeway

In this data, the data format allows us to set the schema to ensure format accuracy.

2. We select the section data to be used, save it as another file and select features.

[Instruction]

filter -i freeway -f RouteID=10 -o freeway_10

Use filter to set conditions (-f) and save the data as another file (-o).

tovec inputCols=timeSlot,month,weekday,unknown outputCol=features -i

freeway_10 -o freewaydata

We use tovec to select features, set the fields to be used (timeSlot,month,

weekday,unknown), set these fields as another field (features) and output the file at last

(freewaydata).

split parts=0.2,0.8 -i freewaydata -o freeway_10_testing,freeway_10_training

At last, we use split to divide the organized data into the test set and train set

based on the method in 2:8.

Model building and assessment

If you’re not sure of the model or command to be used, you can enter help to

look up all command names and use help + [Command Name] to lookup the detailed

method of using the command.

1. You can use the linear regression model. The featuresCol is the features field

generated before. labelCol is preset to label. In this example, you need to set

labelCol to TripTime.

linreg labelCol=TripTime featuresCol=features -i freeway_10_training -o

freeway_10_model

2. By using testing and setting labelCol to TripTime, you can use some preset

outcomes of the evaluation.

testing freeway_10_model labelCol=TripTime task=reg -i freeway_10_testing -o

freewayreport_10

3. Use apply to acquire the predicted outcome.

apply freeway_10_model -i freeway_10_testing -o freewayPrediction_10

4. Use the scatter diagram to check the RegScat

RegScat -i freewayPrediction -l TripTime

5.

6. Use the histogram to check the RegBar

RegBar -i freewayPrediction -n 15 -s 500 -l TripTime

3.3 Example for clustering algorithm

Data introduction

In this example, we use the driving behavior data. We use the mileage and

average speed as the reference for car premium. We want to determine the car

premium based on these indicators. Therefore, we grade the car premium and the

higher the grade, the driver drives more safely. We want to divide the data of

driving behavior into 10 groups.

We build the model by calculation. We need to fill in 0 for the Chinese characters

and the missing or empty part of the data or encode them into numbers. In this

example, we don’t use those fields.

Data preparation

Functions for Data preparation are under development. We have organized the

data and uploaded it to HDFS. We need to define features (such as the average speed)

when building the regression model. Therefore, we organize data fields as seen the

diagram below.

The red frame indicates the feature items (12 in total).

Finally, we will generate a file for training models and viewing the actual clustering

result (testing).

Before training models, we suggest defining the data format and saving it as

spark parameter packets. This would speed up the calculation. To convert the data into

the parameter packet, we mostly use cast. The following is the way to use the

command.

%run

cast -i Train.csv -f header=true schema=features:vec[12] -o training

-i: We input our data after the command.

-f: It is used to set the data format. We have fields in the data; therefore, we set the

header to true. If not, we don’t need to set the header. ”schema” is primarily used to

indicate which part in the data belongs to features. For the example above, the first 12

are features and their type is vec.

Build models and output results

We use the common “kmeans” as an example.

If we want to lookup the way to use this algorithm, we just need to add ? with kmeans

to lookup their parameters and instructions.

In this example, we use the following command to train models:

%run

kmeans k=10 -i [Data Path to be Input] -o [Model Name]

-k: indicates the number of groups we divide.

-i: indicates the data path for input.

-o: indicates the model name for output.

Next we’re going to see which group our data is divided into. We just need to adopt

the testing method to transfer the data to the model.

%run

apply [Model Name] -i [Data Path to be Input] -o [Name of the Result to be Output]

-i: indicates the data path for input.

-o: indicates the result name for output.

If we want to see the output results, we can use the previous Read command.

Documents

Zeppelin quick start guide...cast -i Test.csv -f header=true schema=features:vec[12],label:double -o testing apply models -i testing -o Prediction testing models task=reg -i testing