Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Zeppelin quick start guide
1. Preface
Our Big Data Machine Learning Service (BDMLS) is built on the Hadoop
platform. It primarily provides big data analysis algorithm via Apache Spark. We pack
Spark algorithm into the runspark executive package for convenient command line
operating interface.
This description document uses actual cases to introduce the way to use runspark
to analyze data quickly on zeppelin. Only some runspark commands are used in this
document. For other questions about other commands, refer to [BDMLS Internal
Application Structure Instructions].
The system is in development, other functions will be added later.
2. Introduction of Zeppelin
Zeppelin is a notebook similar to ipython. It offers a web interface and it is
developed by NFLAB. It can be used for data analysis and visualization.
Zeppelin supports Apache Spark Scala, Apache Spark Python, SparkSQL, Hive,
Markdown and Shell. On Zeppelin, we can use the following commands to easily
use runspark and the diagram functions we provide.
2.1 Introduction of interpreter
The basic interpreter offered on Zeppelin is %sh. We can use this interpreter to
execute the command we could operate on the terminal. For example, the user can
enter
%sh
run1
to look up algorithms offered by runspark.
(The %sh function will be removed for safety concerns.)
Zeppelin also offers %python, %r for users familiar with Python or R to write the
program on Zeppelin, which is very convenient to use.
For quick data analysis, we developed an interpreter, %run, to call runspark. The
way to use the interpreter is similar as the terminal. You just need to add the command
you want to use to %run. For example,
%run
logreg –i [input data path] –f [Describe the data format (schema=…)] –o [Output the name]
3. Runspark example
3.1 Upload the external analysis data
Our data is mainly stored in the HDFS. For now you can upload the file from the
Zeppelin web page. By entering [User Name], you can upload the file to your hdfs
folder.
Click “upload” on the Home page and you can then enter the page for data
upload.
Enter the user name in the user field.
Lookup data commands quickly
ls
Use %run ls
to lookup the data on hdfs now.
Use %run ls [Data Name to be Looked Up]
to lookup the route.
show
Print out the file content generated by runspark.
Use %run show [Data Path].
You can lookup the path by using ls! if you don’t know the path.
Quickly lookup the way to use the algorithm and command
help
Use %run help
to lookup all the analysis commands provided by runspark.
Use %run help [Command Name]
%run ? [Command Name]
to lookup the way to use the command.
3.2 Example 1: Regression algorithm
Data introduction
In this example, we use driving behavior data. We use the mileage and
average speed as a reference for car premiums. We grade the car premium and
the higher the grade, the more likely it is that the driver has safe driving habits.
We want to predict these scores.
We build the model by calculation. We need to fill in 0 for the Chinese characters
and the missing or empty part of the data or encode them into numbers. In this
example, we don’t use those fields.
Data preparation
Functions for Data preparation are under development. We have organized the
data and uploaded it to HDFS. We need to define features (such as the average speed)
and labels (in this example, the score would be the label) when building the regression
model. When organizing data fields, we put features at the front and labels behind the
features.
The red frame indicates the feature items (12 in total). The blue frame indicates
the labels. The dimension equals 1.
Finally we will generate two files like these. One is used for model training and
another for model testing.
Before training models, we suggest defining the data format and saving it as
spark parameter packets. This would speed up the calculation. To convert the data into
the parameter packet, we mostly use cast. The following is the way to use the
command.
%run
cast -i Train.csv -f header=true schema=features:vec[12],label:double -o training
-i: We input our data after the command.
-f: It is used to set the data format. We have fields in the data; therefore, we set the
header to true. If not, we don’t need to set the header. “schema” is primarily used to
indicate which part in the data belongs to features or labels. For the example above,
the first 12 are features and their type is vec. The label type is double.
Model building and assessment
Parameters are known
Runspark offers lots of regression algorithms. We take linreg(linear regression)
as an example. If we want to lookup the way to use this algorithm, we just need to add
%run with linreg to lookup the parameters and instructions of this algorithm.
In this example, we use the following command to train models:
linreg regParam=0.01 -i training -o models
“regParam” indicates the model parameter. You can set the parameter yourself, use
the default setting or adjust other parameters. There is no best way and you need to
experiment with the parameter.
-i: We input our training data after the command.
-o: This indicates the model name we output after training.
We need to evaluate the outcome after model training. We use the testing data
obtained after cast conversion.
apply models -i testing -o Prediction
testing models task=reg -i testing -o report
apply is used to output the score for model prediction (e.g. 80).
testing is used to output default evaluation methods and outcomes (e.g. R-squared).
After the two commands, add the name of the models you trained and input the data
and output the result by using -i -o. The difference is that you must use the current
model type to “reg” when using the testing command.
In Zeppelin, we can write all commands above and execute them at the same
time.
%run
cast -i Train.csv -f header=true schema=features:vec[12],label:double -o training
linreg regParam=0.01 -i training -o models
cast -i Test.csv -f header=true schema=features:vec[12],label:double -o testing
apply models -i testing -o Prediction
testing models task=reg -i testing -o report
Parameters are unknown
You can also follow the method in the last section by setting all parameters one
by one. However, it is a bit of a hassle. In runspark, we offer the grid command to
help us set multiple values to the parameter at once.
By adding the grid command before the algorithm, you can set multiple values to the
parameter
and output multiple models. These models are also brought in automatically during
evaluation later on.
As for the example above, the command can be changed to the following:
%run
cast -i Train.csv -f header=true schema=features:vec[12],label:double -o training
grid linreg regParam=0.01,0.02,0.03,0.04,0.05 -i training -o models
cast -i Test.csv -f header=true schema=features:vec[12],label:double -o testing
testing models task=reg -i testing -o report
After checking which model is better, use the parameters you prefer and use the apply
command.
Diagram functions supported for now
RegScatCSV
This function mainly uses the scatter plot to show the distribution of Fields 1 and 2.
[Instruction]
%run
RegScatCSV -i [Data Path] -f1 [Field 1] -f2 [Field 2]
RegBar
This function is used for providing the analysis diagram of proportional error between
actual and predictive values during model evaluation.
[Instruction]
%run
RegBar -i [The Data Path Output via the apply Command] -n [Number of Bars] -s [Error Interval]
-n: indicates the number of bars shown on the bar chart.
-s: indicates the numeric space between each bar. In this example, it indicates the
deviation interval.
RegScat
This function mainly uses scatter plots to show the actual and predicted values.
[Instruction]
%run
RegScat -i [The Data Path Output via the apply Command] -n [Number of Dots (50 by default)]
-n: refers to the number of data entries displayed.
RegEval
This function is mainly used to present the outcome for the evaluation of various
parameters. For regression, R-squared, mse, rmse and mae are used for evaluation.
This diagram is usually used with the grid command.
[Instruction]
%run
RegEval -i [The Outcome Path Output via the test Command] -d [Parameter Sequence]
-d: indicates the sequence of parameter values, such as 0.1,0.2,0.3,0.4,0.5.
3.2 Example 2: Regression model
Data introduction
In this example, we use the open data of highway from the Taiwan Area National
Freeway Bureau (freeway.csv). The data includes Route ID, date (the month and the
day of the week...etc.) and travel time. We aim to predict the time it takes to drive
through a section (travel time).
Data preparation
First, we select the section we wish to analyze. In this example, we select the data
with the condition RouteID = 1. Therefore, we execute the following data:
1. Convert data into the data frame
[Instruction]
cast -i freeway.csv -f csv header=true
schema=RouteID:int,DateTime:string,TripTime:double,timeSlot:double,month:double,weekday:
double,unknown:double -o freeway
In this data, the data format allows us to set the schema to ensure format accuracy.
2. We select the section data to be used, save it as another file and select features.
[Instruction]
filter -i freeway -f RouteID=10 -o freeway_10
Use filter to set conditions (-f) and save the data as another file (-o).
tovec inputCols=timeSlot,month,weekday,unknown outputCol=features -i
freeway_10 -o freewaydata
We use tovec to select features, set the fields to be used (timeSlot,month,
weekday,unknown), set these fields as another field (features) and output the file at last
(freewaydata).
split parts=0.2,0.8 -i freewaydata -o freeway_10_testing,freeway_10_training
At last, we use split to divide the organized data into the test set and train set
based on the method in 2:8.
Model building and assessment
If you’re not sure of the model or command to be used, you can enter help to
look up all command names and use help + [Command Name] to lookup the detailed
method of using the command.
1. You can use the linear regression model. The featuresCol is the features field
generated before. labelCol is preset to label. In this example, you need to set
labelCol to TripTime.
linreg labelCol=TripTime featuresCol=features -i freeway_10_training -o
freeway_10_model
2. By using testing and setting labelCol to TripTime, you can use some preset
outcomes of the evaluation.
testing freeway_10_model labelCol=TripTime task=reg -i freeway_10_testing -o
freewayreport_10
3. Use apply to acquire the predicted outcome.
apply freeway_10_model -i freeway_10_testing -o freewayPrediction_10
4. Use the scatter diagram to check the RegScat
RegScat -i freewayPrediction -l TripTime
5.
6. Use the histogram to check the RegBar
RegBar -i freewayPrediction -n 15 -s 500 -l TripTime
3.3 Example for clustering algorithm
Data introduction
In this example, we use the driving behavior data. We use the mileage and
average speed as the reference for car premium. We want to determine the car
premium based on these indicators. Therefore, we grade the car premium and the
higher the grade, the driver drives more safely. We want to divide the data of
driving behavior into 10 groups.
We build the model by calculation. We need to fill in 0 for the Chinese characters
and the missing or empty part of the data or encode them into numbers. In this
example, we don’t use those fields.
Data preparation
Functions for Data preparation are under development. We have organized the
data and uploaded it to HDFS. We need to define features (such as the average speed)
when building the regression model. Therefore, we organize data fields as seen the
diagram below.
The red frame indicates the feature items (12 in total).
Finally, we will generate a file for training models and viewing the actual clustering
result (testing).
Before training models, we suggest defining the data format and saving it as
spark parameter packets. This would speed up the calculation. To convert the data into
the parameter packet, we mostly use cast. The following is the way to use the
command.
%run
cast -i Train.csv -f header=true schema=features:vec[12] -o training
-i: We input our data after the command.
-f: It is used to set the data format. We have fields in the data; therefore, we set the
header to true. If not, we don’t need to set the header. ”schema” is primarily used to
indicate which part in the data belongs to features. For the example above, the first 12
are features and their type is vec.
Build models and output results
We use the common “kmeans” as an example.
If we want to lookup the way to use this algorithm, we just need to add ? with kmeans
to lookup their parameters and instructions.
In this example, we use the following command to train models:
%run
kmeans k=10 -i [Data Path to be Input] -o [Model Name]
-k: indicates the number of groups we divide.
-i: indicates the data path for input.
-o: indicates the model name for output.
Next we’re going to see which group our data is divided into. We just need to adopt
the testing method to transfer the data to the model.
%run
apply [Model Name] -i [Data Path to be Input] -o [Name of the Result to be Output]
-i: indicates the data path for input.
-o: indicates the result name for output.
If we want to see the output results, we can use the previous Read command.