Daniel Adanza Dopazo - COnnecting REpositoriesStrojno učenje na velikih podatkih z uporabo MongoDB, R in Hadoop Ključne besede: veliki podatki, strojno učenje, analiza podatkov

Daniel Adanza Dopazo

MACHINE LEARNING ON BIG DATA USING MONGODB, R AND HADOOP

Master thesis

Maribor, December 2016

STROJNO UČENJE NA VELIKIH PODATKIH Z UPORABO

MONGODB, R IN HADOOP

MACHINE LEARNING ON BIG DATA USING MONGODB,

R AND HADOOP

Magistrsko delo

Študent: Daniel Adanza Dopazo

Študijski program: študijski program 2. stopnje

Informatika in tehnologije komuniciranja

Mentor: red. prof. dr. Vili Podgorelec

Lektor(ica):

i

ii

iii

iv

Strojno učenje na velikih podatkih z uporabo MongoDB, R in Hadoop

Ključne besede: veliki podatki, strojno učenje, analiza podatkov

UDK: 004.8:004.65(043.2)

Povzetek

Osrednji namen tega magistrskega dela je testiranje različnih pristopov in izvedba več

eksperimentov na različnih podatkovnih zbirkah, razporejenih na infrastrukturi za obdelavo

velikih podatkov. Da bi dosegli ta cilj, smo magistrsko nalogo strukturirali v tri glavne dele.

Najprej smo pridobili deset javno dostopnih podatkovnih zbirk z različnih področij, ki so

dovolj kompleksne (glede na obseg podatkov in število atributov) za namen izvajanja analize

velikih podatkov na ustrezen način. Zbrane podatke smo najprej predhodno obdelali, da bi

bili združljivi s podatkovno bazo MongoDB.

V drugem delu smo analizirali zbrane podatke in izvedli različne poskuse s pomočjo

orodja R, ki omogoča izvedbo statistične obdelave podatkov. Orodje R smo pri tem povezali

s podatkovno bazo MongoDB.

V zadnjem delu smo uporabili še ogrodje Hadoop, s pomočjo katerega smo dokončali

načrtovano infrastrukturo za obdelavo in analizo velikih podatkov. Za namen tega

magistrskega dela smo vzpostavili sistem v načinu enega vozlišča v gruči. Analizirali smo

razlike z vidika učinkovitosti vzpostavljene infrastrukture in delo zaključili z razpravo o

prednostih in slabostih uporabe predstavljenih tehnologij za obdelavo velikih podatkov.

v

vi

Machine learning on big data using MongoDB, R and Hadoop

Key words: big data, machine learning, data analysis

UDK: 004.8:004.65(043.2)

Abstract

The main purpose of this master thesis is to test different approaches and perform several

experiments on different datasets, deployed on a big data infrastructure. In order to achieve

that goal we will structure the thesis in three different parts.

First of all, we will obtain ten publicly available datasets from different domains, which

are complex enough (in terms of size and number of attributes) in order to perform the big

data analysis in the proper way. Once they are gathered, we will pre-process them in order

to be compatible with the MongoDB database.

Second of all, we will analyse the data and perform various experiments using the R

statistical and data analysis tool, which at the same time will be linked to the MongoDB

database.

Finally, we will use Hadoop for deploying this structure on big data. For the purpose of

this master thesis, we will use it in a single node cluster mode. We will analyse the differences

from the performance point of view and discuss the advantages and disadvantages of using

the presented big data technologies.

vii

viii

KAZALO

1 INTRODUCTION ....................................................................................................... 1

2 MASTERING THE DATABASES ............................................................................ 2

2.1 Installing and configuring MongoDB server .......................................................... 2

2.2 Learning mongo commands. ................................................................................... 6

2.3 2.3 Preparation of the datasets .............................................................................. 19

3 MASTERING THE DATA MINING ...................................................................... 27

3.1 Installing and configuring R ................................................................................. 27

3.2 Learning to use R with its library rmongodb ........................................................ 31

4 MASTERING THE HADOOP ................................................................................ 50

4.1 What is big data and Hadoop? .............................................................................. 50

4.2 Installing and configuring Hadoop ....................................................................... 52

4.3 Deployment of R algorithms in Hadoop ............................................................... 63

5 PERFORMING THE EXPERIMENTS AND ANALYZING THE RESULTS .. 75

6 APPLYING MACHINE LEARNING ALGORITHMS ........................................ 87

7 CONCLUSION ........................................................................................................ 104

8 REFERENCES .......................................................... Error! Bookmark not defined.

ix

x

KAZALO SLIK

Picture 2.1: Files which contains the MongoDB server version 3.0. .................................... 3

Picture 2.2: Files which contains the bin folder of MongoDB server. .................................. 3

Picture 2.3: Practical example of the usage of Mongo. ......................................................... 4

Picture 2.4: Practical example of Mongo running in command mode. ................................. 4

Picture 2.5: Window in the advanced configuration for the operative system Windows 8. . 5

Picture 2.6: Window with the environment variables already configured. ........................... 6

Picture 2.7: Connecting with the hockey dataset in Mongo .................................................. 7

Picture 2.8: Inserting data into our players dataset. ............................................................... 8

Picture 2.9: Output result after including a new row into our players dataset. ..................... 9

Picture 2.10: Accessing collections in Mongo. ................................................................... 10

Picture 2.11: Showing the differences after applying the »pretty« function. ...................... 10

Picture 2.12: Practical example of removing one row in Mongo. ....................................... 11

Picture 2.13: Practical example of updating a collection. ................................................... 11

Picture 2.14: Practical example of dropping a collection. ................................................... 12

Picture 2.15: Sample query in Mongo. ................................................................................ 12

Picture 2.16: Executing a query in our players dataset........................................................ 13

Picture 2.17: Executing a different query in our players dataset. ........................................ 16

Picture 2.18: Geting indexes in player dataset. ................................................................... 18

Picture 2.19: Showing all implemented datsets in Mongo. ................................................. 20

Picture 2.20: Showing WEKA interface.............................................................................. 21

Picture 2.21: showing WEKA interface. ............................................................................. 22

Picture 2.22: Showing all attributes of the arrythmia dataset. ............................................. 23

Picture 2.23: Showing the attributes of diabetic dataset. ..................................................... 23

Picture 2.24: Showing the attributes of letter dataset. ......................................................... 24

Picture 2.25: Showing the attributes of nursery dataset. ..................................................... 24

Picture 2.26: Showing the attributes of splice dataset. ........................................................ 24

Picture 2.27: Showing the attributes of student dataset. ...................................................... 25

Picture 2.28: Showing the attributes of tumor dataset. ........................................................ 25

Picture 2.29: Showing the attributes of waveform dataset. ................................................. 25

Picture 2.30: Showing the attributes of cmu dataset. .......................................................... 26

Picture 2.31: Showing the attributes of kddcup dataset....................................................... 26

Picture 3.1: Official documentation of R............................................................................. 29

Picture 3.2: Installing the packages »rmongodb« in R. ....................................................... 30

Picture 3.3: Output result after installing »rmongodb«. ...................................................... 31

Picture 3.4: connecting with mongo datasets since R IDE. ................................................. 33

Picture 3.5: Accessing databases and collections since R. .................................................. 35

Picture 3.6: Executing some queries with our sample data since R. ................................... 37

Picture 3.7: executing some queries since R. ...................................................................... 39

Picture 3.8: Graphics showing the results after executing some queries............................. 41

Picture 3.9: executing »count« function and »head function«. ........................................... 42

Picture 3.10: converting to bson format in R. ..................................................................... 44

xi

Picture 3.11: Output result after executing queries in our sample data in R. ...................... 45

Picture 3.12: sample using the »count« function in »rmongodb«. ...................................... 46

Picture 3.13: Executing some experiments in R. ................................................................. 48

Picture 3.14: graphic showing the output results of the experiment. .................................. 49

Picture 3.15: graphic with bars showing the output results of the experiment. ................... 49

Picture 4.1: Window showing the environment variables. .................................................. 53

Picture 4.2: Command window showing the current java version installed in my computer.

............................................................................................................................................. 54

Picture 4.3: Output result showing the hadoop version installed.. ...................................... 54

Picture 4.4: Output results after executing »hdfs namemode -format« command. ............. 58

Picture 4.5: Environment variables of my computer. .......................................................... 59

Picture 4.6: Output results after running Hadoop. ............................................................... 60

Picture 4.7: Output results after executing »yarn«. ............................................................. 61

Picture 4.8: Initial page after running Hadoop. ................................................................... 62

Picture 4.9: Initial page showing the cluster configuration of Hadoop. .............................. 62

Picture 4.10: Initial configuration of Hadoop in R. ............................................................. 64

Picture 4.11: executing »mapreduce«. ................................................................................. 66

Picture 4.12: Basic usage of »rhdfs« library........................................................................ 67

Picture 4.13: Unserializing data with »rhdfs«. .................................................................... 68

Picture 4.14: Executing different »hdfs« commands. ......................................................... 69

Picture 4.15: Using »mapreduce« in Hadoop. ..................................................................... 71

Picture 4.16: practical example using »mapreduce«. .......................................................... 72

Picture 4.17: graphic using showing the output result of the preivous example. ................ 72

Picture 4.18: set of commands processing data with »rhadoop«. ........................................ 73

Picture 4.19: Final results after applying »rmr«. ................................................................. 74

Picture 5.1: Results for the experiments of arrythmia dataset. ............................................ 76

Picture 5.2: More results about arrythmia dataset. .............................................................. 76

Picture 5.3: Results for the experiments of cmu dataset. .................................................... 77

Picture 5.4: Results for the experiments of diabetic dataset. ............................................... 78

Picture 5.5: Results for the experiments of tumor dataset. .................................................. 79

Picture 5.6: More results for tumors dataset. ....................................................................... 80

Picture 5.7: Results for the experiments of kddcup dataset. ................................................ 81

Picture 5.8: Results for the experiments of letter dataset. ................................................... 82

Picture 5.9: Results for the experiments of nursery dataset. ............................................... 83

Picture 5.10: Results for the experiments of splice dataset. ................................................ 84

Picture 5.11: Results for the experiments of waveform dataset. ......................................... 85

Picture 5.12: Results for the experiments of students dataset. ............................................ 86

Picture 5.13: More results for the experiments of students dataset. .................................... 86

Picture 6.1: Applying machine learning algorithms in R. ................................................... 88

Picture 6.2: Showing main features of iris dataset .............................................................. 89

Picture 6.3: Summary of iris dataset .................................................................................... 90

Picture 6.4: Applying machine learning algorithms to iris data .......................................... 91

Picture 6.5: spliting iris dataset into training and test. ........................................................ 92

Picture 6.6: Final results after applying machine learning on iris dataset. .......................... 93

Picture 6.7: Applying regression tree algorithm to letter dataset ........................................ 95

xii

Picture 6.8: Applying regression tree algorithm to arrythmia dataset ................................. 96

Picture 6.9: Applying regression tree algorithm to diabetic dataset .................................... 97

Picture 6.10: Applying regression tree algorithm to kddcup dataset ................................... 98

Picture 6.11: Applying regression tree algorithm to nursery dataset .................................. 99

Picture 6.12: Applying regression tree algorithm to splice dataset ................................... 100

Picture 6.13: Applying regression tree algorithm to student dataset ................................. 101

Picture 6.14: Applying regression tree algorithm to tumor dataset ................................... 102

Picture 6.15: Applying regression tree algorithm to waveform dataset ............................ 103

xiii


1

1 INTRODUCTION

Context

At the beginning I would like to make a brief comment about the context of this master

thesis, starting with data mining.

Data mining is one subfield inside the computer sciences, which consists of a process where

we discover the patterns of a huge data set, hence it includes different methods at the

intersection of different fields like artificial intelligence, machine learning and so on.

I would like to point out that the main purpose of data mining is the extraction of information

that comes from a data set and its transformation into another structure that can be used for

other different reasons.

Main purpose of the master thesis

The main purpose of the master thesis is to analyse the different relationships within the

attributes of a database or to extract some additional information out of them. In order to get

that it will be necessary to use the help of different tools for storing data in the database

(mongo DB), for analysing the already introduced data (R IDE) and finally for learning about

how to deploy big data by using different tools like Hadoop.

Brief description of the content

The first part of the thesis is dedicated to MongoDB. Here I will install, learn and include

the different datasets that we are going to use for the project with the previous mentioned

tool.

The second part of the report is about data mining. There we will install, configure and learn

how to use the R ide with its necessary libraries in order to connect it with our datasets.

In the third part of the thesis we will talk about big data and we will use Hadoop. We will

also learn how to connect this tool with R and mongo.


2

Finally, in the rest of the parts I am going to analyze different data, make different

experiments, apply different machine learning algorithms and make some inferences about

big data and the obtained results.

Aims

Even if we will work over different set of databases trying to make some inferences about

the data and obtain different statistics the real goal of the project is the application of the

previously mentioned technologies and tools that allow us to work with big data and to

somehow demonstrate that they can be quite useful and applied to a variety of situations.

Therefore the main aim of the study is the application of different technologies that allow us

to work on big data.

Objectives

These are the other aims for the project:

• application of a database with NOSQL datastore like in the case of mongoDB

• application and usage of different machine learning algorithms like R

• deployment of a big data structure like in the case of Hadoop

• deployment of some machine learning algorithms from R on Hadoop, as much as

establishing of MongoDB for storing the datasets in, and then using the algorithms from R

deployed on Hadoop in order to learn from the data in MongoDB datasets

• performing different experiments on the selected datasets

Assumptions and limitations

The main purpose of the research is not to get information about about the databases that I

have taken like an example, but to get familiar with and to try out different technologies that

allow us to analyze a big quantity of data.

I would also like to mention different assumptions and shortcoming of the research. We

should always keep in mind that our inferences are based on a sample of data. Hence the

numbers could be slightly different from the numbers we would get by analysing other


3

sources. Nevertheless it is always good to make some inferences and to make a good

estimation about different features.


2

2 MASTERING THE DATABASES

In this section I am going to describe everything that is necessary for the database, including

the description of the necessary tools and the steps that I took in order to install them.

Furthermore, I am going to include a simple guide of the basic commands that will allow us

to check the information inside the database and manipulate it. In order to achieve this goal,

we will use the database type named NoSQL, which is a type of a data base based on JSON

objects and it is slightly different from the typical SQL databases that we have usually used

in our projects. The tool that we will use in order to handle these type of databases is called

Mongo [1], a cross platform tool that has been released under GNU General public license

and that works through the terminal in command mode.

2.1 Installing and configuring MongoDB server

The first step that is necessary to make in order to accomplish the final step of the project is

to install and configure the necessary tools. Mongo [2] is an open-source document database

that provides high performance, high availability and automatic scaling. This type of tool

contains a data structure that is composed of a field and value pairs. Their documents are

quite similar to JSON objects.

The first thing that we need to do is to go to its official web page [3] and download the

executable file that is in my case specific for the operative system Windows 8 (64 bit

architecture).

Right after ending the installation we will see that we have the files shown in a print screen

below in the path that is by default: c:/Program Files/MongoDB/Server/3.0/


3

Picture 2.1: Files which contains the MongoDB server version 3.0.

And that is what we can see in the bin directory right after installing Mongo DB:

Picture 2.2: Files which contains the bin folder of MongoDB server.

In order to configure Mongo DB for the very first time, it is necessary to open the terminal

cmd.exe and to type the following commands in order to move to the necessary folder.


4

Picture 2.3: Practical example of the usage of Mongo.

As you can see in the image above, we have used the command “cd” in order to change to

the correct directory and when we are in the correct directory, we can execute the command

“mongo”. After that we will see that it does not find any database and it does not work as

we would have maybe expected. The reason for this is that we need to create a folder with

the command “mkdir \data\db”. After creating this directory, we will see that our tool is

working well.

Picture 2.4: Practical example of Mongo running in command mode.


5

Here we have a practical demonstration of the mongo DB tool working good. In one terminal

we used “mongod” command and in the other terminal we used “mongo” command that

initiates a dialog.

Finally, I would like to make an additional step that is not strictly necessary but helps to

facilitate the things when starting Mongo tool. This step consists of configuring the

CLASS_PATH in our environment variables, for which we will need to enter the following

location in our computer:

Control Panel > All control Elements > system > advanced system configuration

Picture 2.5: Window in the advanced configuration for the operative system Windows 8.

After that we will see something similar to this window. After clicking on advance

configuration system we will also need to enter in system properties > advanced settings >

environment variables.


6

Picture 2.6: Window with the environment variables already configured.

In the environmental variables we will need to add a new variable with the name PATH and

with the value C:/Program Files/MongoDB/Server/3.0/bin. Finally, I would like to say that

this step is very useful because from now on it will not be necessary to enter in that hard to

remember address in order to start mongo. Now it would be enough to just open the terminal

and to type the word “mongod”.

2.2 Learning mongo commands.

To install the necessary tools is not enough if we want to accomplish our final goal. Hence

the next step in this case would be to create a sample database and to fill it with some

unimportant sample data. In next paragraphs I will also explain in greater detail how Mongo

DB works and which functions of different commands are provided by it.

As we saw in the previous step, we will need to open two terminals. In one of them we will

type the command “mongo”, hence this window will act like the server. In the other window

we will type the command “mongo”, hence this window will act like the client that will


7

connect with the server that is in the local host. It will be in this second window that we will

need to type the different commands necessary for inserting, deleting and modifying

different information.

STEP 1: BASIC DATA BASE COMMANDS:

There are different commands that allow us to manipulate the databases. Fortunately, they

are quite straightforward to use. Out of all of them we will highlight the following:

Db: It tells you which database you are using at the moment (by default you would be using

the data base named “test”).

use <data base name>: It would switch the database that you intend to use. For example: If

you don’t want to use the database “local” anymore and you prefer to use the database named

“hockey” you can type “use hockey” I also would like to point out that by default, if it does

not find any database under that name, it will automatically create one.

Db.dropDatabase(): this command will erase the database that you are currently using. In

order to facilitate your work the terminal will provide you a feedback explaining if

everything went as expected or not.

Finally, I would like to demonstrate all that I have mentioned by providing a screenshot of

my own terminal, where I used these commands, and its expected output:

Picture 2.7: Connecting with the hockey dataset in Mongo


8

STEP 2: INSERTING JSON OBJECTS INTO COLLECTIONS:

So far we have seen how to create, delete and switch the usage of the different databases,

but we have not yet looked at the manipulation of different data that is inside the databases.

For that it would be necessary to mention that the Mongo database organizes its information

into collection. These collections are analogous to a table with different rows if we would

like to compare them with a normal sql database. The truth is that its syntax is more similar

to the collections in java. Where we can insert one object that will be added into the overall

collection.

Db.collectionName.insert(<Data that you want to insert>): In this command the word db

refers to the database that you are currently using. In our example we are using the database

named “hockey” and refering to the collection “name”. By default, if it does not find any

collection under that name it will create one. Finally you will need to introduce the data in

JSON format as a parameter that you want to add to that database.

Here we have a practical demonstration of the previously explain command. The database

is hockey and the collection is players. In the last line we can prove that the information has

been added in a successfully way.

Picture 2.8: Inserting data into our players dataset.


9

Finally, I would like to mention that this only works with one single object in JSON. If we

would like to insert more than one object we will need to create an array and separate the

different JSON objects using commas. The basic structure would be something like that:

Db.collection.insert ( [ {JSON OBJECT1} , {JSON OBJECT2} , … ] )

Also I would like to mention that by introducing more than one JSON object, the output

result will be quite different than with a single one.

Picture 2.9: Output result after including a new row into our players dataset.

As we can see in the previous image the tool will give us a lot different information, such as

the number of inserted rows, the number of modified ones and whether there were any errors.

STEP 3: DELVING MORE DEEPLY INTO COLLECTIONS:

In order to operate and see different information relative to the collections, we can have the

following commands:

Show collections: this command is self-explanatory. It will show all the collections that we

have created in the database that we are using. By default, it will also include one collection

named “system.indexes”. Here we can see a practical demonstration of it.


10

Picture 2.10: Accessing collections in Mongo.

Db.collection.find(): This command allows us to see all the data that is inside the previously

referred collection. One of the main disadvantages of this command is that the view of the

data is very compact. Nevertheless it can all be repaired with the function pretty(). We can

see the differences between both in the following image:

Picture 2.11: Showing the differences after applying the »pretty« function.


11

Db.collection.findOne(): it works exactly like find().pretty() but with the main difference

that it only shows the first JSON object of the collection.

Db.collection.remove( { id of the object } ): This command removes only one row in the

overall collection. In order to distinguish this object from the others, it will be necessary to

introduce its identification.

Picture 2.12: Practical example of removing one row in Mongo.

Db.collection.update({identifier},{new object}): This function allows you to find an object

inside the previously named collection and update it with the information that you want.

Like always, I will provide you a practical demonstration of that.

Picture 2.13: Practical example of updating a collection.


12

Db.collection.drop(): it eliminates the previously mentioned collection completely, which

implies all its data that has inside and the name of the collection in itself.

Picture 2.14: Practical example of dropping a collection.

STEP 4: HOW TO MAKE QUERIES IN THE COLLECTIONS:

Db.collection.find/findOne(“parammeter”:value): These commands have already been

seen when we wanted to get the overall output of the entire collection. Still we can use it

using two conditions at the same time. For giving a practical example of that let’s show all

the players that have accomplished two conditions: First: the position is defenseman and

second: the age has to be twenty one.

Picture 2.15: Sample query in Mongo.


13

$or:[condition1 , condition2] Lastly I would like to talk about other types of queries.

Sometimes we don’t want to be that strict and we would like to show all of the rows that

accomplish one condition or another one. For those cases we will need to introduce the

variable “or” that will be followed by an array. And inside each element of the array there

will be objects separated by commas that will tell you which conditions can be accepted.

Like always, the best way of understanding it is using a practical example with our previous

mentioned collection “players”.

In the next image we can see how the “find()” and “pretty()” functions are being combined

with the variable “or” in order to get what we want which is in this case to show all those

players that play in the position of the left wing or right wing.

Picture 2.16: Executing a query in our players dataset

Other variables for comparison: So far we saw a lot of different types of queries combining

the logical operators OR and AND. Neverhteless, the truth is that we can still delve more

deep by showing other variables that allow us to make numerical comparison: For example

$gt: value. This expression it is used if we want to establish the condition that a chosen

parameter should be greater than the specified value. Using a practical example this is the

query that we will need to type if we want to show all the players whose age is greater than

30:


14

db.players.find(

{ "age" : {$gt:30}}

).pretty()

Overmore I would also like to mention the other variable named $gte which works exactly

the same but with the difference that it will show all the rows where the age is greater or

equal to 30:

db.players.find(

{ "age" : {$gte:30}}

).pretty()

Finally, there are another two complementary variables: $lt for showing all the rows that are

lower than the specified value.

db.players.find(

{ "age" : {$lt:30}}

).pretty()

As you can guess, we also have $lte in order to show all the rows which are lower or equal

the specified value which is for this case age 30:

db.players.find(

{ "age" : {$lte:30}}

).pretty()


15

At the end I would like to mention another variable which is $ne. In our practical example

its function would be to show all those players whose age is NOT EQUAL to 30, for example

29 would be accepted as well as 31.

db.players.find

( { "age" : {$ne:30}} ).pretty()

In addition to all that I would like to mention that we didn’t customize the queries as much

as we could. For example, in all our previous cases we will receive all the information from

the rows that we want. The counterpart in the SQL syntax will always be “SELECT *

FROM”. If we want to show for example just the name, we will need to specify it with one

condition, according to the following syntax: { “parameter” : 1/0 }. The number one or zero

will indicate whether you want to show one parameter or not.

Finally, in order to clear up some things I would like to mention that by defaul. As we saw

in the previous queries the mongo db tool will show all the parameters that contain that

collection, but if you start specifying the conditions, the default values change and it will

only show the specified values plus the “_id” that it will be shown by default and you will

need to specify that you don’t want to see it if that is the case:

I am going to show you a practical example using our previous players collection: I am going

to make a query returning all the players that play in the center position. And from all their

information I only want to see their names and I don’t want to see their id:


16

Picture 2.17: Executing a different query in our players dataset.

In order to end with this step, I would like to mention another two functions that, while not

very useful, can be helpful in some contexts. One of the functions is: limit(number) This

function will return only the first number of rows and it will ignore the rest. For example, in

our previous query if we want just to show the three first rows, the query would look

something like that.

db.players.find( {"position":"Center"},{"name":1, _id:0} ).limit(3)

Also another complementary function is skip(number). In this case it does not work exactly

in the sql queries. It will not show a certain number of last rows. What it does is to ignore

the first number of rows and show the rest. For example if there are ten different rows and

we use the skip(3) function it will show the last 7 rows, ignoring the first three. Here is a

practical example:

db.players.find( {"position":"Center"},{"name":1, _id:0} ).skip(3)

STEP 4: USING INDEXES:

In order to understand why it is useful to use the indexes in this type of databases it will be

necessary to know an overview of how Mongo tool works inside. When you are making a


17

query with one condition the actions that will be executed are: looping each row in the

collection and checking if each single row accomplishes the condition or not. And if this

condition is accomplished it will be necessary to print out its information.

The best way of understanding the procedures is always with an example: in our previous

collection that contains around twenty different players let’s suppose we want to show all of

those whose age is lower than twenty one. Mongo tool will go one by one checking if each

player’s age is lower than twenty one.

Because it is only a sample database and there is only a very limited number of rows, we can

see the result in milliseconds. But what happens if for the project I want to use a database

with one hundred thousand customers? In this case the performance will be very poor and it

will take a lot of time till I can see the first results. For these cases it is quite useful to know

how to use indexes because we will see a huge difference in performance in a real and robust

database

Finally, I would like to point out that for the end user it is very difficult to appreciate exactly

how much time you saved by using indexes. But there is a way of seeing how much time it

would take to execute this query. For that you need to include in the end the function:

.explain (“executionStats”).

So once we saw why the indexes are used and in which type of databases we should use

them, we will continue with the main commands for manipulating them:

Db.collection.ensureIndex({parameter:1}): It creates an index of the previously specified

parameter.

Db.collection.getIndexes(): Here we can see an output of all of the indexes that we have

created for the previously specified collection.

Db.collection.dropIndex({parameter:1}): As you can possibly guess it will drop the

previously specified index that you have created.

The best way of understanding what I say is with a practical demonstration using the sample

collection players:


18

Picture 2.18: Geting indexes in player dataset.

Finally, I would just like to mention that in order to achieve the best performance we need

to use the indexes in a sensitive way. For example, a quite common mistake is to create an

index for each single parameter in the collection. Each time that you create an index the

performance query of that collection will be downgraded. Hence it will be recommended to

use the indexes only with the parameters that you will use a lot, like for example the name,

the age and the player position, and to leave the other unimportant parameters that you will

barely use in the queries. What’s more, another disadvantage that is necessary to count on is

that after each time that you actualize the collection it will also be necessary to actualize its

associated indexes.

STEP 5: USING GROUPS AND AGGREGATION:

The last topic that I would like to cover when talking about Mongo db commands is the

groups. If we want to group the rows depending on an exact parameter then we can use the

variable $group. Although I would like to mention that this variable will have to be combined

with other variables.


19

For example, let’s put a query in our previous example where we want to group all the

players depending on their position and for each different group we want to sum the number

of those who play that role. Then we would need to make a query with the following syntax

combined with the variable $sum.

db.players.aggregate( { $group : { _id : "$position", total : {$sum : 1} } } )

Other times when making groups we need to get the average value of another different

parameter. For example, let’s suppose that we want to get an average age for all the players

grouping them by position. Then we will need to use the variable $avg in the following way.

db.players.aggregate( { $group : { _id : "$position", avgAge : {$avg : $age} } } )

In order to end this section I would like to mention the self-explanatory variables $min and

$max which will show the biggest and the lowest value of each group. And the way of using

both it is exactly the same than for the previous group variables.

db.players.aggregate( { $group : { _id : "$position", avgAge : {$max : $age} } } )

db.players.aggregate( { $group : { _id : "$position", avgAge : {$min : $age} } } )

2.3 2.3 Preparation of the datasets

For the purpose of this master thesis I am going to apply different experiments in ten

different datasets that are going to be taken like examples in order to show the deployment

of big data. In this part of the report I am going to mention each single one of them and I

will also summarize the procedure that I made in order to include them in Mongo. In addition


20

to that I also will show different screenshots in order to demonstrate the correct inclusion of

them

Arrhythmia

Cmu newsgroup clean 1000 sanitized

Diabetic data

Tumor data

Kddcup99

Letter data

Nursery data

Splice data

Wave form data

High school students data

Picture 2.19: Showing all implemented datsets in Mongo.


21

ADDING ALL DATASETS TO MONGO

One of the main problems that I found in order to convert the database to Mongo was the

format of the data. All datasets were in .arff format so they had to be parsed to JSON in order

to be readable for MongoDB. The solution that I found for this problem was the following:

First of all I downloaded and installed the WEKA tool version 3.6. WEKA is one of the most

popular suit of machine learning software developed by the University of Waikato it is a free

software licensed and it is written in Java. You have an image bellow showing how I opened

one of the sample databases about arrhythmia in the previously mentioned tool.

Picture 2.20: Showing WEKA interface.

One of the many functions that these tools give you is the conversion of files into arff format

into csv. Because there are no direct conversions within JSON and arff, I found two different

solutions for handing it. The fist solution it worked with almost all the different datasets

except with the “cmu” sample data and it consists in parsing them first into .csv format and

then parsing them into an array of JSON readable format in order to successfully include

them in the database.

Right after the first step I used another tool named JSON buddy Application desktop version

3.3. By the usage of this tool I managed to convert the files into the final format without any

troubles and preserving the structure and the content as it was before.


22

Picture 2.21: showing WEKA interface.

Unfortunately, the first solution didn’t work will all my datasets, there is another dataset

named CMU which is very large and because of its size it could not be parsed in JSON

buddy application. So the second solution consists of first parsing the data into csv, like in

the first solution, and after that using the following command in order to import directly this

data into Mongo:

mongoimport --host 127.0.0.1 --port 27017 --db cmu

--collection data --type csv --headerline --file ./cmu.csv -j 256

For importing Mongo you have to specify the host and the port which are in this case:

127.0.0.1:27017. After that it is necessary to specify the name of the database and the name

of the collection that will be created finally we tell them that we have a headerline that

specifies the name of the fields, the path of the file and one last option “-j 256” which solved

me different CPU processing problems. As a final result I managed to have the ten different

databases already imported and ready to work in Mongo tool.


23

DESCRIPTION OF THE DATASETS:

In order to have an overview of the content, I am going to show a summary of all the different

datasets that I will use like an example in order to apply the different technologies for big

data. The first data set it is about 452 different attributes about 24528 different people.

Picture 2.22: Showing all attributes of the arrythmia dataset.

The second database gathers different data about diabetic people. In this case we found a

much smaller database but each row has a much wider number of attributes.

.

Picture 2.23: Showing the attributes of diabetic dataset.

Finally I am going to show a summary of the different datasets about: letters, nursery, splice,

high schools, tumor, waveform and cmu:


24

Picture 2.24: Showing the attributes of letter dataset.

Picture 2.25: Showing the attributes of nursery dataset.

Picture 2.26: Showing the attributes of splice dataset.


25

Picture 2.27: Showing the attributes of student dataset.

Picture 2.28: Showing the attributes of tumor dataset.

Picture 2.29: Showing the attributes of waveform dataset.


26

Picture 2.30: Showing the attributes of cmu dataset.

Picture 2.31: Showing the attributes of kddcup dataset.


27

3 MASTERING THE DATA MINING

So far we saw all the things that concern the databases. For that we have learned how to use

the tool mongo database, unfortunately this is not enough in order to achieve the final goal

of the project. The second step that we need to make is to analyze all of this data that we

have previously loaded in our local host with a quite popular data mining tool named R [4].

With the help of this tool we will be able to analyze our data in a much better way and we

will be able to make operations that we cannot do just with our mongo tool.

3.1 Installing and configuring R

So far we saw how to install, create, manipulate and master the nosql database using the

mongo tool. Unfortunately, this is just one part of the entire task. The next step that we need

to make in order to accomplish our goal for this master thesis is to analyze all this data that

contains the database with another data mining tool and to be able to make operations from

there. For this purpose, we will use a quite famous tool named R.

WHAT IS R?

R[5] is composed by the software environment for statistical computing and its associated

programming language that supports this tool. I would also like to talk about the software

environment which has been developed in the year 1993, and that has been continuously in

development till today, having the last stable version from April 16th 2015. It has been

developed by the R development core team and its paradigm covers array, object-oriented,

imperative functional, procedural and reflective areas.

On the other hand, I would like to also talk about the programming language which is very

well known and used among data miners for developers and is widely used also in another

areas like statistics, data analysis, polls, surveys and studies of scholarly literature databases.

The whole R project has been released under GNU, General Public License. Its source code

has been written primarily in C, FORTRAN and R.


28

Hence R is available for free, with different avaialable versions depending on the operative

system. I would like to point out that even while this tool supports some graphical front-

ends, it works primarily on a command line interface, allowing in this way the user to work

faster, to not waste resources and to be able to run it in all the machines, regardless their

characteristics.

STATISTICAL AND PROGRAMMING FEATURES:

R can be combined with a big number of supported libraries. Together they are able to

implement a big variety of statistical and graphical techniques.

From all of them we can highlight the classical statistical tests[6], the linear and nonlinear

modeling, classification, clustering and so on. The reason for this amount of libraries is the

fact that R is very easily extensible through functions and extensions that can even be written

in its own language.

About the programming features of the R language I would like to highlight that it is an

interpreted language that is able to support matrix arithmetic and a lot of data structures like

vectors, arrays, matrices data frames and lists. Overmore, R is able to support procedural

programming with functions and it is also able to support object oriented programming with

some generic functions.

INSTALLING R:

The best way of getting the R tool is to enter its official page [7]. Here we will be able to

see a quite plain web page with different links depending on our operative system. In order

to clear it up, I will show an image of the official web page where you can download it.


29

Picture 3.1: Official documentation of R.

After choosing Windows I started to download the file corresponding with the last version

untill that moment: 3.2.0 (2015-04-16). The current operative system that I have is Windows

7 and after following the straightforward steps of the installer, I managed to download the

version for 64 bits.

I am not going to delve very deeply into the R interface because we already saw how it works

in another subjects in the university and that would be redundant. What’s more, I have

already explained the different areas that cover the programming language.

INSTALLING THE RMONGODB LIBRARY:

So far I have installed the main R tool and I also explained how it works but the truth is that

if we want to connect it with our mongo database, that we talk about in another sections, this

is not enough. In order to accomplish this part, we need to install one library that is named:

“rmongodb”, which will help us to connect both tools.

The way of installing this library is quite straightforward and the same than any other library

in R. We just have to run the command

install.packages(“library name”)

In order to clear it up I am providing an image of the R interface with its output results at

the beginning of its installation:


30

Picture 3.2: Installing the packages »rmongodb« in R.

As we can appreciate in the previous image, the library allows us to be installed in different

languages. Because it is easier for me I decided to install it in Spanish language but the

functionality should be the same, regardless of the language.

Right after that we will be able to see how all the necessary files and packages are

downloaded in a successful way and we will be able to see a final message informing us that

everything went as expected. In order to clear it up I will show you and screenshot of my

computer right in the moment when the installation was successfully completed:


31

Picture 3.3: Output result after installing »rmongodb«.

With the steps that I have explained in this part we have installed the R tool with its necessary

library. I would like just mention that there is another alternative option that allows us to get

the library: we use the command install. Packages that we will install are the last stable

versions that have been released for this library but alternatively we can also run the latest

development version from the github repository. In this case it would be necessary to run the

following commands:

library(devtools)

install_github(repo = "mongosoup/rmongodb")

3.2 Learning to use R with its library rmongodb

In order to connect our R data mining tool with our Mongo database we need a couple of

requirements and we need to write a couple of lines in the command mode:


32

STEP 1: CONNECTING MONGODB TO R

As we previously mentioned we need first to do a couple of actions in order to connect both

tools:

To install and run our mongo database externally, out of the R tool. To do this we

need to type “mongod” in our command window

The second thing we need to do is to run the R tool and to summon the library

“rmongodb” that we have installed in the previous section. In order to get that we

just need to type the following command: “library(rmongodb)”

I will present a theoretical explanation of the basic mongo commands:

Mongo.create(): with this function we are able to connect with a mongo database server that

can be local or remote and return an object that belongs to a class named mongo. This object

can be used for further communication over the connection.

Mongo.is.connected(variable): This function is used in order to see if the variable is well

connected or not to the mongo database server. If it is connected it will give back TRUE and

if not, it will print out FALSE.

Variable class mongo: if you type the name of a variable with mongo class it will give back

all the basic parameters attached to it like host or username.

Following the same scheme than always, once given the theoretical demonstration, I would

like to clear it up showing an image of it with the commands used in a practical way:


33

Picture 3.4: connecting with mongo datasets since R IDE.

STEP 2: GETTING BASIC INFORMATION ABOUT THE DATA BASE AND THE

COLLECTIONS:

In order to get the list of all the databases or collections in mongo we need to learn this quite

straightforward commands:

mongo.get.databases(variable): By typing this command we will get the list of all variables

that have found the object that has to be Mongo class

mongo.get.database.collections(mongo, db) On the other hand if what we want to get is the

list of all collections, we have to type this second line of code.

mongo.count(mongo, coll): as its name implies, it is able to count the number of elements

that have a previously specified collection.


34

I also would like to point out that the best way of accessing the information of the database

is to check first if the variable has been correctly connected and if it has, to access at the

information and to do nothing otherwise. So as you can guess you need to write a couple of

more lines of code and the final result should be something like that:

if(mongo.is.connected(mongo) == TRUE) {

mongo.get.databases(mongo)

}


db <- "hockey"

mongo.get.database.collections(mongo, db)

}


coll <- "hockey.players"

mongo.count(mongo, coll)

}

Finally, I will provide an image with a practical use of both functions when connecting R

with my mongo databases in my local host: we can see we still have the database hockey

that we used in the previous sections like an example:


35

Picture 3.5: Accessing databases and collections since R.

STEP 3: FINDING SOME DATA:

So far we got just a basic information but now we will delve more deep into the advanced

options in order to get selected information that we want to get. For that we will need to

learn the following commands:

mongo.find.one(mongo, coll): this command finds the first record inside the previously

specified collection that matches the query.

mongo.distinct(mongo,coll,key): it will find all the distinct elements into the specified

collection, the distinct elements will be found according with the given key.

Once more we have the same issue than before and it is better to encapsulate the queries into

an if statement in order to avoid possible errors so those are the final lines of code that we

need to write:


36


mongo.find.one(mongo, coll)

}


res <- mongo.distinct(mongo, coll, "name")

head(res, 2)

}


cityone <- mongo.find.one(mongo, coll, '{"name":"Craig Adams"}')

print( cityone )

mongo.bson.to.list(cityone)

}

Finally, I would like to provide you the output results that I got after typing those commands

into my R interface: I would like to mention that after the last line of command:

mongo.bson.to.list(cityone)

I got all the information relative to that object not only the “_id” but as far as it is very long

and not very necessary to show that information, I decided to not show it in the image, and

to rather expose only the most meaningful information:


37

Picture 3.6: Executing some queries with our sample data since R.

STEP 4: CREATING BSON OBJECTS.

mongo.bson.from.list: This function is used in order to convert a list into a JSON object.

The process is very natural because the objects in R are very similar to the real JSON objects

in mongo database. I also would like to point out that this process internally calls other


38

functions like: mongo.bson.buffer.create , mongo.bson.buffer.append ,

mongo.bson.from.buffer.

mongo.bson.from.JSON: alternatively this function can be used in we want to create a

BSON object from a BSON. It has the same result than the previous one.

mongo.bson.from.list: as you can guess it creates a BSON object from a list. This is the last

alternative option that you have for creating BSON objects.

Here I am providing you the lines of code with the correct use of the previously mentioned

functions.

query <- mongo.bson.from.list(list('city' = 'COLORADO CITY'))

query <- mongo.bson.from.list(list('city' = 'COLORADO CITY', 'loc' = list(-112.952427,

36.976266)))

buf <- mongo.bson.buffer.create()

mongo.bson.buffer.append(buf, "city", "COLORADO CITY")

query <- mongo.bson.from.buffer(buf)

mongo.bson.from.JSON('{"city":"COLORADO CITY", "loc":[-112.952427, 36.976266]}')

date_string <- "2014-10-11 12:01:06"

query <- mongo.bson.from.list(list(date = as.POSIXct(date_string, tz='MSK')))


39

Finally, I would like to provide a screenshot with a practical demonstration of these

functions:

Picture 3.7: executing some queries since R.

STEP 5: EXAMPLE OF ANALYSIS.

In order to perform our first analysis, we will use our collection example named “coll” that

contains the hockey players. We will also need to use functions like mongo.distict which

allows us to get a vector with all the different values according with the key. What’s more,

I would like to mention that we will also use another two functions that are not from the

library but that are still useful for representing the data that we have with a graphic.

For a practical example of it we will grab the collection of players in hockey and we will

analyze their age. For that we will use the following commands:


40


pop <- mongo.distinct(mongo, coll, "age")

hist(pop)

boxplot(pop)

}

With these lines of code, we first check if we have connected correctly to the database and

if so, we get the age of the players that are inside the collection. After that we will represent

this data into two graphics. Histogram for representing the frequency and boxplot for

expressing the given data into a box-and-whisker plot.

I managed to get the following output results:


41

Picture 3.8: Graphics showing the results after executing some queries.

As we can see with theses graphics we can analyze the average or the frequency of the

different age ranks much better. Finally, in order to end with our analysis I would like to

find all of those players that are older than 18, which means they are adults, and to analyze

those two that are the oldest of all of them. In order to get that I used the following code:


42

nr <- mongo.count(mongo, coll, list('age' = list('$gte' = 18)))

print( nr )

pops <- mongo.find.all(mongo, coll, list('age' = list('$gte' = 18)))

head(pops, 1)

Picture 3.9: Executing »count« function and »head function«.

As we can see there are twenty-six players in the list that are adults and we also got the

information of the first player in the list.

STEP 6: CHANGING THE DATABASE SINCE R

In order to achieve this step we will need to use one of the functions that we already saw

before:


43

mongo.bson.from.JSON. It will allow as to create a JSON object that we will add to our

collection a couple of lines later. After that we will need to use the function

mongo.insert.batch. in order to insert the previously created JSON object into the specified

collection. Finally, we have to make the last step where we will prove that the data has been

successfully added to our collection.

a <- mongo.bson.from.JSON( '{"position":"Goalie", "id":8471306, "weight":220,

"height":"6 1", "imageUrl":"http://1.cdn.nhle.com/photos/mugs/8471306.jpg",

"birthplace":"Fussen, DEU", "age":29, "name":"Thomas Greiss", "birthdate":"January

29, 1986", "number":1 }' )

b <- mongo.bson.from.JSON( '{"position":"Goalie", "id":8471306, "weight":220,

"height":"6 1", "imageUrl":"http://1.cdn.nhle.com/photos/mugs/8471306.jpg",

"birthplace":"Fussen, DEU", "age":29, "name":"Thomas Greiss", "birthdate":"January

29, 1986", "number":1 }' )

icoll <- paste("hockey", "players", sep=".")

mongo.insert.batch(mongo, icoll, list(a,b) )

dbs <- mongo.get.database.collections(mongo, "hockey")

print(dbs)

mongo.find.all(mongo, icoll)

In this case we have added to our collection hockey players our json objects named a and b,

once more. In order to prove the effectiveness of those lines of code I will provide the

screenshots that I got in my own R studio interface.


44

Picture 3.10: Converting to bson format in R.

Finally, after the command line mongo.find.all I got each single object in the

collection. In order to demonstrate the fact that the data has been added in a successful way,

I will show the information of the last object in the collection.

In the image I’ve highlighted the information of that object in red and we can see that it is

the same information that we previously loaded in the JSON object that we created. For

example the position is goalie, and the birthplace is Fussen DEU.


45

Picture 3.11: Output result after executing queries in our sample data in R.

APPLYING OUR KNOWLEDGE TO ONE OF OUR DATASETS

For this section we are going to run our mongo database and we will connect it with our R

studio. This time, we will analyze the data relative to our student database making queries

in order to get some useful information. For this section I am not going to delve very deeply

in the commands that I used because that is part of the previous section. Rather than that I

will directly write down the used commands and the output results.

QUESTION 1: WHICH GENDER DISTRIBUTION DO WE HAVE IN THE

HIGHSCHOOLS?

In order to achieve that I used the following commands that compares whether the FIELD 2

contains the character F for Females and M for males:


46

females.count <- mongo.count(mongo, coll, list(FIELD2="F"))

print(females.count)

males.count <- mongo.count(mongo, coll, list(FIELD2="M"))

print(males.count)

counts <-c(females.count,males.count)

barplot(counts, main="Gender Distribution",names.arg=c("Females", "Males"))

The final results showed that from the 649 students 383 are girls and the other 266 are boys.

This means that the girls are majority in the high school getting the 59’01% of the total

students. The boys have the other 40’99%.

Picture 3.12: Sample using the »count« function in »rmongodb«.


47

QUESTION 2: DOES THE LEVEL OF EDUCATION OF THEIR PARENTS

INFLUENCE THE MARKS OF THE STUDENTS?

In order to answer this question there are two attributes in the database (the attributes number

seven and eight) that correspond to the level of education of their parents. The first thing that

I did is to separate them in four groups. All the students whose mother or father are in certain

level of education will be grouped together with the possibility of overlapping in the case

that both parents have different level of education. For separating them I used the following

commands

j12 = '{"$or": [{"FIELD7": "4"}, {"FIELD8": "4"} ] }'

query=mongo.bson.from.JSON(j12)

l1.count <- mongo.count(mongo, coll, query)

print(l1.count)

j12 = '{"$or": [{"FIELD7": "3"}, {"FIELD8": "3"} ] }'



j12 = '{"$or": [{"FIELD7": "2"}, {"FIELD8": "2"} ] }'



j12 = '{"$or": [{"FIELD7": "1"}, {"FIELD8": "1"} ] }'



Also I would like to show you my output results in my terminal:


48

Picture 3.13: Executing some experiments in R.

QUESTION 2: HOW MUCH TIME ON AVERAGE DO THE STUDENTS DECICATE

TO THEIR STUDIES PER DAY?

The first thing that I will do is to show the different commands that I used in order to

calculate this data. For this case the needed functions are quite similar to those ones in the

first question.

v1 <- mongo.count(mongo, coll, list(FIELD14="1"))




variables <-c(v1,v2,v3,v4)

boxplot(variables)

barplot(variables, main="Amount of study time (h)",names.arg=c("1h ", "2h ","3h ","4h"))

hist(variables, main="Amount of study time (h)",names.arg=c("1h ", "2h ","3h ","4h"))


49

After that I will provide some screenshots with the output results that I got in the graphs.

Basically we can guess that most of the students (305 out of 649) study 2 hours per day, with

1,93 being the average of the study time of all of them:

Picture 3.14: Graphic showing the output results of the experiment.

Picture 3.15: Graphic with bars showing the output results of the experiment.


50

4 MASTERING THE HADOOP

4.1 What is big data and Hadoop?

WHAT IS BIG DATA [8]?

Regardless if this data is structured or not the term big data describes a huge volume of

data that inundates a business on an everyday basis. Regardless of what it looks like the

amount of data is not so important, what really matters is what organizations do with this

big amount of data.

Secondly I would like to point out that there is not an exact size of bytes for using big

data, there are mainly three different properties that influence it: The velocity of accessing

this information, the volume of data, and the variety of it. There are two additional

dimensions: variability and complexion.

Finally, in order to end with this brief introduction to big data I would like to explain

swiftly why it is important and what are different fields where big data is being used in

today’s world. First of all, big data is important because it allows you to save costs, it

allows you to reduce time, it allows you to optimize your offering, to make the product

development easier and finally it also allows you to make smart decisions when you

combine it with high-powered analytics. Secondly, I would like to mention that big data

has some applications in today’s world in fields like banking, education, government,

health care, manufacturing and retail.

WHAT IS HADOOP [9]?

Hadoop it is a software project developed by apache and released as an open source

project. Its main purpose is enabling distributed processing of large data sets across

clusters of commodity servers. In its design it is focused on being scaled up from a single

server to thousands of machines. One of the main points of this technology is its good

fault tolerance which means that the entire system admits a high degree of mistakes or

unfortunate circumstances.


51

This fault tolerance is achieved by relying on the end hardware, the resistance of its

clusters comes from the ability that the software has to detect and handle the failures that

occur in the application layer.

After that I would also like to mention that the Hadoop architecture is divided into three

main layers: The ODP Core which consists of a standalone interface, the IBM Open

platform with Apache Hadoop and IBM Hadoop ecosystem.

Finally I would like to mention the main features that Hadoop has when operating with

big data: The first one is its scalability, it is possible to work with very huge amount of

data, the second good feature that it has is the low-cost architecture. The third, as we

mentioned before, is its good fault tolerance and finally I would like to point out its

flexibility, because this tool is able to manage structured and unstructured data, and it is

very easy to join and aggregate multiple sources with the goal of making a deeper analysis.

WHAT IS HDFS [10]?

HDFS comes from the acronym, Hadoop Distributed file system and it has been developed

by using the distributed file system design. One of the main advantages of this distributed

system is the low-cost hardware and the fault tolerance.

Another features that we can also highlight are for example the ease of access of a very

large amount of data, as this data is stored across multiple machines. In addition to data I

also would like to point out that HDFS also makes applications available to parallel

processing.

Secondly I am going to sum up the main features of HDFS:

As I previously mentioned it is suitable for the distributed storage and processing.

Hadoop provides a command interface to interact with HDFS

Streaming access to file system data.

Finally, in order to end with this section, I am going to talk about two different elements

that are part of the HDFS architecture: name node and data node.


52

Name node: it is the commodity hardware that contains the GNU/Linux operative system

and the name node software. The system having the name node acts as the master server

and within its different tasks we can highlight: the management of the file system

namespace, regulateing client’s access to files and also executing system operations like

opening and closing files, opening and closing directories, renaming and so on.

Data node: it is the commodity hardware that has the GNU/Linux operating system and

the data node software, their role is the management of data storage of their system. Within

its main tasks we can highlight the performance of read-write operations on the file

system, and the performance of different operations like block, deletion and replication

according with the instructions of the namenode.

4.2 Installing and configuring Hadoop

The first thing that will be necessary to do in order to install Hadoop is to get and unpack

the source code. The files are available in different places but in order to rely on a more

trusted source I decided to download them from the official site[11] . Also I would like to

mention that I will use the Hadoop version 2.4.0 under the operative system windows 10.

Even if the configuration is a little bit more difficult I decided to not use any virtual

machine for that.

INSTALLING HADOOP IN AN STANDALONE MODE

The first step that will be necessary to perform if we want to run our Hadoop on our

computer is to install java, in my case I had it already installed so what I did is to be sure

that I works correctly and to set up the following environment variables: we have to open

the following menus: System properties > environmental variables and then we should see

something like that:


53

Picture 4.1: Window showing the environment variables.

Here in this menu it will be necessary to include a new variable named: JAVA_HOME

with the following value: C:/java referring our local file system where this library is

located. Finally, we will also need to change one of the system variables named Path

adding the following value to the array: C:/java/bin. After that I am going to make sure

that the java tools are set up correctly in my computer by running the following command

in the cmd window:


54

Picture 4.2: Command window showing the current java version installed in my

computer.

As you can see I have installed the Java version 1.7.0_80 in my computer now let’s move

to the next step where we will go back to the environmental variables.

In this step it will be necessary to set up a new variable named HADOOP_HOME and it

will contain the path of the directory where we placed Hadoop (in my case C:\ hadoop-

2.4.0\) After that It will also be necessary to modify another variable that we already had

in our computer: PATH, fortunately for us it is possible to set up more than one address

in the same variable as far as we separate them with semicolon, hence I added the hadoop

path (C:/Hadoop-2.4.0/bin) also to this environment variable.

Picture 4.3: Output result showing the hadoop version installed..


55

As you can see I have the Hadoop version 2.4.0 installed in my computer in my local file

system.

INSTALLING HADOOP IN A PSEUDO DISTRIBUTED MODE:

In order to accomplish that part I am going to change the Hadoop configuration applying

the different changes in different files that I will mention right after. Finally, I will also

demonstrate that I have my tool well configured and ready to use.

Inside the directories named etc and hadoop (C:\hadoop-2.4.0\etc\hadoop) we can find the

following files that we will change:

Core-site.xml:

For this file I will add the following tags within the already existent configuration tag:

<configuration>

<property>

<name>fs.default.name </name>

<value> hdfs://localhost:9000 </value>

</property>

</configuration>

Hdfs-site.xml:

In this case we will also need to add some information within the configuration tag. Also

I would like to mention that we are assuming that we have the name mode and the data

node in the following routes: C:/hadoop/hadoopinfra/hdfs/namenode and

C:/hadoop/hadoopinfra/hdfs/datanode


56

<configuration>

<property>

<name>dfs.replication</name>

<value>1</value>

</property>

<property>

<name>dfs.name.dir</name>

<value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value>

</property>

<property>

<name>dfs.data.dir</name>

<value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value>

</property>

</configuration>

Yarn-site.xml

Also we will need to add the following configuration to the previously mentioned file:

<configuration>

<property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

</configuration>


57

Mapred-site.xml:

We can also find a file named “mapred-site.xml.template” , we will need to rename it for

“mapred-side.xml” and we will also need to add the following configuration:

<configuration>

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

</configuration>

After we have finished with the configuration for the pseudo distributed mode, I would

also like to mention that the configuration can be slightly different in older or newer

versions. Finally, I would like to verify that my Hadoop is working correctly. On the one

hand it will be useful for me to know that everything works fine so far and on the other

hand it will be also useful to demonstrate it in this report.

The first verification is name node set up: for that I will need to navigate using the

command mode to the folder Hadoop-2.4.0/bin and type the following command “hdfs

namemode -format”.


58

Picture 4.4: Output results after executing »hdfs namemode -format« command.

Now I want to run the following commands in order to see the hdfs (Hadoop distributed

file system) with the commands “hdfs namenode” and “hdfs datanode”.

In order to execute both commands in a successful way I had to handle different issues in

the configuration of hadoop. On the one hand the native libraries for windows were not

included so I downloaded [11] and included them in the bin folder of hadoop.

On the other hand I had to handle another issue: In order to run hadoop in windows

operative system it is necessary to solve two main problems: First of all it is necessary to

install and configure in the right way the following tools: Software development Kit

version 10 which is the unique one compatible with my operative system windows 10.

Download and extract the maven files [12] into the following path C:/maven, to download

and extract the protocol buffer [13] version 2.5.0 in the following path: C:/protobuf.

Also I needed to add the following environmental variables. And the following link to the

PATH:


59

Picture 4.5: Environment variables of my computer.

Finally, I had to solve an incompatibility issue within the java version of my computer

and the java version that hadoop used because it was not the same version 1.7 and 1.8 and

also one of them was using x64 bits whereas the other was using x32 bits. We could have

different solutions to this problem and the solution that I found is to use 32 bits with the

newer java version 1.8 for both tools.

Right after that I managed to make the hdfs work in a successfull and satisfactory way and

in order to demonstrate that I am going to provide a screenshot of their activity whereas

they are running:


60

Picture 4.6: Output results after running Hadoop.

After that I run the following commands: “yarn resourcemanager” , “yarn nodemanager”

Because of the configuration that we have made with the environment variables I don’t

have to run the commands in any specific path, the own system detects the files.

In order to demonstrate that I managed to make them running in the correct way I am

going to show and screenshot of this commands running and I am going to enter in the

following urls: http://localhost:50070/ and http://localhost:8088/ where we will be able to

see the basic configuration of Hadoop in our system.

http://localhost:50070/

http://localhost:8088/


61

Picture 4.7: Output results after executing »yarn«.

Finally, I am also going to show two more screenshots demonstrating that I have both

services running on my computer:


62

Picture 4.8: Initial page after running Hadoop.

Picture 4.9: Initial page showing the cluster configuration of Hadoop.

Finally, I would like to mention another problem that we will have to face in future

ocasions if we format the filesystem again. We can have errors with the cluster ID. In

order to solve that we will need to get the number of the cluster id in the namenode and

add it to the following command. For example, in my case:

hdfs namenode -format -clusterId CID-8bf63244-0510-4db6-a949-8f74b50f2be9.


63

In this way we will be able to format hadoop again and to run it without any problems.

4.3 Deployment of R algorithms in Hadoop

So far we’ve managed to install and run the R studio in a separate way and connect it with

hadoop, also in the previous section we’ve managed to run hadoop tools in our computer

in a successfully way. The goal of this section is to configure our R studio in such a way

that is able to connect with our hadoop. Once we accomplish it we will end with the

technical part of this master thesis. In order to get it we will need to follow a couple of

steps:

STEP 1: CONFIGURATION OF THE ENVIRONMENT

The first thing that we will need to do is to open our R (in our case the version 3.3) and to

run the following commands:

Sys.setenv("HADOOP_CMD"="/hadoop-2.7.1/bin/hadoop.cmd")

Sys.setenv("HADOOP_PREFIX"="/hadoop-2.7.1")

Sys.setenv("HADOOP_STREAMING"="/hadoop-2.7.1/share/hadoop/tools/lib/hadoop-

streaming-2.7.1.jar")

Sys.setenv("HADOOP_HOME"="/hadoop-2.7.1")

These two commands set up the variables that refer to the hadoop bin and the hadoop

streaming file which can change depending on the version that you have in hadoop.

Finally, we want to be sure that our two variables were set up correctly so we will use the

following command this time:

Sys.getenv("HADOOP_CMD")


64

After that we should see the path that we would previously enter without any trouble. Here

you have a screenshot of the command shell with all the commands that have been

previously mentioned so far.

Picture 4.10: Initial configuration of Hadoop in R.

STEP 2: INSTALLING THE NECESSARY PACKAGES

The first thing that we will need to do is to download[14] and install the Rtools. In our

case the latest available was the version 3.3 so that is the one that I downloaded and install.

After we already have the necessary tools we will need to install nine different packages

that are going to be used. For that we will use the command install.packages and inside

the parenthesis we will create an array where we will mention all the packages that will

be automatically installed.

Install.packages(C(“rJava”,”Rcpp”,”JSONIO”,”bitops”,”digest”,”functional”,”string

”,

”plyr”,”reshape2”))

Finally, we will also need to install the three main packages that compose Rhadoop: rhdfs,

rmr2 and rhbase: for that we will need to type the following commands


65

library(devtools)

install_github("rmr2", "RevolutionAnalytics", subdir="pkg")

install_github("rhdfs", "RevolutionAnalytics", subdir="pkg")

install_github("rhbase", "RevolutionAnalytics", subdir="pkg")

STEP 3: MAKING FIRSTS TESTS IN ORDER TO KNOW IF EVERYTHING

WORKS AS WE EXPECTED.

First of all, I would like to describe briefly what is the role of rmr2 package: The main

function of it consists in performing statistical analysis in R via Hadoop MapReduce

functionality on a hadoop cluster. Secondly I would like to show how rmr2 workS

correctly in my computer and in order to achieve that I run the following basic commands:

library(rmr2)

from.dfs(to.dfs(1:100))

from.dfs(mapreduce(to.dfs(1:100)))

If everything has been set up correctly, you should not see any errors and instead you

should see an output like this:


66

Picture 4.11: executing »mapreduce«.

Also I would like to describe briefly what is the role of the rhdfs package. Its main tasks

consist of providing the basic connectivity with the hdfs which means the hadoop

distributed file system. With this package you are able to perform different operations like

read, write, modify files stored in HDFS and so on.

Finally, I am going to run a simple test showing if the packages hdfs work correctly. In

order to verify that I am going to run the following commands in R tool:

library(rhdfs)

hdfs.init()

hdfs.ls("/")

And those are the output results that I got. Because I didn’t receive any error I assumed

that everything works OK.


67

Picture 4.12: Basic usage of »rhdfs« library.

STEP 3: MAKING ADVANCED TESTS

Once we proved that our rhadoop libraries work correctly we are going to perform

different tests in order to explain how to implement machine learning algorithms in R in

Hadoop with the extracted data from our mongo database.

The first thing that we are going to do is to operate the hadoop distributed file system with

our rhdfs: For achieving that I would like to mention that we need to have the following

prequisites:

To start in the command mode the following tasks: hdfs namenode, hdfs datanode,

yarn resourcemanager and yarn nodemanager

To have inported and initialized all the libraries inside R that are needed to perform

the job in the right way

To initialize the environment variables called HADOOP_CMD and

HADOOP_STREAMING:

Once we got all that, we can run the following commands. The first thing we will do is to

write a file called iris.txt in our RHDFS:


68

library(rhdfs)

hdfs.init ()

f = hdfs.file("iris.txt","w")

data(iris)

hdfs.write(iris,f)

hdfs.close(f)

f = hdfs.file("iris.txt", "r")

dfserialized = hdfs.read(f)

df = unserialize(dfserialized)

df

hdfs.close(f)

After that I would like to provide a screenshot in order to show my output results

Picture 4.13: Unserializing data with »rhdfs«.


69

Also I would like to make a demonstration and explain the function of different commands

from this library that we will use in future sections.

Picture 4.14: Executing different »hdfs« commands.

hdfs.ls('./'): read the list of files and directories from hdfs

hdfs.copy(‘name1’,’name2’): copy a file from one hdfs directory into another

hdfs.move(‘name1’,’name2’): move a file from one hdfs directory into another

hdfs.delete(‘file’): delete the file that is passed like a parameter

hdfs.get(‘name1’,’name2’): download a file located in hdfs to the local store of

your computer

hdfs.rename(‘name1’,’name2’), hdfs.chmod(‘name1’,’name2’) and

hdfs.file.info('./') are self-explanatory

Finally, I am also going to make another tests in order to try out at the same time both

libraries: rmr2 and “rhdfs”. For that I am going to make two different examples.


70

In both cases the first thing that I need is to import all the libraries. In some cases, I am

not completely sure if they are necessary but what I am sure is that it is better to summon

them and that that it works correctly this way.

library(rJava)

library(Rcpp)

library(RJSONIO)

library(bitops)

library(digest)

library(functional)

library(stringr)

library(plyr)

library(reshape2)

library(devtools)

library(methods)

Sys.setenv("HADOOP_CMD"="/hadoop-2.7.1/bin/hadoop.cmd")

Sys.setenv("HADOOP_PREFIX"="/hadoop-2.7.1")

Sys.setenv("HADOOP_STREAMING"="/hadoop-2.7.1/share/hadoop/tools/lib/hadoop-

streaming-2.7.1.jar")

Sys.setenv("HADOOP_HOME"="/hadoop-2.7.1")

Sys.setenv("HADOOP_CONF"="/hadoop-2.7.1/libexec")

After that it would also be necessary to import the rmr2 library, the rhdfs library and to

initialize the the rhdfs system in the following way:


71

library(rmr2)

library(rhdfs)

hdfs.init()

Using map reduce for the first time: word count problem:

In this example we will use the function map reduce for the first time and because of that

we will use an example that is as simple as possible. In this case we will not store the

output in any file. We will store it in a variable and we will examine the results in different

ways:

In the first line we use the function for the first time and inside the parameters we include

a txt file that contains the novel of Moby-dick, the whale. After that we will see output

results of the “a” variable and finally we will also fetch the contents of this temporal file

into another variable.

Picture 4.15: Using »mapreduce« in Hadoop.

So far we made the simple example but now we will make it a bit more complicated. Our

goal now is to process the input and to count the length of each single row inside the text

file.


72

Finally, we will show the results in a graph. At the same time, we will also need to fetch

the results obtained in the temporal file into different variables in order to finally be able

to represent them in a graph.

Picture 4.16: practical example using »mapreduce«.

Picture 4.17: graphic using showing the output result of the preivous example.


73

Comparing the performance between a standard R program and R map reduce

program:

The first commands have the goal of implementing a standard R program where we have

all the numbers squared.

a.time = proc.time()

small.ints2=1:100000

result.normal = sapply(small.ints2, function(x) x^2)

proc.time() - a.time

In the second part of the exercise we will do exactly the same than before with the

difference that we will implement map reduce this time:

b.time = proc.time()

small.ints= to.dfs(1:100000)

result = mapreduce(input = small.ints,

map = function(k,v) cbind(v,v^2))

proc.time() - b.time

Picture 4.18: set of commands processing data with »rhadoop«.


74

Inside the performance comparison we can see the R standard program outperforms the

map reduce when we are processing small amounts of data. That is something normal

because Hadoop system needs to spawn daemons, it needs job coordination and fetching

data from data nodes. Hence the Map reduce takes a few seconds more

Testing and debugging the rmr2 program:

In this example I used a practical approach about some techniques for debugging and

testing the rmr2 program. In order to achieve that I made the following steps: First of all,

I configured the “rmr” in a local way. Second of all I performed the same basic example

in order to get the information about the squares of the first million numbers. Finally, I

printed out the time and the structure of the obtained information.

Picture 4.19: Final results after applying »rmr«.


75

5 PERFORMING THE EXPERIMENTS AND ANALYZING

THE RESULTS

So far we’ve managed to do the following things: First we have installed and used the mongo

database and we have stored the ten different datasets that we will use like a sample, secondly

we have installed our R tool and we managed to connect R with mongo using the library

named “rmongodb” in order to analyze the already existent data. Finally, we have installed

Hadoop, a tool that allows us to manage big data and in addiction to that we also have

connected our R tool with Hadoop with the usage of different libraries like “rmr2” y

“rhdfs”.

I will explain for each single database what you could do with the data and I will also provide

some results of some experiments:

FIRST DATA SET: ARRYTHMIA:

WHAT CAN YOU DO WITH THIS DATA SET?

This dataset contains 457 different attributes which shows different features about people

who had that symptom. It includes characteristics like: age, gender, height, weigth, or exactly

which type of elyptic way make the involuntary movements of this people. You could

argurably guess with this data if this disease depends on the weight, height if it is more likely

that you have it at certain age or if it is more usual in males or females, and that could be

helpful for investigation for example.

PROVIDED RESULTS:

On the one hand I will show which is the age distribution within the people that have had

arrhythmia detected: Here you have the results:


76

Picture 5.1: Results for the experiments of arrythmia dataset.

I would like to point out that in this case the database does not contain too many samples,

which is why in the frequency graph you cannot see too many samples. The size of the data

set remains mostly in the amount of attributes On the other hand, in the left side we can see

that the age average for having arrhythmia is between 40 and 50 years approximately-

I would also like to provide the gender distribution for people which have arrhythmia in

order to know if one gender is more weak against this symptom than the other.

Picture 5.2: More results about arrythmia dataset.


77

The results show something quite unusual. There are exactly the same amount of man than

woman that had suffered arrhythmia. Usually one gender should have a bit more than the

other but about the same.

SECOND DATASET CMU:

WHAT CAN YOU DO WITH THIS DATASET?

This dataset contains a lot of different attributes about people, the chromin comes from

central management unity and it could be helpful for knowing where are those people from,

how many surveys they had made or how much money do they have.

PROVIDED REULTS:

In this case the performed experiments are about which accuracy distribution values they

have (Graph 1) which relativity values they have (Graph 2) and finally which amount of

subjectiveness (graph 3).

Picture 5.3: Results for the experiments of cmu dataset.


78

THIRD DATA SET: DIABETIC DATA:


This dataset contains 49 different attributes about diabetic people, we have gathered features

like: race, number of days in the hospital, clinical speciality that was treated, number of

operations, number of diagnoses and more medical data. It could be helpful for knowing

how the patiens respond to the different operations or how many days they usually need to

be in the hospital.

PROVIDED RESULTS:

The first experiment that I have performed is for knowing how many of them take insulin

and which is the gender distribution of those people.

Picture 5.4: Results for the experiments of diabetic dataset.

In the first case we should see the numbers, 10 of them are female whereas 9 of them are

male, and the proportion is almost 50% even if it does not look like it at the first sight. On

the other hand, in the second graph I would like to point out that in most of the cases there

is no information about this attribute. In the rest of the cases all diabetics take insulin which

is something according with the normality.


79

FOURTH DATA SET: TUMOR:


This dataset is quite technical and it contains the different features that different tumors have,

which part of the body they affect, their size and their behavior. It can be very usefull to

learn from the experience from all those features in order to guess for future tumors or in

order to know which type of tumors are more aggressive and which ones are more likely to

appear.

PROVIDED RESULTS

In this case I performed different experiments in order to examine the average and the

frequency in two of attributes. The first one is called HG2507-HT2603_at, HG2507-

HT2603_at.

Picture 5.5: Results for the experiments of tumor dataset.

We can see that in the first attribute the values are within a range -1500 and 400.


80

Picture 5.6: More results for tumors dataset.

FIFTH DATA SET: KDDCUP99:


This dataset talks about data mining and knowledge discovery competition from the year

1999, it contains different features like which protocol the participants are using or which

number of logings have been failed. It can be useful for example in order to know which

protocols are becoming more popular within the participants or in order to find different

error prone situations with the contained error data

PROVIDED RESULTS:

In the kddcup_99 data set we have different attributes and in this case I am going to show:

that can take the amount of destination host and the amount of service destination host.


81

Picture 5.7: Results for the experiments of kddcup dataset.

SIX DATA SET: LETTER:


This dataset contains the different features that have the characters within them. We can

highlight: height width, which corners are they touching and so on. It can be usefull for some

areas of research to ideantify the different Romanic characters or to compare them with

chinesse or japanesse characters.

PROVIDED RESULTS:

In this case I show the different values that can take the attributes width x-box and y-box.


82

Picture 5.8: Results for the experiments of letter dataset.

SEVENTH DATA SET: NURSERY

WHAT CAN YOU DO WITH THIS DATASET:

This dataset contains different information about children that where in the nursery. It can

be usefull to analyze different data like the health, finance or the number of brothers that

they have in order to know which of them are more likely to go to the nursery and if it is

related with some of this attributes.

PROVIDED RESULTS:

In this case we saw the different values and frequency that the database has for the different

attributes: parents and “has_nurs” (if it has an auxiliary nurse or not).


83

Picture 5.9: Results for the experiments of nursery dataset.

EIGHT DATA SET: SPLICE


The dataset contains 61 different attributes about which class of splice there are and which

features do they have. It can be useful for knowing which tipe of splice you should use in

which pipe or which features work better in which installations.

PROVIDED RESULTS

In this case I am going to examine two different attributes named in the data set “attribute_1”

and “attribute_2”. In both cases they can take four different values C, A, G, T. We will

examine the different likelihood for each case in both attributes and on the other hand we

will examine the average, the minimum and the maximum values.


84

Picture 5.10: Results for the experiments of splice dataset.

NINTH DATA SET: WAVEFORM


In this case the data set contains forty different attributes for each row. Each one with the

different dots that contain the wave form. It can be usefull to know which waveforms we

have gathered in the nature and also to guess what can a typical wavefrom graphic be like

for different purposes.


85

PROVIDED RESULTS:

In this case I show the different values that can take the attributes “x1”, ”x2” , ”x3”.

Picture 5.11: Results for the experiments of waveform dataset.

TENTH DATA SET: STUDENTS DATA


This dataset contains data about high school students and you will be able to know for

example if the level of education of their parents influences their marks, the amount of study

or if the girls get better marks than the boys

PROVIDED RESULTS:

I will show the results that I have obtained for the high school dataset. In the first photo we

see the results about whether the level of education of their parents influenced the students.


86

Picture 5.12: Results for the experiments of students dataset.

Finally, I also show the gender distribution within the high school students.

Picture 5.13: More results for the experiments of students dataset.


87

6 APPLYING MACHINE LEARNING ALGORITHMS

Machine learning consists of a subfield inside the computer sciences that has come from

the pattern recognition and from the theory of computational learning of an artificial

intelligence.

We could arguably define the machine learning like “the field of study that gives

computers the ability to learn without being explicitly programmed”.

This subfield of computer sciences explores the algorithms that are able to learn and make

decisions and predictions expressed on data. I also would like to point out that machine

learning is closely related and sometimes overlapped with the subfield named

computational statistics, which is a discipline that focuses on prediction making through

the use of computer.

It has strong bounds with the mathematical optimization which provides methods, theory

and application domains to the machine learning. We can also find some applications like

spam filtering, computer vision, optical character recognition, search engines and so on.

Finally, I would also like to point out that data mining sub-field focuses more on

exploratory data analysis and it is referred like unsupervised learning.

STEP 1: STARTING WITH MACHINE LEARNING ALGORITHMS

Like an introduction to the topic, I am going to apply in R one of the most simple machine

learning algorithm named KNN (K nearest neighbors), for this case I will apply a sample

data set named iris. These are the first commands for the procedure and its explanation:

names(iris) <- c(“Sepal.Length”,”Sepal.Width”,” Sepal.Length”,” Sepal.Width”)

names(iris)

library(ggvis)

iris %>% ggvis(~Sepal.Length, ~Sepal.Width, fill = ~Species) %>% layer_points()

iris %>% ggvis(~Petal.Length, ~Petal.Width, fill = ~Species) %>% layer_points()


88

First of all, we create an array with all the attribute names. Next, we imported a library

named “ggvis” that is able to make more complex graphics that can be useful for this case.

Finally, I create two graphics. We are able to see the relationship within two attributes

petal width and petal height. On the other hand, we can also appreciate that sepal width

and sepal height are not as related as the other two.

Picture 6.1: Applying machine learning algorithms in R.

Secondly I typed the following commands:

table(iris$Species)

round(prop.table(table(iris$Species)) * 100, digits = 1)

summary(iris)

The purpose of these commands basically is to see what we have inside the species

attribute and to round them in order to make them ready for the experiment.


89

Picture 6.2: Showing main features of iris dataset

Thirdly, we also want to see a summary of two attributes petal width and height in order

to see the relationship between them and in order to have a better understanding of the

data set that we are experimenting with, after that we also prepare the workspace that we

are working on by importing the library. Those two actions can be summarized in the

following commands.

summary(iris[c("Petal.Width", "Sepal.Width")])

library(class)

After that we start with a very important step that we will need to take. This step is named

normalization, and it will make the data more consistent. I also would like to mention that

sometimes normalization is not strictly necessary, if there are not too many differences

within the minimum value and the maximum inside the data set this step might be not

strictly necessary but still always advisable.

Coming back to the topic I would like to provide you with the following commands that

make the normalization step:


90

normalize <- function(x) {

num <- x - min(x)

denom <- max(x) - min(x)

return (num/denom)

}

iris_norm <- as.data.frame(lapply(iris[1:4], normalize))

summary(iris_norm)

Picture 6.3: Summary of iris dataset

The fourth thing that we have to do is to prepare the training and the test sets. The first

thing that we need to do in order to accomplish this step is to create a seed, the function

of which is to create random numbers. This function takes a sample with a size set like a


91

number of rows of the iris data set. Finally, we used the variable obtained with the sample

function in order to define our train and our test sets:

set.seed(1234)

ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.67, 0.33))

iris.training <- iris[ind==1, 1:4]

iris.test <- iris[ind==2, 1:4]

Picture 6.4: Applying machine learning algorithms to iris data

Finally, I would like to point out that in our train and test set we do not have all five

attributes, we have only four, because we actually want to predict the fifth attribute. We

will also apply the “knn” function in order to predict the results. But even if it seems that

the work is all done we will also need to analyze the results. For that the first thing that

we need to do is to import the “gmodels” library.


92

iris.trainLabels <- iris[ind==1, 5]

iris.testLabels <- iris[ind==2, 5]

iris_pred <- knn(train = iris.training, test = iris.test, cl = iris.trainLabels, k=3)

iris_pred

library(gmodels)

Picture 6.5: spliting iris dataset into training and test.

Finally, we will analyze the results and we will see that the algorithm worked quite well

and was right in all cases except one. In order to analyze the results, I typed the following

command:


93

CrossTable(x = iris.testLabels, y = iris_pred, prop.chisq=FALSE)

Picture 6.6: Final results after applying machine learning on iris dataset.

STEP 2: APPLYING MACHINE LEARNING (REGRESION TREE) TO OUR

DATASETS

In order to apply machine learning to our datasets we will use regression tree because it is

one of the most popular and recommended algorithms and we will apply it to all our

datasets with the following commands structure:


94

library(class)

library(rpart)

coll <- “dataset.collection”

dataset <- mongo.find.all(mongo, coll, data.frame=TRUE)

dataset

raw = subset(dataset, select=c("x.box","y.box","width","high"))

raw

row.names(raw) = raw.orig$CASNumber

row.names(raw) = dataset$CASNumber

raw = na.omit(raw);

frmla = high ~ x.box + y.box + width

fit = rpart(frmla, method="class", data=raw)

printcp(fit) # display the results

plotcp(fit) # visualize cross-validation results

summary(fit)


95

DATA SET: LETTER:

After applying the regression tree on this dataset with the previously mentioned

commands I got the following output results:

Picture 6.7: Applying regression tree algorithm to letter dataset


96

DATA SET: ARRYTHMIA:

After applying the regression tree for this dataset to the following attributes I managed to

have the following results:

Picture 6.8: Applying regression tree algorithm to arrythmia dataset


97

DATA SET: DIABETIC:

After applying regression tree to the following attributes: gender, weight, race and age I got

the following output results:

Picture 6.9: Applying regression tree algorithm to diabetic dataset


98

DATA SET: KDDCUP99:

Now I will analyze the different attributes from the dataset kddcup from the year 99. These

are the output results that I got:

Picture 6.10: Applying regression tree algorithm to kddcup dataset


99

DATA SET: NURSERY:

Now I will apply the regression tree to the attributes of the nursery dataset, and those are the

output results:

Picture 6.11: Applying regression tree algorithm to nursery dataset


100

DATA SET: SPLICE:

Now I will apply the regression tree to the attributes of the splice dataset, and those are the

output results:

Picture 6.12: Applying regression tree algorithm to splice dataset


101

DATA SET: STUDENT:

Now I will apply the regression tree to the attributes of the student dataset, and those are the

output results:

Picture 6.13: Applying regression tree algorithm to student dataset


102

DATA SET: TUMOR:

Now I will apply the regression tree to the attributes of the tumor dataset, and those are the

output results:

Picture 6.14: Applying regression tree algorithm to tumor dataset


103

DATA SET: WAVEFORM:

Now I will apply the regression tree to the attributes of the waveform dataset, and those are

the output results:

Picture 6.15: Applying regression tree algorithm to waveform dataset


104

7 CONCLUSION

With the introduction of new technologies, the devices and different means of

communication, like the social networks, the quantity of data that is being produced is

growing very fast year by year. Just so that we have a general idea of how much data we

create I would like to mention that we have produced 5 billion of gigabytes of data since the

beginning of time till 2003. The amount of data required to manage the applications and

technologies is becoming increasingly bigger, so there has to be a way of handling this issue.

It is here where the role of big data becomes clear.

Under the name big data we understand something like a collection of very big datasets that

cannot be processed by the usage of traditional computing techniques. Furthermore, in recent

years, the big data is not merely data any more and it has become a different subject which

involves different tools techniques and frameworks.

With this master thesis I had the chance of working with big data which was a good way of

appreciating first hand the main benefits of it, out of which I would like to mention the main

two:

The fact of using such a big quantity of information allows you to learn about the

response for campaigns, promotions and other advertisement medium.

Using the information allows you to have more information about your products

which can be useful for planning the production or for future decision making.

I would also like to have some words for Hadoop, the big data technology that I have been

useing through this master thesis. Hadoop is a solution that has been provided by Google in

2005 and started like as an open source project. Hadoop is able to run applications with the

usage of the Map reduce algorithms on different CPU nodes that are processed in parallel.

With the performance of this work I was able to prove that Hadoop is a strong solution with

a very high fault tolerance and a very good scalable approach for big data.

I would also like to give some conclusions and explanations about the other subject of this

master thesis which is machine learning. Machine learning consists in a subfield inside

computer science which was derived from the study of computational learning and pattern

recognition. We could define machine learning as “the field of study that gives computers


105

the ability to learn without being explicitly programmed” this subfield of computer sciences

explores the algorithms that can learn from the experience and are able to make predictions

based on the gathered data.

Furthermore, I would like to point out that machine learning is quite related with the

discipline of computational statistics which also focuses on prediction making but in this

case it is done by the usage of computers.

Once we have a good definition of what machine learning is, I would like to give my own

conclusions based on my own experience. As I saw throughout the making of this master

thesis, the machine learning approaches can be applied to a lot of different fields being able

in this way to improve product quality. Machine learning approaches are of particular

interest considering the steadily increasing search outputs and accessibility of existing

evidence is a particular challenge of the research field quality improvement.

At the end, I would also like to talk about the results of analyzing our different datasets: As

we saw before big data analysis helps you to identify the connexions between the different

attributes, to have a better understanding about our already existent data and at the same time

it allows you to make predictions about what the future data entries are going to be like. We

could identify in this way the different features of the tumor in order to classify them or we

could also know wich features are related to the apparition of arrhythmia. Finally, we could

also see other less medical examples like the data about highschool students and how

different variables related to each other.


106

8 REFERENCES

[1] Official documentation with general description of MongoDB

http://www.mongodb.org/about/introduction/

[2] Main description and basic features that provides us a teorical base for Mongo DB

from Wikipedia web page http://en.wikipedia.org/wiki/MongoDB

[3] Main page of mongo where you can download the necessary tools legally

http://www.mongodb.org

[4] R ide is a tool for analyzing data, you can find the official web page in

http://www.r-project.org/

[5] It contains the basic features, history and information about the different versions of

R tool https://en.wikipedia.org/wiki/R_(programming_language) and the second one

contains a tutorial where we can get started and learn about data mining in R

http://bigdatatechworld.blogspot.com/2014/01/video-tutorial-for-data-mining-with.html

[6] The following web page contains a theoretical base where we can find the main

information about machine learning algorithms and how they work

http://en.wikipedia.org/wiki/Weka_%28machine_learning%29 and the second link

contains a basic guide for getting started with the rmongodb library inside R

https://github.com/selvinsource/mongodb-datamining-shell

[7] Main page with the official documentation for R ide http://cran.r-project.org

[8] https://en.wikipedia.org/wiki/Machine_learning like the main source of information

in order to express the main features of machine learning

[9] The official documentation about Hadoop

http://www.apache.si/hadoop/common/hadoop-2.7.0/

[10] Main tutorial about how to use the Hadoop Distributed File System on Windows

http://wiki.apache.org/hadoop/Hadoop2OnWindows

[11] Main page where you can download the latest Hadoop version

http://hadoop.apache.org/releases.html

http://www.mongodb.org/about/introduction/

http://en.wikipedia.org/wiki/MongoDB

http://www.mongodb.org/

http://www.r-project.org/

https://en.wikipedia.org/wiki/R_(programming_language)

http://bigdatatechworld.blogspot.com/2014/01/video-tutorial-for-data-mining-with.html

http://en.wikipedia.org/wiki/Weka_%28machine_learning%29

https://github.com/selvinsource/mongodb-datamining-shell

http://cran.r-project.org/

https://en.wikipedia.org/wiki/Machine_learning

http://www.apache.si/hadoop/common/hadoop-2.7.0/

http://wiki.apache.org/hadoop/Hadoop2OnWindows

http://hadoop.apache.org/releases.html


107

[11] The native libraries from windows can be downloaded in the following URL:

http://www.srccodes.com/p/article/39/error-util-shell-failed-locate-winutils-binary-

hadoop-binary-path

[12] The maven files are needed for the purpose of this master thesis and they can be

downloaded in the following URL: http://maven.apache.org/download.cgi version 3.1.1

[13] The procol buffer in needed in order to work correctly the entire system and it

can be downloaded in the following URL: https://github.com/google/protobuf

[14] The R tools can be downloaded in the following URL:

https://cran.r-project.org/bin/windows/Rtools/

http://www.srccodes.com/p/article/39/error-util-shell-failed-locate-winutils-binary-hadoop-binary-path

http://www.srccodes.com/p/article/39/error-util-shell-failed-locate-winutils-binary-hadoop-binary-path

http://maven.apache.org/download.cgi%20version%203.1.1

https://github.com/google/protobuf

https://cran.r-project.org/bin/windows/Rtools/

II

III

IV

Documents

Daniel Adanza Dopazo - COnnecting REpositoriesStrojno učenje na velikih podatkih z uporabo MongoDB, R in Hadoop Ključne besede: veliki podatki, strojno učenje, analiza podatkov