Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Daniel Adanza Dopazo
MACHINE LEARNING ON BIG DATA USING MONGODB, R AND HADOOP
Master thesis
Maribor, December 2016
STROJNO UČENJE NA VELIKIH PODATKIH Z UPORABO
MONGODB, R IN HADOOP
MACHINE LEARNING ON BIG DATA USING MONGODB,
R AND HADOOP
Magistrsko delo
Študent: Daniel Adanza Dopazo
Študijski program: študijski program 2. stopnje
Informatika in tehnologije komuniciranja
Mentor: red. prof. dr. Vili Podgorelec
Lektor(ica):
i
ii
iii
iv
Strojno učenje na velikih podatkih z uporabo MongoDB, R in Hadoop
Ključne besede: veliki podatki, strojno učenje, analiza podatkov
UDK: 004.8:004.65(043.2)
Povzetek
Osrednji namen tega magistrskega dela je testiranje različnih pristopov in izvedba več
eksperimentov na različnih podatkovnih zbirkah, razporejenih na infrastrukturi za obdelavo
velikih podatkov. Da bi dosegli ta cilj, smo magistrsko nalogo strukturirali v tri glavne dele.
Najprej smo pridobili deset javno dostopnih podatkovnih zbirk z različnih področij, ki so
dovolj kompleksne (glede na obseg podatkov in število atributov) za namen izvajanja analize
velikih podatkov na ustrezen način. Zbrane podatke smo najprej predhodno obdelali, da bi
bili združljivi s podatkovno bazo MongoDB.
V drugem delu smo analizirali zbrane podatke in izvedli različne poskuse s pomočjo
orodja R, ki omogoča izvedbo statistične obdelave podatkov. Orodje R smo pri tem povezali
s podatkovno bazo MongoDB.
V zadnjem delu smo uporabili še ogrodje Hadoop, s pomočjo katerega smo dokončali
načrtovano infrastrukturo za obdelavo in analizo velikih podatkov. Za namen tega
magistrskega dela smo vzpostavili sistem v načinu enega vozlišča v gruči. Analizirali smo
razlike z vidika učinkovitosti vzpostavljene infrastrukture in delo zaključili z razpravo o
prednostih in slabostih uporabe predstavljenih tehnologij za obdelavo velikih podatkov.
v
vi
Machine learning on big data using MongoDB, R and Hadoop
Key words: big data, machine learning, data analysis
UDK: 004.8:004.65(043.2)
Abstract
The main purpose of this master thesis is to test different approaches and perform several
experiments on different datasets, deployed on a big data infrastructure. In order to achieve
that goal we will structure the thesis in three different parts.
First of all, we will obtain ten publicly available datasets from different domains, which
are complex enough (in terms of size and number of attributes) in order to perform the big
data analysis in the proper way. Once they are gathered, we will pre-process them in order
to be compatible with the MongoDB database.
Second of all, we will analyse the data and perform various experiments using the R
statistical and data analysis tool, which at the same time will be linked to the MongoDB
database.
Finally, we will use Hadoop for deploying this structure on big data. For the purpose of
this master thesis, we will use it in a single node cluster mode. We will analyse the differences
from the performance point of view and discuss the advantages and disadvantages of using
the presented big data technologies.
vii
viii
KAZALO
1 INTRODUCTION ....................................................................................................... 1
2 MASTERING THE DATABASES ............................................................................ 2
2.1 Installing and configuring MongoDB server .......................................................... 2
2.2 Learning mongo commands. ................................................................................... 6
2.3 2.3 Preparation of the datasets .............................................................................. 19
3 MASTERING THE DATA MINING ...................................................................... 27
3.1 Installing and configuring R ................................................................................. 27
3.2 Learning to use R with its library rmongodb ........................................................ 31
4 MASTERING THE HADOOP ................................................................................ 50
4.1 What is big data and Hadoop? .............................................................................. 50
4.2 Installing and configuring Hadoop ....................................................................... 52
4.3 Deployment of R algorithms in Hadoop ............................................................... 63
5 PERFORMING THE EXPERIMENTS AND ANALYZING THE RESULTS .. 75
6 APPLYING MACHINE LEARNING ALGORITHMS ........................................ 87
7 CONCLUSION ........................................................................................................ 104
8 REFERENCES .......................................................... Error! Bookmark not defined.
ix
x
KAZALO SLIK
Picture 2.1: Files which contains the MongoDB server version 3.0. .................................... 3
Picture 2.2: Files which contains the bin folder of MongoDB server. .................................. 3
Picture 2.3: Practical example of the usage of Mongo. ......................................................... 4
Picture 2.4: Practical example of Mongo running in command mode. ................................. 4
Picture 2.5: Window in the advanced configuration for the operative system Windows 8. . 5
Picture 2.6: Window with the environment variables already configured. ........................... 6
Picture 2.7: Connecting with the hockey dataset in Mongo .................................................. 7
Picture 2.8: Inserting data into our players dataset. ............................................................... 8
Picture 2.9: Output result after including a new row into our players dataset. ..................... 9
Picture 2.10: Accessing collections in Mongo. ................................................................... 10
Picture 2.11: Showing the differences after applying the »pretty« function. ...................... 10
Picture 2.12: Practical example of removing one row in Mongo. ....................................... 11
Picture 2.13: Practical example of updating a collection. ................................................... 11
Picture 2.14: Practical example of dropping a collection. ................................................... 12
Picture 2.15: Sample query in Mongo. ................................................................................ 12
Picture 2.16: Executing a query in our players dataset........................................................ 13
Picture 2.17: Executing a different query in our players dataset. ........................................ 16
Picture 2.18: Geting indexes in player dataset. ................................................................... 18
Picture 2.19: Showing all implemented datsets in Mongo. ................................................. 20
Picture 2.20: Showing WEKA interface.............................................................................. 21
Picture 2.21: showing WEKA interface. ............................................................................. 22
Picture 2.22: Showing all attributes of the arrythmia dataset. ............................................. 23
Picture 2.23: Showing the attributes of diabetic dataset. ..................................................... 23
Picture 2.24: Showing the attributes of letter dataset. ......................................................... 24
Picture 2.25: Showing the attributes of nursery dataset. ..................................................... 24
Picture 2.26: Showing the attributes of splice dataset. ........................................................ 24
Picture 2.27: Showing the attributes of student dataset. ...................................................... 25
Picture 2.28: Showing the attributes of tumor dataset. ........................................................ 25
Picture 2.29: Showing the attributes of waveform dataset. ................................................. 25
Picture 2.30: Showing the attributes of cmu dataset. .......................................................... 26
Picture 2.31: Showing the attributes of kddcup dataset....................................................... 26
Picture 3.1: Official documentation of R............................................................................. 29
Picture 3.2: Installing the packages »rmongodb« in R. ....................................................... 30
Picture 3.3: Output result after installing »rmongodb«. ...................................................... 31
Picture 3.4: connecting with mongo datasets since R IDE. ................................................. 33
Picture 3.5: Accessing databases and collections since R. .................................................. 35
Picture 3.6: Executing some queries with our sample data since R. ................................... 37
Picture 3.7: executing some queries since R. ...................................................................... 39
Picture 3.8: Graphics showing the results after executing some queries............................. 41
Picture 3.9: executing »count« function and »head function«. ........................................... 42
Picture 3.10: converting to bson format in R. ..................................................................... 44
xi
Picture 3.11: Output result after executing queries in our sample data in R. ...................... 45
Picture 3.12: sample using the »count« function in »rmongodb«. ...................................... 46
Picture 3.13: Executing some experiments in R. ................................................................. 48
Picture 3.14: graphic showing the output results of the experiment. .................................. 49
Picture 3.15: graphic with bars showing the output results of the experiment. ................... 49
Picture 4.1: Window showing the environment variables. .................................................. 53
Picture 4.2: Command window showing the current java version installed in my computer.
............................................................................................................................................. 54
Picture 4.3: Output result showing the hadoop version installed.. ...................................... 54
Picture 4.4: Output results after executing »hdfs namemode -format« command. ............. 58
Picture 4.5: Environment variables of my computer. .......................................................... 59
Picture 4.6: Output results after running Hadoop. ............................................................... 60
Picture 4.7: Output results after executing »yarn«. ............................................................. 61
Picture 4.8: Initial page after running Hadoop. ................................................................... 62
Picture 4.9: Initial page showing the cluster configuration of Hadoop. .............................. 62
Picture 4.10: Initial configuration of Hadoop in R. ............................................................. 64
Picture 4.11: executing »mapreduce«. ................................................................................. 66
Picture 4.12: Basic usage of »rhdfs« library........................................................................ 67
Picture 4.13: Unserializing data with »rhdfs«. .................................................................... 68
Picture 4.14: Executing different »hdfs« commands. ......................................................... 69
Picture 4.15: Using »mapreduce« in Hadoop. ..................................................................... 71
Picture 4.16: practical example using »mapreduce«. .......................................................... 72
Picture 4.17: graphic using showing the output result of the preivous example. ................ 72
Picture 4.18: set of commands processing data with »rhadoop«. ........................................ 73
Picture 4.19: Final results after applying »rmr«. ................................................................. 74
Picture 5.1: Results for the experiments of arrythmia dataset. ............................................ 76
Picture 5.2: More results about arrythmia dataset. .............................................................. 76
Picture 5.3: Results for the experiments of cmu dataset. .................................................... 77
Picture 5.4: Results for the experiments of diabetic dataset. ............................................... 78
Picture 5.5: Results for the experiments of tumor dataset. .................................................. 79
Picture 5.6: More results for tumors dataset. ....................................................................... 80
Picture 5.7: Results for the experiments of kddcup dataset. ................................................ 81
Picture 5.8: Results for the experiments of letter dataset. ................................................... 82
Picture 5.9: Results for the experiments of nursery dataset. ............................................... 83
Picture 5.10: Results for the experiments of splice dataset. ................................................ 84
Picture 5.11: Results for the experiments of waveform dataset. ......................................... 85
Picture 5.12: Results for the experiments of students dataset. ............................................ 86
Picture 5.13: More results for the experiments of students dataset. .................................... 86
Picture 6.1: Applying machine learning algorithms in R. ................................................... 88
Picture 6.2: Showing main features of iris dataset .............................................................. 89
Picture 6.3: Summary of iris dataset .................................................................................... 90
Picture 6.4: Applying machine learning algorithms to iris data .......................................... 91
Picture 6.5: spliting iris dataset into training and test. ........................................................ 92
Picture 6.6: Final results after applying machine learning on iris dataset. .......................... 93
Picture 6.7: Applying regression tree algorithm to letter dataset ........................................ 95
xii
Picture 6.8: Applying regression tree algorithm to arrythmia dataset ................................. 96
Picture 6.9: Applying regression tree algorithm to diabetic dataset .................................... 97
Picture 6.10: Applying regression tree algorithm to kddcup dataset ................................... 98
Picture 6.11: Applying regression tree algorithm to nursery dataset .................................. 99
Picture 6.12: Applying regression tree algorithm to splice dataset ................................... 100
Picture 6.13: Applying regression tree algorithm to student dataset ................................. 101
Picture 6.14: Applying regression tree algorithm to tumor dataset ................................... 102
Picture 6.15: Applying regression tree algorithm to waveform dataset ............................ 103
xiii
Machine learning on big data using MongoDB, R and Hadoop
1
1 INTRODUCTION
Context
At the beginning I would like to make a brief comment about the context of this master
thesis, starting with data mining.
Data mining is one subfield inside the computer sciences, which consists of a process where
we discover the patterns of a huge data set, hence it includes different methods at the
intersection of different fields like artificial intelligence, machine learning and so on.
I would like to point out that the main purpose of data mining is the extraction of information
that comes from a data set and its transformation into another structure that can be used for
other different reasons.
Main purpose of the master thesis
The main purpose of the master thesis is to analyse the different relationships within the
attributes of a database or to extract some additional information out of them. In order to get
that it will be necessary to use the help of different tools for storing data in the database
(mongo DB), for analysing the already introduced data (R IDE) and finally for learning about
how to deploy big data by using different tools like Hadoop.
Brief description of the content
The first part of the thesis is dedicated to MongoDB. Here I will install, learn and include
the different datasets that we are going to use for the project with the previous mentioned
tool.
The second part of the report is about data mining. There we will install, configure and learn
how to use the R ide with its necessary libraries in order to connect it with our datasets.
In the third part of the thesis we will talk about big data and we will use Hadoop. We will
also learn how to connect this tool with R and mongo.
Machine learning on big data using MongoDB, R and Hadoop
2
Finally, in the rest of the parts I am going to analyze different data, make different
experiments, apply different machine learning algorithms and make some inferences about
big data and the obtained results.
Aims
Even if we will work over different set of databases trying to make some inferences about
the data and obtain different statistics the real goal of the project is the application of the
previously mentioned technologies and tools that allow us to work with big data and to
somehow demonstrate that they can be quite useful and applied to a variety of situations.
Therefore the main aim of the study is the application of different technologies that allow us
to work on big data.
Objectives
These are the other aims for the project:
• application of a database with NOSQL datastore like in the case of mongoDB
• application and usage of different machine learning algorithms like R
• deployment of a big data structure like in the case of Hadoop
• deployment of some machine learning algorithms from R on Hadoop, as much as
establishing of MongoDB for storing the datasets in, and then using the algorithms from R
deployed on Hadoop in order to learn from the data in MongoDB datasets
• performing different experiments on the selected datasets
Assumptions and limitations
The main purpose of the research is not to get information about about the databases that I
have taken like an example, but to get familiar with and to try out different technologies that
allow us to analyze a big quantity of data.
I would also like to mention different assumptions and shortcoming of the research. We
should always keep in mind that our inferences are based on a sample of data. Hence the
numbers could be slightly different from the numbers we would get by analysing other
Machine learning on big data using MongoDB, R and Hadoop
3
sources. Nevertheless it is always good to make some inferences and to make a good
estimation about different features.
Machine learning on big data using MongoDB, R and Hadoop
2
2 MASTERING THE DATABASES
In this section I am going to describe everything that is necessary for the database, including
the description of the necessary tools and the steps that I took in order to install them.
Furthermore, I am going to include a simple guide of the basic commands that will allow us
to check the information inside the database and manipulate it. In order to achieve this goal,
we will use the database type named NoSQL, which is a type of a data base based on JSON
objects and it is slightly different from the typical SQL databases that we have usually used
in our projects. The tool that we will use in order to handle these type of databases is called
Mongo [1], a cross platform tool that has been released under GNU General public license
and that works through the terminal in command mode.
2.1 Installing and configuring MongoDB server
The first step that is necessary to make in order to accomplish the final step of the project is
to install and configure the necessary tools. Mongo [2] is an open-source document database
that provides high performance, high availability and automatic scaling. This type of tool
contains a data structure that is composed of a field and value pairs. Their documents are
quite similar to JSON objects.
The first thing that we need to do is to go to its official web page [3] and download the
executable file that is in my case specific for the operative system Windows 8 (64 bit
architecture).
Right after ending the installation we will see that we have the files shown in a print screen
below in the path that is by default: c:/Program Files/MongoDB/Server/3.0/
Machine learning on big data using MongoDB, R and Hadoop
3
Picture 2.1: Files which contains the MongoDB server version 3.0.
And that is what we can see in the bin directory right after installing Mongo DB:
Picture 2.2: Files which contains the bin folder of MongoDB server.
In order to configure Mongo DB for the very first time, it is necessary to open the terminal
cmd.exe and to type the following commands in order to move to the necessary folder.
Machine learning on big data using MongoDB, R and Hadoop
4
Picture 2.3: Practical example of the usage of Mongo.
As you can see in the image above, we have used the command “cd” in order to change to
the correct directory and when we are in the correct directory, we can execute the command
“mongo”. After that we will see that it does not find any database and it does not work as
we would have maybe expected. The reason for this is that we need to create a folder with
the command “mkdir \data\db”. After creating this directory, we will see that our tool is
working well.
Picture 2.4: Practical example of Mongo running in command mode.
Machine learning on big data using MongoDB, R and Hadoop
5
Here we have a practical demonstration of the mongo DB tool working good. In one terminal
we used “mongod” command and in the other terminal we used “mongo” command that
initiates a dialog.
Finally, I would like to make an additional step that is not strictly necessary but helps to
facilitate the things when starting Mongo tool. This step consists of configuring the
CLASS_PATH in our environment variables, for which we will need to enter the following
location in our computer:
Control Panel > All control Elements > system > advanced system configuration
Picture 2.5: Window in the advanced configuration for the operative system Windows 8.
After that we will see something similar to this window. After clicking on advance
configuration system we will also need to enter in system properties > advanced settings >
environment variables.
Machine learning on big data using MongoDB, R and Hadoop
6
Picture 2.6: Window with the environment variables already configured.
In the environmental variables we will need to add a new variable with the name PATH and
with the value C:/Program Files/MongoDB/Server/3.0/bin. Finally, I would like to say that
this step is very useful because from now on it will not be necessary to enter in that hard to
remember address in order to start mongo. Now it would be enough to just open the terminal
and to type the word “mongod”.
2.2 Learning mongo commands.
To install the necessary tools is not enough if we want to accomplish our final goal. Hence
the next step in this case would be to create a sample database and to fill it with some
unimportant sample data. In next paragraphs I will also explain in greater detail how Mongo
DB works and which functions of different commands are provided by it.
As we saw in the previous step, we will need to open two terminals. In one of them we will
type the command “mongo”, hence this window will act like the server. In the other window
we will type the command “mongo”, hence this window will act like the client that will
Machine learning on big data using MongoDB, R and Hadoop
7
connect with the server that is in the local host. It will be in this second window that we will
need to type the different commands necessary for inserting, deleting and modifying
different information.
STEP 1: BASIC DATA BASE COMMANDS:
There are different commands that allow us to manipulate the databases. Fortunately, they
are quite straightforward to use. Out of all of them we will highlight the following:
Db: It tells you which database you are using at the moment (by default you would be using
the data base named “test”).
use <data base name>: It would switch the database that you intend to use. For example: If
you don’t want to use the database “local” anymore and you prefer to use the database named
“hockey” you can type “use hockey” I also would like to point out that by default, if it does
not find any database under that name, it will automatically create one.
Db.dropDatabase(): this command will erase the database that you are currently using. In
order to facilitate your work the terminal will provide you a feedback explaining if
everything went as expected or not.
Finally, I would like to demonstrate all that I have mentioned by providing a screenshot of
my own terminal, where I used these commands, and its expected output:
Picture 2.7: Connecting with the hockey dataset in Mongo
Machine learning on big data using MongoDB, R and Hadoop
8
STEP 2: INSERTING JSON OBJECTS INTO COLLECTIONS:
So far we have seen how to create, delete and switch the usage of the different databases,
but we have not yet looked at the manipulation of different data that is inside the databases.
For that it would be necessary to mention that the Mongo database organizes its information
into collection. These collections are analogous to a table with different rows if we would
like to compare them with a normal sql database. The truth is that its syntax is more similar
to the collections in java. Where we can insert one object that will be added into the overall
collection.
Db.collectionName.insert(<Data that you want to insert>): In this command the word db
refers to the database that you are currently using. In our example we are using the database
named “hockey” and refering to the collection “name”. By default, if it does not find any
collection under that name it will create one. Finally you will need to introduce the data in
JSON format as a parameter that you want to add to that database.
Here we have a practical demonstration of the previously explain command. The database
is hockey and the collection is players. In the last line we can prove that the information has
been added in a successfully way.
Picture 2.8: Inserting data into our players dataset.
Machine learning on big data using MongoDB, R and Hadoop
9
Finally, I would like to mention that this only works with one single object in JSON. If we
would like to insert more than one object we will need to create an array and separate the
different JSON objects using commas. The basic structure would be something like that:
Db.collection.insert ( [ {JSON OBJECT1} , {JSON OBJECT2} , … ] )
Also I would like to mention that by introducing more than one JSON object, the output
result will be quite different than with a single one.
Picture 2.9: Output result after including a new row into our players dataset.
As we can see in the previous image the tool will give us a lot different information, such as
the number of inserted rows, the number of modified ones and whether there were any errors.
STEP 3: DELVING MORE DEEPLY INTO COLLECTIONS:
In order to operate and see different information relative to the collections, we can have the
following commands:
Show collections: this command is self-explanatory. It will show all the collections that we
have created in the database that we are using. By default, it will also include one collection
named “system.indexes”. Here we can see a practical demonstration of it.
Machine learning on big data using MongoDB, R and Hadoop
10
Picture 2.10: Accessing collections in Mongo.
Db.collection.find(): This command allows us to see all the data that is inside the previously
referred collection. One of the main disadvantages of this command is that the view of the
data is very compact. Nevertheless it can all be repaired with the function pretty(). We can
see the differences between both in the following image:
Picture 2.11: Showing the differences after applying the »pretty« function.
Machine learning on big data using MongoDB, R and Hadoop
11
Db.collection.findOne(): it works exactly like find().pretty() but with the main difference
that it only shows the first JSON object of the collection.
Db.collection.remove( { id of the object } ): This command removes only one row in the
overall collection. In order to distinguish this object from the others, it will be necessary to
introduce its identification.
Picture 2.12: Practical example of removing one row in Mongo.
Db.collection.update({identifier},{new object}): This function allows you to find an object
inside the previously named collection and update it with the information that you want.
Like always, I will provide you a practical demonstration of that.
Picture 2.13: Practical example of updating a collection.
Machine learning on big data using MongoDB, R and Hadoop
12
Db.collection.drop(): it eliminates the previously mentioned collection completely, which
implies all its data that has inside and the name of the collection in itself.
Picture 2.14: Practical example of dropping a collection.
STEP 4: HOW TO MAKE QUERIES IN THE COLLECTIONS:
Db.collection.find/findOne(“parammeter”:value): These commands have already been
seen when we wanted to get the overall output of the entire collection. Still we can use it
using two conditions at the same time. For giving a practical example of that let’s show all
the players that have accomplished two conditions: First: the position is defenseman and
second: the age has to be twenty one.
Picture 2.15: Sample query in Mongo.
Machine learning on big data using MongoDB, R and Hadoop
13
$or:[condition1 , condition2] Lastly I would like to talk about other types of queries.
Sometimes we don’t want to be that strict and we would like to show all of the rows that
accomplish one condition or another one. For those cases we will need to introduce the
variable “or” that will be followed by an array. And inside each element of the array there
will be objects separated by commas that will tell you which conditions can be accepted.
Like always, the best way of understanding it is using a practical example with our previous
mentioned collection “players”.
In the next image we can see how the “find()” and “pretty()” functions are being combined
with the variable “or” in order to get what we want which is in this case to show all those
players that play in the position of the left wing or right wing.
Picture 2.16: Executing a query in our players dataset
Other variables for comparison: So far we saw a lot of different types of queries combining
the logical operators OR and AND. Neverhteless, the truth is that we can still delve more
deep by showing other variables that allow us to make numerical comparison: For example
$gt: value. This expression it is used if we want to establish the condition that a chosen
parameter should be greater than the specified value. Using a practical example this is the
query that we will need to type if we want to show all the players whose age is greater than
30:
Machine learning on big data using MongoDB, R and Hadoop
14
db.players.find(
{ "age" : {$gt:30}}
).pretty()
Overmore I would also like to mention the other variable named $gte which works exactly
the same but with the difference that it will show all the rows where the age is greater or
equal to 30:
db.players.find(
{ "age" : {$gte:30}}
).pretty()
Finally, there are another two complementary variables: $lt for showing all the rows that are
lower than the specified value.
db.players.find(
{ "age" : {$lt:30}}
).pretty()
As you can guess, we also have $lte in order to show all the rows which are lower or equal
the specified value which is for this case age 30:
db.players.find(
{ "age" : {$lte:30}}
).pretty()
Machine learning on big data using MongoDB, R and Hadoop
15
At the end I would like to mention another variable which is $ne. In our practical example
its function would be to show all those players whose age is NOT EQUAL to 30, for example
29 would be accepted as well as 31.
db.players.find
( { "age" : {$ne:30}} ).pretty()
In addition to all that I would like to mention that we didn’t customize the queries as much
as we could. For example, in all our previous cases we will receive all the information from
the rows that we want. The counterpart in the SQL syntax will always be “SELECT *
FROM”. If we want to show for example just the name, we will need to specify it with one
condition, according to the following syntax: { “parameter” : 1/0 }. The number one or zero
will indicate whether you want to show one parameter or not.
Finally, in order to clear up some things I would like to mention that by defaul. As we saw
in the previous queries the mongo db tool will show all the parameters that contain that
collection, but if you start specifying the conditions, the default values change and it will
only show the specified values plus the “_id” that it will be shown by default and you will
need to specify that you don’t want to see it if that is the case:
I am going to show you a practical example using our previous players collection: I am going
to make a query returning all the players that play in the center position. And from all their
information I only want to see their names and I don’t want to see their id:
Machine learning on big data using MongoDB, R and Hadoop
16
Picture 2.17: Executing a different query in our players dataset.
In order to end with this step, I would like to mention another two functions that, while not
very useful, can be helpful in some contexts. One of the functions is: limit(number) This
function will return only the first number of rows and it will ignore the rest. For example, in
our previous query if we want just to show the three first rows, the query would look
something like that.
db.players.find( {"position":"Center"},{"name":1, _id:0} ).limit(3)
Also another complementary function is skip(number). In this case it does not work exactly
in the sql queries. It will not show a certain number of last rows. What it does is to ignore
the first number of rows and show the rest. For example if there are ten different rows and
we use the skip(3) function it will show the last 7 rows, ignoring the first three. Here is a
practical example:
db.players.find( {"position":"Center"},{"name":1, _id:0} ).skip(3)
STEP 4: USING INDEXES:
In order to understand why it is useful to use the indexes in this type of databases it will be
necessary to know an overview of how Mongo tool works inside. When you are making a
Machine learning on big data using MongoDB, R and Hadoop
17
query with one condition the actions that will be executed are: looping each row in the
collection and checking if each single row accomplishes the condition or not. And if this
condition is accomplished it will be necessary to print out its information.
The best way of understanding the procedures is always with an example: in our previous
collection that contains around twenty different players let’s suppose we want to show all of
those whose age is lower than twenty one. Mongo tool will go one by one checking if each
player’s age is lower than twenty one.
Because it is only a sample database and there is only a very limited number of rows, we can
see the result in milliseconds. But what happens if for the project I want to use a database
with one hundred thousand customers? In this case the performance will be very poor and it
will take a lot of time till I can see the first results. For these cases it is quite useful to know
how to use indexes because we will see a huge difference in performance in a real and robust
database
Finally, I would like to point out that for the end user it is very difficult to appreciate exactly
how much time you saved by using indexes. But there is a way of seeing how much time it
would take to execute this query. For that you need to include in the end the function:
.explain (“executionStats”).
So once we saw why the indexes are used and in which type of databases we should use
them, we will continue with the main commands for manipulating them:
Db.collection.ensureIndex({parameter:1}): It creates an index of the previously specified
parameter.
Db.collection.getIndexes(): Here we can see an output of all of the indexes that we have
created for the previously specified collection.
Db.collection.dropIndex({parameter:1}): As you can possibly guess it will drop the
previously specified index that you have created.
The best way of understanding what I say is with a practical demonstration using the sample
collection players:
Machine learning on big data using MongoDB, R and Hadoop
18
Picture 2.18: Geting indexes in player dataset.
Finally, I would just like to mention that in order to achieve the best performance we need
to use the indexes in a sensitive way. For example, a quite common mistake is to create an
index for each single parameter in the collection. Each time that you create an index the
performance query of that collection will be downgraded. Hence it will be recommended to
use the indexes only with the parameters that you will use a lot, like for example the name,
the age and the player position, and to leave the other unimportant parameters that you will
barely use in the queries. What’s more, another disadvantage that is necessary to count on is
that after each time that you actualize the collection it will also be necessary to actualize its
associated indexes.
STEP 5: USING GROUPS AND AGGREGATION:
The last topic that I would like to cover when talking about Mongo db commands is the
groups. If we want to group the rows depending on an exact parameter then we can use the
variable $group. Although I would like to mention that this variable will have to be combined
with other variables.
Machine learning on big data using MongoDB, R and Hadoop
19
For example, let’s put a query in our previous example where we want to group all the
players depending on their position and for each different group we want to sum the number
of those who play that role. Then we would need to make a query with the following syntax
combined with the variable $sum.
db.players.aggregate( { $group : { _id : "$position", total : {$sum : 1} } } )
Other times when making groups we need to get the average value of another different
parameter. For example, let’s suppose that we want to get an average age for all the players
grouping them by position. Then we will need to use the variable $avg in the following way.
db.players.aggregate( { $group : { _id : "$position", avgAge : {$avg : $age} } } )
In order to end this section I would like to mention the self-explanatory variables $min and
$max which will show the biggest and the lowest value of each group. And the way of using
both it is exactly the same than for the previous group variables.
db.players.aggregate( { $group : { _id : "$position", avgAge : {$max : $age} } } )
db.players.aggregate( { $group : { _id : "$position", avgAge : {$min : $age} } } )
2.3 2.3 Preparation of the datasets
For the purpose of this master thesis I am going to apply different experiments in ten
different datasets that are going to be taken like examples in order to show the deployment
of big data. In this part of the report I am going to mention each single one of them and I
will also summarize the procedure that I made in order to include them in Mongo. In addition
Machine learning on big data using MongoDB, R and Hadoop
20
to that I also will show different screenshots in order to demonstrate the correct inclusion of
them
Arrhythmia
Cmu newsgroup clean 1000 sanitized
Diabetic data
Tumor data
Kddcup99
Letter data
Nursery data
Splice data
Wave form data
High school students data
Picture 2.19: Showing all implemented datsets in Mongo.
Machine learning on big data using MongoDB, R and Hadoop
21
ADDING ALL DATASETS TO MONGO
One of the main problems that I found in order to convert the database to Mongo was the
format of the data. All datasets were in .arff format so they had to be parsed to JSON in order
to be readable for MongoDB. The solution that I found for this problem was the following:
First of all I downloaded and installed the WEKA tool version 3.6. WEKA is one of the most
popular suit of machine learning software developed by the University of Waikato it is a free
software licensed and it is written in Java. You have an image bellow showing how I opened
one of the sample databases about arrhythmia in the previously mentioned tool.
Picture 2.20: Showing WEKA interface.
One of the many functions that these tools give you is the conversion of files into arff format
into csv. Because there are no direct conversions within JSON and arff, I found two different
solutions for handing it. The fist solution it worked with almost all the different datasets
except with the “cmu” sample data and it consists in parsing them first into .csv format and
then parsing them into an array of JSON readable format in order to successfully include
them in the database.
Right after the first step I used another tool named JSON buddy Application desktop version
3.3. By the usage of this tool I managed to convert the files into the final format without any
troubles and preserving the structure and the content as it was before.
Machine learning on big data using MongoDB, R and Hadoop
22
Picture 2.21: showing WEKA interface.
Unfortunately, the first solution didn’t work will all my datasets, there is another dataset
named CMU which is very large and because of its size it could not be parsed in JSON
buddy application. So the second solution consists of first parsing the data into csv, like in
the first solution, and after that using the following command in order to import directly this
data into Mongo:
mongoimport --host 127.0.0.1 --port 27017 --db cmu
--collection data --type csv --headerline --file ./cmu.csv -j 256
For importing Mongo you have to specify the host and the port which are in this case:
127.0.0.1:27017. After that it is necessary to specify the name of the database and the name
of the collection that will be created finally we tell them that we have a headerline that
specifies the name of the fields, the path of the file and one last option “-j 256” which solved
me different CPU processing problems. As a final result I managed to have the ten different
databases already imported and ready to work in Mongo tool.
Machine learning on big data using MongoDB, R and Hadoop
23
DESCRIPTION OF THE DATASETS:
In order to have an overview of the content, I am going to show a summary of all the different
datasets that I will use like an example in order to apply the different technologies for big
data. The first data set it is about 452 different attributes about 24528 different people.
Picture 2.22: Showing all attributes of the arrythmia dataset.
The second database gathers different data about diabetic people. In this case we found a
much smaller database but each row has a much wider number of attributes.
.
Picture 2.23: Showing the attributes of diabetic dataset.
Finally I am going to show a summary of the different datasets about: letters, nursery, splice,
high schools, tumor, waveform and cmu:
Machine learning on big data using MongoDB, R and Hadoop
24
Picture 2.24: Showing the attributes of letter dataset.
Picture 2.25: Showing the attributes of nursery dataset.
Picture 2.26: Showing the attributes of splice dataset.
Machine learning on big data using MongoDB, R and Hadoop
25
Picture 2.27: Showing the attributes of student dataset.
Picture 2.28: Showing the attributes of tumor dataset.
Picture 2.29: Showing the attributes of waveform dataset.
Machine learning on big data using MongoDB, R and Hadoop
26
Picture 2.30: Showing the attributes of cmu dataset.
Picture 2.31: Showing the attributes of kddcup dataset.
Machine learning on big data using MongoDB, R and Hadoop
27
3 MASTERING THE DATA MINING
So far we saw all the things that concern the databases. For that we have learned how to use
the tool mongo database, unfortunately this is not enough in order to achieve the final goal
of the project. The second step that we need to make is to analyze all of this data that we
have previously loaded in our local host with a quite popular data mining tool named R [4].
With the help of this tool we will be able to analyze our data in a much better way and we
will be able to make operations that we cannot do just with our mongo tool.
3.1 Installing and configuring R
So far we saw how to install, create, manipulate and master the nosql database using the
mongo tool. Unfortunately, this is just one part of the entire task. The next step that we need
to make in order to accomplish our goal for this master thesis is to analyze all this data that
contains the database with another data mining tool and to be able to make operations from
there. For this purpose, we will use a quite famous tool named R.
WHAT IS R?
R[5] is composed by the software environment for statistical computing and its associated
programming language that supports this tool. I would also like to talk about the software
environment which has been developed in the year 1993, and that has been continuously in
development till today, having the last stable version from April 16th 2015. It has been
developed by the R development core team and its paradigm covers array, object-oriented,
imperative functional, procedural and reflective areas.
On the other hand, I would like to also talk about the programming language which is very
well known and used among data miners for developers and is widely used also in another
areas like statistics, data analysis, polls, surveys and studies of scholarly literature databases.
The whole R project has been released under GNU, General Public License. Its source code
has been written primarily in C, FORTRAN and R.
Machine learning on big data using MongoDB, R and Hadoop
28
Hence R is available for free, with different avaialable versions depending on the operative
system. I would like to point out that even while this tool supports some graphical front-
ends, it works primarily on a command line interface, allowing in this way the user to work
faster, to not waste resources and to be able to run it in all the machines, regardless their
characteristics.
STATISTICAL AND PROGRAMMING FEATURES:
R can be combined with a big number of supported libraries. Together they are able to
implement a big variety of statistical and graphical techniques.
From all of them we can highlight the classical statistical tests[6], the linear and nonlinear
modeling, classification, clustering and so on. The reason for this amount of libraries is the
fact that R is very easily extensible through functions and extensions that can even be written
in its own language.
About the programming features of the R language I would like to highlight that it is an
interpreted language that is able to support matrix arithmetic and a lot of data structures like
vectors, arrays, matrices data frames and lists. Overmore, R is able to support procedural
programming with functions and it is also able to support object oriented programming with
some generic functions.
INSTALLING R:
The best way of getting the R tool is to enter its official page [7]. Here we will be able to
see a quite plain web page with different links depending on our operative system. In order
to clear it up, I will show an image of the official web page where you can download it.
Machine learning on big data using MongoDB, R and Hadoop
29
Picture 3.1: Official documentation of R.
After choosing Windows I started to download the file corresponding with the last version
untill that moment: 3.2.0 (2015-04-16). The current operative system that I have is Windows
7 and after following the straightforward steps of the installer, I managed to download the
version for 64 bits.
I am not going to delve very deeply into the R interface because we already saw how it works
in another subjects in the university and that would be redundant. What’s more, I have
already explained the different areas that cover the programming language.
INSTALLING THE RMONGODB LIBRARY:
So far I have installed the main R tool and I also explained how it works but the truth is that
if we want to connect it with our mongo database, that we talk about in another sections, this
is not enough. In order to accomplish this part, we need to install one library that is named:
“rmongodb”, which will help us to connect both tools.
The way of installing this library is quite straightforward and the same than any other library
in R. We just have to run the command
install.packages(“library name”)
In order to clear it up I am providing an image of the R interface with its output results at
the beginning of its installation:
Machine learning on big data using MongoDB, R and Hadoop
30
Picture 3.2: Installing the packages »rmongodb« in R.
As we can appreciate in the previous image, the library allows us to be installed in different
languages. Because it is easier for me I decided to install it in Spanish language but the
functionality should be the same, regardless of the language.
Right after that we will be able to see how all the necessary files and packages are
downloaded in a successful way and we will be able to see a final message informing us that
everything went as expected. In order to clear it up I will show you and screenshot of my
computer right in the moment when the installation was successfully completed:
Machine learning on big data using MongoDB, R and Hadoop
31
Picture 3.3: Output result after installing »rmongodb«.
With the steps that I have explained in this part we have installed the R tool with its necessary
library. I would like just mention that there is another alternative option that allows us to get
the library: we use the command install. Packages that we will install are the last stable
versions that have been released for this library but alternatively we can also run the latest
development version from the github repository. In this case it would be necessary to run the
following commands:
library(devtools)
install_github(repo = "mongosoup/rmongodb")
3.2 Learning to use R with its library rmongodb
In order to connect our R data mining tool with our Mongo database we need a couple of
requirements and we need to write a couple of lines in the command mode:
Machine learning on big data using MongoDB, R and Hadoop
32
STEP 1: CONNECTING MONGODB TO R
As we previously mentioned we need first to do a couple of actions in order to connect both
tools:
To install and run our mongo database externally, out of the R tool. To do this we
need to type “mongod” in our command window
The second thing we need to do is to run the R tool and to summon the library
“rmongodb” that we have installed in the previous section. In order to get that we
just need to type the following command: “library(rmongodb)”
I will present a theoretical explanation of the basic mongo commands:
Mongo.create(): with this function we are able to connect with a mongo database server that
can be local or remote and return an object that belongs to a class named mongo. This object
can be used for further communication over the connection.
Mongo.is.connected(variable): This function is used in order to see if the variable is well
connected or not to the mongo database server. If it is connected it will give back TRUE and
if not, it will print out FALSE.
Variable class mongo: if you type the name of a variable with mongo class it will give back
all the basic parameters attached to it like host or username.
Following the same scheme than always, once given the theoretical demonstration, I would
like to clear it up showing an image of it with the commands used in a practical way:
Machine learning on big data using MongoDB, R and Hadoop
33
Picture 3.4: connecting with mongo datasets since R IDE.
STEP 2: GETTING BASIC INFORMATION ABOUT THE DATA BASE AND THE
COLLECTIONS:
In order to get the list of all the databases or collections in mongo we need to learn this quite
straightforward commands:
mongo.get.databases(variable): By typing this command we will get the list of all variables
that have found the object that has to be Mongo class
mongo.get.database.collections(mongo, db) On the other hand if what we want to get is the
list of all collections, we have to type this second line of code.
mongo.count(mongo, coll): as its name implies, it is able to count the number of elements
that have a previously specified collection.
Machine learning on big data using MongoDB, R and Hadoop
34
I also would like to point out that the best way of accessing the information of the database
is to check first if the variable has been correctly connected and if it has, to access at the
information and to do nothing otherwise. So as you can guess you need to write a couple of
more lines of code and the final result should be something like that:
if(mongo.is.connected(mongo) == TRUE) {
mongo.get.databases(mongo)
}
if(mongo.is.connected(mongo) == TRUE) {
db <- "hockey"
mongo.get.database.collections(mongo, db)
}
if(mongo.is.connected(mongo) == TRUE) {
coll <- "hockey.players"
mongo.count(mongo, coll)
}
Finally, I will provide an image with a practical use of both functions when connecting R
with my mongo databases in my local host: we can see we still have the database hockey
that we used in the previous sections like an example:
Machine learning on big data using MongoDB, R and Hadoop
35
Picture 3.5: Accessing databases and collections since R.
STEP 3: FINDING SOME DATA:
So far we got just a basic information but now we will delve more deep into the advanced
options in order to get selected information that we want to get. For that we will need to
learn the following commands:
mongo.find.one(mongo, coll): this command finds the first record inside the previously
specified collection that matches the query.
mongo.distinct(mongo,coll,key): it will find all the distinct elements into the specified
collection, the distinct elements will be found according with the given key.
Once more we have the same issue than before and it is better to encapsulate the queries into
an if statement in order to avoid possible errors so those are the final lines of code that we
need to write:
Machine learning on big data using MongoDB, R and Hadoop
36
if(mongo.is.connected(mongo) == TRUE) {
mongo.find.one(mongo, coll)
}
if(mongo.is.connected(mongo) == TRUE) {
res <- mongo.distinct(mongo, coll, "name")
head(res, 2)
}
if(mongo.is.connected(mongo) == TRUE) {
cityone <- mongo.find.one(mongo, coll, '{"name":"Craig Adams"}')
print( cityone )
mongo.bson.to.list(cityone)
}
Finally, I would like to provide you the output results that I got after typing those commands
into my R interface: I would like to mention that after the last line of command:
mongo.bson.to.list(cityone)
I got all the information relative to that object not only the “_id” but as far as it is very long
and not very necessary to show that information, I decided to not show it in the image, and
to rather expose only the most meaningful information:
Machine learning on big data using MongoDB, R and Hadoop
37
Picture 3.6: Executing some queries with our sample data since R.
STEP 4: CREATING BSON OBJECTS.
mongo.bson.from.list: This function is used in order to convert a list into a JSON object.
The process is very natural because the objects in R are very similar to the real JSON objects
in mongo database. I also would like to point out that this process internally calls other
Machine learning on big data using MongoDB, R and Hadoop
38
functions like: mongo.bson.buffer.create , mongo.bson.buffer.append ,
mongo.bson.from.buffer.
mongo.bson.from.JSON: alternatively this function can be used in we want to create a
BSON object from a BSON. It has the same result than the previous one.
mongo.bson.from.list: as you can guess it creates a BSON object from a list. This is the last
alternative option that you have for creating BSON objects.
Here I am providing you the lines of code with the correct use of the previously mentioned
functions.
query <- mongo.bson.from.list(list('city' = 'COLORADO CITY'))
query <- mongo.bson.from.list(list('city' = 'COLORADO CITY', 'loc' = list(-112.952427,
36.976266)))
buf <- mongo.bson.buffer.create()
mongo.bson.buffer.append(buf, "city", "COLORADO CITY")
query <- mongo.bson.from.buffer(buf)
mongo.bson.from.JSON('{"city":"COLORADO CITY", "loc":[-112.952427, 36.976266]}')
date_string <- "2014-10-11 12:01:06"
query <- mongo.bson.from.list(list(date = as.POSIXct(date_string, tz='MSK')))
Machine learning on big data using MongoDB, R and Hadoop
39
Finally, I would like to provide a screenshot with a practical demonstration of these
functions:
Picture 3.7: executing some queries since R.
STEP 5: EXAMPLE OF ANALYSIS.
In order to perform our first analysis, we will use our collection example named “coll” that
contains the hockey players. We will also need to use functions like mongo.distict which
allows us to get a vector with all the different values according with the key. What’s more,
I would like to mention that we will also use another two functions that are not from the
library but that are still useful for representing the data that we have with a graphic.
For a practical example of it we will grab the collection of players in hockey and we will
analyze their age. For that we will use the following commands:
Machine learning on big data using MongoDB, R and Hadoop
40
if(mongo.is.connected(mongo) == TRUE) {
pop <- mongo.distinct(mongo, coll, "age")
hist(pop)
boxplot(pop)
}
With these lines of code, we first check if we have connected correctly to the database and
if so, we get the age of the players that are inside the collection. After that we will represent
this data into two graphics. Histogram for representing the frequency and boxplot for
expressing the given data into a box-and-whisker plot.
I managed to get the following output results:
Machine learning on big data using MongoDB, R and Hadoop
41
Picture 3.8: Graphics showing the results after executing some queries.
As we can see with theses graphics we can analyze the average or the frequency of the
different age ranks much better. Finally, in order to end with our analysis I would like to
find all of those players that are older than 18, which means they are adults, and to analyze
those two that are the oldest of all of them. In order to get that I used the following code:
Machine learning on big data using MongoDB, R and Hadoop
42
nr <- mongo.count(mongo, coll, list('age' = list('$gte' = 18)))
print( nr )
pops <- mongo.find.all(mongo, coll, list('age' = list('$gte' = 18)))
head(pops, 1)
Picture 3.9: Executing »count« function and »head function«.
As we can see there are twenty-six players in the list that are adults and we also got the
information of the first player in the list.
STEP 6: CHANGING THE DATABASE SINCE R
In order to achieve this step we will need to use one of the functions that we already saw
before:
Machine learning on big data using MongoDB, R and Hadoop
43
mongo.bson.from.JSON. It will allow as to create a JSON object that we will add to our
collection a couple of lines later. After that we will need to use the function
mongo.insert.batch. in order to insert the previously created JSON object into the specified
collection. Finally, we have to make the last step where we will prove that the data has been
successfully added to our collection.
a <- mongo.bson.from.JSON( '{"position":"Goalie", "id":8471306, "weight":220,
"height":"6 1", "imageUrl":"http://1.cdn.nhle.com/photos/mugs/8471306.jpg",
"birthplace":"Fussen, DEU", "age":29, "name":"Thomas Greiss", "birthdate":"January
29, 1986", "number":1 }' )
b <- mongo.bson.from.JSON( '{"position":"Goalie", "id":8471306, "weight":220,
"height":"6 1", "imageUrl":"http://1.cdn.nhle.com/photos/mugs/8471306.jpg",
"birthplace":"Fussen, DEU", "age":29, "name":"Thomas Greiss", "birthdate":"January
29, 1986", "number":1 }' )
icoll <- paste("hockey", "players", sep=".")
mongo.insert.batch(mongo, icoll, list(a,b) )
dbs <- mongo.get.database.collections(mongo, "hockey")
print(dbs)
mongo.find.all(mongo, icoll)
In this case we have added to our collection hockey players our json objects named a and b,
once more. In order to prove the effectiveness of those lines of code I will provide the
screenshots that I got in my own R studio interface.
Machine learning on big data using MongoDB, R and Hadoop
44
Picture 3.10: Converting to bson format in R.
Finally, after the command line mongo.find.all I got each single object in the
collection. In order to demonstrate the fact that the data has been added in a successful way,
I will show the information of the last object in the collection.
In the image I’ve highlighted the information of that object in red and we can see that it is
the same information that we previously loaded in the JSON object that we created. For
example the position is goalie, and the birthplace is Fussen DEU.
Machine learning on big data using MongoDB, R and Hadoop
45
Picture 3.11: Output result after executing queries in our sample data in R.
APPLYING OUR KNOWLEDGE TO ONE OF OUR DATASETS
For this section we are going to run our mongo database and we will connect it with our R
studio. This time, we will analyze the data relative to our student database making queries
in order to get some useful information. For this section I am not going to delve very deeply
in the commands that I used because that is part of the previous section. Rather than that I
will directly write down the used commands and the output results.
QUESTION 1: WHICH GENDER DISTRIBUTION DO WE HAVE IN THE
HIGHSCHOOLS?
In order to achieve that I used the following commands that compares whether the FIELD 2
contains the character F for Females and M for males:
Machine learning on big data using MongoDB, R and Hadoop
46
females.count <- mongo.count(mongo, coll, list(FIELD2="F"))
print(females.count)
males.count <- mongo.count(mongo, coll, list(FIELD2="M"))
print(males.count)
counts <-c(females.count,males.count)
barplot(counts, main="Gender Distribution",names.arg=c("Females", "Males"))
The final results showed that from the 649 students 383 are girls and the other 266 are boys.
This means that the girls are majority in the high school getting the 59’01% of the total
students. The boys have the other 40’99%.
Picture 3.12: Sample using the »count« function in »rmongodb«.
Machine learning on big data using MongoDB, R and Hadoop
47
QUESTION 2: DOES THE LEVEL OF EDUCATION OF THEIR PARENTS
INFLUENCE THE MARKS OF THE STUDENTS?
In order to answer this question there are two attributes in the database (the attributes number
seven and eight) that correspond to the level of education of their parents. The first thing that
I did is to separate them in four groups. All the students whose mother or father are in certain
level of education will be grouped together with the possibility of overlapping in the case
that both parents have different level of education. For separating them I used the following
commands
j12 = '{"$or": [{"FIELD7": "4"}, {"FIELD8": "4"} ] }'
query=mongo.bson.from.JSON(j12)
l1.count <- mongo.count(mongo, coll, query)
print(l1.count)
j12 = '{"$or": [{"FIELD7": "3"}, {"FIELD8": "3"} ] }'
query=mongo.bson.from.JSON(j12)
l2.count <- mongo.count(mongo, coll, query)
j12 = '{"$or": [{"FIELD7": "2"}, {"FIELD8": "2"} ] }'
query=mongo.bson.from.JSON(j12)
l3.count <- mongo.count(mongo, coll, query)
j12 = '{"$or": [{"FIELD7": "1"}, {"FIELD8": "1"} ] }'
query=mongo.bson.from.JSON(j12)
l4.count <- mongo.count(mongo, coll, query)
Also I would like to show you my output results in my terminal:
Machine learning on big data using MongoDB, R and Hadoop
48
Picture 3.13: Executing some experiments in R.
QUESTION 2: HOW MUCH TIME ON AVERAGE DO THE STUDENTS DECICATE
TO THEIR STUDIES PER DAY?
The first thing that I will do is to show the different commands that I used in order to
calculate this data. For this case the needed functions are quite similar to those ones in the
first question.
v1 <- mongo.count(mongo, coll, list(FIELD14="1"))
v2 <- mongo.count(mongo, coll, list(FIELD14="2"))
v3 <- mongo.count(mongo, coll, list(FIELD14="3"))
v4 <- mongo.count(mongo, coll, list(FIELD14="4"))
variables <-c(v1,v2,v3,v4)
boxplot(variables)
barplot(variables, main="Amount of study time (h)",names.arg=c("1h ", "2h ","3h ","4h"))
hist(variables, main="Amount of study time (h)",names.arg=c("1h ", "2h ","3h ","4h"))
Machine learning on big data using MongoDB, R and Hadoop
49
After that I will provide some screenshots with the output results that I got in the graphs.
Basically we can guess that most of the students (305 out of 649) study 2 hours per day, with
1,93 being the average of the study time of all of them:
Picture 3.14: Graphic showing the output results of the experiment.
Picture 3.15: Graphic with bars showing the output results of the experiment.
Machine learning on big data using MongoDB, R and Hadoop
50
4 MASTERING THE HADOOP
4.1 What is big data and Hadoop?
WHAT IS BIG DATA [8]?
Regardless if this data is structured or not the term big data describes a huge volume of
data that inundates a business on an everyday basis. Regardless of what it looks like the
amount of data is not so important, what really matters is what organizations do with this
big amount of data.
Secondly I would like to point out that there is not an exact size of bytes for using big
data, there are mainly three different properties that influence it: The velocity of accessing
this information, the volume of data, and the variety of it. There are two additional
dimensions: variability and complexion.
Finally, in order to end with this brief introduction to big data I would like to explain
swiftly why it is important and what are different fields where big data is being used in
today’s world. First of all, big data is important because it allows you to save costs, it
allows you to reduce time, it allows you to optimize your offering, to make the product
development easier and finally it also allows you to make smart decisions when you
combine it with high-powered analytics. Secondly, I would like to mention that big data
has some applications in today’s world in fields like banking, education, government,
health care, manufacturing and retail.
WHAT IS HADOOP [9]?
Hadoop it is a software project developed by apache and released as an open source
project. Its main purpose is enabling distributed processing of large data sets across
clusters of commodity servers. In its design it is focused on being scaled up from a single
server to thousands of machines. One of the main points of this technology is its good
fault tolerance which means that the entire system admits a high degree of mistakes or
unfortunate circumstances.
Machine learning on big data using MongoDB, R and Hadoop
51
This fault tolerance is achieved by relying on the end hardware, the resistance of its
clusters comes from the ability that the software has to detect and handle the failures that
occur in the application layer.
After that I would also like to mention that the Hadoop architecture is divided into three
main layers: The ODP Core which consists of a standalone interface, the IBM Open
platform with Apache Hadoop and IBM Hadoop ecosystem.
Finally I would like to mention the main features that Hadoop has when operating with
big data: The first one is its scalability, it is possible to work with very huge amount of
data, the second good feature that it has is the low-cost architecture. The third, as we
mentioned before, is its good fault tolerance and finally I would like to point out its
flexibility, because this tool is able to manage structured and unstructured data, and it is
very easy to join and aggregate multiple sources with the goal of making a deeper analysis.
WHAT IS HDFS [10]?
HDFS comes from the acronym, Hadoop Distributed file system and it has been developed
by using the distributed file system design. One of the main advantages of this distributed
system is the low-cost hardware and the fault tolerance.
Another features that we can also highlight are for example the ease of access of a very
large amount of data, as this data is stored across multiple machines. In addition to data I
also would like to point out that HDFS also makes applications available to parallel
processing.
Secondly I am going to sum up the main features of HDFS:
As I previously mentioned it is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with HDFS
Streaming access to file system data.
Finally, in order to end with this section, I am going to talk about two different elements
that are part of the HDFS architecture: name node and data node.
Machine learning on big data using MongoDB, R and Hadoop
52
Name node: it is the commodity hardware that contains the GNU/Linux operative system
and the name node software. The system having the name node acts as the master server
and within its different tasks we can highlight: the management of the file system
namespace, regulateing client’s access to files and also executing system operations like
opening and closing files, opening and closing directories, renaming and so on.
Data node: it is the commodity hardware that has the GNU/Linux operating system and
the data node software, their role is the management of data storage of their system. Within
its main tasks we can highlight the performance of read-write operations on the file
system, and the performance of different operations like block, deletion and replication
according with the instructions of the namenode.
4.2 Installing and configuring Hadoop
The first thing that will be necessary to do in order to install Hadoop is to get and unpack
the source code. The files are available in different places but in order to rely on a more
trusted source I decided to download them from the official site[11] . Also I would like to
mention that I will use the Hadoop version 2.4.0 under the operative system windows 10.
Even if the configuration is a little bit more difficult I decided to not use any virtual
machine for that.
INSTALLING HADOOP IN AN STANDALONE MODE
The first step that will be necessary to perform if we want to run our Hadoop on our
computer is to install java, in my case I had it already installed so what I did is to be sure
that I works correctly and to set up the following environment variables: we have to open
the following menus: System properties > environmental variables and then we should see
something like that:
Machine learning on big data using MongoDB, R and Hadoop
53
Picture 4.1: Window showing the environment variables.
Here in this menu it will be necessary to include a new variable named: JAVA_HOME
with the following value: C:/java referring our local file system where this library is
located. Finally, we will also need to change one of the system variables named Path
adding the following value to the array: C:/java/bin. After that I am going to make sure
that the java tools are set up correctly in my computer by running the following command
in the cmd window:
Machine learning on big data using MongoDB, R and Hadoop
54
Picture 4.2: Command window showing the current java version installed in my
computer.
As you can see I have installed the Java version 1.7.0_80 in my computer now let’s move
to the next step where we will go back to the environmental variables.
In this step it will be necessary to set up a new variable named HADOOP_HOME and it
will contain the path of the directory where we placed Hadoop (in my case C:\ hadoop-
2.4.0\) After that It will also be necessary to modify another variable that we already had
in our computer: PATH, fortunately for us it is possible to set up more than one address
in the same variable as far as we separate them with semicolon, hence I added the hadoop
path (C:/Hadoop-2.4.0/bin) also to this environment variable.
Picture 4.3: Output result showing the hadoop version installed..
Machine learning on big data using MongoDB, R and Hadoop
55
As you can see I have the Hadoop version 2.4.0 installed in my computer in my local file
system.
INSTALLING HADOOP IN A PSEUDO DISTRIBUTED MODE:
In order to accomplish that part I am going to change the Hadoop configuration applying
the different changes in different files that I will mention right after. Finally, I will also
demonstrate that I have my tool well configured and ready to use.
Inside the directories named etc and hadoop (C:\hadoop-2.4.0\etc\hadoop) we can find the
following files that we will change:
Core-site.xml:
For this file I will add the following tags within the already existent configuration tag:
<configuration>
<property>
<name>fs.default.name </name>
<value> hdfs://localhost:9000 </value>
</property>
</configuration>
Hdfs-site.xml:
In this case we will also need to add some information within the configuration tag. Also
I would like to mention that we are assuming that we have the name mode and the data
node in the following routes: C:/hadoop/hadoopinfra/hdfs/namenode and
C:/hadoop/hadoopinfra/hdfs/datanode
Machine learning on big data using MongoDB, R and Hadoop
56
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value>
</property>
</configuration>
Yarn-site.xml
Also we will need to add the following configuration to the previously mentioned file:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Machine learning on big data using MongoDB, R and Hadoop
57
Mapred-site.xml:
We can also find a file named “mapred-site.xml.template” , we will need to rename it for
“mapred-side.xml” and we will also need to add the following configuration:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
After we have finished with the configuration for the pseudo distributed mode, I would
also like to mention that the configuration can be slightly different in older or newer
versions. Finally, I would like to verify that my Hadoop is working correctly. On the one
hand it will be useful for me to know that everything works fine so far and on the other
hand it will be also useful to demonstrate it in this report.
The first verification is name node set up: for that I will need to navigate using the
command mode to the folder Hadoop-2.4.0/bin and type the following command “hdfs
namemode -format”.
Machine learning on big data using MongoDB, R and Hadoop
58
Picture 4.4: Output results after executing »hdfs namemode -format« command.
Now I want to run the following commands in order to see the hdfs (Hadoop distributed
file system) with the commands “hdfs namenode” and “hdfs datanode”.
In order to execute both commands in a successful way I had to handle different issues in
the configuration of hadoop. On the one hand the native libraries for windows were not
included so I downloaded [11] and included them in the bin folder of hadoop.
On the other hand I had to handle another issue: In order to run hadoop in windows
operative system it is necessary to solve two main problems: First of all it is necessary to
install and configure in the right way the following tools: Software development Kit
version 10 which is the unique one compatible with my operative system windows 10.
Download and extract the maven files [12] into the following path C:/maven, to download
and extract the protocol buffer [13] version 2.5.0 in the following path: C:/protobuf.
Also I needed to add the following environmental variables. And the following link to the
PATH:
Machine learning on big data using MongoDB, R and Hadoop
59
Picture 4.5: Environment variables of my computer.
Finally, I had to solve an incompatibility issue within the java version of my computer
and the java version that hadoop used because it was not the same version 1.7 and 1.8 and
also one of them was using x64 bits whereas the other was using x32 bits. We could have
different solutions to this problem and the solution that I found is to use 32 bits with the
newer java version 1.8 for both tools.
Right after that I managed to make the hdfs work in a successfull and satisfactory way and
in order to demonstrate that I am going to provide a screenshot of their activity whereas
they are running:
Machine learning on big data using MongoDB, R and Hadoop
60
Picture 4.6: Output results after running Hadoop.
After that I run the following commands: “yarn resourcemanager” , “yarn nodemanager”
Because of the configuration that we have made with the environment variables I don’t
have to run the commands in any specific path, the own system detects the files.
In order to demonstrate that I managed to make them running in the correct way I am
going to show and screenshot of this commands running and I am going to enter in the
following urls: http://localhost:50070/ and http://localhost:8088/ where we will be able to
see the basic configuration of Hadoop in our system.
Machine learning on big data using MongoDB, R and Hadoop
61
Picture 4.7: Output results after executing »yarn«.
Finally, I am also going to show two more screenshots demonstrating that I have both
services running on my computer:
Machine learning on big data using MongoDB, R and Hadoop
62
Picture 4.8: Initial page after running Hadoop.
Picture 4.9: Initial page showing the cluster configuration of Hadoop.
Finally, I would like to mention another problem that we will have to face in future
ocasions if we format the filesystem again. We can have errors with the cluster ID. In
order to solve that we will need to get the number of the cluster id in the namenode and
add it to the following command. For example, in my case:
hdfs namenode -format -clusterId CID-8bf63244-0510-4db6-a949-8f74b50f2be9.
Machine learning on big data using MongoDB, R and Hadoop
63
In this way we will be able to format hadoop again and to run it without any problems.
4.3 Deployment of R algorithms in Hadoop
So far we’ve managed to install and run the R studio in a separate way and connect it with
hadoop, also in the previous section we’ve managed to run hadoop tools in our computer
in a successfully way. The goal of this section is to configure our R studio in such a way
that is able to connect with our hadoop. Once we accomplish it we will end with the
technical part of this master thesis. In order to get it we will need to follow a couple of
steps:
STEP 1: CONFIGURATION OF THE ENVIRONMENT
The first thing that we will need to do is to open our R (in our case the version 3.3) and to
run the following commands:
Sys.setenv("HADOOP_CMD"="/hadoop-2.7.1/bin/hadoop.cmd")
Sys.setenv("HADOOP_PREFIX"="/hadoop-2.7.1")
Sys.setenv("HADOOP_STREAMING"="/hadoop-2.7.1/share/hadoop/tools/lib/hadoop-
streaming-2.7.1.jar")
Sys.setenv("HADOOP_HOME"="/hadoop-2.7.1")
These two commands set up the variables that refer to the hadoop bin and the hadoop
streaming file which can change depending on the version that you have in hadoop.
Finally, we want to be sure that our two variables were set up correctly so we will use the
following command this time:
Sys.getenv("HADOOP_CMD")
Machine learning on big data using MongoDB, R and Hadoop
64
After that we should see the path that we would previously enter without any trouble. Here
you have a screenshot of the command shell with all the commands that have been
previously mentioned so far.
Picture 4.10: Initial configuration of Hadoop in R.
STEP 2: INSTALLING THE NECESSARY PACKAGES
The first thing that we will need to do is to download[14] and install the Rtools. In our
case the latest available was the version 3.3 so that is the one that I downloaded and install.
After we already have the necessary tools we will need to install nine different packages
that are going to be used. For that we will use the command install.packages and inside
the parenthesis we will create an array where we will mention all the packages that will
be automatically installed.
Install.packages(C(“rJava”,”Rcpp”,”JSONIO”,”bitops”,”digest”,”functional”,”string
”,
”plyr”,”reshape2”))
Finally, we will also need to install the three main packages that compose Rhadoop: rhdfs,
rmr2 and rhbase: for that we will need to type the following commands
Machine learning on big data using MongoDB, R and Hadoop
65
library(devtools)
install_github("rmr2", "RevolutionAnalytics", subdir="pkg")
install_github("rhdfs", "RevolutionAnalytics", subdir="pkg")
install_github("rhbase", "RevolutionAnalytics", subdir="pkg")
STEP 3: MAKING FIRSTS TESTS IN ORDER TO KNOW IF EVERYTHING
WORKS AS WE EXPECTED.
First of all, I would like to describe briefly what is the role of rmr2 package: The main
function of it consists in performing statistical analysis in R via Hadoop MapReduce
functionality on a hadoop cluster. Secondly I would like to show how rmr2 workS
correctly in my computer and in order to achieve that I run the following basic commands:
library(rmr2)
from.dfs(to.dfs(1:100))
from.dfs(mapreduce(to.dfs(1:100)))
If everything has been set up correctly, you should not see any errors and instead you
should see an output like this:
Machine learning on big data using MongoDB, R and Hadoop
66
Picture 4.11: executing »mapreduce«.
Also I would like to describe briefly what is the role of the rhdfs package. Its main tasks
consist of providing the basic connectivity with the hdfs which means the hadoop
distributed file system. With this package you are able to perform different operations like
read, write, modify files stored in HDFS and so on.
Finally, I am going to run a simple test showing if the packages hdfs work correctly. In
order to verify that I am going to run the following commands in R tool:
library(rhdfs)
hdfs.init()
hdfs.ls("/")
And those are the output results that I got. Because I didn’t receive any error I assumed
that everything works OK.
Machine learning on big data using MongoDB, R and Hadoop
67
Picture 4.12: Basic usage of »rhdfs« library.
STEP 3: MAKING ADVANCED TESTS
Once we proved that our rhadoop libraries work correctly we are going to perform
different tests in order to explain how to implement machine learning algorithms in R in
Hadoop with the extracted data from our mongo database.
The first thing that we are going to do is to operate the hadoop distributed file system with
our rhdfs: For achieving that I would like to mention that we need to have the following
prequisites:
To start in the command mode the following tasks: hdfs namenode, hdfs datanode,
yarn resourcemanager and yarn nodemanager
To have inported and initialized all the libraries inside R that are needed to perform
the job in the right way
To initialize the environment variables called HADOOP_CMD and
HADOOP_STREAMING:
Once we got all that, we can run the following commands. The first thing we will do is to
write a file called iris.txt in our RHDFS:
Machine learning on big data using MongoDB, R and Hadoop
68
library(rhdfs)
hdfs.init ()
f = hdfs.file("iris.txt","w")
data(iris)
hdfs.write(iris,f)
hdfs.close(f)
f = hdfs.file("iris.txt", "r")
dfserialized = hdfs.read(f)
df = unserialize(dfserialized)
df
hdfs.close(f)
After that I would like to provide a screenshot in order to show my output results
Picture 4.13: Unserializing data with »rhdfs«.
Machine learning on big data using MongoDB, R and Hadoop
69
Also I would like to make a demonstration and explain the function of different commands
from this library that we will use in future sections.
Picture 4.14: Executing different »hdfs« commands.
hdfs.ls('./'): read the list of files and directories from hdfs
hdfs.copy(‘name1’,’name2’): copy a file from one hdfs directory into another
hdfs.move(‘name1’,’name2’): move a file from one hdfs directory into another
hdfs.delete(‘file’): delete the file that is passed like a parameter
hdfs.get(‘name1’,’name2’): download a file located in hdfs to the local store of
your computer
hdfs.rename(‘name1’,’name2’), hdfs.chmod(‘name1’,’name2’) and
hdfs.file.info('./') are self-explanatory
Finally, I am also going to make another tests in order to try out at the same time both
libraries: rmr2 and “rhdfs”. For that I am going to make two different examples.
Machine learning on big data using MongoDB, R and Hadoop
70
In both cases the first thing that I need is to import all the libraries. In some cases, I am
not completely sure if they are necessary but what I am sure is that it is better to summon
them and that that it works correctly this way.
library(rJava)
library(Rcpp)
library(RJSONIO)
library(bitops)
library(digest)
library(functional)
library(stringr)
library(plyr)
library(reshape2)
library(devtools)
library(methods)
Sys.setenv("HADOOP_CMD"="/hadoop-2.7.1/bin/hadoop.cmd")
Sys.setenv("HADOOP_PREFIX"="/hadoop-2.7.1")
Sys.setenv("HADOOP_STREAMING"="/hadoop-2.7.1/share/hadoop/tools/lib/hadoop-
streaming-2.7.1.jar")
Sys.setenv("HADOOP_HOME"="/hadoop-2.7.1")
Sys.setenv("HADOOP_CONF"="/hadoop-2.7.1/libexec")
After that it would also be necessary to import the rmr2 library, the rhdfs library and to
initialize the the rhdfs system in the following way:
Machine learning on big data using MongoDB, R and Hadoop
71
library(rmr2)
library(rhdfs)
hdfs.init()
Using map reduce for the first time: word count problem:
In this example we will use the function map reduce for the first time and because of that
we will use an example that is as simple as possible. In this case we will not store the
output in any file. We will store it in a variable and we will examine the results in different
ways:
In the first line we use the function for the first time and inside the parameters we include
a txt file that contains the novel of Moby-dick, the whale. After that we will see output
results of the “a” variable and finally we will also fetch the contents of this temporal file
into another variable.
Picture 4.15: Using »mapreduce« in Hadoop.
So far we made the simple example but now we will make it a bit more complicated. Our
goal now is to process the input and to count the length of each single row inside the text
file.
Machine learning on big data using MongoDB, R and Hadoop
72
Finally, we will show the results in a graph. At the same time, we will also need to fetch
the results obtained in the temporal file into different variables in order to finally be able
to represent them in a graph.
Picture 4.16: practical example using »mapreduce«.
Picture 4.17: graphic using showing the output result of the preivous example.
Machine learning on big data using MongoDB, R and Hadoop
73
Comparing the performance between a standard R program and R map reduce
program:
The first commands have the goal of implementing a standard R program where we have
all the numbers squared.
a.time = proc.time()
small.ints2=1:100000
result.normal = sapply(small.ints2, function(x) x^2)
proc.time() - a.time
In the second part of the exercise we will do exactly the same than before with the
difference that we will implement map reduce this time:
b.time = proc.time()
small.ints= to.dfs(1:100000)
result = mapreduce(input = small.ints,
map = function(k,v) cbind(v,v^2))
proc.time() - b.time
Picture 4.18: set of commands processing data with »rhadoop«.
Machine learning on big data using MongoDB, R and Hadoop
74
Inside the performance comparison we can see the R standard program outperforms the
map reduce when we are processing small amounts of data. That is something normal
because Hadoop system needs to spawn daemons, it needs job coordination and fetching
data from data nodes. Hence the Map reduce takes a few seconds more
Testing and debugging the rmr2 program:
In this example I used a practical approach about some techniques for debugging and
testing the rmr2 program. In order to achieve that I made the following steps: First of all,
I configured the “rmr” in a local way. Second of all I performed the same basic example
in order to get the information about the squares of the first million numbers. Finally, I
printed out the time and the structure of the obtained information.
Picture 4.19: Final results after applying »rmr«.
Machine learning on big data using MongoDB, R and Hadoop
75
5 PERFORMING THE EXPERIMENTS AND ANALYZING
THE RESULTS
So far we’ve managed to do the following things: First we have installed and used the mongo
database and we have stored the ten different datasets that we will use like a sample, secondly
we have installed our R tool and we managed to connect R with mongo using the library
named “rmongodb” in order to analyze the already existent data. Finally, we have installed
Hadoop, a tool that allows us to manage big data and in addiction to that we also have
connected our R tool with Hadoop with the usage of different libraries like “rmr2” y
“rhdfs”.
I will explain for each single database what you could do with the data and I will also provide
some results of some experiments:
FIRST DATA SET: ARRYTHMIA:
WHAT CAN YOU DO WITH THIS DATA SET?
This dataset contains 457 different attributes which shows different features about people
who had that symptom. It includes characteristics like: age, gender, height, weigth, or exactly
which type of elyptic way make the involuntary movements of this people. You could
argurably guess with this data if this disease depends on the weight, height if it is more likely
that you have it at certain age or if it is more usual in males or females, and that could be
helpful for investigation for example.
PROVIDED RESULTS:
On the one hand I will show which is the age distribution within the people that have had
arrhythmia detected: Here you have the results:
Machine learning on big data using MongoDB, R and Hadoop
76
Picture 5.1: Results for the experiments of arrythmia dataset.
I would like to point out that in this case the database does not contain too many samples,
which is why in the frequency graph you cannot see too many samples. The size of the data
set remains mostly in the amount of attributes On the other hand, in the left side we can see
that the age average for having arrhythmia is between 40 and 50 years approximately-
I would also like to provide the gender distribution for people which have arrhythmia in
order to know if one gender is more weak against this symptom than the other.
Picture 5.2: More results about arrythmia dataset.
Machine learning on big data using MongoDB, R and Hadoop
77
The results show something quite unusual. There are exactly the same amount of man than
woman that had suffered arrhythmia. Usually one gender should have a bit more than the
other but about the same.
SECOND DATASET CMU:
WHAT CAN YOU DO WITH THIS DATASET?
This dataset contains a lot of different attributes about people, the chromin comes from
central management unity and it could be helpful for knowing where are those people from,
how many surveys they had made or how much money do they have.
PROVIDED REULTS:
In this case the performed experiments are about which accuracy distribution values they
have (Graph 1) which relativity values they have (Graph 2) and finally which amount of
subjectiveness (graph 3).
Picture 5.3: Results for the experiments of cmu dataset.
Machine learning on big data using MongoDB, R and Hadoop
78
THIRD DATA SET: DIABETIC DATA:
WHAT CAN YOU DO WITH THIS DATASET?
This dataset contains 49 different attributes about diabetic people, we have gathered features
like: race, number of days in the hospital, clinical speciality that was treated, number of
operations, number of diagnoses and more medical data. It could be helpful for knowing
how the patiens respond to the different operations or how many days they usually need to
be in the hospital.
PROVIDED RESULTS:
The first experiment that I have performed is for knowing how many of them take insulin
and which is the gender distribution of those people.
Picture 5.4: Results for the experiments of diabetic dataset.
In the first case we should see the numbers, 10 of them are female whereas 9 of them are
male, and the proportion is almost 50% even if it does not look like it at the first sight. On
the other hand, in the second graph I would like to point out that in most of the cases there
is no information about this attribute. In the rest of the cases all diabetics take insulin which
is something according with the normality.
Machine learning on big data using MongoDB, R and Hadoop
79
FOURTH DATA SET: TUMOR:
WHAT CAN YOU DO WITH THIS DATASET?
This dataset is quite technical and it contains the different features that different tumors have,
which part of the body they affect, their size and their behavior. It can be very usefull to
learn from the experience from all those features in order to guess for future tumors or in
order to know which type of tumors are more aggressive and which ones are more likely to
appear.
PROVIDED RESULTS
In this case I performed different experiments in order to examine the average and the
frequency in two of attributes. The first one is called HG2507-HT2603_at, HG2507-
HT2603_at.
Picture 5.5: Results for the experiments of tumor dataset.
We can see that in the first attribute the values are within a range -1500 and 400.
Machine learning on big data using MongoDB, R and Hadoop
80
Picture 5.6: More results for tumors dataset.
FIFTH DATA SET: KDDCUP99:
WHAT CAN YOU DO WITH THIS DATASET?
This dataset talks about data mining and knowledge discovery competition from the year
1999, it contains different features like which protocol the participants are using or which
number of logings have been failed. It can be useful for example in order to know which
protocols are becoming more popular within the participants or in order to find different
error prone situations with the contained error data
PROVIDED RESULTS:
In the kddcup_99 data set we have different attributes and in this case I am going to show:
that can take the amount of destination host and the amount of service destination host.
Machine learning on big data using MongoDB, R and Hadoop
81
Picture 5.7: Results for the experiments of kddcup dataset.
SIX DATA SET: LETTER:
WHAT CAN YOU DO WITH THIS DATASET?
This dataset contains the different features that have the characters within them. We can
highlight: height width, which corners are they touching and so on. It can be usefull for some
areas of research to ideantify the different Romanic characters or to compare them with
chinesse or japanesse characters.
PROVIDED RESULTS:
In this case I show the different values that can take the attributes width x-box and y-box.
Machine learning on big data using MongoDB, R and Hadoop
82
Picture 5.8: Results for the experiments of letter dataset.
SEVENTH DATA SET: NURSERY
WHAT CAN YOU DO WITH THIS DATASET:
This dataset contains different information about children that where in the nursery. It can
be usefull to analyze different data like the health, finance or the number of brothers that
they have in order to know which of them are more likely to go to the nursery and if it is
related with some of this attributes.
PROVIDED RESULTS:
In this case we saw the different values and frequency that the database has for the different
attributes: parents and “has_nurs” (if it has an auxiliary nurse or not).
Machine learning on big data using MongoDB, R and Hadoop
83
Picture 5.9: Results for the experiments of nursery dataset.
EIGHT DATA SET: SPLICE
WHAT CAN YOU DO WITH THIS DATASET?
The dataset contains 61 different attributes about which class of splice there are and which
features do they have. It can be useful for knowing which tipe of splice you should use in
which pipe or which features work better in which installations.
PROVIDED RESULTS
In this case I am going to examine two different attributes named in the data set “attribute_1”
and “attribute_2”. In both cases they can take four different values C, A, G, T. We will
examine the different likelihood for each case in both attributes and on the other hand we
will examine the average, the minimum and the maximum values.
Machine learning on big data using MongoDB, R and Hadoop
84
Picture 5.10: Results for the experiments of splice dataset.
NINTH DATA SET: WAVEFORM
WHAT CAN YOU DO WITH THIS DATASET?
In this case the data set contains forty different attributes for each row. Each one with the
different dots that contain the wave form. It can be usefull to know which waveforms we
have gathered in the nature and also to guess what can a typical wavefrom graphic be like
for different purposes.
Machine learning on big data using MongoDB, R and Hadoop
85
PROVIDED RESULTS:
In this case I show the different values that can take the attributes “x1”, ”x2” , ”x3”.
Picture 5.11: Results for the experiments of waveform dataset.
TENTH DATA SET: STUDENTS DATA
WHAT CAN YOU DO WITH THIS DATASET?
This dataset contains data about high school students and you will be able to know for
example if the level of education of their parents influences their marks, the amount of study
or if the girls get better marks than the boys
PROVIDED RESULTS:
I will show the results that I have obtained for the high school dataset. In the first photo we
see the results about whether the level of education of their parents influenced the students.
Machine learning on big data using MongoDB, R and Hadoop
86
Picture 5.12: Results for the experiments of students dataset.
Finally, I also show the gender distribution within the high school students.
Picture 5.13: More results for the experiments of students dataset.
Machine learning on big data using MongoDB, R and Hadoop
87
6 APPLYING MACHINE LEARNING ALGORITHMS
Machine learning consists of a subfield inside the computer sciences that has come from
the pattern recognition and from the theory of computational learning of an artificial
intelligence.
We could arguably define the machine learning like “the field of study that gives
computers the ability to learn without being explicitly programmed”.
This subfield of computer sciences explores the algorithms that are able to learn and make
decisions and predictions expressed on data. I also would like to point out that machine
learning is closely related and sometimes overlapped with the subfield named
computational statistics, which is a discipline that focuses on prediction making through
the use of computer.
It has strong bounds with the mathematical optimization which provides methods, theory
and application domains to the machine learning. We can also find some applications like
spam filtering, computer vision, optical character recognition, search engines and so on.
Finally, I would also like to point out that data mining sub-field focuses more on
exploratory data analysis and it is referred like unsupervised learning.
STEP 1: STARTING WITH MACHINE LEARNING ALGORITHMS
Like an introduction to the topic, I am going to apply in R one of the most simple machine
learning algorithm named KNN (K nearest neighbors), for this case I will apply a sample
data set named iris. These are the first commands for the procedure and its explanation:
names(iris) <- c(“Sepal.Length”,”Sepal.Width”,” Sepal.Length”,” Sepal.Width”)
names(iris)
library(ggvis)
iris %>% ggvis(~Sepal.Length, ~Sepal.Width, fill = ~Species) %>% layer_points()
iris %>% ggvis(~Petal.Length, ~Petal.Width, fill = ~Species) %>% layer_points()
Machine learning on big data using MongoDB, R and Hadoop
88
First of all, we create an array with all the attribute names. Next, we imported a library
named “ggvis” that is able to make more complex graphics that can be useful for this case.
Finally, I create two graphics. We are able to see the relationship within two attributes
petal width and petal height. On the other hand, we can also appreciate that sepal width
and sepal height are not as related as the other two.
Picture 6.1: Applying machine learning algorithms in R.
Secondly I typed the following commands:
table(iris$Species)
round(prop.table(table(iris$Species)) * 100, digits = 1)
summary(iris)
The purpose of these commands basically is to see what we have inside the species
attribute and to round them in order to make them ready for the experiment.
Machine learning on big data using MongoDB, R and Hadoop
89
Picture 6.2: Showing main features of iris dataset
Thirdly, we also want to see a summary of two attributes petal width and height in order
to see the relationship between them and in order to have a better understanding of the
data set that we are experimenting with, after that we also prepare the workspace that we
are working on by importing the library. Those two actions can be summarized in the
following commands.
summary(iris[c("Petal.Width", "Sepal.Width")])
library(class)
After that we start with a very important step that we will need to take. This step is named
normalization, and it will make the data more consistent. I also would like to mention that
sometimes normalization is not strictly necessary, if there are not too many differences
within the minimum value and the maximum inside the data set this step might be not
strictly necessary but still always advisable.
Coming back to the topic I would like to provide you with the following commands that
make the normalization step:
Machine learning on big data using MongoDB, R and Hadoop
90
normalize <- function(x) {
num <- x - min(x)
denom <- max(x) - min(x)
return (num/denom)
}
iris_norm <- as.data.frame(lapply(iris[1:4], normalize))
summary(iris_norm)
Picture 6.3: Summary of iris dataset
The fourth thing that we have to do is to prepare the training and the test sets. The first
thing that we need to do in order to accomplish this step is to create a seed, the function
of which is to create random numbers. This function takes a sample with a size set like a
Machine learning on big data using MongoDB, R and Hadoop
91
number of rows of the iris data set. Finally, we used the variable obtained with the sample
function in order to define our train and our test sets:
set.seed(1234)
ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.67, 0.33))
iris.training <- iris[ind==1, 1:4]
iris.test <- iris[ind==2, 1:4]
Picture 6.4: Applying machine learning algorithms to iris data
Finally, I would like to point out that in our train and test set we do not have all five
attributes, we have only four, because we actually want to predict the fifth attribute. We
will also apply the “knn” function in order to predict the results. But even if it seems that
the work is all done we will also need to analyze the results. For that the first thing that
we need to do is to import the “gmodels” library.
Machine learning on big data using MongoDB, R and Hadoop
92
iris.trainLabels <- iris[ind==1, 5]
iris.testLabels <- iris[ind==2, 5]
iris_pred <- knn(train = iris.training, test = iris.test, cl = iris.trainLabels, k=3)
iris_pred
library(gmodels)
Picture 6.5: spliting iris dataset into training and test.
Finally, we will analyze the results and we will see that the algorithm worked quite well
and was right in all cases except one. In order to analyze the results, I typed the following
command:
Machine learning on big data using MongoDB, R and Hadoop
93
CrossTable(x = iris.testLabels, y = iris_pred, prop.chisq=FALSE)
Picture 6.6: Final results after applying machine learning on iris dataset.
STEP 2: APPLYING MACHINE LEARNING (REGRESION TREE) TO OUR
DATASETS
In order to apply machine learning to our datasets we will use regression tree because it is
one of the most popular and recommended algorithms and we will apply it to all our
datasets with the following commands structure:
Machine learning on big data using MongoDB, R and Hadoop
94
library(class)
library(rpart)
coll <- “dataset.collection”
dataset <- mongo.find.all(mongo, coll, data.frame=TRUE)
dataset
raw = subset(dataset, select=c("x.box","y.box","width","high"))
raw
row.names(raw) = raw.orig$CASNumber
row.names(raw) = dataset$CASNumber
raw = na.omit(raw);
frmla = high ~ x.box + y.box + width
fit = rpart(frmla, method="class", data=raw)
printcp(fit) # display the results
plotcp(fit) # visualize cross-validation results
summary(fit)
Machine learning on big data using MongoDB, R and Hadoop
95
DATA SET: LETTER:
After applying the regression tree on this dataset with the previously mentioned
commands I got the following output results:
Picture 6.7: Applying regression tree algorithm to letter dataset
Machine learning on big data using MongoDB, R and Hadoop
96
DATA SET: ARRYTHMIA:
After applying the regression tree for this dataset to the following attributes I managed to
have the following results:
Picture 6.8: Applying regression tree algorithm to arrythmia dataset
Machine learning on big data using MongoDB, R and Hadoop
97
DATA SET: DIABETIC:
After applying regression tree to the following attributes: gender, weight, race and age I got
the following output results:
Picture 6.9: Applying regression tree algorithm to diabetic dataset
Machine learning on big data using MongoDB, R and Hadoop
98
DATA SET: KDDCUP99:
Now I will analyze the different attributes from the dataset kddcup from the year 99. These
are the output results that I got:
Picture 6.10: Applying regression tree algorithm to kddcup dataset
Machine learning on big data using MongoDB, R and Hadoop
99
DATA SET: NURSERY:
Now I will apply the regression tree to the attributes of the nursery dataset, and those are the
output results:
Picture 6.11: Applying regression tree algorithm to nursery dataset
Machine learning on big data using MongoDB, R and Hadoop
100
DATA SET: SPLICE:
Now I will apply the regression tree to the attributes of the splice dataset, and those are the
output results:
Picture 6.12: Applying regression tree algorithm to splice dataset
Machine learning on big data using MongoDB, R and Hadoop
101
DATA SET: STUDENT:
Now I will apply the regression tree to the attributes of the student dataset, and those are the
output results:
Picture 6.13: Applying regression tree algorithm to student dataset
Machine learning on big data using MongoDB, R and Hadoop
102
DATA SET: TUMOR:
Now I will apply the regression tree to the attributes of the tumor dataset, and those are the
output results:
Picture 6.14: Applying regression tree algorithm to tumor dataset
Machine learning on big data using MongoDB, R and Hadoop
103
DATA SET: WAVEFORM:
Now I will apply the regression tree to the attributes of the waveform dataset, and those are
the output results:
Picture 6.15: Applying regression tree algorithm to waveform dataset
Machine learning on big data using MongoDB, R and Hadoop
104
7 CONCLUSION
With the introduction of new technologies, the devices and different means of
communication, like the social networks, the quantity of data that is being produced is
growing very fast year by year. Just so that we have a general idea of how much data we
create I would like to mention that we have produced 5 billion of gigabytes of data since the
beginning of time till 2003. The amount of data required to manage the applications and
technologies is becoming increasingly bigger, so there has to be a way of handling this issue.
It is here where the role of big data becomes clear.
Under the name big data we understand something like a collection of very big datasets that
cannot be processed by the usage of traditional computing techniques. Furthermore, in recent
years, the big data is not merely data any more and it has become a different subject which
involves different tools techniques and frameworks.
With this master thesis I had the chance of working with big data which was a good way of
appreciating first hand the main benefits of it, out of which I would like to mention the main
two:
The fact of using such a big quantity of information allows you to learn about the
response for campaigns, promotions and other advertisement medium.
Using the information allows you to have more information about your products
which can be useful for planning the production or for future decision making.
I would also like to have some words for Hadoop, the big data technology that I have been
useing through this master thesis. Hadoop is a solution that has been provided by Google in
2005 and started like as an open source project. Hadoop is able to run applications with the
usage of the Map reduce algorithms on different CPU nodes that are processed in parallel.
With the performance of this work I was able to prove that Hadoop is a strong solution with
a very high fault tolerance and a very good scalable approach for big data.
I would also like to give some conclusions and explanations about the other subject of this
master thesis which is machine learning. Machine learning consists in a subfield inside
computer science which was derived from the study of computational learning and pattern
recognition. We could define machine learning as “the field of study that gives computers
Machine learning on big data using MongoDB, R and Hadoop
105
the ability to learn without being explicitly programmed” this subfield of computer sciences
explores the algorithms that can learn from the experience and are able to make predictions
based on the gathered data.
Furthermore, I would like to point out that machine learning is quite related with the
discipline of computational statistics which also focuses on prediction making but in this
case it is done by the usage of computers.
Once we have a good definition of what machine learning is, I would like to give my own
conclusions based on my own experience. As I saw throughout the making of this master
thesis, the machine learning approaches can be applied to a lot of different fields being able
in this way to improve product quality. Machine learning approaches are of particular
interest considering the steadily increasing search outputs and accessibility of existing
evidence is a particular challenge of the research field quality improvement.
At the end, I would also like to talk about the results of analyzing our different datasets: As
we saw before big data analysis helps you to identify the connexions between the different
attributes, to have a better understanding about our already existent data and at the same time
it allows you to make predictions about what the future data entries are going to be like. We
could identify in this way the different features of the tumor in order to classify them or we
could also know wich features are related to the apparition of arrhythmia. Finally, we could
also see other less medical examples like the data about highschool students and how
different variables related to each other.
Machine learning on big data using MongoDB, R and Hadoop
106
8 REFERENCES
[1] Official documentation with general description of MongoDB
http://www.mongodb.org/about/introduction/
[2] Main description and basic features that provides us a teorical base for Mongo DB
from Wikipedia web page http://en.wikipedia.org/wiki/MongoDB
[3] Main page of mongo where you can download the necessary tools legally
http://www.mongodb.org
[4] R ide is a tool for analyzing data, you can find the official web page in
http://www.r-project.org/
[5] It contains the basic features, history and information about the different versions of
R tool https://en.wikipedia.org/wiki/R_(programming_language) and the second one
contains a tutorial where we can get started and learn about data mining in R
http://bigdatatechworld.blogspot.com/2014/01/video-tutorial-for-data-mining-with.html
[6] The following web page contains a theoretical base where we can find the main
information about machine learning algorithms and how they work
http://en.wikipedia.org/wiki/Weka_%28machine_learning%29 and the second link
contains a basic guide for getting started with the rmongodb library inside R
https://github.com/selvinsource/mongodb-datamining-shell
[7] Main page with the official documentation for R ide http://cran.r-project.org
[8] https://en.wikipedia.org/wiki/Machine_learning like the main source of information
in order to express the main features of machine learning
[9] The official documentation about Hadoop
http://www.apache.si/hadoop/common/hadoop-2.7.0/
[10] Main tutorial about how to use the Hadoop Distributed File System on Windows
http://wiki.apache.org/hadoop/Hadoop2OnWindows
[11] Main page where you can download the latest Hadoop version
http://hadoop.apache.org/releases.html
Machine learning on big data using MongoDB, R and Hadoop
107
[11] The native libraries from windows can be downloaded in the following URL:
http://www.srccodes.com/p/article/39/error-util-shell-failed-locate-winutils-binary-
hadoop-binary-path
[12] The maven files are needed for the purpose of this master thesis and they can be
downloaded in the following URL: http://maven.apache.org/download.cgi version 3.1.1
[13] The procol buffer in needed in order to work correctly the entire system and it
can be downloaded in the following URL: https://github.com/google/protobuf
[14] The R tools can be downloaded in the following URL:
https://cran.r-project.org/bin/windows/Rtools/
II
III
IV