4
Research of Cloud Computing based on the Hadoop platform ChenHao Computer Science & Technology Department Southwest Petroleum University Chengdu. China Email:[email protected] QiaoYing Computer Science & Technology Department Southwest Petroleum University Chengdu. China Email:[email protected] Abstract-The application of cloud computing will cause mass of data accumulation so that how to manage data and distribute data storage space effectively become a hot topic recently. Hadoop was developed by the Apache Foundation to deal with massive data through parallel processing. In addition, Hadoop is applied widely as the most popular distributed platform. This paper mainly presents a Hadoop platform computing model and the Map/Reduce algorithm. We combined the K-means with data mining technology to implement the effectiveness analysis and application of the cloud computing platform. Key words:-Cloud computing; Hadoop; Map/Reduce; Data mining; K-means I. INTRODUCTION Along with the maturing of computer network technology, especially today, the cloud computing technology has been widely recognized and applied. The IT giants companies such as Google, IBM, Amazon and Microsoft have launched their own commercial products; they also make the cloud computing technology as the priority strategy in the future development. But this will lead to a large number of data problems: the explosive growth of online information. Each user may have a huge amount of information. From now on, the transistor circuit has been gradually approaching its physical limits. In the past; the Moore’s law of each 18 months CPU efficiency double has come to the failure point. Facing the massive information, how to manage and store the data are the important issues we should deal with. Hadoop is a building cloud computing platform which is designed by Apache open source projects. We use this framework to solve the problems and manage data conveniently. There are two major technologies: HDFS and Map/Reduce. HDFS is used to achieve the storage and fault-tolerant of huge document, Map/Reduce is used to compute the data by distributed computing. II. BACKGROUND KNOWLEDGE AND RELEVANT CONCEPTIONS A Cloud computing information Cloud computing was developed from a variety of network technologies, including parallel computing, distributed computing and grid computing etc. It will implement all the tasks by using the cloud virtually and accomplish the purpose of combining all the cheap computer points into the huge system which can provide the large computing capacity. In other words, we can regard as separating the monitor and main engine. As the Figure 1 shows: Figure 1 Cloud Computing Structure Like the bank system. We can store the data and use the application as conveniently as we can save and manage money in the bank. The user no longer need to use so many hardware as background supporting, the only thing is 2011 International Conference on Computational and Information Sciences 978-0-7695-4501-1/11 $26.00 © 2011 IEEE DOI 10.1109/ICCIS.2011.213 181

06086164 - Hadoop

Embed Size (px)

Citation preview

Page 1: 06086164 - Hadoop

Research of Cloud Computing based on the Hadoop platform

ChenHao Computer Science & Technology

Department Southwest Petroleum University

Chengdu. China Email:[email protected]

QiaoYing Computer Science & Technology

Department Southwest Petroleum University

Chengdu. China Email:[email protected]

Abstract-The application of cloud computing will

cause mass of data accumulation so that how to

manage data and distribute data storage space

effectively become a hot topic recently. Hadoop

was developed by the Apache Foundation to deal

with massive data through parallel processing. In

addition, Hadoop is applied widely as the most

popular distributed platform. This paper mainly

presents a Hadoop platform computing model and

the Map/Reduce algorithm. We combined the

K-means with data mining technology to

implement the effectiveness analysis and

application of the cloud computing platform.

Key words:-Cloud computing; Hadoop;

Map/Reduce; Data mining; K-means

I. INTRODUCTION

Along with the maturing of computer network technology, especially today, the cloud computing technology has been widely recognized and applied. The IT giants companies such as Google, IBM, Amazon and Microsoft have launched their own commercial products; they also make the cloud computing technology as the priority strategy in the future development. But this will lead to a large number of data problems: the explosive growth of online information. Each user may have a huge amount of information. From now on, the transistor circuit has been gradually approaching its physical limits. In the past; the Moore’s law of each 18 months CPU efficiency double has come to the failure point. Facing the massive information, how to manage and store the data

are the important issues we should deal with. Hadoop is a building cloud computing

platform which is designed by Apache open source projects. We use this framework to solve the problems and manage data conveniently. There are two major technologies: HDFS and Map/Reduce. HDFS is used to achieve the storage and fault-tolerant of huge document, Map/Reduce is used to compute the data by distributed computing. II. BACKGROUND KNOWLEDGE AND RELEVANT

CONCEPTIONS

A Cloud computing information Cloud computing was developed from a

variety of network technologies, including parallel computing, distributed computing and grid computing etc. It will implement all the tasks by using the cloud virtually and accomplish the purpose of combining all the cheap computer points into the huge system which can provide the large computing capacity. In other words, we can regard as separating the monitor and main engine. As the Figure 1 shows:

Figure 1 Cloud Computing Structure

Like the bank system. We can store the data and use the application as conveniently as we can save and manage money in the bank. The user no longer need to use so many hardware as background supporting, the only thing is

2011 International Conference on Computational and Information Sciences

978-0-7695-4501-1/11 $26.00 © 2011 IEEE

DOI 10.1109/ICCIS.2011.213

181

Page 2: 06086164 - Hadoop

connecting to the cloud. B Parallel computing

Parallel computing is a method to upraise the efficiency of the computing capacity and it solves the problem by using the multiple resources. The main principle is that we split the task into N parts and send the task to the N computers, so the efficiency is increased N times. But the parallel computing has a serious shortcoming which is that each part is relevant, that is a barrier for parallel computing development. C Distributed computing

The basic principle of distributed computing and parallel computing is consistent. The advantage is that the distributed computing has a very strong fault-tolerant and easily to expand computing capacity by increasing the number of computer nodes. The difference is that the part is independent of each other, so a batch of computing nodes’ failure would not affect the accuracy of calculation. D GFS and HDFS

GFS cluster, consisted by a master and lots chunserver, is accessed by many clients. Here is the structure shown as Figure 2:

Figure 2 GFS Structure

HDFS is a distributed file system and uses the ordinary hardware, which has been achieved by the GFS structure from the Google paper. It adopts the master/slave model. An HDFS cluster consists of a Namenode and many Datanode. The Namenode is the center management server, shown as Figure 3:

Figure 3 HDFS Structure

III. THE STRUCTURE OF CLOUD COMPUTING

Cloud computing platform is a powerful “cloud” network to connect a large number of concurrent services, and can be used to extend each of virtual servers. Combining every source through the platform is in order to support the huge computing and storage abilities. The general cloud computing system structure shown as Figure 4

Figure 4 Cloud Computing Platform

A Hadoop Structure Hadoop, developed by the Apache Company,

is a distributed system basic structure. It makes the users program the distributed software easily even they know nothing about the bottom circumstances. HDFS, which is the base layer, is a main storage system in Hadoop and running on the ordinary cluster components. Usually it is deployed in the low-cost hardware devices to provide high rate of transmission and access the application data. This is suitable for the program which has large dataset. B Map/Reduce

Map/Reduce, presented by the Jeffery Dean and Sanjay Ghemawat, is a programming model in the massive data computing, which is developed by Google and meanwhile is a core technology of cloud computing. The model

182

Page 3: 06086164 - Hadoop

abstracts the common operation of large dataset as Map and Reduce steps to simplify the programmers’ difficulty of distributed and parallel computing.

Firstly, dataset are split the several small parts by Map, then several parts are sent to a large number of computers to make the parallel computing, and which will produce a series of intermediate keys. Finally, Reduce will integrate the all results and output them. The structure is shown as Figure 5:

Figure 5 Map/Reduce Process

Map function is usually supported to the diffident needs of users according to the different business. It is used to deal with a pair of key-value and to get a new pair of key-value as intermediate results. The Map/Reduce function library will put the same value together and send to the Reduce function.

Reduce function is the same as Map function defined by the user themselves. According to the transfer intermediate key, it will deal with the results and merge the same key to produce a smaller set of values.

So, we can conclude the process of Map/Reduce operation:

IV. APPLICATION OF CLUSTER

Cluster algorithm is an important factor in data mining, especially in such a large system of data computing. The clustering is particularly vital in cloud technology. The division of data characteristics is the most vital step in the storage

and security of cloud computing. There are so many algorithms in cluster analysis and we will combine the K-means and Map/Reduce to discuss the distribution of data in the Hadoop platform. A K-means algorithm

K-means algorithm is an objective function which is aiming at optimizing the distance between the data point to the center point. That distance uses Euclidean distance similarity as measure. This function makes the clustering meet some rules: the similarity of objects in the same cluster is higher, and the difference between the cluster types has less similarity. It means the idea of “High cohesion, Low coupling” in software engineering. The specific process is as follows:

A) Firstly, we should select K objects as the centers of the data clusters; B) For the rest of the other objects, we should assign them to these center clusters based on the similarity (distance); C) Then we calculate each of cluster centers to get a new center; D) Last, we should repeat these steps until the criteria measure function is converged; In addition, we usually adopt the mean square

as the criteria measure function. For example, we stimulate a set of data to

scatter them by K-means and we use k=3 as cluster types. The experimental data shown in Figure 6:

2 2.5 3 3.5 4 4.51

2

3

4

5

6

7

ClusterOne

ClusterTwoClusterThree

Centroids

Figure 6 Experimental Data 1

On this basis, we can also add some disturbance factors to enhance the data hidden, so

183

Page 4: 06086164 - Hadoop

the data have a higher security. Such as we add an F (5, 10) distribution function to the original data, then the experience data shown in Figure 7:

2 4 6 8 10 12 14 16 18 201

2

3

4

5

6

7

8

9

10

11

ClusterOne

ClusterTwo

ClusterThree

Centroids

Figure 7 Experimental Data 2

B The application of K-means Consequently, we can use K-means clustering

method to separate the data which have similarity to each other. Combining with the Map/Reduce in the Hadoop, we can distribute storage and use the data in cloud computing. The structure is shown as Figure 8:

Figure 8 Combination Structure

According to the diagrams display, we can generally divide the process into five steps:

A) In cloud computing, data will be distributed in so many blocks by using the K-means algorithm to ensure data in each block have high similarity. B) We distribute the block again by Center controlled, and then we specify the corresponding pointer to the each distributed data to make the Map/Reduce operation become more convenient. C) Then we should notice to the Master Management the location of these small block data to facilitate the Master to assign the tasks.

D) Master split the data point to Map to operate. E) Map return the intermediate value to Master then let Reduce operate. This process, which compares to the previous

algorithm without using the K-means, has more impact on the data distribution. Before the Map/Reduce, we use the K-means to classify the data types which is more effective to manage the data classifications.

V. CONCLUSION

Data storage is an important element of cloud computing. This paper discussed the major core technologies HDFS and Map/Reduce in Hadoop framework. Combination of data mining and K-means clustering algorithm will make the data management is easier and quicker in cloud computing model. Even though this technology is still in its infancy, we believe that along with the continuous improvement, the cloud computing will develop towards the security and reliable directions.

REFERENCES

[1]. WangXiangqian. Optimization of High Performance MapReduce System. Computer Software Theory, Univerisity of Science and Technology of China. 2010

[2]. ZhuZhu. Research and application of massive data processing model based on Hadoop.Beijing University of Posts and Telecommunications. 2008

[3]. YangChenzhu. The Research of Data Mining Based on Hadoop. Chongqing Univerisity. 2010

[4]. QiuRongtai. Research on MapReduce application based on Hadoop, Henan Polytechnic University, 2009

[5]. WangPeng. Into Cloud Computing.. People Posts and Telecommuniations Press.2009 (In Chinese)

184