Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
INOM EXAMENSARBETE DATALOGI OCH DATATEKNIK,AVANCERAD NIVÅ, 30 HP
, STOCKHOLM SVERIGE 2016
A Global Ecosystem for Datasets on Hadoop
JOHAN SVEDLUND NORDSTRÖM
KTHSKOLAN FÖR INFORMATIONS- OCH KOMMUNIKATIONSTEKNIK
A Global Ecosystem for Datasets on Hadoop
TRITA-ICT-EX-2016:131
Johan Peter Svedlund Nordstrom
Master of Science Thesis
Software Engineering of Distributed Systems
School of Information and Communication Technology
KTH Royal Institute of Technology
Stockholm, Sweden
11 September 2016
Examiner: Jim Dowling
c© Johan Peter Svedlund Nordstrom, 11 September 2016
Abstract
The immense growth of the web has led to the age of Big Data. Companies like
Google, Yahoo and Facebook generates massive amounts of data everyday. In
order to gain value from this data, it needs to be effectively stored and processed.
Hadoop, a Big Data framework, can store and process Big Data in a scalable
and performant fashion. Both Yahoo and Facebook, two major IT companies,
deploy Hadoop as their solution to the Big Data problem. Many application
areas for Big Data would benefit from the ability to share datasets across cluster
boundaries. However, Hadoop does not support searching for datasets either local
to a single Hadoop cluster or across many Hadoop clusters. Similarly, there is only
limited support for copying datasets between Hadoop clusters (using Distcp). This
project presents a solution to this weakness using the Hadoop distribution, Hops,
and its frontend Hopsworks. Clusters advertise their peer-to-peer and search
endpoints to a central server called Hops-Site. The advertised endpoints builds
a global hadoop ecosystem and gives clusters the ability to participate in public-
search or peer-to-peer sharing of datasets. HopsWorks users are given a choice to
write data into Kafka as it’s being downloaded. This opens up new possibilities
for data scientists who can interactively analyse remote datasets without having
to download everything in advance. By writing data into Kafka as its being
downloaded, it can be consumed by entities like Spark-streaming or Flink.
i
Acknowledgements
I would like to acknowledge my examinor Jim Dowling and my supervisor Alex
Ormenisan. Both of them have contributed with advice and smart ideas which has
helped me throughout the project. I would also like to thank all the coworkers at
SICS, who where always glad to offer help.
iii
Contents
1 Introduction 1
1.1 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Problem statement and Purpose . . . . . . . . . . . . . . . . . . . 3
1.3 Goals, Ethics and Sustainability . . . . . . . . . . . . . . . . . . 4
1.4 Structure of this thesis . . . . . . . . . . . . . . . . . . . . . . . 5
2 Background 7
2.1 Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Hops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 HopsWorks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Hops-site . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 MySQL cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 ElasticSearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.7 Epipe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.8 GVoD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.9 HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.10 Kafka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Method 19
3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
v
vi CONTENTS
3.2 Experiments and evaluation . . . . . . . . . . . . . . . . . . . . . 20
4 Implementation 21
4.1 Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Hops-Site interactions with a HopsWorks instance . . . . . . . . . 23
4.3 Public Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3.1 What is needed ? . . . . . . . . . . . . . . . . . . . . . . 25
4.3.2 Producing a public search and handling responses . . . . . 26
4.4 GVoD peer-to-peer upload and download . . . . . . . . . . . . . 28
4.4.1 What is needed ? . . . . . . . . . . . . . . . . . . . . . . 28
4.4.2 Upload . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4.3 Download . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.5 Real-time processing . . . . . . . . . . . . . . . . . . . . . . . . 30
5 Analysis 33
5.1 Evaluation of implementation and current technologies . . . . . . 33
5.2 Results of Experiments . . . . . . . . . . . . . . . . . . . . . . . 34
5.3 P2P test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.4 Real-time processing tests . . . . . . . . . . . . . . . . . . . . . 37
6 Conclusions 39
6.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Bibliography 45
List of Figures
2.1 HopsWorks and Hops . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Hops HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1 Register to Hops-Site . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Ping Hops-Site . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3 Public Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.1 Download with one uploader . . . . . . . . . . . . . . . . . . . . 35
5.2 Download with two uploaders . . . . . . . . . . . . . . . . . . . 35
5.3 Download with three uploaders . . . . . . . . . . . . . . . . . . . 36
5.4 Download with four uploaders . . . . . . . . . . . . . . . . . . . 36
vii
List of Acronyms and Abbreviations
HDFS Hadoop Distributed File System
YARN Yet Another Resource Negotiator
Hops Hadoop Open Platform As-a-Service
JSON Javascript Object Notation
REST Representational State Transfer
NAT Network address translation
FTP File Transfer Protocol
NDB Network Database
API Application Programming Interface
ix
Chapter 1
Introduction
Modern computing is generating massive volumes of data at ever growing speeds.
The data generated has different characteristics. Not only are the volumes large
but data is structured, semi structured and unstructured [1]. Data originates from
different kinds of sources like web pages, logs, social media, e-mail, documents,
sensor devices and many more [2]. The different characteristics, size, complexity
and origins of this data makes it difficult for traditional storage and processing
systems to handle. A commonly recognized term for this kind of data is ”Big
Data”.
Storing, processing and extracting value from Big Data is no trivial matter and
a problem that companies like Google and Yahoo spends lots of resources on.
The most notable framework for Big Data storage and processing is called
Hadoop [3]. The base of Hadoop was developed at Yahoo together with the
creator of the famous search-engine library Apache Lucene [4]. Hadoop is a
framework consisting of several different projects such as ”Hadoop Distributed
FileSystem”(HDFS) and ”Yet Another Resource Scheduler”(Yarn). Hadoop is
capable of storing and processing large amounts of data in a scalable, efficient and
1
2 CHAPTER 1. INTRODUCTION
effective way. Although Hadoop excels with handling of Big Data, it is a fairly
young framework and lacks some important capabilities that would be beneficial
for its progress.
1.1 Problem description
The capabilities that Hadoop offer makes it possible for developers and scientists
to create new interesting applications as well as extracting interesting information
out of large datasets. However, as in any technical area, in order to make progress
there needs to be a way to share knowledge. Inside a Hadoop cluster, stored in
HDFS, there may be hundreds of petabytes of data which users of that cluster
can perform processing on. This data is bound to the cluster where it is stored.
There are limited options for users to obtain information about data that isn’t
local to their cluster. Neither is there support for scalable transfer of data between
clusters. Right now, in order to import data that that isn’t local to a cluster, a user
must first find it via some third party information source and after that do some
kind of copying, typically by using Hadoops Distcp [5] or worse something like
the File Transfer Protocol(FTP) [6]. These solutions aren’t always going to work
either since the ”open internet” might include NAT(Network address translation)
endpoints and other technology which would further complicate transfers. Also,
just because something is advertised to exist on the internet, doesn’t mean that it
actually is available. These limitations makes sharing datasets tedious and in many
cases very slow which could discourage users from sharing. Even if transfers were
fast and knowledge of remote datasets was easily accessible, datasets might be
very big and downloads will take time to complete. Some users aren’t concerned
with the whole blob of data, but might simply want to make some experiments on
some if it. If there is no way to process data as it is being downloaded then these
1.2. PROBLEM STATEMENT AND PURPOSE 3
users will be forced to wait for everything to finish downloading. This will likely
make them less inclined to share datasets.
1.2 Problem statement and Purpose
Hadoop has no solutions for either searching or scalable sharing of datasets
across cluster boundaries. Also, sharing datasets between datacenters could be
problematic in the face of NAT endpoints. This means that data will likely be
bound to one cluster and sharing will not happen. The lack of shared datasets
is a hindrance for further progress within application areas that use the utilities
of Hadoop. Even if Hadoop had scalable capabilities to share data, downloading
large datasets might take too long and therefore be avoided. There needs to be a
way to do processing on data as it is being downloaded.
The contribution of this thesis is the implementation and evaluation of a global
ecosystem for hadoop datasets where datasets are searchable, shareable and
processable as they are downloading. The implementation uses SICS own Hadoop
distribution Hops and its frontend Hopsworks as the base system. For global
cluster registrations, a central server is deployed where different Hops clusters
can register their peer-to-peer and search endpoints. The peer-to-peer service
is an altered version of GVoD [7] which is a peer-to-peer video streaming
application with NAT-traversal capabilities, developed at SICS. The public search
is accomplished using the extremely popular ElasticSearch [8] search-engine
which is a distributed document store built on top of Apache Lucene [4]. The
real time processing of data is accomplished through the choice of writing GVoD
downloaded data into either HDFS or both HDFS and Kafka [9]. By writing data
into Kafka, users in Hopsworks can read from Kafka during downloads using
4 CHAPTER 1. INTRODUCTION
entities like Spark streaming [10], Flink [11] or similar technologies.
1.3 Goals, Ethics and Sustainability
The goal of this thesis and work will be a working, scalable and efficient
implementation of a global ecosystem for Hops datasets. HopsWorks users of
different Hops clusters will be able to share and search for datasets in this global
ecosystem. Also in the case of downloading datasets, real-time processing will be
supported. The explicit goals are listed below.
• Implement search for public datasets. HopsWorks users should be able to
find data in their own Hops cluster and in remote Hops clusters.
• Implement peer-to-peer sharing of public datasets. HopsWorks users should
be able to upload and download data to and from other Hops clusters.
• Implement support for real-time processing of downloading data. HopsWorks
users should not be forced to wait for complete downloads in order to
investigate interesting data.
• Demonstrate that peer-to-peer sharing of data is a scalable solution and
better solution than Hadoops built in copying mechanism Distcp and similar
technologies.
If these goals are met then people creating, storing and processing interesting data
can share it with others from the related or unrelated application domains in order
to further progress their products and goals. It can directly benefit entities that
work in the Big Data society but could also indirectly benefit those who are only
affected by it, for example visitors of enterprise web applications.
Introducing the concept and peer-to-peer can be controversial in both an ethical
1.4. STRUCTURE OF THIS THESIS 5
and a sustainable standpoint. Peer-to-peer downloads of large datasets could
potentially mean large usage of bandwidth which could strain network infrastructure.
Also, in an ethical perspective sharing data can be problematic if there is no
sufficient access control to the data being shared. Fortunately GVoD which is the
process that conducts the peer-to-peer downloads uses a special network protocol
known as Ledbat [12]. This protocol is different from protcols such as TCP and
UDP as it will adapt its usage to the current network characteristics. Concerning
ethical problems, access control is managed by the Hopsworks web application
where users can choose to make their own data publicly available.
1.4 Structure of this thesis
Chapter 1 describes the problem and its context. Chapter 2 provides the specific
knowledge that the reader will need to understand the rest of the thesis. Chapter
3 describes what method was used to implement and evaluate the solution. The
solution implementation is presented and explained in chapter 4. The solution is
analyzed and evaluated in Chapter 5. Finally, Chapter 6 offers some conclusions
and suggests future work.
Chapter 2
Background
This chapter presents the background of the thesis. It introduces different entities
that the reader needs to understand in order to comprehend the remainder of the
thesis. First, Hadoop, Hops and HopsWorks are introduced as they outline the base
of which the solution is built upon. After that the different parts of the solution
architecture are introduced. Starting with the central-server(Hops-Site) and
then further on to relational-persistence(MySQL cluster), search(ElasticSearch
and Epipe) and peer-to-peer sharing (GVoD). Lastly, the different dataset-store
components (HDFS and Kafka) are presented and also what they offer for this
particular solution . This chapter doesn’t discuss how these different techniques
accomplish the overall solution, that is done in Chapter 4.
2.1 Hadoop
Apache Hadoop is a framework that provides distributed storage and processing of
large datasets [3]. Hadoop was designed to scale from single servers up to massive
7
8 CHAPTER 2. BACKGROUND
clusters of nodes where each node offered both storage and processing power.
Today, companies like Yahoo, Facebook and Spotify deploy Hadoop stacks in
their datacenters in order to manage the large amount of data they generate
[13]. Rather than relying on central and expensive solutions, Hadoop utilizes the
power of parallel-computing and non expensive hardware. The main modules of
Hadoop are Hadoop Common, HDFS, YARN and MapReduce. Hadoop Common
consists of the utilities that the other modules need in order to function properly.
HDFS is a distributed filesystem, the default filesystem for Hadoop. Yarn is a
resource negotiator, with responsibilities similar to a traditional operating system.
MapReduce is a programming model that is widely supported inside Hadoop and
allows for things like distributed processing of data.
2.2 Hops
Hadoop open platform as-a-Service or ”Hops” is a Hadoop distribution developed
at SICS [14]. Hops has several improvements over Hadoop, some of these are
listed below.
• Hadoop-as-a-Service
• Project-Based Multi-Tenancy
• Secure sharing of DataSets across HopsWorks projects
• Extensible metadata that supports free-text search using Elasticsearch
• YARN quotas for projects
The key functionality that enables the above listed improvements is the storage of
HDFS and Yarn metadata inside the MySQL cluster. This functionality enables
things like search of datasets through ElasticSearch, described in sections 2.6
2.3. HOPSWORKS 9
and 2.7. Also, with HDFS metadata stored inside the MySQL cluster instead
on the Heap of a NameNode, Hops becomes more scalable than other Hadoop
architectures [15]. The secure sharing, multi-tenancy and service improvements
are enabled by the HopsWorks web application which is described further
below.
2.3 HopsWorks
HopsWorks is the frontend for Hops. It introduces concepts like Users, Projects
and Datasets which helps organize the different services that Hops offer. For
example, a user of HopsWorks can create a Project, run a sparkjob and store
the results as a dataset inside the project. Technically speaking, HopsWorks is
an AngularJS [16] front-end and a Java Jersey [17] REST(Representational state
transfer) [18] back-end. The back-end talks with Hops services(like ElasticSearch
and GVoD) via REST calls. Most data exchanged over REST in the Hops and
HopsWorks architecture are serialized into JSON(Javascript Object Notation) [19]
format. Figure 2.1 exemplifies the architecture of Hops and HopsWorks.
The top layer represents the HopsWorks REST API(Application Programming
Interface), it offers many different API calls that the AngularJS frontend uses
to present Hops to its users. Almost all of these API calls are protected by
user authentication. However, the REST call to search for public datasets is
not protected by security, so anyone can call it. This is a necessary evil as it
allows clusters to make REST calls for public datasets without needing some kind
of web-application session information. We will come back to the problematic
aspects of this choice in the last chapter of the thesis. The next layer represents
the different services that Hops provides. The relevant services for this project are
10 CHAPTER 2. BACKGROUND
Figure 2.1: HopsWorks and Hops
ElasticSearch and GVoD(unfortunately not in the picture), both of which will be
introduced below in sections 2.6 and 2.8. The bottom layer represents the main
modules of Hops, the filesystem and the resource negotiator. The filesystem is a
modified version of HDFS, which is further described below in section 2.9. The
resource negotiator is an altered version of Hadoops YARN. HDFS handles the
storage needs inside the Hops clusters while YARN manages important choices
like scheduling and allocation of resources. Together HDFS and YARN help
enable the above layered services of Hops.
2.4. HOPS-SITE 11
2.4 Hops-site
Searching publicly for datasets means that those dataset needs to be globally
unique, i.e unique across different clusters and within clusters. A Hops cluster
also needs to know the search endpoints of other clusters in order to direct
public-search queries to them. Hops-Site serves as a solution for both of these
problems. Hops-Site is a Jersey [17] RESTFUL web-service deployed on a
Glassfish webserver [20] local to a certain Hops cluster. Hops-Site offers a REST-
API to Hops-clusters where they can advertise their search endpoints, obtain an
unique cluster-id and find information of other registered Hops-clusters. Hops-
Site also maintains other types of information about registered clusters, such as
how active(how often they ping) they are and what GVoD endpoint they have.
Hops-Site uses a MySQL cluster for persistence, same as HopsWorks.
2.5 MySQL cluster
Both HopsWorks, Hops and Hops-Site utilizes a relational database to persist
important information. HopsWorks and Hops(Yarn and HDFS) needs to store
information about users, projects, datasets and more, while Hops-Site persists
information about registered Hops clusters. Some of the reasons for choosing
MySQL-cluster as a persistent store for Hops-Site are listed below.
• Integration
• High Availability and Scale Ability
• No Single Point of Failure
The first reason for the choice of MySQL cluster is integration. Because Hops and
12 CHAPTER 2. BACKGROUND
HopsWorks already utilizes MySQL cluster for persistence, it became a natural
choice for Hops-Site which also runs inside a Hops Cluster. The second and
perhaps most important reason is the performance that MySQL cluster offers.
MySQL cluster has both high availability and scale-ability [21] which both are
critical for up-time and being able to handle a high load of requests. Also, because
MySQL cluster is a distributed Relational database, there is no single point of
failure which further improves potential up-time.
2.6 ElasticSearch
Search inside Hops clusters is powered by ElasticSearch. ElasticSearch is a
distributed search-engine built on top of Apache Lucene [8, 4]. The distributed
nature of ElasticSearch gives it important properties like high availability, scale-
ability and responsiveness. ElasticSearch is a document store, which means that
instead of storing data in traditional rows and columns format, data is stored in
object format, in so called documents. Documents themselves are stored inside
indexes which are replicated throughout an ElasticSearch cluster using shards.
An index is similar to what a database is in a relational database system. By
default, each field inside a document is indexed with a Lucene inverted index
making it available for fast search and retrieval [22]. Other than the performance
reasons stated above, ElasticSearch is a good choice for a search-engine since it
speaks REST and JSON, this enables simple integration with the rest of Hops
and HopsWorks. The data that elastic makes available for search comes from
the metadata stored in the MySQL cluster and is delivered by Epipe which is
described below.
2.7. EPIPE 13
2.7 Epipe
Inside a Hops cluster, an application called Epipe is responsible for writing data
from the MySQL cluster into ElasticSearch. This application employees the
NDB(Network Database) event API [23] and listens for evens that are generated
when something changes inside the MySQL cluster, for example an update to a
table. When an event is generated, Epipe looks at the event and writes the changes
into ElasticSearch. In this manner parts of the MySQL cluster is replicated in
ElasticSearch which means that users of HopsWorks can direct search-queries to
ElasticSearch and search for data that is stored in the MySQL cluster. An example
of this would be when a user makes a dataset public, this would then change
a column inside the dataset table in the MySQL cluster. Epipe would recieve
an event and write the column change into ElasticSearch. Now users should
be able to search for that public datasets by querying the local ElasticSearch
instance.
2.8 GVoD
GVoD is a video on demand streaming application with NAT-traversal capabilities,
developed at SICS [7]. GVoD has been altered to fit into the Hops ecosystem and
serve as the process which handles downloads and uploads of public datasets.
GVoD is a single process that runs on a single node inside a Hops cluster. GVoD
incorporates peer-to-peer technology in order to upload and download datasets.
GVoD downloads and uploads files in pieces, which are smaller units of data.
These pieces are assembled into blocks which are a bigger units of data that later
will be verified for correctness and then written into HDFS. The size of a block is
configurable but in the implementation for this project it was set to 10 megabytes.
14 CHAPTER 2. BACKGROUND
Downloading datasets with GVoD means transferring alot of these pieces and
building blocks from them. In order for a GVoD instance to know that it has
obtained correct blocks, it needs to verify block hash-values. GVoD incorporates
”on demands hashing” of blocks in order to support this. GVoD downloads data
in an orderly fashion which differs to other common peer-to-peer applications that
usually downloads data out of order. The fact that GVoD downloads data in order
is necessary since HDFS is an append only filesystem [24]. GVoD has also been
altered to write to HDFS and Kafka, where HDFS represents the persistent store
and Kafka the temporary store that enables real-time processing. Writing data into
HDFS is is done by transferring pieces and building blocks, verifying the block
hash-values and then writing them to the DataNodes of HDFS. Writing data into
Kafka is a bit different and is described further in section 2.10.
2.9 HDFS
HDFS is the default filesystem for both Hadoop and Hops, it is a distributed
filesystem designed to run on non expensive hardware. HDFS was built according
to some particular design goals namely, fault-tolerance, streaming data access,
large datasets, simple read-write model and moving application to data [24].
Figure 2.2 exemplifies the architecture of HDFS.
HDFS is a master/slave architecture were the so called NameNode acts as the
master and DataNodes acts as slaves. The idea of HDFS is to let one node handle
client requests and metadata storage while having a massive amount of other nodes
that offers simple block storage. The NameNode has the most central role in the
system. It stores the HDFS metadata such as directory structure, permissions etc
[24]. The actual data is stored in blocks on the DataNodes and replicated onto
2.9. HDFS 15
Figure 2.2: HDFS
other DataNodes at the command of the NameNode.
The Hops filesystem is different from HDFS, it migrates the filesystem metadata
from the Namenode heap to a MySQL cluster [15]. See figure 2.3 for an example
of the Hops-HDFS architecture.
The migration of filesystem metadata to the MySQL cluster makes the Hops
filesystem more scalable and also enables multiple-writer model for mutating
HDFS metadata [15].
16 CHAPTER 2. BACKGROUND
Figure 2.3: Hops HDFS
2.10. KAFKA 17
2.10 Kafka
Apache Kafka is a distributed publish/subscribe message-system with high
throughput [9]. Kafka acts as the temporary store for downloading public datasets
into a Hops cluster. Kafka manages a set of storage entities called Topics, these
entities are similar to queues and lets producers and consumers, read and write
messages to and from them. These messages must follow a certain structure,
this structure is usually expressed in CSV [25] or Avro [26] format. As with
datasets, topics are part of projects in HopsWorks and follow the same type of
access control. Messages written to Kafka Topics are kept there for a configurable
amount of time. When topics become full or time for keeping messages runs out,
the oldest messages are discarded. For each topic, the Kafka cluster maintains a
partitioned log, where each partition must reside on a node in the cluster but a topic
may have different partitions making topics storage scalable. Partitions are also
replicated throughout the cluster making Kafka fault-tolerant. Kafkas distributed,
high throughput and good integration capabilities with technologies such as Spark
Streaming [10] makes it a very capable temporary store for real-time processing
of data and compatible with the Hops ecosystem.
Chapter 3
Method
This chapter presents the type of methodology used to produce the thesis,
work and result. It discusses the analysis and tests. The actual results and
implementation is presented in the upcoming chapters.
3.1 Methodology
For this thesis, a quantitative research method was chosen [27]. First a system
was created that sought to meet the proposed goals of the project. These where as
mentioned before, to implement search and scalable sharing of public datasets as
well as support for real-time processing of downloading datasets. Along with the
quantitative research method, a deductive research approach [27] was chosen and
experimental tests and evaluations were made to verify the goals.
19
20 CHAPTER 3. METHOD
3.2 Experiments and evaluation
Due to the nature of the project as well as the limited time frame, only a couple of
experiments where conducted. The main experiment for testing the performance
of the implementation was a test that transferred datasets between clusters with an
increasing number of participating peers. By downloading datasets and increasing
the number of participating peers, the scale-ability and performance of the peer-
to-peer sharing could be verified.
No tests were conducted to establish the performance of public-search. The main
reason for this was that GVoD at the time of writing did not have the ability to
build its own overlay. Therefore in order to make the peer-to-peer sharing optimal,
public-search had to take a hit in performance, more on this later in chapter 4 and
section 4.3. Also, because no Hadoop implementation had the ability to do public-
search there was not really anything to do benchmarks against.
In order to test the ability of real-time processing, a simple test was conducted.
This test evaluated how much time it took before downloading data started to
appear in a Kafka Topic.
As a final evaluation, the peer-to-peer sharing of datasets was compared and
evaluated against existing technologies with similar abilities.
Chapter 4
Implementation
This chapter presents the implementation of the search and peer-to-peer downloading
as well as the real-time processing support. First a couple of rules/assumptions
are presented, some of these are temporary limitations and others are the results
of logical conclusions. After that, the interactions between Hops-Site and
HopsWorks are discussed as the results from those interactions are essential for
public-search. Next, both public-search and peer-to-peer sharing of datasets are
presented in depth. Lastly the real-time processing support is explained.
4.1 Rules
The first rule says that public datasets are immutable. This means that once a
dataset is made public, it cannot be changed, i.e you cannot add or remove files
from a public dataset. It turns out that this rule is a logical choice as HDFS
originally was designed for immutable data [28] and even though it’s now possible
to append to files, it’s common that large datasets remain static.
21
22 CHAPTER 4. IMPLEMENTATION
The second rule says that public datasets are identified by the cluster id, the project
name, the dataset name and an unix-timestamp. The cluster id is obtained through
registration with Hops-Site, which is described further in section 4.2. The project
name and dataset name are provided by Hopsworks and its structure of users
having projects and projects having datasets. The cluster id makes the public
dataset unqique across different clusters, the project name makes the dataset
unique within a cluster and dataset name is there for convenience. The unix-
timestamp is there because a HopsWorks user might want to remove the public
property of the dataset but then make it public again later.
The third rule is really a result of a temporary version of GVoD. When this thesis
was written, GVoD did not have the ability to build its own overlay for peer-to-
peer sharing of data. Instead it needed to know all the peers that it was going to
download data from. This limited the ability to optimize public-search and we
will discuss improvements to this later in chapter 6 section 6.2.
The fourth rule is another one of those temporary assumptions that can be
improved upon. At this moment public datasets are considered to be one-level
in the sense that they don’t have any directories, only files.
The last rule was already mentioned in chapter 2 section 2.3. It says that the REST
call for public-datasets is available for anyone to call. This is necessary since a
HopsWorks instances needs to be able to direct search queries for public-datasets
to other HopsWorks instances without requiring any session type information.
This is also a security issue which will further discussed in chapter 6 section
6.2.
4.2. HOPS-SITE INTERACTIONS WITH A HOPSWORKS INSTANCE 23
4.2 Hops-Site interactions with a HopsWorks instance
Hops-Site is a centralized server that is crucial to the functionality of the global
Hops ecosystem. Below are two figures that presents the interactions between
HopsWorks and Hops-Site. The first, figure 4.1, shows the Register REST call to
Hops-Site and the second, figure 4.2, shows the Ping REST call.
Figure 4.1: Register to Hops-Site
HopsWorks needs a cluster id that uniqley identifies it in the Hops ecosystem so
that it can later create public datasets that are unique across clusters. This cluster
id is obtained when HopsWorks succesfully registers with Hops-Site. HopsWorks
registers with Hops-Site as its being deployed to a Glassfish server. At the time of
deployment, HopsWorks checks the MySQL cluster to see if it has a cluster id. If
it doesn’t, it makes a REST call to Hops-Site with information about itself(email,
certificate, public-search-endpoint, GVoD-endpoint). If Hops-Site accepts the
REST call, it will generate a unqiue cluster id and send it back as a response.
HopsWorks will then persist this cluster id in the MySQL cluster which means
that no further registrations will be needed and it now has the ability to create
24 CHAPTER 4. IMPLEMENTATION
public-datasets.
Figure 4.2: Ping Hops-Site
In order to perform public-search HopsWorks also needs to know where to direct
REST calls, simply sending them to the local ElasticSearch instance will only
produce local results. HopsWorks needs to be able to direct queries to non
local ElasticSearch instances. This is done by querying the REST API for
public datasets of other HopsWorks instances which will then forward the queries
to their local ElasticSearch instances. HopsWorks obtains these endpoints by
continuously pinging Hops-Site. Hops-Site will investigate the Ping call and sends
back a list where each entry contains information about another registered cluster.
In this information are two very important entries, the endpoint for search and
a counter which indicates the activity of the cluster. If the counter is high, it
means that is inactive and sending search-queries or trying to share datasets with
it probably isn’t a good idea.
4.3. PUBLIC SEARCH 25
4.3 Public Search
This section describes public search in detail. First, the information needed to
perform public search is summarized. After that the different steps of public
search is described and also what the results are.
4.3.1 What is needed ?
As mentioned in Chapter 2 section 2.6 and 2.7, search inside a Hops cluster
involves ElasticSearch, Epipe and the MySQL cluster. A change in a MySQL-
cluster table will generate an event which Epipe will look at and write the
corresponding changes into ElasticSearch making it searchable. For public search,
a litte more work must be done. First of all, a dataset that is searchable in more
than one cluster needs to have a unique id so that the cluster searching for it can
see what cluster has what data(it may be that different clusters have the same
dataset). This id can be created with the cluster-id, the project-name, the dataset-
name and an unix-timestamp as we discussed above. The other thing that public
search needs is the public-search endpoints of the different clusters that shall
receive the search query. This was as mentioned above acquired by HopsWorks
pinging Hops-Site and obtaining endpoint information of other clusters. That
is all the information that HopsWorks need in order to perform public search.
The production of the public search and the handling of the response is described
below.
26 CHAPTER 4. IMPLEMENTATION
4.3.2 Producing a public search and handling responses
When HopsWorks receives a REST call for a public-search query it loops through
a list of clusters that it obtained from the Ping with Hops-Site. This list contains
information about each cluster and most importantly a public-search-endpoint.
At each iteration of the loop, a check is made to see if the particular cluster is
considered active. If a cluster hasn’t pinged for some time, a counter for that
cluster will have a high value and the corresponding cluster will not be considered
for search. However, if a cluster is considered active, the public-search endpoint
is extracted and a non-blocking REST call to that endpoint is made. When each
iteration has concluded the main thread(the one iterating through the loop) blocks
and starts to wait for all responses to come back. As each response comes back, a
handler(another assigned thread) checks the hits(each cluster might have several
datasets that match the query) of each response. If a hit is unique it is saved
together with its GVoD-endpoint in an overall result list. If a hit isn’t unique
then only the GVoD-endpoint is extracted and appended to the corresponding
hit in the overall result list. When a thread has handled all hits, it sets a flag
indicating that this cluster has responded, it then checks to see if all clusters
have responded. If all have responded the thread also wakes up the main thread
that blocked in the beginning. The overall result list produced by all responses
will contain unique public-dataset matches, all with a list of GVoD-endpoints
that can be used to download the dataset. This is important, as we mentioned
in section 4.1, GVoD needs to know all of it’s peers it should to the download
with, it cannot(at this moment) build an overlay on its own. Before sending
back the result list, the list is sorted according to the score of each entry. This
score is a value that ElasticSearch associates with every search-hit which basically
reflects its relevance [29]. Figure 4.3 exemplifies the steps of the public-search
4.3. PUBLIC SEARCH 27
implementation described above.
Figure 4.3: Public Search
The steps are as followed. First, a user inputs a query into the frontend. The
frontend will then forward this query to the public-search REST endpoint of a
HopsWorks instance. The HopsWorks instance then loops through all registered
clusters (only 2 are shown in figure 4.3) and sends asynchronous search queries to
these clusters HopsWorks REST endpoints for public datasets. These HopsWorks
instances will forward the queries to their local ElasticSearch instances which
will respond with some set of matches. Lastly as all responses comes back, these
matches are combined and filtered into the overall result list which is then sorted
28 CHAPTER 4. IMPLEMENTATION
and returned to the orignal frontend.
4.4 GVoD peer-to-peer upload and download
This section describes the peer-to-peer sharing in detail. First, the information
needed before a download can be made is presented. After that the actual, upload
and download are described.
4.4.1 What is needed ?
In Chapter 2 section 2.8 we mentioned that GVoD is the application that takes care
of the download and upload of datasets. In this Chapter, in section 4.1, we also
mentioned that GVoD has no ability to build an overlay on demand. This means
that in order to produce an optimal download, i.e a download with the maximum
amount of participating peers, GVoD needs to get the peers from somewhere. It
turns out that public search does just that. Public search returns a list of unique
public datasets corresponding to the query it received as input, each of these
datasets also comes with a list of GVoD endpoints, which happens to be all of
the peers that GVoD can utilize to download the dataset. This means that after a
public-search is performed a HopsWorks user has all the information needed to
perform a optimal peer-to-peer download of a dataset.
4.4.2 Upload
In order to share a dataset, someone must first make the dataset public so that it
can be searched for and after that downloaded. Making a dataset public inside
4.4. GVOD PEER-TO-PEER UPLOAD AND DOWNLOAD 29
HopsWorks and Hops involves several steps. The first thing that happens is that a
user of HopsWorks right-clicks on a dataset icon and selects ”make public”. The
next step involves the creation of the so called Manifest. A Manifest is a JSON
file that contains information about the contents of a public dataset. It describes
the files and if they support writing into Kafka. The Manifest also contains other
metadata information such as creator, creator-date and so on. When the Manifest
is created, it is written to the dataset folder in HDFS. After that, HopsWorks makes
a REST call to GVoD, informing it about the path of to the HDFS folder and other
information such as the public-dataset-id and HDFS endpoint information. GVoD
then looks at the path provided and tries to read the Manifest and parse it into a
JSON file. If successful GVoD knows the structure of the dataset it should upload
and also the torrent-id it should use (public-dataset-id). GVoD then replies to
HopsWorks with a REST call indicating that everything went fine. HopsWorks
then persists the fact that this dataset is now public and also its public-dataset-id
into the MySQL cluster. Epipe will then receive an event and write the changes
into ElasticSearch, making it available for public-search.
4.4.3 Download
When an upload has been conducted the public dataset will be publicly searchable
and at least one GVoD instance will also have it ready for upload. When a
HopsWorks users searches for a public-dataset it will receive a list of matches,
where each match has a certain public-dataset-id and a list of GVoD endpoints that
are willing to upload this dataset. In order to download a public dataset certain
steps needs to happen. First, a user will want to understand what kind of dataset
it is downloading and also if it can be written into Kafka. This information is
present in the Manifest file of each public-dataset and the first step is to ask the
30 CHAPTER 4. IMPLEMENTATION
local GVoD instance to download the Manifest and present it to the HopsWorks
user. In order to do this, there must first be a location where the local GVoD can
write the Manifest. The HopsWorks user must then create a destination dataset
folder so that the local GVoD instance can write the Manifest into it. After a
destination dataset is created, HopsWorks sends the path of this dataset to GVoD
in a REST call, together with the other important info such as GVoD endpoints
to download from and the torrent-id(public-dataset-id). GVoD then downloads
the Manifest from the peers it was presented with and writes the Manifest into
the path that HopsWorks gave it. Then it sends a REST call back indicating that
the Manifest is now present in the path that it was given. HopsWorks can now
read the manifest from the destination dataset and present the information to the
user. Depending on what kind of files and schemas are present in the dataset, the
user can choose to either write the rest of the dataset into only HDFS or into both
HDFS and Kafka. After that choice is made, HopsWorks sends a REST call to
GVoD informing it about what kind of download should be made. When GVoD
recieves this REST call it proceeds to download the rest of the data into the desired
storage components.
4.5 Real-time processing
If a HopsWorks user chooses to download data into both Kafka and HDFS, then
it is possible to process the data thats being written to Kafka while the download
is progressing. There are many different ways of doing this as there are plenty
of technologies that have the ability to read from Kafka. The simplest and most
boring way to go about it is to create a simple Kafka consumer that reads from
a certain offset inside the Kafka Topic in question. However there are more
interesting things you can do. For example, both Apache Spark [10] and Apache
4.5. REAL-TIME PROCESSING 31
Flink [11] provide APIs that has the ability to read from Kafka Topics. With these
technologies advance processing can be done as the dataset is being downloaded
into the cluster.
Chapter 5
Analysis
This chapter presents the results and analysis of the implementation and tests.
First, the evaluation of the implementation and existing technology is presented.
After that the different test results are shown.
5.1 Evaluation of implementation and current technologies
This thesis and project has introduced a peer-to-peer sharing service that enables
scalable and efficient sharing of public datasets. Copying data between file-
systems, servers and datacenters is no novel idea. There exists countless solutions
for this type of problem but almost none of them fit particularly good in an Hadoop
ecosystem. First of all, datasets in an Hadoop cluster like Hops are often very
large hence simple transfer protocols like FTP won’t scale well. Technologies like
DistCp does perform well when copying large datasets but not from one datacenter
to another. Also because neither of these solutions use peer-to-peer technology
they are unlikely to achieve maximum performance. Another major obstacle for
33
34 CHAPTER 5. ANALYSIS
technologies such as DistCp is that they cannot traverse NAT-endpoints. This
is a major problem as most of todays internet use NATs to extend network
infrastructure.
The implementation developed throughout this thesis suffers from none of these
above mentioned problems. It is peer-to-peer by nature and has built in NAT-
traversal capabilities.
5.2 Results of Experiments
This section presents the results of the tests that were conducted to validate the
implementation. First, the scale-ability of the peer-to-peer sharing service is
presented. Then the performance of the real-time processing is presented and
discussed.
5.3 P2P test
To test the peer-to-peer performance and scale-ability a test of sharing a public
dataset with an increasing amount of participating peers was conducted. The
clusters in the test where emulated using Vagrant [30] Virtual Machines. The
machines had 2 CPUs and enough memory to deploy a Hops cluster, see
details http://www.hops.io/?q=content/hopsworks-vagrant. Five clusters where
deployed, the first one being an initial uploader, the others downloaders that
became uploaders after finishing a download. In order to clearly observe the
performance of the setup, the upload speed of each clusters GVoD instance was
limited to 300 000 bytes per second. The size of the dataset to share was set to 40
megabytes. The results of these tests are depicted in the figures below, were the
5.3. P2P TEST 35
first figure is a download with one uploader and the next with two uploaders and
so on.
Figure 5.1: Download with one uploader
Figure 5.2: Download with two uploaders
We can observe that each added peer makes for faster download speeds. The
last figure(5.4) shows a quite stable speed around one million and two hundred
thousand bytes per second, which is almost four times the speed of a perfect one-
to-one download. These tests confirm the scale ability of the implementation.
36 CHAPTER 5. ANALYSIS
Figure 5.3: Download with three uploaders
Figure 5.4: Download with four uploaders
5.4. REAL-TIME PROCESSING TESTS 37
5.4 Real-time processing tests
To test the performance of real-time processing, a simple test was made that
checked how long it took before data appeared in a Kafka Topic after a download
was started. This test is highly dependent on the speed of the transfer which itself
is dependent on the amount of participating peers. In order to test a worse case
scenario, only one uploader participated in the transfer. The size of the dataset
was the same as in the tests above and the clusters where emulated using the same
type of Vagrant virtual machines. The average time for data to appear differed
between 3 seconds to 6 with and average of 4 seconds, only 10 tests where made
which wasn’t enough to produce any nice graphs. There are a couple of things you
can say about this test. First, it shows that even in the face of only one uploader, it
can take as little as 3 seconds before a Kafka Topic has data from the dataset and
real-time processing could begin. With more peers the speed of the transfers will
grow as shown in section 5.3 which means that a highly popular public dataset
would be very easy to process.
Chapter 6
Conclusions
This chapter concludes the thesis by presenting the authors reflections of the
project. First an evaluation of the goals are made. Then some reflections about
the work and future work is presented. The thesis is summed up with a final
conclusion in the last section.
6.1 Goals
The explicit goals of this project can be found in chapter 1 section 1.3. Overall,
the goal was to create a scalable and effective solution to share datasets between
Hadoop clusters and also support the ability to do real-time processing on
downloading datasets. The tests and evaluation in chapter 5 confirms that the peer-
to-peer sharing is scalable and that the real-time processing is effective and very
useful. The public-search part wasn’t tested and I can’t therefore claim anything
about its performance. However, as is described in chapter 4 section 4.3 the
main thread that does the search does block to wait for all clusters to respond.
39
40 CHAPTER 6. CONCLUSIONS
This is obviously a performance flaw and improvements to this will be discussed
below.
6.2 Future work
Even though the implementation fulfilled the goals of the project there are
plenty of improvements that would make the system more component and
performant.
The most obvious problem with the system is the sub-optimal implementation of
search. Right now, when HopsWorks produces a public-search it queries a list of
clusters and then blocks and waits for all of them to respond. This means that
search will be as slow as the cluster that takes the most time to respond. This
can be quite problematic as people won’t expect a simple search for datasets to
take a long time. The reason for this implementation is that GVoD doesn’t build
an overlay on demand. The search needs to wait for all queires to come back in
order to collect all possible GVoD endpoints for the different matching datasets.
In order to improve this GVoD must first incorporate the ability to build an overlay
on demand. When GVoD acquires this ability the implementation of search can be
changed into something more sophisticated. For example, instead of waiting for
every cluster to respond, a pre-determined wait time could be decided upon. When
all queries have been sent to the clusters HopsWorks wont block and wait for all
of them to return, it will only wait the amount of time that was decided upon.
When that time has elapsed all the responses that HopsWorks has gotten could be
returned to the frontend. The rest of the responses could either be discarded as too
old or handled in a way that a HopsWorks user could request for them later.
Other obvious things to improve upon are the rules/assumptions introduced in
6.2. FUTURE WORK 41
chapter 4 section 4.1. The fact that public datasets are immutable could be
changed and incorporate some kind of version system. Instead of forcing public
dataset to be static, a public dataset could have different versions where added data
meant a new version of the dataset.The peer-to-peer system could the recognize
that two datasets had the same base version and perhaps use that to optimize a
download or upload. Another obvious limitation in those rules is the fact that
public datasets are one-level. This should be changeds so that public datasets can
have directories to further structure its data.
In both chapter 2 and 4 it was mentioned that the HopsWorks web application had
a REST call that was available for anyone to call. This is an obvious weak point
for DDOS attackers to exploit but it is not trivial to fix. The fix needs to allow
different HopsWorks web-applications to differentiate themselves from random
DDOS attackers. Another solution could be to incorporate some kind of DDOS
detection, where spam-behavior is detected and dealt with correctly.
Another problem that wasn’t really clear from the tests is the way that GVoD
writes data to Kafka. When this thesis is written this is done with synchronous
producers which basically means that GVoD writes data to a topic, awaits a
confirmation that it was written and then writes again. This is of course not
optimal, it would be better if data could be written in a asynchronous way, similar
to how search queries to other clusters are handled.
A final issue with the implementation is the lack of functionality that would
increase the incentive for someone to download or upload their data. On torrent-
sites, there is usually some kind of ranking of different torrents, where the most
popular torrents have many uploaders and downloaders. There is also often some
kind of indication of where the torrent is from, what its creator was and so on.
Right now, there is basically nothing of this in HopsWorks which is problematic
42 CHAPTER 6. CONCLUSIONS
as presenting data in this way motivates people to share their data. An example
of a solution for this would be for Hops-Site to store information about popular-
datasets. When a user uploads or downloads a dataset, a REST call top Hops-
Site could be made that informs it about the action and the dataset. The ping
REST call that HopsWorks does continuously to get endpoint information could
be extended to also get information about popular-datasets. This could then be
displayed somewhere in the HopsWorks frontend.
6.3. CONCLUSION 43
6.3 Conclusion
This report and project has presented a solution for sharing datasets between
Hadoop clusters in a scalable and efficient manner. The implementation also
introduced a solution for the problem of downloading very large amount of data.
The major limitations of the solution have also been presented and suggestive
work for removing those limitations are explained above.
Bibliography
[1] S. Kaisler, F. Armour, J. A. Espinosa, and W. Money, “Big data: Issues and
challenges moving forward,” in System Sciences (HICSS), 2013 46th Hawaii
International Conference on, Jan 2013, pp. 995–1004.
[2] A. Katal, M. Wazid, and R. H. Goudar, “Big data: Issues, challenges,
tools and good practices,” in Contemporary Computing (IC3), 2013 Sixth
International Conference on, Aug 2013, pp. 404–409.
[3] “Hadoop homepage,” http://hadoop.apache.org/, accessed: 2016-09-09.
[4] “Apache lucene,” https://lucene.apache.org/core/, accessed: 2016-09-09.
[5] “Apache distcp homepage,” https://hadoop.apache.org/docs/r1.2.1/distcp2.
html, accessed: 2016-09-09.
[6] “Ftp rfc,” https://www.ietf.org/rfc/rfc959.txt, accessed: 2016-09-09.
[7] “Gvod homepage,” http://www.decentrify.io/?q=content/video, accessed:
2016-09-09.
[8] “Elasticsearch guide,” https://www.elastic.co/guide/en/elasticsearch/guide/
current/getting-started.html, accessed: 2016-09-09.
[9] “Kafka homepage,” http://kafka.apache.org/.
45
46 BIBLIOGRAPHY
[10] “Spark streaming and kafka,” http://spark.apache.org/docs/latest/streaming-
kafka-integration.html.
[11] “Flink and kafka,” https://ci.apache.org/projects/flink/flink-docs-release-1.
0/apis/streaming/connectors/kafka.html, accessed: 2016-09-09.
[12] D. Rossi, C. Testa, S. Valenti, and L. Muscariello, “Ledbat: The new
bittorrent congestion control protocol,” in Computer Communications and
Networks (ICCCN), 2010 Proceedings of 19th International Conference on,
Aug 2010, pp. 1–6.
[13] “Hadoop usages,” http://wiki.apache.org/hadoop/PoweredBy, accessed:
2016-09-09.
[14] “Hadoop open platform-as-a-service,” http://www.hops.io/?q=content/docs.
[15] K. Hakimzadeh, H. Peiro Sajjad, and J. Dowling, Scaling HDFS
with a Strongly Consistent Relational Model for Metadata. Berlin,
Heidelberg: Springer Berlin Heidelberg, 2014, pp. 38–51. [Online].
Available: http://dx.doi.org/10.1007/978-3-662-43352-2 4
[16] “Angularjs doc,” https://angularjs.org/, accessed: 2016-09-09.
[17] “Jersey web services,” https://jersey.java.net/.
[18] L. Richardson and S. Ruby, RESTful Web Services. ” O’Reilly Media, Inc.”,
2008.
[19] “Json rfc,” https://tools.ietf.org/html/rfc7159, accessed: 2016-09-09.
[20] “Glassfish server,” https://glassfish.java.net/.
[21] “Mysql cluster,” http://dev.mysql.com/doc/refman/5.7/en/ha-overview.html.
BIBLIOGRAPHY 47
[22] “Lucene inverted index,” https://lucene.apache.org/core/3 0 3/fileformats.
html, accessed: 2016-09-09.
[23] “Ndb cluster api,” https://dev.mysql.com/doc/ndbapi/en/mysql-cluster-api-
overview.html.
[24] “Hdfs architecture,” http://hadoop.apache.org/docs/r2.7.2/hadoop-project-
dist/hadoop-hdfs/HdfsDesign.html.
[25] “Csv rfc,” https://www.ietf.org/rfc/rfc4180.txt, accessed: 2016-09-09.
[26] “Avro docs,” http://avro.apache.org/docs/1.7.5/spec.html, accessed: 2016-
09-09.
[27] A. Hakansson, “Portal of research methods and methodologies for
research projects and degree projects,” in Proceedings of the International
Conference on Frontiers in Education : Computer Science and Computer
Engineering FECS’13. CSREA Press U.S.A, 2013, pp. 67–73, qC
20131210.
[28] “Older hdfs version,” https://hadoop.apache.org/docs/r1.2.1/hdfs design.
html, accessed: 2016-09-09.
[29] “Elasticsearch score,” , accessed: 2016-09-09.
[30] “Vagrant docs,” https://www.vagrantup.com/docs/, accessed: 2016-09-09.
TRITA -ICT-EX-2016:131
www.kth.se