Big Data with Apache Spark and Amazon AWS - Quoin · Big Data with Apache Spark and Amazon AWS An...

Preview:

Citation preview

www.quoininc.com Boston Charlotte New York Washington DC Managua

Big Data with Apache Spark and Amazon AWS

Lance Parlier16 February 2017

Big Data with Apache Spark and Amazon AWS

An introduction to big data applications using Apache Spark, running on Amazon AWS EC2 clusters.

This is a short introduction to some big data programs using Apache Spark. These programs will be ran locally and on Amazon AWS EC2 clusters, while comparing the differences in performance.

Big Data, Apache Spark, Amazon AWS

16February2017 QuoinInc. 2

Definitions

• Big Data: A term used for computationally analyzing extremely large data sets to reveal patterns and trends.

• Apache Spark: A fast, in-memory data processing engine. We will be using Spark’s engine for batch processing in the examples.

• Amazon AWS: A secure cloud services platform.• EC2: Elastic Compute Cloud(EC2). Virtual computers for rent on AWS.• S3: Simple Storage Service(S3). Storage on AWS.

• Scala: A general-purpose programming language. Is object-oriented(similar to Java). Has full support for functional programming.

16February2017 QuoinInc. 3

Definitions cont.

• Hadoop: An open source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.

* Apache Hadoop YARN (Yet Another Resource Negotiator) is a cluster management technology.

4

Why Do We Need Big Data?

5

Apache Spark

6

An Example: Wordcount(local)

7

An Example: Wordcount(local) cont.

• We will first run this example locally, with only one worker thread.• The initial input size will be 258 megabytes.

*The local machine this was run on has an Intel i5 processor and 8 gigabytes of ram.

8

An Example: Wordcount(local) cont.

9

An Example: Wordcount(local) cont.

10

An Example: Wordcount(local) cont.

11

Wordcount(AWS)

12

Wordcount(AWS) cont.

What we will do:• Use the spark-ec2 script to create the clusters.• Upload the input files to S3.• Upload the jar to the master.• SSH into the master, and run the jar.

13

Wordcount(AWS) cont.

14

Wordcount(AWS) cont.

15

Wordcount(AWS) cont.

16

Wordcount(AWS) cont.

17

Big Data?

• In these examples, 258MB isn’t really what most would consider big data.

• So, we will ramp the input sizes up to 3.5 gigabytes and try again.

18

Bigger Data(local)

19

Bigger Data(AWS)

20

Ramping it up

• Since using Amazon’s t2.micro instances didn’t perform the way we wanted, let’s ramp it up some.

• Now, we will use 6(1 Master and 5 Slaves) m4.xlarge instances.• These each have 16 gigabytes of memory each, compared to the

t2.micro’s 1 gigabyte per instance.

21

Ramping it up cont.

22

Summarizing Performance

23

LOCAL AWS (t2.micro) AWS (m4.xlarge)

SMALL DATASET 27 Seconds 56 Seconds 22 Seconds

LARGE DATASET 254 Seconds 243 Seconds 150 Seconds

Summarizing Performance cont.

• Using the smaller instances on AWS, performance was worse, or only slightly better. This could be because using the smaller instances only have 1 gigabyte of ram(so 6 total), versus the 8 gigabytes on the local machine. On top of that, the AWS clusters have the overhead of distributing jobs between the slaves.

• We can see the advantages once we begin to use larger instances/clusters on larger datasets. There is some overhead to distributed computing, but on a large enough dataset/more complex code, it becomes clearer why we need it.

24

Sources

• http://www.webopedia.com/TERM/B/big_data.html• http://spark.apache.org/• http://www.crackinghadoop.com/apache-spark-101-introduction-for-

big-data-newcomers/

25