Big Data with Apache Spark and Amazon AWS - Quoin · Big Data with Apache Spark and Amazon AWS An...

www.quoininc.com Boston Charlotte New York Washington DC Managua

Big Data with Apache Spark and Amazon AWS

Lance Parlier16 February 2017

Big Data with Apache Spark and Amazon AWS

An introduction to big data applications using Apache Spark, running on Amazon AWS EC2 clusters.

This is a short introduction to some big data programs using Apache Spark. These programs will be ran locally and on Amazon AWS EC2 clusters, while comparing the differences in performance.

Big Data, Apache Spark, Amazon AWS

16February2017 QuoinInc. 2

Definitions

• Big Data: A term used for computationally analyzing extremely large data sets to reveal patterns and trends.

• Apache Spark: A fast, in-memory data processing engine. We will be using Spark’s engine for batch processing in the examples.

• Amazon AWS: A secure cloud services platform.• EC2: Elastic Compute Cloud(EC2). Virtual computers for rent on AWS.• S3: Simple Storage Service(S3). Storage on AWS.

• Scala: A general-purpose programming language. Is object-oriented(similar to Java). Has full support for functional programming.

16February2017 QuoinInc. 3

Definitions cont.

• Hadoop: An open source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.

* Apache Hadoop YARN (Yet Another Resource Negotiator) is a cluster management technology.

Why Do We Need Big Data?

Apache Spark

An Example: Wordcount(local)

An Example: Wordcount(local) cont.

• We will first run this example locally, with only one worker thread.• The initial input size will be 258 megabytes.

*The local machine this was run on has an Intel i5 processor and 8 gigabytes of ram.

An Example: Wordcount(local) cont.

Wordcount(AWS)

Wordcount(AWS) cont.

What we will do:• Use the spark-ec2 script to create the clusters.• Upload the input files to S3.• Upload the jar to the master.• SSH into the master, and run the jar.

Wordcount(AWS) cont.

Big Data?

• In these examples, 258MB isn’t really what most would consider big data.

• So, we will ramp the input sizes up to 3.5 gigabytes and try again.

Bigger Data(local)

Bigger Data(AWS)

Ramping it up

• Since using Amazon’s t2.micro instances didn’t perform the way we wanted, let’s ramp it up some.

• Now, we will use 6(1 Master and 5 Slaves) m4.xlarge instances.• These each have 16 gigabytes of memory each, compared to the

t2.micro’s 1 gigabyte per instance.

Ramping it up cont.

Summarizing Performance

LOCAL AWS (t2.micro) AWS (m4.xlarge)

SMALL DATASET 27 Seconds 56 Seconds 22 Seconds

LARGE DATASET 254 Seconds 243 Seconds 150 Seconds

Summarizing Performance cont.

• Using the smaller instances on AWS, performance was worse, or only slightly better. This could be because using the smaller instances only have 1 gigabyte of ram(so 6 total), versus the 8 gigabytes on the local machine. On top of that, the AWS clusters have the overhead of distributing jobs between the slaves.

• We can see the advantages once we begin to use larger instances/clusters on larger datasets. There is some overhead to distributed computing, but on a large enough dataset/more complex code, it becomes clearer why we need it.

Sources

• http://www.webopedia.com/TERM/B/big_data.html• http://spark.apache.org/• http://www.crackinghadoop.com/apache-spark-101-introduction-for-

big-data-newcomers/

Big Data with Apache Spark and Amazon AWS - Quoin · Big Data with Apache Spark and Amazon AWS An...

Documents

(SEC403) Diving into AWS CloudTrail Events w/ Apache Spark on EMR

Apache Spark and the Hadoop Ecosystem on AWS

Using Apache Spark Pat McDonough - Databricks. Apache Spark spark.incubator.apache.org github.com/apache/incubator- spark user@spark.incubator.apache.or

Apache Spark Streaming

Apache Spark 2.0

Apache Spark Operations

AWS re:Invent 2016: Best Practices for Apache Spark on Amazon EMR (BDM301)

Apache Spark PDF

Apache spark

Apache Spark Briefing

Apache spark meetup

AWS meetup「Apache Spark on EMR」

R + Apache Spark

[@NaukriEngineering] Apache Spark

Apache spark Intro

Apache Spark RDDs

Apache Spark and Apache Cassandra Processing 200K … · /usr/bin/whoami • Ben Bromhead, CTO of Instaclustr • We provide managed Cassandra, Spark and Kafka in the cloud (AWS,

Using Apache Spark

Spark SQL | Apache Spark

Apache Spark - Yandex