Addendum Session 5 - s3.amazonaws.com · Session Addendum Objectives Understand what causes data-imbalance Understand the impact of data-imbalance Be familiar with common strategies

Session 5 Addendum

Data ImbalanceMemory Usage Estimation

Session Addendum Objectives

➢ Understand what causes data-imbalance

➢ Understand the impact of data-imbalance

➢ Be familiar with common strategies for dealing with data-imbalance

➢ Understanding and Estimating memory usage

What is Imbalanced Data?

● “Unbalanced” or “Imbalanced” or “Lumpy” data typically refers to “too much of one and not another” for performing some type of operation

● For example, imagine “groupByKey” for businesses in a zipcode in NY State.○ NYC would have a huge proportion

of the data!

● Caused by trying to organize around low-cardinality categorical or ordinal data

What Does Imbalanced Data Do?

● Makes things explode, is the main problem!

● Causes unstable nodes

● Causes laggy nodes (stragglers)

● Causes OOM errors

● Causes Looooong shuffles for wide operations

● Often manifests well-down-the-pipeline so you find out...too late

How Do We Deal With Imbalanced Data? Part 1

● Operations Strategies (judicious use of groupBy & related shuffle-triggers)

● Filter early and often (Optimizer may handle some of this for you)

● Be cognizant of your partitioning mechanism○ Custom Partitioner if needed/helpful

● Detecting Stragglers○ Tasks in a stage that take over-long to execute relative to others○ Sign of uneven partitioning○ Use the Spark UI

How Do We Deal With Imbalanced Data? Part 2

● Enforcing higher-cardinality○ Re-keying pre-join (add ‘noise’ to keys)

■ E.G. Use the zip+4. Add a business category.■ 10017-1234 or 10017-laundromat

● Broadcast smallish data to avoid a shuffle○ You can push smaller dataframes up and join to them

businesses.join(broadcast(nyZips) as “z”, $”z.zip” === $”postal”)

● Remove duplicates or combine via “mapPartition” before join/grouping.

● When all else fails, vertically partitioning data, landing it, then a 2nd pass with remapped keys...subset subset subset.

Example: Increasing Cardinality for Groupings

//Both work!

//This one may cause more shuffling, which increases network // chatter incurring delaysval nyBiz = spark.read.csv("data/imbalance.csv").coalesce(5)case class Datum(city:String, postal:Int, category:String)val biz = nyBiz.map(x => Datum(x.getAs[String](0), Integer.parseInt(x.getAs[String](1)), x.getAs[String](2)))val groupedByPostal = biz.groupBy("postal").count.show

//This one may incur more "memory overhead" and therefore // garbage collectionval nyBiz = spark.read.csv("data/imbalance.csv").coalesce(5)case class Datum(city:String, postal:String, category:String)val biz = nyBiz.map(x => Datum(x.getAs[String](0), x.getAs[String](1) + x.getAs[String](2), x.getAs[String](2)))val groupedByPostal = biz.groupBy("postal").count.show

Proactively Managing Memory

● Understand what eats it up

● Understand what gets used when

● Inspect the web UI and review usage

● Do the math!

● Use SizeEstimator.estimate

● Manually configure ratios (in more extreme cases)

Components of Memory Usage

● Objects stored & the ‘meta’ they carry with them○ Object Headers can be > the data itself○ Linked structures retain “pointers” to siblings○ “Primitives” might be boxed.

● Cost of object access

● Spark “memory overhead minimum”, ~ 384MB

● Serialization style

Components of Memory Access

● Parallelism: too few partitions can be problematic○ More partitions decreases each tasks memory use (input data)○ 2-3 tasks per CPU core

● Garbage Collection (the unseen devil in JVM performance problems)○ Collect & review statistics○ Adjust allocation to fit○ Try different flags, and balance with “fraction settings”

■ spark.memory.fraction■ spark.memory.storageFraction

○ This is its own science+art form: https://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html

● Data Locality

Data Locality & Streaming

● Where is the Partition?○ RDD carries information about location○ Hadoop RDD’s know about location of HDFS data○ KafkaRDDs indicate Kafka-Spark partition should get data from the machine hosting the Kafka

topic○ Spark Streaming - partitions are local to the node the receiver is running on

● What is “local” for a Spark task is based on what the RDD implementer decided would be local

● 4 Kinds of Locality○ PROCESS_LOCAL - task runs with same process as source data○ NODE_LOCAL - task runs on same machine as source data○ RACK_LOCAL - task runs on same rack○ NO_PREF/ANY - task cannot be run on same process as source data

● spark.locality.wait - determines how long to wait before changing the locality goal of a task

Here we can review the different metrics in the Spark UI around memory usage and data shuffled about.

Here we can look at the relative cost of tasks that involve shuffles - most of the more expensive stages involved data shuffled.

Specific Actions You Can Take - 1

● Design for size. Use arrays and primitives over standard collections and richer types○ Key off numbers not strings○ Use minimized objects to reduce overhead: http://fastutil.di.unimi.it/

● Calculate expected usage & size & parallelize accordingly○ Pass in initializations (e.g. sc.textFile or spark.read.xx.coalesce(n)○ Change the default - spark.default.parallelism

● Try dynamic allocation

● Change pointer size (< 32 GB Ram, set -XX:+UseCompressedOops - spark-env.sh)

http://fastutil.di.unimi.it/


● Switch to Kryo for serialization○ conf.set("spark.serializer",

"org.apache.spark.serializer.KryoSerializer")○ Warning: Must register your custom classes

● Create facilities for on-demand clusters to isolate “RBJs” (Really Big Jobs)

● Estimate Size scala> import org.apache.spark.util.SizeEstimator

//The imported text filescala> println(SizeEstimator.estimate(nyBiz))62436248

//The resulting dataset post-map-to-case-classscala> println(SizeEstimator.estimate(biz))62440912

//The post-groupBy dataframescala> println(SizeEstimator.estimate(groupedByPostal))62440904


● Adjust Garbage Collection Settings○ Turn on logging:

-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

○ Try the G1GC-XX:+UseG1GC

○ Many others…

● Adjust “Fractions”○ spark.memory.fraction ○ spark.memory.storageFraction

Session Review

● Definitions of Imbalanced Data

● Solutions for managing Imbalanced Data

● Understanding components of memory consumption

● Actionable steps for improving functionality

EMRElastic Map Reduce on AWS

Session 9: Amazon EMR

Session 9 - EMR Session Objectives

➢ Understand EMR Solution

➢ Know how to start and run a cluster

➢ Know how to interact with Spark on EMR

➢ Interfacing with S3 data

➢ Zeppelin with EMR

➢ Cost optimizations for storage and networking??

➢ Scaling

➢ Debugging

EMR 101

● Elastic Map Reduce (Hadoop & Friends on AWS)

● Managed Cluster (Less devopsy stuff for you to do)

● Autoscaling

● Many Softwares○ Spark!○ Hadoop○ HBase○ Presto○ Hive○ HBase

YARN Schedulers - CapacityScheduler

● Default scheduler specified in Amazon EMR

● Queues○ Single queue is set by default○ Can create additional queues for workloads

based on multitenancy requirements

● Capacity guarantees● set minimal resources for each queue● Programmatically assign free resources to queues

● Adjust these settings using the classification capacityscheduler in an

EMR configuration object (or bootstrapping)

EMR Hadoop

● Preconfigured software for your convenience○ AWS-Instance-Type-based Yarn and Hadoop settings

● Contains Hadoop customizations that are uniquely AWS○ Must build binaries using EMR (in other words on-cluster)

■ Not the case for Spark○ Must build binaries with same Linux version

● Build on EMR > Copy to S3 > Run Step Sequence

EMR Spark: The Sales Pitch

● “Easy” to use/get started

● Cost savings (potential)

● Open Source tools (with mods in some cases)

● Managed

● Secure

● Flexible

EMR Spark

● Fundamentally “still a Yarn-based cluster”

● Supports pretty much all the same features you’d expect running your own

● Ala-Carte opportunity to drink more AWS Kool-aid○ Data Pipeline○ Encryption at rest/in-transit○ Aurora-based Hive meta-store○ Spot provisioning for cost-measures

■ Can incur delays○ IAM Security measures○ S3 Data Lake (Decouple compute & storage)

Session Review

● What EMR is

● EMR’s purpose

● Spark on EMR Basics


9.1 S3 Data Lake

EMR Spark: S3 Data Lake - Why?

The S3 Data Lake concept has some advantages

● High availability, 11 9’s of uptime

● Security○ Can constrain by IAM roles○ VPC-only access○ In Depth bucket policies○ Encryption at-rest

● Low Cost (dramatically lower than RDBMS or NoSQL storage)

● S3-Select (where viable/appropriate)

EMR Spark: S3 Data Lake

But some disadvantages too!

● Limited read/write speed

● Network latency

● “Ghost files”, “conceived files” (eventual consistency side effects)

● Somewhat confusing protocol addressing s3:// s3n:// s3a://

EMR Spark: S3 Data Lake - Protocol & File Access

● s3:// - Hadoop implementation of block-based file system backed by s3. ○ Also how you might be used to referencing directly to AWS as an S3 URI

● s3n:// - “Native file system” access by Hadoop

● s3a:// - “s3n part 2” - The upgrade to s3n. ○ Supports files > 5GB○ Uses/requires AWS SDK○ Backwards compatible with s3n○ NOT SUPPORTED in EMR!

● EMRFS - Wait...we’re back to s3:// - yep. AWS EMR has re-simplified the confusion back to just s3:// if you are on EMR

EMR Spark: S3 Data Lake - Latency Concerns

● Resolve S3 inconsistencies, if present, with “EMRFS consistent View” in cluster setup

● Use compression!○ CSV/JSON - GZip or Bzip2 (if you wish S3-Select to be an option)

● Use S3-Select for CSV or JSON if filtering out ½ or more of the dataset

● Use other types of file-store, i.e. Parquet/Orc

● Chunk your files. ○ Spark will handle more small files better than many big files, up to a point

EMR Spark: S3 Data Lake - Latency Concerns:Sizing

How big should my files be? It depends -

● No S3-Select○ With GZip, 1-2GB tops. GZip cannot be split.○ Splittable files, between 2GB and 4GB.

■ Allows more than 1 mapper, increasing throughput■ Goal is to process as many files in parallel as possible

EMR Spark: S3 Data Lake - Latency Concerns:Sizing

How big should my files be? It depends -

● Using S3-Select? Less of a concern○ Input Files must be CSV. JSON, or Parquet○ Output Files must be JSON or CSV ○ Files must be uncompressed, .gz (GZip), or .bz2 (Bzip2) (json/csv only)○ Max SQL expression length 256KB○ Max result record length 1 MB

EMR Spark: Ingesting and Landing Data

● Ingestion○ We can improve overall system performance oftentimes, by using s3DistCopy to

bring data in from S3 and pushing to HDFS - but this is a whole extra step○ We may ingest from other datastores of course, RDS, Redshift, Streaming

(kafka/kinesis), Dynamo...etc. We can also ingest from Elasticsearch

● Landing○ When we land data, we can temporarily land it to a Hive table for interactive

exploration○ We can land it of course to any JDBC storage○ Often we will land it to S3 for further interaction with other systems (Athena,

Presto...etc)○ Fun Tip - need it to feed your ES Search? You can write your final result directly to

Elastic Search!

Session Review

● Value of S3 Data Lake

● Hindrances (pros/cons)

● Some ideas for how to store/retrieve data


9.2 Setting up the Cluster

EMR Cluster: Setup

● Methods

○ AWS Console■ Advantages: Simplicity, Clarity

○ AWS CLI■ Advantages: Completeness, Scriptable

○ AWS SDK■ Java■ BOTO3 (Python■ ...etc…■ Advantages: Infrastructure As Code (IAC) friendly, completeness

EMR Cluster: Setup: Console

Don’t be seduced by the one-pager (Create Cluster - “Quick Options”- Typically want to use “Advanced” mode! (if you use Console at all)

If you do use quick options, at least take note of your log location, and make sure you select Spark ;)

Instance type is a function of sizing exercises you presumably have already done, or perhaps will do after you run some trial code.

Step execution is good for 1-off job clusters (launch, do, terminate)

EMR Cluster: Setup: Console - Quick


Get Coffee. This is not a super-quick, procedure. The EMR machinery is doing a bit of work, and the more softwares you selected, the longer it will take.

Once this cluster is launched, it is really not much different programmatically, from a local or on-prem cluster, except you have to SSH in to do much.

Look up the master node for your cluster in the Console UI, or:

> aws emr list-clusters> aws emr list-instances --cluster-id j-YOUR-CLUSTER-IDOr> aws emr describe-cluster --cluster-id j-YOUR-CLUSTER-ID


You can SSH in:

> ssh -i ~/.ssh/rf_emr_dev_access.pem [email protected]

--EMR banner shows --

> sudo spark-shell

If you don’t `sudo` you get a bunch of warnings basically saying the logs cannot be written.


You can connect from Zeppelin:

(It’s already installed on the cluster by default, just need an SSH tunnel)

Once a tunnel is set up, you can just “click the link” in the cluster-ui.

> ssh -i ~/.ssh/rf_emr_dev_access.pem -ND 8157 [email protected]

MINI-LAB - Launch a Quick Cluster

● Log into AWS

● Launch a Quick Cluster

● Connect from Spark Shell

● Run a few commands to prove its working○ Try parallelizing an array of some data and mapping it○ Review the Spark UI (in the cluster window click “Enable Web Connection” and follow instructions)

● Bonus Credit - launch Zeppelin (in the cluster window click “Enable Web Connection” and follow instructions)

● Terminate the cluster

EMR Cluster: Setup: Console - Advanced


● Unsurprisingly, kinda the same, but with options

● Customize your software set. ○ Need to ala-carte TensorFlow? ○ Contrast TensorFlow and MXNet?○ Try out Presto?

● Customize the installed configurations (i.e. adjust hadoop-env, yarn-site..etc)

● Add steps & conditionally set auto-terminate

● Optimize pricing/Instance types

● Customize Security/VPC/Subnets


What’s in a Node?There are some details to be aware of around node-types, that are hidden from the Quick Cluster setup.

● Master Node, you are probably familiar with○ Runs HDFS NameNode service○ Runs YARN ResourceManager service○ Tracks submitted job statuses and monitors health of the instance groups○ Like Highlander, there can be only one (per instance group/fleet)

● Core Nodes○ Run the DataNode daemon (for HDFS)○ Run TaskTracker daemon○ This is a scaling point


What’s in a Node?There are some details to be aware of around node-types, that are hidden from the Quick Cluster setup.● Task Node

○ Does not run DataNode daemon (not participating in HDFS)○ Best for autoscale/spike capacity in your cluster

● Instance Fleets○ Fully configurable cluster management○ Able to take advantage of Spot instances (cost optimization)○ Allows AWS to “Mix and match” instance types, optimizing your pricing and

their utilization. Can result in sudden-death nodes.○ Can be used to really optimize, but is complex and requires

experimentation○ Can add a “Task Instance Fleet” to an active cluster


What’s in a Node?There are some details to be aware of around node-types, that are hidden from the Quick Cluster setup.

● Uniform Instance Groups○ Simplified capacity management○ While allowing flexible autoscaling setup○ Specify purchasing options to manage cost○ Don’t run Master as Spot Instance on any cluster you care about


3: Cluster SettingsThere’s a few things to note here, but the highlights for now are -

● Logging of course (location spec)

● “EMRFS consistent view” - remember when we talked about S3 “eventual consistency”?

● Bootstrap Actions - (NOT the same as ‘steps’)○ Definitely an “advanced mode” option here, for:

■ Pre-loading some common data-set onto each Node■ Install additional software (Drill, Impala, ElasticSearch, Saltstack..etc)■ Max of 16 actions

MINI-LAB - Configure an Advanced Cluster

● Log into AWS

● Configure an Advanced Cluster

● Play with the options

● Click some “i” icons

● Q&A

What Are These “Steps”?

“When you are done starting, do this one thing.” Then maybe shut down too.

● Available in:○ Quick Launch - auto-terminate when done○ Advanced - specify steps & termination option

● A “Unit of work” submitted to the cluster○ Stream processing○ Hive/Pig○ Spark job○ Custom Hadoop

● Each has its own unique configuration

One disadvantage is ephemeral cluster can be hard to troubleshoot.Shutdown post-step is not required (when using Advanced Config)

EMR Cluster: Setup: Other Ways To Start

● Are you using IAC? Code it in. ○ On-Demand Jobs from Jenkins○ Other “AWS-SDK”-based solutions

● Script it○ Generally anything that can be done in Console, can be done in CLI○ Usually more options in CLI○ Also an IAC option here…○ https://docs.aws.amazon.com/cli/latest/reference/emr/index.html

For guaranteeing execution insulation, I really like the one-off-cluster mechanism, but it is easily overkill for “many small/mid sized jobs” environments.

EMR Cluster: Setup: Launch via CLI

Demo Lab 9.2Follow along if you like/are able. What we’ll do

1. Push a chunk of our data from earlier to S3

2. Take one of our lab projects as a Jar and push it to S3

3. Launch a cluster using “steps” and the AWS CLI that will

a. Start

b. Run Steps

c. Terminate

4. Validate the the creation, output, and termination

EMR Cluster: Setup: CLI Lab 9.2

Asset PlacementWe need to get our assets on S3, to be accessible to EMR.

● Build the Jar

● Use the AWS SDK to copy the jar

● Use the AWS SDK to copy the data

● Record those paths

Session Review

● Cluster Setup

● Cluster Management (a bit)

● Some decisions & options for clusters

● We set up a cluster! (or 2..)

● We learned EMR is a highly malleable environment, though the stock configuration gives a lot of benefit out of the box


9.3 Tools on EMR Clusters

Many Tools Available

● Databases○ Hive○ HBase/Phoenix○ Presto

● ML Tools○ Mahout○ MXNet○ Tensorflow

● Data Streaming/Loading○ Flink/Sqoop○ Kinesis/Kafka/ES/Cassandra

● Workflow & Monitoring○ Hue○ Oozie

Aaand Notebooks!

● Zeppelin

● Jupyter

Notebooks on EMR

● Houston, we have options

● Zeppelin vs. Jupyter: what’s worth fighting for?○ Well...Zeppelin gives you native Scala support…?○ Yeah but...Almond is a Scala kernel for Jupyter○ Jupyter has more better visualization and stuff...python libs man…○ Yeah but Zeppelin is growing more quickly○ Dude the data science guys LIKE Jupyter ok?○ But multi-user. But authentication. But...○ …scala….python...scala...python...jupyter...zeppelin

● Both are good. Both have merits. Both have some issues on EMR

● “What about Beaker man!” “But I like Databricks!”- “Religous debate has no place in [data] science”

Zeppelin on EMR

Here’s the deal with Zeppelin, if that’s what you want to use

● Multi-user setup may be less effort

● It can be more secure than Jupyter, if that is a business concern

● Store Zeppelin notebooks on S3 so they don’t go away with the cluster!○ If we don’t store off-cluster, we are scared to terminate, increasing cost○ Security options here:

■ Access key/secret■ IAM/User■ Secure by-bucket if you require

○ Can be done two ways■ Shelling in to the running cluster and updating the config■ Configuring the cluster at startup using “configurations” block

Zeppelin on EMR

● One-off clusters for analysis can use Spot instances. IF you do this, make sure you set up to store the notebook to S3

● You can even set up your own EC2 with Zeppelin to run off the main cluster-master (so node deaths and cluster decommissions have less impact)

● Zeppelin on Amazon EMR does not support the SparkR interpreter

● Zepl is a 3rd party solution from the Zeppelin folks offering a product ZeppelinHub to ease the burdens here.

● https://www.zepl.com/blog/setting-multi-tenant-environment-zeppelin-amazon-emr/

Zeppelin Performance Notes 1

● Store Zeppelin notebooks on S3 so they don’t go away with the cluster. Ephemeral notebooks on-cluster-crash or decommission are just no fun○ You can also store them on EFS if you bootstrap your cluster to do so

● (Potentially) Set your notebooks to use interpreter-per rather than shared○ Configure “interpreter/spark interpreter” as “The interpreter will be instantiated ‘Per User’ in

‘scoped’ process”. and click “Save”. (JVM Isolation)○ Otherwise a single interpreter will be used by all notebooks

● Understanding CPU/VCPU/Yarn CPU/Zeppelin allocations○ Do not expect your cluster settings to be in effect:

“Zeppelin does not use some of the settings defined in your cluster’s spark-defaults.conf configuration file, even though it instructs YARN to allocate executors dynamically if you have set spark.dynamicAllocation.enabled to true. You must set executor settings, such as memory and cores, using the Zeppelin Interpreter tab, and then restart the interpreter for them to be used.

● Interpreter options override Spark submit optionsspark.executor.memory > export SPARK_SUBMIT_OPTIONS="--executor-memory 10G ..."

Zeppelin Performance Notes 2

When creating your cluster -

● Use the Script Runner step or ”Configurations” to customize the setup ○ Give more memory, e.g. zeppelin-env.sh

■ export ZEPPELIN_MEM="-Xms4024m -Xmx6024m -XX:MaxPermSize=512m"

■ export ZEPPELIN_INTP_MEM="-Xms4024m -Xmx4024m -XX:MaxPermSize=512m"

● Limit the result sets, e.g. in the interpreter○ zeppelin.spark.maxResult

● Limit the interpreter output, e.g. zeppelin-site.xml○ <name>zeppelin.interpreter.output.limit</name>

We’ll do an example of this in the next lab.

Lab: Set Up Zeppelin on EMR

Lab 9.3-A

What we’ll do

1. Launch a single-node cluster2. Add Zeppelin & Spark,3. Do some work in Zeppelin4. Terminate the cluster

Lab: Set Up Zeppelin on EMR with S3 Storage

Lab 9.3-B FIXME/TODO: NEED TO FINISH

What we’ll do

1. Launch a single-node cluster2. Create a folder for s3 persistence3. Configure to use S3 for Zeppelin persistence4. Edit & save the notebook5. Check S36. Terminate the cluster7. Start a new one8. Load the notebook & run it9. Terminate the cluster

Jupyter on EMR

Notice that the EMR documentation only includes instructions for JupyterHub - which we will get to…

● To run a Jupyter Notebook directly against EMR requires a few extra steps○ We need to configure EMR to have Jupyter available○ We need to enable jupyter_spark○ We need to configure security groups○ We need to create the cluster with the custom bootstrap, and

script-runnerstep○ We need to manually start pyspark

● That seems like a lot of work. ○ WAY more work than running it locally

● Lets just use JupyterHub for goodness sake

Running Jupyter Off-Master - JupyterHub, SparkMagic, Spark-Rest/Livy

● Livy: Apache incubated a REST API for Spark called Livy○ Livy enables off-cluster hosting of notebooks for JupyterHub

● SparkMagic: Extra bits for Jupyter with Spark via JupyterHub + Livy○ Automatically installed on EMR with the JupyterHub package

● Spark-Rest - is Livy - the server side of the equation○ We really don’t need to think much about it○ Abstracted by wrapper-API’s under the covers of JupyterHub

● This creates a whole new set of capabilities

● As usual, EMR introduces some hiccups○ Cluster access from-AWS○ Means ssh-tunnel usually required

Livy/SparkMagic Notes● SparkMagic

○ Included with Jupyter-Hub○ Uses Livy○ Gives extra ‘magics’ - %% capabilities in your notebook○ Automatic visualization of SQL queries in the PySpark, PySpark3, Spark and SparkR kernels; use an

easy visual interface to interactively construct visualizations, no code required○ Ability to capture the output of SQL queries as Pandas dataframes to interact with other Python libraries

(e.g. matplotlib)

● Introduces some limitations, notably○ “Since all code is run on a remote driver through Livy, all structured data must be serialized to JSON and

parsed by the Sparkmagic library so that it can be manipulated and visualized on the client side. In practice this means that you must use Python for client-side data manipulation in %%local mode”

○ Which can be confusing to readers….and writers…● You might also want to keep an eye ont Toree for more ‘magics’ https://toree.apache.org/

Notebooks: Jupyter Mini-Lab - 9.4

Lab (or Demo) 9.3.2Follow along if you like/are able. What we’ll do

1. Push a chunk of our data from earlier to S3 (or use if its still there)

2. Launch a cluster using “steps” and the AWS CLI that will use JupyterHub

3. Start up an SSH tunnel

4. Do some work in the notebook

5. Terminate the cluster

Externalizing Zeppelin

It is also possible, probably preferable, to completely externalize Zeppelin to a secondary EC2 instance near your cluster(s).

● Isolation

● Insulation

● Multi-cluster access

https://aws.amazon.com/blogs/big-data/running-an-external-zeppelin-instance-using-s3-backed-notebooks-with-spark-on-amazon-emr/

https://aws.amazon.com/blogs/big-data/running-an-external-zeppelin-instance-using-s3-backed-notebooks-with-spark-on-amazon-emr/

Notebooks on EMR, Review

● Zeppelin vs Jupyter

● Both are good. Both have merits. Both have some issues on EMR○ Just know they must run “up there” and you’ll be ok

● For highest stability, externalize your setup from your cluster

● Evaluate your security needs and consider if those requirements drive you toward either solution in particular

● Outside of that, the normal arguments for/against each do not differ for being on EMR


9.4 EMR ClusterTroubleshooting

Troubleshooting & Debugging

Step 1: Check the logs

Troubleshooting & Debugging

Step 2: Find the logs

Why So Difficult?!

Distributed Systems

● Are inherently challenging to troubleshoot

● Require a distributed mindset

● Require a deeper understanding of the components

● Take time to adjust your internal Sherlock Holmes to

● Have no golden hammers, silver bullets,

● May contain additional tooling for troubleshooting

Regular Spark

Spark UI already comes with some great tools

● Master UI○ Workers○ Cluster Details

● Spark Worker UI○ Jobs○ Stages

■ DAG Visualization■ Event Timeline

○ Storage○ Environment○ Executors

■ Memory/Cores/IO○ SQL

What Morecould you

need?

Maximize Information, Minimize Surface Area

All That There Is

EMR + Spark

The AWS EMR solution has taken extra steps to help limit the “surface area” needed to find most problems.

● Enable Debugging

○ With debugging enabled, you get a lot of information retained (contextually), and accessible via the AWS Console without having to sort through S3 log folders and peruse the gz files

● Without it, the data IS still there, you just have to dig a little harder

Quick Demo: Nav to an S3 log file and open.

EMR Best Practices

● Do a ‘dry run’ of completed code against limited datasets.○ Great case for notebooks○ Great case for small 1-off cluster you can leave running a bit for

diagnostics○ Enable Debugging○ Configure Logging○ Logging IS expensive, don’t leave it on for production runs, only test cases

and diagnostics

● Consult the “common errors” doc in AWS (link below) - good chance you’ll see your problem & solution listed there.

● Master Node log browsing○ If you’ve a good idea of the problem and don’t have debug enabled, most

of the logs you need will live on Master node

Debug Logs In Context





Debug Logs: Lab 9.4

Lab 9.4

What we’ll do

● Navigate the Debug logs of previously launched clusters

Troubleshooting

Understand your problem class

● Sizing/Scaling? OOM, Hung Nodes, Stragglers○ Review cluster configuration○ Review Zeppelin interpreter config (if appropriate)○ Do some math with your data sizes

● Can’t Start Cluster?○ Bootstrapping failures○ One-offs jobs failing○ Security

● Processing○ Data formats - noisy data throwing exceptions?○ Data access - security preventing file access?

Real-Failure Suppression

Sometimes in Spark, the real problem is hidden by downstream tasks

● Log parsing failures

● Run explain plan

● Construct an alternative pipeline by○ Checkpointing○ Landing data & making your job multi-step

● Sometimes the best solution is to not try to make your Spark job a one-shot, and get back to basic ETL methodology○ Extract (O) - one job - load (O) /convert/filter/land (O’)○ Transform (O’) - one job - load (O’) /join/filter/land (O’’)○ Load (O’’) - one job - load (O’’) /load/aggregate/land

Real-Failure Suppression, Step By Step Testing

1. Launch a cluster with debugging & termination production

2. Configure a Notebook with enough memory and cores for the interpreter that you are confident it is sufficient for your data population

3. Look for opportunities in each step for errors. a. Use type-testing (number? string?) b. Use try/catch/log semanticsc. When you are 100% sure the problem isn’t in this step, move on

4. Check the logs after each step via the Console

5. Correct any errors. If your automatic job still fails after this effort, you should have sufficiently ruled out programming problems, and should look for scale issues, security issues, latency problems, and other “infra” surfaces

The Usual Suspects 1 (how are you coding)

Stop errors before they blow up at cost by -

● Testing your code, ideally locally or on a 1-server “cluster”, with a limited-but-representative dataset

● Unit Testing your code with small data that contains expected variations and corner cases in your raw data

● Try/Catch works in Spark too...error handling

● Defect-based unit testing - when something blows up and it takes you 2 hours or 2 days to find the culprit, and you code a workaround - code a test for that workaround

● Configurable debug logging in your code

Your problem is really only happening post-dev with your shiny unit tested code. I know we’ve talked about this before, but it’s worth a second look.

● Review your DAG in the Spark UI○ Look for stragglers○ Re-evaluate your partitioning schemes

● Do a little math and double check your data size assumptions

● Look for opportunities to:○ Coalesce or Repartition○ DISTRIBUTE BY, SORT BY, CLUSTER BY

● If that all fails, try pulling in a good sample of your data, coalesce it to a reasonable number, and pay attention to the Spark UI

The Usual Suspects 2 (how are you running)

The Unusual Suspects

AWS has its own conditions which may cause cluster instability

● Service Outages http://status.aws.amazon.com/

● Usage Limits○ EC2 default quota of 20 EC2 servers

■ EC2 QUOTA EXCEEDED error ○ S3 bucket count limit (100 per acct)

■ Consider nesting by env and project

● Networking issues communicating between resources, in-or-across VPC’s○ Subnets run short of addressable IPs for large clusters

● Last state change before termination may hold clues

http://status.aws.amazon.com/

Debugging Scala/Spark Applications

● Most bugs can be replicated from a reasonable sample-set of the data

● Running locally in an IDE can help you to perform interactive real-time troubleshooting of the running application○ Any Scala IDE can do this - I like Intellij

● With debug opts enabled you can also runtime-debug a running cluster - locally or on EMR○ You’ll obviously not do this with Prod○ Its best to do with a small cluster and data-switched breakpoints

● Arguably the best way to solve any issues with data validation, data munging, joins, failing actions.

● NOT the best way to troubleshoot performance issues!

Debugging PySpark Applications

● Some messages are from the JVM & some from Python, which can lead to some confusion

● In Jupyter, the useful error messages are most likely in the console log

● In YARN, YARN logs. Point being, the stuff in your notebook may feel pointless

● .take(1), .count(), and loop/print are your allies

● Lambdas can make troubleshooting even more difficult○ You can write tests in Python too! https://github.com/holdenk/spark-testing-base

● As with Scala, being able to run your code in a debuggable environment (e.g. Pycharm) can dramatically increase productivity, though it can feel alien (or, “low level”) to the data analyst accustomed to notebooks


9.5 The OperationsPerspective

Observability & Monitoring

We have been talking about “how the application experiences the infrastructure”

Now we will be talking about “how the infrastructure experiences the applications”


What is the delineation of responsibility in your organization?



On a case-by-base basis we may know a particular Spark job we are responsible for didn’t work out, and that’s what we’ve been talking about. At a system level, how do we know when things are going right or things are going wrong?

Let’s take a look at some options

● Cloudwatch - AWS log/log-monitoring solution

● Ganglia - cluster visualization tool

● Third Party Options○ Influx TICK Stack○ Prometheus○ Home Rolled with any number of time-series dbs○ (Paid solutions) Loggly, Datadog...etc

Observability & Monitoring: Cloudwatch

● More #”s than you can probably stomach!○ Grows linearly by #metrics * # jobs

● Find the ones that matter for your purposes

● This can be time consuming and requires developing some subject matter expertise

● One limitation is any aggregation of system metrics across clusters must be performed manually


This makes ephemeral clusters a bit harder to deal with.


Tough to group across business-purposes when organized this way


Events vs Metrics

● “Actions vs Data”

● Goals with Events○ Know “something happened” or “something changed”

■ E.g. Cluster status went isIdle = true■ Great for creating actions you want to know about

● SNS yourself a message for:○ Scale out events

● Idle or Zombie server decommission

● You can generate custom Events from your app as well!

● Metrics: Know the usage of the system (at a very fine grained level)

Observability & Monitoring: Cloudwatch Rule

Observability & Monitoring: Cloudwatch Explore

9.5.1 Mini lab (group or demo-style)

Lets review some Cloudwatch metrics from the clusters we’ve run.

● MemoryAvailableMB● IsIdle● CoreNodesRunning● S3BytesRead● MRUnhealthyNodes● ...etc...

Observability & Monitoring: Cloudwatch: Custom Dashboard

Observability & Monitoring: Ganglia

Another “box-solution” provided by AWS for EMR is Ganglia.

http://ganglia.info/

“Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. “

Lab 9.5: Lets launch one cluster with Ganglia, as a group, and poke around a bit, separately.

http://ganglia.info/








Observability & Monitoring: 3rd Party

Since EMR is so flexible in custom bootstrapping, any agent you like can be added at cluster-provisioning time to broadcast to any accessible target.

Whether you decide to use TICK stack, Prometheus, or another system that you hand-roll, you can accomplish the goal of creating graphs and alerts about your infrastructure.

Observability & Monitoring: Tick Stack

Tick stack is an “open source core” for a time-series-data platform built to handle metrics and events.

This open source core consists of the projects —

● Telegraf - A collection and reporting agent

● InfluxDB - A high performance time-series database written in Go with 0 dependencies

● Chronograf - Dashboarding

● Kapacitor - Realtime batch-and-streaming data processing engine for munching data from InfluxDB

-- collectively called the TICK Stack.


9.6 EMR ClusterOptimization

Scaling: Plenty of Knobs to Turn

● Sensible defaults can get you a long way○ Dynamic Allocation is “on by default”

■ This requires Shuffle service● Spark Shuffle Service is automatically configured by Amazon EMR

■ maxExecutors = infinity

● Autoscaling can be set up for Instance Groups

● Spark parameters can be adjusted with --configurations

● Many defaults are set based on instance types selected

Scaling: Defaults

By default, Spark in EMR picks up its basic settings from instance type selected

spark.dynamicAllocation.enabled true

spark.executor.memory Setting is configured based on the core and task instance types in the cluster.

spark.executor.cores Setting is configured based on the core and task instance types in the cluster.

You can adjust these defaults and others on cluster creation with the --configurations classification: spark-defaults

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html

Scaling: maximumResourceAllocation: true

On cluster launch, settings you can supply via --configurations within Classification:Spark that affect how performance is managed. If maximizeResourceAllocation:true, then

spark.default.parallelism 2X number of CPU cores available to YARN containers.

spark.driver.memory Setting is configured based on the instance types in the cluster. This is set based on the smaller of the instance types in the two instance groups (master/core)

spark.executor.memory Setting is configured based on the core and task instance types in the cluster.

spark.executor.cores Setting is configured based on the core and task instance types in the cluster.

spark.executor.instances Setting is configured based on the core and task instance types in the cluster. Set unless spark.dynamicAllocation.enabled explicitly set to true at the same time.

Scaling: maximumResourceAllocation: true

● Limits your cluster to one-job-at-a-time

● Best for single-use or single-purpose clusters

Scaling: dynamicAllocation Notes

Dynamic Allocation may be “on by default” in EMR, but it still has some of its own knobs to turn.

spark.dynamicAllocation.executorIdleTimeout Default: 60 - seconds of idle time means execute is “removable”

spark.dynamicAllocation.cachedExecutorIdleTimeout Default: Infinity - the lifespan of an executor which has cached data blocks.

spark.dynamicAllocation.initialExecutors Default: spark.dynamicAllocation.minExecutors - Default number of executors for DA, only if < --num-executors.

spark.dynamicAllocation.maxExecutors Default: Infinity - upper bound of num executors.

spark.dynamicAllocation.minExecutors Default: 0 - Number of executors ‘by default’.

spark.dynamicAllocation.executorAllocationRatio Default: 1 - Ratio of executors to tasks (1:1 is maximum parallelism)

Scaling: Some Numbers

● YARN: when tuning, Be sure to leave at least 1-2GB RAM and 1 vCPU for each instance's O/S and other applications to run too.○ The default amount of RAM seems to cover this, but this will leave us with

(N-1) vCPUs per instance off the top

● Executors: Parallelism is the goal, and keep in mind cluster size when setting parameters. If you are not using max and dynamic, you might, e.g. have 3 machines with 4 CPU each. Leaving 1 per for the system, you have 3 each, so --num-executors = 9 would be reasonable. (executors per node)

● Executor-cores: How many parallel tasks can an executor take on?○ Think about time spent on IO and determine the ratio○ Each executor-core will have various operations in which they are waiting

on other things (reads/writes) and so increasing executors and reducing cores-per can result in better performance

Scaling: Drive Space

● By default you generally get a 10GB EBS volume○ Add volumes to increase storage when drivespace is a problem○ Add volumes to offset memory vs cpu vs storage inequities (in other words you keep running out

of drivespace, but the ram/cpu are fine)○ These are also ephemeral!○ “EBS-Optimized” = network traffic is dedicated, not shared

● You can add additional volumes but you must consciously adjust configuration to be aware of and take advantage of them.○ Check the directories where the logs are stored and change parameters as needed


● Cross-cluster or intra-cluster storage can be performed using EFS rather than S3, if preferred. This is not recommended practice for log files.

● Change the guarantees of performance by selecting custom EBS○ Provisioned IOPS SSD - high performance (ops/sec)○ Throughput optimized - high throughput (Mib/sec)○ AWS measures IOPS in 256K or smaller blocks


If you decide to create an EBS volume to tolerate more local-logs, you will want to bootstrap your environment to accommodate this. Set the log path in yarn-site, and possibly do a custom mount operation to guarantee the device.

[hadoop@ip-172-31-50-48 /]$ mountproc on /proc type proc (rw,relatime)sysfs on /sys type sysfs (rw,relatime)devtmpfs on /dev type devtmpfs (rw,relatime,size=7686504k,nr_inodes=1921626,mode=755)devpts on /dev/pts type devpts (rw,relatime,gid=5,mode=620,ptmxmode=000)tmpfs on /dev/shm type tmpfs (rw,relatime)/dev/xvda1 on / type ext4 (rw,noatime,data=ordered)devpts on /dev/pts type devpts (rw,relatime,gid=5,mode=620,ptmxmode=000)none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,relatime)/dev/xvdb1 on /emr type xfs (rw,relatime,attr2,inode64,noquota)/dev/xvdb2 on /mnt type xfs (rw,relatime,attr2,inode64,noquota)/dev/xvdc on /mnt1 type xfs (rw,relatime,attr2,inode64,noquota)/dev/xvdd on /mnt2 type xfs (rw,relatime,attr2,inode64,noquota)/dev/xvde on /mnt3 type xfs (rw,relatime,attr2,inode64,noquota)

Scaling: Autoscaling

● Autoscaling requires Instance Groups and is not supported with Instance Fleet

● EMR Scaling is more complex that EC2 autoscale○ Core Node vs Task Node○ Core Node decommission times longer due to HDFS

● Scale Out != Scale In○ Scale out policies can be more flexible○ Scale in must be more prudent

Scaling: Scale In: Switches● “ Amazon EMR implements a blacklisting mechanism in Spark that is built on top of YARN's decommissioning mechanism.

This mechanism helps ensure that no new tasks are scheduled on a node that is decommissioning, while at the same time allowing tasks that are already running to complete.”

spark.blacklist.decommissioning.enabled Default: true - Spark does not schedule new tasks on executors running on that node. Tasks

already running are allowed to complete.

spark.blacklist.decommissioning.timeout Default : 1 hour - After the decommissioning timeout expires, the node transitions to a

decommissioned state, EMR can terminate the node's EC2 instance. Any tasks are still

running after the timeout expires are lost or killed and rescheduled on executors running on

other nodes.

spark.decommissioning.timeout.threshold Default: 20 seconds - This allows Spark to handle Spot instance terminations better because

Spot instances decommission within a 20-second timeout regardless of the value of

yarn.resourcemager.decommissioning.timeout, which may not provide other nodes enough

time to read shuffle files.

spark.stage.attempt.ignoreOnDecommissio

nFetchFailure

Default - true - When set to true, helps prevent Spark from failing stages and eventually failing

the job because of too many failed fetches from decommissioned nodes. Failed fetches of

shuffle blocks from a node in the decommissioned state will not count toward the maximum

number of consecutive fetch failures.

Sizing Suggestions

● Memory at 3X your data size expectations

● Enough cores to reasonably parallelize your data, assuming you’ve also worked through the partitioning scenarios

● Filter, filter filter. Narrow narrow narrow

● Ephemeral clusters have fewer variables

● Shared clusters have MANY more details to consider

● This is as much art as science, and is invariably “case-by-case”

Sizing/Setup Suggestions

● Use HDFS for intermediate data storage while the cluster is running and Amazon S3 only to input the initial data and output the final results.

● If your clusters will commit 200 or more transactions per second to Amazon S3, contact support to prepare your bucket for greater transactions per second and consider using the key partition strategies described in the links below

● Set Hadoop configuration setting io.file.buffer.size to 65536. This causes Hadoop to spend less time seeking through Amazon S3 objects. .

● If listing buckets with large numbers of files, pre-cache the results of an Amazon S3 list operation locally on the cluster.

Cost Optimization Recommendations

● Ephemeral clusters which auto-terminate for spark-submit & sizeable jobs○ Can integrate with Jenkins for a seamless commit/execute CICD

● Dynamic allocation with minimal primary-cluster for active analysis○ Good pool of Task Node available

● Off-cluster notebook connectivity and management (jupyter-hub, zeppelin, livy)○ Cluster Core Node pool should remain relatively fixed to reduce decommission time○ Primary scale point should be Task Node○ Task Nodes are best case for Spot Instances (least risk, least cost)

● Off-cluster notebook storage

● Cloudwatch alerts with SNS listeners that proactively act or message

Cost Optimization In Dev

● Watch how you work○ Develop locally - grab a file and start the process○ Troubleshoot locally - use an IDE or local-notebook to evaluate your work and debug

● Go to the cluster when you have a job you are ready to productionize (doesn’t work on local machine anymore)

● Cluster job fails○ Get back to local once you understand the failure mode○ Emulate, correct, and re-deploy

● Too often we get in the mindset of “just this one tweak will fix it” and waste hours upon hours of cluster-runtime cycles.

Cost Source Notes

● EMR costs are on-top-of the underlying infrastructure (EC2) costs.

● S3 costs are around 700/mo for 10TB ‘with reduced redundancy’. Contrast this with Redshift $1000/TB/mo at the lowest tier (3 year buy-in).

● You may be charged for use of “SimpleDB” when you enable debugging

● If you add large EBS volumes to your clusters this can add up. Important if you are writing to HDFS, using Hive, or expect a lot of spill-to-disk or disk-cache

● In-Region data transfer should not add cost. Transferring data across regions will. Bear this in mind when establishing any data-landing practices

● You can save a ton of money leveraging reserved & spot instances

Common Problems

● https://www.knowru.com/blog/first-3-frustrations-you-will-encounter-when-migrating-spark-applications-aws-emr/

● https://www.indix.com/blog/engineering/lessons-from-using-spark-to-process-large-amounts-of-data-part-i/

https://www.knowru.com/blog/first-3-frustrations-you-will-encounter-when-migrating-spark-applications-aws-emr/

https://www.knowru.com/blog/first-3-frustrations-you-will-encounter-when-migrating-spark-applications-aws-emr/

https://www.indix.com/blog/engineering/lessons-from-using-spark-to-process-large-amounts-of-data-part-i/

https://www.indix.com/blog/engineering/lessons-from-using-spark-to-process-large-amounts-of-data-part-i/


9.7 EMR SecurityNotes

What’s at Risk?

● Being based on AWS Technologies means you have “all the basic” AWS tools to help you secure your system.

● You launch your cluster in a VPC

● You leverage completely customizable IAM roles to○ Interact with other services○ Allow cleanup○ Autoscale

● You leverage completely customizable Security Groups

● The typical “risk” is identified as in-org risk

What’s at Risk?

● In-Org Risk○ Some teams cannot see other teams data○ Organization wants to split resources by budgets in different departments○ Custom roles/groups can be usefiul for this

● Extra-Org Risk○ The most common failing is when data gets left lying around○ S3 public buckets (“Hey I couldn’t get the S3 policy set right so I just made it public”)○ Content emitted to email through notifications & SNS that contains sensitive

information

Risk Mitigation: Mechanisms for Security

● AWS Level○ IAM Roles○ Cloudtrail○ S3 Policies○ Firewall/VPN

● Spark/Hadoop level○ LDAP/AD Integration○ Authorized-user access (IAM + LDAP)○ HDFS Permissions/ACLs○ Kerberos

● Mechanisms○ Lock down access (IAM Roles, VPC’s, S3 Policies, LDAP)○ Audit access (Cloudtrail)○ Encrypt

Risk Mitigation: Encryption Options

There are many options for at-rest and in-transit encryption in the Spark/EMR/S3 ecosystem.

● What matters to your organization?

● What attack vectors concern you?

● Do you consider VPC secure?

● Do you need to protect yourself from internal threats?

● What are your regulated surfaces/relevant responsibilities?

Encryption Optionshttps://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-encryption-tdehdfs.html

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-data-encryption.html

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/TransparentEncryption.html

Encryption of Notebook storage on S3

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-encryption-tdehdfs.html

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-encryption-tdehdfs.html





Time Remaining?● Q&A

● Lab Struggles & Assistance

● Group Experiments?

● Active problems in the extant system?

● Thank you and good night!

Documents

Addendum Session 5 - s3.amazonaws.com · Session Addendum Objectives Understand what causes data-imbalance Understand the impact of data-imbalance Be familiar with common strategies