Apache Spark Clusters for Everyone | AWS Public Sector Summit 2016

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Khuluod OdehAlex C. Engler

June 20, 2016

Apache Spark Clusters for EveryoneEasy Access to Amazon EMR Spark Clusters

Using R and Python in the Browser

Khuloud OdehVP of IT and CIO [email protected]@kodeh

Our mission is to open minds, shape decisions, and offer solutions through economic and social policy research.

mailto:[email protected]

Data driven evaluation informing public policy decision-makingTelling stories through interactive data visualizationsSimulating policies before they happen using Micro-simulation models

Extend and strengthen research

Raise visibility & influence

Grow revenue & flexible resources

Improve productivity &

work experience

High performance computing

Machine learning

Big data

Web and social

Data visualizations

Outreach

Automated process

Business intelligence

CRM

User-centric systems

Post-modern ERP

Mobility

not simply modernization

Transformation

Smarter FasterBetter

Our IT challenge

Publication managementData analytics & visualization

Statistical programmingMicrosimulation and modeling

Cloud-firstScalable & elastic architecture

Our technology strategy

Urban website today

CMS, publications management and DataViz in the cloud

BeforeToday

Modeling in the cloud architecture concept

Sharedfolder NoSQL

Web Server

R CPU

Web Server

R CPU

Web server

R CPU

Auto Scaling

Queue

Networked data repository

Web server

Queue

Web Server

R CPU

Web Server

R CPUHadoop

Compute clusters

Slow

Fast

RDBMSAnalysis server

Me:

• Alex, hi.

• Data Scientist @ Urban Institute

• Professor of Data (Viz & Science) + Public Policy @

• Georgetown University• Johns Hopkins University

Our problem

1. Big data

2. Small budget

3. Advanced statistics

4. Programming limitations

Our solution

Apache Spark +

R/Python IDEs in the browser+

Amazon Elastic MapReduce

Our solution

Apache Spark• Distributed memory framework made for big data

• Written in Scala/Java, but has R and Python APIs

• Good and improving statistical methods

• Open source

Our solution

R/Python IDEs in the browser• SparkR + pySpark : Researchers can work in

relatively familiar languages

• RStudio / Jupyter Notebooks offer interactive development environment

Our solution

Amazon Elastic MapReduce (Amazon EMR)• Elastic – Only pay for clusters that are turned on

• Fast – 12-15 minute cluster spin up

• Free unlimited data transfer from S3

Our solution

Our solution

New problem – How do researchers use this?On a Windows computer, this requires:• Create an AWS account• Create a PPK file• Install AWS CLI• Use Linux command line for EMR bootstrap• Install/use Putty for SSH• Use FoxyProxy for port forwarding

Linux command line

Our solution

Automate many processes w/ Python script

Alternative: AWS CloudFormation

• Specify and launch AWS resources with a JSON template

• Obviates the need for local installations on researcher machines

• Still meets security requirements

AWS CloudFormation – specify parameters

AWS CloudFormation – specify resources

Applying SparkR

HFPC mortgage default data

• 800 million rows and growing to an expected 2-3 billion;

• 1 TB of data

• Requires • data manipulation• logistic regression (must be deterministic)

EMR cluster

• Four R3.8x Large EC2 instances:• 244 GB memory Each (~1 TB total);• 32 vCPU each (128 total);• 1296 GB SSD storage.

• $2.66/hour each, or ~$10/hour total

RStudio in the browser

Setting up a Spark context

Reading in data from S3

Dplyr–like Syntax

Data Visualization with ggplot2

Logistic Regression

Speed ComparisonsSAS on server• Urban Institute

local SAS server• Single-threaded

HP SAS on EC2 cluster• High performance

SAS• On a Cloudera

Hadoop cluster optimized by SAS professionals

SparkR on EMR• Spark + R• On EMR• Optimized

(maybe) by me

Speed Comparisons

Data import Sort Merge Simple aggregations

Logistic regression

SAS on a server 58 44 26 125 306

High performance SAS on a Hadoop cluster

29 13 3 1 9

SparkR on EMR 6 6 2 3 8

All times in minutes

Conclusion

Spark + R/Python + Amazon EMR w/ CloudFormation

Solves a very tough problem:• Scalable to huge datasets• Low cost• Very fast• Completely elastic• Accessible• Robust statistics

Thank you!

GitHub Page:https://github.com/UrbanInstitute/spark-social-science

e-mail:[email protected]

Twitter: @alexcengler

https://github.com/UrbanInstitute/spark-social-science

https://github.com/UrbanInstitute/spark-social-science

Technology

Apache Spark Clusters for Everyone | AWS Public Sector Summit 2016