Upload
amazon-web-services
View
466
Download
0
Embed Size (px)
Citation preview
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Khuluod OdehAlex C. Engler
June 20, 2016
Apache Spark Clusters for EveryoneEasy Access to Amazon EMR Spark Clusters
Using R and Python in the Browser
Khuloud OdehVP of IT and CIO [email protected]@kodeh
Our mission is to open minds, shape decisions, and offer solutions through economic and social policy research.
Data driven evaluation informing public policy decision-makingTelling stories through interactive data visualizationsSimulating policies before they happen using Micro-simulation models
Extend and strengthen research
Raise visibility & influence
Grow revenue & flexible resources
Improve productivity &
work experience
High performance computing
Machine learning
Big data
Web and social
Data visualizations
Outreach
Automated process
Business intelligence
CRM
User-centric systems
Post-modern ERP
Mobility
not simply modernization
Transformation
Smarter FasterBetter
Our IT challenge
Publication managementData analytics & visualization
Statistical programmingMicrosimulation and modeling
Cloud-firstScalable & elastic architecture
Our technology strategy
Modeling in the cloud architecture concept
Sharedfolder NoSQL
Web Server
R CPU
Web Server
R CPU
Web server
R CPU
Auto Scaling
Queue
Networked data repository
Web server
Queue
Web Server
R CPU
Web Server
R CPUHadoop
Compute clusters
Slow
Fast
RDBMSAnalysis server
Me:
• Alex, hi.
• Data Scientist @ Urban Institute
• Professor of Data (Viz & Science) + Public Policy @
• Georgetown University• Johns Hopkins University
Our solution
Apache Spark• Distributed memory framework made for big data
• Written in Scala/Java, but has R and Python APIs
• Good and improving statistical methods
• Open source
Our solution
R/Python IDEs in the browser• SparkR + pySpark : Researchers can work in
relatively familiar languages
• RStudio / Jupyter Notebooks offer interactive development environment
Our solution
Amazon Elastic MapReduce (Amazon EMR)• Elastic – Only pay for clusters that are turned on
• Fast – 12-15 minute cluster spin up
• Free unlimited data transfer from S3
New problem – How do researchers use this?On a Windows computer, this requires:• Create an AWS account• Create a PPK file• Install AWS CLI• Use Linux command line for EMR bootstrap• Install/use Putty for SSH• Use FoxyProxy for port forwarding
Alternative: AWS CloudFormation
• Specify and launch AWS resources with a JSON template
• Obviates the need for local installations on researcher machines
• Still meets security requirements
HFPC mortgage default data
• 800 million rows and growing to an expected 2-3 billion;
• 1 TB of data
• Requires • data manipulation• logistic regression (must be deterministic)
EMR cluster
• Four R3.8x Large EC2 instances:• 244 GB memory Each (~1 TB total);• 32 vCPU each (128 total);• 1296 GB SSD storage.
• $2.66/hour each, or ~$10/hour total
Speed ComparisonsSAS on server• Urban Institute
local SAS server• Single-threaded
HP SAS on EC2 cluster• High performance
SAS• On a Cloudera
Hadoop cluster optimized by SAS professionals
SparkR on EMR• Spark + R• On EMR• Optimized
(maybe) by me
Speed Comparisons
Data import Sort Merge Simple aggregations
Logistic regression
SAS on a server 58 44 26 125 306
High performance SAS on a Hadoop cluster
29 13 3 1 9
SparkR on EMR 6 6 2 3 8
All times in minutes
Conclusion
Spark + R/Python + Amazon EMR w/ CloudFormation
Solves a very tough problem:• Scalable to huge datasets• Low cost• Very fast• Completely elastic• Accessible• Robust statistics
GitHub Page:https://github.com/UrbanInstitute/spark-social-science
e-mail:[email protected]
Twitter: @alexcengler