Upload
altiscale
View
51
Download
0
Tags:
Embed Size (px)
Citation preview
RUNNING SPARK AND MAPREDUCE TOGETHER IN PRODUCTION
David Chaiken, CTO of Altiscale
#HadoopSherpa
2
AGENDA
• Why run MapReduce and Spark together in production?
• What about H2O, Impala, and other memory-intensive frameworks?
• Batch + Interactive = Challenges
• Specific issues and solutions
• Ongoing Challenges: Keeping Things Running
• Perspective: Hadoop as a Service versus DIY*
* do it yourself
3
ALTISCALE PERSPECTIVE:INFRASTRUCTURE NERDS
• Experienced Technical Yahoos• Raymie Stata, CEO. Former Yahoo! CTO,
advocate of Apache Software Foundation
• David Chaiken, CTO.Former Yahoo! Chief Architect
• Charles Wimmer, Head of Operations.Former Yahoo! SRE
• Hadoop as a Service, built and managed by Big Data, SaaS, and enterprise software veterans• Yahoo!, Google, LinkedIn, VMWare, Oracle, ...
4
SOLVED: COST-EFFECTIVE DATA SCIENCE AT SCALE
But how do you make it easier for data scientists?
Two bad options:1. Use Hadoop directly
using unfamiliar and unproductive command-line tools and APIs
2. Use Hadoop indirectly via a back-and-forth with data engineers who translate needs into Hadoop programs
5
Data Scientist’s Workflow
Modeling
Exploration
Production
Cleansing
Flattening
Serving
Hive
SourceData
CSV
COMMON HADOOP WORKFLOW
Model
Flatten
Explore
Exploration
6
ENTER SPARK. . . AND IMPALA AND H2O
• Interactive, iterative analysis
• Quick turns
• Memory heavy
7
DOES THIS MEAN THAT MAPREDUCEDOESN’T MATTER ANYMORE?
HA!(Don’t believe the hype.)
Exploration
Hive
SourceData
CSV
IT MATTERS SO MUCH THAT YOU WANT BOTH ON ONE CLUSTER.
Flattening
Exploration
Modeling
Serving
Production
Cleansing
BIG DATA MODELING WORKFLOW
8
THE CHALLENGE. . .
9
“Why is my Spark job not starting?”
“Why is my Spark job consuming so many resources?”
Resourceconflicts!
9
SPECIFIC ISSUES AND SOLUTIONS
10
11
INTERACTIVE:INCREASE CONTAINER SIZEChallenge: Memory intensive systems take as much local DRAM as available.
Solutions: • Spark and H20: Increase YARN container memory
size• Impala: Box using operating system containers
• Caution: Larger YARN container settings for interactive jobs may not be right for batch systems like Hive
• Container size: needs to combine vcores and memory:yarn.scheduler.maximum-allocation-vcoresyarn.nodemanager.resource.cpu-vcores ...
HIVE+INTERACTIVE:WATCH OUT FOR LARGE CONTAINER SIZE
12
HIVE + INTERACTIVE:WATCH OUT FOR FRAGMENTATION
• Caution: Attempting to schedule interactive systems and batch systems like Hive may result in fragmentation
• Interactive systems may require all-or-nothing scheduling
• Batch jobs with little tasks may starve interactive jobs
13
HIVE + INTERACTIVE:WATCH OUT FOR FRAGMENTATION
Solutions:
• Reserve interactive nodes before starting batch jobs
• Reduce interactive container size (if the algorithm permits)
• Node labels (YARN-2492) and gang scheduling (YARN-624)
14
ONGOING CHALLENGESKeeping things running. . .
15
16
CHALLENGE: SECURITY
• Challenge: User Management not uniform• MapReduce: collaboration requires getting groups right• Hive: proxyuser settings have to be right for hiveserver2• Spark application owner versus connected users• Impala: “I just gotta be me!”• As usual, watch out for cluster administrator accounts!
• Challenge: Port and Protocol Management• Best security practice: open specific ports for specific
protocols• Spark: “I just gotta be free!”• Spark improved between version 1.0.2 -> 1.1.0,
but still confusing
17
CHALLENGE: WEB SERVING
• How to provide interactive services to business user?
• Concerns: security, variable resources, latency, availability
• Keep serving infrastructure separate from Hadoop
18
CHALLENGE:RESOURCE ATTRIBUTION (BILLING)
• Accounting for long-running Spark, H2O, Impala clusters?
• Is reserving resources the same as using the resources?
• Trade-off: availability/response time vs. oversubscription.
19
CHALLENGE:STABILITY VERSUS AGILITY
• Never-ending story: latest hotness versus SLAs*
• New system stability curve. Example…• SPARK-1476: 2GB limit in Spark for blocks
• Interoperation issues. Example…• IMPALA-1416: Queries fail with metastore exception
after upgrade and compute stat• HIVE-8627: Compute stats on a table from Impala
caused the table to be corrupted
• Many issues come down to YARN container size and JVM heap size configuration
* service level agreements
20
PERSPECTIVE: HADOOP AS A SERVICE VERSUS DIY (DO IT YOURSELF)
• Data Scientists and Data Engineers:use the right tools for the right job
• Data Scientists and Data Engineers:don’t spend your time on cluster maintenance
• Hadoop As A Service: have your cake and eat it, too• Benefit from the experiences of other customers• One size does not fit all, but one configuration schema
does• Leave the maintenance to us infrastructure nerds
21
QUESTIONS? COMMENTS?