As a company starts dealing with large amounts of data, operation engineers are challenged with managing the influx of information while ensuring the resilience of data. Hadoop HDFS, Mesos and Spark help reduce issues with a scheduler that allows data cluster resources to be shared. It provides a common ground where data scientists and engineers can meet, develop high performance data processing applications and deploy their own tools.
Text of Scaling Big Data with Hadoop and Mesos
Scaling Big Data with Hadoop And Mesos
Bernardo Gomez Palacio Software Engineer at Guavus Inc
Beyond Buzz Words
Mesos and Data Analysis Yes, you don't need Hadoop to start
using Mesos and Spark.
Now, If You... 4 Need to store large files? by default each
block is 128MB. 4 Data is written mainly as new files or by
appending into existing ones?
Convinced you want to jump into the Hadoop bandwagon? Read
Sammer, Eric. "Hadoop Operations." Sebastopol, CA: O'Reilly, 2012.
Welcome to the Jungle
Distributions Apache Bigtop, CDH, HDP, MapR
Hadoop HDFS MRV1 MRV2
Assuming You Already Have Mesos 4 Mesosphere Packages 4
https://mesosphere.io/downloads/ 4 From Source. 4
Hadoop MRV1 in Meso https://github.com/mesos/hadoop
Hadoop MRV1 in Mesos 4 Requires Hadoop MRV1 4 Officially works
with CDH5 MRV1 4 Apache Hadoop 0.22, 0.23 and 1+ 4 Apache Hadoop 2+
doesn't come with MRV1!
Hadoop MRV1 in Mesos 4 Requires a JobTracker. 4 By default uses
the org.apache.hadoop.mapred.JobQueueTaskScheduler 4 You can change
it .e.g ...mapred.FairScheduler
Hadoop MRV1 in Mesos 4 Requires TaskTracker. 4 That is
org.apache.hadoop.mapreduce.server.jobtracker. TaskTracker. 4 And
How Hadoop MRV1 Runs In Mesos?
How Hadoop MRV1 in Mesos works? 1. Framework Mesos Scheduler
creates the Job Tracker as part of the driver. 2. The Job Trakcer
will use org.apache.hadoop.mapred.MesosScheduler to lunch
Thoughts What about Hadoop 2.4? Namenode HA? MRV2 and
Personal Preference 4 Use Hadoop 2.4.0 or above. 4 Name Node HA
through the Quorum Journal Manager. 4 Move to Spark if
Example of a Mesos Data Analysis Stack 1. HDFS stores files. 2.
Use the Spark CLI to test ideas. 3. Use Spark Submit for jobs. 4.
Use Chronos or Oozie to schedule workflows.
Spark On Mesos
Spark On Mesos
Know that Each Spark Application 1. Has its own driving
process. 2. Has its own RDDs 3. Has its own cache.
Spark Schedulers on Mesos Fine Grained Coarse Grained
Spark Fine Grained Scheduling 4 Enabled by default. 4 Each
Spark task runs as a separate Mesos task. 4 Has an overhead in
launching each task.
Spark Coarse Grained Scheduling 4 Uses only one long-running
Spark task on each Mesos slave. 4 Dynamically schedules its own
mini-tasks, using Akka. 4 Lower startup overhead. 4 Reserving the
cluster resources for the complete duration of the
Be ware of... 4 Greedy Scheduling (Coarse Grain) 4 Over
committing and deadlocks (Fine Grained)
Using Spark Understand Parametrization and Usage 4
spark.app.name 4 spark.executor.memory 4 spark.serializer 4
spark.local.dir 4 ....
Use Spark Submit Avoid parametrizing the Spark Context in your
code as much as possible. Leverage the spark-submit arguments,
properties files as well as environment variables to configure your
Using Spark Accept That Tunning is a Science & an Art
Understand and Tune Your Applications 4 Know your Working Set.
4 Understand Spark Partitioning and Block management. 4 Define your
Spark workflow and where to cache/ persist. 4 If you cache you will
serialize, use Kryo.
Example Spark API PairRDDFunctions def combineByKey[C](
createCombiner: V => C, mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C, numPartitions: Int): RDD[(K,
PairRDDFunctions.combineByKey 4 Combines the elements for key
using a custom set of aggregations. 4 RDD[(K, V)] to RDD[(K,
PairRDDFunctions.combineByKey 4 createCombiner: Turns a V into
a C 4 mergeValue: merge a V into a C 4 mergeCombiners: to combine
two C's into a single one. partitioner defaults to
Example Spark API PairRDDFunctions self: RDD[(K, V)] def
aggregateByKey[U: ClassTag](zeroValue: U)( seqOp: (U, V) => U,
combOp: (U, U) => U ): RDD[(K, U)] Uses the default
Understand your Data
Tune your Data 4 Per Data Source understand its optimal block
size 4 Leverage Avro as the serialization format. 4 Leverage
Parquet as the storage format. 4 Try to keep your Avro &
Parquet schemas flat.
Each Application 4 Instrument the Code. 4 Measure Input size in
number of records and byte size. 4 Measure Output size in the same
Standardize 4 JDK & JRE version across your cluster. 4 The
Spark version across your cluster. 4 The libraries that will be
added to the JVM classpath by default. 4 A packaging strategy for
your application, uber jar.
About YARN and Spark
Some Differences with YARN 4 Execution Cluster vs Client modes.
4 Isolation process vs cgroups 4 Docker support? LXC Templates? 4
References 1. "Hadoop - Apache Hadoop 2.4.0." Apache Hadoop
2.4.0. Apache Software Foundation, 31 Mar. 2014. Web. 24 July 2014.
link. 2. "Hadoop Distributed File System-2.4.0 - HDFS High
Availability Using the Quorum Journal Manager." Apache Hadoop
2.4.0. Apache Software Foundation, 31 Mar. 2014. Web. 23 July 2014.