How to build your query engine in spark

How to build your query engine in SparkPengEngineer@anchorbotLove machine learning & algorithmsPart-time Mahout committer

Prior Knowledge Scala – not important, it's always changing, so if you don't know it, congratulation

you don't have to learn it again and re-become the grandmaster you were. Functional Programming – very important! But not too functional – equally important! Will be explained later Working with Amazon EC2/S3 – spot instances are dirt-cheap and unleash the

power of auto-scaling. (assuming you want to finish things in short burst AND your 8-hours sleep)

You DON'T have to know Hadoop, YARN OR HDFS (but highly recommended). You DON'T have to know MapReduce, DAG dependency model OR Apache Akka (I

never did) You DON'T have to know Machine Learning OR Data Science.

Guideline Basic: RDD, Transformations and Actions. Basic: Testing, Packaging and Deployment. Advanced: Partitioning, Distribution and Staging. Expert: Composite Mapping and Accumulator. Example: A query engine for distributed web scraping. Q&A.

Programming BricksEntities: Generic data abstractions• RDD[T] (Resilient Distributed Dataset): a collection of

java objects spread across your cluster, inaccessible from your local computer.

• LAR[T] (Locally Accessible Resources): a data source/sink you can read/write from your local computer.

• Can be many things including but not limited to: JVM memory block on your computer, local files, files on HDFS, files on S3, tables in C* (new!), tables in Hive, Twitter feed(read-only), other web API feed (read-only)

• This list is still growing.

Mappings: Methods that cast one entity to another Parallelization: LAR[T] => RDD[T] Transformation(f: {T} => {K}): RDD[T] => RDD[K],

generalizes map Action(f: {K} => {K}): RDD[K] => LAR[K], generalizes

reduce

RDDs

LARs

Transformation

ParallelizationAction

Plain java code

Programming BricksThese bricks are atomic black boxes, do not attempt to break or reverse-engineer them! If you want to try ------->Instead, try to be agnostic and construct your complex algorithm and framework by wiring them like IC chips. They form a much larger superset of Map/Reduce. They have no problem constructing the most

complex distributed algorithms in ML and Graph analysis.

Developers of Spark has made great effort in abstracting these complex and ugly trivia away from you so you can concentrate on the beauty of your algorithm.

Advantages Probably not optimized to the core. But once you fit into the paradigm…• No more thread unsafety, racing condition, resource pool, consumer starvation,

buffer overflow, deadlock, JVM OutOfHeapSpaceException, or whatever absurdities.

• No more RPC timeout, service unavailable, 500 internal server error, or Chaos Monkey’s miracles.

• No more weird exception that only happens after being deployed to cluster, local multi-thread debugging and test capture 95% of them.

• No dependency on any external database, message queue, or a specific file system (pick one from local, HDFS, S3, CFS and change it later in 5 seconds)

• Your code will be stripped down to its core• 10~20% of your original code in cluster computing!• 30~50% of that in multi-thread computing


TestingThe first thing you should do even before cluster setup because: On a laptop with 8 cores it is still a savage beast that outperforms most other programs with

similar size. Does not require packaging and uploading, both are slow. Read all logs from console output. Is a self-contained multi-threaded process that fits into any debugger and test framework.Supports 3 modes, set by ‘--master’ parameter local[*]: use all local cores, won’t do failover! (better paranoid than sorry) local[n,t]: use n cores (support for * is missing), will retry each task t-1 times. local-cluster[n,c,m]: cluster-simulation mode! Simulate a mini-cluster of size c, each computer has n cores and m megabytes of memory. Technically no longer a single process, it will simulate everything including data distribution

over network. As a result, you have to package first, do not support debugging, and better not using it in unit

test. Will expose 100% of your errors in local run.

MasterSeed nodeResource negotiator3 options: Native: lightweight, well tested, ugly UI,

primary/backup redundancy delegated to ZooKeeper, support auto-scaling (totally abused by DataBricks), recommended for beginners

YARN: scalable, heavyweight, threads run in containers, beautiful UI, swarm redundancy

Mesos: don’t know why its still hereRemember the master URL on its UI after setup, you are going to use it everywhere.

WorkerThe muscle and the real dealReport status to master and shuffle data to each other.Cores are segregated and share nothing in computation, except broadcasted variables.Disposable! Can be added or removed at will, enables fluent scaling3 options: $SPARK_HOME/bin/spark-class

org.apache.spark.deploy.worker.Worker $MASTER_URL: both the easiest and most flexible, support auto-scaling by adding this line into startup script.

…/bin/start-all: launch both master and workers, need to setup password-less ssh login first.

…/ec2/spark-ec2: launch many things on EC2 including an in-memory HDFS, too heavyweight and too many options hardcoded.

DriverNode/JVM that runs your main function.Merged with a random worker in cluster deploy mode (see next page)Distribute dataControl stagingCollect action results and accumulator changes.Technically not part of cluster but still better to be close to all other nodes. (Important for iterative jobs)Must have a public DNS to master! otherwise will cause:

WARNING: Initial job has not accepted any resources…$SPARK_HOME on it has to be identical to that on workers (This is really sloppy but people no longer care)

PackagingGenerate the all inclusive ‘fat/über’ JAR being distributed to nodes.Self-contained, should include everything in your program’s dependency tree This JAR won’t be generated by default, you have to generate it by: Enable maven-shade plugin and run mvn package Enable sbt-assembly plugin and run sbt> assembly

… EXCEPT those who overlap with Spark’s dependencies (and all modules’ dependencies, including but not limited to: SparkSQL, Streaming, Mllib and GrpahX). Excluding them by setting the scope of Spark artifact(s) in your dependency list to

‘provided’ You don’t have to do it but this decrease your JAR size by 90M+. They already exist in $SPARK_HOME/lib/*.jar and will always be loaded BEFORE your

JAR. if your program and Spark have overlapping dependencies but in different versions,

yours will be ignored in runtime (Java’s first-found-first-serve principal), and you go straight into...

JAR hellManifest itself as either one of these errors that only appears after packaging: NoClassDefFoundError ClassNotFoundException NoSuchFieldError NoSuchMethodError Unfortunately many dependencies of Spark are severely not up-to-date. Even more unfortunately the list of these outdated dependencies is still growing, a curse

bestowed by Apache Foundation. Switching to YARN won’t resolve it! It just box threads with containers but won’t change

class loading sequence.Only (ugly but working) solution so far: package relocation! Supported by maven-shade by setting relocation rule, don’t know how to do this in sbt :-

< Probably have third-party plugins that can detect it from dependency, need more testing. Not very compatible with some IDE, if reporting a classpath error please re-import the

project.

Maven vs sbtMaven

• The most extendable and widely-supported build tool.

• Native to Java, but all Scala dependencies are Java bytecode.

• Need maven-scala and maven-shade plugins

• I don’t know why but Spark official repo just switched from sbt to maven after 0.9.0.

• Apparently slightly faster than ivy• A personal tool of choice.

Simple Build Tool (used to be simple)• No abominable xml• Native to Scala• Self-contained executable• Beautiful build report by ivy

backend• Need sbt-assembly plugin (does

NOT support relocation :-<)

Deployment$SPARK_HOME/bin/spark-submit --master $SPARK_MASTER_URL --jar $YOUR_JARS full.path.to.your.main.object.This command do everything including distributing JARS and run main function locally as the driver. (a.k.a. client deploy mode)Alternatively you can move the driver to a random node by overriding ‘--deploy-mode’ to ‘cluster’, but it’s not recommended for beginners, reasons: Don’t know which node until seeing Spark UI Driver takes extra CPU and bandwidth load. Cannot use local JAR –you have to upload it to HDFS or S3 first. spark-submit is dumb, don’t know where to find it in JAR distribution dir. Useless to any Spark-shell! And a few other things. If its part of an SOA, have fun pointing all other clients to it.


Partitioning (a.k.a. shuffling)RDD[T].Partition: a smallest inseparable chunk of T, each cannot spread over 2 cores or threads. -> generating each partition only requires a self-contained single-thread subroutine (called task) that won’t screw up and induces no overhead on scheduling/synchronization.Default number of partitions is the total number of cores in a cluster, works great if workload on each partition is fairly balanced.Otherwise some cores will finish first and fence in your cluster ------>you’d better override this: Many transformations and parallelizations takes an optional Int

parameter to result in RDD with desired number of partitions. RDD[T].repartition(n: Int) returns an RDD[T] with identical content

but different number of partitions, also rebalance sizes of partitions.

RDD[T].coalesce(n: Int) merge closest partitions (ideally on one node) together. This is an incomplete partitioning (no shuffling) which makes it faster.

ResiliencePartition is also the smallest unit to be discarded and regenerated from scratch whenever: The task generating it throws an exception and quit. (regardless you can

customize your mappings to retry locally inside the task thread to avoid discarding the already succeeded part, Matei confirm this last time)

It is lost in a power outage or being disconnected from the cluster (When speculative task is enabled) the task generating it takes too long to

finish comparing to the time to finish most other partitions. (In this case the old one won’t be discarded but race with the new one)

It is being slowly (I mean discouragingly slow) redistributed from another node or loaded from a remote cache (e.g. HDFS&S3) yet all its dependencies (prior Partitions and mappings needed to generate it) are available locally. When this really happens, you and your network admin will have a problem.

Rule No. 1:No. partitions >= No. cores

More partitions/smaller partition =Higher scheduling and distribution overheadHigher workload/bandwidth consumption for driver/master nodeLower cost for retry/regenerationBetter scheduling for unbalanced partitioning and speculative taskEasier monitoring of progressLess partitions/bigger partition =

Lower scheduling and distribution overheadLower workload/bandwidth consumption for driver/master nodeHigher cost for retry/regeneration (again you can retry inside thread first)Longer waiting time for unbalanced partitioning and speculative taskProgress bar will stay at 0% until you lose hope

DistributionBoth RDD[T] and Mappings are (hopefully evenly) spread across cores.Supports 3 modes, from fast to slow:1. (fastest) JAR! Contains all precompiled java bytecode and artifacts, which include but not

limited to: IMMUTABLE static objects or constants (Mutable static objects are strictly forbidden: if you modify any of these locally in runtime no other node will know it, and you go straight into mutable hell), class methods, precompiled anonymous functions EXCLUDING closures (they are just methods of classes that inherit Function interfaces. BTW nice lambda operator, Java 8), manifest, anything you package into jar for whatever reason

Not include: fields of dynamic objects (initialized at runtime), closure of anonymous functions (same old thing), variables.

Only (and always) happens once before execution and reused in the lifespan of a job.

You can find the distributed jar in $SPARK_HOME/work/$JOB_UUID/… of each node

Distribution2. (faster) Broadcast (bonus feature, prioritized due to importance) Basically a cluster-wide singleton initialized in run-time. Happens immediately during singleton’s initialization using an eponymous static

function: val wrapper = spark.broadcast(thing: T <: Serializable)

After which you can define any mapping to read it repeatedly across the entire cluster, by using this wrapper in the mapping’s parameter:

wrapper.value You can’t write it, its IMMUTABLE, will cause racing condition if you can anyway.3. (fast) Shipping (Easiest, also the only way to distribute heterogeneous objects created at runtime including non-singletons and closures) also used to distribute broadcast wrappers (very small/fast) Still much faster than reading from ANY non-local file system. (Giving it a

performance edge over Hadoop MR) Happens automatically in partitioning, triggered on-demand. You can see its time cost in ‘Shuffle Read/Write’ column in job UI.

Serialization HellBroadcast and shipping demands that all objects being distributed are SERIALIZABLE:

...broadcast[T <: Serializable](… RDD[T <: Serializable]

…map(f <: Function[T,K] with Serializable)Otherwise deep copy is no-can-do and program throws NotSerializableError.Easiest (and most effective) solution: Don’t be a too functional! only put simple types and collections in RDDs and closures. Will also makes shipping faster (very important for iterative algorithms, beware R programmers.)If not possible you still have 2 options: Wrap complex objects with a serializable wrapper (recommended, used by many Spark

parallelizations/actions to distribute HDFS/S3 credentials) Switch to Kryo Serializer (shipping is faster in most cases and favored by Mahout

due to extremely iterative ML algorithms, I haven’t tried yet)Happens even at shipping between cores, only becomes useless when broadcasting locally (singleton is not bind to cores). One of the rare cases where you cluster-wise deployment fails yet local test succeeds.

StagingA stage contains several mappings that are concatenated into embarrassingly parallelizable longer tasks. E.g. map->map = 1 stage, map->reduce = 2 stages Technically reduce can start after a partition of its preceding map is

generated, but Spark is not that smart (or unnecessarily complex).Staging can only be triggered by the following mappings: All actions. Wide transformations.CanNOT be triggerd by caching or checkpointing: They are also embarrasingly parallelizable.

Wide Transformations?Narrow (no

partitioning):• Map• FlatMap• MapPartitions• Filter• Sample• Union

Wide (partitioning):

• Intersection• Distinct• ReduceByKey• GroupByKey• Join• Cartesian• Repartition• Coalesce (I know

its incomplete but WTH)


Composite MappingUsed to create additional mappings and DSL that do complex things in one line, also reduces jar and closure size. However… You can’t inherit or extend RDD of which exact type is abstracted away from

you. You can’t break basic mappings as atomic black-boxes.Only solution: use Scala implicit view! Define a wrapper of RDD[T], implement all your methods/operators in it by

wiring the plethora of programming bricks. Create a static object (usually referred as context) with an implicit method that

converts RDD[T] to your wrapper (referred as implicit converter) In your main function, import the implicit converter by:

Import context._ Voila, you can use those methods/operators like mappings on any RDD[T]. Another reason why Scala is the language of choice.

AccumulatorUsed to create counters, progress trackers, and performance metrics of things that are not displayed on UI. Created from eponymous function:

Val acc = spark.accumulator(i) Only readable in main function, but can be updated anywhere in parallel

by using: acc += j

Type of i and j must be identical and inherit AccumulatorParam No constraint on implementation, but order of j should have no impact on final result. Prefer simple and fast implementation. Updated in real time, but requires an extra thread or non-blocking function to read

locally when the main thread is blocked by stage execution.


Background

ImplementationOnly 2 entities: RDD[ActionPlan]: A sequence of “human” actions executable on a

browser. RDD[Page]: A web page or one of its sections.Query syntax is entirely defined by composite mappings. Mappings on RDD[ActionPlan] are infix operators Mappings on RDD[Page] are methods with SQL/LINQ-ish names. Enabled by importing context._, and work with other mapping extensions.Web UI is ported from iScala-notebook, which is ported from iPython-notebook.

RDD[T].

Map(T=>K)Pipe(./bin)

FlatMap (T=>K…)

Distinct() Union (RDD[T])

Filter (T=>y/n?)Sample()

Intersection (RDD[T])

GroupBy (T~T’?)

Cartesian (RDD[V])

RDD [(U,V)].

GroupByKey() (Left/right)Join(RDD[U,K])Lookup(RDD[U])

ReduceByKey(V…=>V)

RDD[T].

Reduce(T…=>T)

Collect()

Count() First() saveAsTextFile(filePath)

fromTextFile (filePath)

Parallelize (T…)

Performance Benchmark

10 cores 20 cores 30 cores 40 cores import.io0

10000

20000

30000

40000

50000

60000

70000

Pages/hour

Amazon.com Google Image Iherb

10 cores 20 cores 30 cores 40 cores import.io (10

cores?)

0200400600800

10001200140016001800

Pages/(hour*core)

Amazon.com Google Image Iherb

Thanks for tough questions!==In Chaos Monkey we [email protected]/tribbloid

mailto:[email protected]

AddendumMore Mutable Hell?Lazy EvaluationAnecdotes

More mutable hell?Many sources have claimed that RDD[T] enforces immutable pattern for T, namely: T contains only immutable (val) fields, which themselves are also immutable types and

collections. Content of T cannot be modified in-place inside any mapping, always create a slightly

different deep copy.However this pattern is perhaps less rigorous by now, reasons: Serialization and shipping happens ubiquitously between stages which always enforce a

deep copy. 2 threads/tasks cannot access one memory block by design to avoid racing. Any partition modified by a mapping then used by another is either regenerated or loaded

from a previous cache/checkpoint created BEFORE the modifying mapping, in both cases those modifications are simply discarded without collateral damage.

Immutable pattern requires great discipline in Java and Python.So, my experience: just make sure your var and shallow copies won’t screw up things INSIDE each single-threaded task, anticipate discarded changes, and theoretically you’ll be safe.Of course there is no problem in using immutable pattern if you feel insecure.Again mutable static object is FORBIDDEN.

Lazy Evaluation RDD is empty at creation, only its type and partition ids are

determined. Can only be redeemed by an action called upon itself or its

downstream RDD. After which a recursive resolve request will be passed along its

partitions’ dependency tree(a directed acyclic graph or DAG), resulting in their tasks being executed in sequence after their respect dependencies are redeemed.

A task with all its dependencies are largely omitted if the partition it generates is already cached, so always cache if you derive 2+ LARs from 1 RDD

Knowing this is trivial to programming until you start using first() action Caching an RDD, call first() on it (triggers several mini-stages that has

1 partition/task), then do anything that triggers full-scale staging will take twice as long as doing it otherwise. Spark’s caching and staging mechanism is not smart enough to know what will be used in later stages.

Anecdotes•Partitioning of an RDD[T] depends on the data structure of T and who creates it, and is rarely random (yeah it’s also called ‘shuffling’ to fool you guys).

• E.g. an RDD[(U,V)] (a.k.a. PairRDD) will use U as partition keys and distribute (U,V) in a C* token ring-ish fashion for obvious reasons.

• More complex T and expected usage case will results in increasingly complex RDD implementations, notably the SchemaRDD in SparkSQL.

• Again don’t try to reverse-engineer them unless you are hardcore.

Software

How to build your query engine in spark