How to build your query engine in spark

  • View

  • Download

Embed Size (px)


An over-ambitious introduction to Spark programming, test and deployment. This slide tries to cover most core technologies and design patterns used in SpookyStuff, the fastest query engine for data collection/mashup from the deep web. For more information please follow: A bug in PowerPoint used to cause transparent background color not being rendered properly. This has been fixed in a recent upload.


  • How to build your query engine in Spark Peng Engineer@anchorbot Love machine learning & algorithms Part-time Mahout committer
  • Prior Knowledge Scala not important, it's always changing, so if you don't know it, congratulation you don't have to learn it again and re-become the grandmaster you were. Functional Programming very important! But not too functional equally important! Will be explained later Working with Amazon EC2/S3 spot instances are dirt-cheap and unleash the power of auto-scaling. (assuming you want to finish things in short burst AND your 8-hours sleep) You DON'T have to know Hadoop, YARN OR HDFS (but highly recommended). You DON'T have to know MapReduce, DAG dependency model OR Apache Akka (I never did) You DON'T have to know Machine Learning OR Data Science.
  • Guideline Basic: RDD, Transformations and Actions. Basic: Testing, Packaging and Deployment. Advanced: Partitioning, Distribution and Staging. Expert: Composite Mapping and Accumulator. Example: A query engine for distributed web scraping. Q&A.
  • Programming Bricks Entities: Generic data abstractions RDD[T] (Resilient Distributed Dataset): a collection of java objects spread across your cluster, inaccessible from your local computer. LAR[T] (Locally Accessible Resources): a data source/sink you can read/write from your local computer. Can be many things including but not limited to: JVM memory block on your computer, local files, files on HDFS, files on S3, tables in C* (new!), tables in Hive, Twitter feed(read-only), other web API feed (read-only) This list is still growing. Mappings: Methods that cast one entity to another Parallelization: LAR[T] => RDD[T] Transformation(f: {T} => {K}): RDD[T] => RDD[K], generalizes map Action(f: {K} => {K}): RDD[K] => LAR[K], generalizes reduce RDDs LARs Transformation ParallelizationAction Plain java code
  • Programming Bricks These bricks are atomic black boxes, do not attempt to break or reverse-engineer them! If you want to try -------> Instead, try to be agnostic and construct your complex algorithm and framework by wiring them like IC chips. They form a much larger superset of Map/Reduce. They have no problem constructing the most complex distributed algorithms in ML and Graph analysis. Developers of Spark has made great effort in abstracting these complex and ugly trivia away from you so you can concentrate on the beauty of your algorithm.
  • Advantages Probably not optimized to the core. But once you fit into the paradigm No more thread unsafety, racing condition, resource pool, consumer starvation, buffer overflow, deadlock, JVM OutOfHeapSpaceException, or whatever absurdities. No more RPC timeout, service unavailable, 500 internal server error, or Chaos Monkeys miracles. No more weird exception that only happens after being deployed to cluster, local multi-thread debugging and test capture 95% of them. No dependency on any external database, message queue, or a specific file system (pick one from local, HDFS, S3, CFS and change it later in 5 seconds) Your code will be stripped down to its core 10~20% of your original code in cluster computing! 30~50% of that in multi-thread computing
  • Guideline Basic: RDD, Transformations and Actions. Basic: Testing, Packaging and Deployment. Advanced: Partitioning, Distribution and Staging. Expert: Composite Mapping and Accumulator. Example: A query engine for distributed web scraping. Q&A.
  • Testing The first thing you should do even before cluster setup because: On a laptop with 8 cores it is still a savage beast that outperforms most other programs with similar size. Does not require packaging and uploading, both are slow. Read all logs from console output. Is a self-contained multi-threaded process that fits into any debugger and test framework. Supports 3 modes, set by --master parameter local[*]: use all local cores, wont do failover! (better paranoid than sorry) local[n,t]: use n cores (support for * is missing), will retry each task t-1 times. local-cluster[n,c,m]: cluster-simulation mode! Simulate a mini-cluster of size c, each computer has n cores and m megabytes of memory. Technically no longer a single process, it will simulate everything including data distribution over network. As a result, you have to package first, do not support debugging, and better not using it in unit test. Will expose 100% of your errors in local run.
  • Master Seed node Resource negotiator 3 options: Native: lightweight, well tested, ugly UI, primary/backup redundancy delegated to ZooKeeper, support auto-scaling (totally abused by DataBricks), recommended for beginners YARN: scalable, heavyweight, threads run in containers, beautiful UI, swarm redundancy Mesos: dont know why its still here Remember the master URL on its UI after setup, you are going to use it everywhere.
  • Worker The muscle and the real deal Report status to master and shuffle data to each other. Cores are segregated and share nothing in computation, except broadcasted variables. Disposable! Can be added or removed at will, enables fluent scaling 3 options: $SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker $MASTER_URL: both the easiest and most flexible, support auto-scaling by adding this line into startup script. /bin/start-all: launch both master and workers, need to setup password-less ssh login first. /ec2/spark-ec2: launch many things on EC2 including an in-memory HDFS, too heavyweight and too many options hardcoded.
  • Driver Node/JVM that runs your main function. Merged with a random worker in cluster deploy mode (see next page) Distribute data Control staging Collect action results and accumulator changes. Technically not part of cluster but still better to be close to all other nodes. (Important for iterative jobs) Must have a public DNS to master! otherwise will cause: WARNING: Initial job has not accepted any resources $SPARK_HOME on it has to be identical to that on workers (This is really sloppy but people no longer care)
  • Packaging Generate the all inclusive fat/ber JAR being distributed to nodes. Self-contained, should include everything in your programs dependency tree This JAR wont be generated by default, you have to generate it by: Enable maven-shade plugin and run mvn package Enable sbt-assembly plugin and run sbt> assembly EXCEPT those who overlap with Sparks dependencies (and all modules dependencies, including but not limited to: SparkSQL, Streaming, Mllib and GrpahX). Excluding them by setting the scope of Spark artifact(s) in your dependency list to provided You dont have to do it but this decrease your JAR size by 90M+. They already exist in $SPARK_HOME/lib/*.jar and will always be loaded BEFORE your JAR. if your program and Spark have overlapping dependencies but in different versions, yours will be ignored in runtime (Javas first-found-first-serve principal), and you go straight into...
  • JAR hell Manifest itself as either one of these errors that only appears after packaging: NoClassDefFoundError ClassNotFoundException NoSuchFieldError NoSuchMethodError Unfortunately many dependencies of Spark are severely not up-to-date. Even more unfortunately the list of these outdated dependencies is still growing, a curse bestowed by Apache Foundation. Switching to YARN wont resolve it! It just box threads with containers but wont change class loading sequence. Only (ugly but working) solution so far: package relocation! Supported by maven-shade by setting relocation rule, dont know how to do this in sbt :-< Probably have third-party plugins that can detect it from dependency, need more testing. Not very compatible with some IDE, if reporting a classpath error please re-import the project.
  • Maven vs sbt Maven The most extendable and widely-supported build tool. Native to Java, but all Scala dependencies are Java bytecode. Need maven-scala and maven-shade plugins I dont know why but Spark official repo just switched from sbt to maven after 0.9.0. Apparently slightly faster than ivy A personal tool of choice. Simple Build Tool (used to be simple) No abominable xml Native to Scala Self-contained executable Beautiful build report by ivy backend Need sbt-assembly plugin (does NOT support relocation :- generating each partition only requires a self-contained single-thread subroutine (called task) that wont screw up and induces no overhead on scheduling/synchronization. Default number of partitions is the total number of cores in a cluster, works great if workload on each partition is fairly balanced. Otherwise some cores will finish first and fence in your cluster ------> youd better override this: Many transformations and parallelizations takes an optional Int parameter to result in RDD with desired number of partitions. RDD[T].repartition(n: Int) returns an RDD[T] with identical content but different number of partitions, also rebalance sizes of partitions. RDD[T].coalesce(n: Int) merge closest partitions (ideally on one node) together. This is an incomplete partitio


View more >