YARN Ready webinar series helps developers integrate their applications to YARN. Tez is one vehicle to do that. We take a deep dive including code review to help you get started.
1. Apache Tez : Accelerating HadoopData ProcessingBikas Saha@bikassaha Hortonworks Inc. 2013 Page 1
2. Tez Introduction Hortonworks Inc. 2013Page 2 Distributed execution framework targeted towards data-processingapplications. Based on expressing a computation as a dataflow graph. Highly customizable to meet a broad spectrum of use cases. Built on top of YARN the resource management frameworkfor Hadoop. Open source Apache project and Apache licensed. 3. Tez Hadoop 1 ---> Hadoop 2 Monolithic Resource Management MR Execution Engine MRLayered Resource Management YARN Engines Hive, Pig, Cascading, Your App! 4. Tez Empowering Applications Tez solves hard problems of running on a distributed Hadoop environment Apps can focus on solving their domain specific problems This design is important to be a platform for a variety of applications Hortonworks Inc. 2013Page 4AppTez Custom application logic Custom data format Custom data transfer technology Distributed parallel execution Negotiating resources from the Hadoop framework Fault tolerance and recovery Horizontal scalability Resource elasticity Shared library of ready-to-use components Built-in performance optimizations Security 5. Tez End User Benefits Better performance of applications Built-in performance + Application define optimizations Better predictability of results Minimization of overheads and queuing delays Better utilization of compute capacity Efficient use of allocated resources Reduced load on distributed filesystem (HDFS) Reduce unnecessary replicated writes Hortonworks Inc. 2013 Reduced network usage Better locality and data transfer using new data patterns Higher application developer productivity Focus on application business logic rather than Hadoop internalsPage 5 6. Tez Design considerationsDont solve problems that have already been solved. Or elseyou will have to solve them again! Leverage discrete task based compute model for elasticity, scalabilityand fault tolerance Leverage several man years of work in Hadoop Map-Reduce datashuffling operations Leverage proven resource sharing and multi-tenancy model for Hadoopand YARN Leverage built-in security mechanisms in Hadoop for privacy andisolation Hortonworks Inc. 2013Page 6Look to the Future with an eye on the Past 7. Tez Problems that it addresses Expressing the computation Direct and elegant representation of the data processing flow Interfacing with application code and new technologies Hortonworks Inc. 2013 Performance Late Binding : Make decisions as late as possible using real data from atruntime Leverage the resources of the cluster efficiently Just work out of the box! Customizable engine to let applications tailor the job to meet their specificrequirements Operation simplicity Painless to operate, experiment and upgradePage 7 8. Tez Simplifying Operations No deployments to do. No side effects. Easy and safe to try it out! Tez is a completely client side application. Simply upload to any accessible FileSystem and change local Tez configuration topoint to that. Enables running different versions concurrently. Easy to test new functionalitywhile keeping stable versions for production. Leverages YARN local resources.TezClient TezTaskTezTask Hortonworks Inc. 2013Page 8ClientMachineNodeManagerNodeManagerHDFSTez Lib 1 Tez Lib 2TezClientClientMachine 9. Tez Expressing the computationDistributed data processing jobs typically look like DAGs (Directed AcyclicGraph). Vertices in the graph represent data transformations Edges represent data movement from producers to consumers Hortonworks Inc. 2013Page 9Preprocessor StagePartition StageAggregate StageSamplerTask-1 Task-2Task-1 Task-2Task-1 Task-2SamplesRangesDistributed Sort 10. Tez Expressing the computationTez provides the following APIs to define the processing DAG API Defines the structure of the data processing and the relationshipbetween producers and consumers Enable definition of complex data flow pipelines using simple graphconnection APIs. Tez expands the logical DAG at runtime This is how the connection of tasks in the job gets specified Hortonworks Inc. 2013 Runtime API Defines the interfaces using which the framework and app code interactwith each other App code transforms data and moves it between tasks This is how we specify what actually executes in each task on the clusternodesPage 10 11. Hortonworks Inc. 2013Tez DAG API// Define DAGDAG dag = DAG.create();// Define VertexVertex Map1 = Vertex.create(Processor.class);// Define EdgeEdge edge = Edge.create(Map1, Reduce1,SCATTER_GATHER, PERSISTED, SEQUENTIAL,Output.class, Input.class);// Connect themdag.addVertex(Map1).addEdge(edge)Page 11Defines the global processing flowMap1 Map2ScatterGatherReduce1 Reduce2ScatterGatherJoin 12. Hortonworks Inc. 2013Tez DAG APIData movement Defines routing of data betweentasks One-To-One : Data from the ith producer task routes tothe ith consumer task. Broadcast : Data from a producer task routes to allconsumer tasks. Scatter-Gather : Producer tasks scatter data into shardsand consumer tasks gather the data. The ith shard fromall producer tasks routes to the ith consumer task.Page 12Edge properties define the connection between producer andconsumer tasks in the DAG 13. Tez Logical DAG expansion atRuntime Hortonworks Inc. 2013Page 13Reduce1Map2Reduce2JoinMap1 14. Tez Runtime APIFlexible Inputs-Processor-Outputs Model Thin API layer to wrap around arbitrary application code Compose inputs, processor and outputs to execute arbitrary processing Applications decide logical data format and data transfer technology Customize for performance Built-in implementations for Hadoop 2.0 data services HDFS andYARN ShuffleService. Built on the same API. Your impls are as first classas ours! Hortonworks Inc. 2013Page 14 15. Tez Library of Inputs and OutputsSortedOutput Hortonworks Inc. 2013Classical MapPage 15Classical ReduceMapProcessorHDFSInputIntermediate Reduce forMap-Reduce-ReduceReduceProcessorShuffleInputHDFSOutputReduceProcessorShuffleInputSortedOutput What is built in? Hadoop InputFormat/OutputFormat SortedGroupedPartitioned Key-ValueInput/Output UnsortedGroupedPartitioned Key-ValueInput/Output Key-Value Input/Output 16. Tez Composable Task ModelAdopt Evolve OptimizeHDFSInputRemoteFileServerInput Hortonworks Inc. 2013NativeDBInputPage 16HDFSInputRemoteFileServerInputHive ProcessorHDFSOutputLocalDiskOutputYour ProcessorHDFSOutputLocalDiskOutputRDMAInputYour ProcessorKakfaPub-SubOutputAmazonS3Output 17. Tez Performance Benefits of expressing the data processing as a DAG Reducing overheads and queuing effects Gives system the global picture for better planning Efficient use of resources Re-use resources to maximize utilization Pre-launch, pre-warm and cache Locality & resource aware scheduling Support for application defined DAG modifications at runtimefor optimized execution Change task concurrency Change task scheduling Change DAG edges Change DAG vertices (TBD) Hortonworks Inc. 2013Page 17 18. Tez Benefits of DAG executionFaster Execution and Higher Predictability Eliminate replicated write barrier between successive computations. Eliminate job launch overhead of workflow jobs. Eliminate extra stage of map reads in every workflow job. Eliminate queue and resource contention suffered by workflow jobs thatare started after a predecessor job completes. Better locality because the engine has the global picture Hortonworks Inc. 2013Page 18Pig/Hive - MRPig/Hive - Tez 19. Tez Container Re-Use Reuse YARN containers/JVMs to launch new tasks Reduce scheduling and launching delays Shared in-memory data across tasks JVM JIT friendly execution Hortonworks Inc. 2013Page 19TezTask HostTezTask1TezTask2Shared ObjectsYARN Container / JVMTezApplication MasterYARN ContainerStart TaskTask DoneStart Task 20. Hortonworks Inc. 2013Tez SessionsPage 20ClientApplication MasterStartSessionSubmitDAGTask SchedulerContainer PoolSharedObjectRegistryPreWarmedJVMSessions Standard concepts of pre-launchand pre-warm applied Key for interactive queries Represents a connection betweenthe user and the cluster Multiple DAGs executed in thesame session Containers re-used across queries Takes care of data locality andreleasing resources when idle 21. Tez Current status Hortonworks Inc. 2013 Apache ProjectRapid development. Over 1100 jiras opened. Over 800 resolvedGrowing community of contributors and users 0.5 being released. In voting for releaseDeveloper release focused ease of development Stable APIBetter debuggingDocumentation and code samples Support for a vast topology of DAGs Being used by multiple applications such as Apache Hive,Apache Pig, Cascading, ScaldingPage 21 22. Tez Adoption PathPre-requisite : Hadoop 2 with YARNTez has zero deployment pain. No side effects or traces leftbehind on your cluster. Low risk and low effort to try out. Using Hive, Pig, Cascading, Scalding Try them with Tez as execution engine Already have MapReduce based pipeline Use configuration to change MapReduce to run on Tez by settingmapreduce.framework.name to yarn-tez in mapred-site.xml Consolidate MR workflow into MR-DAG on Tez Change MR-DAG to use more efficient Tez constructs Hortonworks Inc. 2013 Have custom pipeline Wrap custom code/scripts into Tez inputs-processor-outputs Translate your custom pipeline topology into Tez DAG Change custom topology to use more efficient Tez constructsPage 22 23. Tez Theory to Practice Hortonworks Inc. 2013 Performance ScalabilityPage 23 24. Tez Hive TPC-DS Scale 200GB latency Hortonworks Inc. 2013 25. Tez Pig performance gains Demonstrates performance gains from a basic translation to aTez DAG160140120100806040200Prod script 125m vs 10m5 MR JobsProd script 234m vs 16m5 MR JobsProd script 31h 46m vs 48m12 MR JobsProd script 42h 22m vs 1h21m15 MR jobsTime in minsMRTez 26. Tez Observations on Performance Hortonworks Inc. 2013 Number of stages in the DAG Higher the nu