43
Anatomy of Data Frame API A deep dive into the Spark Data Frame API https://github.com/phatak-dev/anatomy_of_spark_dataframe_api

Anatomy of Data Frame API : A deep dive into Spark Data Frame API

Embed Size (px)

Citation preview

Page 1: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

Anatomy of Data Frame API

A deep dive into the Spark Data Frame API

https://github.com/phatak-dev/anatomy_of_spark_dataframe_api

Page 2: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

● Madhukara Phatak

● Big data consultant and trainer at datamantra.io

● Consult in Hadoop, Spark and Scala

● www.madhukaraphatak.com

Page 3: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

Agenda

● Spark SQL library● Dataframe abstraction● Pig/Hive pipleline vs SparkSQL● Logical plan● Optimizer● Different steps in Query analysis

Page 4: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

Spark SQL library● Data source API

Universal API for Loading/ Saving structured data● DataFrame API

Higher level representation for structured data● SQL interpreter and optimizer

Express data transformation in SQL● SQL service

Hive thrift server

Page 5: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

Architecture of Spark SQL

CSV JSON JDBC

Data Source API

Data Frame API

Spark SQL and HQLDataframe DSL

Page 6: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

DataFrame API● Single abstraction for representing structured data in

Spark● DataFrame = RDD + Schema (aka SchemaRDD)● All data source API’s return DataFrame● Introduced in 1.3● Inspired from R and Python panda● .rdd to convert to RDD representation resulting in RDD

[Row]● Support for DataFrame DSL in Spark

Page 7: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

Need for new abstraction● Single abstraction for structured data

○ Ability to combine data from multiple sources○ Uniform access from all different language API’s○ Ability to support multiple DSL’s

● Familiar interface to Data scientists○ Same API as R/ Panda○ Easy to convert from R local data frame to Spark○ New 1.4 SparkR is built around it

Page 8: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

Data Structure of structured world● Data Frame is a data structure to represent structured

data, whereas RDD is a data structure for unstructured data

● Having single data structure allows to build multiple DSL’s targeting different developers

● All DSL’s will be using same optimizer and code generator underneath

● Compare with Hadoop Pig and Hive

Page 9: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

Pig and Hive pipeline

HiveQL

Hive parser

Optimizer

Executor

Hive queries

Logical Plan

Optimized Logical Plan(M/R plan)

Physical Plan

Pig latin

Pig parser

Optimizer

Executor

Pig latin script

Logical Plan

Optimized Logical Plan(M/R plan)

Physical Plan

Page 10: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

Issue with Pig and Hive flow● Pig and hive shares a lot similar steps but independent

of each other● Each project implements it’s own optimizer and

executor which prevents benefiting from each other’s work

● There is no common data structure on which we can build both Pig and Hive dialects

● Optimizer is not flexible to accommodate multiple DSL’s● Lot of duplicate effort and poor interoperability

Page 11: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

Spark SQL pipeline HiveQL

Hive parser

Hive queries

SparkQL

SparkSQL Parser

Spark SQL queries

Dataframe DSL

DataFrame

Catalyst

Spark RDD code

Page 12: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

Spark SQL flow● Multiple DSL’s share same optimizer and executor● All DSL’s ultimately generate Dataframes● Catalyst is a new optimizer built from ground up for

Spark which is rule based framework● Catalyst allows developers to plug custom rules specific

to their DSL● You can plug your own DSL too!!

Page 13: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

What is a data frame?● Data frame is a container for Logical Plan● Logical Plan is a tree which represents data and

schema ● Every transformation is represented as tree

manipulation● These trees are manipulated and optimized by catalyst

rules● Logical plan will be converted to physical plan for

execution

Page 14: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

Explain Command● Explain command on dataframe allows us look at these

plans● There are three types of logical plans

○ Parsed logical plan○ Analysed Logical Plan○ Optimized logical Plan

● Explain also shows Physical plan● DataFrameExample.scala

Page 15: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

Filter example● In last example, all plans looked same as there were no

dataframe operations● In this example, we are going to apply two filters on the

data frame● Observe generated optimized plan● Example : FilterExampleTree.scala

Page 16: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

Optimized Plan● Optimized plan normally allows spark to plug in set of

optimization rules ● In our example, When multiple filters are added, spark

&& them for better performance● Even developer can plug in his/her own rules to

optimizer

Page 17: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

Accessing Plan trees● Every dataframe is attached with queryExecution object

which allows us to access these plans individually.● We can access plans as follows

○ parsed plan - queryExecution.logical○ Analysed - queryExecution.analyzed○ Optimized - queryExecution.optimizedPlan

● numberedTreeString on the plan allows us to see the hierarchy

● Example : FilterExampleTree.scala

Page 18: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

Filter tree representation

02 LogicalRDD [c1#0,c2#1,c3#2,c4#3]

01 Filter NOT (CAST(c1#0, DoubleType) = CAST(0, DoubleType))

00 Filter NOT (CAST(c2#0, DoubleType) = CAST(0, DoubleType))

02 LogicalRDD [c1#0,c2#1,c3#2,c4#3]

Filter (NOT (CAST(c1#0, DoubleType) = 0.0) && NOT (CAST(c2#1, DoubleType) = 0.0))

Page 19: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

Manipulating Trees● Every optimization in spark-sql is implemented as a tree

or logical transformation● Series of these transformation allows for modular

optimizer● All tree manipulations are done using scala case class● As developer we can write these manipulations too● Let’s create an OR filter rather than and● OrFilter.scala

Page 20: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

Understanding steps in plan ● Logical plan goes through series of rules to resolve and

optimize plan● Each plan is a Tree manipulation we seen before● We can apply series of rules to see how a given plan

evolves over time● This understanding allows us to understand how to

tweak given query for better performance● Ex : StepsInQueryPlanning.scala

Page 21: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

Query

select a.customerId from ( select customerId , amountPaid as amount from sales where 1 = '1') a where amount=500.0

Page 22: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

Parsed Plan● This is plan generated after parsing the DSL● Normally these plans generated by the specific parsers

like HiveQL parser, Dataframe DSL parser etc● Usually they recognize the different transformations and

represent them in the tree nodes● It’s a straightforward translation without much tweaking ● This will be fed to analyser to generate analysed plan

Page 23: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

Parsed Logical Plan

UnResolvedRelationSales

`Filter(1 = 1)

`Projection'customerId,'amountPaid

`SubQuerya

`Filter(amount = 500)

`Projecta.customerId

Page 24: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

Analyzed plan● We use sqlContext.analyser access the rules to

generate analyzed plan● These rules has to be run in sequence to resolve

different entities in the logical plan● Different entities to be resolved is

○ Relations ( aka Table)○ References Ex : Subquery, aliases etc○ Data type casting

Page 25: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

ResolveRelations Rule● This rule resolves all the relations ( tables) specified in

the plan

● Whenever it finds a new unresolved relation, it consults catalyst aka registerTempTable list

● Once it finds the relation, it resolves that with actual relationship

Page 26: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

Resolved Relation Logical Plan

JsonRelation Sales[amountPaid..]

Filter(1 = 1)

`Projection'customerId,'amountPaid

`SubQuerya

`Filter(amount = 500)

`Projecta.customerId

SubQuery - salesUnResolvedRelation

Sales

`Filter(1 = 1)

`Projection'customerId,'amountPaid

`SubQuerya

`Filter(amount = 500)

`Projecta.customerId

Page 27: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

ResolveReferences● This rule resolves all the references in the Plan

● All aliases and column names get a unique number which allows parser to locate them irrespective of their position

● This unique numbering allows subqueries to removed for better optimization

Page 28: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

Resolved References Plan

JsonRelation Sales[amountPaid#0..]

`Filter(1 = 1)

ProjectioncustomerId#1L,amountPaid#0

SubQuerya

Filter(amount#4 = 500)

ProjectcustomerId#1L

SubQuery - sales

JsonRelation Sales[amountPaid..]

`Filter(1 = 1)

`Projection'customerId,'amountPaid

`SubQuerya

`Filter(amount = 500)

`Projecta.customerId

SubQuery - sales

Page 29: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

PromoteString● This rule allows analyser to promote string to right data

types

● In our query, Filter( 1=’1’) we are comparing a double with string

● This rule puts a cast from string to double to have the right semantics

Page 30: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

Promote String Plan

JsonRelation Sales[amountPaid#0..]

`Filter(1 = CAST(1, DoubleType))

ProjectioncustomerId#1L,amountPaid#0

SubQuerya

Filter(amount#4 = 500)

ProjectcustomerId#1L

SubQuery - sales

JsonRelation Sales[amountPaid#0..]

`Filter(1 = 1)

ProjectioncustomerId#1L,amountPaid#0

SubQuerya

Filter(amount#4 = 500)

ProjectcustomerId#1L

SubQuery - sales

Page 31: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

Optimize

Page 32: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

Eliminate Subqueries● This rule allows analyser to eliminate superfluous sub

queries

● This is possible as we have unique identifier for each of the references

● Removal of sub queries allows us to do advanced optimization in subsequent steps

Page 33: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

Eliminate subqueries

JsonRelation Sales[amountPaid#0..]

`Filter(1 = CAST(1, DoubleType))

ProjectioncustomerId#1L,amountPaid#0

Filter(amount#4 = 500)

ProjectcustomerId#1L

JsonRelation Sales[amountPaid#0..]

`Filter(1 = CAST(1, DoubleType))

ProjectioncustomerId#1L,amountPaid#0

SubQuerya

Filter(amount#4 = 500)

ProjectcustomerId#1L

SubQuery - sales

Page 34: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

Constant Folding● Simplifies expressions which result in constant values

● In our plan, Filter(1=1) always results in true

● So constant folding replaces it in true

Page 35: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

ConstantFoldingPlan

JsonRelation Sales[amountPaid#0..]

`FilterTrue

ProjectioncustomerId#1L,amountPaid#0

Filter(amount#4 = 500)

ProjectcustomerId#1L

JsonRelation Sales[amountPaid#0..]

`Filter(1 = CAST(1, DoubleType))

ProjectioncustomerId#1L,amountPaid#0

Filter(amount#4 = 500)

ProjectcustomerId#1L

Page 36: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

Simplify Filters● This rule simplifies filters by

○ Removes always true filters○ Removes entire plan subtree if filter is false

● In our query, the true Filter will be removed

● By simplifying filters, we can avoid multiple iterations on data

Page 37: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

Simplify Filter Plan

JsonRelation Sales[amountPaid#0..]

ProjectioncustomerId#1L,amountPaid#0

Filter(amount#4 = 500)

ProjectcustomerId#1L

JsonRelation Sales[amountPaid#0..]

`FilterTrue

ProjectioncustomerId#1L,amountPaid#0

Filter(amount#4 = 500)

ProjectcustomerId#1L

Page 38: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

PushPredicateThroughFilter● It’s always good to have filters near to the data source for better optimizations ● This rules pushes the filters near to the JsonRelation● When we rearrange the tree nodes, we need to make

sure we rewrite the rule match the aliases● In our example, the filter rule is rewritten to use alias

amountPaid rather than amount

Page 39: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

PushPredicateThroughFilter Plan

JsonRelation Sales[amountPaid#0..]

Filter(amountPaid#0 = 500)

ProjectioncustomerId#1L,amountPaid#0

ProjectcustomerId#1L

JsonRelation Sales[amountPaid#0..]

ProjectioncustomerId#1L,amountPaid#0

Filter(amount#4 = 500)

ProjectcustomerId#1L

Page 40: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

Project Collapsing● Removes unnecessary projects from the plan● In our plan , we don’t need second projection, i.e

customerId, amount Paid as we only require one projection i.e customerId

● So we can get rid of the second projection● This gives us most optimized plan

Page 41: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

Project Collapsing Plan

JsonRelation Sales[amountPaid#0..]

Filter(amountPaid#0 = 500)

ProjectioncustomerId#1L,amountPaid#0

ProjectcustomerId#1L

JsonRelation Sales[amountPaid#0..]

Filter(amountPaid#0 = 500)

ProjectcustomerId#1L

Page 42: Anatomy of Data Frame API :  A deep dive into Spark Data Frame API

Generating Physical Plan● Catalyser can take a logical plan and turn into a

physical plan or Spark plan● On queryExecutor, we have a plan called executedPlan

which gives us physical plan● On physical plan, we can call executeCollect or

executeTake to start evaluating the plan