Upload
john-nestor
View
614
Download
3
Embed Size (px)
Citation preview
Type Checking Scala Spark Datasets:
Data Set Transforms
John Nestor 47 Degrees
www.47deg.com
Seattle Spark MeetupSeptember 22, 2016
147deg.com
47deg.com © Copyright 2016 47 Degrees
Outline
• Introduction
• Transforms
• Demos
• Implementation
• Getting the Code
2
Introduction
3
47deg.com © Copyright 2016 47 Degrees
Spark Scala APIs
• RDD (pass closures)
• Functional programming model
• Types checked at compile time
• DataFrame (pass SQL)
• SQL programming model (can be optimized)
• Types checked at run time
• Dataset (pass SQL)
• Combines best of RDDs and DataFrames
• Some (not all) types checked at compile time
4
47deg.com © Copyright 2016 47 Degrees
Run-Time Scala Checking
• Field/column names
• Names specified as strings
• RT error if no such field
• Field/column types
• Specified via casting to expected type
• RT error if not of expected type
5
47deg.com © Copyright 2016 47 Degrees
Dataset Example
case class ABC(a: Int, b: String, c: String) case class CA(c: String, a: Int)
val abc = ABC(3, "foo", "test") val abc1 = ABC(5, "xxx", "alpha") val abc3 = ABC(10, "aaa", "aaa") val abcs = Seq(abc, abc1, abc3) val ds = abcs.toDS()
/* Compile time type checking; but must pass closure and can’t optimize */val ds1 = ds.map(abc => CA(abc.b, abc.a * 2 + abc.a))
/* Can be query optimized; but run-time type and field name checking */val ds2 = ds.select($"b" as "c", ($"a" * 2 + $"a") as "a").as[CA]
6
Transforms
7
47deg.com © Copyright 2016 47 Degrees
Goal
• Add strong typing to Scala Spark Datasets
• Check field names at compile time
• Check field types at compile time
• Each transform maps one of more Datasets to a new Dataset.
• Dataset rows are compile-time types: Scala case classes
8
47deg.com © Copyright 2016 47 Degrees
Transform Example
case class ABC(a: Int, b: String, c: String) case class CA(c: String, a: Int)
val abc = ABC(3, "foo", "test") val abc1 = ABC(5, "xxx", "alpha") val abc3 = ABC(10, "aaa", "aaa") val abcs = Seq(abc, abc1, abc3) val ds = abcs.toDS()
/* Compile time type checking; but can do query optimization */ val smap = SqlMap[ABC, CA] .act(cols => (cols.b, cols.a * 2 + cols.a)) val ds3 = smap(ds)
9
47deg.com © Copyright 2016 47 Degrees
Current Transforms
• Filter
• Map
• Sort
• Join (combines 2 DataSets)
• Aggregate (sum, count, max)
10
Demos
11
47deg.com © Copyright 2016 47 Degrees
Demo
• Dataset example
• map
• select
• Transform examples
• Map
• Sort
• Join
• Filter
• Aggregate
12
Implementation
13
47deg.com © Copyright 2016 47 Degrees
Scala Macros
• Scala code executed at compile time
• Kinds
• Black box - single result type specified
• * White box - result type computed
14
47deg.com © Copyright 2016 47 Degrees
Transform Implementation
• case class Person(name:String,age:Int) val p = Person(“Sam”,30)
• Scala macro converts
• from: an arbitrary case class type• classOf[p]
• to: a meta structure that encodes field names and types
• case class PersonM(name:StringCol,age:IntCol)val cols = PersonM(name:StringCol(“name”),age:IntCol(“age”))
15
47deg.com © Copyright 2016 47 Degrees
Column Operations
• StrCol(“A”) === StrCol(“B”) => BoolCol(“A === B”)
• IntCol(“A”) + IntCol(“B”) => IntCol(“A + B”)
• IntCol(“A”).max => IntCol(“A.max”)
16
47deg.com © Copyright 2016 47 Degrees
White Box Macro Restrictions
• Works fine in SBT and Eclipse
• Not supported in Intellij but can use
• Reports type errors
• Does not show available completions
17
Getting the Code
18
47deg.com © Copyright 2016 47 Degrees
Transforms Code
• https://github.com/nestorpersist/dataset-transform
• Code
• Documentation
• Examples
• "com.persist" % "dataset-transforms_2.11" % "0.0.5"
19
Questions
20