20
Type Checking Scala Spark Datasets: Data Set Transforms John Nestor 47 Degrees www.47deg.com Seattle Spark Meetup September 22, 2016 1 47deg.com

Type Checking Scala Spark Datasets: Dataset Transforms

Embed Size (px)

Citation preview

Page 1: Type Checking Scala Spark Datasets: Dataset Transforms

Type Checking Scala Spark Datasets:

Data Set Transforms

John Nestor 47 Degrees

www.47deg.com

Seattle Spark MeetupSeptember 22, 2016

147deg.com

Page 2: Type Checking Scala Spark Datasets: Dataset Transforms

47deg.com © Copyright 2016 47 Degrees

Outline

• Introduction

• Transforms

• Demos

• Implementation

• Getting the Code

2

Page 3: Type Checking Scala Spark Datasets: Dataset Transforms

Introduction

3

Page 4: Type Checking Scala Spark Datasets: Dataset Transforms

47deg.com © Copyright 2016 47 Degrees

Spark Scala APIs

• RDD (pass closures)

• Functional programming model

• Types checked at compile time

• DataFrame (pass SQL)

• SQL programming model (can be optimized)

• Types checked at run time

• Dataset (pass SQL)

• Combines best of RDDs and DataFrames

• Some (not all) types checked at compile time

4

Page 5: Type Checking Scala Spark Datasets: Dataset Transforms

47deg.com © Copyright 2016 47 Degrees

Run-Time Scala Checking

• Field/column names

• Names specified as strings

• RT error if no such field

• Field/column types

• Specified via casting to expected type

• RT error if not of expected type

5

Page 6: Type Checking Scala Spark Datasets: Dataset Transforms

47deg.com © Copyright 2016 47 Degrees

Dataset Example

case class ABC(a: Int, b: String, c: String) case class CA(c: String, a: Int)

val abc = ABC(3, "foo", "test") val abc1 = ABC(5, "xxx", "alpha") val abc3 = ABC(10, "aaa", "aaa") val abcs = Seq(abc, abc1, abc3) val ds = abcs.toDS()

/* Compile time type checking; but must pass closure and can’t optimize */val ds1 = ds.map(abc => CA(abc.b, abc.a * 2 + abc.a))

/* Can be query optimized; but run-time type and field name checking */val ds2 = ds.select($"b" as "c", ($"a" * 2 + $"a") as "a").as[CA]

6

Page 7: Type Checking Scala Spark Datasets: Dataset Transforms

Transforms

7

Page 8: Type Checking Scala Spark Datasets: Dataset Transforms

47deg.com © Copyright 2016 47 Degrees

Goal

• Add strong typing to Scala Spark Datasets

• Check field names at compile time

• Check field types at compile time

• Each transform maps one of more Datasets to a new Dataset.

• Dataset rows are compile-time types: Scala case classes

8

Page 9: Type Checking Scala Spark Datasets: Dataset Transforms

47deg.com © Copyright 2016 47 Degrees

Transform Example

case class ABC(a: Int, b: String, c: String) case class CA(c: String, a: Int)

val abc = ABC(3, "foo", "test") val abc1 = ABC(5, "xxx", "alpha") val abc3 = ABC(10, "aaa", "aaa") val abcs = Seq(abc, abc1, abc3) val ds = abcs.toDS()

/* Compile time type checking; but can do query optimization */ val smap = SqlMap[ABC, CA] .act(cols => (cols.b, cols.a * 2 + cols.a)) val ds3 = smap(ds)

9

Page 10: Type Checking Scala Spark Datasets: Dataset Transforms

47deg.com © Copyright 2016 47 Degrees

Current Transforms

• Filter

• Map

• Sort

• Join (combines 2 DataSets)

• Aggregate (sum, count, max)

10

Page 11: Type Checking Scala Spark Datasets: Dataset Transforms

Demos

11

Page 12: Type Checking Scala Spark Datasets: Dataset Transforms

47deg.com © Copyright 2016 47 Degrees

Demo

• Dataset example

• map

• select

• Transform examples

• Map

• Sort

• Join

• Filter

• Aggregate

12

Page 13: Type Checking Scala Spark Datasets: Dataset Transforms

Implementation

13

Page 14: Type Checking Scala Spark Datasets: Dataset Transforms

47deg.com © Copyright 2016 47 Degrees

Scala Macros

• Scala code executed at compile time

• Kinds

• Black box - single result type specified

• * White box - result type computed

14

Page 15: Type Checking Scala Spark Datasets: Dataset Transforms

47deg.com © Copyright 2016 47 Degrees

Transform Implementation

• case class Person(name:String,age:Int) val p = Person(“Sam”,30)

• Scala macro converts

• from: an arbitrary case class type• classOf[p]

• to: a meta structure that encodes field names and types

• case class PersonM(name:StringCol,age:IntCol)val cols = PersonM(name:StringCol(“name”),age:IntCol(“age”))

15

Page 16: Type Checking Scala Spark Datasets: Dataset Transforms

47deg.com © Copyright 2016 47 Degrees

Column Operations

• StrCol(“A”) === StrCol(“B”) => BoolCol(“A === B”)

• IntCol(“A”) + IntCol(“B”) => IntCol(“A + B”)

• IntCol(“A”).max => IntCol(“A.max”)

16

Page 17: Type Checking Scala Spark Datasets: Dataset Transforms

47deg.com © Copyright 2016 47 Degrees

White Box Macro Restrictions

• Works fine in SBT and Eclipse

• Not supported in Intellij but can use

• Reports type errors

• Does not show available completions

17

Page 18: Type Checking Scala Spark Datasets: Dataset Transforms

Getting the Code

18

Page 19: Type Checking Scala Spark Datasets: Dataset Transforms

47deg.com © Copyright 2016 47 Degrees

Transforms Code

• https://github.com/nestorpersist/dataset-transform

• Code

• Documentation

• Examples

• "com.persist" % "dataset-transforms_2.11" % "0.0.5"

19

Page 20: Type Checking Scala Spark Datasets: Dataset Transforms

Questions

20