17
Intro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli Resources: garrens.com/pyspark124 #PySparkWorkshop

Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice

  • Upload
    others

  • View
    19

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice

Intro to PySpark WorkshopGarren StaubliSr. Data Engineer@gstaubli

Resources: garrens.com/pyspark124#PySparkWorkshop

Page 2: Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice

Working with Spark since 2015• Batch analytics in Spark + Hive, Pig

and Hadoop MapReduce• Real-time big data reporting

using Spark/Impala/CDH• Spark Structured Streaming + ML apps

for real-time decision making

2

Do I know what I’m talking about?

Resources: garrens.com/pyspark124

50+ answers on

for Spark

#PySparkWorkshop

Page 3: Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice

3

Main Points

• Apache Spark• Sample App Walkthrough• Interactive Azure Jupyter

Notebook• Python-specific Spark advice• Resources to continue learning

Resources: garrens.com/pyspark124#PySparkWorkshop

Page 4: Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice

4

About Apache Spark

Resources: garrens.com/pyspark124

Structured Spark.ML GraphFrame

Lazily Evaluated• Transforms vs Actions

Immutable

#PySparkWorkshop

Page 5: Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice

5

About Apache Spark

Resources: garrens.com/pyspark124#PySparkWorkshop

Page 6: Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice

6

Spark application (Driver)spark = SparkSession.builder\

.appName(name="PySpark Intro")\

.master("local[*]")\

.getOrCreate()

Master (Cluster Manager)

Slave (Worker)

detailed architecture

Executor

Task Task

Slave (Worker)

Executor

Task Task

Slave (Worker)

Executor

Task Task

SparkSession

Resources: garrens.com/pyspark124#PySparkWorkshop

Page 7: Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice

7

About Apache Spark | Spark SQL

Resources: garrens.com/pyspark124

SQL is not about SQLis about more than SQL

#PySparkWorkshop

Page 8: Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice

8

About Apache Spark | 2 Kinds of Actions

Resources: garrens.com/pyspark124#PySparkWorkshop

Page 9: Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice

9

About Apache Spark | Modern vs Legacy

Resources: garrens.com/pyspark124#PySparkWorkshop

Page 10: Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice

10

About Apache Spark | Modern Optimization

Resources: garrens.com/pyspark124#PySparkWorkshop

Page 11: Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice

11

About Apache Spark | Planning

Resources: garrens.com/pyspark124#PySparkWorkshop

Page 12: Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice

12

Walkthrough | Create Spark Session

Resources: garrens.com/pyspark124

from pyspark.sql import SparkSession

spark = SparkSession.builder\

.appName(name="PySpark Intro")\

.master("local[*]")\

.getOrCreate()

Deploy modes: Local, standalone, YARN, Mesos and Kubernetes

#PySparkWorkshop

Page 13: Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice

13

Walkthrough | Read CSV into DataFrame

Resources: garrens.com/pyspark124

green_trips = spark.read\ .option("header", "true")\ .option("inferSchema", "true")\ .csv("green_tripdata_2017-06.csv")

Forces eager evaluation; default is false

#PySparkWorkshop

Page 14: Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice

14

Walkthrough | Behind the Scenes: UI

Resources: garrens.com/pyspark124#PySparkWorkshop

Page 15: Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice

15

Walkthrough | Behind the Scenes: UI

Resources: garrens.com/pyspark124#PySparkWorkshop

Page 16: Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice

16

Walkthrough | DataFrame Schema

Resources: garrens.com/pyspark124

green_trips.printSchema()Eagerly evaluated (inferSchema = true) Lazily evaluated (inferSchema = false)

#PySparkWorkshop

Page 17: Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice

• 2014• 2015 #1• 2016 #1• 2017 #4

• 2015 #1• 2016 #1

• 2014• 2015• 2016

• 2016

• 2015 #373• 2016 #166• 2017 #161

1717

You guessed it… We’re hiring!

Resources: garrens.com/pyspark124#PySparkWorkshop