66
A One Hour Introduction to Analytics with PySpark Introduction to PySpark http://www.slideshare.net/rjurney/introduction-to-pyspark or http://bit.ly/intro_to_pyspark

Introduction to PySpark

Embed Size (px)

Citation preview

Page 1: Introduction to PySpark

A One Hour Introduction to Analytics with PySpark

Introduction to PySpark

http://www.slideshare.net/rjurney/introduction-to-pyspark or

http://bit.ly/intro_to_pyspark

Page 2: Introduction to PySpark

Agile Data Science 2.0

Russell Jurney

2

Data Engineer

Data Scientist

Visualization Software Engineer

85%

85%

85%

Writer85%

Teacher50%

Russell Jurney is a veteran data

scientist and thought leader. He

coined the term Agile Data Science in

the book of that name from O’Reilly

in 2012, which outlines the first agile

development methodology for data

science. Russell has constructed

numerous fu l l -stack analyt ics

products over the past ten years and

now works with clients helping them

extract value from their data assets.

Russell Jurney

Skill

Principal Consultant at Data Syndrome

Russell Jurney

Data Syndrome, LLC

Email : [email protected] : datasyndrome.com

Principal Consultant

Page 3: Introduction to PySpark

Building Full-Stack Data Analytics Applications with Spark

http://bit.ly/agile_data_science

Agile Data Science 2.0

Page 4: Introduction to PySpark

Agile Data Science 2.0 4

Realtime Predictive Analytics

Rapidly learn to build entire predictive systems driven by

Kafka, PySpark, Speak Streaming, Spark MLlib and with a web

front-end using Python/Flask and JQuery.

Available for purchase at http://datasyndrome.com/video

Page 5: Introduction to PySpark

Lorem Ipsum dolor siamet suame this placeholder for text can simply

random text. It has roots in a piece of classical. variazioni deiwords which

whichhtly. ven on your zuniga merida della is not denis.

Product Consulting

We build analytics products and systems

consisting of big data viz, predictions,

recommendations, reports and search.

Corporate Training

We offer training courses for data

scientists and engineers and data

science teams,

Video Training

We offer video training courses that rapidly

acclimate you with a technology and

technique.

Page 6: Introduction to PySpark

Agile Data Science 2.0 6

What is Spark? What makes it go?

Concepts

Page 7: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Hadoop / HDFSHDFS splits large data among many machines

7

Page 8: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Hadoop / MapReduceIn the beginning there was MapReduce

8

Page 9: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Spark / RDDSpark RDDs are iterable MapReduce relations

9

Page 10: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Spark / DataFrameFast SQLish RDD thingies

10

Page 11: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Spark StreamingSpark on realtime streams in mini-batches

11

Page 12: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Spark EcosystemLots of cool stuff working together…

12

/

Page 13: Introduction to PySpark

Agile Data Science 2.0 13

Setting up our class environment

Setup

Page 14: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0 14

Python 3 > 2.7

While the break in compatibility between Python 2.X

and 3.X was unfortunate and unnecessary , Python 3

has increasingly become the platform of choice for

analytics work. With a few alterations, all code in this

course will execute in a Python 2.7 environment.

Page 15: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0 15

Virtualbox

Virtualbox is a Free and Open Source Software (FOSS)

virtualization product for AMD64/Intel64 processors. It

supports many operating systems, and is under active

development.

https://www.virtualbox.org/wiki/Downloads

Page 16: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0 16

Vagrant

Vagrant sits on top of Virtualbox and provides easy to

use, reproducible development environments.

https://www.vagrantup.com/downloads.html

Page 17: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Vagrant Setup

17

Initializing our Vagrant Environment

# Get the projectgit clone https://github.com/rjurney/Agile_Data_Code_2/

# Setup and connect to our virtual machinevagrant up; vagrant ssh

# Now, from within Vagrantcd Agile_Data_Code_2intro_download.sh

# See Appendix A and install.sh for manual install

Page 18: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0 18

Amazon EC2

Alternatively, Amazon Web Services provide a simple

way to launch a prepared image for use in this exercise.

Page 19: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

EC2 Setup for Ubuntu Linux

19

Initializing our EC2 Environment

# See ec2.sh, which uses aws/ec2_bootstrap.sh# To use add: —user-data file://aws/ec2_bootstrap.sh

# Get the projectgit clone [email protected]:rjurney/Agile_Data_Code_2.git

# Setup AWS CLI toolspip install awscli

# Edit and run r3.xlarge instance with your key./ec2.sh

# ssh to the machine

Page 20: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

EC2 Setup for Ubuntu LinuxInitializing our EC2 Environment

20

# Contents of ec2.sh

# Launch our instance, which ec2_bootstrap.sh will initialize aws ec2 run-instances \ --image-id ami-4ae1fb5d \ --key-name agile_data_science \ --user-data file://aws/ec2_bootstrap.sh \ --instance-type r3.xlarge \ --ebs-optimized \ --placement "AvailabilityZone=us-east-1d" \ --block-device-mappings '{"DeviceName":"/dev/sda1","Ebs":{"DeleteOnTermination":false,"VolumeSize":1024}}' \ --count 1

Page 21: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

EC2 Setup for Ubuntu LinuxInitializing our EC2 Environment

21

# Download the datacd Agile_Data_Code_2./intro_download.sh

Page 22: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

EC2 Setup for Ubuntu LinuxInitializing our EC2 Environment

22

# Download the datacd Agile_Data_Code_2./intro_download.sh

Page 23: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Documentation SetupOpening the right web pages to answer your questions

23

http://spark.apache.org/docs/latest/api/python/pyspark.html

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html

http://spark.apache.org/docs/latest/api/python/pyspark.ml.html

Page 24: Introduction to PySpark

Agile Data Science 2.0 24

Learning the basics of PySpark

Basic PySpark

Page 25: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Hello, World!How to load data and perform an operation on it in Spark

25

# See ch02/spark.py

# Load the text file using the SparkContextcsv_lines = sc.textFile("data/example.csv") # Map the data to split the lines into a listdata = csv_lines.map(lambda line: line.split(",")) # Collect the dataset into local RAMdata.collect()

Page 26: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Creating Objects from CSV using a functionHow to create objects from CSV using a function instead of a lambda

26

# See ch02/groupby.py

csv_lines = sc.textFile("data/example.csv") # Turn the CSV lines into objectsdef csv_to_record(line): parts = line.split(",") record = { "name": parts[0], "company": parts[1], "title": parts[2] } return record# Apply the function to every recordrecords = csv_lines.map(csv_to_record)# Inspect the first item in the datasetrecords.first()

Page 27: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Using a GroupBy to Count JobsCount things using the groupBy API

27

# Group the records by the name of the persongrouped_records = records.groupBy(lambda x: x["name"]) # Show the first groupgrouped_records.first()# Count the groupsjob_counts = grouped_records.map( lambda x: { "name": x[0], "job_count": len(x[1]) }) job_counts.first()job_counts.collect()

Page 28: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Map vs FlatMapUnderstanding the difference between these two operators

28

# See ch02/flatmap.py

csv_lines = sc.textFile("data/example.csv") # Compute a relation of words by linewords_by_line = csv_lines\ .map(lambda line: line.split(","))words_by_line.collect()# Compute a relation of wordsflattened_words = csv_lines\ .map(lambda line: line.split(","))\ .flatMap(lambda x: x)flattened_words.collect()

Page 29: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Map vs FlatMapUnderstanding the difference between these two operators

29

words_by_line.collect()

[['Russell Jurney', 'Relato', 'CEO'], ['Florian Liebert', 'Mesosphere', 'CEO'], ['Don Brown', 'Rocana', 'CIO'], ['Steve Jobs', 'Apple', 'CEO'], ['Donald Trump', 'The Trump Organization', 'CEO'], ['Russell Jurney', 'Data Syndrome', 'Principal Consultant']]

flattened_words.collect()

['Russell Jurney', 'Relato', 'CEO', 'Florian Liebert', 'Mesosphere', 'CEO', 'Don Brown', 'Rocana', 'CIO', 'Steve Jobs', 'Apple', 'CEO', 'Donald Trump', 'The Trump Organization', 'CEO', 'Russell Jurney', 'Data Syndrome', 'Principal Consultant']

Page 30: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Using DataFrames and Spark SQL to Count JobsConverting an RDD to a DataFrame to use Spark SQL

30

# See ch02/sql.py

csv_lines = sc.textFile("data/example.csv") from pyspark.sql import Row# Convert the CSV into a pyspark.sql.Rowdef csv_to_row(line): parts = line.split(",") row = Row( name=parts[0], company=parts[1], title=parts[2] ) return row# Apply the function to get rows in an RDDrows = csv_lines.map(csv_to_row)

Page 31: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Using DataFrames and Spark SQL to Count JobsConverting an RDD to a DataFrame to use Spark SQL

31

# Convert to a pyspark.sql.DataFramerows_df = rows.toDF()# Register the DataFrame for Spark SQLrows_df.registerTempTable("executives") # Generate a new DataFrame with SQL using the SparkSessionjob_counts = spark.sql(""" SELECT name, COUNT(*) AS total FROM executives GROUP BY name

""") job_counts.show()# Go back to an RDDjob_counts.rdd.collect()

Page 32: Introduction to PySpark

Agile Data Science 2.0 32

Working with a more complex dataset

Exploratory Data Analysis with Airline Data

Page 33: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Loading a Parquet Columnar FileUsing the Apache Parquet format to load columnar data

33

# See ch02/load_on_time_performance.py

# Load the parquet file containing flight delay recordson_time_dataframe = spark.read.parquet('data/on_time_performance.parquet') # Register the data for Spark SQLon_time_dataframe.registerTempTable("on_time_performance") # Check out the columnson_time_dataframe.columns# Check out some dataon_time_dataframe\ .select("FlightDate", "TailNum", "Origin", "Dest", "Carrier", "DepDelay", "ArrDelay")\ .show()

Page 34: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Sampling a DataFrameSampling a DataFrame to get a better view of its data

34

# Trim the fields and keep the resulttrimmed_on_time = on_time_dataframe\ .select( "FlightDate", "TailNum", "Origin", "Dest", "Carrier", "DepDelay", "ArrDelay" )

# Sample 0.01% of the data and showtrimmed_on_time.sample(False, 0.0001).show()

Page 35: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Calculating a HistogramComputing the distribution of a column in a dataset

35

# See ch02/histogram.py

# Load the parquet file containing flight delay recordson_time_dataframe = spark.read.parquet('data/on_time_performance.parquet') # Register the data for Spark SQLon_time_dataframe.registerTempTable("on_time_performance") # Compute a histogram of departure delayson_time_dataframe\ .select("DepDelay")\ .rdd\ .flatMap(lambda x: x)\ .histogram(10)

Page 36: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Displaying a HistogramUsing pyplot to display a histogram

36

import numpy as np import matplotlib.mlab as mlabimport matplotlib.pyplot as plt# Function to plot a histogram using pyplotdef create_hist(rdd_histogram_data): """Given an RDD.histogram, plot a pyplot histogram""" heights = np.array(rdd_histogram_data[1]) full_bins = rdd_histogram_data[0] mid_point_bins = full_bins[:-1] widths = [abs(i - j) for i, j in zip(full_bins[:-1], full_bins[1:])] bar = plt.bar(mid_point_bins, heights, width=widths, color='b') return bar# Compute a histogram of departure delaysdeparture_delay_histogram = on_time_dataframe\ .select("DepDelay")\ .rdd\ .flatMap(lambda x: x)\ .histogram(10, [-60,-30,-15,-10,-5,0,5,10,15,30,60,90,120,180])create_hist(departure_delay_histogram)

Page 37: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Displaying a HistogramUsing pyplot to display a histogram

37

Page 38: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Counting AirplanesHow many airplanes are in the US fleet in total?

38

# See ch05/assess_airplanes.py

# Load the parquet fileon_time_dataframe = spark.read.parquet('data/on_time_performance.parquet') on_time_dataframe.registerTempTable("on_time_performance") # Dump the unneeded fieldstail_numbers = on_time_dataframe.rdd.map(lambda x: x.TailNum)tail_numbers = tail_numbers.filter(lambda x: x != '') # distinct() gets us unique tail numbersunique_tail_numbers = tail_numbers.distinct()# now we need a count() of unique tail numbersairplane_count = unique_tail_numbers.count()print("Total airplanes: {}".format(airplane_count))

Page 39: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Counting Total Flights by MonthPreparing data for a chart

39

# See ch05/total_flights.py

# Load the parquet fileon_time_dataframe = spark.read.parquet('data/on_time_performance.parquet') # Use SQL to look at the total flights by month across 2015on_time_dataframe.registerTempTable("on_time_dataframe") total_flights_by_month = spark.sql( """SELECT Month, Year, COUNT(*) AS total_flights FROM on_time_dataframe GROUP BY Year, Month ORDER BY Year, Month""") # This map/asDict trick makes the rows print a little prettier. It is optional.flights_chart_data = total_flights_by_month.rdd.map(lambda row: row.asDict())flights_chart_data.collect()

Page 40: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Preparing Complex Records for StorageGetting data ready for storage in a document or key/value store

40

# See ch05/extract_airplanes.py

# Load the parquet fileon_time_dataframe = spark.read.parquet('data/on_time_performance.parquet') on_time_dataframe.registerTempTable("on_time_performance") # Filter down to the fields we need to identify and link to a flightflights = on_time_dataframe.rdd.map(lambda x: (x.Carrier, x.FlightDate, x.FlightNum, x.Origin, x.Dest, x.TailNum) )# Group flights by tail number, sorted by date, then flight number, then origin/destflights_per_airplane = flights\ .map(lambda nameTuple: (nameTuple[5], [nameTuple[0:5]]))\ .reduceByKey(lambda a, b: a + b)\ .map(lambda tuple: { 'TailNum': tuple[0], 'Flights': sorted(tuple[1], key=lambda x: (x[1], x[2], x[3], x[4])) } )flights_per_airplane.first()

Page 41: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Counting Flight DelaysAnalyzing and understanding why flights are late

41

# See ch07/explore_delays.py

# Load the on-time parquet fileon_time_dataframe = spark.read.parquet('data/on_time_performance.parquet') on_time_dataframe.registerTempTable("on_time_performance") total_flights = on_time_dataframe.count()# Flights that were late leaving...late_departures = on_time_dataframe.filter(on_time_dataframe.DepDelayMinutes > 0) total_late_departures = late_departures.count()# Flights that were late arriving...late_arrivals = on_time_dataframe.filter(on_time_dataframe.ArrDelayMinutes > 0) total_late_arrivals = late_arrivals.count()

# Get the percentage of flights that are late, rounded to 1 decimal placepct_late = round((total_late_arrivals / (total_flights * 1.0)) * 100, 1)

Page 42: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Hero FlightsHow many flights made up for time in the air? Those that departed late and arrived on time?

42

# See ch07/explore_delays.py

# Flights that left late but made up time to arrive on time...on_time_heros = on_time_dataframe.filter( (on_time_dataframe.DepDelayMinutes > 0) & (on_time_dataframe.ArrDelayMinutes <= 0) ) total_on_time_heros = on_time_heros.count()

Page 43: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Presenting ResultsDisplaying the answers in plaintext we’ve just calculated

43

# See ch07/explore_delays.py

print("Total flights: {:,}".format(total_flights))print("Late departures: {:,}".format(total_late_departures))print("Late arrivals: {:,}".format(total_late_arrivals))print("Recoveries: {:,}".format(total_on_time_heros))print("Percentage Late: {}%".format(pct_late))

Page 44: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0 44

# See ch07/explore_delays.py

# Get the average minutes late departing and arrivingspark.sql("""SELECT ROUND(AVG(DepDelay),1) AS AvgDepDelay, ROUND(AVG(ArrDelay),1) AS AvgArrDelayFROM on_time_performance""").show()

Average Lateness Departing and ArrivingDrilling down into flights and how late they are…

Page 45: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Sampling Late FlightsGetting to know our data by sampling records of interest

45

# Why are flights late? Lets look at some delayed flights and the delay causeslate_flights = spark.sql("""SELECT ArrDelayMinutes, WeatherDelay, CarrierDelay, NASDelay, SecurityDelay, LateAircraftDelayFROM on_time_performanceWHERE WeatherDelay IS NOT NULL OR CarrierDelay IS NOT NULL OR NASDelay IS NOT NULL OR SecurityDelay IS NOT NULL OR LateAircraftDelay IS NOT NULLORDER BY FlightDate""") late_flights.sample(False, 0.01).show()

Page 46: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Why are Flights Late?Analyzing and understanding why flights are late

46

# Calculate the percentage contribution to delay for each sourcetotal_delays = spark.sql("""SELECT ROUND(SUM(WeatherDelay)/SUM(ArrDelayMinutes) * 100, 1) AS pct_weather_delay, ROUND(SUM(CarrierDelay)/SUM(ArrDelayMinutes) * 100, 1) AS pct_carrier_delay, ROUND(SUM(NASDelay)/SUM(ArrDelayMinutes) * 100, 1) AS pct_nas_delay, ROUND(SUM(SecurityDelay)/SUM(ArrDelayMinutes) * 100, 1) AS pct_security_delay, ROUND(SUM(LateAircraftDelay)/SUM(ArrDelayMinutes) * 100, 1) AS pct_late_aircraft_delayFROM on_time_performance""") total_delays.show()

Page 47: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

How Often are Weather Delayed Flights Late?Analyzing and understanding why flights are late

47

# Eyeball the first to define our bucketsweather_delay_histogram = on_time_dataframe\ .select("WeatherDelay")\ .rdd\ .flatMap(lambda x: x)\ .histogram([1, 5, 10, 15, 30, 60, 120, 240, 480, 720, 24*60.0])print(weather_delay_histogram)

create_hist(weather_delay_histogram)

Page 48: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

How Often are Weather Delayed Flights Late?Analyzing and understanding why flights are late

48

Page 49: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Preparing Histogram Data for d3.jsAnalyzing and understanding why flights are late

49

# Transform the data into something easily consumed by d3def histogram_to_publishable(histogram): record = {'key': 1, 'data': []} for label, value in zip(histogram[0], histogram[1]): record['data'].append( { 'label': label, 'value': value } ) return record

# Recompute the weather histogram with a filter for on-time flightsweather_delay_histogram = on_time_dataframe\ .filter( (on_time_dataframe.WeatherDelay != None) & (on_time_dataframe.WeatherDelay > 0) )\ .select("WeatherDelay")\ .rdd\ .flatMap(lambda x: x)\ .histogram([0, 15, 30, 60, 120, 240, 480, 720, 24*60.0])print(weather_delay_histogram)record = histogram_to_publishable(weather_delay_histogram)

Page 50: Introduction to PySpark

Agile Data Science 2.0 50

Building a classifier model

Predictive Analytics Machine Learning

Page 51: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Download Prepared Training DataSaving time by using a prepared dataset

51

# Be in the root directory of the projectcd Agile_Data_Code_2

# Run the download scriptch08/download_data.sh

Page 52: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

String VectorizationFrom properties of items to vector format

52

Page 53: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0 53

scikit-learn was 166. Spark MLlib is very powerful!

ch08/train_spark_mllib_model.py

190 Line Model

# !/usr/bin/env pythonimport sys, os, re# Pass date and base path to main() from airflowdef main(base_path): # Default to "." try: base_path except NameError: base_path = "." if not base_path: base_path = "." APP_NAME = "train_spark_mllib_model.py" # If there is no SparkSession, create the environment try: sc and spark except NameError as e: import findspark findspark.init() import pyspark import pyspark.sql sc = pyspark.SparkContext() spark = pyspark.sql.SparkSession(sc).builder.appName(APP_NAME).getOrCreate() # # { # "ArrDelay":5.0,"CRSArrTime":"2015-12-31T03:20:00.000-08:00","CRSDepTime":"2015-12-31T03:05:00.000-08:00", # "Carrier":"WN","DayOfMonth":31,"DayOfWeek":4,"DayOfYear":365,"DepDelay":14.0,"Dest":"SAN","Distance":368.0, # "FlightDate":"2015-12-30T16:00:00.000-08:00","FlightNum":"6109","Origin":"TUS" # } # from pyspark.sql.types import StringType, IntegerType, FloatType, DoubleType, DateType, TimestampType from pyspark.sql.types import StructType, StructField from pyspark.sql.functions import udf schema = StructType([ StructField("ArrDelay", DoubleType(), True), # "ArrDelay":5.0 StructField("CRSArrTime", TimestampType(), True), # "CRSArrTime":"2015-12-31T03:20:00.000-08:00" StructField("CRSDepTime", TimestampType(), True), # "CRSDepTime":"2015-12-31T03:05:00.000-08:00" StructField("Carrier", StringType(), True), # "Carrier":"WN" StructField("DayOfMonth", IntegerType(), True), # "DayOfMonth":31 StructField("DayOfWeek", IntegerType(), True), # "DayOfWeek":4 StructField("DayOfYear", IntegerType(), True), # "DayOfYear":365 StructField("DepDelay", DoubleType(), True), # "DepDelay":14.0 StructField("Dest", StringType(), True), # "Dest":"SAN" StructField("Distance", DoubleType(), True), # "Distance":368.0 StructField("FlightDate", DateType(), True), # "FlightDate":"2015-12-30T16:00:00.000-08:00" StructField("FlightNum", StringType(), True), # "FlightNum":"6109" StructField("Origin", StringType(), True), # "Origin":"TUS" ]) input_path = "{}/data/simple_flight_delay_features.jsonl.bz2".format( base_path ) features = spark.read.json(input_path, schema=schema) features.first() # # Check for nulls in features before using Spark ML # null_counts = [(column, features.where(features[column].isNull()).count()) for column in features.columns] cols_with_nulls = filter(lambda x: x[1] > 0, null_counts) print(list(cols_with_nulls)) # # Add a Route variable to replace FlightNum # from pyspark.sql.functions import lit, concat features_with_route = features.withColumn( 'Route', concat( features.Origin, lit('-'), features.Dest ) ) features_with_route.show(6) # # Use pysmark.ml.feature.Bucketizer to bucketize ArrDelay into on-time, slightly late, very late (0, 1, 2) # from pyspark.ml.feature import Bucketizer # Setup the Bucketizer splits = [-float("inf"), -15.0, 0, 30.0, float("inf")] arrival_bucketizer = Bucketizer( splits=splits, inputCol="ArrDelay", outputCol="ArrDelayBucket" )

# Save the bucketizer arrival_bucketizer_path = "{}/models/arrival_bucketizer_2.0.bin".format(base_path) arrival_bucketizer.write().overwrite().save(arrival_bucketizer_path) # Apply the bucketizer ml_bucketized_features = arrival_bucketizer.transform(features_with_route) ml_bucketized_features.select("ArrDelay", "ArrDelayBucket").show() # # Extract features tools in with pyspark.ml.feature # from pyspark.ml.feature import StringIndexer, VectorAssembler # Turn category fields into indexes for column in ["Carrier", "Origin", "Dest", "Route"]: string_indexer = StringIndexer( inputCol=column, outputCol=column + "_index" ) string_indexer_model = string_indexer.fit(ml_bucketized_features) ml_bucketized_features = string_indexer_model.transform(ml_bucketized_features) # Drop the original column ml_bucketized_features = ml_bucketized_features.drop(column) # Save the pipeline model string_indexer_output_path = "{}/models/string_indexer_model_{}.bin".format( base_path, column ) string_indexer_model.write().overwrite().save(string_indexer_output_path) # Combine continuous, numeric fields with indexes of nominal ones # ...into one feature vector numeric_columns = [ "DepDelay", "Distance", "DayOfMonth", "DayOfWeek", "DayOfYear"] index_columns = ["Carrier_index", "Origin_index", "Dest_index", "Route_index"] vector_assembler = VectorAssembler( inputCols=numeric_columns + index_columns, outputCol="Features_vec" ) final_vectorized_features = vector_assembler.transform(ml_bucketized_features) # Save the numeric vector assembler vector_assembler_path = "{}/models/numeric_vector_assembler.bin".format(base_path) vector_assembler.write().overwrite().save(vector_assembler_path) # Drop the index columns for column in index_columns: final_vectorized_features = final_vectorized_features.drop(column) # Inspect the finalized features final_vectorized_features.show() # Instantiate and fit random forest classifier on all the data from pyspark.ml.classification import RandomForestClassifier rfc = RandomForestClassifier( featuresCol="Features_vec", labelCol="ArrDelayBucket", predictionCol="Prediction", maxBins=4657, maxMemoryInMB=1024 ) model = rfc.fit(final_vectorized_features) # Save the new model over the old one model_output_path = "{}/models/spark_random_forest_classifier.flight_delays.5.0.bin".format( base_path ) model.write().overwrite().save(model_output_path) # Evaluate model using test data predictions = model.transform(final_vectorized_features) from pyspark.ml.evaluation import MulticlassClassificationEvaluator evaluator = MulticlassClassificationEvaluator( predictionCol="Prediction", labelCol="ArrDelayBucket", metricName="accuracy" ) accuracy = evaluator.evaluate(predictions) print("Accuracy = {}".format(accuracy)) # Check the distribution of predictions predictions.groupBy("Prediction").count().show() # Check a sample predictions.sample(False, 0.001, 18).orderBy("CRSDepTime").show(6) if __name__ == "__main__": main(sys.argv[1])

Page 54: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Loading Our Training DataLoading our data as a DataFrame to use the Spark ML APIs

54

from pyspark.sql.types import StringType, IntegerType, FloatType, DoubleType, DateType, TimestampTypefrom pyspark.sql.types import StructType, StructFieldfrom pyspark.sql.functions import udfschema = StructType([ StructField("ArrDelay", DoubleType(), True), # "ArrDelay":5.0 StructField("CRSArrTime", TimestampType(), True), # "CRSArrTime":"2015-12-31T03:20:00.000-08:00" StructField("CRSDepTime", TimestampType(), True), # "CRSDepTime":"2015-12-31T03:05:00.000-08:00" StructField("Carrier", StringType(), True), # "Carrier":"WN" StructField("DayOfMonth", IntegerType(), True), # "DayOfMonth":31 StructField("DayOfWeek", IntegerType(), True), # "DayOfWeek":4 StructField("DayOfYear", IntegerType(), True), # "DayOfYear":365 StructField("DepDelay", DoubleType(), True), # "DepDelay":14.0 StructField("Dest", StringType(), True), # "Dest":"SAN" StructField("Distance", DoubleType(), True), # "Distance":368.0 StructField("FlightDate", DateType(), True), # "FlightDate":"2015-12-30T16:00:00.000-08:00" StructField("FlightNum", StringType(), True), # "FlightNum":"6109" StructField("Origin", StringType(), True), # "Origin":"TUS"]) features = spark.read.json( "data/simple_flight_delay_features.jsonl.bz2", schema=schema) features.first()

Page 55: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Checking the Data for NullsNulls will cause problems hereafter, so detect and address them first

55

# # Check for nulls in features before using Spark ML# null_counts = [(column, features.where(features[column].isNull()).count()) for column in features.columns]cols_with_nulls = filter(lambda x: x[1] > 0, null_counts)print(list(cols_with_nulls))

Page 56: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Adding a Feature - The RouteRoute is defined as origin airport code + “-“ + destination airport code

56

# # Add a Route variable to replace FlightNum# from pyspark.sql.functions import lit, concatfeatures_with_route = features.withColumn( 'Route', concat( features.Origin, lit('-'), features.Dest )) features_with_route.select("Origin", "Dest", "Route").show(5)

Page 57: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Bucketizing ArrDelay into ArrDelayBucketWe can’t classify a continuous variable, so we must bucketize it to make it nominal/categorical

57

# # Use pysmark.ml.feature.Bucketizer to bucketize ArrDelay# from pyspark.ml.feature import Bucketizersplits = [-float("inf"), -15.0, 0, 30.0, float("inf")]bucketizer = Bucketizer( splits=splits, inputCol="ArrDelay", outputCol="ArrDelayBucket") ml_bucketized_features = bucketizer.transform(features_with_route)# Check the buckets outml_bucketized_features.select("ArrDelay", "ArrDelayBucket").show()

Page 58: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Indexing String Columns into Numeric ColumnsNominal/categorical/string columns need to be made numeric before we can vectorize them

58

# # Extract features tools in with pyspark.ml.feature# from pyspark.ml.feature import StringIndexer, VectorAssembler# Turn category fields into categoric feature vectors, then drop intermediate fieldsfor column in ["Carrier", "DayOfMonth", "DayOfWeek", "DayOfYear", "Origin", "Dest", "Route"]: string_indexer = StringIndexer( inputCol=column, outputCol=column + "_index" ) ml_bucketized_features = string_indexer.fit(ml_bucketized_features)\ .transform(ml_bucketized_features)# Check out the indexesml_bucketized_features.show(6)

Page 59: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Combining Numeric and Indexed Fields into One VectorOur classifier needs a single field, so we combine all our numeric fields into one feature vector

59

# Handle continuous, numeric fields by combining them into one feature vectornumeric_columns = ["DepDelay", "Distance"] index_columns = ["Carrier_index", "DayOfMonth_index", "DayOfWeek_index", "DayOfYear_index", "Origin_index", "Origin_index", "Dest_index", "Route_index"] vector_assembler = VectorAssembler( inputCols=numeric_columns + index_columns, outputCol="Features_vec") final_vectorized_features = vector_assembler.transform(ml_bucketized_features)# Drop the index columnsfor column in index_columns: final_vectorized_features = final_vectorized_features.drop(column)# Check out the featuresfinal_vectorized_features.show()

Page 60: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Splitting our Data in a Test/Train SplitWe need to split our data to evaluate the performance of our classifier

60

# # Cross validate, train and evaluate classifier# # Test/train splittraining_data, test_data = final_vectorized_features.randomSplit([0.7, 0.3])

Page 61: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Training Our ModelThis is the magic in machine learning, and it is only a couple of lines of code

61

# Instantiate and fit random forest classifierfrom pyspark.ml.classification import RandomForestClassifierrfc = RandomForestClassifier( featuresCol="Features_vec", labelCol="ArrDelayBucket", maxBins=4657, maxMemoryInMB=1024) model = rfc.fit(training_data)

Page 62: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Evaluating Our ModelUsing the test/train split to evaluate our model for accuracy

62

# Evaluate model using test datapredictions = model.transform(test_data)from pyspark.ml.evaluation import MulticlassClassificationEvaluatorevaluator = MulticlassClassificationEvaluator(labelCol="ArrDelayBucket", metricName="accuracy") accuracy = evaluator.evaluate(predictions)print("Accuracy = {}".format(accuracy))

Page 63: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Sampling Our PredictionsMaking sure they pass the sniff check

63

# Check a samplepredictions.sample(False, 0.001, 18).orderBy("CRSDepTime").show(6)

Page 64: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0

Experiment SetupNecessary to improve model

64

Page 65: Introduction to PySpark

Data Syndrome: Agile Data Science 2.0 65

155 additional lines to setup an experiment

and add 3 new features to improvement the model

ch09/improve_spark_mllib_model.py

345 L.O.C.

# !/usr/bin/env pythonimport sys, os, reimport jsonimport datetime, iso8601from tabulate import tabulate# Pass date and base path to main() from airflowdef main(base_path): APP_NAME = "train_spark_mllib_model.py" # If there is no SparkSession, create the environment try: sc and spark except NameError as e: import findspark findspark.init() import pyspark import pyspark.sql sc = pyspark.SparkContext() spark = pyspark.sql.SparkSession(sc).builder.appName(APP_NAME).getOrCreate() # # { # "ArrDelay":5.0,"CRSArrTime":"2015-12-31T03:20:00.000-08:00","CRSDepTime":"2015-12-31T03:05:00.000-08:00", # "Carrier":"WN","DayOfMonth":31,"DayOfWeek":4,"DayOfYear":365,"DepDelay":14.0,"Dest":"SAN","Distance":368.0, # "FlightDate":"2015-12-30T16:00:00.000-08:00","FlightNum":"6109","Origin":"TUS" # } # from pyspark.sql.types import StringType, IntegerType, FloatType, DoubleType, DateType, TimestampType from pyspark.sql.types import StructType, StructField from pyspark.sql.functions import udf schema = StructType([ StructField("ArrDelay", DoubleType(), True), # "ArrDelay":5.0 StructField("CRSArrTime", TimestampType(), True), # "CRSArrTime":"2015-12-31T03:20:00.000-08:00" StructField("CRSDepTime", TimestampType(), True), # "CRSDepTime":"2015-12-31T03:05:00.000-08:00" StructField("Carrier", StringType(), True), # "Carrier":"WN" StructField("DayOfMonth", IntegerType(), True), # "DayOfMonth":31 StructField("DayOfWeek", IntegerType(), True), # "DayOfWeek":4 StructField("DayOfYear", IntegerType(), True), # "DayOfYear":365 StructField("DepDelay", DoubleType(), True), # "DepDelay":14.0 StructField("Dest", StringType(), True), # "Dest":"SAN" StructField("Distance", DoubleType(), True), # "Distance":368.0 StructField("FlightDate", DateType(), True), # "FlightDate":"2015-12-30T16:00:00.000-08:00" StructField("FlightNum", StringType(), True), # "FlightNum":"6109" StructField("Origin", StringType(), True), # "Origin":"TUS" ]) input_path = "{}/data/simple_flight_delay_features.json".format( base_path ) features = spark.read.json(input_path, schema=schema) features.first() # # Add a Route variable to replace FlightNum # from pyspark.sql.functions import lit, concat features_with_route = features.withColumn( 'Route', concat( features.Origin, lit('-'), features.Dest ) ) features_with_route.show(6) # # Add the hour of day of scheduled arrival/departure # from pyspark.sql.functions import hour features_with_hour = features_with_route.withColumn( "CRSDepHourOfDay", hour(features.CRSDepTime) ) features_with_hour = features_with_hour.withColumn( "CRSArrHourOfDay", hour(features.CRSArrTime) ) features_with_hour.select("CRSDepTime", "CRSDepHourOfDay", "CRSArrTime", "CRSArrHourOfDay").show() # # Use pysmark.ml.feature.Bucketizer to bucketize ArrDelay into on-time, slightly late, very late (0, 1, 2) # from pyspark.ml.feature import Bucketizer # Setup the Bucketizer splits = [-float("inf"), -15.0, 0, 30.0, float("inf")] arrival_bucketizer = Bucketizer( splits=splits, inputCol="ArrDelay", outputCol="ArrDelayBucket" ) # Save the model arrival_bucketizer_path = "{}/models/arrival_bucketizer_2.0.bin".format(base_path) arrival_bucketizer.write().overwrite().save(arrival_bucketizer_path) # Apply the model ml_bucketized_features = arrival_bucketizer.transform(features_with_hour) ml_bucketized_features.select("ArrDelay", "ArrDelayBucket").show() # # Extract features tools in with pyspark.ml.feature # from pyspark.ml.feature import StringIndexer, VectorAssembler # Turn category fields into indexes for column in ["Carrier", "Origin", "Dest", "Route"]: string_indexer = StringIndexer( inputCol=column, outputCol=column + "_index" ) string_indexer_model = string_indexer.fit(ml_bucketized_features) ml_bucketized_features = string_indexer_model.transform(ml_bucketized_features)

# Save the pipeline model string_indexer_output_path = "{}/models/string_indexer_model_3.0.{}.bin".format( base_path, column ) string_indexer_model.write().overwrite().save(string_indexer_output_path)# Combine continuous, numeric fields with indexes of nominal ones# ...into one feature vectornumeric_columns = [ "DepDelay", "Distance", "DayOfMonth", "DayOfWeek", "DayOfYear", "CRSDepHourOfDay", "CRSArrHourOfDay"] index_columns = ["Carrier_index", "Origin_index", "Dest_index", "Route_index"] vector_assembler = VectorAssembler( inputCols=numeric_columns + index_columns, outputCol="Features_vec") final_vectorized_features = vector_assembler.transform(ml_bucketized_features)# Save the numeric vector assemblervector_assembler_path = "{}/models/numeric_vector_assembler_3.0.bin".format(base_path)vector_assembler.write().overwrite().save(vector_assembler_path)# Drop the index columnsfor column in index_columns: final_vectorized_features = final_vectorized_features.drop(column)# Inspect the finalized featuresfinal_vectorized_features.show()# # Cross validate, train and evaluate classifier: loop 5 times for 4 metrics# from collections import defaultdictscores = defaultdict(list) feature_importances = defaultdict(list) metric_names = ["accuracy", "weightedPrecision", "weightedRecall", "f1"] split_count = 3 for i in range(1, split_count + 1): print("\nRun {} out of {} of test/train splits in cross validation...".format( i, split_count, ) ) # Test/train split training_data, test_data = final_vectorized_features.randomSplit([0.8, 0.2]) # Instantiate and fit random forest classifier on all the data from pyspark.ml.classification import RandomForestClassifier rfc = RandomForestClassifier( featuresCol="Features_vec", labelCol="ArrDelayBucket", predictionCol="Prediction", maxBins=4657, ) model = rfc.fit(training_data) # Save the new model over the old one model_output_path = "{}/models/spark_random_forest_classifier.flight_delays.baseline.bin".format( base_path ) model.write().overwrite().save(model_output_path) # Evaluate model using test data predictions = model.transform(test_data) # Evaluate this split's results for each metric from pyspark.ml.evaluation import MulticlassClassificationEvaluator for metric_name in metric_names: evaluator = MulticlassClassificationEvaluator( labelCol="ArrDelayBucket", predictionCol="Prediction", metricName=metric_name ) score = evaluator.evaluate(predictions) scores[metric_name].append(score) print("{} = {}".format(metric_name, score)) # # Collect feature importances # feature_names = vector_assembler.getInputCols() feature_importance_list = model.featureImportances for feature_name, feature_importance in zip(feature_names, feature_importance_list): feature_importances[feature_name].append(feature_importance)# # Evaluate average and STD of each metric and print a table# import numpy as npscore_averages = defaultdict(float) # Compute the table dataaverage_stds = [] # hafor metric_name in metric_names: metric_scores = scores[metric_name] average_accuracy = sum(metric_scores) / len(metric_scores) score_averages[metric_name] = average_accuracy std_accuracy = np.std(metric_scores) average_stds.append((metric_name, average_accuracy, std_accuracy))# Print the tableprint("\nExperiment Log") print("--------------") print(tabulate(average_stds, headers=["Metric", "Average", "STD"]))# # Persist the score to a sccore log that exists between runs# import pickle

# Load the score log or initialize an empty one try: score_log_filename = "{}/models/score_log.pickle".format(base_path) score_log = pickle.load(open(score_log_filename, "rb")) if not isinstance(score_log, list): score_log = [] except IOError: score_log = [] # Compute the existing score log entry score_log_entry = {metric_name: score_averages[metric_name] for metric_name in metric_names} # Compute and display the change in score for each metric try: last_log = score_log[-1] except (IndexError, TypeError, AttributeError): last_log = score_log_entry experiment_report = [] for metric_name in metric_names: run_delta = score_log_entry[metric_name] - last_log[metric_name] experiment_report.append((metric_name, run_delta)) print("\nExperiment Report") print("-----------------") print(tabulate(experiment_report, headers=["Metric", "Score"])) # Append the existing average scores to the log score_log.append(score_log_entry) # Persist the log for next run pickle.dump(score_log, open(score_log_filename, "wb")) # # Analyze and report feature importance changes # # Compute averages for each feature feature_importance_entry = defaultdict(float) for feature_name, value_list in feature_importances.items(): average_importance = sum(value_list) / len(value_list) feature_importance_entry[feature_name] = average_importance # Sort the feature importances in descending order and print import operator sorted_feature_importances = sorted( feature_importance_entry.items(), key=operator.itemgetter(1), reverse=True ) print("\nFeature Importances") print("-------------------") print(tabulate(sorted_feature_importances, headers=['Name', 'Importance'])) # # Compare this run's feature importances with the previous run's # # Load the feature importance log or initialize an empty one try: feature_log_filename = "{}/models/feature_log.pickle".format(base_path) feature_log = pickle.load(open(feature_log_filename, "rb")) if not isinstance(feature_log, list): feature_log = [] except IOError: feature_log = [] # Compute and display the change in score for each feature try: last_feature_log = feature_log[-1] except (IndexError, TypeError, AttributeError): last_feature_log = defaultdict(float) for feature_name, importance in feature_importance_entry.items(): last_feature_log[feature_name] = importance # Compute the deltas feature_deltas = {} for feature_name in feature_importances.keys(): run_delta = feature_importance_entry[feature_name] - last_feature_log[feature_name] feature_deltas[feature_name] = run_delta # Sort feature deltas, biggest change first import operator sorted_feature_deltas = sorted( feature_deltas.items(), key=operator.itemgetter(1), reverse=True ) # Display sorted feature deltas print("\nFeature Importance Delta Report") print("-------------------------------") print(tabulate(sorted_feature_deltas, headers=["Feature", "Delta"])) # Append the existing average deltas to the log feature_log.append(feature_importance_entry) # Persist the log for next run pickle.dump(feature_log, open(feature_log_filename, "wb")) if __name__ == "__main__": main(sys.argv[1])

Page 66: Introduction to PySpark

Data Syndrome Russell Jurney

Principal Consultant

Email : [email protected] : datasyndrome.com

Data Syndrome, LLC