53
Using PySpark to Process Boat Loads of Data Robert Dempsey, CEO Atlantic Dominion Solutions

Using PySpark to Process Boat Loads of Data

Embed Size (px)

Citation preview

Page 1: Using PySpark to Process Boat Loads of Data

Using PySpark to Process Boat Loads of Data

Robert Dempsey, CEO Atlantic Dominion Solutions

Page 2: Using PySpark to Process Boat Loads of Data
Page 3: Using PySpark to Process Boat Loads of Data

We’ve mastered three jobs so you can focus on one - growing your business.

Page 4: Using PySpark to Process Boat Loads of Data

The Three JobsAt Atlantic Dominion Solutions we perform three functions for our customers:

Consulting: we assess and advise in the areas of technology, team and process to determine how machine learning can have the biggest impact on your business.

Implementation: after a strategy session to determine the work you need we get to work using our proven methodology and begin delivering smarter applications.

Training: continuous improvement requires continuous learning. We provide both on-premises and online training.

Page 5: Using PySpark to Process Boat Loads of Data

Co-authoring the book Building Machine Learning Pipelines.

Written for software developers and data scientists, Building Machine Learning Pipelines teaches the skills required to create and use the infrastructure needed to run modern intelligent systems.

machinelearningpipelines.com

Writing the Book

Page 6: Using PySpark to Process Boat Loads of Data

Robert Dempsey, CEOSoftware Engineer

Books and online courses

Lotus Guides, District Data Labs

Atlantic Dominion Solutions, LLC

Professional

Author

Instructor

Owner

Page 7: Using PySpark to Process Boat Loads of Data

What You Can Expect Today

Page 8: Using PySpark to Process Boat Loads of Data

MTAC Framework™Mindset

Toolbox

Application

Communication

Page 9: Using PySpark to Process Boat Loads of Data

1. When acquiring knowledge start by going wide instead of deep.

2. Always focus on what's important to people rather than just the technology.

3. Be able to clearly communicate what you know with others.

Core Principles

Page 10: Using PySpark to Process Boat Loads of Data

MTAC Framework™ AppliedMindset: use-case centric example

Toolbox: Python, PySpark, Docker

Application: Code & Analysis

Communication: Q&A

Page 11: Using PySpark to Process Boat Loads of Data

Mindset

Page 12: Using PySpark to Process Boat Loads of Data

Keep It Simple

Image: Jesse van Dijk : http://jessevandijkart.com/the-labyrinth-of-tsan-kamal/

Page 13: Using PySpark to Process Boat Loads of Data

Solve the Problem

Image: Paulo : https://paullus23.deviantart.com/art/Bliss-soccer-field-326563199

Page 14: Using PySpark to Process Boat Loads of Data

Explain It, Simply

Page 15: Using PySpark to Process Boat Loads of Data

Break Through

Page 16: Using PySpark to Process Boat Loads of Data

Use Case

Page 17: Using PySpark to Process Boat Loads of Data

Got Clean Air?

Page 18: Using PySpark to Process Boat Loads of Data

Got Clean Air?• Clean air is important.

• Toxic pollutants are known or suspected of causing cancer, reproductive effects, birth defects, and adverse environmental effects.

Page 19: Using PySpark to Process Boat Loads of Data

Questions to Answer1. Which state has the highest level of pollutants?

2. Which county has the highest level of pollutants?

3. What are the top 5 pollutants by unit of measure?

4. What are the trends of pollutants by state over time?

Page 20: Using PySpark to Process Boat Loads of Data

Toolbox

Page 21: Using PySpark to Process Boat Loads of Data

Python

Page 22: Using PySpark to Process Boat Loads of Data
Page 23: Using PySpark to Process Boat Loads of Data

Spark

Page 24: Using PySpark to Process Boat Loads of Data

The Core of Spark• Computational engine that schedules, distributes and

monitors computational tasks running on a cluster

Page 25: Using PySpark to Process Boat Loads of Data

Higher Level Tools• Spark SQL: SQL and structured data

• MLlib: machine learning

• GraphX: graph processing

• Spark Streaming: process streaming data

Page 26: Using PySpark to Process Boat Loads of Data

Storage• Local file system

• Amazon S3

• Cassandra

• Hive

• HBase

• File formats

• Text files

• Sequence files

• Avro

• Parquet

• Hadoop Input Format

Page 27: Using PySpark to Process Boat Loads of Data

Hadoop?• Not necessary, but…

• If you have multiple nodes you need a resource manager like YARN or Mesos

• You'll need access to distributed storage like HDFS, Amazon S3 or Cassandra

Page 28: Using PySpark to Process Boat Loads of Data

PySpark

Page 29: Using PySpark to Process Boat Loads of Data

What Is PySpark?• An API that exposes the Spark programming model to

Python

• Build on top of Spark's Java API

• Data is processed with Python and cached/shuffled in the JVM

• Driver programs

Page 30: Using PySpark to Process Boat Loads of Data

Driver Programs• Launch parallel operations on a cluster

• Contain application functions

• Define distributed datasets

• Access Spark through a SparkContext

• Uses Py4J to launch a JVM and create a JavaSparkContext

Page 31: Using PySpark to Process Boat Loads of Data

When to Use It• When you need to…

• Process boat loads of data (TB)

• Perform operations that require all the data to be in memory (machine learning)

• Efficiently process streaming data

• Create an overly complicated use case to present at a meetup

Page 32: Using PySpark to Process Boat Loads of Data

Docker

Page 33: Using PySpark to Process Boat Loads of Data

Docker• Software container platform

• Containers are application only (no OS)

• Deployed anywhere with same CPU architecture (x86-64, ARM)

• Available for *nix, Mac, Windows

Page 34: Using PySpark to Process Boat Loads of Data

Container Architecture

Page 35: Using PySpark to Process Boat Loads of Data

Application

Page 36: Using PySpark to Process Boat Loads of Data

PySpark in Data Architectures

Page 37: Using PySpark to Process Boat Loads of Data

Architecture #1

AgentFile

SystemApache Spark

File System

Agent ES

1 2 3

Data Flow

Page 38: Using PySpark to Process Boat Loads of Data

Architecture #2

Data Flow

Agent

1 2 3

Agent

Agent

Athena

S3

S3Apache Spark

Page 39: Using PySpark to Process Boat Loads of Data

Architecture #3

Data Flow

Agent

1 2 3

Agent

Agent

ES

S3

HDFS

Apache Kafka

Apache Spark

HBase

Page 40: Using PySpark to Process Boat Loads of Data

What We’ll Build (Simple)

AgentFile

SystemApache Spark

File System

1 2 3

Data Flow

Page 41: Using PySpark to Process Boat Loads of Data

Python• Analysis

• Visualization

• Code in our Spark jobs

Page 42: Using PySpark to Process Boat Loads of Data

Spark• By using PySpark

Page 43: Using PySpark to Process Boat Loads of Data

PySpark• Process all the data!

• Perform aggregations

Page 44: Using PySpark to Process Boat Loads of Data

Docker• Run Spark in a Docker container.

• So you don’t have to install anything.

Page 45: Using PySpark to Process Boat Loads of Data

Code Time!

Page 46: Using PySpark to Process Boat Loads of Data

README• https://github.com/rdempsey/pyspark-for-data-processing

• Create a virtual environment (Anaconda)

• Install dependencies

• Run docker-compose to create the Spark containers

• Run a script (or all of them!) per the README

Page 47: Using PySpark to Process Boat Loads of Data

Dive In• Data explorer notebook

• Q1 - Most polluted state

• Q2 - Most polluted county

• Q3 - Top pollutants by unit of measure

• Q4 - Pollutants over time

Page 48: Using PySpark to Process Boat Loads of Data

Communication

Page 49: Using PySpark to Process Boat Loads of Data

Q&A

Page 50: Using PySpark to Process Boat Loads of Data

Early Bird Specials!

Page 51: Using PySpark to Process Boat Loads of Data

Intro to Data Science for Software Engineers

Goes live October 23, 2017

Normally: $97

Pre-Launch: $47

http://lotusguides.com

Page 52: Using PySpark to Process Boat Loads of Data

Where to Find MeWebsite

Lotus Guides

LinkedIn

Twitter

Github

robertwdempsey.com

lotusguides.com

robertwdempsey

rdempsey

rdempsey

Page 53: Using PySpark to Process Boat Loads of Data

Thank You!