29
End-to-End Data Pipelines with Apache Spark Burak Yavuz December 27, 2015

End-to-End Data Pipelines with Apache Spark

Embed Size (px)

Citation preview

Page 1: End-to-End Data Pipelines with Apache Spark

End-to-End Data Pipelines with Apache Spark

Burak YavuzDecember 27, 2015

Page 2: End-to-End Data Pipelines with Apache Spark

Who Am I?• Software Engineer at Databricks• MS Management Science & Eng. @ Stanford

University• BS Mechanical Eng. @ Bogazici University,

Istanbul• Contributor to Spark Core, MLlib, SQL, and

Streaming• Maintainer of Spark Packages

2

Page 3: End-to-End Data Pipelines with Apache Spark

Outline• Intro - Spark & Ecosystem• Build an End-to-End Data Product

• Step 1: Understand your Data• SparkSQL - DataFrames

• Step 2: Build your Service• SparkMLlib - ML Pipelines

• Step 3: Monitor your Service• Spark Streaming• Kafka

3

Page 4: End-to-End Data Pipelines with Apache Spark

Timeline of Spark• 2010: a research paper• 2010-13: a project under github/mesos • 2013-14: Apache incubating -> TLP • 2014: the most active project in the ASF

4

Page 5: End-to-End Data Pipelines with Apache Spark

Apache Spark

5

Page 6: End-to-End Data Pipelines with Apache Spark

Spark Ecosystem• 770 contributors• 6000+ forks on GitHub• 14000+ commits!

6https://github.com/apache/spark

Page 7: End-to-End Data Pipelines with Apache Spark

7http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf

Page 8: End-to-End Data Pipelines with Apache Spark

8http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf

Page 9: End-to-End Data Pipelines with Apache Spark

9http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf

Page 10: End-to-End Data Pipelines with Apache Spark

10

Page 11: End-to-End Data Pipelines with Apache Spark

• a community index of 3rd-party packages• helps users find packages• helps package developers meet users• users provide feedback through voting and

commenting• index maintained by Databricks

11

3rd Party Packages

Community

Spark Packageshttp://spark-packages.org

Page 12: End-to-End Data Pipelines with Apache Spark

Types of Packages Currently Available• Data Source Connectors

• spark-avro, spark-redshift, spark-mongodb, spark-sequoiadb, spark-cassandra-connector, …

• Deployment Scripts• spark_azure, spark_gce, sbt-spark-ec2

• Machine Learning Algorithms• spark-hash, spark-mrmr-feature-selection, streaming-

matrix-factorization, generalized-kmeans-clustering• and many more…

12

Page 13: End-to-End Data Pipelines with Apache Spark

What’s new in Spark 1.6• Dataset API• Automatic memory configuration• Optimized state storage in Spark Streaming• Pipeline persistence in Spark ML

13

Page 14: End-to-End Data Pipelines with Apache Spark

DemoSource Code: http://brkyvz.github.io/spark-pipeline

Scenario: As an e-commerce company, we would like to recommend products that users may like in order to increase sales and profit.

Dataset: http://jmcauley.ucsd.edu/data/amazon/ - 18 GB - 82.83 million reviewsWe will use a subset with 24 million reviews

14

Page 15: End-to-End Data Pipelines with Apache Spark

15

Page 16: End-to-End Data Pipelines with Apache Spark

16

Page 17: End-to-End Data Pipelines with Apache Spark

Recommendation Engines• Finding Similar Items

• Clustering using: • Metadata• Matrix Factorization

• Frequent Itemsets• Ranking

• Rating Prediction using:• Matrix Factorization

17

Page 18: End-to-End Data Pipelines with Apache Spark

Architecture

18

Web Service 1

Web Service 2

Web Service 3

Cassandra

Sales DataDatabase

Spark

Sales + RatingsRating Data

ML Model

Recommendations

Request

Page 19: End-to-End Data Pipelines with Apache Spark

19

Step 1: Understand your Data

Page 20: End-to-End Data Pipelines with Apache Spark

20

Step 2: Build your Service

Page 21: End-to-End Data Pipelines with Apache Spark

Solution ProposalUse Matrix Factorization to understand customers and items.

Then:1) Predict the rating for a product for a given user2) Find similar products, and show top k

21

Page 22: End-to-End Data Pipelines with Apache Spark

Matrix Factorization

22https://databricks-training.s3.amazonaws.com/slides/Spark_Summit_MLlib_070214_v2.pdf

Page 23: End-to-End Data Pipelines with Apache Spark

Matrix Factorization

23https://databricks-training.s3.amazonaws.com/slides/Spark_Summit_MLlib_070214_v2.pdf

Page 24: End-to-End Data Pipelines with Apache Spark

24https://databricks-training.s3.amazonaws.com/slides/Spark_Summit_MLlib_070214_v2.pdf

Page 25: End-to-End Data Pipelines with Apache Spark

25

Step 3: Monitor your Service

Page 26: End-to-End Data Pipelines with Apache Spark

• Distributed messaging system• High-throughput• Fast• Scalable• Durable

• http://kafka.apache.org/

26

Apache Kafka

Page 27: End-to-End Data Pipelines with Apache Spark

Architecture

27

Web Service 1

Web Service 2

Web Service 3

Kafka Spark Streaming

Page 28: End-to-End Data Pipelines with Apache Spark

Architecture

28

Web Service 1

Web Service 2

Web Service 3

Kafka Spark Streaming