Spark: Beyond mapreduce

Preview:

Citation preview

Alin Blidisel - Spark: Big Data Beyond MapReduce

ALIN BLIDISEL - SPARK: BIG DATA BEYOND MAPREDUCE

Apache Spark: Introduction,

Examples, Data Analysis and

Statistics.

Blidisel Alin

Alin Blidisel - Big Data: Beyond MapReduce

WHY SPARK?

Hadoop Spark

Alin Blidisel - Big Data: Beyond MapReduce

SPARK - INTRODUCTION- was created by Matei Zaharia at Berkley

- was introduced by Apache Software Foundation for speeding up the Hadoop computational process

- is not a modified version of Hadoop

- in-memory cluster computing

- own cluster computation management

- designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming

Alin Blidisel - Big Data: Beyond MapReduce

SPARK COMPONENTS

Alin Blidisel - Big Data: Beyond MapReduce

FEATURES OF APACHE SPARK

- Lighting Fast Processing (10 to 100 faster then Hadoop)

- Ease of Use as it supports multiple languages

- Support for Sophisticated Analytics

- Real Time Stream Processing

- Ability to Integrate with Hadoop and Existing HadoopData

- Active and Expanding Community (more than 250 developers have contributed to Spark already)

Alin Blidisel - Big Data: Beyond MapReduce

RESILIENT DISTRIBUTED DATASETS (RDDS)

- fault-tolerant collection of elements that can be operated on in parallel (distributed and immutable)

- two ways to create RDDs:- parallelizing an existing collection in your driver program

- referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat

- persistence (MEMORY_ONLY*, MEMORY_AND_DISK*, DISK_ONLY, OFF_HEAP)

Alin Blidisel - Big Data: Beyond MapReduce

SPARK CLUSTER MODE OVERVIEW

Alin Blidisel - Big Data: Beyond MapReduce

SPARK USER INTERFACE

Alin Blidisel - Big Data: Beyond MapReduce

EXAMPLE: DATA ANALYSIS Sample Data from Sales transactions CSV file

Transaction_date,Product,Price,Payment_Type,Name,City,State,Country,Account_Created,Last_Login,Latitude,Longitude

1/2/09 6:17,Product1,1200,Mastercard,carolina,Basildon,England,United Kingdom,1/2/09 6:00,1/2/09 6:08,51.5,-1.1166667

1/2/09 4:53,Product1,1200,Visa,Betina,Parkville,MO,United States,1/2/09 4:42,1/2/09 7:49,39.195,-94.68194

1/2/09 13:08,Product1,1200,Mastercard,Federica e Andrea,Astoria,OR,United States,1/1/09 16:21,1/3/09 12:32,46.18806,-123.83

1/3/09 14:44,Product1,1200,Visa,Gouya,Echuca,Victoria,Australia,9/25/05 21:13,1/3/09 14:22,-36.1333333,144.75

1/4/09 12:56,Product2,3600,Visa,Gerd W ,Cahaba Heights,AL,United States,11/15/08 15:47,1/4/09 12:45,33.52056,-86.8025

1/4/09 13:19,Product1,1200,Visa,LAURENCE,Mickleton,NJ,United States,9/24/08 15:19,1/4/09 13:04,39.79,-75.23806

1/4/09 20:11,Product1,1200,Mastercard,Fleur,Peoria,IL,United States,1/3/09 9:38,1/4/09 19:45,40.69361,-89.58889

1/2/09 20:09,Product1,1200,Mastercard,adam,Martin,TN,United States,1/2/09 17:43,1/4/09 20:01,36.34333,-88.85028

1/4/09 13:17,Product1,1200,Mastercard,Renee Elisabeth,Tel Aviv,Tel Aviv,Israel,1/4/09 13:03,1/4/09 22:10,32.0666667,34.7666667

Alin Blidisel - Big Data: Beyond MapReduce

LOAD ORIGINAL CSV FROM HDFSCreate Spark Context and define input parameters

Create RDD from CSV file

Alin Blidisel - Big Data: Beyond MapReduce

GET RANDOM DATA AND CREATE A DATAFRAME

Alin Blidisel - Big Data: Beyond MapReduce

DETERMINE FIELD TYPES

Alin Blidisel - Big Data: Beyond MapReduce

CREATE NEW DATAFRAME BASED ON THE NEW DETERMINED FIELD TYPES

Alin Blidisel - Big Data: Beyond MapReduce

SAVE DATA IN PARQUET FORMAT

This is the new updated schema

Alin Blidisel - Big Data: Beyond MapReduce

GENERATE STATISTICS

© 2016 Atigeo, Corporation. All rights reserved. Atigeo and the xPatterns logo are trademarks of Atigeo. The information herein is for informational purposes only and represents the current view of Atigeo as of the date of this presentation. Because Atigeo must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Atigeo, and Atigeo cannot guarantee the accuracy of any information provided after the date of this presentation. ATIGEO MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Thank you!

Recommended