Spark - Alexis Seigneurin (English)

Alexis Seigneurin@aseigneurin @ippontech

● Processing of large volumes of data● Distributed processing on commodity

hardware● Written in Scala, Java and Python bindings

History

● 2009: AMPLab, Berkeley University● June 2013 : "Top-level project" of the

Apache foundation● May 2014: version 1.0.0● Currently: version 1.2.0

Use cases

● Logs analysis● Processing of text files● Analytics● Distributed search (Google, before)● Fraud detection● Product recommendation

● Same use cases● Same development

model: MapReduce● Integration with the

ecosystem

Proximity with Hadoop

Simpler than Hadoop

● API simpler to learn● “Relaxed” MapReduce● Spark Shell: interactive processing

Faster than Hadoop

Spark officially sets a new record in large-scale sorting (5th November 2014)

● Sorting 100 To of data● Hadoop MR: 72 minutes

○ With 2100 noeuds (50400 cores)

● Spark: 23 minutes○ With 206 noeuds (6592 cores)

Spark ecosystem

● Spark● Spark Shell● Spark Streaming● Spark SQL● Spark ML● GraphX

Integration

● Yarn, Zookeeper, Mesos● HDFS● Cassandra● Elasticsearch● MongoDB

SparkOperating principle

● Resilient Distributed Dataset● Abstraction of a collection processed in

parallel● Fault tolerant● Can work with tuples:

○ Key - Value○ Tuples must be independent from each other

Sources

● Files on HDFS● Local files● Collection in memory● Amazon S3● NoSQL database● ...● Or a custom implementation of

InputFormat

Transformations

● Processes an RDD, returns another RDD● Lazy!● Examples :

○ map(): one value → another value○ mapToPair(): one value → a tuple○ filter(): filters values/tuples given a condition○ groupByKey(): groups values by key○ reduceByKey(): aggregates values by key○ join(), cogroup()...: joins two RDDs

Actions

● Does not return an RDD● Examples:

○ count(): counts values/tuples○ saveAsHadoopFile(): saves results in Hadoop’s

format○ foreach(): applies a function on each item○ collect(): retrieves values in a list (List<T>)

Example

● Trees of Paris: CSV file, Open Data● Count of trees by specie

Spark - Example

geom_x_y;circonfere;adresse;hauteurenm;espece;varieteouc;dateplanta48.8648454814, 2.3094155344;140.0;COURS ALBERT 1ER;10.0;Aesculus hippocastanum;;48.8782668139, 2.29806967519;100.0;PLACE DES TERNES;15.0;Tilia platyphyllos;;48.889306184, 2.30400164126;38.0;BOULEVARD MALESHERBES;0.0;Platanus x hispanica;;48.8599934405, 2.29504883623;65.0;QUAI BRANLY;10.0;Paulownia tomentosa;;1996-02-29...

Spark - ExampleJavaSparkContext sc = new JavaSparkContext("local", "arbres");

sc.textFile("data/arbresalignementparis2010.csv") .filter(line -> !line.startsWith("geom")) .map(line -> line.split(";")) .mapToPair(fields -> new Tuple2<String, Integer>(fields[4], 1)) .reduceByKey((x, y) -> x + y) .sortByKey() .foreach(t -> System.out.println(t._1 + " : " + t._2));

[... ; … ; …]

textFile mapToPairmap

reduceByKey

foreach

filter

sortByKey

geom;...

Spark - ExampleAcacia dealbata : 2

Acer acerifolius : 39

Acer buergerianum : 14

Acer campestre : 452

Spark clusters

Topology & Terminology

● One master / several workers○ (+ one standby master)

● Submit an application to the cluster● Execution managed by a driver

Spark in a cluster

Several options

● YARN● Mesos● Standalone

○ Workers started manually○ Workers started by the master

MapReduce● Spark (API)● Distributed processing● Fault tolerant

Storage● HDFS, base NoSQL...● Distributed storage● Fault tolerant

Storage & Processing

Data locality

● Process the data where it is stored● Avoid network I/Os

Data locality

Spark Worker

HDFS Datanode

Spark Worker

HDFS Datanode

Spark Worker

HDFS Datanode

Spark Master

HDFS Namenode

HDFS Namenode (Standby)

SparkMaster

(Standby)

DemoSpark in a cluster

Demo$ $SPARK_HOME/sbin/start-master.sh

$ $SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker spark://MBP-de-Alexis:7077 --cores 2 --memory 2G

$ mvn clean package$ $SPARK_HOME/bin/spark-submit --master spark://MBP-de-Alexis:7077 --class com.seigneurin.spark.WikipediaMapReduceByKey --deploy-mode cluster target/pres-spark-0.0.1-SNAPSHOT.jar

Spark SQL

● Usage of an RDD in SQL● SQL engine: converts SQL instructions to

low-level instructions

Spark SQL

Prerequisites:

● Use tabular data● Describe the schema → SchemaRDD

Describing the schema :

● Programmatic description of the data● Schema inference through reflection (POJO)

JavaRDD<Row> rdd = trees.map(fields -> Row.create( Float.parseFloat(fields[3]), fields[4]));

● Creating tabular data (type Row)

Spark SQL - Example

---------------------------------------

| 10.0 | Aesculus hippocastanum |

| 15.0 | Tilia platyphyllos |

| 0.0 | Platanus x hispanica |

| 10.0 | Paulownia tomentosa |

| ... | ... |

Spark SQL - Example

List<StructField> fields = new ArrayList<StructField>();fields.add(DataType.createStructField("hauteurenm", DataType.FloatType, false));fields.add(DataType.createStructField("espece", DataType.StringType, false));

StructType schema = DataType.createStructType(fields);

JavaSchemaRDD schemaRDD = sqlContext.applySchema(rdd, schema);schemaRDD.registerTempTable("tree");

---------------------------------------

| hauteurenm | espece |

---------------------------------------

| 10.0 | Aesculus hippocastanum |

| 15.0 | Tilia platyphyllos |

| 0.0 | Platanus x hispanica |

| 10.0 | Paulownia tomentosa |

| ... | ... |

● Describing the schema

● Counting trees by specie

Spark SQL - Example

sqlContext.sql("SELECT espece, COUNT(*) FROM tree WHERE espece <> '' GROUP BY espece ORDER BY espece") .foreach(row -> System.out.println(row.getString(0)+" : "+row.getLong(1)));

Acacia dealbata : 2

Acer acerifolius : 39

Acer buergerianum : 14

Acer campestre : 452

Spark Streaming

Micro-batches

● Slices a continuous flow of data into batches● Same API● ≠ Apache Storm

DStream

● Discretized Streams● Sequence of RDDs● Initialized with a Duration

Window operations

● Sliding window● Reuses data from other windows● Initialized with a window length and a slide

interval

Sources

● Socket● Kafka● Flume● HDFS● MQ (ZeroMQ...)● Twitter● ...● Or a custom implementation of Receiver

DemoSpark Streaming

Spark Streaming Demo

● Receive Tweets with hashtag #Android○ Twitter4J

● Detection of the language of the Tweet○ Language Detection

● Indexing with Elasticsearch● Reporting with Kibana 4

$ curl -X DELETE localhost:9200$ curl -X PUT localhost:9200/spark/_mapping/tweets '{ "tweets": { "properties": { "user": {"type": "string","index": "not_analyzed"}, "text": {"type": "string"}, "createdAt": {"type": "date","format": "date_time"}, "language": {"type": "string","index": "not_analyzed"} } }}'

● Launch ElasticSearch

● Launch Kibana -> http://localhost:5601● Launch the Spark Streaming process

@aseigneurin

aseigneurin.github.io

@ippontech

blog.ippon.fr

Spark - Alexis Seigneurin (English)

Technology

Alexis fischer

Alexis Bravo

MINOR JUNIORcloud.rampinteractive.com/albertanativehockeycouncil/files/dec1510… · Alexis Edan Alexis Edan Alexis Gabriel Alexis Gabriel Alexis Joelle Alexis Joelle Alexis Joseph

Trabajo Alexis

Alexis Voyager

Alexis Chafino

Alexis Tertulianni

Using Apache Spark for Intelligent Services by Alexis Roos

Using Apache Spark for Intelligent Services: Keynote at Spark Summit East by Alexis Roos

[XLS]caasd.gob.docaasd.gob.do/media/75334/nomina nombrados enero 2016.xlsx · Web viewALEXIS BRITO REYES ALEXIS DIAZ MOROBEL ALEXIS VALLEJO SURIEL ALEXIS ANTONIO MONTERO GIL ALEXIS

Alexis Gehrt / alexis@database-designs.ch - Erster Kontakt .... Mai Donners… · Alexis Gehrt / alexis@database-designs.ch - Erster Kontakt mit FileMaker ca. 1991 (≈ Version 2,

Encuestas Alexis

Cassidy Alexis Ediger, an infant by her Cassidy Alexis

[XLS]caasd.gob.docaasd.gob.do/media/86872/NOMINA NOMBRADOS ENERO 2017.xlsx · Web viewALEXIS BRITO REYES ALEXIS DIAZ MOROBEL ALEXIS VALLEJO SURIEL ALEXIS ANTONIO MONTERO GIL ALEXIS

Spark SQL and DataFrames Spark GraphX Spark Mlib Spark ...Spark GraphX! Spark Mlib! Spark Streaming Lightning-fast cluster computing. Chaining transformations 2. ... Covert RDD to

Spark & Spark SQL

Plantas Alexis

The recollections of Alexis de Tocqueville - Alexis de Tocqueville

Spark streaming , Spark SQL

SPARK SPARK VRT