View
157
Download
3
Category
Tags:
Preview:
Citation preview
Alexis Seigneurin@aseigneurin @ippontech
Spark
● Processing of large volumes of data● Distributed processing on commodity
hardware● Written in Scala, Java and Python bindings
History
● 2009: AMPLab, Berkeley University● June 2013 : "Top-level project" of the
Apache foundation● May 2014: version 1.0.0● Currently: version 1.2.0
Use cases
● Logs analysis● Processing of text files● Analytics● Distributed search (Google, before)● Fraud detection● Product recommendation
● Same use cases● Same development
model: MapReduce● Integration with the
ecosystem
Proximity with Hadoop
Simpler than Hadoop
● API simpler to learn● “Relaxed” MapReduce● Spark Shell: interactive processing
Faster than Hadoop
Spark officially sets a new record in large-scale sorting (5th November 2014)
● Sorting 100 To of data● Hadoop MR: 72 minutes
○ With 2100 noeuds (50400 cores)
● Spark: 23 minutes○ With 206 noeuds (6592 cores)
● Resilient Distributed Dataset● Abstraction of a collection processed in
parallel● Fault tolerant● Can work with tuples:
○ Key - Value○ Tuples must be independent from each other
RDD
Sources
● Files on HDFS● Local files● Collection in memory● Amazon S3● NoSQL database● ...● Or a custom implementation of
InputFormat
Transformations
● Processes an RDD, returns another RDD● Lazy!● Examples :
○ map(): one value → another value○ mapToPair(): one value → a tuple○ filter(): filters values/tuples given a condition○ groupByKey(): groups values by key○ reduceByKey(): aggregates values by key○ join(), cogroup()...: joins two RDDs
Actions
● Does not return an RDD● Examples:
○ count(): counts values/tuples○ saveAsHadoopFile(): saves results in Hadoop’s
format○ foreach(): applies a function on each item○ collect(): retrieves values in a list (List<T>)
● Trees of Paris: CSV file, Open Data● Count of trees by specie
Spark - Example
geom_x_y;circonfere;adresse;hauteurenm;espece;varieteouc;dateplanta48.8648454814, 2.3094155344;140.0;COURS ALBERT 1ER;10.0;Aesculus hippocastanum;;48.8782668139, 2.29806967519;100.0;PLACE DES TERNES;15.0;Tilia platyphyllos;;48.889306184, 2.30400164126;38.0;BOULEVARD MALESHERBES;0.0;Platanus x hispanica;;48.8599934405, 2.29504883623;65.0;QUAI BRANLY;10.0;Paulownia tomentosa;;1996-02-29...
Spark - ExampleJavaSparkContext sc = new JavaSparkContext("local", "arbres");
sc.textFile("data/arbresalignementparis2010.csv") .filter(line -> !line.startsWith("geom")) .map(line -> line.split(";")) .mapToPair(fields -> new Tuple2<String, Integer>(fields[4], 1)) .reduceByKey((x, y) -> x + y) .sortByKey() .foreach(t -> System.out.println(t._1 + " : " + t._2));
[... ; … ; …]
[... ; … ; …]
[... ; … ; …]
[... ; … ; …]
[... ; … ; …]
[... ; … ; …]
u
m
k
m
a
a
textFile mapToPairmap
reduceByKey
foreach
1
1
1
1
1
u
m
k
1
2
1
2a
...
...
...
...
filter
...
...
sortByKey
a
m
2
1
2
1u
...
...
...
...
...
...
geom;...
1 k
Spark - ExampleAcacia dealbata : 2
Acer acerifolius : 39
Acer buergerianum : 14
Acer campestre : 452
...
Topology & Terminology
● One master / several workers○ (+ one standby master)
● Submit an application to the cluster● Execution managed by a driver
Spark in a cluster
Several options
● YARN● Mesos● Standalone
○ Workers started manually○ Workers started by the master
MapReduce● Spark (API)● Distributed processing● Fault tolerant
Storage● HDFS, base NoSQL...● Distributed storage● Fault tolerant
Storage & Processing
Data locality
Spark Worker
HDFS Datanode
Spark Worker
HDFS Datanode
Spark Worker
HDFS Datanode
Spark Master
HDFS Namenode
HDFS Namenode (Standby)
SparkMaster
(Standby)
Demo$ $SPARK_HOME/sbin/start-master.sh
$ $SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker spark://MBP-de-Alexis:7077 --cores 2 --memory 2G
$ mvn clean package$ $SPARK_HOME/bin/spark-submit --master spark://MBP-de-Alexis:7077 --class com.seigneurin.spark.WikipediaMapReduceByKey --deploy-mode cluster target/pres-spark-0.0.1-SNAPSHOT.jar
Spark SQL
Prerequisites:
● Use tabular data● Describe the schema → SchemaRDD
Describing the schema :
● Programmatic description of the data● Schema inference through reflection (POJO)
JavaRDD<Row> rdd = trees.map(fields -> Row.create( Float.parseFloat(fields[3]), fields[4]));
● Creating tabular data (type Row)
Spark SQL - Example
---------------------------------------
| 10.0 | Aesculus hippocastanum |
| 15.0 | Tilia platyphyllos |
| 0.0 | Platanus x hispanica |
| 10.0 | Paulownia tomentosa |
| ... | ... |
Spark SQL - Example
List<StructField> fields = new ArrayList<StructField>();fields.add(DataType.createStructField("hauteurenm", DataType.FloatType, false));fields.add(DataType.createStructField("espece", DataType.StringType, false));
StructType schema = DataType.createStructType(fields);
JavaSchemaRDD schemaRDD = sqlContext.applySchema(rdd, schema);schemaRDD.registerTempTable("tree");
---------------------------------------
| hauteurenm | espece |
---------------------------------------
| 10.0 | Aesculus hippocastanum |
| 15.0 | Tilia platyphyllos |
| 0.0 | Platanus x hispanica |
| 10.0 | Paulownia tomentosa |
| ... | ... |
● Describing the schema
● Counting trees by specie
Spark SQL - Example
sqlContext.sql("SELECT espece, COUNT(*) FROM tree WHERE espece <> '' GROUP BY espece ORDER BY espece") .foreach(row -> System.out.println(row.getString(0)+" : "+row.getLong(1)));
Acacia dealbata : 2
Acer acerifolius : 39
Acer buergerianum : 14
Acer campestre : 452
...
Window operations
● Sliding window● Reuses data from other windows● Initialized with a window length and a slide
interval
Sources
● Socket● Kafka● Flume● HDFS● MQ (ZeroMQ...)● Twitter● ...● Or a custom implementation of Receiver
Spark Streaming Demo
● Receive Tweets with hashtag #Android○ Twitter4J
● Detection of the language of the Tweet○ Language Detection
● Indexing with Elasticsearch● Reporting with Kibana 4
$ curl -X DELETE localhost:9200$ curl -X PUT localhost:9200/spark/_mapping/tweets '{ "tweets": { "properties": { "user": {"type": "string","index": "not_analyzed"}, "text": {"type": "string"}, "createdAt": {"type": "date","format": "date_time"}, "language": {"type": "string","index": "not_analyzed"} } }}'
● Launch ElasticSearch
Demo
● Launch Kibana -> http://localhost:5601● Launch the Spark Streaming process
@aseigneurin
aseigneurin.github.io
@ippontech
blog.ippon.fr
Recommended