26

Better together: Fast Data with Apache Spark™ and Apache Ignite™ by Mike Griggs

Embed Size (px)

Citation preview

Page 1: Better together: Fast Data with Apache Spark™ and Apache Ignite™ by Mike Griggs
Page 2: Better together: Fast Data with Apache Spark™ and Apache Ignite™ by Mike Griggs

© 2016 The Apache Software Foundation. Apache, Apache Ignite, the Apache feather and the Apache Ignite logo are trademarks of The Apache Software Foundation.

Better Together: Fast Data with Ignite & SparkChristos Erotocritou - Spark Summit EU 2016

Page 3: Better together: Fast Data with Apache Spark™ and Apache Ignite™ by Mike Griggs

© 2016 GridGain Systems, Inc.

Agenda

• GridGain & Apache Ignite Project

• Ignite In-Memory Data Fabric• Apache Ignite vs. Apache Spark• Hadoop & Spark Integration• Q & A

Page 4: Better together: Fast Data with Apache Spark™ and Apache Ignite™ by Mike Griggs

© 2016 GridGain Systems, Inc.

Apache Ignite Project• 2007: First version of GridGain• Oct. 2014: GridGain contributes Ignite to

ASF• Aug. 2015: Ignite is the second fastest

project to graduate after Spark• Today:

• 82+ contributors and growing rapidly• Huge development momentum -

Estimated 233 years of effort since the first commit in February, 2014 [Openhub]

• Mature codebase: 840k+ SLOC & more than 17k commits

January 2016

Moderador
Notas de la presentación
Time: 9 months incubation to become top-level project Statistics from OpenHub
Page 5: Better together: Fast Data with Apache Spark™ and Apache Ignite™ by Mike Griggs

© 2016 GridGain Systems, Inc.

• GridGain Enterprise Edition• Is a binary build of Apache Ignite™ created by GridGain• Added enterprise features for enterprise deployments• Earlier features and bug fixes by a few weeks• Heavily tested

Moderador
Notas de la presentación
Enterprise: data-centre replication Added security features
Page 6: Better together: Fast Data with Apache Spark™ and Apache Ignite™ by Mike Griggs

© 2016 GridGain Systems, Inc.

Customer Use Cases

Automated Trading SystemsReal time analysis of trading positions & market risk. High volume transactions, ultra low latencies.

Financial ServicesFraud Detection, Risk Analysis, Insurance rating and modelling.

Online & Mobile AdvertisingReal time decisions, geo-targeting & retail traffic information.

Big Data AnalyticsCustomer 360 view, real-time analysis of KPIs, up-to-the-second operational BI.

Online GamingReal-time back-ends for mobile and massively parallel games.

SaaS Platforms & AppsHigh performance next-generation architectures for Software as a Service Application vendors.

Travel & E-CommerceHigh performance next-generation architectures for online hotel booking.

Moderador
Notas de la presentación
Online/Mobile Advertising: Analytics General: highly scalable and high-performance applications
Page 7: Better together: Fast Data with Apache Spark™ and Apache Ignite™ by Mike Griggs

© 2016 GridGain Systems, Inc.

What is an IMDF?

High-performance distributed in-memory platform forcomputing and transacting on large-scale data sets innear real-time.

Moderador
Notas de la presentación
Collection of logical in-memory components that solve high performance and scalability challenges
Page 8: Better together: Fast Data with Apache Spark™ and Apache Ignite™ by Mike Griggs

© 2016 GridGain Systems, Inc.

What is an IMDF?

Moderador
Notas de la presentación
HA – Hadoop Accelerator Multiple components, fully integrated with each other
Page 9: Better together: Fast Data with Apache Spark™ and Apache Ignite™ by Mike Griggs

© 2016 GridGain Systems, Inc.

In-Memory Computing Platform

Data Grid

BatchData

ComputeGrid

Transactional & AnalyticalWorkloads

Transactional & Analytical workloads

StreamingData

ExternalPersistence

ExternalAPIs

Back-end users, third-party clients and downstream systems

DownstreamSystems

Clients accessing a high-speed distributed multi-facet service

Page 10: Better together: Fast Data with Apache Spark™ and Apache Ignite™ by Mike Griggs

© 2016 GridGain Systems, Inc.

Scalability & Resilience with Ignite

Data Grid

BatchData

ComputeGrid

Transactional & AnalyticalWorkloads

Transactional & Analytical workloads

StreamingData

ExternalPersistence

ExternalAPIs

Back-end users, third-party clients and downstream systems

DownstreamSystems

Clients accessing a high-speed distributed multi-facet service

Page 11: Better together: Fast Data with Apache Spark™ and Apache Ignite™ by Mike Griggs

© 2016 GridGain Systems, Inc.

Fault Tolerance & Horizontal Scalability

Replicated Cache Partitioned Cache

Moderador
Notas de la presentación
Ignite is about using memory as the master data source FULL_SYNC, FULL_ASYNC, PRIMARY_SYNC for rebalancing and backups topology validator for rebalancing / failure. Choose EVENTUALLY CONSISTENT, or FULLY CONSISTENT. FULLY blocks transactions on cluster until rebalancing is complete. client-side NEAR Caching with transactional mode affinity colocation - colocated data with data and compute with data
Page 12: Better together: Fast Data with Apache Spark™ and Apache Ignite™ by Mike Griggs

© 2016 GridGain Systems, Inc.

Local Store & Vertical Scale

• Tiered Memory• On-Heap -> Off-Heap ->

Disk• Persistent On-Disk Store• Fast Recovery• Local Data Reload

• Eliminate Network and Db impacts when reloading in-memory store

Moderador
Notas de la presentación
Off-Heap: avoid large GC pauses
Page 13: Better together: Fast Data with Apache Spark™ and Apache Ignite™ by Mike Griggs

© 2016 GridGain Systems, Inc.

Storage and Caching using Ignite

Data Grid

BatchData

ComputeGrid

Transactional & AnalyticalWorkloads

Transactional & Analytical workloads

StreamingData

ExternalPersistency

ExternalAPIs

Back-end users, third-party clients and downstream systems

DownstreamSystems

Clients accessing a high-speed distributed multi-facet service

Page 14: Better together: Fast Data with Apache Spark™ and Apache Ignite™ by Mike Griggs

© 2016 GridGain Systems, Inc.

• 100% JCache Compliant (JSR 107)– Basic Cache Operations– Concurrent Map APIs– Collocated Processing (EntryProcessor)– Events and Metrics– Pluggable Persistence

• Ignite Data Grid– Fault Tolerance and Scalability – Distributed Key-Value Store– SQL Queries (ANSI 99)– ACID Transactions– In-Memory Indexes– RDBMS / NoSQL Integration

In-Memory Data Grid

Moderador
Notas de la presentación
Ignite is a 100% compliant implementation of JCache (JSR 107) specification. JCache provides a very simple to use, but yet very powerful API for data caching. RDBMS: external database integration. Read-through and write-through to relational database, NoSQL store or Hadoop cluster. Out-of-box: Cassandra, MongoDB, Hadoop or implement your own.
Page 15: Better together: Fast Data with Apache Spark™ and Apache Ignite™ by Mike Griggs

© 2016 GridGain Systems, Inc.

Distributed Computing with Ignite

Data Grid

BatchData

ComputeGrid

Transactional & AnalyticalWorkloads

Transactional & Analytical workloads

StreamingData

ExternalPersistency

ExternalAPIs

Back-end users, third-party clients and downstream systems

DownstreamSystems

Clients accessing a high-speed distributed multi-facet service

Page 16: Better together: Fast Data with Apache Spark™ and Apache Ignite™ by Mike Griggs

© 2016 GridGain Systems, Inc.

Client-Server vs. Affinity Colocation

12

4

3 Data 1

Job 1

2

3Data 2

Job 2

ProcessingNode 1

ProcessingNode 2

ClientNode

Data Node 1

Data Node 2

ProcessingNode 1

1

3

4

Data 1

Data 2

2

2

1. Initial Request2. Fetch data from remote nodes3. Process entire data-set4. Return to client

1. Initial Request2. Co-locating processing with data3. Return partial result4. Reduce & return to client

Moderador
Notas de la presentación
Tasks – Runnables/Callables Ignite allows us to take the compute to the data, NOT the data to the compute. As little data migration as possible.
Page 17: Better together: Fast Data with Apache Spark™ and Apache Ignite™ by Mike Griggs

© 2016 GridGain Systems, Inc.

• Direct API for MapReduce• Cron-like Task Scheduling• State Checkpoints• Load Balancing

• Round-robin• Random & weighted

• Automatic Failover• Per-node Shared State• Zero Deployment

• Distributed class loading

In-Memory Compute Grid

Page 18: Better together: Fast Data with Apache Spark™ and Apache Ignite™ by Mike Griggs

© 2016 GridGain Systems, Inc.

Hadoop & Spark Integration

Moderador
Notas de la presentación
Collection of logical in-memory components that solve high performance and scalability challenges
Page 19: Better together: Fast Data with Apache Spark™ and Apache Ignite™ by Mike Griggs

© 2016 GridGain Systems, Inc.

Apache Ignite Apache Spark– Ingests data from HDFS or another

distributed file system– Inclined towards analytics (OLAP) and

focused on MR-specific payloads– Requires the creation of RDD and data and

processing operations are governed by it– Primarily disk-based SQL support– Strong ML libraries– Big community

– Data source agnostic– Fully fledged compute engine and

resilient data storage in-memory for OLAP & OLTP

– Zero-deployment– In-Memory SQL support– Fully ACID transactions across memory

and disk– Broader in-memory system that is less

focused on Hadoop– Off-heap memory to avoid GC pauses– In production since 2007

Moderador
Notas de la presentación
Direct API for Fork/Join Some overlap in functionality, but we do not see as a direct competitor.
Page 20: Better together: Fast Data with Apache Spark™ and Apache Ignite™ by Mike Griggs

© 2016 GridGain Systems, Inc.

• IgniteRDD – Share RDD across jobs on the host– Share RDD across jobs in the

application– Share RDD globally

• Faster SQL– In-Memory Indexes– SQL on top of Shared RDD

Spark & Ignite Integration

Spark Application

Spark Worker

Spark Job

Spark Job

Ignite Node

Yarn Mesos Docker Cloud

Server

Spark Worker

Spark Job

Spark Job

Ignite Node

Server

Spark Worker

Spark Job

Spark Job

Ignite Node

Server

In-Memory Shared RDDs

Moderador
Notas de la presentación
Affinity run of spark jobs
Page 21: Better together: Fast Data with Apache Spark™ and Apache Ignite™ by Mike Griggs

© 2016 GridGain Systems, Inc.

• Reading values from Ignite:

• IgniteContext is the main entry point to Spark-Ignite integration:val igniteContext = new IgniteContext[Integer, Integer]

(sparkContext, () => new IgniteConfiguration())

val cache = igniteContext.fromCache("myRdd") val result = cache.filter(_._2.contains("Ignite")).collect()

val cacheRdd = igniteContext.fromCache("myRdd")cacheRdd.savePairs(sparkContext.parallelize(1 to 10000, 10).map(i => (i, i)))

• Saving values to Ignite:

• Running SQL queries against Ignite Cache:val cacheRdd = igniteContext.fromCache("myRdd") val result = cacheRdd.sql

("select _val from Integer where val > ? and val < ?", 10, 100)

Spark & Ignite Integration: IgniteRDD

Moderador
Notas de la presentación
The main difference between native Spark RDD and IgniteRDD is that Ignite RDD provides a shared in-memory view on data across different Spark jobs, workers, or applications, while native Spark RDD cannot be seen by other Spark jobs or applications. IgniteContext is the main entry point to Spark-Ignite integration. To create an instance of Ignite context, user must provide an instance of SparkContext and a closure creating IgniteConfiguration (configuration factory). Ignite context will make sure that server or client Ignite nodes exist in all involved job instances. Alternatively, a path to an XML configuration file can be passed to IgniteContext constructor which will be used to configure nodes being started.
Page 22: Better together: Fast Data with Apache Spark™ and Apache Ignite™ by Mike Griggs

© 2016 GridGain Systems, Inc.

Spark Integration: Using Dataframes from IgniteRDDs

// Create an IgniteRDD

val companyCacheIgnite = new IgniteContext[Int, String](sc, () => new IgniteConfiguration()).fromCache("CompanyCache")

// Create company DataFrame

val dfCompany = sqlContext.createDataFrame(companyCacheIgnite.map(p => Company(p._1, p._2)))

// Register DataFrame as a table

dfCompany.registerTempTable("company")

Moderador
Notas de la presentación
Affinity run of spark jobs
Page 23: Better together: Fast Data with Apache Spark™ and Apache Ignite™ by Mike Griggs

© 2016 GridGain Systems, Inc.

• Ignite In-Memory File System (IGFS)– Hadoop-compliant– Easy to Install– On-Heap and Off-Heap– Caching Layer for HDFS– Write-through and Read-through

HDFS– Performance Boost

IGFS: In-Memory File System

MR HIVE PIG

In-Memory MapReduce

IGFS

HDFS

IGFS

YARN

Any Hadoop Distro

Moderador
Notas de la presentación
Direct API for Fork/Join
Page 24: Better together: Fast Data with Apache Spark™ and Apache Ignite™ by Mike Griggs

© 2016 GridGain Systems, Inc.

Hadoop Accelerator: Map Reduce

• In-Memory Performance• Zero Code Change

• Use existing MR code• Use existing Hive queries

• No Name Node• No Network Noise• In-Process Data Colocation• Eager Push Scheduling

User Applicatio

n

Hadoop Client

Ignite Client

Hadoop Jobtracke

r

Hadoop Name Node

Hadoop Tasktrack

er

Hadoop Tasktrack

er

Ignite Data Node (IGFS)

Ignite Data Node (IGFS)

Hadoop Data Node

(HDFS)

Hadoop Data Node

(HDFS)

Ignite PathHadoop Path

Page 25: Better together: Fast Data with Apache Spark™ and Apache Ignite™ by Mike Griggs

© 2016 GridGain Systems, Inc.

• Docker• Amazon AWS• Azure Marketplace• Google Cloud• Apache JClouds• Mesos• YARN• Apache Karaf (OSGi)

Deployment

Moderador
Notas de la presentación
- supports multiple infrastructure including Google Cloud, AWS & Docker
Page 26: Better together: Fast Data with Apache Spark™ and Apache Ignite™ by Mike Griggs

© 2016 GridGain Systems, Inc.

Thank You!

www.gridgain.com

@gridgain @ApacheIgnite#gridgain #ApacheIgnite

Thank you for joining us. Follow the conversation.

Michael [email protected]

github.com/kemiz/SparkIgniteSimpleExample