Better together: Fast Data with Apache Spark™ and Apache Ignite™ by Mike Griggs

© 2016 The Apache Software Foundation. Apache, Apache Ignite, the Apache feather and the Apache Ignite logo are trademarks of The Apache Software Foundation.

Better Together: Fast Data with Ignite & SparkChristos Erotocritou - Spark Summit EU 2016

© 2016 GridGain Systems, Inc.

Agenda

• GridGain & Apache Ignite Project

• Ignite In-Memory Data Fabric• Apache Ignite vs. Apache Spark• Hadoop & Spark Integration• Q & A


Apache Ignite Project• 2007: First version of GridGain• Oct. 2014: GridGain contributes Ignite to

ASF• Aug. 2015: Ignite is the second fastest

project to graduate after Spark• Today:

• 82+ contributors and growing rapidly• Huge development momentum -

Estimated 233 years of effort since the first commit in February, 2014 [Openhub]

• Mature codebase: 840k+ SLOC & more than 17k commits

January 2016

Moderador

Notas de la presentación

Time: 9 months incubation to become top-level project Statistics from OpenHub

https://www.openhub.net/p/apache-ignite/estimated_cost

https://www.openhub.net/p/apache-ignite/

https://www.openhub.net/p/apache-ignite/


• GridGain Enterprise Edition• Is a binary build of Apache Ignite™ created by GridGain• Added enterprise features for enterprise deployments• Earlier features and bug fixes by a few weeks• Heavily tested

Moderador


Enterprise: data-centre replication Added security features


Customer Use Cases

Automated Trading SystemsReal time analysis of trading positions & market risk. High volume transactions, ultra low latencies.

Financial ServicesFraud Detection, Risk Analysis, Insurance rating and modelling.

Online & Mobile AdvertisingReal time decisions, geo-targeting & retail traffic information.

Big Data AnalyticsCustomer 360 view, real-time analysis of KPIs, up-to-the-second operational BI.

Online GamingReal-time back-ends for mobile and massively parallel games.

SaaS Platforms & AppsHigh performance next-generation architectures for Software as a Service Application vendors.

Travel & E-CommerceHigh performance next-generation architectures for online hotel booking.

Moderador


Online/Mobile Advertising: Analytics General: highly scalable and high-performance applications


What is an IMDF?

High-performance distributed in-memory platform forcomputing and transacting on large-scale data sets innear real-time.

Moderador


Collection of logical in-memory components that solve high performance and scalability challenges


What is an IMDF?

Moderador


HA – Hadoop Accelerator Multiple components, fully integrated with each other


In-Memory Computing Platform

Data Grid

BatchData

ComputeGrid

Transactional & AnalyticalWorkloads

Transactional & Analytical workloads

StreamingData

ExternalPersistence

ExternalAPIs

Back-end users, third-party clients and downstream systems

DownstreamSystems

Clients accessing a high-speed distributed multi-facet service


Scalability & Resilience with Ignite

Data Grid

BatchData

ComputeGrid



StreamingData

ExternalPersistence

ExternalAPIs


DownstreamSystems



Fault Tolerance & Horizontal Scalability

Replicated Cache Partitioned Cache

Moderador


Ignite is about using memory as the master data source FULL_SYNC, FULL_ASYNC, PRIMARY_SYNC for rebalancing and backups topology validator for rebalancing / failure. Choose EVENTUALLY CONSISTENT, or FULLY CONSISTENT. FULLY blocks transactions on cluster until rebalancing is complete. client-side NEAR Caching with transactional mode affinity colocation - colocated data with data and compute with data


Local Store & Vertical Scale

• Tiered Memory• On-Heap -> Off-Heap ->

Disk• Persistent On-Disk Store• Fast Recovery• Local Data Reload

• Eliminate Network and Db impacts when reloading in-memory store

Moderador


Off-Heap: avoid large GC pauses


Storage and Caching using Ignite

Data Grid

BatchData

ComputeGrid



StreamingData

ExternalPersistency

ExternalAPIs


DownstreamSystems



• 100% JCache Compliant (JSR 107)– Basic Cache Operations– Concurrent Map APIs– Collocated Processing (EntryProcessor)– Events and Metrics– Pluggable Persistence

• Ignite Data Grid– Fault Tolerance and Scalability – Distributed Key-Value Store– SQL Queries (ANSI 99)– ACID Transactions– In-Memory Indexes– RDBMS / NoSQL Integration

In-Memory Data Grid

Moderador


Ignite is a 100% compliant implementation of JCache (JSR 107) specification. JCache provides a very simple to use, but yet very powerful API for data caching. RDBMS: external database integration. Read-through and write-through to relational database, NoSQL store or Hadoop cluster. Out-of-box: Cassandra, MongoDB, Hadoop or implement your own.


Distributed Computing with Ignite

Data Grid

BatchData

ComputeGrid



StreamingData

ExternalPersistency

ExternalAPIs


DownstreamSystems



Client-Server vs. Affinity Colocation

12

4

3 Data 1

Job 1

2

3Data 2

Job 2

ProcessingNode 1

ProcessingNode 2

ClientNode

Data Node 1

Data Node 2

ProcessingNode 1

1

3

4

Data 1

Data 2

2

2

1. Initial Request2. Fetch data from remote nodes3. Process entire data-set4. Return to client

1. Initial Request2. Co-locating processing with data3. Return partial result4. Reduce & return to client

Moderador


Tasks – Runnables/Callables Ignite allows us to take the compute to the data, NOT the data to the compute. As little data migration as possible.


• Direct API for MapReduce• Cron-like Task Scheduling• State Checkpoints• Load Balancing

• Round-robin• Random & weighted

• Automatic Failover• Per-node Shared State• Zero Deployment

• Distributed class loading

In-Memory Compute Grid


Hadoop & Spark Integration

Moderador


Collection of logical in-memory components that solve high performance and scalability challenges


Apache Ignite Apache Spark– Ingests data from HDFS or another

distributed file system– Inclined towards analytics (OLAP) and

focused on MR-specific payloads– Requires the creation of RDD and data and

processing operations are governed by it– Primarily disk-based SQL support– Strong ML libraries– Big community

– Data source agnostic– Fully fledged compute engine and

resilient data storage in-memory for OLAP & OLTP

– Zero-deployment– In-Memory SQL support– Fully ACID transactions across memory

and disk– Broader in-memory system that is less

focused on Hadoop– Off-heap memory to avoid GC pauses– In production since 2007

Moderador


Direct API for Fork/Join Some overlap in functionality, but we do not see as a direct competitor.


• IgniteRDD – Share RDD across jobs on the host– Share RDD across jobs in the

application– Share RDD globally

• Faster SQL– In-Memory Indexes– SQL on top of Shared RDD

Spark & Ignite Integration

Spark Application

Spark Worker

Spark Job

Spark Job

Ignite Node

Yarn Mesos Docker Cloud

Server

Spark Worker

Spark Job

Spark Job

Ignite Node

Server

Spark Worker

Spark Job

Spark Job

Ignite Node

Server

In-Memory Shared RDDs

Moderador


Affinity run of spark jobs


• Reading values from Ignite:

• IgniteContext is the main entry point to Spark-Ignite integration:val igniteContext = new IgniteContext[Integer, Integer]

(sparkContext, () => new IgniteConfiguration())

val cache = igniteContext.fromCache("myRdd") val result = cache.filter(_._2.contains("Ignite")).collect()

val cacheRdd = igniteContext.fromCache("myRdd")cacheRdd.savePairs(sparkContext.parallelize(1 to 10000, 10).map(i => (i, i)))

• Saving values to Ignite:

• Running SQL queries against Ignite Cache:val cacheRdd = igniteContext.fromCache("myRdd") val result = cacheRdd.sql

("select _val from Integer where val > ? and val < ?", 10, 100)

Spark & Ignite Integration: IgniteRDD

Moderador


The main difference between native Spark RDD and IgniteRDD is that Ignite RDD provides a shared in-memory view on data across different Spark jobs, workers, or applications, while native Spark RDD cannot be seen by other Spark jobs or applications. IgniteContext is the main entry point to Spark-Ignite integration. To create an instance of Ignite context, user must provide an instance of SparkContext and a closure creating IgniteConfiguration (configuration factory). Ignite context will make sure that server or client Ignite nodes exist in all involved job instances. Alternatively, a path to an XML configuration file can be passed to IgniteContext constructor which will be used to configure nodes being started.


Spark Integration: Using Dataframes from IgniteRDDs

// Create an IgniteRDD

val companyCacheIgnite = new IgniteContext[Int, String](sc, () => new IgniteConfiguration()).fromCache("CompanyCache")

// Create company DataFrame

val dfCompany = sqlContext.createDataFrame(companyCacheIgnite.map(p => Company(p._1, p._2)))

// Register DataFrame as a table

dfCompany.registerTempTable("company")

Moderador


Affinity run of spark jobs


• Ignite In-Memory File System (IGFS)– Hadoop-compliant– Easy to Install– On-Heap and Off-Heap– Caching Layer for HDFS– Write-through and Read-through

HDFS– Performance Boost

IGFS: In-Memory File System

MR HIVE PIG

In-Memory MapReduce

IGFS

HDFS

IGFS

YARN

Any Hadoop Distro

Moderador


Direct API for Fork/Join


Hadoop Accelerator: Map Reduce

• In-Memory Performance• Zero Code Change

• Use existing MR code• Use existing Hive queries

• No Name Node• No Network Noise• In-Process Data Colocation• Eager Push Scheduling

User Applicatio

n

Hadoop Client

Ignite Client

Hadoop Jobtracke

r

Hadoop Name Node

Hadoop Tasktrack

er

Hadoop Tasktrack

er

Ignite Data Node (IGFS)

Ignite Data Node (IGFS)

Hadoop Data Node

(HDFS)

Hadoop Data Node

(HDFS)

Ignite PathHadoop Path


• Docker• Amazon AWS• Azure Marketplace• Google Cloud• Apache JClouds• Mesos• YARN• Apache Karaf (OSGi)

Deployment

Moderador


- supports multiple infrastructure including Google Cloud, AWS & Docker


Thank You!

www.gridgain.com

@gridgain @ApacheIgnite#gridgain #ApacheIgnite

Thank you for joining us. Follow the conversation.

Michael [email protected]

github.com/kemiz/SparkIgniteSimpleExample

http://www.gridgain.com

https://github.com/kemiz/SparkIgniteSimpleExample

Technology

Better together: Fast Data with Apache Spark™ and Apache Ignite™ by Mike Griggs