From distributed caches to in-memory data grids

From distributed caches to in-memory data grids

TechTalk by Max A. [email protected]

Max A. Alexejev2

Memory Hierarchy

R<1ns

L1~4 cycles,

~1nsL2

~10 cycles, ~3ns

L3~42 cycles, ~15ns

DRAM>65ns

Flash / SSD / USB

HDD

Tapes, Remote systems, etc

Cost

Storage term

Max A. Alexejev3

Software caches

Improve response times by reducing data access latency

Offload persistent storages

Only work for IO-bound applications!

Max A. Alexejev4

Caches and data location

Local Remote

Hierarchical

Distributed

Shared

Consistency

protocol

Distribution

algorithm

Max A. Alexejev5

Ok, so how do we grow beyond one node?

Data replication

Max A. Alexejev6

Pro’s and Con’s of replication

• Best read performance (for local replicated caches)• Fault tolerant cache (both local and remote)• Can be smart: replicate only part of CRUD cycle

Pro

• Poor writes performance• Additional network load• Can scale only vertically: limited by single machine size• In case of master-master replication, requires complex consistency protocol

Con

Max A. Alexejev7

Ok, so how do we grow beyond one node?

Data distribution

Max A. Alexejev8

Pro’s and Con’s of data distribution

• Can scale horizontally beyond single machine size

• Reads and writes performance scales horizontally

Pro

• No fault tolerance for cached data• Increased latency of reads (due to network

round-trip and serialization expenses)

Con

Max A. Alexejev9

What do high-load applications need from cache?

Low latency

Linear horizont

al scalabili

ty

Distributed cache

Max A. Alexejev10

Cache access patterns: Cache AsideFor reading data:

1. Application asks for some data for a given key

2. Check the cache

3. If data is in the cache return it to the user

4. If data is not in the cache fetch it from the DB, put it in the cache, return it to the user.

For writing data

5. Application writes some new data or updates existing.

6. Write it to the cache

7. Write it to the DB.

Overall:• Increases reads

performance• Offloads DB reads• Introduces race

conditions for writes

Client

DB

Cache

Max A. Alexejev11

Cache access patterns: Read Through

For reading data:

1. Application asks for some data for a given key

2. Check the cache

3. If data is in the cache return it to the user

4. If data is not in the cache – cache will invoke fetching it from the DB by himself, saving retrieved value and returning it to the user.

Overall:

• Reduces reads latency

• Offloads read load from underlying storage

• May have blocking behavior, thus helping with dog-pile effect

• Requires “smarter” cache nodes

Client

DB

Cache

Max A. Alexejev12

Cache access patterns: Write Through

For writing data



3. Cache will then synchronously write it to the DB.

Overall:

• Slightly increases writes latency

• Provides natural invalidation

• Removes race conditions on writes

Client

DB

Cache

Max A. Alexejev13

Cache access patterns: Write BehindFor writing data



3. Cache adds writes request to its internal queue.

4. Later, cache asynchronously flushes queue to DB on a periodic basis and/or when queue size reaches certain limit.

Overall:

• Dramatically reduces writes latency by a price of inconsistency window

• Provides writes batching

• May provide updates deduplication

Client

DB

Cache

Max A. Alexejev14

A variety of products on the market…

Memcached Hazelcast

Terracotta

EhCache

Oracle Coherence

Riak

Redis

MongoDB

Cassandra

GigaSpaces

Infinispan

…

Max A. Alexejev15

Lets sort em out!

KV caches

Memcached

Ehcache

…

NoSQL

Redis

Cassandra

MongoDB

…

Data Grids

Oracle Coherence

GemFire

GigaSpaces

GridGain

Hazelcast

Infinispan

Some products are really hard to sort – like Terracotta in both DSO and Express modes.

Max A. Alexejev16

Why don’t we have any distributed in-memory RDBMS?

• Is, if fact, an example of replication• Helps with reads distribution, but does not help with

writes• Does not scale beyond single master

Master – MultiSlaves configuration

• Helps with reads and writes for datasets with good data affinity

• Does not work nicely with joins semantics (i.e., there are no distributed joins)

Horizontal partitioning (sharding)

Max A. Alexejev17

Key-Value caches

• Memcached and EHCache are good examples to look at

• Keys and values are arbitrary binary (serializable) entities

• Basic operations are put(K,V), get(K), replace(K,V), remove(K)

• May provide group operations like getAll(…) and putAll(…)

• Some operations provide atomicity guarantees (CAS, inc/dec)

Max A. Alexejev18

Memcached• Developed for LiveJournal

in 2003

• Has client libraries in PHP, Java, Ruby, Python and many others

• Nodes are independent and don’t communicate with each other

Max A. Alexejev19

EHCache• Initially named “Easy Hibernate

Cache”

• Java-centric, mature product with open-source and commercial editions

• Open-source version provides only replication capabilities, distributed caching requires commercial license for both EHCache and Terracotta TSA

Max A. Alexejev20

NoSQL Systems

A whole bunch of different products with both persistent and non-persistent storage options. Lets call them caches and storages, accordingly.

Built to provide good horizontal scalability

Try to fill the feature gap between pure KV and full-blown RDBMS

Max A. Alexejev21

Case study: Redis

Written in C, supported by VMWare

Client libraries for C, C#, Java, Scala, PHP, Erlang, etc

Single-threaded async impl

Has configurable persistence

Works with K-V pairs, where K is a string and V may be either number, string or Object (JSON)

Provides 5 interfaces for: strings, hashes, sorted lists, sets, sorted sets

Supports transactions

hset users:goku powerlevel 9000 hget users:goku powerlevel

Max A. Alexejev22

Use cases: Redis

Good for fixed lists, tagging, ratings, counters, analytics and queues (pub-sub messaging)

Has Master – MultiSlave replication support. Master node is currently a SPOF.

Distributed Redis was named “Redis Cluster” and is currently under development

Max A. Alexejev23

Case study: Cassandra

• Written in Java, developed in Facebook.

• Inspired by Amazon Dynamo replication mechanics, but uses column-based data model.

• Good for logs processing, index storage, voting, jobs storage etc.

• Bad for transactional processing.

• Want to know more? Ask Alexey!

Max A. Alexejev24

In-Memory Data GridsNew generation of caching products, trying to combine benefits of replicated and distributed schemes.

Max A. Alexejev25

IMDG: Evolution

Modern

IMDG

Data Grids• Reliable storage

and live data balancing among grid nodes

Computational Grids• Reliable jobs

execution, scheduling and load balancing

Max A. Alexejev26

IMDG: Caching concepts

• Implements KV cache interface

• Provides indexed search by values

• Provides reliable distributed locks interface

• Caching scheme – partitioned or distributed, may be specified per cache or cache service

• Provides events subscription for entries (change notifications)

• Configurable fault tolerance for distributed schemes (HA)

• Equal data (and read/write load)

distribution among grid nodes

• Live data redistribution when nodes are going up or down – no data loss, no clients termination

• Supports RT, WT, WB caching patterns and hierarchical caches (near caching)

• Supports atomic computations on grid nodes

Max A. Alexejev27

IMDG: Under the hood• All data is split in a number of sections,

called partitions.

• Partition, rather then entry, is an atomic unit of data migration when grid rebalances. Number of partitions is fixed for cluster lifetime.

• Indexes are distributed among grid nodes.

• Clients may or may not be part of the grid cluster.

Max A. Alexejev28

IMDG Under the hood: Requests routingFor get() and put() requests:

1. Cluster member, that makes a request, calculates key hash code.

2. Partition number is calculated using this hash code.

3. Node is identified by partition number.

4. Request is then routed to identified node, executed, and results are sent back to the client member who initiated request.

For filter queries:

5. Cluster member initiating requests sends it to all storage enabled nodes in the cluster.

6. Query is executed on every node using distributed indexes and partial results are sent to the requesting member.

7. Requesting member merges partial results locally.

8. Final result set is returned from filter method.

Max A. Alexejev29

IMDG: Advanced use-cases

Messaging

Map-Reduce calculations

Cluster-wide singleton

And more…

Max A. Alexejev30

GC tuning for large grid nodes

An easy way to go: rolling restarts or storage-enabled cluster nodes. Can not be used in any project.

A complex way to go: fine-tune CMS collector to ensure that it will always keep up cleaning garbage concurrently under normal production workload.

An expensive way to go: use OffHeap storages provided by some vendors (Oracle, Terracotta) and use direct memory buffers available to JVM.

Max A. Alexejev31

IMDG: Market players

Oracle Coherence: commercial, free for evaluation use.

GigaSpaces: commercial.

GridGain: commercial.

Hazelcast: open-source.

Infinispan: open-source.

Max A. Alexejev32

TerracottaA company behind EHCache, Quartz and Terracotta Server Array.Acquired by Software AG.

Max A. Alexejev33

Terracotta Server ArrayAll data is split in a number of sections, called stripes. Stripes consist of 2 or more Terracotta nodes. One of them is Active node, others have Passive status.All data is distributed among stripes and replicated inside stripes.

Open Source limitation: only one stripe. Such setup will support HA, but will not distribute cache data. I.e., it is not horizontally scalable.

Max A

. A

lexeje

v

QA SessionAnd thank you for coming!

Technology

From distributed caches to in-memory data grids