35
Virtualized Memory Scale-Out for SAP HANA Dr. Benoit Hudzia SAP Research – Next Business and Technology Cloud and Virtualization Week , April 2013

Hana Memory Scale out using the hecatonchire Project

Embed Size (px)

DESCRIPTION

This session will present a memory scale-out solution that liberates SAP HANA, or similar memory demanding enterprise applications, from the classical limitation of underlying physical servers. The solution relies on key enabling technology developed within SAP. It allows applications or hypervisors to go beyond the boundaries of the underlying hardware, and effectively enables a fluid transformation from commodity sized physical nodes to very large virtual instances, in order to meet the rapidly growing demand of memory intensive applications.

Citation preview

Page 1: Hana Memory Scale out using the hecatonchire Project

Virtualized Memory Scale-Out for SAP HANADr. Benoit Hudzia SAP Research – Next Business and TechnologyCloud and Virtualization Week , April 2013

Page 2: Hana Memory Scale out using the hecatonchire Project

© 2013 SAP AG. All rights reserved. 2Public

Agenda

• Reducing Memory TCO via Scale Out• Memory As a service• Hana Memory Scale Out• Conclusion & Next Steps

Page 3: Hana Memory Scale out using the hecatonchire Project

Reducing Memory TCO via Scale OutHecatonchire Project – Memory Cloud

Page 4: Hana Memory Scale out using the hecatonchire Project

© 2013 SAP AG. All rights reserved. 4Public

Achieving the Right-sized Servers

• Eliminates unnecessary components• Increased power efficiency • Optimized performance by workload

Using high-volume components to build high-value systems

Page 5: Hana Memory Scale out using the hecatonchire Project

© 2013 SAP AG. All rights reserved. 5Public

How do we offer better Memory TCO

• Memory in the nodes of current clusters is Overscaled in order to fit the requirements of “any” application• It remains unused most of the time

• How can we unleash your memory-constrained application by using the memory in the rest of nodes

Memory that grows with your business, not before.

Page 6: Hana Memory Scale out using the hecatonchire Project

© 2013 SAP AG. All rights reserved. 6Public

Decouple memory from cores aggregation

Many shared-memory parallel applications do not scale beyond a few tens of cores...However, may benefit from large amounts of memory:

• In-memory databases• Datamining• VM• Scientific applications• etc

Eliminate Physical Limitation of Cloud / DC

Page 7: Hana Memory Scale out using the hecatonchire Project

© 2013 SAP AG. All rights reserved. 7Public

Scaling-Out with Fast Networking

Low latency memory transfers can help reduce the need for overprovisioning while also providing much lower latency than swapping data out to disk.

Author: Chaim Bendalac

Page 8: Hana Memory Scale out using the hecatonchire Project

Memory as a ServiceHecatonchire Project – Memory Cloud

Page 9: Hana Memory Scale out using the hecatonchire Project

© 2013 SAP AG. All rights reserved. 9Public

The Idea: Turning memory into a distributed memory service

Breaks memory from the bounds of the physical box

Transparent deployment with performance at scale and Reliability

Page 10: Hana Memory Scale out using the hecatonchire Project

© 2013 SAP AG. All rights reserved. 10Public

Network

High Level Principle

Memory Sponsor A

Memory Sponsor B

Memory Demander

Virtual Memory Address Space

Memory Demanding Process

Page 11: Hana Memory Scale out using the hecatonchire Project

© 2013 SAP AG. All rights reserved. 11Public

Hard Page Fault Resolution Performance

Time spend over the wire one way

Average (μs)

Resolution timeAverage (μs)- 4KB

Page

Page (4KB) Transfer/Sec

(with prefetch)SoftIwarp (10

GbE)150 + 330 50k/s

Iwarp (10GbE)

4-6 28 250k/s

Infiniband (40 Gbps)

2-4 16 650k/s

Page 12: Hana Memory Scale out using the hecatonchire Project

© 2013 SAP AG. All rights reserved. 12Public

Average Compounded Page Fault Resolution Time(With Prefetch)

1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads 7 Threads 8 Threads1000

1500

2000

2500

3000

3500

4000

4500

5000

5500

6000IW 10GE Sequential IB 40 Gbps SequentialIW 10GE- Binary splitIB 40Gbps- Binary splitIW 10GE- Random WalkIB- Random WalkM

icro-seconds

Avg IW

Avg IB

Page 13: Hana Memory Scale out using the hecatonchire Project

Benchmark of Scaled Out HANAHecatonchire Project – Memory Cloud

Page 14: Hana Memory Scale out using the hecatonchire Project

© 2013 SAP AG. All rights reserved. 14Public

Use-Case: Hana Memory Scale Out - Memory Aggregation

• Aggregate the RAM of a cluster into 1 HUGE HANA DB

Overview:• Given n nodes, each with X GB of RAM:

• 1 node will run a VM with a HANA DB • High Core count / Memory Ratio

• n-1 nodes will serve as memory containers • Low Core Count / Memory Ratio

• Deploy a VM with n*X GB using Hecatonchire• Instead of using swap for freeing up the RAM, push pages to remote hosts• If a page is needed again, request it back

Up to 30% in HW cost Reduction compare to a single node instance

Page 15: Hana Memory Scale out using the hecatonchire Project

© 2013 SAP AG. All rights reserved. 15Public

Memory Scale out for SAP HANA – Benchmark

Hardware:•Server with Intel Xeon West Mere

• 4 socket • 10Core • 512 GB RAM

•Fabric: • Infiniband QDR 40Gbps Switch + Mellanox ConnectX2 • 10 GbE Ethernet Switch + Chelsio T422 NIC

Virtual Machine:• Small Size: 64 GB Ram - 32 vCPU• Medium Size: 128 GB RAM – 40 vCPU• Hypervisor: KVM

• Application : SAP HANA ( In memory Database)

• Workload : OLAP ( TPC-H Variant)• Data size ~600 GB uncompressed• 18 differents Queries• 15 iteration of each query set

Page 16: Hana Memory Scale out using the hecatonchire Project

© 2013 SAP AG. All rights reserved. 16Public

1 20 40 600

50

100

150

200

250

300

350

400

450

500

Baseline

Half-Half

Scaling Out Hana(Small Size)

Nb Users HECA Overhead per query set

1 ~3%

20 ~3%

40 ~3%

60 ~3%

Virtual Machine:• 64 GB Ram , 32 vCPU (Small)• Application : HANA ( In memory Database)• Workload : OLAP ( TPC-H Variant)• ~600GB data set uncompressed

Hardware:•Intel Xeon West Mere – 4 socket - 10Core – 512 GB RAM•Fabric: Infiniband QDR 40Gbps Switch + Mellanox ConnectX2

Query Set Completion Time (s)

Users

Page 17: Hana Memory Scale out using the hecatonchire Project

© 2013 SAP AG. All rights reserved. 17Public

Per Query Overhead

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

-50%

0%

50%

100%

150%

200%

1 User20 Users40 Users

Query

Page 18: Hana Memory Scale out using the hecatonchire Project

© 2013 SAP AG. All rights reserved. 18Public

Scaling out HANA(Medium Size )

Virtual Machine:• 128 GB Ram , 40 vCPU (Medium)• Application : HANA ( In memory Database )• Workload : OLAP ( TPC-H Variant)- 80 Users• ~600GB data set uncompressed

Memory Ratio

HECA Overhead per query set – 80 users

1:2 1%

1:3 1.6%

2:1:1 0.1%

1:1:1 1.5%

Hardware:•Intel Xeon West Mere – 4 socket - 10Core – 512 GB RAM•Fabric: Infiniband QDR 40Gbps Switch + Mellanox ConnectX2

1:02 1:03 2:01 1:010

0.0020.0040.0060.0080.01

0.0120.0140.0160.018

Memory ratio

Page 19: Hana Memory Scale out using the hecatonchire Project

Production and Cloud readyHecatonchire Project – Memory Cloud

Page 20: Hana Memory Scale out using the hecatonchire Project

© 2013 SAP AG. All rights reserved. 20Public

High Availability of Memory Node ( RRAIM ) running HANA

Memory Ratio (Medium 128GB - SAP-H Benchmark – 80 users)

RRAIM Overhead vs Heca

1:2 -0.2%

1:3 -0.1%

RAMRAM

RAMRAM

RAMRAM

RAMRAM

RAMRAM

RAMRAM

HA

Mirroring

• Memory region backed by two remote nodes.

• Remote page faults and swap outs initiated simultaneously to all relevant nodes.

• No immediate effect on computation node upon failure of node.

• When we a new remote enters the cluster, it synchronizes with computation node and mirror node.

Page 21: Hana Memory Scale out using the hecatonchire Project

© 2013 SAP AG. All rights reserved. 21Public

Enterprise Class Feature

Before DedupAfterDedup

Online RAM Deduplication (via KSM) Automatic Memory Tiering

Local Memory

Remote Memory

Compressed Memory

SSD

HDD

Local Node Remote Node

Page 22: Hana Memory Scale out using the hecatonchire Project

© 2013 SAP AG. All rights reserved. 22Public

Enabling Live Migration of HANA DB (Small Instance)

Baseline Pre-Copy(Standard)

Post-Copy(Heca)

Downtime N/A 7.47 s 675 ms

Performance Degradation

(80 users)

0% Benchmark Failed

(HANA crash- Vm unresponsive)

5%

Page 23: Hana Memory Scale out using the hecatonchire Project

NextHecatonchire Project – Memory Cloud

Page 24: Hana Memory Scale out using the hecatonchire Project

© 2013 SAP AG. All rights reserved. 24Public

Transitioning to a Memory Cloud Transparent Cloud Integration

Compute VMMemory Demander

Memory Cloud Management Services (OpenStack)

App App

memoryMemoryCloud

Heca-NOVA

VM

RAM

VM VM

RAM

Many Physical NodesHosting a variety of VMs

Combination VMMemory Sponsor & Demander

Memory VMMemory Sponsor

PoC Q3 2013

• Automatically reclaim underutilized or abandoned memory resources

• Automatically and intelligently redeploys memory workloads across the infrastructure

Page 25: Hana Memory Scale out using the hecatonchire Project

© 2013 SAP AG. All rights reserved. 25Public

BareBone Memory Scale Out and Fine grained Memory Sharing

BareBone Memory Scale Out

• No need for Virtualization and/or Hypervisor

• Similar to Linux Control Group

• Allow barebones memory scale out for HANA or any other applications

Fine grained Distributed Shared Memory :

• Memory coherence Model • Shared Nothing,• Write Invalidate• Read-Write protection

PoC Q2 2013

PoC Q3 2013

Page 26: Hana Memory Scale out using the hecatonchire Project

© 2013 SAP AG. All rights reserved. 26Public

Virtual Distributed Shared Memory System(Compute Cloud)

Compute aggregation Idea : Virtual Machine compute and memory span

Multiple physical Nodes

Challenges Coherency Protocol Granularity ( False sharing )

Hecatonchire Value Proposition Optimal price / performance by using commodity

hardware Operational flexibility: node downtime without downing

the cluster Seamless deployment within existing cloud

CPUs

Memory

I/O

CPUs

Memory

I/O

CPUs

Memory

I/O

H/W

OS

App

VM

H/W

OS

App

VM

H/W

OS

App

VM

H/W

OS

App

VM

Server #1 Server #2 Server #n

Guests

Fast RDMA Communication

Future Works

Page 27: Hana Memory Scale out using the hecatonchire Project

© 2013 SAP AG. All rights reserved. 27Public

Hecatonchire: Open-Source Project

Http://www.hecatonchire.com

https://github.com/hecatonchire/

Page 28: Hana Memory Scale out using the hecatonchire Project

© 2013 SAP AG. All rights reserved. 28Public

Open Source Roadmap

Standardization : • Linux Kernel Upstream• KVM/QEMU Upstream• OpenStack Module (Nova-Heca)

Collaboration, collaboration, collaboration!• Academic and Industrial

Page 29: Hana Memory Scale out using the hecatonchire Project

Thank You!Contact information:

Dr. Benoit Hudzia [email protected]

Hecatonchire Project: WWW: http://www.hecatonchire.comGithub: https://github.com/hecatonchire/

Page 30: Hana Memory Scale out using the hecatonchire Project

Backup Slides

Page 31: Hana Memory Scale out using the hecatonchire Project

© 2013 SAP AG. All rights reserved. 31Public

How does it work (Simplified Version)

Virtual Address

MMU(+ TLB)

Physical Address

Page Table Entry

Coherency Engine

RDMA Engine RDMA Engine

MMU(+ TLB)

Physical Address

Page Table Entry

Coherency Engine

Miss

Remote PTE

(Custom Swap Entry)

Page request

Page Response

PTE write

Update MMU

Invalidate PTE

Invalidate MMU

Extract Page

Extract Page

Prepare Page for RDMA transfer

Physical Node BPhysical Node A

Network

Fabric

Page 32: Hana Memory Scale out using the hecatonchire Project

© 2013 SAP AG. All rights reserved. 32Public

Raw Bandwidth usageHW: 4 core i5-2500 CPU @ 3.30GHz- SoftIwarp 10GbE – Iwarp Chelsio T422 10GbE - IB ConnectX2 QDR 40 Gbps

Total Gbit/sec (SIW - Seq)

Total Gbit/sec (IW-Seq)

Total Gbit/sec (IB-Seq)

Total Gbit/sec (SIW- Bin split)

Total Gbit/sec (IW- Bin split)

Total Gbit/sec (IB- Bin split)

Total Gbit/sec (SIW- Random)

Total Gbit/sec (IW- Random)

Total Gbit/sec (IB- Random)

0

5

10

15

20

251 Thread 2 Threads3 Threads4 Threads5 Threads6 Threads7 Threads

Gb/s Sequential Walk over 1GB of shared RAM Bin split Walk over 1GB of shared RAM Random Walk over 1GB of shared RAM

Maxing out Bandwidth

Not enough core to saturate (?)

No degradation under high load

Software RDMA has significant

overhead

Page 33: Hana Memory Scale out using the hecatonchire Project

© 2013 SAP AG. All rights reserved. 33Public

Redundant Array of Inexpensive RAM: RRAIM

1. Memory region backed by two remote nodes. Remote page faults and swap outs initiated simultaneously to all relevant nodes.

2. No immediate effect on computation node upon failure of node.

3. When we a new remote enters the cluster, it synchronizes with computation node and mirror node.

Page 34: Hana Memory Scale out using the hecatonchire Project

© 2013 SAP AG. All rights reserved. 34Public

Quicksort Benchmark with Memory Constraint

Memory Ratio (constraint using cgroup)

HECA Overhead RRAIM Overhead

3:4 2.08% 5.21%1:2 2.62% 6.15%1:3 3.35% 9.21%1:4 4.15% 8.68%1:5 4.71% 9.28%

Quicksort Benchmark 512 MB Dataset Quicksort Benchmark 1GB Dataset Quicksort Benchmark 2GB Dataset

3:041:021:031:041:050.00%

2.00%

4.00%

6.00%

8.00%

10.00%

DSM OverheadRRAIM Overhead

Page 35: Hana Memory Scale out using the hecatonchire Project

© 2013 SAP AG. All rights reserved. 37Public

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

-50%

0%

50%

100%

150%

200%

1 User

20 Users

40 Users