The Limits of Java Performance: Breaking through the Scalability Barriers imposed by the Java Platform. Ron Kleinman, Lead Product Technologist

The Limits of Java Performance:

Breaking through the Scalability Barriers imposed by the Java Platform.

Ron Kleinman, Lead Product Technologist

2 ©2008 Azul Systems, Inc.

Agenda

• Java Platform Scalability Barriers? What Scalability Barriers?─ The language is the platform. How big a platform?

• Avoiding the Barriers: What do people do today?

• When the Barriers can’t be avoided: Scalability Design Patterns─ Scaling Out: When performance fades, add some blades ─ Scaling Up: Performance Gains through Virtual Domains

─ Scaling Middleware: What seems local is remote─ Scaling External: Moving to a bigger house

• Focused Solution: Java Compute Appliance

• Building it out: Leveraging an integrated Appliance Architecture


Scalability Barriers:What Scalability Barriers?

There just seems to be something about Managed Runtime Environments

“Perhaps the most commonly asked questions regarding memory management in .NET are: "How long does a garbage collection take?" and "How can I control when the garbage collector runs?" Apprehensive that "pauses" caused by garbage collections will be perceived by users, application developers often search for ways to control when garbage collections occur”.

- Steven Pratschner, Microsoft Program Manager for .Net Common Language Runtime

“Ruby’s garbage collector (GC) has become a problem for the Luz user experience. The GC process can cause the entire application to pause for upwards of 200ms at a time (on a P3 1.2ghz), which is simply unacceptable for an application doing real-time animation where, to achieve even 24fps, a new frame must be generated every 42ms. As a result, we see ‘hiccups’ in the animation.”

- Gnome Coder


What Java Platform Scalability Barriers?

• 1. Resource Limit on Maximum # usable GB of Memory─ Unused Memory must be freed and defragmented─ All “in use” references must be found, flagged and changed─ GC Pauses scale linearly with memory size (~ 1GB max)

• 2. Resource Limit on Maximum # usable CPUs─ Synchronized Method “Large Grained”

─ Lock suspends all but one thread ─ Lock Contention, not Data Contention

─ 4 CPUs vs. 400


“Avoiding the Barriers”(Living within 4 CPU / 4 GB constraints)

• Force garbage collection to occur at non-peak times

• Write components in C, C++─ Use native components rather than Java components (JNI)

• Limit Java cache sizes─ Reuse own Memory (keep own pool)

• Handcrafted fine tuning: <20+ GC algorithm settings>─ GC Algorithm Dependency (a la VMS Fortran “File Open”)

• Throw more hardware at the problem (may not work)

• Recode the Application─ Increase CPU concurrency with finer-grained locking (R/W)─ Attack GC Pauses with Real Time Java Extensions


“Confronting the Barriers”

“If something can’t go on forever, it won’t.”

-Herb Stein, Former Chair of the Council of Economic Advisers

(pre-2005)


Dealing with Increasing Peak Loading:The Trade Exchange Program

Operating System

Memory CPUs

Application

JVM

StockDB Cache


Maintaining Service Level Agreements in the face of massively increasing demand

• # Stock Feeds up─ More sources of data to correlate

• # Trades up─ Greater volume of transactions to handle

• # Metrics up─ More things to monitor for each trade

• Processing / Metric up─ “Secret Sauce” Trading algorithms more complex

• Required maximum response times way down─ 1-2 msec and lower─ Significant swings in Latency Jitter intolerable─ GC Pause can cost $$$


Handling Increasing Workload:It’s not just providing additional Capacity


It’s making that Capacity usable


Java Application Scalability Design Patterns:Adding Computing Capacity

• Multiple Real Application Instances─ 1. Horizontal (Scale Out - with Commodity Servers)─ 2. Vertical (Scale Up - with Hypervisor Domains on Enterprise Servers)

• Single Virtual Application Instance─ 3. Middleware (Scale Virtually – with customized software modules)

• Single Real Application Instance─ 4. External (Scale Specialized - with Java Compute Appliances)


1. Horizontal Scale Out to host multiple instancesAdd more commodity servers to Data Center

Operating System

Memory CPUs

Application

JVM StockDB Cache

[M-Z]

??Operating System

Memory CPUs

Application

JVM StockDB Cache

[A-L]

??


2. Vertical Scale Up to host multiple instancesCreate more virtual servers on Hypervisor

CPUsMemory

Operating System Operating System

Hypervisor

JVM

Application

JVM

Application

StockDB Cache

[M-Z]

StockDB Cache

[A-L]


Breaking through the Java Platform Scalability Barrier:Hardware Servers vs. Virtual Servers

• Same Java Platform limitations within instance:

- Refactor Data

- Recode Application

- Peak load swings can still exceed JVM memory capacity / result in huge pauses

• Cloud “Orthogonal”

• Refactor Data (Shards)

• Recode Application

• Peak load swings can exceed resource limits

Partial crashes

Load Management

• Over Provisioning

Server sprawl

Issues

• Hypervisor provides better resource utilization.

• Reduces Server Sprawl. Easier to manage.

• Easy expansion via addition of homogeneous commodity servers.

• “Cloud-izable” / Hadoop-ish

Advantages

Separate Application Instances on Virtual Servers

Separate Application Instances on Hardware Servers

Strategy

Pure VerticalPure HorizontalScale:


StockDB [A-Z]

3. Memory Scale Out to multiple systemsUse Middleware to simulate one huge memory heap

Operating System

MemoryCPUs

Application

JVM

Instrumented Byte Codes

Operating System

MemoryCPUs

Application

JVM

Instrumented Byte Codes

Operating System

PhysicalMemory

CPUs

Federated Global Memory Cache

Virtual Memory Hub


Breaking through the Java Platform Scalability Barrier:Multiple Local Memory Heaps vs. Single Global Memory Heap

• Not all data elements can be shared (ex: hash keys).

• Cache misses can cause widely varying response latencies

• Performance dependent upon data usage (reads >> writes good)

• Multiple points of partial failure

• Central hub limits Cloud Computing

• Global thread locks tough to scale

• Refactor Data



Partial crashes

Load Management


Server sprawl

Issues

• Selected object elements shared, dynamically cached, transparently updated from central source

• Effective JVM memory limits transparently bypassed

• Easy expansion via addition of homogeneous commodity servers

• “Cloud-izable”

• Hadoop-ish

Advantages

+ Shared global memory supported by Java byte code instrumentation (get/put element)

Separate Application Instances on Commodity Servers with separate local memory

Strategy

+ Shared Global MemoryPure HorizontalScale:


4. Externally Scale on a Specialized Java ApplianceAdd physical memory and CPUs as needed

Kernel

Memory CPUs

Application

JVM

StockDB Cache

Java Compute Appliance

JVMProxy

Original Deployed

Host


Scalability with the Appliance Design Pattern

Physically Isolate Resource

Share Resource

Centrally Manage

Expand Capacity

Extend Functionality

Appliance

External Resources

Transparently Utilize Remote Hardware

Application

Memory

CPUs

Network

Storage

Application

Application


Example #1: Router Share, Manage, Scale Up, Extend

RouterAppliance

Guaranteed Message Delivery

Auto-encryption

Protocol Gateway

High bandwidth WAN connections

NETWORK

Resource Externalized: (Network)

Application

Application

Operating System

Memory CPUs


Example #2: Storage Area NetworkShare, Manage, Scale up, Extend

StorageStorage A

rea Netw

ork (SA

N)

Flash as Storage

Disk Mirroring

Need based Allocation

Resource Externalized: (Storage)

Application

Application

Operating System

Memory CPUs


Ex #3: Java Compute Appliance (JCA)(Share, Manage, Scale up, Extend )

Transparently Bring the Application to the Resources

Operating System

Memory CPUs

Proxy JVM

Original Deployed System

Optimized Kernel

100’s GBs Memory

100’sCPUs

JavaApplication

Appliance JVM

JCA


Ex #3: Java Compute Appliance (JCA)Complete Java Application / Deployed Platform separation

• JVM: Decouples a Java Application from the OS─ Decoupled from local hardware (& any Hypervisor)─ Decoupled from connected appliances─ Decoupled from Middleware─ Last remaining resource connections are Memory and CPUs

• Move the Java Application to its Computing Resources─ Decouple from the original deployment platform entirely─ Transparently redeploy on a Java Compute Appliance (JCA)─ Use Appliance Memory and CPUs─ Same appliance advantages apply

─ Share─ Centrally Manage─ Expand─ Optimize / Extend

─ And some other ones as well (Stability)


DatabaseWeb

VM Proxy

Compute Pool

App ServerHosts


Java Compute Appliance: Sharing the Resources


An integrated Java Compute Appliance

Vega 3

Up to 864 CPU Cores, 768 Gbytes Memory

On-Chip Hardware Extensions

Azul Thread Execution Kernel (AzTEK)

Mission CriticalJava Application

Azul VM

Mission CriticalJava Application

Azul VM


#1. The GC Pause Scalability Barrier:Make the problem part of the solution

• Problem

• Solution

• Maximum usable Memory Limit Removed─ Scale from 1 to 100 GB heap─ Constant response latencies of 1-3 msec─ No change to existing Java code


Impact of Garbage Collection(Actual Financial Service Trading system under load)

Java Pause Time ComparisonDerivatives Trading Application

0

5

10

15

20

25

30

GC Iteration

Seco

nds

Native Pause Times Azul Pause Times

Performance Impact Complexity Impact

Native Configuration-Xms2g -Xmx2g -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:TargetSurvivorRatio=80 -XX:CMSInitiatingOccupancyFraction=85 -XX:SurvivorRatio=8-XX:MaxNewSize=320m -XX:NewSize=320m -XX:MaxTenuringThreshold=10

Azul Configuration-Xms3g –Xmx3g

With Azul(yes, that flat blue line)

Native


#2. The Large-Grained Lock Scalability Barrier A two-tier Integrated approach

• Serialized portions of program severely limit scalability

• Amdahl’s Law: Efficiency = 1/P /((1- P) + P/N))(N = # of concurrent threads, P = run time fraction of parallelizable code)─ At 4 threads

─ 5% serialized code = 87%+ efficiency─ At 400 threads:

─ 5% serialized code = <5% efficiency!

• Solution─ Automated: Optimistic Thread Concurrency (OTC)─ Manual: Real Time Performance Monitoring (RTPM)


Optimistic Thread Concurrency (OTC)

• Strategy: Assume no data contention

• How it Works─ Java Synchronized Block: Similar to DB transaction

─ Block is Transactional around synchronized {…} ─ Transparent Roll back if object element impacted

─ JVM Dynamic lock levels (Speculative, Thick) ─ Runtime profile based

• Where it Works─ Thread instances access different variables─ Thread instances access same variables for read─ Hash Table for product database: 100 readers for every writer

• When it Works─ Parallel execution of all threads in same synchronized method:─ Competition for actual data elements, not lock─ Amdahl’s law: Efficiency reflects actual data (not lock) contention times


+ Real Time Performance Monitoring (RTPM)

• JVM-assisted deep visibility into Application Performance ─ Threads (List / States, Trace, Lock Contention details, CPU Usage)─ GC (Cycle phase results, min/max pauses, memory used / freed)─ Memory (Detailed Live Objects breakdown / updated every GC cycle)─ Socket IO (Open connections, quantity of data, associated latency)

• Performance Bottleneck and Problem Detection─ Multi-core processing & concurrency─ Memory Demands and Memory Leaks─ Multithreading Race Conditions (**)

• Zero Overhead─ Monitoring won’t impact application being monitored─ No disturbance to production environment

• Real Time – Always On─ Allows ID of Performance Problems as they happen─ No application restarts


A Java Compute Appliance gives Java Applications Room to Scale

TRADITIONAL

Garbage Collection Pauses

2GB Heaps

Instabilities due to resource limitations

Over-provisioning and server sprawl

Lock Contention

2-4 CPUs 100s of CPUs

OTC / RTPM

Up to 670 GB Heaps

No resource related restarts

Server consolidation

Pauseless GC


JCA Product Proof Point:Winner of Largest single instance JVM benchmark

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

7380E25KItaniumRX6600

T5220 PowerEdge2950

P570

SP

EC

jbb

200

5

New$4.5

M $0.7

5M


Breaking through the Java Platform Scalability Barrier:Distributed Global Memory Heap vs. Appliance-enhanced JVM

• Not applicable to all Java Apps:•Heavy JNI Use•“Chatty” DB applications•Single threaded

• IT objections to new hardware configuration

• SAAS, not Cloud

• Not all data elements can be shared (ex: hash keys).

• Cache misses can cause widely varying response latencies

• Performance dependent upon data usage (reads >> writes good)

• Multiple points of partial failure

• Global data locks tough to scale

Issues

• Transparent JVM scalability to full utilization of all memory and CPU resources on the JCA.

• No “resource limit” crashes

• Predictably low response latencies

• Usable JVM memory limits transparently extended via local cache of larger global memory

• Easy CPU scalability via addition of commodity servers.

Advantages

Single Application Instance on Appliance provided with massive amounts of usable CPUs and memory

Separate Application Instances on Commodity Servers provided with local dynamically updated cache of shared global memory.

Strategy

External Java ApplianceShared Global MemoryScale:


Summary

• Java barriers to scalability are becoming more painful:─ Memory utilization limited by GC pauses─ CPU utilization limited by course grained thread locks (“synchronized”)

• Obvious Workarounds take you only so far

• Standard “scale out” & “scale up” strategies have drawbacks

─ Server sprawl, code modifications, partial failures, ...

• Additional (and transparent) scalability solutions possible for “Managed Environments”

─ Shared Global Memory─ External Java Compute Appliance

• No one answer is right in all cases


Questions


References

• Azul Engineer to Engineer Technical Site─ http://www.azulsystems.com/e2e/

• VMS Fortran File Open Options─ http://www.astro.virginia.edu/class/oconnell/astr511/idl_5.1_html/idl

130.htm

• My email address─ [email protected]


Additional Material ...


Breaking through the Java Platform Scalability Barrier:Transparent App Redeployment on a Java Compute Appliance

• Not applicable to all Java Applications:

- Heavy JNI Use

- “Chatty” DB applications

- Single threaded

• IT objection to new hardware configuration

• SAAS, not Cloud

• Same Java Platform limitations within instance:

- Refactor Data

- Recode Application

- Peak load swings can still exceed JVM memory capacity

• Cloud orthogonal

• Refactor Data



Partial crashes

Load Management


Server sprawl

Issues

• Enough usable memory & CPUs to operate as before

• No “resource limit” crashes

• Scalability is Transparent

• Hypervisor provides better resource utilization.

• Reduces Server Sprawl. Easier to administrate,

• Easy expansion via addition of homogeneous commodity servers

• “Cloud-izable”

Strengths

Redeploy single Instance of Application on specialized JCA

Separate Application Instances on Virtual Servers

Separate Application Instances on Commodity Servers

Strategy

External ApplianceVerticalHorizontalScale:


Other Advantage of an Integrated Appliance:Low Level hooks into Kernel / Hardware

• Compute Pool Manager─ Central view of Appliances Resource─ Policy based management

─ Establish resources guarantees ─ Set application resources

(min, max, redundancy, etc.)

• Real Time Performance Monitor─ Zero cost Java application probes─ Extensive memory & thread usage info─ Isolate problems even in Production


Java Compute Appliance: Summary Value Proposition: Share, Manage, Scale up, Extend

• Large (100’s of GB) heap support / No user-visible pauses─ Reduce maximum response latency time / jitter─ Reduce total application instance count (fewer / larger instances)─ End crashes due to hitting memory limit under peak loading─ Enable new design alternatives (e.g. cache the entire database in

memory)

• Hardware-assisted Optimistic Thread Concurrency─ LHF Critical bottlenecks minimized

• JVM-assisted Real Time Performance Monitor─ Critical bottlenecks discovered

• No new APIs required (ex: Real Time Java)─ Any code changes for tuning / performance

• No changes to application deployment procedures

Documents

The Limits of Java Performance: Breaking through the Scalability Barriers imposed by the Java Platform. Ron Kleinman, Lead Product Technologist