JIT-Compiler-Assisted Distributed Java Virtual Machine Wenzhang Zhu, Cho-Li Wang, Weijian Fang and Francis C. M. Lau Department of Computer Science and

JIT-Compiler-Assisted Distributed Java Virtual Machine

Wenzhang Zhu, Cho-Li Wang, Weijian Fang and Francis C. M. Lau

Department of Computer Science and Information Systems

The University of Hong Kong

Presented by Cho-Li Wang

TCHPC 2004, Taiwan, Mar, 2004 2

Outline

Distributed Java Virtual MachineDesign TradeoffsRelated workJESSICA2 featuresExperimental resultsConclusion & future workA raytracing demo


Distributed Java Virtual Machine (DJVM)

A distributed Java Virtual Machine (DJVM) consists of a group of extended JVMs running on a distributed environment to support true parallel execution of a multithreaded Java application.

A DJVM provides all the JVM services, that are compliant with the Java language specification, as if running on a single machine – Single System Image (SSI).

Heap

Bytecode Execution Engine

ClassThread

DJVM

(Single System Image)

import java.util.*;class worker extends Thread{ private long n; public worker(long N){ n=N; } public void run(){ long sum=0; for(long i=0; i<n; i++) sum+=i; System.out.println(“N=“+n+” Sum="+sum);}}public class test { static final int N=100; public static void main(String args[]){ worker [] w= new worker[N]; Random r = new Random(); for (int i=0; i<N; i++) w[i] = new worker(r.nextLong()); for (int i=0; i<N; i++) w[i].start(); try{ for (int i=0; i<N; i++) w[i].join();} catch (Exception e){}}}

Java thread

JVM JVM JVM JVM


Design Tradeoffs of a DJVMHow to manage the threads? Distributed thread scheduling Initial placement vs thread migration

How to store the data ? Distributed heap (object store) Java memory model (memory consistency) Can an off-the-shelf DSM be used as the heap?

How to process the bytecode ? Execution Engine : Interpretation, Just-in-Time

(JIT) compilation, Static compilation

ThreadSched

ExecEngine Heap


Related work

cJVM (IBM Haifa Research) Interpreter mode execution built-in object caching

JAVA/DSM (Rice University) Interpreter mode execution Heap built on top of a page-based DSM

JESSICA(HKU) Thread migration Interpreter mode execution Heap built on top of a page-based DSM

Jackal, Hyperion Static compilation Link to object-based DSM

RemoteCreation

IntrEmbedded OO-based

DSM (Proxy)

ManualDistribution

Intr Page-basedDSM

TransparentMigration

Intr Page-basedDSM

RemoteCreation

Static compilation

OO-basedDSM


JESSICA2 (Java-Enabled Single-System-Image Computing Architecture)

Thread Migration

Global Object Space

JESSICA2JVM

A Multithreaded Java Program

JESSICA2JVM

JESSICA2JVM

JESSICA2JVM

JESSICA2JVM

JESSICA2JVM

Master Worker Worker Worker Worker Worker

JIT Compiler ModePortable Java Frame


JESSICA2 Main FeaturesTransparent Java thread migration Runtime capturing and restoring of thread execution context. No source code modification; no bytecode instrumentation

(preprocessing); no new API introduced Enable dynamic load balancing on clusters

JIT compiler-based execution engine (JITEE) Operated in Just-In-Time (JIT) compilation mode cluster-aware

Global Object Space A shared global heap spanning all cluster nodes Provide location-transparent object access Adaptive migrating home protocol for memory consistency, plus

various optimizing schemes. I/O redirection


JESSICA2 thread migration (In a JIT-enabled JVM)

Thread

Frame

(1) Alert

Frames

Method AreaJVM

Frame parsingRestore execution

Frame

Stack analysisStack capturing

Thread Scheduler

Source node

Destination node

Migration Manager

LoadMonitor

Method Area

RTC

RTC

FramesBTC

(2)

(3)

PC

PC

RTC: Raw Thread ContextBTC : Bytecode-oriented Thread Context (thread id, frames, class names, method signature, PC, Operand stack ptr, local vars …)

Transformation of the RTC into the BTC directly inside the JIT compiler


Thread Stack Transformation

Raw Thread Context (RTC)

%esp: 0x00000000%esp+4: 0x082ca809%esp+8: 0x08225400%esp+12: 0x08266bc0

%esp: 0x00000000%esp+4: 0x082ca809%esp+8: 0x08225400%esp+12: 0x08266bc0...%eax = 0x08623200%ebx = 0x08293100

Frames{method CPI::run()V@111local=13;stack=0;var:arg0:CPI, 33, 0x8225400local1: [D; 33, 0x8266bc0@2local2: int, 2;...

Bytecode-oriented Thread Context (BTC)

Raw Thread Context (RTC)

Stack CapturingStack Restoration


DetailsBytecode verifier

Bytecode translation

migration point selection :head of a basic block INVOKESTATIC, INVOKESPECIAL,INVOKEVIRTUAL and INVOKEINTERFACE

Construct control flow

graphinvoke

code generation

Native Code

Linking & Constant Resolution

Intermediate Code

Java frame

C frame

Register rebuild

Register allocation

(Restore)

mov var1->reg1mov var2->reg2...

reg var

Variables

Java frame detectionthread stack

raw stack

Global Object Space

1. Migration checking2. Non-destructive register spilling3. Object checking4. Type spilling for variable type

deducing


Example of native code instrumentation


Optimization on migration points – Pseudo-inlining

Purpose : eliminate the costs of unnecessary inserted migration points

General idea: delete M-points before a small method invocation


Dynamic Register Patching

Stack growth

%ebp

bootstrap frame

trampoline frame

Ret addr

frame 0

reg1 <- value1reg2 <- value2

jmp restore_point0

Ret addr

%ebp

%ebp

frame 1

reg1 <- value1jmp restore_point1

Compiled methods:

Method1(){...retore_point1:}

Method0(){...retore_point10:}

trampoline

bootstrap(){ trampoline();closing handler();}


Advantages of native code instrumentation

Lightweight Re-use JIT compiler internal data structures and control

flow analysis functions No need to include debugging information in Java class

files Instrumented native codes are more efficient than

instrumented bytecode.

Transparent No source code modification. No new API introduced. No preprocessing


Global Object Space (GOS)

Provide global heap abstraction for DJVMHome-based object coherence protocol, compliant with JVM Memory Model OO-based to reduce false sharing

Non-blocking communication Use threaded I/O interface inside JVM for

communication to hide the latency

Adaptive object home migration mechanism Take advantage of JVM runtime information for

optimization


GOS runtime data structureMaster object

object header

cache pointer

object data

Cache object

object header

cache pointer

cache data Master host id

master address

class

cache obj list

thread idstatuscache datanext

Cache data

thread idstatuscache datanext

cache data

Cache header


Experimental environment

HKU Gideon 300 Linux cluster : 300 P4 PCs (2GHz, 512 MB RAM, 40 GB disk)

Network: 312-port Foundry FastIron 1500 Non-blocking switch (100 Mbits/s)


Migration overhead during normal execution

(SPECJVM98 benchmark)

Benchmarks Time (seconds) Space (native code/bytecode)

No migration Migration No migration Migration

compress 11.31 11.39(+0.71%) 6.89 7.58(+10.01%)

jess 30.48 30.96(+1.57%) 6.82 8.34(+22.29%)

raytrace 24.47 24.68(+0.86%) 7.47 8.49(+13.65%)

db 35.49 36.69(+3.38%) 7.01 7.63(+8.84%)

javac 38.66 40.96(+5.95%) 6.74 8.72(+29.38%)

mpegaudio 28.07 29.28(+4.31%) 7.97 8.53(+7.03%)

mtrt 24.91 25.05(+0.56%) 7.47 8.49(+13.65%)

jack 37.78 37.90(+0.32%) 6.95 8.38(+20.58%)

Average (+2.21%) (+15.68%)


Migration overhead analysisProgram (frame #) LT(1) CPI(1) ASP(1) N-Body(8) SOR(2)

Latency (ms) 4.997 2.680 4.678 10.803 8.467

Frame # 1 2 4 6 8 10

Var # 4 15 37 59 81 103

Size (B) 201 417 849 1281 1713 2145

Capture (us) 202 266 410 495 605 730

Parse (us) 235 253 447 526 611 724

Create (us) 360 360 360 360 360 360

Compile (us) 478 575 847 1,169 1,451 1,720

Build (us) 7 11 14 16 21 28

Total (us) 1,282 1,465 2,078 2,566 3,048 3,562

Overall migration latency

Migration time breakdown (LT program)


GOS Optimizations (using 4 PCs)

0%

20%

40%

60%

80%

100%

NO H

HS

HS

P

NO H

HS

HS

P

NO H

HS

HS

P

NO H

HS

HS

P

ASP SOR Nbody TSP

Obj

Syn

Comp

NO = No optimizations HS = Home migration + Synchronized Method ShippingH = Home migration HSP = HS + Object pushing


JESSICA2 vs JESSICA (CPI)

CPI(50,000,000iterations)

050000

100000150000200000250000

2 4 8

Number of nodes

Tim

e(m

s) JESSICA

JESSICA2


Application benchmarkSpeedup

0

2

4

6

8

10

2 4 8

Node number

Spe

edup

Linear speedup

CPI

TSP

Raytracer

nBody


Parallel Ray Tracing (using 64 nodes of Gideon 300 cluster)

Linux 2.4.18-3 kernel (Redhat 7.3)

64 nodes: 108 seconds

1 node: 4402 seconds ( 1.2 hour)

Speedup = 4402/108=40.75


Conclusions

Transparent Java thread migration in JIT compiler enables the high-performance execution of multithreaded Java application on clusters

An embedded GOS layer can take advantage of the JVM runtime information to reduce communication overhead


Future work

Advanced thread migration mechanism without overhead during normal execution (finished)

Incremental Distributed GC

Enhanced Single I/O Space to benefit more real-life applications

Parallel I/O Support


Thanks

JESSICA2 Webpagehttp://www.csis.hku.hk/~clwang/

projects/JESSICA2.html

Documents

JIT-Compiler-Assisted Distributed Java Virtual Machine Wenzhang Zhu, Cho-Li Wang, Weijian Fang and Francis C. M. Lau Department of Computer Science and