Upload
lillian-richardson
View
218
Download
1
Embed Size (px)
Citation preview
JIT-Compiler-Assisted Distributed Java Virtual Machine
Wenzhang Zhu, Cho-Li Wang, Weijian Fang and Francis C. M. Lau
Department of Computer Science and Information Systems
The University of Hong Kong
Presented by Cho-Li Wang
TCHPC 2004, Taiwan, Mar, 2004 2
Outline
Distributed Java Virtual MachineDesign TradeoffsRelated workJESSICA2 featuresExperimental resultsConclusion & future workA raytracing demo
TCHPC 2004, Taiwan, Mar, 2004 3
Distributed Java Virtual Machine (DJVM)
A distributed Java Virtual Machine (DJVM) consists of a group of extended JVMs running on a distributed environment to support true parallel execution of a multithreaded Java application.
A DJVM provides all the JVM services, that are compliant with the Java language specification, as if running on a single machine – Single System Image (SSI).
Heap
Bytecode Execution Engine
ClassThread
DJVM
(Single System Image)
import java.util.*;class worker extends Thread{ private long n; public worker(long N){ n=N; } public void run(){ long sum=0; for(long i=0; i<n; i++) sum+=i; System.out.println(“N=“+n+” Sum="+sum);}}public class test { static final int N=100; public static void main(String args[]){ worker [] w= new worker[N]; Random r = new Random(); for (int i=0; i<N; i++) w[i] = new worker(r.nextLong()); for (int i=0; i<N; i++) w[i].start(); try{ for (int i=0; i<N; i++) w[i].join();} catch (Exception e){}}}
Java thread
JVM JVM JVM JVM
TCHPC 2004, Taiwan, Mar, 2004 4
Design Tradeoffs of a DJVMHow to manage the threads? Distributed thread scheduling Initial placement vs thread migration
How to store the data ? Distributed heap (object store) Java memory model (memory consistency) Can an off-the-shelf DSM be used as the heap?
How to process the bytecode ? Execution Engine : Interpretation, Just-in-Time
(JIT) compilation, Static compilation
ThreadSched
ExecEngine Heap
TCHPC 2004, Taiwan, Mar, 2004 5
Related work
cJVM (IBM Haifa Research) Interpreter mode execution built-in object caching
JAVA/DSM (Rice University) Interpreter mode execution Heap built on top of a page-based DSM
JESSICA(HKU) Thread migration Interpreter mode execution Heap built on top of a page-based DSM
Jackal, Hyperion Static compilation Link to object-based DSM
RemoteCreation
IntrEmbedded OO-based
DSM (Proxy)
ManualDistribution
Intr Page-basedDSM
TransparentMigration
Intr Page-basedDSM
RemoteCreation
Static compilation
OO-basedDSM
TCHPC 2004, Taiwan, Mar, 2004 6
JESSICA2 (Java-Enabled Single-System-Image Computing Architecture)
Thread Migration
Global Object Space
JESSICA2JVM
A Multithreaded Java Program
JESSICA2JVM
JESSICA2JVM
JESSICA2JVM
JESSICA2JVM
JESSICA2JVM
Master Worker Worker Worker Worker Worker
JIT Compiler ModePortable Java Frame
TCHPC 2004, Taiwan, Mar, 2004 7
JESSICA2 Main FeaturesTransparent Java thread migration Runtime capturing and restoring of thread execution context. No source code modification; no bytecode instrumentation
(preprocessing); no new API introduced Enable dynamic load balancing on clusters
JIT compiler-based execution engine (JITEE) Operated in Just-In-Time (JIT) compilation mode cluster-aware
Global Object Space A shared global heap spanning all cluster nodes Provide location-transparent object access Adaptive migrating home protocol for memory consistency, plus
various optimizing schemes. I/O redirection
TCHPC 2004, Taiwan, Mar, 2004 8
JESSICA2 thread migration (In a JIT-enabled JVM)
Thread
Frame
(1) Alert
Frames
Method AreaJVM
Frame parsingRestore execution
Frame
Stack analysisStack capturing
Thread Scheduler
Source node
Destination node
Migration Manager
LoadMonitor
Method Area
RTC
RTC
FramesBTC
(2)
(3)
PC
PC
RTC: Raw Thread ContextBTC : Bytecode-oriented Thread Context (thread id, frames, class names, method signature, PC, Operand stack ptr, local vars …)
Transformation of the RTC into the BTC directly inside the JIT compiler
TCHPC 2004, Taiwan, Mar, 2004 9
Thread Stack Transformation
Raw Thread Context (RTC)
%esp: 0x00000000%esp+4: 0x082ca809%esp+8: 0x08225400%esp+12: 0x08266bc0
%esp: 0x00000000%esp+4: 0x082ca809%esp+8: 0x08225400%esp+12: 0x08266bc0...%eax = 0x08623200%ebx = 0x08293100
Frames{method CPI::run()V@111local=13;stack=0;var:arg0:CPI, 33, 0x8225400local1: [D; 33, 0x8266bc0@2local2: int, 2;...
Bytecode-oriented Thread Context (BTC)
Raw Thread Context (RTC)
Stack CapturingStack Restoration
TCHPC 2004, Taiwan, Mar, 2004 10
DetailsBytecode verifier
Bytecode translation
migration point selection :head of a basic block INVOKESTATIC, INVOKESPECIAL,INVOKEVIRTUAL and INVOKEINTERFACE
Construct control flow
graphinvoke
code generation
Native Code
Linking & Constant Resolution
Intermediate Code
Java frame
C frame
Register rebuild
Register allocation
(Restore)
mov var1->reg1mov var2->reg2...
reg var
Variables
Java frame detectionthread stack
raw stack
Global Object Space
1. Migration checking2. Non-destructive register spilling3. Object checking4. Type spilling for variable type
deducing
TCHPC 2004, Taiwan, Mar, 2004 11
Example of native code instrumentation
TCHPC 2004, Taiwan, Mar, 2004 12
Optimization on migration points – Pseudo-inlining
Purpose : eliminate the costs of unnecessary inserted migration points
General idea: delete M-points before a small method invocation
TCHPC 2004, Taiwan, Mar, 2004 13
Dynamic Register Patching
Stack growth
%ebp
bootstrap frame
trampoline frame
Ret addr
frame 0
reg1 <- value1reg2 <- value2
jmp restore_point0
Ret addr
%ebp
%ebp
frame 1
reg1 <- value1jmp restore_point1
Compiled methods:
Method1(){...retore_point1:}
Method0(){...retore_point10:}
trampoline
bootstrap(){ trampoline();closing handler();}
TCHPC 2004, Taiwan, Mar, 2004 14
Advantages of native code instrumentation
Lightweight Re-use JIT compiler internal data structures and control
flow analysis functions No need to include debugging information in Java class
files Instrumented native codes are more efficient than
instrumented bytecode.
Transparent No source code modification. No new API introduced. No preprocessing
TCHPC 2004, Taiwan, Mar, 2004 15
Global Object Space (GOS)
Provide global heap abstraction for DJVMHome-based object coherence protocol, compliant with JVM Memory Model OO-based to reduce false sharing
Non-blocking communication Use threaded I/O interface inside JVM for
communication to hide the latency
Adaptive object home migration mechanism Take advantage of JVM runtime information for
optimization
TCHPC 2004, Taiwan, Mar, 2004 16
GOS runtime data structureMaster object
object header
cache pointer
object data
Cache object
object header
cache pointer
cache data Master host id
master address
class
cache obj list
thread idstatuscache datanext
Cache data
thread idstatuscache datanext
cache data
Cache header
TCHPC 2004, Taiwan, Mar, 2004 17
Experimental environment
HKU Gideon 300 Linux cluster : 300 P4 PCs (2GHz, 512 MB RAM, 40 GB disk)
Network: 312-port Foundry FastIron 1500 Non-blocking switch (100 Mbits/s)
TCHPC 2004, Taiwan, Mar, 2004 18
Migration overhead during normal execution
(SPECJVM98 benchmark)
Benchmarks Time (seconds) Space (native code/bytecode)
No migration Migration No migration Migration
compress 11.31 11.39(+0.71%) 6.89 7.58(+10.01%)
jess 30.48 30.96(+1.57%) 6.82 8.34(+22.29%)
raytrace 24.47 24.68(+0.86%) 7.47 8.49(+13.65%)
db 35.49 36.69(+3.38%) 7.01 7.63(+8.84%)
javac 38.66 40.96(+5.95%) 6.74 8.72(+29.38%)
mpegaudio 28.07 29.28(+4.31%) 7.97 8.53(+7.03%)
mtrt 24.91 25.05(+0.56%) 7.47 8.49(+13.65%)
jack 37.78 37.90(+0.32%) 6.95 8.38(+20.58%)
Average (+2.21%) (+15.68%)
TCHPC 2004, Taiwan, Mar, 2004 19
Migration overhead analysisProgram (frame #) LT(1) CPI(1) ASP(1) N-Body(8) SOR(2)
Latency (ms) 4.997 2.680 4.678 10.803 8.467
Frame # 1 2 4 6 8 10
Var # 4 15 37 59 81 103
Size (B) 201 417 849 1281 1713 2145
Capture (us) 202 266 410 495 605 730
Parse (us) 235 253 447 526 611 724
Create (us) 360 360 360 360 360 360
Compile (us) 478 575 847 1,169 1,451 1,720
Build (us) 7 11 14 16 21 28
Total (us) 1,282 1,465 2,078 2,566 3,048 3,562
Overall migration latency
Migration time breakdown (LT program)
TCHPC 2004, Taiwan, Mar, 2004 20
GOS Optimizations (using 4 PCs)
0%
20%
40%
60%
80%
100%
NO H
HS
HS
P
NO H
HS
HS
P
NO H
HS
HS
P
NO H
HS
HS
P
ASP SOR Nbody TSP
Obj
Syn
Comp
NO = No optimizations HS = Home migration + Synchronized Method ShippingH = Home migration HSP = HS + Object pushing
TCHPC 2004, Taiwan, Mar, 2004 21
JESSICA2 vs JESSICA (CPI)
CPI(50,000,000iterations)
050000
100000150000200000250000
2 4 8
Number of nodes
Tim
e(m
s) JESSICA
JESSICA2
TCHPC 2004, Taiwan, Mar, 2004 22
Application benchmarkSpeedup
0
2
4
6
8
10
2 4 8
Node number
Spe
edup
Linear speedup
CPI
TSP
Raytracer
nBody
TCHPC 2004, Taiwan, Mar, 2004 23
Parallel Ray Tracing (using 64 nodes of Gideon 300 cluster)
Linux 2.4.18-3 kernel (Redhat 7.3)
64 nodes: 108 seconds
1 node: 4402 seconds ( 1.2 hour)
Speedup = 4402/108=40.75
TCHPC 2004, Taiwan, Mar, 2004 24
Conclusions
Transparent Java thread migration in JIT compiler enables the high-performance execution of multithreaded Java application on clusters
An embedded GOS layer can take advantage of the JVM runtime information to reduce communication overhead
TCHPC 2004, Taiwan, Mar, 2004 25
Future work
Advanced thread migration mechanism without overhead during normal execution (finished)
Incremental Distributed GC
Enhanced Single I/O Space to benefit more real-life applications
Parallel I/O Support
TCHPC 2004, Taiwan, Mar, 2004 26
Thanks
JESSICA2 Webpagehttp://www.csis.hku.hk/~clwang/
projects/JESSICA2.html