113
Design Rationales in the JRockit JVM Marcus Lagergren Senior Software Architect, Klarna

Design rationales in the JRockit JVM

Embed Size (px)

Citation preview

DesignRationalesintheJRockit JVMMarcusLagergren

SeniorSoftwareArchitect,Klarna

DesignRationalesintheJRockit JVMMarcusLagergren

SeniorSoftwareArchitect,Klarna

DesignRationalesintheJRockit JVMMarcusLagergren

SeniorSoftwareArchitect,Klarna

Agenda• Inthebeginning…• Whatdidweaccomplish/Internals

– CodeGeneration– MemoryManagement– Threads&Synchronization

• Externals– TheJavaMissionControlsuite– AparenthesisonJRockitVE

• Q&A

Aboutthespeaker

@lagergren

Aboutthespeaker

@lagergren

Aboutthespeaker

@lagergren

Aboutthespeaker• M.Sc.fromKTH,Stockholm– NarrowlyescapeddoingaPhDonbitsecurityincryptographicsystems

• Runtime,OSandcompilerengineersince1999,withsomestartupbreaks

• OneoftheoriginalcreatorsoftheJRockitJVM

Inthebeginning

AppealVirtualMachines• AppealSoftwareSolutions– Consulting,almostexclusivelyJavaby1997

• Stillthepre-appserverera

AppealVirtualMachines• WesawthatJavawouldbegreatontheserverside

AppealVirtualMachines• WesawthatJavawouldbegreatontheserverside– Shorterdevelopmentcycles– moneyinthebank

• Bufferoverrunprotection• Automaticmemorymanagement• Writeonceruneverywhere

AppealVirtualMachines• Tremendousscalability problems• SunClassicVMwasall-encompassing

JavaOne1997• SunMicrosystemspresentstheHotSpotvirtualmachine

JavaOne1997• SunMicrosystemspresentstheHotSpotvirtualmachine– “WOW!Thisisthewaytodoit!Adaptiveruntimes!”

JavaOne1998• SunMicrosystemspresentstheHotSpotvirtualmachineagain

JavaOne1998• SunMicrosystemspresentstheHotSpotvirtualmachineagain– “WTF!Thisisslide-by-slidetheexactsamepresentationaslastyear!?!”

–Wecan’twaitanylonger.Let’sbuildourownVM.Howhardcanitbe?

CreatingourownJVM- JRockit

Productizeanarrowerdomain?• Server-sideusageonly.Headless.– Weneedtohelptheearlyappservervendorsgetperformanceandscalability

Productizeanarrowerdomain?• Server-sideusageonly.Headless.– Weneedtohelptheearlyappservervendorsgetperformanceandscalability

• Nointerpreter– “startuptimedoesn’tmatterontheserveranyway”

Productizeanarrowerdomain?• Server-sideusageonly.Headless.– Weneedtohelptheearlyappservervendorsgetperformanceandscalability

• Nointerpreter– “startuptimedoesn’tmatterontheserveranyway”

• Greenthreadsorn xm threads.– Explicitparallelismwasall-pervasive.

Productizeanarrowerdomain?• IncrementalGC–Wethoughtsomethinglike[Seligman,Grarup]wouldsuffice.

Productizeanarrowerdomain?• IncrementalGC–Wethoughtsomethinglike[Seligman,Grarup]wouldsuffice.

• Supportourselvesonconsultingonly.– Nope– neededventurecapital

TheJavaLicense• Youcan’tcallyourself“Java”withoutaJavalicense

• YouneedtopasstheTCKtestsuite– Notavailablewithoutlicense

• TogetaJavaLicenseyouneeda“valueadd”

TheJavaLicense• What’sa“valueadd”?

TheJavaLicense• What’sa“valueadd”?

TheJavaLicense• What’sa“valueadd”?

TheJavaLicense• What’sa“valueadd”?– Superiorperformance!

TheJavaLicense• What’sa“valueadd”?– Superiorperformance!–What?Youdidn’tlikethat?

TheJavaLicense• What’sa“valueadd”?– Superiorperformance!–What?Youdidn’tlikethat?– OK…Let’ssee…Err..“managability”

TheJavaLicense• JavaLicensewasgranted2001– HelpeduspartnerupwithBEASystemsandIntel

– BEAacquiredusin2002– OracleacquiredBEAin2008– OracleacquiredSunin2010

Whatdidweaccomplish?

Therealvalueaddsturnedouttobe:

• Multitieredsupportforpayingcustomers– PartoftheWLSstack

• MonitoringandServiceability– JRockit MissionControl(nowJavaMissionControl)

– Recordandintrospectproductionsystemswithzerooverhead.

Therealvalueaddsturnedouttobe:

• Pioneered“Softrealtime”GC– DeterministicGC– LowlatencyGC

Therealvalueaddsturnedouttobe:

• Virtualization– JRockitVirtualEdition– anoperatingsystemforJava

– ShorterpathsbetweenJavaandhardware– Hypervisorrequired– JRockitVEonvirtualhardwareoutperformedphysicalLinux!

Therealvalueaddsturnedouttobe:

• Thebenchmarkwars– ConstantlykeepingitgoingwithSunandIBM,drivingJavaserver-sideperformance

Therealvalueaddsturnedouttobe:

• JRockit becamethedefaultJVMintheOraclestackin2008

• ExaLogic

…andthen

INTERNALS

@SimmsUpNorth

CodeGeneration

Codegeneration– NoInterpreter• Keeptestmatrixsmall• Keepoperationalcomplexitydown• Targetingserversideapps– warmupasmallissue

• “Codecaching/AOTcanbedonelater”

Codegeneration– OneJIT• Keeptestmatrixsmall• Keepoperationcomplexitydown• Runitindifferentmodes,withmaximumcodereuse

• SameIRthroughout–Withgradualaugmentations

But…• Startupbecameaproblem–Weremovedoptimizersandaddedasa“spine”tothenormalJITpipeline.

• Lazycodegenerationthroughtrampolines• Samemechanismforcodeinvalidation• Bookkeepingtoidentifyaprogrampointdowntoanyindividualmachineinstruction

CodeGeneration• Same“spine”usedinalltiersofcodegeneration

CodeGeneration• Same“spine”usedinalltiersofcodegeneration

Optimizations• InandoutofSSA• AppliedtoalllevelsofIR

– Looppeeling,valuenumbering, Stringappendexplosion,Typecheckremoval,signextensionelimination,copypropagation, bounds checkremoval,virtualtofixedcalls, inlining,ifshortcircuiting,straightening,strengthreduction,constantpropagation, deadcoderemoval,outofloophoisting,explodeobjectsandarraycopies,boxing&unboxing removal,localescapeanalysis,ASMpeepholeoptimization,redundant memoryaccessremoval,etcetcetc…

• SupportforregionalizedIRs• GraphFusionRegisterAllocator

OptimizationTargets• Threadsampling• PartlytakenoverbysafepointbasedapproachinR28

• Somecodeinstrumentation,forexampleforinliningpath– Notinthegeneralcase,e.ginvocationcounters

OptimizationTargets• Hardwaresamplingwhereavailable– OnlygoodthingaboutIA64?– Couldalsomatche.g.L2missestoprogrampoints

• Buggingtheprocessormanufacturerssince2002aboutuserlandPCsamplebuffer.

• JRockitVEx1000moresamples– significantlyprovenshorterwarmup

HotSpotstyle?• On-stackreplacement?• Deoptimization?

HotSpotstyle?• On-stackreplacement?• Deoptimization?• Nevermuchcaredforanyit;-)

HotSpot StyleOSRandDeoptimization• We’veneverfoundapractical usecase.

– Sowecan’teverswapoutthemainfunctionwiththemicrobenchmark loop.Whocares?

• Anassumption isinvalidated– Eitherpatchcodedirectlyoruseaguardwhengeneratingitin

thefirstplace• Alargeassumption

– Writeatrapinthecodeandschedulelazyregenerationofentiremethod

• Notstrictly truefordynamic languages

HotSpot StyleOSRandDeoptimization• We’veneverfoundapractical usecase.

– Sowecan’teverswapoutthemainfunctionwiththemicrobenchmark loop.Whocares?

• Anassumption isinvalidated– Eitherpatchcodedirectlyoruseaguardwhengeneratingitin

thefirstplace• Alargeassumption

– Writeatrapinthecodeandschedulelazyregenerationofentiremethod

• Notstrictly truefordynamic languages

HotSpot StyleOSRandDeoptimization• We’veneverfoundapractical usecase.

– Sowecan’teverswapoutthemainfunctionwiththemicrobenchmark loop.Whocares?

• Anassumption isinvalidated– Eitherpatchcodedirectlyoruseaguardwhengeneratingitin

thefirstplace• Alargeassumption

– Writeatrapinthecodeanddoregenerationofentiremethod• Notstrictly truefordynamic languages

HotSpot StyleOSRandDeoptimization• We’veneverfoundapractical usecase.

– Sowecan’teverswapoutthemainfunctionwiththemicrobenchmark loop.Whocares?

• Anassumption isinvalidated– Eitherpatchcodedirectlyoruseaguardwhengeneratingitin

thefirstplace• Alargeassumption

– Writeatrapinthecodeanddoregenerationofentiremethod• Notstrictly truefordynamic languages

“Garbagecollectingcode”• Codekeptinbinarytreeofcodeblocks~ 64M– Moreiflargepagesenabled

• Classloaderunloadingà garbagecollection• Referencecounttoactivecodemodifiedwhen

backpatching• Specializedusageofcodeblocks.– Trampolinesonly– Optimizedcodeonly

Bytecodeisbad– killitquickly

Bytecodeisbad– killitquickly• What’swiththegoto:s?• WhycanitexpressmorethanJavasourcecode?– OKweunderstandthemultilanguageconcept,wesortaforgiveyou.

– Butman,dominatorsandloopanalysis–that’salotofcompiletime

Bytecodeisbad– killitquickly• …andwhyisitastackmachineANDaregistermachinewith65535registersatthesametime!?

• Initially triedtoreconstructASTs– Obfuscatorsetcmadethisprettyhopeless.

• ~15%oftheklocsinJRockit/codegendoflowcontrolanalysisonthegoto:s

TheIR• UseIReverywhere(orJava)• TheIRshouldideallyreflectanyofseveralpluggable

frontends.– WeplayedaroundwithCLRabit.– Thesedays– dynamiclanguages:-)

• NoSeaofNodes• NoHotSpotstyle“highlevelIRislowlevel”

TheIR• SimpleIRinMIRform(platformindependent)

TheIR– DesignRationale• Wehadsomecompilerexperience– wantedtobeontrackquickly.Doitthetraditionalway.

• Wearenot“wrong”.LLVMisverysimilar.

TheIR– DesignRationale• Tiered: highesttier==alwayshighlevel• Hardwareagnostic.• Noarchitecturespecificmemoryops

• Tiered: lowesttier==alwaysthenativearchitectureinstructionforinstruction.• Agradualtransition.• ACPUhasnoseaofnodes.

TheIR• HighestIRlevelmayhaveoperationsasoperands

• Intrinsicseverywhere– arraycopy, membar, cmpuXX, sse4IndexOf,

doubleToLongBits, crypto, Math.sin andsoon…• RegretnotdoingmoreinSSAform

TheIRInfo“database”• Lazilycomputableinformation

– Liveness– Dominators– Loopinformation– Aliases– Typeinference– Ranges– Nullnessanalysis– …– Invalidateonmodification.

• Notaverystablemodel.

Memorymanagement

Transition:objectlayout,typesandlivemaps…

Objectlayoutandtypes• Objectheadersshouldbefixedsized.• JRockit Objectheaderis32+32bits• Allplatformswithsomecontentvariations.

• [Grove]ramblingsonobjectmodels• Typetreesimilarto [Krall,Vitek,

Horspool]

Livemaps(oopmaps)• Registersandstackslotsonthelocalframethatcontainobjects.

• Nothingstrangehere.Requiredfornon-conservativegarbagecollectionofanysort.

• Internalpointerbit• Formstherootset.• Rollforwardingvsthesafepointapproach

Transition- Livemaps

Memorymanagement• Garbagecollectors– Concurrent– Parallel– Deterministic

• Withorwithoutgenerations

Memorymanagement• Concurrent collection

– Yourbasicgenerational concurrentmarkandsweepcollector [Printezis,Detlefs]

– Supportsmultigeneration (>1)youngspaces.• Combatsheavyobjectallocationsituations.• Adaptivelybalancedagainstcopyoverhead

– Writebarriersbeforeobjectwrites– Minimizestoppingtheworld– Youngcollections useavariantofstop&copy

Memorymanagement• Canalsorunwithaparallel policy– Stoptheworldandcleanupquickly– Onlythroughputoriented– Nowritebarriers,asthereisnoneedforacardtable

Mark&Sweep• BackboneofGCbasedontraditionaltri-colormarkandsweep

• Adaptivethreadusageandadditionalconcurrency

Mark&Sweep• Twocolors– notthree.

– Objectisinoneoftwosets– Liveobjects:greybits(mixofgrey&blackobjectsintraditional tri-coloring)

– Distinctionhandledbyputtinggreyobjectsinthreadlocalqueues foreachGCthread.

– Parallel threadscanworkonthreadlocaldata– Efficientprefetching ispossibleduetoFIFOorder.

Nopermgenever!

Othernicefeatures• Nopermgen!!!Ever!

Othernicefeatures• Nopermgen!!!Ever!• Pinnedobjects.– Fastmemorybuffers– Alsoenablenon-contiguousheaps

Othernicefeatures• Nopermgen!!!Ever!• Pinnedobjects.– Fastmemorybuffers.– Alsoenablenon-contiguousheaps.

• Compaction– “Internalandexternal”.– G1evacuatesregionsinsteadwithastoptheworld-and-copypolicysimilartoJRockit YC

Memorymanagement• Concurrent GChasanadditionalset:livebits

– Containsallliveobjectsinthesystem,includingthenewlycreatedones.

– JRockit canquicklyfindobjectsthathavebeencreatedduringaconcurrentmarkphase.

– Cardtables• NotjustforgenerationalGC• Alsotoavoidsearchingtheentireliveobjectgraphwhenaconcurrentmarkphasecleansup.

• Justlookatdirtycardsattheendofthemarkphase.

YoungCollections• Avariantofstopandcopyisused.– Allthreadsarehaltedandobjectsaredeletedorpromoted

– Hierarchicalbreadthfirstcopyforcachelocality• Parallelizesnicely• Manythreadsalwaysharvestayoungspace

YoungCollections• Youngandoldcollectionsmayoccuratsametime.– Allbitsetsanddatastructurescanbesharedaslongastheoldcollectionisguaranteedtoseeallcardsthathavebecomedirtyduringaconcurrentphase.(Extracardtabletorecordthis“difference”– “modifiedunionset”)

– Keepthisintactforoldcollection

ThreadLocalAllocation• Threadlocalallocation• ThreadlocalareasareroughlyL2cachesizedandobjectsareallocatedherebeforetheyareforcedupontheheap

CompressedReferences• Forlessthan4(or4*x)GBofmaximumheapsize

• Use32bitpointers(or32+log2(x)bits)

CompRef compress(Ref ref) {

return (uint32_t)ref; //truncate reference to 32-bits

}

Ref decompress(CompRef ref) {

return globalHeapBase | ref;

}

CompressedReferencesCompRef compress(Ref ref) {

return (uint32_t)ref; //truncate reference to 32-bits

}

Ref decompress(CompRef ref) {

return globalHeapBase | ref;

}

CompRef compress(Ref ref) {

return (uint32_t)(ref >> log2(objectAlignment));

}

Ref decompress(CompRef ref) {

return globalHeapBase | (ref << log2(objectAlignment));

}

DeterministicGC• QoSlevelforlatencies.“NomorethanXms”• Downtosingledigitsonmodernx86hardware

• Caveat:livedataonheapisthemainconstraint.– Upto50%ofheaplivedatastillfeasible

DeterministicGC

DeterministicGC– How?• Greedystrategy– Postponestoppingtheworldforaslongaspossible.

–Maybetheproblemgoesawayandwedon’thavetostoptheworld

• Splitupeverythingintoworkpackets– Dropthematanytime.

DeterministicGC– How?• Efficientparallelization.–Markphaseis90%ofGCtime

• Efficientheuristics– Somemoreworkine.g.writebarriers

ThreadsandSynchronization

ThreadsandSynchronization• Ajava.lang.Thread isanativethread.– Interesting,though:threadpoolingandpseudothin-threadsareback,forexampleinAkka.

– Java8– Collection.parallelStream– Theworldismovingtowardsimplicitparallelismingeneral

• MostoftheJRockitthreadcodeandadaptivitylogiciswritteninJava

ThreadsandSynchronization• Locksarethinorfat– Adaptiveinflationanddeflation

• Lazylocking(biasedlockingsupported)– Adaptiveheuristicsforbanningandretryingthelazyapproach.

ThreadsandSynchronizationpublic class PseudoSpinlock {

private static final int LOCK_FREE = 0; private static final int LOCK_TAKEN = 1;

public void lock() { //burn cycleswhile (cmpxchg(LOCK_TAKEN, &lock) == LOCK_TAKEN) {

micropause(); //optional}

}

public void unlock() { int old = cmpxchg(LOCK_FREE, &lock); //guard against recursive locksassert(old == LOCK_TAKEN);

} }

ThreadsandSynchronization• Locksarethinwhenfirsttaken• Timespentinlockandtimestakentriggersinflation

• wait ornotify immediatelyinflatesalock• Fatlocksarealsodeflatedwhenuncontendedfortoolong

ThreadsandSynchronization

ThreadsandSynchronizationThinlocklifecycle

ThreadsandSynchronizationThin&fatlocklifecycle

LockPairing• Bytecodeagain– norestrictiononmatchingmonitorenter withmonitorexit

• NotallofthemcanbeanalyzedbytheJIT

LockPairing• Wecanstorewhatweknow,andmakeunlocksquick.– Locktokens(theobjectOR3bits)

• Thin,fat,recursive, lazilytaken,unmatched

– Livemapsystemcontainsnestingorder.

Optimizations• Alotofsmallish codegentransforms: e.g.Lockfusion• “Fatspin”• Lazyunlocking(biasedlocking)

– Startassumingalllocksarelazy.Tagthinlocksaslazilylocked.– Ifobjectalreadylazilylocked

• Ifit’sthesamethread:profit• Else– stopthelockholder,detectthe“real”lockstatebystackwalk.Converttothinlockorforcefullyunlockit

– Transferbits– Heuristics:objectandclassbanning.Ageing.

ThreadsandSynchronizationThin,fat&lazylocklifecycle

Exportitall!– JRockitMissionControl

(nowJavaMissionControl)

@javamissionctrl$JAVA_HOME/bin/jmc

MissionControl• Use“free”runtimeinformation!

– JRockit(Java)MissionControl• JRockit(Java)flightrecorder• Memoryleakdetector(JRockitonly)• Managementconsole

• $JAVA_HOME/bin/JCMD (usedtobeJRCMD)• EverythingintheVMabstracted intoaneventthat

mayormaynothaveaduration• Soon:publicAPI

JavaFlightRecorder• Alwayson

– Excellentfordebuggingandanalysisofcrashes– Canbesettorecordmoreintrusivelyforperiodsinproduction

• E.g.extensive lockprofiling• Everythingisanevent• Bufferedrecording– thelastn secondsavailableatanycrashor

whenacommandisgiven.• Veryfineprecision.

– Multimediatimersandsystemhardwaresupportrequiredfore.g.latencies

LatencyAnalysis

TheManagementConsole• PeekintotherunningproductionJVM• Addtriggersonevents• InteractwiththeVM:forceGCetc.

TheMemoryLeakDetector• Introspectthetypegraphinrealtime.LookfortypesthataregrowingdespiteGC:s

Studyingarecordingoffline

JRockitVirtualEdition

IstheJVManOS?

IstheJVManOS?• Addacooperativeaspecttothreadswitching• Zero-copynetworkingcode• ReducecostofenteringOS• Balloondriver• Runsonlyonhypervisor• FacilitatespauselessGC

IstheJVManOS?

Thankyou!

Wouldyouliketo

knowmore?

OracleJRockit –

theDefinitiveGuide