Upload
javadayua
View
316
Download
1
Embed Size (px)
Citation preview
Agenda• Inthebeginning…• Whatdidweaccomplish/Internals
– CodeGeneration– MemoryManagement– Threads&Synchronization
• Externals– TheJavaMissionControlsuite– AparenthesisonJRockitVE
• Q&A
Aboutthespeaker• M.Sc.fromKTH,Stockholm– NarrowlyescapeddoingaPhDonbitsecurityincryptographicsystems
• Runtime,OSandcompilerengineersince1999,withsomestartupbreaks
• OneoftheoriginalcreatorsoftheJRockitJVM
AppealVirtualMachines• AppealSoftwareSolutions– Consulting,almostexclusivelyJavaby1997
• Stillthepre-appserverera
AppealVirtualMachines• WesawthatJavawouldbegreatontheserverside– Shorterdevelopmentcycles– moneyinthebank
• Bufferoverrunprotection• Automaticmemorymanagement• Writeonceruneverywhere
JavaOne1997• SunMicrosystemspresentstheHotSpotvirtualmachine– “WOW!Thisisthewaytodoit!Adaptiveruntimes!”
JavaOne1998• SunMicrosystemspresentstheHotSpotvirtualmachineagain– “WTF!Thisisslide-by-slidetheexactsamepresentationaslastyear!?!”
–Wecan’twaitanylonger.Let’sbuildourownVM.Howhardcanitbe?
Productizeanarrowerdomain?• Server-sideusageonly.Headless.– Weneedtohelptheearlyappservervendorsgetperformanceandscalability
Productizeanarrowerdomain?• Server-sideusageonly.Headless.– Weneedtohelptheearlyappservervendorsgetperformanceandscalability
• Nointerpreter– “startuptimedoesn’tmatterontheserveranyway”
Productizeanarrowerdomain?• Server-sideusageonly.Headless.– Weneedtohelptheearlyappservervendorsgetperformanceandscalability
• Nointerpreter– “startuptimedoesn’tmatterontheserveranyway”
• Greenthreadsorn xm threads.– Explicitparallelismwasall-pervasive.
Productizeanarrowerdomain?• IncrementalGC–Wethoughtsomethinglike[Seligman,Grarup]wouldsuffice.
• Supportourselvesonconsultingonly.– Nope– neededventurecapital
TheJavaLicense• Youcan’tcallyourself“Java”withoutaJavalicense
• YouneedtopasstheTCKtestsuite– Notavailablewithoutlicense
• TogetaJavaLicenseyouneeda“valueadd”
TheJavaLicense• What’sa“valueadd”?– Superiorperformance!–What?Youdidn’tlikethat?– OK…Let’ssee…Err..“managability”
TheJavaLicense• JavaLicensewasgranted2001– HelpeduspartnerupwithBEASystemsandIntel
– BEAacquiredusin2002– OracleacquiredBEAin2008– OracleacquiredSunin2010
• MonitoringandServiceability– JRockit MissionControl(nowJavaMissionControl)
– Recordandintrospectproductionsystemswithzerooverhead.
Therealvalueaddsturnedouttobe:
• Virtualization– JRockitVirtualEdition– anoperatingsystemforJava
– ShorterpathsbetweenJavaandhardware– Hypervisorrequired– JRockitVEonvirtualhardwareoutperformedphysicalLinux!
Therealvalueaddsturnedouttobe:
• Thebenchmarkwars– ConstantlykeepingitgoingwithSunandIBM,drivingJavaserver-sideperformance
Therealvalueaddsturnedouttobe:
Codegeneration– NoInterpreter• Keeptestmatrixsmall• Keepoperationalcomplexitydown• Targetingserversideapps– warmupasmallissue
• “Codecaching/AOTcanbedonelater”
Codegeneration– OneJIT• Keeptestmatrixsmall• Keepoperationcomplexitydown• Runitindifferentmodes,withmaximumcodereuse
• SameIRthroughout–Withgradualaugmentations
But…• Startupbecameaproblem–Weremovedoptimizersandaddedasa“spine”tothenormalJITpipeline.
• Lazycodegenerationthroughtrampolines• Samemechanismforcodeinvalidation• Bookkeepingtoidentifyaprogrampointdowntoanyindividualmachineinstruction
Optimizations• InandoutofSSA• AppliedtoalllevelsofIR
– Looppeeling,valuenumbering, Stringappendexplosion,Typecheckremoval,signextensionelimination,copypropagation, bounds checkremoval,virtualtofixedcalls, inlining,ifshortcircuiting,straightening,strengthreduction,constantpropagation, deadcoderemoval,outofloophoisting,explodeobjectsandarraycopies,boxing&unboxing removal,localescapeanalysis,ASMpeepholeoptimization,redundant memoryaccessremoval,etcetcetc…
• SupportforregionalizedIRs• GraphFusionRegisterAllocator
OptimizationTargets• Threadsampling• PartlytakenoverbysafepointbasedapproachinR28
• Somecodeinstrumentation,forexampleforinliningpath– Notinthegeneralcase,e.ginvocationcounters
OptimizationTargets• Hardwaresamplingwhereavailable– OnlygoodthingaboutIA64?– Couldalsomatche.g.L2missestoprogrampoints
• Buggingtheprocessormanufacturerssince2002aboutuserlandPCsamplebuffer.
• JRockitVEx1000moresamples– significantlyprovenshorterwarmup
HotSpot StyleOSRandDeoptimization• We’veneverfoundapractical usecase.
– Sowecan’teverswapoutthemainfunctionwiththemicrobenchmark loop.Whocares?
• Anassumption isinvalidated– Eitherpatchcodedirectlyoruseaguardwhengeneratingitin
thefirstplace• Alargeassumption
– Writeatrapinthecodeandschedulelazyregenerationofentiremethod
• Notstrictly truefordynamic languages
HotSpot StyleOSRandDeoptimization• We’veneverfoundapractical usecase.
– Sowecan’teverswapoutthemainfunctionwiththemicrobenchmark loop.Whocares?
• Anassumption isinvalidated– Eitherpatchcodedirectlyoruseaguardwhengeneratingitin
thefirstplace• Alargeassumption
– Writeatrapinthecodeandschedulelazyregenerationofentiremethod
• Notstrictly truefordynamic languages
HotSpot StyleOSRandDeoptimization• We’veneverfoundapractical usecase.
– Sowecan’teverswapoutthemainfunctionwiththemicrobenchmark loop.Whocares?
• Anassumption isinvalidated– Eitherpatchcodedirectlyoruseaguardwhengeneratingitin
thefirstplace• Alargeassumption
– Writeatrapinthecodeanddoregenerationofentiremethod• Notstrictly truefordynamic languages
HotSpot StyleOSRandDeoptimization• We’veneverfoundapractical usecase.
– Sowecan’teverswapoutthemainfunctionwiththemicrobenchmark loop.Whocares?
• Anassumption isinvalidated– Eitherpatchcodedirectlyoruseaguardwhengeneratingitin
thefirstplace• Alargeassumption
– Writeatrapinthecodeanddoregenerationofentiremethod• Notstrictly truefordynamic languages
“Garbagecollectingcode”• Codekeptinbinarytreeofcodeblocks~ 64M– Moreiflargepagesenabled
• Classloaderunloadingà garbagecollection• Referencecounttoactivecodemodifiedwhen
backpatching• Specializedusageofcodeblocks.– Trampolinesonly– Optimizedcodeonly
Bytecodeisbad– killitquickly• What’swiththegoto:s?• WhycanitexpressmorethanJavasourcecode?– OKweunderstandthemultilanguageconcept,wesortaforgiveyou.
– Butman,dominatorsandloopanalysis–that’salotofcompiletime
Bytecodeisbad– killitquickly• …andwhyisitastackmachineANDaregistermachinewith65535registersatthesametime!?
• Initially triedtoreconstructASTs– Obfuscatorsetcmadethisprettyhopeless.
• ~15%oftheklocsinJRockit/codegendoflowcontrolanalysisonthegoto:s
TheIR• UseIReverywhere(orJava)• TheIRshouldideallyreflectanyofseveralpluggable
frontends.– WeplayedaroundwithCLRabit.– Thesedays– dynamiclanguages:-)
• NoSeaofNodes• NoHotSpotstyle“highlevelIRislowlevel”
TheIR– DesignRationale• Wehadsomecompilerexperience– wantedtobeontrackquickly.Doitthetraditionalway.
• Wearenot“wrong”.LLVMisverysimilar.
TheIR– DesignRationale• Tiered: highesttier==alwayshighlevel• Hardwareagnostic.• Noarchitecturespecificmemoryops
• Tiered: lowesttier==alwaysthenativearchitectureinstructionforinstruction.• Agradualtransition.• ACPUhasnoseaofnodes.
TheIR• HighestIRlevelmayhaveoperationsasoperands
• Intrinsicseverywhere– arraycopy, membar, cmpuXX, sse4IndexOf,
doubleToLongBits, crypto, Math.sin andsoon…• RegretnotdoingmoreinSSAform
TheIRInfo“database”• Lazilycomputableinformation
– Liveness– Dominators– Loopinformation– Aliases– Typeinference– Ranges– Nullnessanalysis– …– Invalidateonmodification.
• Notaverystablemodel.
Objectlayoutandtypes• Objectheadersshouldbefixedsized.• JRockit Objectheaderis32+32bits• Allplatformswithsomecontentvariations.
• [Grove]ramblingsonobjectmodels• Typetreesimilarto [Krall,Vitek,
Horspool]
Livemaps(oopmaps)• Registersandstackslotsonthelocalframethatcontainobjects.
• Nothingstrangehere.Requiredfornon-conservativegarbagecollectionofanysort.
• Internalpointerbit• Formstherootset.• Rollforwardingvsthesafepointapproach
Memorymanagement• Concurrent collection
– Yourbasicgenerational concurrentmarkandsweepcollector [Printezis,Detlefs]
– Supportsmultigeneration (>1)youngspaces.• Combatsheavyobjectallocationsituations.• Adaptivelybalancedagainstcopyoverhead
– Writebarriersbeforeobjectwrites– Minimizestoppingtheworld– Youngcollections useavariantofstop©
Memorymanagement• Canalsorunwithaparallel policy– Stoptheworldandcleanupquickly– Onlythroughputoriented– Nowritebarriers,asthereisnoneedforacardtable
Mark&Sweep• BackboneofGCbasedontraditionaltri-colormarkandsweep
• Adaptivethreadusageandadditionalconcurrency
Mark&Sweep• Twocolors– notthree.
– Objectisinoneoftwosets– Liveobjects:greybits(mixofgrey&blackobjectsintraditional tri-coloring)
– Distinctionhandledbyputtinggreyobjectsinthreadlocalqueues foreachGCthread.
– Parallel threadscanworkonthreadlocaldata– Efficientprefetching ispossibleduetoFIFOorder.
Othernicefeatures• Nopermgen!!!Ever!• Pinnedobjects.– Fastmemorybuffers– Alsoenablenon-contiguousheaps
Othernicefeatures• Nopermgen!!!Ever!• Pinnedobjects.– Fastmemorybuffers.– Alsoenablenon-contiguousheaps.
• Compaction– “Internalandexternal”.– G1evacuatesregionsinsteadwithastoptheworld-and-copypolicysimilartoJRockit YC
Memorymanagement• Concurrent GChasanadditionalset:livebits
– Containsallliveobjectsinthesystem,includingthenewlycreatedones.
– JRockit canquicklyfindobjectsthathavebeencreatedduringaconcurrentmarkphase.
– Cardtables• NotjustforgenerationalGC• Alsotoavoidsearchingtheentireliveobjectgraphwhenaconcurrentmarkphasecleansup.
• Justlookatdirtycardsattheendofthemarkphase.
YoungCollections• Avariantofstopandcopyisused.– Allthreadsarehaltedandobjectsaredeletedorpromoted
– Hierarchicalbreadthfirstcopyforcachelocality• Parallelizesnicely• Manythreadsalwaysharvestayoungspace
YoungCollections• Youngandoldcollectionsmayoccuratsametime.– Allbitsetsanddatastructurescanbesharedaslongastheoldcollectionisguaranteedtoseeallcardsthathavebecomedirtyduringaconcurrentphase.(Extracardtabletorecordthis“difference”– “modifiedunionset”)
– Keepthisintactforoldcollection
ThreadLocalAllocation• Threadlocalallocation• ThreadlocalareasareroughlyL2cachesizedandobjectsareallocatedherebeforetheyareforcedupontheheap
CompressedReferences• Forlessthan4(or4*x)GBofmaximumheapsize
• Use32bitpointers(or32+log2(x)bits)
CompRef compress(Ref ref) {
return (uint32_t)ref; //truncate reference to 32-bits
}
Ref decompress(CompRef ref) {
return globalHeapBase | ref;
}
CompressedReferencesCompRef compress(Ref ref) {
return (uint32_t)ref; //truncate reference to 32-bits
}
Ref decompress(CompRef ref) {
return globalHeapBase | ref;
}
CompRef compress(Ref ref) {
return (uint32_t)(ref >> log2(objectAlignment));
}
Ref decompress(CompRef ref) {
return globalHeapBase | (ref << log2(objectAlignment));
}
DeterministicGC• QoSlevelforlatencies.“NomorethanXms”• Downtosingledigitsonmodernx86hardware
• Caveat:livedataonheapisthemainconstraint.– Upto50%ofheaplivedatastillfeasible
DeterministicGC– How?• Greedystrategy– Postponestoppingtheworldforaslongaspossible.
–Maybetheproblemgoesawayandwedon’thavetostoptheworld
• Splitupeverythingintoworkpackets– Dropthematanytime.
DeterministicGC– How?• Efficientparallelization.–Markphaseis90%ofGCtime
• Efficientheuristics– Somemoreworkine.g.writebarriers
ThreadsandSynchronization• Ajava.lang.Thread isanativethread.– Interesting,though:threadpoolingandpseudothin-threadsareback,forexampleinAkka.
– Java8– Collection.parallelStream– Theworldismovingtowardsimplicitparallelismingeneral
• MostoftheJRockitthreadcodeandadaptivitylogiciswritteninJava
ThreadsandSynchronization• Locksarethinorfat– Adaptiveinflationanddeflation
• Lazylocking(biasedlockingsupported)– Adaptiveheuristicsforbanningandretryingthelazyapproach.
ThreadsandSynchronizationpublic class PseudoSpinlock {
private static final int LOCK_FREE = 0; private static final int LOCK_TAKEN = 1;
public void lock() { //burn cycleswhile (cmpxchg(LOCK_TAKEN, &lock) == LOCK_TAKEN) {
micropause(); //optional}
}
public void unlock() { int old = cmpxchg(LOCK_FREE, &lock); //guard against recursive locksassert(old == LOCK_TAKEN);
} }
ThreadsandSynchronization• Locksarethinwhenfirsttaken• Timespentinlockandtimestakentriggersinflation
• wait ornotify immediatelyinflatesalock• Fatlocksarealsodeflatedwhenuncontendedfortoolong
LockPairing• Bytecodeagain– norestrictiononmatchingmonitorenter withmonitorexit
• NotallofthemcanbeanalyzedbytheJIT
LockPairing• Wecanstorewhatweknow,andmakeunlocksquick.– Locktokens(theobjectOR3bits)
• Thin,fat,recursive, lazilytaken,unmatched
– Livemapsystemcontainsnestingorder.
Optimizations• Alotofsmallish codegentransforms: e.g.Lockfusion• “Fatspin”• Lazyunlocking(biasedlocking)
– Startassumingalllocksarelazy.Tagthinlocksaslazilylocked.– Ifobjectalreadylazilylocked
• Ifit’sthesamethread:profit• Else– stopthelockholder,detectthe“real”lockstatebystackwalk.Converttothinlockorforcefullyunlockit
– Transferbits– Heuristics:objectandclassbanning.Ageing.
MissionControl• Use“free”runtimeinformation!
– JRockit(Java)MissionControl• JRockit(Java)flightrecorder• Memoryleakdetector(JRockitonly)• Managementconsole
• $JAVA_HOME/bin/JCMD (usedtobeJRCMD)• EverythingintheVMabstracted intoaneventthat
mayormaynothaveaduration• Soon:publicAPI
JavaFlightRecorder• Alwayson
– Excellentfordebuggingandanalysisofcrashes– Canbesettorecordmoreintrusivelyforperiodsinproduction
• E.g.extensive lockprofiling• Everythingisanevent• Bufferedrecording– thelastn secondsavailableatanycrashor
whenacommandisgiven.• Veryfineprecision.
– Multimediatimersandsystemhardwaresupportrequiredfore.g.latencies
TheManagementConsole• PeekintotherunningproductionJVM• Addtriggersonevents• InteractwiththeVM:forceGCetc.
IstheJVManOS?• Addacooperativeaspecttothreadswitching• Zero-copynetworkingcode• ReducecostofenteringOS• Balloondriver• Runsonlyonhypervisor• FacilitatespauselessGC