Upload
others
View
14
Download
0
Embed Size (px)
Citation preview
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
VectorizaAoninHotSpotJVM
VladimirIvanovHotSpotJVMCompilerOracleCorp.April8,2017
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
• SIMDISAextensions– packedvectorsonx86
• JVM– auto-vectorizaAon,intrinsics
• Future– JDK9– VectorAPI
2
Agenda
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
0x11529c8c0:mov%eax,-0x16000(%rsp)0x11529c8c7:push%rbp0x11529c8c8:sub$0x20,%rsp0x11529c8cc:mov%rdx,(%rsp)0x11529c8d0:mov%rsi,%rbp0x11529c8d3:movabs$0x7c0013d10,%rsi0x11529c8dd:nop0x11529c8de:nop0x11529c8df:nop0x11529c8e0:vzeroupper0x11529c8e3:callq0x00000001152418a00x11529c8e8:mov%rax,%rbx0x11529c8eb:mov(%rsp),%r100x11529c8ef:vmovdqu0x10(%r10),%ymm10x11529c8f5:vmovdqu0x10(%rbp),%ymm00x11529c8fa:vpaddd%ymm0,%ymm1,%ymm00x11529c8fe:vmovdqu%ymm0,0x10(%rbx)0x11529c903:mov%rbx,%rax0x11529c906:vzeroupper0x11529c909:add$0x20,%rsp0x11529c90d:pop%rbp
0x11529d240:mov%eax,-0x16000(%rsp)0x11529d247:push%rbp0x11529d248:sub$0x30,%rsp0x11529d24c:mov%rcx,%rbp0x11529d24f:vmovdqu0x10(%rsi),%ymm00x11529d254:vmovdqu0x10(%rdx),%ymm10x11529d259:vpaddd%ymm0,%ymm1,%ymm00x11529d25d:vmovdqu%ymm0,(%rsp)0x11529d262:movabs$0x7c0013d10,%rsi0x11529d26c:vzeroupper0x11529d26f:callq0x00000001152418a00x11529d274:mov%rax,%rbx0x11529d277:vmovdqu0x10(%rbp),%ymm10x11529d27c:vmovdqu(%rsp),%ymm00x11529d281:vpaddd%ymm0,%ymm1,%ymm00x11529d285:vmovdqu%ymm0,0x10(%rbx)0x11529d28a:mov%rbx,%rax0x11529d28d:vzeroupper0x11529d290:add$0x30,%rsp0x11529d294:pop%rbp0x11529d295:test%eax,-0xb78729b(%rip)
x86Assembly
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
mov0x10(%src),%dstvs
movdst,[src+10h]
4
AssemblySyntaxAT&T
Intel
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.| 5
“TheFreeLunchIsOver”,HerbSuZer,2005
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
• Machines– Hadoop(Map/Reduce),ApacheSpark
• Cores/hardwarethreads– JavaStreamAPI– Fork/Joinframework
• CPUSIMDextensions– x86:SSE...,AVX,…,AVX-512
6
GoingParallel<10^3-10^6
<10s-100s
<10s
(servers)
(threads)
(elements)
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
• CPUs– SIMDISAextensions(SingleInstrucAon-MulApleData)– threads(MulApleInstrucAons-MulApleData)
• Co-processors
– GPUs– FPGAs– ASICs
• DataAnalyAcsAccelerator(DAX)onSPARC
7
GoingParallel:CPUsvsCo-processors
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
• Machines– upto12cards/server
• IntelXeonPhi
– 4threadsx72cores
• AVX-512
– 2units/core
8
SIMDvsMIMD<10^3-10^612x
<10s-100s
<10s
(servers)
288(threads)
vs
16SP(elements)
4608-way
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
• x86:MMX,SSE,AVX– 864-bitregisters(MMX)to32512-bitregisters(AVX-512)
• ARM:NEON– 32128-bitregisters
• SPARC:VIS– 3264-bitregisters
• POWER:VMX/AlAVec– 32128-bitregisters
SIMDtoday
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
• Wide(mulA-word)registers– 128-bit(xmm)– 256-bit(ymm)– 512-bit(zmm)
• InstrucAonsonpackedvectors– packedinaregisterormemorylocaAon– shortvectorsofinteger/FPnumbers
• 2xdouble,4xint,8xshort– hard-codedvectorsize
10
x86SIMDExtensions
0128256512
xmm0ymm0zmm0
xmm00326496128
intintintint
double double
short short short short short short short short
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
//LoadA[i:i+3]vmovdqu0x10(%rcx,%rdx,4),%xmm0
//LoadB[i:i+3]
vmovdqu0x10(%r10,%rdx,4),%xmm1
//A[i:i+3]+B[i:i+3]
vpaddd%xmm0,%xmm1,%xmm2
//StoreintoC[i:i+3]
vmovdqu%xmm2,0x10(%r8,%rdx,4)
11
x86SIMDExtensions
A[i+0]A[i+1]A[i+2]A[i+3]
B[i+0]B[i+1]B[i+2]B[i+3]
+ + + +
= = = =
C[i+3]C[i+2]C[i+1]C[i+0]
xmm0
xmm1
xmm2
memory
int[]
A[i+3]A[i+2]A[i+1]A[i+0]
memory
int[]
B[i+3]B[i+2]B[i+1]B[i+0]int[]B[]
A[]
C[]
C[i+0]C[i+1]C[i+2]C[i+3]
3296128
registers
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
Year Name Registers1997 MMX 64-bit mm0-71999 SSE 128-bit xmm0-72001 SSE2 128-bit xmm0-152004 SSE3 128-bit xmm0-152006 SSSE3 128-bit xmm0-152006 SSE4.1 128-bit xmm0-152008 SSE4.2 128-bit xmm0-152011 AVX 256-bit ymm0-152013 AVX2 256-bit ymm0-152013 FMA3 256-bit ymm0-152015 AVX-512 512-bit zmm0-31(k0-7)
12
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
HowtouAlizeSIMDinstrucAons?
13
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
• AutomaAc– sequenAallanguagesandpracAcesgetsintheway
• Semi-automaAc– Giveyourcompiler/runAmehintsandhopeitvectorizes– e.g.,OpenMP4.0#pragmaompsimd
• Codeexplicitly
– e.g.,SIMDinstrucAonintrinsics
14
VectorizaAontechniques
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
IfthecodeiscompiledforaparJcularinstrucJonsetthenitwillbecompaJblewithallCPUsthatsupportthis
instrucAonsetoranyhigherinstrucAonset,butpossiblynotwithearlierCPUs.
15
Problem
SSE4.2<<AVX-512
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
Idea:MakecriAcalpartsofthecodeinmulJpleversionsfordifferentCPUs.
• Forexample,provide:
– AVX2&SSE4.2specializaAons– genericversionthatiscompaAblewitholdmicroprocessors
• TheprogramshouldautomaAcallydetectwhichinstrucAonsetissupportedandchoosetheappropriateversion.
16
CPUDispatching
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
“Itisquiteexpensive-intermsofdevelopment,tesAngandmaintenance-tomakeapieceofcodeinmulApleversions,eachcarefullyopAmizedandfine-tunedfora
parAcularsetofCPUs.“
“OpAmizingsowwareinC++”,AgnerFog
17
CPUDispatching
hZp://www.agner.org/opAmize/opAmizing_cpp.pdf
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
• OpAmizingforpresentprocessorsratherthanfutureprocessors• Thinkingintermsofspecificprocessormodelsratherthanprocessorfeatures
• Assumingthatprocessormodelnumbersformalogicalsequence• Failuretohandleunknownprocessorsproperly• UnderesAmaAngthecostofkeepingaCPUdispatcherupdated• Makingtoomanybranches• IgnoringvirtualizaAon
18
CPUdispatching:Commonpiyalls
“OpAmizingsowwareinC++”,AgnerFog
hZp://www.agner.org/opAmize/opAmizing_cpp.pdf
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
JVMisinagoodposiAon:1. Javabytecodeisplayorm-agnosAc
2. CPUprobingatrunAme(atstartup)– knowseverythingaboutthehardwareitexecutesatthemoment
3. DynamiccodegeneraAon– onlyuseinstrucAonswhichareavailableonthehost
JVMandSIMDtoday
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
• Hotspotsupportssomeofx86SIMDinstrucAons
• AutomaAcvectorizaAonofJavacode– SuperwordopAmizaAonsinHotSpotC2compilertoderiveSIMDcodefromsequenAalcode
• JVMintrinsics– Arraycopying,filling,andcomparison
JVMandSIMDtoday
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
JVMIntrinsics
21
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
“AmethodisintrinsifiediftheHotSpotVMreplacestheannotatedmethodwithhand-wriOenassemblyand/orhand-wriOencompilerIR--acompilerintrinsic--toimproveperformance.”
@HotSpotIntrinsicCandidateJavaDoc
JVMIntrinsics
publicfinalclassjava.lang.Class<T>implements…{@HotSpotIntrinsicCandidatepublicnativebooleanisInstance(Objectobj);
hZp://hg.openjdk.java.net/jdk9/jdk9/jdk/file/Ap/src/java.base/share/classes/jdk/internal/HotSpotIntrinsicCandidate.java
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
• Arraycopy– System.arraycopy(),Arrays.copyOf(),Arrays.equals()
• Arraymismatch(@since9)– Arrays.mismatch(),Arrays.compare()– basedonArraysSupport.vectorizedMismatch()
23
VectorizedJVMIntrinsics
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
Auto-vectorizaAonbyJIT-compiler
24
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.| 25
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.| 26
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
SuperWordopAmizaAonis:1. implementedonlyinC2JIT-compiler
hotspot/src/share/vm/opto/c2_globals.hpp:product(bool,UseSuperWord,true,"Transformscalaroperationsintosuperwordoperations”)
2. appliedonlytounrolledloops– unrollingisperformedonlyforcountedloops
27
VectorizaAon:Prerequisites
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
“Countedloopsarealltrip-countedloops,withexactly1trip-counterexitpath(andmaybesomeotherexitpaths).Thetrip-counterexitisalwayslastintheloop.Thetrip-counterhavetostridebyaconstant;theexitvalueisalsoloopinvariant.”
hotspot/src/share/vm/opto/loopnode.hpp:136
28
CountedLoopsvsTrip-CountedLoops
hZp://hg.openjdk.java.net/jdk9/jdk9/hotspot/file/b725de5c248a/src/share/vm/opto/loopnode.hpp#l136
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
for(inti=start;i<limit;i+=stride){//loopbody}inti=start;while(i<limit){//loopbodyi+=stride;}
29
CountedLoop
• limitisloopinvariant• strideisconstant(compile-Ame)
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
1. $java…-XX:+PrintCompilation-XX:+TraceLoopOpts…– availableonlyindebugbuilds1291bCountedLoop::test1(28bytes)CountedLoop:N100/N83limit_checkpredicatedcounted[0,100),+1(-1iters)
2. $java…-XX:+PrintAssembly…– andeyeballgeneratedcode
30
Howtodetect?
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
for(inti=0;i<100;i++){/*loopbody*/}$java…-XX:+TraceLoopOpts…CountedLoop:N100/N83…counted[0,100),+1(-1iters)
31
CountedLoop?
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
for(inti=start;i<100;i++){/*loopbody*/}CountedLoop:N104/N84…counted[int,100),+1(-1iters)
32
CountedLoop?
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
for(inti=start;i<end;i++){/*loopbody*/}CountedLoop:N104/N84…counted[int,int),+1(-1iters)
33
CountedLoop?
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
for(inti=start;i<end1&&i<end2;i++){…}CountedLoop:N104/N84…counted[int,int),+1(-1iters)
34
CountedLoop?
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
for(inti=start;i<end1||i<end2;i++){…}Loop:N101/N93limit_checkpredicatedsfpts={93}PartialPeelLoop:N101/N93limit_checkpredicatedsfpts={93}CountedLoop:N136/N64counted[int,int),+1(-1iters)
35
CountedLoop?
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
for(inti=start;i<end;i+=2){/*loopbody*/}CountedLoop:N108/N85…counted[int,int),+2(-1iters)
36
CountedLoop?
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
for(inti=start;i<end;i+=d){/*loopbody*/}for(inti=start;i<end;i*=2){/*loopbody*/}for(inti=start;i<end;i++){…if(…){end++;}…}
37
CountedLoop?
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
for(longl=0;l<100;l++){…}for(inti=0;i<100;i++){…}for(byteb=0;b<100;b++){…}for(shorts=0;s<100;s++){…}for(charc=0;c<100;c++){…}
38
CountedLoop?
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
for(inti=0;i<100;i++){…f();//notinlined
}CountedLoop:…counted[0,100),+1(-1iters)has_callhas_sfpt
39
CountedLoop?
Butnounrollinghappens,hencenovectorizaAon.
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
int[]A,B,C;for(inti=0;i<MAX;i++){A[i]=B[i]+C[i];}
40
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
for(inti=0;i<MAX-4;i+=4){//mainloopA[i+0]=B[i+0]+C[i+0];A[i+1]=B[i+1]+C[i+1];A[i+2]=B[i+2]+C[i+2];A[i+3]=B[i+3]+C[i+3];}//post-loop
41
Loopunrolling(4Ames)
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
for(inti=0;i<MAX-4;i+=4){A[i+0]=B[i+0]+C[i+0];A[i+1]=B[i+1]+C[i+1];A[i+2]=B[i+2]+C[i+2];A[i+3]=B[i+3]+C[i+3];}
42
Loopunrolling
A[i:i+3]B[i:i+3]C[i:i+3]
isomorphic
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
0x10a4249d0:vmovdqu0x10(%rcx,%rdx,4),%xmm0;B[i:i+3]=>xmm0
0x10a4249d6:vpaddd0x10(%r10,%rdx,4),%xmm0,%xmm0;C[i:i+3]+xmm0=>xmm0
0x10a4249dd:vmovdqu%xmm0,0x10(%r8,%rdx,4);xmm0=>A[i:i+3]
0x10a4249e4:add$0x4,%edx;i+=4
0x10a4249e7:cmp%r9d,%edx;
0x10a4249ea:jl0x10a4249d0;if(i<(MAX-4))repeat
43
8u121,AVX2(Haswell)
Mainloop
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
for(longl=0;l<MAX-4;l+=4){//mainloopA[l+0]=B[l+0]+C[l+0];A[l+1]=B[l+1]+C[l+1];A[l+2]=B[l+2]+C[l+2];A[l+3]=B[l+3]+C[l+3];}Nope…NounrollingduringcompilaAon,hencenovectorizaAon.
44
Manualunrolling?
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
inti=0;for(;i<MAX-4;i+=4){//mainloopA[i:i+3]=B[i:i+3]+C[i:i+3];}for(;i<MAX;i++){//post-loopA[i]=B[i]+C[i];}
45
Vectorizedloop
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
//main-loop
0x10a4249d0:vmovdqu0x10(%rcx,%rdx,4),%xmm0
0x10a4249d6:vpaddd0x10(%r10,%rdx,4),%xmm0,%xmm0
0x10a4249dd:vmovdqu%xmm0,0x10(%r8,%rdx,4)
0x10a4249e4:add$0x4,%edx
0x10a4249e7:cmp%r9d,%edx
0x10a4249ea:jl0x10a4249d0
...
//post-loop
0x10a4249f4:mov0x10(%r10,%rdx,4),%ebx;A[i]=>ebx
0x10a4249f9:add0x10(%rcx,%rdx,4),%ebx;B[i]+ebx=>ebx
0x10a4249fd:mov%ebx,0x10(%r8,%rdx,4);ebx=>C[i]
0x10a424a02:inc%edx;i++
0x10a424a04:cmp%r11d,%edx;
0x10a424a07:jl0x10a4249f4;if(i<MAX)repeat46
8u121,AVX2(Haswell)
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
//???
0x10a42499d:mov0x10(%r10,%rdx,4),%r9d
0x10a4249a2:add0x10(%rcx,%rdx,4),%r9d
0x10a4249a7:mov%r9d,0x10(%r8,%rdx,4)
0x10a4249ac:inc%edx
0x10a4249ae:cmp%edi,%edx
0x10a4249b0:jl0x10a42499d
…
//main-loop
0x10a4249d0:vmovdqu0x10(%rcx,%rdx,4),%xmm0
0x10a4249d6:vpaddd0x10(%r10,%rdx,4),%xmm0,%xmm0
0x10a4249dd:vmovdqu%xmm0,0x10(%r8,%rdx,4)
0x10a4249e4:add$0x4,%edx
0x10a4249e7:cmp%r9d,%edx
0x10a4249ea:jl0x10a4249d047
8u121,AVX2(Haswell)
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
inti=0,prefix=???;for(;i<prefix;i++){//pre-loopA[i]=B[i]+C[i];}for(;i<MAX-4;i+=4){//mainloopA[i:i+3]=B[i:i+3]+C[i:i+3];}for(;i<MAX;i++){//post-loopA[i]=B[i]+C[i];}
48
Vectorizedloop
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.| 49
Alignment
A[i+2]A[i+1]A[i]A[0]int[] A[MAX-4]+160
8 8 64MAX
Pre-loop Mainloop Post-loopHeader
Cacheline
mov mov vmovdqu vmovdqu mov mov
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
T nounrolling notvectorized vectorized
byte 592±6 506±6 159±4short 541±7 495±4 140±3char 537±4 493±4 141±2int 532±5 490±4 154±2long 533±8 492±5 157±2float 530±4 489±7 155±2double 526±5 483±4 172±3
50
<anyT>voidadd(T[]A,T[]B,T[]C){for(inti=0;i<MAX;i++){A[i]=B[i]+C[i];}}
MAX=1000Corei7,1x2x2,Haswell(AVX2)8u121,macos-x64,ns/op
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
0x10a4249d0:vmovdqu0x10(%rcx,%rdx,4),%xmm0;B[i:i+3]=>xmm0
0x10a4249d6:vpaddd0x10(%r10,%rdx,4),%xmm0,%xmm0;C[i:i+3]+xmm0=>xmm0
0x10a4249dd:vmovdqu%xmm0,0x10(%r8,%rdx,4);xmm0=>A[i:i+3]
0x10a4249e4:add$0x4,%edx;i+=4
0x10a4249e7:cmp%r9d,%edx;
0x10a4249ea:jl0x10a4249d0;if(i<(MAX-4))repeat
51
8u121,AVX2(Haswell)
Mainloop
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
0x10a4249d0:vmovdqu0x10(%rcx,%rdx,4),%xmm0;B[i:i+3]=>xmm0
0x10a4249d6:vpaddd0x10(%r10,%rdx,4),%xmm0,%xmm0;C[i:i+3]+xmm0=>xmm0
0x10a4249dd:vmovdqu%xmm0,0x10(%r8,%rdx,4);xmm0=>A[i:i+3]
0x10a4249e4:add$0x4,%edx;i+=4
0x10a4249e7:cmp%r9d,%edx;
0x10a4249ea:jl0x10a4249d0;if(i<(MAX-4))repeat
52
8u121,AVX2(Haswell)
Hmm…Whynot256-bit?
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
0x117023512:vmovdqu0x10(%rbx,%rcx,4),%ymm0
0x117023518:vpaddd0x10(%rdi,%rcx,4),%ymm0,%ymm0
0x11702351e:vmovdqu%ymm0,0x10(%r9,%rcx,4)
…
0x11702354f:vmovdqu0x70(%rbx,%r8,4),%ymm0
0x117023556:vpaddd0x70(%rdi,%r8,4),%ymm0,%ymm0
0x11702355d:vmovdqu%ymm0,0x70(%r9,%r8,4)
0x117023564:add$0x20,%ecx
0x117023567:cmp%r10d,%ecx
0x11702356a:jl0x117023512
53
jdk9-b163,AVX2(Haswell)
Allright,compilerproblem.Fixedin9.
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
0x117023512:vmovdqu0x10(%rbx,%rcx,4),%ymm0;iteration#1
0x117023518:vpaddd0x10(%rdi,%rcx,4),%ymm0,%ymm0;A[i:i+7]=B[…]+C[…]
0x11702351e:vmovdqu%ymm0,0x10(%r9,%rcx,4);
…;…
0x11702354f:vmovdqu0x70(%rbx,%r8,4),%ymm0;iteration#4
0x117023556:vpaddd0x70(%rdi,%r8,4),%ymm0,%ymm0;A[i+24:i+31]=…
0x11702355d:vmovdqu%ymm0,0x70(%r9,%r8,4);
0x117023564:add$0x20,%ecx;i+=32;//(4*8)
0x117023567:cmp%r10d,%ecx
0x11702356a:jl0x117023512
54
jdk9-b163,AVX2(Haswell)
Butwait…Whathappenedtomainloop?
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.| 55
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
“…weleverageunrollfactorsfromthebaselineloopwhicharemuchlargertoobtainopJmumthroughputonx86architectures.TheupliwrangeonSpecJvm2008isseenonscimark.lu.{small|large}withupliwnotedat3%and8%respecAvely.Weseeasmuchas1.5xupliwonvectorcentricmicroslikereducAonsondefaultopAmizaAons.”
MichaelBerg,IntelJDK-8129920
56
JDK-8129920:VectorizedLoopUnrolling
hZps://bugs.openjdk.java.net/browse/JDK-8129920
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
for(;i<a;i++){…}//pre-loopfor(;i<MAX-4;i+=4){//mainloopA[i:i+3]=B[i:i+3]+C[i:i+3];}for(;i<MAX;i++){//post-loopA[i]=B[i]+C[i];}
57
JDK9:VectorizedLoop:BeforeUnrolling
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
for(;i<MAX-step;i+=step){//mainloopA[i:i+V]=B[i:i+V]+C[i:i+V];A[i+V:i+2*V]=B[i+V:i+2*V]+C[i+V:i+2*V];A[i+2*V:i+3*V]=B[i+2*V:i+3*V]+C[i+2*V:i+3*V];A[i+3*V:i+4*V]=B[i+3*V:i+4*V]+C[i+3*V:i+4*V];}for(;i<MAX;i++){A[i]=B[i]+C[i];}//post-loopintstep=4/*unroll_factor*/*max_vector_size;intV=max_vector_size–1;
58
JDK9:VectorizedLoop:AwerUnrolling
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
intstep=unroll_factor*max_vector_size;for(;i<MAX-step;i+=step){…}//mainloop//NB!Upto(unroll_factor*max_vector_size)iterationsfor(;i<MAX;i++){A[i]=B[i]+C[i];}//post-loop
59
JDK9:VectorizedMainLoop
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
“theaddiAonofatomicunrolleddrainloopswhichprecedefix-upsegmentsandwhicharesignificantlyfasterthanscalarcode.TherequirementisthatthemainloopissuperunrolledawervectorizaAon.Iseeupto54%upliwonmicrobenchmarksonx86targetsforloopswhichpasssuperwordvectorizaAonandwhichmeettheabovecriteria.”
MichaelBerg,Intelhotspot-compiler-dev@ojn
60
JDK-8149421:VectorizedPostLoops
hZp://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2016-February/021205.html
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
for(;i<MAX-4;i+=4){//vectorizedpost-loopA[i:i+3]=B[i:i+3]+C[i:i+3];}//trip-countin[0;unroll_factor)for(;i<MAX;i++){//post-loopA[i]=B[i]+C[i];}//trip-countin[0;max_vector_size)
61
JDK9:VectorizedPost-loop
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
T nounrolling notvectorized vectorized jdk9-b163
byte 592±6 506±6 159±4 69±3short 541±7 495±4 140±3 69±4char 537±4 493±4 141±2 68±2int 532±5 490±4 154±2 74±1long 533±8 492±5 157±2 141±1float 530±4 489±7 155±2 80±3double 526±5 483±4 172±3 167±2
62
<anyT>voidadd(T[]A,T[]B,T[]C){for(inti=0;i<MAX;i++){A[i]=B[i]+C[i];}}
MAX=1000Corei7,1x2x2,Haswell(AVX2)8u121,macos-x64,ns/op
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
MAX 8u121 jdk9-b163
0 2.0±0 2±0
ns/op
1 3.8±0 3.8±010 7.3±2 8.2±1100 22±1 17±110^3 153±4 73±310^4 2058±57 2025±1410^5 36±1 35±1
μs/op10^6 858±34 883±1410^7 8751±145 9144±14
63
Corei7,1x2x2,Haswell(AVX2)macos-x64T=int
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
int[]A;for(inti=0;i<MAX;i++)A[i]++;
64
[Constants]
0x102222b60:0x00000001
…
vmovq0x102222b60,%xmm0
vpunpcklqdq%xmm0,%xmm0,%xmm0
vinserti128$0x1,%xmm0,%ymm0,%ymm0
//Mainloop
vmovdqu0x10(%r10,%rcx,4),%ymm1
vpaddd%ymm0,%ymm1,%ymm1
vmovdqu%ymm1,0x10(%r11,%rcx,4)
add$0x8,%ecx
cmp%r9d,%ecx
jl…
8u121,AVX2(Haswell)
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
int[]A;for(inti=0;i<MAX;i++)A[i]*=10;
65
//Mainloop
vmovdqu0x10(%r10,%r11,4),%ymm0
vpslld$0x1,%ymm0,%ymm1
vpslld$0x3,%ymm0,%ymm0
vpaddd%ymm0,%ymm1,%ymm0
vmovdqu%ymm0,0x10(%r10,%r11,4)
add$0x8,%r11d
cmp%r8d,%r11d
jl...
8u121,AVX2(Haswell)
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
int[]A,B,C;for(inti=0;i<MAX;i++)A[i]=B[i]*C[i];
66
//Mainloop
vmovdqu0x10(%rcx,%rdx,4),%xmm0
vpmulld0x10(%r10,%rdx,4),%xmm0,%xmm0
vmovdqu%xmm0,0x10(%r8,%rdx,4)
add$0x4,%edx
cmp%r9d,%edx
jl…
8u121,AVX2(Haswell)
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
float[]A,B,C;for(inti=0;i<MAX;i++)A[i]=B[2*i]*C[2*i];
67
StridedAccess
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
float[]A,B,C;for(inti=0;i<MAX;i++)A[i]=B[2*i]*C[2*i];
68
StridedAccess //Mainloop
…
vmovss0x18(%rax,%rcx,4),%xmm4
vaddss0x18(%rdx,%rcx,4),%xmm4,%xmm1
vmovss%xmm1,0x14(%r11,%r9,4)
vmovss0x20(%rax,%rcx,4),%xmm4
vaddss0x20(%rdx,%rcx,4),%xmm4,%xmm1
vmovss%xmm1,0x18(%r11,%r9,4)
vmovss0x28(%rdx,%rcx,4),%xmm4
vaddss0x28(%rax,%rcx,4),%xmm4,%xmm1
vmovss%xmm1,0x1c(%r11,%r9,4)
add$0x4,%r10d
...
8u121,AVX2(Haswell)
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
float[]A,B,C;for(inti=0;i<MAX;i++)A[i]=B[2*i]*C[2*i];VGATHERD*inAVX2,but:• noscaZeroperaAons• onlyfloaAngpointvariants
69
StridedAccess //Mainloop
…
vmovss0x18(%rax,%rcx,4),%xmm4
vaddss0x18(%rdx,%rcx,4),%xmm4,%xmm1
vmovss%xmm1,0x14(%r11,%r9,4)
vmovss0x20(%rax,%rcx,4),%xmm4
vaddss0x20(%rdx,%rcx,4),%xmm4,%xmm1
vmovss%xmm1,0x18(%r11,%r9,4)
vmovss0x28(%rdx,%rcx,4),%xmm4
vaddss0x28(%rax,%rcx,4),%xmm4,%xmm1
vmovss%xmm1,0x1c(%r11,%r9,4)
add$0x4,%r10d
...
8u121,AVX2(Haswell)
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
WhataboutUnsafe?
70
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
//On-heaplongoff=Unsafe.ARRAY_INT_BASE_OFFSET;for(inti=0;i<MAX;i++){//A[i]=B[i]+C[i]intval=U.getInt(B,off)+U.getInt(C,off);U.putInt(A,off,val);off+=Unsafe.ARRAY_INT_INDEX_SCALE;}
71
WhataboutUnsafe? //Mainloop
...
mov%rdx,%rdi
add%r9,%rdi
mov(%rcx),%r8d
...
add$0x20,%r9
add$0x8,%r10d
cmp%eax,%r10d
jl...
8u121,AVX2(Haswell)
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
//Off-heaplongaddrA=U.allocateMemory(...);longaddrB=U.allocateMemory(...);longaddrC=U.allocateMemory(...);
for(inti=0;i<MAX;i++){longoff=i*4;intval=U.getInt(null,addrB+off)+U.getInt(null,addrC+off);U.putInt(null,addrA+off,val);}
72
WhataboutUnsafe? //Mainloop
…
mov0x0(%rbp,%rax,1),%r11d
add(%rdi,%rax,1),%r11d
mov%r11d,(%rbx,%rax,1)
…
add$0x8,%r10d
cmp%r14d,%r10d
jl…
8u121,AVX2(Haswell)
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.| 73
Unsafe==Fast
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.| 74
Unsafe==Fast
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
ReducAons
75
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.| 76
HorizontalAddiAon
+ + + +
xmm1 xmm0
xmm2
VPHADDD%xmm0,%xmm1,%xmm2
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.| 77
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
publicintsum(int[]A){intsum=0;for(inta:A){sum+=a;}returnsum;}
78
//Mainloop
add0x10(%r8,%rcx,4),%eax
add0x14(%r8,%rcx,4),%eax
add0x18(%r8,%rcx,4),%eax
add0x1c(%r8,%rcx,4),%eax
add0x20(%r8,%rcx,4),%eax
add0x24(%r8,%rcx,4),%eax
add0x28(%r8,%rcx,4),%eax
add0x2c(%r8,%rcx,4),%eax
add$0x8,%ecx
cmp%r10d,%ecx
jl…
jdk9-b163,AVX2(Haswell)
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
“ReducAonvectoropAmizaAoncouldbeexpensiveforsimpleexpressionsbecauseitusesseveraladdiAonalinstrucAonspervector.[…]WeneedtorestrictreducAonopAmizaAononlytocaseswhenitisbeneficial.”
8074981:RestrictreducAonopAmizaAon
79
hZps://jbs.oracle.com/browse/JDK-8074981
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
intdotProduct(int[]A,int[]B){intr=0;for(inti=0;i<MAX;i++){r+=A[i]*B[i];}returnr;}
80
//Vectorizedpost-loop
vmovdqu0x10(%rdi,%r11,4),%ymm0
vmovdqu0x10(%rbx,%r11,4),%ymm1
vpmulld%ymm0,%ymm1,%ymm0
vphaddd%ymm0,%ymm0,%ymm3
vphaddd%ymm1,%ymm3,%ymm3
vextracti128$0x1,%ymm3,%xmm1
vpaddd%xmm1,%xmm3,%xmm3
vmovd%eax,%xmm1
vpaddd%xmm3,%xmm1,%xmm1
vmovd%xmm1,%eax
add$0x8,%r11d
cmp%r8d,%r11d
jl0x117e23668
jdk9-b163,AVX2(Haswell)
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
intdotProduct(int[]A,int[]B){intr=0;for(inti=0;i<MAX;i++){r+=A[i]*B[i];}returnr;}
81
//Vectorizedpost-loop
vmovdqu0x10(%rdi,%r11,4),%ymm0
vmovdqu0x10(%rbx,%r11,4),%ymm1
vpmulld%ymm0,%ymm1,%ymm0
vphaddd%ymm0,%ymm0,%ymm3
vphaddd%ymm1,%ymm3,%ymm3
vextracti128$0x1,%ymm3,%xmm1
vpaddd%xmm1,%xmm3,%xmm3
vmovd%eax,%xmm1
vpaddd%xmm3,%xmm1,%xmm1
vmovd%xmm1,%eax
add$0x8,%r11d
cmp%r8d,%r11d
jl0x117e23668
jdk9-b163,AVX2(Haswell)
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.| 82
VPHADDD
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
FusedMulAply-Add(FMA)
83
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
• SingleInstrucAon–MulApleNestedoperaAons
• Usecases– dotproduct
for(inti=0;i<MAX;i++)r=r+A[i]*B[i];
– matrixmulAplicaAon
for(intk=0;k<MAX;k++)r=r+A[i][k]*B[k][j];
84
FusedOperaAons
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
Mnemonic Operands OperaJon
VFMADDPDyymm,ymm,ymm/m256
a=b*c+d
VFMADDPSyVFMADDPDx
xmm,xmm,xmm/m128VFMADDPSxVFMADDSD xmm,xmm,xmm/m64VFMADDSS xmm,xmm,xmm/m32
85
FMA4(onlyAMD)
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
Mnemonic OperaJon
VFMADD132… a=a*c+bVFMADD213… a=b*a+cVFMADD231… a=b*c+a
86
FMA3(bothIntel&AMD)
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
float[]A,B,C,D=…;
for(inti=0;i<MAX;i++){A[i]=B[i]*C[i]+D[i];}
87
//Vectorizedpost-loop
vmovdqu0x10(%r8,%r11,4),%ymm0
vmulps0x10(%rcx,%r11,4),%ymm0,%ymm0
vaddps0x10(%rax,%r11,4),%ymm0,%ymm0
vmovdqu%ymm0,0x10(%rdx,%r11,4)
add$0x8,%r11d
cmp%r10d,%r11d
jl…
jdk9/hs,AVX2(Haswell)
Vectorized,noFMA.
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
float[]A,B,C,D=…;
for(inti=0;i<MAX;i++){A[i]=Math.fma(B[i],C[i],D[i]);}
88
//Post-loop
vmovss0x10(%r11,%rbx,4),%xmm1
vmovss0x10(%r9,%rbx,4),%xmm0
vmovss0x10(%r8,%rbx,4),%xmm3
vfmadd231ss%xmm1,%xmm0,%xmm3
vmovss%xmm3,0x10(%r10,%rbx,4)
inc%ebx
cmp%edi,%ebx
jl...
jdk9/hs,AVX2(Haswell)Math.fma()//@since9
Notvectorized,scalarFMA.
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
8153998:Maskedvectorpostloops8080325:SuperWordloopunrollinganalysis
8151573:MulAversioningforrangecheckeliminaAon
8135028:supportforvectorizingdoubleprecisionsqrt
8076284:ImprovevectorizaAonofparallelstreams
8139340:SuperWordenhancementtosupportvectorcondiAonalmove(CMovVD)onIntelAVXcpu
…
89
JDK9:OtherEnhancementsinSuperWord
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
• MaskedvectoroperaAonsif(B[i]>0)A[i]=B[i];
• ScaZer/GatherA[i]=B[i]+C[D[i]];//indirectaccess
• ConflictDetecAonA[B[i]]++; //histogram
90
WhatelseinAVX-512?
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
for(inti=0;i<MAX;i++)if(B[i]>0){A[i]=B[i];}
91
If-conversion
xmm0
B[i+0]B[i+1]B[i+2]B[i+3]3296128 64
B[i:i+3]>0 k1
B[i+3]A[i+2]B[i+1]A[i+0]int[]
0 01 1
VMOVDQU32%xmm0,%k1,…
A[]
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
for(inti=0;i<MAX;i++)A[i]=B[i]+C[D[i]];“Gatherers”:fetchdataelementsusingvector-indexmemoryaddressing.
92
IndirectAccess
C[D[i]]C[D[i+1]]C[D[i+2]]C[D[i+3]]
xmm0
C[D[i+3]]C[D[i+2]]C[D[i+1]]C[D[i]]int[]C[]
3296128
VGATHERQPD
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
for(inti=0;i<MAX;i++)A[B[i]]++;
93
Histogram
A[B[i]]A[B[i+1]]A[B[i+2]]A[B[i+3]]
xmm03296128
VPGATHERDD
A[B[i+3]]A[B[i+2]]A[B[i+1]]A[B[i]]int[]A[]
A[B[i+3]]+1A[B[i+2]]+1A[B[i+1]]+1A[B[i]]+1int[]A[]
VPSCATTERDD
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
for(inti=0;i<MAX;i++)A[B[i]]++;
94
Histogram
A[B[i]]A[B[i+1]]A[B[i+2]]A[B[i+3]]
xmm03296128
VPGATHERDD
A[B[i+3]]A[B[i+2]]A[B[i+1]]A[B[i]]int[]A[]
A[B[i+3]]+1A[B[i+2]]+1A[B[i+1]]+1A[B[i]]+1int[]A[]
VPSCATTERDD
B[i]==B[j]?
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
for(inti=0;i<MAX;i++)A[B[i]]++;
95
Histogram
A[B[i]]A[B[i+1]]A[B[i+2]]A[B[i+3]]
xmm03296128
VPGATHERDD
A[B[i+3]]A[B[i+2]]A[B[i+1]]A[B[i]]int[]A[]
A[B[i+3]]+1A[B[i+2]]+1A[B[i+1]]+1A[B[i]]+1int[]A[]
VPSCATTERDD
ConflictdetecJon!VPCONFLICTD
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
• 100sofvectorinstrucAonsonx86• IntelintrinsicinstrucAons
– MMX:~120– SSE:~130– SSE2/3/SSSE3/4.1/4.2:~260– AVX/AVX2:~380
VectorISAExtensions
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
• 1000sofvectorinstrucAonsonx86• IntelintrinsicinstrucAons
– MMX:~120– SSE:~130– SSE2/3/SSSE3/4.1/4.2:~260– AVX/AVX2:~380– AVX-512:~3800
VectorISAExtensions
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
UTF-8 UTF-16
ASCII(1byte) 0aaaaaaa 000000000aaaaaaaBasicMulAlingualPlane(2or3bytes)
110bbbbb10aaaaaa 00000bbbbbaaaaaa1110cccc10bbbbbb10aaaaaa ccccbbbbbbaaaaaa
SupplementaryPlanes(4bytes)
11110ddd10ddcccc10bbbbbb10aaaaaauuuu=ddddd-1
110110uuuuccccbb110111bbbbaaaaaa
100
UseCase:UTF-8<=>UTF-16
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.| 101
UseCase:UTF-8<=>UTF-16
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
• SuperwordopAmizaAonscanbeverybriZle– doesn’t(andcan’t)coveralltheusecases
• Intrinsicsarepointfixes,notgeneral– powerful,lightweight,andflexible– highdevelopmentcosts
• JNIishardtodevelopandmaintain– interoperabilityoverheadbetweenJava&naAvecode– CPUdispatchingisrequired
JavaandSIMDtoday
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
VectorAPIEmbraceexplicitvectorizaJon
103
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
ProjectPanama
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
SafeHarborStatementThefollowingisintendedtooutlineourgeneralproductdirecAon.ItisintendedforinformaAonpurposesonly,andmaynotbeincorporatedintoanycontract.Itisnotacommitmenttodeliveranymaterial,code,orfuncAonality,andshouldnotberelieduponinmakingpurchasingdecisions.Thedevelopment,release,andAmingofanyfeaturesorfuncAonalitydescribedforOracle’sproductsremainsatthesolediscreAonofOracle.
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
MoAvaAon
Expose data-parallel operations through a cross-platform API
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
Int8Vectorx=...,y=...;//vectorsof8intsInt8Vectorz=x.add(y);//element-wiseaddi4on
107
MoAvaAon
vpaddd%ymm1,%ymm0,%ymm0
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
• MaximallyexpressiveandportableAPI– “principleofleastastonishment”– uniformcoverageoperaAonsanddatatypes– type-safe
• Performant– Highqualityofgeneratedcode– CompeAAvewithexisAngfaciliAesforauto-vectorizaAon
• GracefulperformancedegradaAon– fallbackfor"holes"innaAvearchitectures
108
Goals
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
• DrawAPIproposedbyJohnRose– ImmutableVectortype– parameterizedbyelementtype&size(Vector<E,S>)
• PrototypeImplementaAoninPanama– int,float,long,anddoubleelementssupported
• Int128Vector,Int256Vector,…– basedonMachineCodeSnippets&“super-longs”(Long2,Long4,Long8)
109
CurrentStatus
Vector<Integer,S256Bit>x=...,y=...;//vectorsof8intsVector<Integer,S256Bit>z=x.add(y);//element-wiseaddition
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
• java.lang.Long2/Long4/Long8/…– represent128/256/512-bitvalues
• “well-known“totheJVM– specialtreatmentintheJVM– C2knowshowtomapthevaluestoappropriatevectorregisters
RawVectors
//128-bitvector.public/*value*/finalclassLong2{privatefinallongl1,l2;//FIXMEprivateLong2(){thrownewError();}@HotSpotIntrinsicCandidatepublicstaticnativeLong2make(longlo,longhi);@HotSpotIntrinsicCandidatepublicnativelongextract(inti);@HotSpotIntrinsicCandidatepublicbooleanequals(Long2v){...}...
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
JVMvsHardware:ImpedanceMismatch
size(bits) 8 16 32 64 128 256 512 …
x86regs AL AX EAX RAX XMM0 YMM0 ZMM0 -
JVM B S I J j.l.Long2 j.l.Long4 j.l.Long8 …
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
• OpAmizeawayvectorboxesinthecode– requiredformappingVectorinstancestovectorregistersingeneratedcode– Vector<Integer,S256Bits>=>vectorregister(ymm)onx86/AVX– crucialfordecentperformance
• EscapeAnalysisinC2– doesn’tcoverallthecases(e.g.,non-trivialcontrolflow)– briZle(dependsoninliningdecisions;easyforausertoleakaninstance)
112
VectorBoxEliminaAon
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
• ValueTypes(ProjectValhalla)fortherescue!– representsuper-longs&typedvectorsasvaluetypes– lettheJIT-compilerdotherest
• MinimalValueTypes,asthefirststep– hZp://cr.openjdk.java.net/~jrose/values/shady-values.html
113
VectorboxeliminaAon
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
ValhallaJVMvsHardware
size(bits) 8 16 32 64 128 256 512 …
x86regs AL AX EAX RAX XMM0 YMM0 ZMM0 -
JVM B S I J j.l.Long2 j.l.Long4 j.l.Long8 …
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
Summary
115
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
• SIMDISAextensions– veryirregularonx86– hardtouAlizeincross-playormmanner
• JVM– auto-vectorizaAon
• briZle• can’tcoverallthecases
– intrinsics• pros:powerful,lightweight,andflexible• cons:pointfixes,highdevelopmentcosts
116
Summary
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
• JDK9– enhancementsinauto-vectorizaAon
• parAalAVX-512support,codeshapeimprovementsonx86
– newmethodsandintrinsics• Math.fma(),Arrays.vectorizedMismatch()
• VectorAPI– easy&reliablewaytowriteperformantvectorizedcode– workinprogress!
• underacAvedevelopmentinProjectPanama
117
Summary:Future
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
SafeHarborStatementTheprecedingisintendedtooutlineourgeneralproductdirecAon.ItisintendedforinformaAonpurposesonly,andmaynotbeincorporatedintoanycontract.Itisnotacommitmenttodeliveranymaterial,code,orfuncAonality,andshouldnotberelieduponinmakingpurchasingdecisions.Thedevelopment,release,andAmingofanyfeaturesorfuncAonalitydescribedforOracle’sproductsremainsatthesolediscreAonofOracle.
118
Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|
• Vectorinterface:hZp://cr.openjdk.java.net/~jrose/arrays/vector/Vector.java• Prototype:hZp://hg.openjdk.java.net/panama/panama/jdk/file/Ap/test/panama/vector-api-patchable
• MinimalValueTypes:hZp://cr.openjdk.java.net/~jrose/values/shady-values.html
• Super-longs: – hZp://hg.openjdk.java.net/panama/panama/jdk/file/0243d8ef6bd1/src/java.base/share/classes/java/lang/Long2.java– hZp://hg.openjdk.java.net/panama/panama/jdk/file/0243d8ef6bd1/src/java.base/share/classes/java/lang/Long4.java– hZp://hg.openjdk.java.net/panama/panama/jdk/file/0243d8ef6bd1/src/java.base/share/classes/java/lang/Long8.java
• MachineCodeSnippets:hZp://cr.openjdk.java.net/~vlivanov/talks/2016_JVMLS_MachineCodeSnippets.pdf
119
VectorAPI:Materials