Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
AnotheryearofprogressforBLIS:2017-2018FieldG.VanZee
ScienceofHighPerformanceCompu:ngTheUniversityofTexasatAus:n
ScienceofHighPerformanceCompu:ng(SHPC)researchgroup
• LedbyRobertA.vandeGeijn
• ContributestothescienceofDLAandinstan:atesresearchresultsasopensourcesoJware
• LonghistoryofsupportfromNa:onalScienceFounda:on
• Website:hOps://shpc.ices.utexas.edu/
SHPCFunding(BLIS)
• NSF– AwardACI-1148125/1340293:SI2-SSI:ALinearAlgebraSo2wareInfrastructureforSustainedInnova;oninComputa;onalChemistryandotherSciences.(FundedJune1,2012-May31,2015.)
– AwardCCF-1320112:SHF:Small:FromMatrixComputa;onstoTensorComputa;ons.(FundedAugust1,2013-July31,2016.)
– AwardACI-1550493:SI2-SSI:SustainingInnova;onintheLinearAlgebraSo2wareStackforComputa;onalChemistryandotherSciences.(FundedJuly15,2016–June30,2018.)
SHPCFunding(BLIS)
• Industry(grantsandhardware)– MicrosoJ– TexasInstruments– Intel– AMD– HPEnterprise– Oracle– Huawei
Publica:ons• “BLIS:AFrameworkforRapidInstan;a;onofBLASFunc;onality”(TOMS;
inprint)
• “TheBLISFramework:ExperimentsinPortability”(TOMS;inprint)
• “AnatomyofMany-ThreadedMatrixMul;plica;on”(IPDPS;in
proceedings)
• “Analy;calModelsfortheBLISFramework”(TOMS;inprint)
• “Implemen;ngHigh-PerformanceComplexMatrixMul;plica;onviathe
3mand4mMethods”(TOMS;inprint)
• “Implemen;ngHigh-PerformanceComplexMatrixMul;plica;onviathe
1mMethod”(TOMS;acceptedpendingmodifica:ons)
Review
• BLAS:BasicLinearAlgebraSubprograms– Level1:vector-vector[Lawsonetal.1979]– Level2:matrix-vector[Dongarraetal.1988]– Level3:matrix-matrix[Dongarraetal.1990]
• WhyareBLASimportant?– BLAScons:tutethe“boOomofthefoodchain”formostdenselinearalgebraapplica:ons,aswellasotherHPClibraries
– LAPACK,libflame,MATLAB,PETSc,numpy,gsl,etc.
Review
• WhatisBLIS?– Aframeworkforinstan:a:ngBLASlibraries(ie:fullycompa:blewithBLAS)
• WhatelseisBLIS?– Providesalterna:veBLAS-like(Cfriendly)APIthatfixesdeficienciesinoriginalBLAS
– Providesanobject-basedAPI– ProvidesasupersetofBLASfunc:onality– Aproduc:vitymul:plier– Aresearchenvironment
Review:Wherewereweayearago?
• License:3-clauseBSD• Mostrecentversion:0.4.1(August30)• Host:hOps://github.com/flame/blis
– Clonerepositories,opennewissues,submitpullrequests,interactwithothergithubusers,viewmarkdowndocs
• GNU-likebuildsystem– Supportforgcc,clang,icc
• Configure-:mehardwaredetec:on(cpuid)
Review:Wherewereweayearago?
• BLAS/CBLAScompa:bilitylayers• Twona:veAPIs
– Typed(BLAS-like)– Object-based(libflame-like)
• Supportforlevel-3mul:threading– viaOpenMPorPOSIXthreads– Quadra:cpar::oning:herk,syrk,her2k,syr2k,trmm
• Comprehensivetestsuite– Controlopera:ons,parameters,problemsizes,datatypes,storageformats,andmore
Run:mekernelmanagement
• Run:memanagementofconfigura:ons(kernels,blocksizes,etc.)– RewriOen/generalizedconfigura:onsystem– Allowsmul:-configura:onbuilds(“fat”libraries)
• CPUIDusedatrun:metochoosebetweentargets– Examples:
• ./configureintel64• ./configurex86_64• ./configurehaswell#stillworks
– Ordefineyourown!• ./configureskx_knl#with~5mofwork
Self-ini:aliza:on
• Libraryself-ini:aliza:on– Previouslystatusquo
• Useroftyped/objectAPIshadtocallbli_init()priortocallinganyotherfunc:onorpartofBLIS
• BLAS/CBLASwerealreadyself-ini:alizing– Howdoesitworknow?
• Typicalusageoftyped/objectAPIresultsinexactlyonethreadcallingbli_init()automa:cally,exactlyonce
• Librarystaysini:alized;bli_finalize()isop:onal– Whyisthisimportant?
• Applica:ondoesn’thavetoworryanymoreaboutwhetherBLISisini:alized(esp.withconstantsBLIS_ZERO,BLIS_ONE,etc.)
– Implementa:on• pthread_once()
Basic+ExpertInterfaces
• Separate“basic”and“expert”interfaces– appliestobothtypedandobjectAPIs
• Whatisthedifference?
Basic+ExpertInterfaces
//TypedAPI(basic)voidbli_dgemm(trans_ttransa,trans_ttransb,dim_tm,dim_tn,dim_tk,double*alpha,double*a,inc_trsa,inc_tcsa,double*b,inc_trsb,inc_tcsb,double*beta,double*c,inc_trsc,inc_tcsc);
//ObjectAPI(basic)voidbli_gemm(obj_t*alpha,obj_t*a,obj_t*b,obj_t*beta,obj_t*c);
Basic+ExpertInterfaces
//TypedAPI(expert)voidbli_dgemm_ex(trans_ttransa,trans_ttransb,dim_tm,dim_tn,dim_tk,double*alpha,double*a,inc_trsa,inc_tcsa,double*b,inc_trsb,inc_tcsb,double*beta,double*c,inc_trsc,inc_tcsc,cntx_t*cntx,rntm_t*rntm);
//ObjectAPI(expert)voidbli_gemm_ex(obj_t*alpha,obj_t*a,obj_t*b,obj_t*beta,obj_t*c,cntx_t*cntx,rntm_t*rntm);
Basic+ExpertInterfaces
• Whatarecntx_tandrntm_t?– cntx_t:contextencapsulatesallarchitecture-specificinforma:onobtainedfromthebuildsystemabouttheconfigura:on(blocksizes,kerneladdresses,etc.)
– rntm_t:moreonthisinabit– BoOomline:expertscanexertmorecontroloverBLISwithoutimpedingeverydayusers
ControllingMul:threading
• Reminder– Howdoesmul:threadingworkinBLIS?– BLIS’sgemmalgorithmhasfiveloopsoutsidethemicrokernelandoneloopinsidethemicrokernel
• JC• PC(notyetparallelized)• IC• JR• IR• PR(microkernel)
JCloop
PCloop
ICloop
JRloop
IRloop
PRloop
5thlooparoundmicro-kernel
4thlooparoundmicro-kernel
3rdlooparoundmicro-kernel
2ndlooparoundmicro-kernel
1stlooparoundμkernel
micro-kernel
+=
mC
mR
mR
1
+=
+=
+=
+=
+=
nC nC
kC
kC
mC
1
nR
kC
nR
Pack Ai → Ai ~
Pack Bp → Bp ~
nR
A Bj Cj
Ap
Ai
Bp Cj
Ai ~ Bp
~
Bp ~ Ci
Ci
kC
L3cacheL2cacheL1cacheregisters
mainmemory
Update Cij
ControllingMul:threading
• Previously,BLIShadonemethodtocontrolthreading:Globalspecifica:onviaenvironmentvariables– Affectsallapplica:onthreadsequally– Automa:cway
• BLIS_NUM_THREADS– Manualway
• BLIS_JC_NT,BLIS_IC_NT,BLIS_JR_NT,BLIS_IR_NT• BLIS_PC_NT(notyetimplemented)
ControllingMul:threading
#Useeithertheautomaticwayormanualwayofrequesting#parallelism.#Automaticway.$exportBLIS_NUM_THREADS=6#Expertway.$exportBLIS_IC_NT=2;exportBLIS_JR_NT=3//Callalevel-3operation(basicinterfaceisenough).bli_gemm(&alpha,&a,&b,&beta,&c);
• Example:Globalspecifica:onviaenvironmentvariables
ControllingMul:threading
• Wenowhaveasecondmethod:Globalspecifica:onviarun:meAPI– Affectsallapplica:onthreadsequally– Automa:cway
• bli_thread_set_num_threads(dim_tnt);– Manualway
• bli_thread_set_ways(dim_tjc,dim_tpc,dim_tic,dim_tjr,dim_tir);
ControllingMul:threading
//Useeithertheautomaticwayormanualwayofrequesting//parallelism.//Automaticway.bli_thread_set_num_threads(6,&rntm);//Manualway.bli_thread_set_ways(1,1,2,3,1,&rntm);//Callalevel-3operation(basicinterfaceisstillenough).bli_gemm(&alpha,&a,&b,&beta,&c);
• Example:Globalspecifica:onviarun:meAPI
ControllingMul:threading
• Andalsoathirdmethod:Thread-localspecifica:onviarun:meAPI– Affectsonlythecallingthread!– Requiresuseofexpertinterface(typedorobject)
• Userini:alizesandpassesina“run:me”object:rntm_t– Automa:cway
• bli_rntm_set_num_threads(dim_tnt,rntm_t*rntm);
– Manualway• bli_rntm_set_ways(dim_tjc,dim_tpc,dim_tic,dim_tjr,dim_tir,rntm_t*rntm);
ControllingMul:threading
//Declareandinitializearntm_tobject.rntm_trntm=BLIS_RNTM_INITIALIZER;//CallONE(notboth)ofthefollowingtoencodeyour//parallelizationintotherntm_t.bli_rntm_set_num_threads(6,&rntm);//automaticwaybli_rntm_set_ways(1,1,2,3,1,&rntm);//manualway//Callalevel-3operationviaanexpertinterfaceandpass//inyourrntm_t.(NULLbelowrequestsdefaultcontext.)bli_gemm_ex(&alpha,&a,&b,&beta,&c,NULL,&rntm);
• Example:Thread-localspecifica:onviarun:meAPI
ThreadSafety
• Uncondi:onalthreadsafety• Whatdoesthismean?
– BLISalwaysusesmechanismsprovidedbypthreadsAPItoensuresynchronousaccesstoglobally-shareddatastructures
– Independentofmul:threadingop:on--enable-threading={pthreads|openmp}• WorkswithOpenMP• Workswhenmul:threadingisdisableden:rely
Sandboxes
• Mo:va:on:whatifyoucouldprovideyourownimplementa:onofgemm?– YoucoulduseasliOleorasmuchoftheexis:ngimplementa:oncodeasyoulike
– Butyouwanttopreserveeverythingelse:buildsystem,testsuite,u:lityfunc:ons,etc.
• EnterBLISsandbox– Integratedintobuildsystem(noaddi:onalmakefiles)– Requiresonlyoneheaderfile(whichcanbeempty)– Requiresonlyonefunc:on:bli_gemmnat()– UseC(orevenC++)
Sandboxes
• EnablingasandboxinBLIS#Enablesandboxnamed‘ref99’(withautomaticconfiguration#selection).$./configure--enable-sandbox=ref99auto#Shorthand:$./configure-sref99auto
Sandboxes
• Possibleuses– Tryingadifferentalgorithmicpath(notGoto)– Tryingadifferentimplementa:onofpackm(notjustpackmkernels)
– Tryvariousop:miza:ons:avoidingobj_tatahigherlevel,orinliningfunc:ons.
– Createexperimentalimplementa:onsofnewopera:ons
Sandboxes
• NOTfordoinganyofthefollowing:– Defininganewdatatype(half-precision,quad-precision,shortinteger,etc.)
– Changingexis:ngAPIs– Removingsupportforoneormoredatatypes(toreducelibrarysize)
– Changeimplementa:onofotherlevel-3opera:onssuchasherkortrmm
• Thismaybeallowedinthefuture
Kernels
• IntelSkylakeXandKnight’sLanding(AVX-512)– na:ve:s/d(alllevel-3opera:ons)– induced1m:c/z(alllevel-3)
• IntelPenryn,Sandybridge,IvyBridge,Haswell,Broadwell,Skylake,KabyLake,CoffeeLake– na:ve:s/d/c/z(alllevel-3;somelevel-1v,-1f)
• AMDBulldozer,Piledriver,Steamroller,Excavator,Zen– na:ve:s/d/c/z(alllevel-3)
Buildsystem
• Monolithicheadergenera:on– Allheaders(~500)recursivelyinlinedintoblis.h– Fastercompila:on:me– Easiertodistributebuildproducts
• RewriOenconfigure-:mehardwaredetec:on• Configura:onblacklis:ng(assembler/binu:ls)• ARG_MAXhack
– ./configure--enable-arg-max-hack• Compile/linkagainstinstalledcopyofBLIS
– makeBLIS_INSTALL_PATH=/usr/local
Tes:ng
• IntegratednetlibBLAStestdrivers– CarefullytranslatedfromFortran-77toC– Integratedintobuildsystem
• makecheckblas• Simulateapplica:on-levelmul:threadingintestsuite– Executewitharbitrarynumberofthreads
• TravisCInowusesIntelSDEemulatortotestallx86_64kernels– Excep:on:FMA4-basedBulldozer
Documenta:on
• Examplecode– TypedAPI:examples/tapi– ObjectAPI:examples/oapi– Makefilesincluded– Setuplikeatutorial:readcodealongsidetheexecutableoutput
• Documenta:on– typedAPI,objectAPI,buildsystem,configura:ons,hardwaresupport,kernels,mul:threading,sandboxes,testsuite,releasenotes
GitHubStats
• TotalBLIScontributorsto-date:62– non-UTcontributors:52
• Issuesclosed:115– bynon-UTcontributors:86
• Pullrequestsclosed:88– virtuallyallaccepted
• Averageuniqueclonespertwo-weekperiod:~50– totalclonespertwo-weekperiod:~500
• Averageuniquevisitorspertwo-weekperiod:~350– totalvisitorspertwo-weekperiod:~1500
What’snew?(review)
• Fivebroadcategories– Framework:run:meconfigmanagement;libraryself-init;basic+expertAPIs;per-callmul:threadingspecifica:on;uncondi:onalthreadsafety;sandboxes
– Kernels:zensupport;Devin’sassemblymacrolanguage– Buildsystem:monolithicheadergenera:on(fasterbuild:me);rewriOenconfigure-:mehardwaredetec:on;configblacklis:ng;ARG_MAXhack;BLIS_INSTALL_PATH
– Tes:ng:integratednetlibBLAStestdrivers(translatedtoC);simulateapplica:on-levelthreadsintestsuite;TravisCInowusesIntelSDE
– Documenta:on:examplecode(typedandobjectAPIs);APIdocumenta:on(typedandobjectAPIs);movedwikisintosourcedistribu:on
Conclusion
• BLIS…– israpidlymaturing– isfeature-rich– iswell-documented– hasacommunitytosupportitsdevelopers/users– hasbeenembracedbyindustry– providescompe::ve(orsuperior)performancerela:vetootherleadingopen-sourcesolu:ons(andsomevendorlibraries!)
FurtherInforma:on
• Website:– hOp://github.com/flame/blis/
• Discussion:– hOp://groups.google.com/group/blis-devel– hOp://groups.google.com/group/blis-discuss
• Contact:– [email protected]
48