50
Another year of progress for BLIS: 2017-2018 Field G. Van Zee Science of High Performance Compu:ng The University of Texas at Aus:n

Another year of progress for BLIS: 2017-2018

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

AnotheryearofprogressforBLIS:2017-2018FieldG.VanZee

ScienceofHighPerformanceCompu:ngTheUniversityofTexasatAus:n

ScienceofHighPerformanceCompu:ng(SHPC)researchgroup

•  LedbyRobertA.vandeGeijn

•  ContributestothescienceofDLAandinstan:atesresearchresultsasopensourcesoJware

•  LonghistoryofsupportfromNa:onalScienceFounda:on

•  Website:hOps://shpc.ices.utexas.edu/

SHPCFunding(BLIS)

•  NSF–  AwardACI-1148125/1340293:SI2-SSI:ALinearAlgebraSo2wareInfrastructureforSustainedInnova;oninComputa;onalChemistryandotherSciences.(FundedJune1,2012-May31,2015.)

–  AwardCCF-1320112:SHF:Small:FromMatrixComputa;onstoTensorComputa;ons.(FundedAugust1,2013-July31,2016.)

–  AwardACI-1550493:SI2-SSI:SustainingInnova;onintheLinearAlgebraSo2wareStackforComputa;onalChemistryandotherSciences.(FundedJuly15,2016–June30,2018.)

SHPCFunding(BLIS)

•  Industry(grantsandhardware)–  MicrosoJ–  TexasInstruments–  Intel–  AMD–  HPEnterprise–  Oracle–  Huawei

Publica:ons•  “BLIS:AFrameworkforRapidInstan;a;onofBLASFunc;onality”(TOMS;

inprint)

•  “TheBLISFramework:ExperimentsinPortability”(TOMS;inprint)

•  “AnatomyofMany-ThreadedMatrixMul;plica;on”(IPDPS;in

proceedings)

•  “Analy;calModelsfortheBLISFramework”(TOMS;inprint)

•  “Implemen;ngHigh-PerformanceComplexMatrixMul;plica;onviathe

3mand4mMethods”(TOMS;inprint)

•  “Implemen;ngHigh-PerformanceComplexMatrixMul;plica;onviathe

1mMethod”(TOMS;acceptedpendingmodifica:ons)

Review

•  BLAS:BasicLinearAlgebraSubprograms–  Level1:vector-vector[Lawsonetal.1979]–  Level2:matrix-vector[Dongarraetal.1988]–  Level3:matrix-matrix[Dongarraetal.1990]

•  WhyareBLASimportant?–  BLAScons:tutethe“boOomofthefoodchain”formostdenselinearalgebraapplica:ons,aswellasotherHPClibraries

–  LAPACK,libflame,MATLAB,PETSc,numpy,gsl,etc.

Review

•  WhatisBLIS?– Aframeworkforinstan:a:ngBLASlibraries(ie:fullycompa:blewithBLAS)

•  WhatelseisBLIS?–  Providesalterna:veBLAS-like(Cfriendly)APIthatfixesdeficienciesinoriginalBLAS

–  Providesanobject-basedAPI–  ProvidesasupersetofBLASfunc:onality– Aproduc:vitymul:plier– Aresearchenvironment

Review:Wherewereweayearago?

•  License:3-clauseBSD•  Mostrecentversion:0.4.1(August30)•  Host:hOps://github.com/flame/blis

–  Clonerepositories,opennewissues,submitpullrequests,interactwithothergithubusers,viewmarkdowndocs

•  GNU-likebuildsystem–  Supportforgcc,clang,icc

•  Configure-:mehardwaredetec:on(cpuid)

Review:Wherewereweayearago?

•  BLAS/CBLAScompa:bilitylayers•  Twona:veAPIs

–  Typed(BLAS-like)– Object-based(libflame-like)

•  Supportforlevel-3mul:threading–  viaOpenMPorPOSIXthreads– Quadra:cpar::oning:herk,syrk,her2k,syr2k,trmm

•  Comprehensivetestsuite–  Controlopera:ons,parameters,problemsizes,datatypes,storageformats,andmore

SoWhat’sNew?

•  Fivebroadcategories– Framework– Kernels– Buildsystem– Tes:ng– Documenta:on

SoWhat’sNew?

•  Fivebroadcategories– Framework– Kernels– Buildsystem– Tes:ng– Documenta:on

Run:mekernelmanagement

•  Run:memanagementofconfigura:ons(kernels,blocksizes,etc.)– RewriOen/generalizedconfigura:onsystem– Allowsmul:-configura:onbuilds(“fat”libraries)

•  CPUIDusedatrun:metochoosebetweentargets– Examples:

•  ./configureintel64•  ./configurex86_64•  ./configurehaswell#stillworks

– Ordefineyourown!•  ./configureskx_knl#with~5mofwork

Run:mekernelmanagement

•  Formoredetails:–  docs/ConfigurationHowTo.md

Self-ini:aliza:on

•  Libraryself-ini:aliza:on–  Previouslystatusquo

•  Useroftyped/objectAPIshadtocallbli_init()priortocallinganyotherfunc:onorpartofBLIS

•  BLAS/CBLASwerealreadyself-ini:alizing–  Howdoesitworknow?

•  Typicalusageoftyped/objectAPIresultsinexactlyonethreadcallingbli_init()automa:cally,exactlyonce

•  Librarystaysini:alized;bli_finalize()isop:onal– Whyisthisimportant?

•  Applica:ondoesn’thavetoworryanymoreaboutwhetherBLISisini:alized(esp.withconstantsBLIS_ZERO,BLIS_ONE,etc.)

–  Implementa:on•  pthread_once()

Basic+ExpertInterfaces

•  Separate“basic”and“expert”interfaces– appliestobothtypedandobjectAPIs

•  Whatisthedifference?

Basic+ExpertInterfaces

//TypedAPI(basic)voidbli_dgemm(trans_ttransa,trans_ttransb,dim_tm,dim_tn,dim_tk,double*alpha,double*a,inc_trsa,inc_tcsa,double*b,inc_trsb,inc_tcsb,double*beta,double*c,inc_trsc,inc_tcsc);

//ObjectAPI(basic)voidbli_gemm(obj_t*alpha,obj_t*a,obj_t*b,obj_t*beta,obj_t*c);

Basic+ExpertInterfaces

//TypedAPI(expert)voidbli_dgemm_ex(trans_ttransa,trans_ttransb,dim_tm,dim_tn,dim_tk,double*alpha,double*a,inc_trsa,inc_tcsa,double*b,inc_trsb,inc_tcsb,double*beta,double*c,inc_trsc,inc_tcsc,cntx_t*cntx,rntm_t*rntm);

//ObjectAPI(expert)voidbli_gemm_ex(obj_t*alpha,obj_t*a,obj_t*b,obj_t*beta,obj_t*c,cntx_t*cntx,rntm_t*rntm);

Basic+ExpertInterfaces

•  Whatarecntx_tandrntm_t?– cntx_t:contextencapsulatesallarchitecture-specificinforma:onobtainedfromthebuildsystemabouttheconfigura:on(blocksizes,kerneladdresses,etc.)

– rntm_t:moreonthisinabit– BoOomline:expertscanexertmorecontroloverBLISwithoutimpedingeverydayusers

Basic+ExpertInterfaces

•  Formoredetails:–  docs/BLISTypedAPI.md–  docs/BLISObjectAPI.md

ControllingMul:threading

•  Reminder– Howdoesmul:threadingworkinBLIS?– BLIS’sgemmalgorithmhasfiveloopsoutsidethemicrokernelandoneloopinsidethemicrokernel

•  JC•  PC(notyetparallelized)•  IC•  JR•  IR•  PR(microkernel)

JCloop

PCloop

ICloop

JRloop

IRloop

PRloop

5thlooparoundmicro-kernel

4thlooparoundmicro-kernel

3rdlooparoundmicro-kernel

2ndlooparoundmicro-kernel

1stlooparoundμkernel

micro-kernel

+=

mC

mR

mR

1

+=

+=

+=

+=

+=

nC nC

kC

kC

mC

1

nR

kC

nR

Pack Ai → Ai ~

Pack Bp → Bp ~

nR

A Bj Cj

Ap

Ai

Bp Cj

Ai ~ Bp

~

Bp ~ Ci

Ci

kC

L3cacheL2cacheL1cacheregisters

mainmemory

Update Cij

ControllingMul:threading

•  Previously,BLIShadonemethodtocontrolthreading:Globalspecifica:onviaenvironmentvariables– Affectsallapplica:onthreadsequally– Automa:cway

•  BLIS_NUM_THREADS– Manualway

•  BLIS_JC_NT,BLIS_IC_NT,BLIS_JR_NT,BLIS_IR_NT•  BLIS_PC_NT(notyetimplemented)

ControllingMul:threading

#Useeithertheautomaticwayormanualwayofrequesting#parallelism.#Automaticway.$exportBLIS_NUM_THREADS=6#Expertway.$exportBLIS_IC_NT=2;exportBLIS_JR_NT=3//Callalevel-3operation(basicinterfaceisenough).bli_gemm(&alpha,&a,&b,&beta,&c);

•  Example:Globalspecifica:onviaenvironmentvariables

ControllingMul:threading

•  Wenowhaveasecondmethod:Globalspecifica:onviarun:meAPI– Affectsallapplica:onthreadsequally– Automa:cway

•  bli_thread_set_num_threads(dim_tnt);– Manualway

•  bli_thread_set_ways(dim_tjc,dim_tpc,dim_tic,dim_tjr,dim_tir);

ControllingMul:threading

//Useeithertheautomaticwayormanualwayofrequesting//parallelism.//Automaticway.bli_thread_set_num_threads(6,&rntm);//Manualway.bli_thread_set_ways(1,1,2,3,1,&rntm);//Callalevel-3operation(basicinterfaceisstillenough).bli_gemm(&alpha,&a,&b,&beta,&c);

•  Example:Globalspecifica:onviarun:meAPI

ControllingMul:threading

•  Andalsoathirdmethod:Thread-localspecifica:onviarun:meAPI– Affectsonlythecallingthread!–  Requiresuseofexpertinterface(typedorobject)

•  Userini:alizesandpassesina“run:me”object:rntm_t– Automa:cway

•  bli_rntm_set_num_threads(dim_tnt,rntm_t*rntm);

– Manualway•  bli_rntm_set_ways(dim_tjc,dim_tpc,dim_tic,dim_tjr,dim_tir,rntm_t*rntm);

ControllingMul:threading

//Declareandinitializearntm_tobject.rntm_trntm=BLIS_RNTM_INITIALIZER;//CallONE(notboth)ofthefollowingtoencodeyour//parallelizationintotherntm_t.bli_rntm_set_num_threads(6,&rntm);//automaticwaybli_rntm_set_ways(1,1,2,3,1,&rntm);//manualway//Callalevel-3operationviaanexpertinterfaceandpass//inyourrntm_t.(NULLbelowrequestsdefaultcontext.)bli_gemm_ex(&alpha,&a,&b,&beta,&c,NULL,&rntm);

•  Example:Thread-localspecifica:onviarun:meAPI

ControllingMul:threading

•  Formoredetails:–  docs/Multithreading.md

ThreadSafety

•  Uncondi:onalthreadsafety•  Whatdoesthismean?

– BLISalwaysusesmechanismsprovidedbypthreadsAPItoensuresynchronousaccesstoglobally-shareddatastructures

–  Independentofmul:threadingop:on--enable-threading={pthreads|openmp}• WorkswithOpenMP• Workswhenmul:threadingisdisableden:rely

Sandboxes

•  Mo:va:on:whatifyoucouldprovideyourownimplementa:onofgemm?–  YoucoulduseasliOleorasmuchoftheexis:ngimplementa:oncodeasyoulike

–  Butyouwanttopreserveeverythingelse:buildsystem,testsuite,u:lityfunc:ons,etc.

•  EnterBLISsandbox–  Integratedintobuildsystem(noaddi:onalmakefiles)–  Requiresonlyoneheaderfile(whichcanbeempty)–  Requiresonlyonefunc:on:bli_gemmnat()– UseC(orevenC++)

Sandboxes

•  EnablingasandboxinBLIS#Enablesandboxnamed‘ref99’(withautomaticconfiguration#selection).$./configure--enable-sandbox=ref99auto#Shorthand:$./configure-sref99auto

Sandboxes

•  Possibleuses– Tryingadifferentalgorithmicpath(notGoto)– Tryingadifferentimplementa:onofpackm(notjustpackmkernels)

– Tryvariousop:miza:ons:avoidingobj_tatahigherlevel,orinliningfunc:ons.

– Createexperimentalimplementa:onsofnewopera:ons

Sandboxes

•  NOTfordoinganyofthefollowing:– Defininganewdatatype(half-precision,quad-precision,shortinteger,etc.)

– Changingexis:ngAPIs– Removingsupportforoneormoredatatypes(toreducelibrarysize)

– Changeimplementa:onofotherlevel-3opera:onssuchasherkortrmm

•  Thismaybeallowedinthefuture

Sandboxes

•  Formoredetails:–  docs/Sandboxes.md

SoWhat’sNew?

•  Fivebroadcategories– Framework– Kernels– Buildsystem– Tes:ng– Documenta:on

Kernels

•  IntelSkylakeXandKnight’sLanding(AVX-512)– na:ve:s/d(alllevel-3opera:ons)–  induced1m:c/z(alllevel-3)

•  IntelPenryn,Sandybridge,IvyBridge,Haswell,Broadwell,Skylake,KabyLake,CoffeeLake– na:ve:s/d/c/z(alllevel-3;somelevel-1v,-1f)

•  AMDBulldozer,Piledriver,Steamroller,Excavator,Zen– na:ve:s/d/c/z(alllevel-3)

SoWhat’sNew?

•  Fivebroadcategories– Framework– Kernels– Buildsystem– Tes:ng– Documenta:on

Buildsystem

•  Monolithicheadergenera:on– Allheaders(~500)recursivelyinlinedintoblis.h–  Fastercompila:on:me–  Easiertodistributebuildproducts

•  RewriOenconfigure-:mehardwaredetec:on•  Configura:onblacklis:ng(assembler/binu:ls)•  ARG_MAXhack

–  ./configure--enable-arg-max-hack•  Compile/linkagainstinstalledcopyofBLIS

–  makeBLIS_INSTALL_PATH=/usr/local

SoWhat’sNew?

•  Fivebroadcategories– Framework– Kernels– Buildsystem– Tes:ng– Documenta:on

Tes:ng

•  IntegratednetlibBLAStestdrivers– CarefullytranslatedfromFortran-77toC–  Integratedintobuildsystem

•  makecheckblas•  Simulateapplica:on-levelmul:threadingintestsuite– Executewitharbitrarynumberofthreads

•  TravisCInowusesIntelSDEemulatortotestallx86_64kernels– Excep:on:FMA4-basedBulldozer

SoWhat’sNew?

•  Fivebroadcategories– Framework– Kernels– Buildsystem– Tes:ng– Documenta:on

Documenta:on

•  Examplecode– TypedAPI:examples/tapi– ObjectAPI:examples/oapi– Makefilesincluded– Setuplikeatutorial:readcodealongsidetheexecutableoutput

•  Documenta:on–  typedAPI,objectAPI,buildsystem,configura:ons,hardwaresupport,kernels,mul:threading,sandboxes,testsuite,releasenotes

Performance

Performance

GitHubStats

•  TotalBLIScontributorsto-date:62–  non-UTcontributors:52

•  Issuesclosed:115–  bynon-UTcontributors:86

•  Pullrequestsclosed:88–  virtuallyallaccepted

•  Averageuniqueclonespertwo-weekperiod:~50–  totalclonespertwo-weekperiod:~500

•  Averageuniquevisitorspertwo-weekperiod:~350–  totalvisitorspertwo-weekperiod:~1500

What’snew?(review)

•  Fivebroadcategories–  Framework:run:meconfigmanagement;libraryself-init;basic+expertAPIs;per-callmul:threadingspecifica:on;uncondi:onalthreadsafety;sandboxes

–  Kernels:zensupport;Devin’sassemblymacrolanguage–  Buildsystem:monolithicheadergenera:on(fasterbuild:me);rewriOenconfigure-:mehardwaredetec:on;configblacklis:ng;ARG_MAXhack;BLIS_INSTALL_PATH

–  Tes:ng:integratednetlibBLAStestdrivers(translatedtoC);simulateapplica:on-levelthreadsintestsuite;TravisCInowusesIntelSDE

–  Documenta:on:examplecode(typedandobjectAPIs);APIdocumenta:on(typedandobjectAPIs);movedwikisintosourcedistribu:on

Conclusion

•  BLIS…–  israpidlymaturing–  isfeature-rich–  iswell-documented– hasacommunitytosupportitsdevelopers/users– hasbeenembracedbyindustry– providescompe::ve(orsuperior)performancerela:vetootherleadingopen-sourcesolu:ons(andsomevendorlibraries!)

FurtherInforma:on

•  Website:– hOp://github.com/flame/blis/

•  Discussion:– hOp://groups.google.com/group/blis-devel– hOp://groups.google.com/group/blis-discuss

•  Contact:– [email protected]

48

Thankyou!