Upload
others
View
18
Download
1
Embed Size (px)
Citation preview
SangminSeo
AssistantComputerScientistArgonneNationalLaboratory
November7,2016
User-Level Threads and OpenMP
Top500 Supercomputers Today1. SunwayTaihuLight (SunwayMPP):2. Tianhe-2(IntelXeon+XeonPhi):3. Titan(CrayXK7+Nvidia K20x):4. Sequoia(BlueGene/Q):5. Kcomputer(SPARC64):6. Mira(BlueGene/Q):
10,646,600cores3,120,000cores560,640cores
1,572,864cores705,024cores786,432cores
NumberofcorespersocketinTop500 TotalnumberofcoresinTop500rank#14thXMPWorkshop(11/7/2016) 2
0%
20%
40%
60%
80%
100%1 2 4 6 8 10 12 14-
0K
2,000K
4,000K
6,000K
8,000K
10,000K
12,000K
TotalN
umbe
rofC
oers
Massive On-node Parallelism
• Toaddressmassiveon-nodeparallelism,thenumberofworkunits(e.g.,threads)mustincreaseby100X
• MPI+OpenMP issufficientformanyapps,butimplementationispoor– TodayMPI+OpenMP ==MPI+Pthreads
• Pthreadabstractionistoogeneric,notsuitableforHPC– Lackoffine-grainedscheduling,memory
management,networkmanagement,signaling,etc.
• BetterruntimecansignificantlyimproveMPI+OpenMP performanceandsupportotheremergingprogrammingmodels
core
MPIprocesswithmanyOpenMP threads
4thXMPWorkshop(11/7/2016) 3
Outline
• Motivation• User-LevelThreads(ULTs)• Argobots• BOLT:OpenMP overLightweightThreads• Summary
4thXMPWorkshop(11/7/2016) 4
User-Level Threads (ULTs)
• Whatisuser-levelthread(ULT)?– Providesthreadsemanticsinuserspace– Executionmodel:cooperativetimesharing
• MorethanoneULTcanbemappedtoasinglekernelthread
• ULTsonthesameOSthreaddonotexecuteinparallel– Canbeimplementedwithcoroutines
• Enableexplicitsuspendandresumeofitsprogressbypreservingexecutionstate
• SomelanguagessuchasPythonandGousecoroutines forasynchronousI/O
timeline
Context switch
Context switch
ULT1
ULT2
Core Core Core Core Core Core Core Core
ULTs :
Kernel threads :
4thXMPWorkshop(11/7/2016) 5
User-Level Threads (ULTs)
• WhyULTs?– Conventionalthreads(e.g.,Pthreads)aretooexpensiveto
expressmassiveparallelism– IfwecreatePthreadsmorethan# ofcores
(i.e.,oversubscription):• Context-switchandsynchronizationoverheadwillincreasedramatically
– ULTscanmitigatehighoverheadofPthreadsbutneedexplicitcontrol
• Wheretouse?– Tobetteroverlapcomputationandcommunication/IO
• Lowcontext-switchingoverheadofULTscangivemoreopportunitiestohidecommunication/IOlatency
– Toexploitfine-grainedtaskparallelism• LightweightULTsaremoresuitabletoexpressmassivetaskparallelism
U
U
U
OSthread
4thXMPWorkshop(11/7/2016) 6
Pthreads vs. ULTs
• Averagetimeforcreatingandjoiningonethread• Pthread:6.6us- 21.2us(avg.34,953cycles)• ULT(Argobots):78ns- 130ns(avg.191cycles)• ULTis64x- 233xfaster thanPthread
– HowfastisULT?• L1$access:1.112ns,L2$access:5.648ns,memoryaccess:18.4ns• Contextswitch(2processes):1.64us
1
10
100
1000
10000
100000
1 2 4 8 16 32 64 128 256 512 1024 2048
Avg.Create&
JoinTime/thread
(ns)
NumberofThreads
Pthread ULT(Argobots)
*measuredusingLMbench3
4thXMPWorkshop(11/7/2016) 7
Growing Interests in ULTs
• ULTandtasklibraries– Conversethreads,Qthreads,MassiveThreads,Nanos++,
Maestro,GnuPth,StackThreads/MP,Protothreads,Capriccio,StateThreads,TiNy-threads,etc.
• OSsupports– Windowsfibers,Solaristhreads
• Languageandprogrammingmodels– Cilk,OpenMP task,C++11task,C++17coroutine proposal,
Stackless Python,Gocoroutines,etc.
• Pros– EasytousewithPthreads-likeinterface
• Cons– Runtimetriestodosomethingsmart(e.g.,work-stealing)– Thismayconflictwiththecharacteristicsanddemandsof
applications
4thXMPWorkshop(11/7/2016) 8
Argobots
Overview• Separationofmechanismsandpolicies• Massiveparallelism
– Exec.Streams guaranteeprogress– WorkUnits executetocompletion
• User-levelthreads(ULTs)vs.Tasklets• Clearlydefinedmemorysemantics
– Consistencydomains• ProvideEventualConsistency
– Softwarecanmanageconsistency
ArgobotsInnovations• Enablingtechnology,butnotapolicymaker
– High-levellanguages/librariessuchasOpenMPorCharm++havemoreinformationabouttheuserapplication(datalocality,dependencies)
• Explicitmodel:– Enablesdynamism,butalwaysmanaged
byhigh-levelsystems
Argobots
coreProcessor
Programming Models(MPI, OpenMP, Charm++, PaRSEC, …)
U User-Level Thread T TaskletLightweightWork Units
Exec
utio
n St
ream
Private pool Private poolShared pool
U U
U T
TTU TU
Exec
utio
n St
ream
Exec
utio
n St
ream
Alightweightlow-levelthreadingandtaskingframework(http://www.mcs.anl.gov/argobots/)
9
*Currentteammembers:PavanBalaji,SangminSeo,Halim Amer (ANL),L.Kale,NitinBhat (UIUC)
4thXMPWorkshop(11/7/2016)
Argobots Execution Model
• ExecutionStreams(ES)– Sequentialinstructionstream
• Canconsistofoneormoreworkunits– Mappedefficientlytoahardware
resource– Implicitlymanagedprogresssemantics
• OneblockedEScannotblockotherESs
• User-levelThreads(ULTs)– Independentexecutionunitsinuser
space– AssociatedwithanESwhenrunning– Yieldableandmigratable– Canmakeblockingcalls
• Tasklets– Atomicunitsofwork– Asynchronouscompletionvia
notifications– Notyieldable,migratablebefore
execution– Cannotmakeblockingcalls
S
Scheduler Pool
U
ULT
T
Tasklet
E
Event
ES1 Sched
U
U
E
E
E
E
U
S
S
T
T
T
T
T
Argobots Execution Model
...
ESn
• Scheduler– Stackableschedulerwithpluggable
strategies• Synchronizationprimitives
– Mutex,conditionvariable,barrier,future• Events
– Communicationtriggers
4thXMPWorkshop(11/7/2016) 10
Explicit Mapping ULT/Tasklet to ES
• TheuserneedstomapworkunitstoESs• Nosmartscheduling,nowork-stealingunlesstheuserwants
touse
ES1
U0
U1
T1
T2
U2
U3
ES2
U4
U5
• Benefits– Allowlocalityoptimization
• ExecuteworkunitsonthesameES
– NoexpensivelockisneededbetweenULTsonthesameES
• Theydonotruninparallel• Aflagisenough
4thXMPWorkshop(11/7/2016) 11
Stackable Scheduler with Pluggable Strategies
• AssociatedwithanES• CanhandleULTsandtasklets• Canhandleschedulers
– Allowstostackschedulershierarchically• Canhandleasynchronousevents• Userscanwriteschedulers
– Providesmechanisms,notpolicies– Replacethedefaultscheduler
• E.g.,FIFO,LIFO,PriorityQueue,etc.• ULTcanexplicitlyyieldto anotherULT
– Avoidscheduleroverhead
Sched
U
U
E
E
E
E
U
S
S
T
T
T
T
T
U S U U U
yield() yield_to(target)
4thXMPWorkshop(11/7/2016) 12
Performance: Create/Join Time
• Idealscalability– IftheULTruntimeisperfectlyscalable,thetimeshouldbethesame
regardlessofthenumberofESs
10
100
1000
10000
1 2 4 8 16 24 32 36 40 48 56 64 72
Create/JoinTimepe
rULT(cycles)
NumberofExecutionStreams(Workers)
Qthreads MassiveThreads(H) MassiveThreads(W)
Argobots(ULT) Argobots(Tasklet)
4thXMPWorkshop(11/7/2016) 13
Argobots’ Position
NodeOS
Argobots Comm.Lib.
High-Level Programming Models/LibrariesDomain SpecificLanguages(DSLs)
Applications
...
NodeOS
Argobots Comm.Lib.
Argobotsisalow-levelthreading/taskingruntime!
4thXMPWorkshop(11/7/2016) 14
Argobots Ecosystem
ES1 Sched
U
U
E
E
E
E
U
S
S
T
T
T
T
T
Argobots
...
ESn
MPI+Argobots
ULT
ES
ULT
ES
MPI
Argobots runtime
Communication libraries
Charm++
Applications
Charm++Cilk “Worker”
ArgobotsES
RWSULT
FusedULT1
FusedULT2
FusedULTN
…CilkBots
PO
GE
TR
SYTR TR
PO GE GE
TR TR
SYSY GE
PO
TR
SY
PO
SY SY
PaRSEC
OpenMP MercuryRPC
Origin
Target
RPCproc
RPCproc
OmpSs
TASCEL,XMP,ROSE,GridFTP,Kokkos,RAJA,etc.MoreConnections
4thXMPWorkshop(11/7/2016) 15
OpenMP
• Directivebasedprogrammingmodel• Commonlyusedforshared-memoryprogramminginanode• Manydifferentimplementations
– TypicallyontopofPthreadslibrary– Intel,GCC,Clang,IBM,etc.
Sequentialcodefor (i = 0; i < N; i++) {
do_something();}
OpenMPcode#pragma omp parallel forfor (i = 0; i < N; i++) {
do_something();}
zz
zzzz
164thXMPWorkshop(11/7/2016)
Nested Parallel Loop: Microbenchmark
17
AthreadforeachCPUiscreatedbydefault
Eachthreadexecutesaportion
Eachthreadcreatesmorethreadsforthesecondloop
Eachinnerthreadexecutesaportion
int in[1000][1000], out[1000][1000];
#pragma omp parallel for
for (i = 0; i < 1000; i++) {
lib_compute(i);
}
lib_compute(int x)
{
#pragma omp parallel for
for (j = 0; j < 1000; j++)
out[x][j] = compute(in[x][j]);
}
Contribution:AdrianCastello(Universitat Jaume I)
4thXMPWorkshop(11/7/2016)
Nested Parallel Loop: Implementations
• GCC– Doesnotreusetheidlethreadsinnestedparallelconstructs– Allthreadteamsinsideaparallelregionneedtobecreated
• ICC– Reuseidlethreads
• Iftherearenotmorethreadsavailable,newthreadsarecreated• AllcreatedthreadsareOSthreadsandaddoverhead
• ImplementationusingArgobots– CreatesULTsortasklets forbothouterloopandinnerloop
18
ES0
ESN-1…
WU
Outerloopsynchronizationpoint
…
…
S
S
S
S
WU
WU
WU WU
WU WU WU
OneESforeachcoreWorkunitfortheouterloop
Eachworkunitexecutesaportionoftheinnerloop
Innerloopsynchronizationpoint
4thXMPWorkshop(11/7/2016)
0.00
0.01
0.10
1.00
10.00
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35Time(s)
#OMPThreads|ArgobotsULTs/tasks(innerloop)
ICC/Pthreads ICC/ArgobotsULTs ICC/Argobotstasks
0.00
0.01
0.10
1.00
10.00
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35
Time(s)
#OMPThreads|ArgobotsULTs/tasks(innerloop)
GCC/Pthreads GCC/ArgobotsULTs GCC/Argobotstasks
Nested Parallel Loop: Performance
GCCOpenMPimplementationdoesnotreuseidlethreadsinnestedparallelregions,alltheteamsofthreadsneedtobecreatedineachiteration
Executiontimefor36threadsintheouterloop
SomeoverheadisaddedbycreatingULTsinsteadoftasks
Lowerisbetter
Lowerisbetter
194thXMPWorkshop(11/7/2016)
Nested Parallel Loop: Analysis
• Howdoeseachimplementationmanagethethreadsinnestedparallelregions?
Imple. #CreatedThreads #ReusedThreads # CreatedULTs
# CreatedTasks
GCC C+N*(T-1) =35,036 0 --- ---
ICC C+C*(T-1)=1,296 (N-C)*(T-1) =33,740 --- ---
Argobots tasks C=36 0 C=36 N*T=36,000
§ Parameters:• C:numberofcores(orthreadscreatedbyuseratthebeginning)• N:numberofiterationsoftheouterloop• M:numberofiterationsoftheinnerloop• T:numberofthreadsfortheinnerloop
Sequentialcreation Parallelcreation
Example->C:36,N:1,000,M:1,000,T:36
204thXMPWorkshop(11/7/2016)
BOLT: A Lightning-Fast OpenMP Implementation
• AboutBOLT– BOLTisarecursiveacronymthatstandsfor
"BOLTisOpenMP overLightweightThreads"– https://www.mcs.anl.gov/bolt/
• Objective– OpenMP frameworkthatexploitslightweightthreadsandtasks
21
ImprovedNestedMassiveParallelism
EnhancedFine-GrainedTaskParallelism
BetterInteroperabilitywithMPIandOtherInternodeProgrammingModels
4thXMPWorkshop(11/7/2016)
Approach & Development
• Basicapproach– CompilersimplygeneratesruntimeAPIcalls,whiletheruntime
createsULTs/tasklets andmanagesthemoverafixedsetofcomputationalresources
– UseArgobots astheunderlyingthreadingandtaskingmechanism– ABIcompatibilitywithIntelOpenMP compilers,LLVM/Clang,andGCC
(i.e.,canbeusedwiththesecompilers)• Development
– Runtime• BasedonIntelOpenMP RuntimeAPI• GeneratesArgobotsworkunitsfromOpenMP pragmas• CangenerateULTsortasklets dependingoncodecharacteristics
– Compiler(planned)• LLVM/Clang• Passescharacteristicsofparallelregionortask(e.g.,existenceofblocking
calls)totheruntime• Extendspragmaswiththeoption“nonblocking”
224thXMPWorkshop(11/7/2016)
BOLT Execution Model
• OpenMP threadsandtasksaretranslatedintoArgobotsworkunits(i.e.,ULTsandtasklets)
• Sharedpoolsareutilizedtohandlenestedparallelism• AcustomizedArgobotsschedulermanagesschedulingofworkunits
acrossexecutionstreams
23
T TU T
OpenMP
Argobots
U
ULT
T
Tasklet
#pragmaomp parallel #pragmaomp task
U U
ExecutionStream
CPUcore
PrivatePool
U TU T USharedPool
CPU
OpenMPthreads
OpenMPtasks
4thXMPWorkshop(11/7/2016)
Prototype Implementation of BOLT Runtime
• BasedonIntel’sopen-sourceOpenMP runtime– http://openmp.llvm.org/
• KepttheoriginalruntimeAPIfortheABIcompatibility• Designedandimplementedthethreadinglayerusing
Argobots andmodifiedtheruntimeinternallayer
24
ThreadingLayer
Pthreads
RuntimeAPILayer
RuntimeInternalLayer
Argobots
4thXMPWorkshop(11/7/2016)
OpenMP Pragma Translation
1. AsetofN threadsiscreatedatruntime– Iftheyhavenotbeencreatedyet– CommonlyasmanyasthenumberofCPUcores
2. Thenumberofiterationsisdividedbetweenallthethreads3. Asynchronizationpointisaddedaftertheforloop
– Implicitbarrierattheendofparallelfor
#pragma omp parallel for (1,2)for (i = 0; i < N; i++) {
do_something();} (3)
254thXMPWorkshop(11/7/2016)
OpenMP Compiler & BOLT Runtime
__kmpc_fork_call(…){
}
__kmp_fork_call(…)
__kmp_join_call(…)
IntelOpenMPRuntimeAPI
• CreateExecutionStreams(ifneeded)• AddaULTortasklettoeachES• Launchthework
• Join workunitscreated
BOLTruntime
#pragmaompparallelClangandIntelcompiler
264thXMPWorkshop(11/7/2016)
parallel for
#pragma omp parallel forfor (i = 0; i < N; i++) {
do_something();}
Createsthreads
Dividesalliterationsamongthreads
Synchronizationpoint
ES0
ES1
ESK
…
WU
WU
WU
Eachworkunitexecutesaportionoftheforloop
ImplementationusingArgobots
S
S
S
Asynchronizationpointisadded
OneExecutionStreamforeachCPU core(orhardwarethread)
274thXMPWorkshop(11/7/2016)
OpenUH OpenMP Validation Suite 3.1
GCC 6.1 ICC17.0.0 +IntelOpenMP
ICC 17.0.0+BOLTruntime(Argobots)
BOLT(clang+Argobots)
#oftestedOpenMPconstructs 62 62 62 62
#ofusedtests 123 123 123 123
#ofsuccessful tests 118 118 122 112
#offailedtests 5 5 1 1
Passrate(%) 95.9 95.9 99.2 99.2
28
• TheBOLTprototypefunctionallyworkswell!
4thXMPWorkshop(11/7/2016)
Nested Parallel Loop Microbenchmark
29
10
100
1000
10000
100000
1000000
10000000
100000000
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35
ExecutionTime(us)
#ofThreadsfortheInnerLoop
gcc6.1.0 icc17.0.0 BOLT(ULT) BOLT(ULT+tasklet) Argobots(ULT)
*Thenumberofthreadsfortheouterloopwasfixedat36.
4thXMPWorkshop(11/7/2016)
Application Study: KIFMM
• Kernel-IndependentFastMultipoleMethod(KIFMM)– Offloaddgemv operationstoIntelMKL
• EvaluatedtheefficiencyofthenestedparallelismsupportinIntelOpenMPandBOLTduringtheDownwardstage– 9threadsfortheapplication(outerparallelregion)
30
0
2
4
6
8
10
12
14
1 2 4 8
ExecutionTime(s)
#ofThreadsforIntelMKL
IntelOMP:core-close IntelOMP:core-true IntelOMP:no-binding BOLT
4thXMPWorkshop(11/7/2016)
Application Study: ACME mini-app
• ACME(AcceleratedClimateModelingforEnergy)– ImplementingadditionallevelsofparallelismthroughOpenMP
nestedparallelloopsforupcomingmany-coremachines• Preliminaryresultsoftestingthetransport_semini-appversion
ofHOMME(ACME’sCAM-SEdycore)
31
0%
20%
40%
60%
80%
100%
120%
H=16,V=1 H=8,V=2 H=4,V=4 H=4,V=8 H=8,V=4
Normalize
dExecutionTime(%
)
ACMEmini-app(transport_se)
ICC+IntelOpenMP(15.0.0) ICC+BOLT(Argobots)
Lowerisbetter(upto3.16x
faster)
oversubscription
4thXMPWorkshop(11/7/2016)
Summary
• Massiveon-nodeparallelismisinevitable– Needruntimesystemsutilizingsuchparallelism
• User-levelthreads(ULTs)– Lightweightthreadsmoresuitableforfine-graineddynamic
parallelismandcomputation-communicationoverlap• Argobots
– Alightweightlow-levelthreading/taskingframework– Providesefficientmechanisms,notpolicies,tousers(library
developersorcompilers)• Theycanbuildtheirownsolutions
• BOLT:OpenMP overLightweightThreads– MoreefficientsupportofnestedparallelismwithArgobotsULTs
andtasklets– PreliminaryresultsshowthatBOLTispromising
4thXMPWorkshop(11/7/2016) 32
Argo Concurrency Team
• ArgonneNationalLaboratory(ANL)– Pavan Balaji (co-lead)– SangminSeo– Abdelhalim Amer– PeteBeckman(PI)
• UniversityofIllinoisatUrbana-Champaign(UIUC)– Laxmikant Kale(co-lead)– MarcSnir– NitinKundapur Bhat
• UniversityofTennessee,Knoxville(UTK)– GeorgeBosilca– ThomasHerault– DamienGenet
• PacificNorthwestNationalLaboratory(PNNL)– Sriram Krishnamoorthy
PastTeamMembers:• CyrilBordage (UIUC)• Prateek Jindal(UIUC)• JonathanLifflander (UIUC)• EstebanMeneses
(UniversityofPittsburgh)• Huiwei Lu(ANL)• Yanhua Sun(UIUC)
4thXMPWorkshop(11/7/2016) 33
BOLT Collaborations
• Maintainers– ArgonneNationalLaboratory
• Sangmin Seo• Abdelhalim Amer• Pavan Balaji
• Contributors– Universitat Jaume IdeCastelló
• Adrián Castelló• RafaelMayo• EnriqueS.Quintana-Ortí
– BarcelonaSupercomputingCenter(BSC)• AntonioJ.Peña• JesusLabarta
– RIKEN• Jinpil Lee• Mitsuhisa Sato
344thXMPWorkshop(11/7/2016)
Try Argobots & BOLT
• Argobots– http://www.mcs.anl.gov/argobots/– git repository
• https://github.com/pmodels/argobots– Wiki
• https://github.com/pmodels/argobots/wiki– Doxygen
• http://www.mcs.anl.gov/~sseo/public/argobots/
• BOLT– http://www.mcs.anl.gov/bolt/– git repository
• https://github.com/pmodels/bolt-runtime
354thXMPWorkshop(11/7/2016)
Funding AcknowledgmentsFundingGrantProviders
InfrastructureProviders
4thXMPWorkshop(11/7/2016)
Programming Models and Runtime Systems GroupGroupLead– PavanBalaji(computerscientistandgroup
lead)
CurrentStaffMembers– Abdelhalim Amer (postdoc)– Yanfei Guo (postdoc)– RobLatham(developer)– LenaOden(postdoc)– KenRaffenetti (developer)– Sangmin Seo (assistantcomputerscientist)– MinSi(postdoc)
PastStaffMembers– AntonioPena(postdoc)– WesleyBland(postdoc)– DariusT.Buntinas (developer)– JamesS.Dinan (postdoc)– DavidJ.Goodell (developer)– Huiwei Lu(postdoc)– MinTian(visitingscholar)– Yanjie Wei(visitingscholar)– Yuqing Xiong (visitingscholar)– JianYu(visitingscholar)– Junchao Zhang(postdoc)– Xiaomin Zhu(visitingscholar)
– Ashwin Aji (Ph.D.)– Abdelhalim Amer (Ph.D.)– Md.Humayun Arafat(Ph.D.)– AlexBrooks(Ph.D.)– AdrianCastello (Ph.D.)– Dazhao Cheng(Ph.D.)– Hoang-VuDang(Ph.D.)– JamesS.Dinan (Ph.D.)– Piotr Fidkowski (Ph.D.)– PriyankaGhosh(Ph.D.)– Sayan Ghosh (Ph.D.)– RalfGunter(B.S.)– Jichi Guo (Ph.D.)– Yanfei Guo (Ph.D.)– MariusHorga (M.S.)– JohnJenkins(Ph.D.)– Feng Ji (Ph.D.)– PingLai(Ph.D.)– Palden Lama(Ph.D.)– YanLi(Ph.D.)– Huiwei Lu(Ph.D.)– JintaoMeng (Ph.D.)– GaneshNarayanaswamy (M.S.)– Qingpeng Niu (Ph.D.)– Ziaul Haque Olive(Ph.D.)
– DavidOzog (Ph.D.)– Renbo Pang(Ph.D.)– Nikela Papadopoulou (Ph.D)– Sreeram Potluri (Ph.D.)– Sarunya Pumma (Ph.D)– LiRao (M.S.)– Gopal Santhanaraman (Ph.D.)– ThomasScogland (Ph.D.)– MinSi(Ph.D.)– BrianSkjerven (Ph.D.)– RajeshSudarsan (Ph.D.)– LukaszWesolowski (Ph.D.)– Shucai Xiao(Ph.D.)– Chaoran Yang(Ph.D.)– Boyu Zhang(Ph.D.)– Xiuxia Zhang(Ph.D.)– XinZhao(Ph.D.)
Advisory Board– PeteBeckman(seniorscientist)– RustyLusk(retired,STA)– MarcSnir (divisiondirector)– RajeevThakur(deputydirector)
CurrentandRecentStudents
4thXMPWorkshop(11/7/2016)
Q&A
• Thankyouforyourattention!
Questions?
4thXMPWorkshop(11/7/2016)