User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •

SangminSeo

AssistantComputerScientistArgonneNationalLaboratory

[email protected]

November7,2016

User-Level Threads and OpenMP

Top500 Supercomputers Today1. SunwayTaihuLight (SunwayMPP):2. Tianhe-2(IntelXeon+XeonPhi):3. Titan(CrayXK7+Nvidia K20x):4. Sequoia(BlueGene/Q):5. Kcomputer(SPARC64):6. Mira(BlueGene/Q):

10,646,600cores3,120,000cores560,640cores

1,572,864cores705,024cores786,432cores

NumberofcorespersocketinTop500 TotalnumberofcoresinTop500rank#14thXMPWorkshop(11/7/2016) 2

0%

20%

40%

60%

80%

100%1 2 4 6 8 10 12 14-

0K

2,000K

4,000K

6,000K

8,000K

10,000K

12,000K

TotalN

umbe

rofC

oers

Massive On-node Parallelism

• Toaddressmassiveon-nodeparallelism,thenumberofworkunits(e.g.,threads)mustincreaseby100X

• MPI+OpenMP issufficientformanyapps,butimplementationispoor– TodayMPI+OpenMP ==MPI+Pthreads

• Pthreadabstractionistoogeneric,notsuitableforHPC– Lackoffine-grainedscheduling,memory

management,networkmanagement,signaling,etc.

• BetterruntimecansignificantlyimproveMPI+OpenMP performanceandsupportotheremergingprogrammingmodels

core

MPIprocesswithmanyOpenMP threads

4thXMPWorkshop(11/7/2016) 3

Outline

• Motivation• User-LevelThreads(ULTs)• Argobots• BOLT:OpenMP overLightweightThreads• Summary


User-Level Threads (ULTs)

• Whatisuser-levelthread(ULT)?– Providesthreadsemanticsinuserspace– Executionmodel:cooperativetimesharing

• MorethanoneULTcanbemappedtoasinglekernelthread

• ULTsonthesameOSthreaddonotexecuteinparallel– Canbeimplementedwithcoroutines

• Enableexplicitsuspendandresumeofitsprogressbypreservingexecutionstate

• SomelanguagessuchasPythonandGousecoroutines forasynchronousI/O

timeline

Context switch

Context switch

ULT1

ULT2

Core Core Core Core Core Core Core Core

ULTs :

Kernel threads :


User-Level Threads (ULTs)

• WhyULTs?– Conventionalthreads(e.g.,Pthreads)aretooexpensiveto

expressmassiveparallelism– IfwecreatePthreadsmorethan# ofcores

(i.e.,oversubscription):• Context-switchandsynchronizationoverheadwillincreasedramatically

– ULTscanmitigatehighoverheadofPthreadsbutneedexplicitcontrol

• Wheretouse?– Tobetteroverlapcomputationandcommunication/IO

• Lowcontext-switchingoverheadofULTscangivemoreopportunitiestohidecommunication/IOlatency

– Toexploitfine-grainedtaskparallelism• LightweightULTsaremoresuitabletoexpressmassivetaskparallelism

U

U

U

OSthread


Pthreads vs. ULTs

• Averagetimeforcreatingandjoiningonethread• Pthread:6.6us- 21.2us(avg.34,953cycles)• ULT(Argobots):78ns- 130ns(avg.191cycles)• ULTis64x- 233xfaster thanPthread

– HowfastisULT?• L1$access:1.112ns,L2$access:5.648ns,memoryaccess:18.4ns• Contextswitch(2processes):1.64us

1

10

100

1000

10000

100000

1 2 4 8 16 32 64 128 256 512 1024 2048

Avg.Create&

JoinTime/thread

(ns)

NumberofThreads

Pthread ULT(Argobots)

*measuredusingLMbench3


Growing Interests in ULTs

• ULTandtasklibraries– Conversethreads,Qthreads,MassiveThreads,Nanos++,

Maestro,GnuPth,StackThreads/MP,Protothreads,Capriccio,StateThreads,TiNy-threads,etc.

• OSsupports– Windowsfibers,Solaristhreads

• Languageandprogrammingmodels– Cilk,OpenMP task,C++11task,C++17coroutine proposal,

Stackless Python,Gocoroutines,etc.

• Pros– EasytousewithPthreads-likeinterface

• Cons– Runtimetriestodosomethingsmart(e.g.,work-stealing)– Thismayconflictwiththecharacteristicsanddemandsof

applications


Argobots

Overview• Separationofmechanismsandpolicies• Massiveparallelism

– Exec.Streams guaranteeprogress– WorkUnits executetocompletion

• User-levelthreads(ULTs)vs.Tasklets• Clearlydefinedmemorysemantics

– Consistencydomains• ProvideEventualConsistency

– Softwarecanmanageconsistency

ArgobotsInnovations• Enablingtechnology,butnotapolicymaker

– High-levellanguages/librariessuchasOpenMPorCharm++havemoreinformationabouttheuserapplication(datalocality,dependencies)

• Explicitmodel:– Enablesdynamism,butalwaysmanaged

byhigh-levelsystems

Argobots

coreProcessor

Programming Models(MPI, OpenMP, Charm++, PaRSEC, …)

U User-Level Thread T TaskletLightweightWork Units

Exec

utio

n St

ream

Private pool Private poolShared pool

U U

U T

TTU TU

Exec

utio

n St

ream

Exec

utio

n St

ream

Alightweightlow-levelthreadingandtaskingframework(http://www.mcs.anl.gov/argobots/)

9

*Currentteammembers:PavanBalaji,SangminSeo,Halim Amer (ANL),L.Kale,NitinBhat (UIUC)

4thXMPWorkshop(11/7/2016)

Argobots Execution Model

• ExecutionStreams(ES)– Sequentialinstructionstream

• Canconsistofoneormoreworkunits– Mappedefficientlytoahardware

resource– Implicitlymanagedprogresssemantics

• OneblockedEScannotblockotherESs

• User-levelThreads(ULTs)– Independentexecutionunitsinuser

space– AssociatedwithanESwhenrunning– Yieldableandmigratable– Canmakeblockingcalls

• Tasklets– Atomicunitsofwork– Asynchronouscompletionvia

notifications– Notyieldable,migratablebefore

execution– Cannotmakeblockingcalls

S

Scheduler Pool

U

ULT

T

Tasklet

E

Event

ES1 Sched

U

U

E

E

E

E

U

S

S

T

T

T

T

T

Argobots Execution Model

...

ESn

• Scheduler– Stackableschedulerwithpluggable

strategies• Synchronizationprimitives

– Mutex,conditionvariable,barrier,future• Events

– Communicationtriggers


Explicit Mapping ULT/Tasklet to ES

• TheuserneedstomapworkunitstoESs• Nosmartscheduling,nowork-stealingunlesstheuserwants

touse

ES1

U0

U1

T1

T2

U2

U3

ES2

U4

U5

• Benefits– Allowlocalityoptimization

• ExecuteworkunitsonthesameES

– NoexpensivelockisneededbetweenULTsonthesameES

• Theydonotruninparallel• Aflagisenough


Stackable Scheduler with Pluggable Strategies

• AssociatedwithanES• CanhandleULTsandtasklets• Canhandleschedulers

– Allowstostackschedulershierarchically• Canhandleasynchronousevents• Userscanwriteschedulers

– Providesmechanisms,notpolicies– Replacethedefaultscheduler

• E.g.,FIFO,LIFO,PriorityQueue,etc.• ULTcanexplicitlyyieldto anotherULT

– Avoidscheduleroverhead

Sched

U

U

E

E

E

E

U

S

S

T

T

T

T

T

U S U U U

yield() yield_to(target)


Performance: Create/Join Time

• Idealscalability– IftheULTruntimeisperfectlyscalable,thetimeshouldbethesame

regardlessofthenumberofESs

10

100

1000

10000

1 2 4 8 16 24 32 36 40 48 56 64 72

Create/JoinTimepe

rULT(cycles)

NumberofExecutionStreams(Workers)

Qthreads MassiveThreads(H) MassiveThreads(W)

Argobots(ULT) Argobots(Tasklet)


Argobots’ Position

NodeOS

Argobots Comm.Lib.

High-Level Programming Models/LibrariesDomain SpecificLanguages(DSLs)

Applications

...

NodeOS

Argobots Comm.Lib.

Argobotsisalow-levelthreading/taskingruntime!


Argobots Ecosystem

ES1 Sched

U

U

E

E

E

E

U

S

S

T

T

T

T

T

Argobots

...

ESn

MPI+Argobots

ULT

ES

ULT

ES

MPI

Argobots runtime

Communication libraries

Charm++

Applications

Charm++Cilk “Worker”

ArgobotsES

RWSULT

FusedULT1

FusedULT2

FusedULTN

…CilkBots

PO

GE

TR

SYTR TR

PO GE GE

TR TR

SYSY GE

PO

TR

SY

PO

SY SY

PaRSEC

OpenMP MercuryRPC

Origin

Target

RPCproc

RPCproc

OmpSs

TASCEL,XMP,ROSE,GridFTP,Kokkos,RAJA,etc.MoreConnections


OpenMP

• Directivebasedprogrammingmodel• Commonlyusedforshared-memoryprogramminginanode• Manydifferentimplementations

– TypicallyontopofPthreadslibrary– Intel,GCC,Clang,IBM,etc.

Sequentialcodefor (i = 0; i < N; i++) {

do_something();}

OpenMPcode#pragma omp parallel forfor (i = 0; i < N; i++) {

do_something();}

zz

zzzz


Nested Parallel Loop: Microbenchmark

17

AthreadforeachCPUiscreatedbydefault

Eachthreadexecutesaportion

Eachthreadcreatesmorethreadsforthesecondloop

Eachinnerthreadexecutesaportion

int in[1000][1000], out[1000][1000];

#pragma omp parallel for

for (i = 0; i < 1000; i++) {

lib_compute(i);

}

lib_compute(int x)

{

#pragma omp parallel for

for (j = 0; j < 1000; j++)

out[x][j] = compute(in[x][j]);

}

Contribution:AdrianCastello(Universitat Jaume I)


Nested Parallel Loop: Implementations

• GCC– Doesnotreusetheidlethreadsinnestedparallelconstructs– Allthreadteamsinsideaparallelregionneedtobecreated

• ICC– Reuseidlethreads

• Iftherearenotmorethreadsavailable,newthreadsarecreated• AllcreatedthreadsareOSthreadsandaddoverhead

• ImplementationusingArgobots– CreatesULTsortasklets forbothouterloopandinnerloop

18

ES0

ESN-1…

WU

Outerloopsynchronizationpoint

…

…

S

S

S

S

WU

WU

WU WU

WU WU WU

OneESforeachcoreWorkunitfortheouterloop

Eachworkunitexecutesaportionoftheinnerloop

Innerloopsynchronizationpoint


0.00

0.01

0.10

1.00

10.00

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35Time(s)

#OMPThreads|ArgobotsULTs/tasks(innerloop)

ICC/Pthreads ICC/ArgobotsULTs ICC/Argobotstasks

0.00

0.01

0.10

1.00

10.00

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

Time(s)

#OMPThreads|ArgobotsULTs/tasks(innerloop)

GCC/Pthreads GCC/ArgobotsULTs GCC/Argobotstasks

Nested Parallel Loop: Performance

GCCOpenMPimplementationdoesnotreuseidlethreadsinnestedparallelregions,alltheteamsofthreadsneedtobecreatedineachiteration

Executiontimefor36threadsintheouterloop

SomeoverheadisaddedbycreatingULTsinsteadoftasks

Lowerisbetter

Lowerisbetter


Nested Parallel Loop: Analysis

• Howdoeseachimplementationmanagethethreadsinnestedparallelregions?

Imple. #CreatedThreads #ReusedThreads # CreatedULTs

# CreatedTasks

GCC C+N*(T-1) =35,036 0 --- ---

ICC C+C*(T-1)=1,296 (N-C)*(T-1) =33,740 --- ---

Argobots tasks C=36 0 C=36 N*T=36,000

§ Parameters:• C:numberofcores(orthreadscreatedbyuseratthebeginning)• N:numberofiterationsoftheouterloop• M:numberofiterationsoftheinnerloop• T:numberofthreadsfortheinnerloop

Sequentialcreation Parallelcreation

Example->C:36,N:1,000,M:1,000,T:36


BOLT: A Lightning-Fast OpenMP Implementation

• AboutBOLT– BOLTisarecursiveacronymthatstandsfor

"BOLTisOpenMP overLightweightThreads"– https://www.mcs.anl.gov/bolt/

• Objective– OpenMP frameworkthatexploitslightweightthreadsandtasks

21

ImprovedNestedMassiveParallelism

EnhancedFine-GrainedTaskParallelism

BetterInteroperabilitywithMPIandOtherInternodeProgrammingModels


Approach & Development

• Basicapproach– CompilersimplygeneratesruntimeAPIcalls,whiletheruntime

createsULTs/tasklets andmanagesthemoverafixedsetofcomputationalresources

– UseArgobots astheunderlyingthreadingandtaskingmechanism– ABIcompatibilitywithIntelOpenMP compilers,LLVM/Clang,andGCC

(i.e.,canbeusedwiththesecompilers)• Development

– Runtime• BasedonIntelOpenMP RuntimeAPI• GeneratesArgobotsworkunitsfromOpenMP pragmas• CangenerateULTsortasklets dependingoncodecharacteristics

– Compiler(planned)• LLVM/Clang• Passescharacteristicsofparallelregionortask(e.g.,existenceofblocking

calls)totheruntime• Extendspragmaswiththeoption“nonblocking”


BOLT Execution Model

• OpenMP threadsandtasksaretranslatedintoArgobotsworkunits(i.e.,ULTsandtasklets)

• Sharedpoolsareutilizedtohandlenestedparallelism• AcustomizedArgobotsschedulermanagesschedulingofworkunits

acrossexecutionstreams

23

T TU T

OpenMP

Argobots

U

ULT

T

Tasklet

#pragmaomp parallel #pragmaomp task

U U

ExecutionStream

CPUcore

PrivatePool

U TU T USharedPool

CPU

OpenMPthreads

OpenMPtasks


Prototype Implementation of BOLT Runtime

• BasedonIntel’sopen-sourceOpenMP runtime– http://openmp.llvm.org/

• KepttheoriginalruntimeAPIfortheABIcompatibility• Designedandimplementedthethreadinglayerusing

Argobots andmodifiedtheruntimeinternallayer

24

ThreadingLayer

Pthreads

RuntimeAPILayer

RuntimeInternalLayer

Argobots


OpenMP Pragma Translation

1. AsetofN threadsiscreatedatruntime– Iftheyhavenotbeencreatedyet– CommonlyasmanyasthenumberofCPUcores

2. Thenumberofiterationsisdividedbetweenallthethreads3. Asynchronizationpointisaddedaftertheforloop

– Implicitbarrierattheendofparallelfor

#pragma omp parallel for (1,2)for (i = 0; i < N; i++) {

do_something();} (3)


OpenMP Compiler & BOLT Runtime

__kmpc_fork_call(…){

}

__kmp_fork_call(…)

__kmp_join_call(…)

IntelOpenMPRuntimeAPI

• CreateExecutionStreams(ifneeded)• AddaULTortasklettoeachES• Launchthework

• Join workunitscreated

BOLTruntime

#pragmaompparallelClangandIntelcompiler


parallel for

#pragma omp parallel forfor (i = 0; i < N; i++) {

do_something();}

Createsthreads

Dividesalliterationsamongthreads

Synchronizationpoint

ES0

ES1

ESK

…

WU

WU

WU

Eachworkunitexecutesaportionoftheforloop

ImplementationusingArgobots

S

S

S

Asynchronizationpointisadded

OneExecutionStreamforeachCPU core(orhardwarethread)


OpenUH OpenMP Validation Suite 3.1

GCC 6.1 ICC17.0.0 +IntelOpenMP

ICC 17.0.0+BOLTruntime(Argobots)

BOLT(clang+Argobots)

#oftestedOpenMPconstructs 62 62 62 62

#ofusedtests 123 123 123 123

#ofsuccessful tests 118 118 122 112

#offailedtests 5 5 1 1

Passrate(%) 95.9 95.9 99.2 99.2

28

• TheBOLTprototypefunctionallyworkswell!


Nested Parallel Loop Microbenchmark

29

10

100

1000

10000

100000

1000000

10000000

100000000

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

ExecutionTime(us)

#ofThreadsfortheInnerLoop

gcc6.1.0 icc17.0.0 BOLT(ULT) BOLT(ULT+tasklet) Argobots(ULT)

*Thenumberofthreadsfortheouterloopwasfixedat36.


Application Study: KIFMM

• Kernel-IndependentFastMultipoleMethod(KIFMM)– Offloaddgemv operationstoIntelMKL

• EvaluatedtheefficiencyofthenestedparallelismsupportinIntelOpenMPandBOLTduringtheDownwardstage– 9threadsfortheapplication(outerparallelregion)

30

0

2

4

6

8

10

12

14

1 2 4 8

ExecutionTime(s)

#ofThreadsforIntelMKL

IntelOMP:core-close IntelOMP:core-true IntelOMP:no-binding BOLT


Application Study: ACME mini-app

• ACME(AcceleratedClimateModelingforEnergy)– ImplementingadditionallevelsofparallelismthroughOpenMP

nestedparallelloopsforupcomingmany-coremachines• Preliminaryresultsoftestingthetransport_semini-appversion

ofHOMME(ACME’sCAM-SEdycore)

31

0%

20%

40%

60%

80%

100%

120%

H=16,V=1 H=8,V=2 H=4,V=4 H=4,V=8 H=8,V=4

Normalize

dExecutionTime(%

)

ACMEmini-app(transport_se)

ICC+IntelOpenMP(15.0.0) ICC+BOLT(Argobots)

Lowerisbetter(upto3.16x

faster)

oversubscription


Summary

• Massiveon-nodeparallelismisinevitable– Needruntimesystemsutilizingsuchparallelism

• User-levelthreads(ULTs)– Lightweightthreadsmoresuitableforfine-graineddynamic

parallelismandcomputation-communicationoverlap• Argobots

– Alightweightlow-levelthreading/taskingframework– Providesefficientmechanisms,notpolicies,tousers(library

developersorcompilers)• Theycanbuildtheirownsolutions

• BOLT:OpenMP overLightweightThreads– MoreefficientsupportofnestedparallelismwithArgobotsULTs

andtasklets– PreliminaryresultsshowthatBOLTispromising


Argo Concurrency Team

• ArgonneNationalLaboratory(ANL)– Pavan Balaji (co-lead)– SangminSeo– Abdelhalim Amer– PeteBeckman(PI)

• UniversityofIllinoisatUrbana-Champaign(UIUC)– Laxmikant Kale(co-lead)– MarcSnir– NitinKundapur Bhat

• UniversityofTennessee,Knoxville(UTK)– GeorgeBosilca– ThomasHerault– DamienGenet

• PacificNorthwestNationalLaboratory(PNNL)– Sriram Krishnamoorthy

PastTeamMembers:• CyrilBordage (UIUC)• Prateek Jindal(UIUC)• JonathanLifflander (UIUC)• EstebanMeneses

(UniversityofPittsburgh)• Huiwei Lu(ANL)• Yanhua Sun(UIUC)


BOLT Collaborations

• Maintainers– ArgonneNationalLaboratory

• Sangmin Seo• Abdelhalim Amer• Pavan Balaji

• Contributors– Universitat Jaume IdeCastelló

• Adrián Castelló• RafaelMayo• EnriqueS.Quintana-Ortí

– BarcelonaSupercomputingCenter(BSC)• AntonioJ.Peña• JesusLabarta

– RIKEN• Jinpil Lee• Mitsuhisa Sato


Try Argobots & BOLT

• Argobots– http://www.mcs.anl.gov/argobots/– git repository

• https://github.com/pmodels/argobots– Wiki

• https://github.com/pmodels/argobots/wiki– Doxygen

• http://www.mcs.anl.gov/~sseo/public/argobots/

• BOLT– http://www.mcs.anl.gov/bolt/– git repository

• https://github.com/pmodels/bolt-runtime


Funding AcknowledgmentsFundingGrantProviders

InfrastructureProviders


Programming Models and Runtime Systems GroupGroupLead– PavanBalaji(computerscientistandgroup

lead)

CurrentStaffMembers– Abdelhalim Amer (postdoc)– Yanfei Guo (postdoc)– RobLatham(developer)– LenaOden(postdoc)– KenRaffenetti (developer)– Sangmin Seo (assistantcomputerscientist)– MinSi(postdoc)

PastStaffMembers– AntonioPena(postdoc)– WesleyBland(postdoc)– DariusT.Buntinas (developer)– JamesS.Dinan (postdoc)– DavidJ.Goodell (developer)– Huiwei Lu(postdoc)– MinTian(visitingscholar)– Yanjie Wei(visitingscholar)– Yuqing Xiong (visitingscholar)– JianYu(visitingscholar)– Junchao Zhang(postdoc)– Xiaomin Zhu(visitingscholar)

– Ashwin Aji (Ph.D.)– Abdelhalim Amer (Ph.D.)– Md.Humayun Arafat(Ph.D.)– AlexBrooks(Ph.D.)– AdrianCastello (Ph.D.)– Dazhao Cheng(Ph.D.)– Hoang-VuDang(Ph.D.)– JamesS.Dinan (Ph.D.)– Piotr Fidkowski (Ph.D.)– PriyankaGhosh(Ph.D.)– Sayan Ghosh (Ph.D.)– RalfGunter(B.S.)– Jichi Guo (Ph.D.)– Yanfei Guo (Ph.D.)– MariusHorga (M.S.)– JohnJenkins(Ph.D.)– Feng Ji (Ph.D.)– PingLai(Ph.D.)– Palden Lama(Ph.D.)– YanLi(Ph.D.)– Huiwei Lu(Ph.D.)– JintaoMeng (Ph.D.)– GaneshNarayanaswamy (M.S.)– Qingpeng Niu (Ph.D.)– Ziaul Haque Olive(Ph.D.)

– DavidOzog (Ph.D.)– Renbo Pang(Ph.D.)– Nikela Papadopoulou (Ph.D)– Sreeram Potluri (Ph.D.)– Sarunya Pumma (Ph.D)– LiRao (M.S.)– Gopal Santhanaraman (Ph.D.)– ThomasScogland (Ph.D.)– MinSi(Ph.D.)– BrianSkjerven (Ph.D.)– RajeshSudarsan (Ph.D.)– LukaszWesolowski (Ph.D.)– Shucai Xiao(Ph.D.)– Chaoran Yang(Ph.D.)– Boyu Zhang(Ph.D.)– Xiuxia Zhang(Ph.D.)– XinZhao(Ph.D.)

Advisory Board– PeteBeckman(seniorscientist)– RustyLusk(retired,STA)– MarcSnir (divisiondirector)– RajeevThakur(deputydirector)

CurrentandRecentStudents


Q&A

• Thankyouforyourattention!

Questions?


Documents

User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •