44
Porting NANOS on Porting NANOS on SDSM SDSM GOAL Porting a shared memory environm to distributed memory. What is missing to current SDSM Christian Christian Perez Perez

Porting NANOS on SDSM

  • Upload
    viola

  • View
    83

  • Download
    5

Embed Size (px)

DESCRIPTION

Porting NANOS on SDSM. GOAL Porting a shared memory environment to distributed memory. What is missing to current SDSM ?. Christian Perez. Who am i ?. December 1999 : PhD at LIP, ENS Lyon, France Data parallel languages, distributed memory, load balancing, preemptive thread migration - PowerPoint PPT Presentation

Citation preview

Page 1: Porting NANOS on SDSM

Porting NANOS on SDSMPorting NANOS on SDSM

GOALPorting a shared memory environment

to distributed memory.What is missing to current SDSM ?

Christian PerezChristian Perez

Page 2: Porting NANOS on SDSM

Who am i ?Who am i ?

• December 1999 : PhD at LIP, ENS Lyon, FranceDecember 1999 : PhD at LIP, ENS Lyon, France– Data parallel languages, distributed Data parallel languages, distributed

memory, load balancing, preemptive thread memory, load balancing, preemptive thread migrationmigration

• Winter 1999/2000 : TMR at UPCWinter 1999/2000 : TMR at UPC– OpenMP, Nanos, SDSMOpenMP, Nanos, SDSM

• October 2000 : INRIA researcherOctober 2000 : INRIA researcher– Distributed programs, code couplingDistributed programs, code coupling

Page 3: Porting NANOS on SDSM

ContentsContents

• MotivationMotivation• Related worksRelated works• Nanos execution model (NthLib)Nanos execution model (NthLib)• Nanos on top of 2 SDSM (JIAJIA & DSM-PM2)Nanos on top of 2 SDSM (JIAJIA & DSM-PM2)• Missing SDSM functionalitiesMissing SDSM functionalities• ConclusionConclusion

Page 4: Porting NANOS on SDSM

MotivationMotivation

• OpenMP : emerging standardOpenMP : emerging standard– simplicity (no data distribution)simplicity (no data distribution)

• Cluster of machines (mono or Cluster of machines (mono or multiprocessors)multiprocessors)– excellent ratio performance / priceexcellent ratio performance / price

OpenMP on top of a cluster !OpenMP on top of a cluster !

Page 5: Porting NANOS on SDSM

OpenMP / Cluster : HOW ?OpenMP / Cluster : HOW ?

• OpenMP paradigm : shared memoryOpenMP paradigm : shared memory• Cluster paradigm : message passingCluster paradigm : message passing Use of software DSM system !Use of software DSM system !

Hardware DSM system : SCI (write: 2 Hardware DSM system : SCI (write: 2 s) s) specific hardwarespecific hardware not yet stablenot yet stable

Page 6: Porting NANOS on SDSM

Related workRelated work

• Several OpenMP/DSM implementationsSeveral OpenMP/DSM implementations– OpenMP NOW!, OmniOpenMP NOW!, Omni

• But,But,– Modification of OpenMP semanticsModification of OpenMP semantics– One level of parallelismOne level of parallelism– Do not exploit high performance Do not exploit high performance

networksnetworks

Page 7: Porting NANOS on SDSM

OpenMP on classical DSM OpenMP on classical DSM

• Compiler extracts shared data from stackCompiler extracts shared data from stack– Expensive local variable creationExpensive local variable creation

•shared memory allocationshared memory allocation• Modification of OpenMP standard :Modification of OpenMP standard :

– default should be default should be privateprivate instead of instead of being being sharedshared variables variables

– New synchronization primitives :New synchronization primitives :•condition variables & semaphorescondition variables & semaphores

Page 8: Porting NANOS on SDSM

OpenMP on classical DSMOpenMP on classical DSM

• One level of parallelism (SPMD)One level of parallelism (SPMD)

!$omp parallel do!$omp parallel dodo i = 1,4do i = 1,4

x(i) = x(i) + x(i+1)x(i) = x(i) + x(i+1)end doend do

barriercall schedule(lb, up, …)call schedule(lb, up, …)do i = lb, ubdo i = lb, ub

x(i) = x(i) + x(i+1)x(i) = x(i) + x(i+1)end doend docall dsm_barrier()call dsm_barrier()

Page 9: Porting NANOS on SDSM

Taken from pdplab.trc.rwcp.or.jp/pdperf/Omni/wgcc2k/Taken from pdplab.trc.rwcp.or.jp/pdperf/Omni/wgcc2k/

Omni compilation Omni compilation approachapproach

Page 10: Porting NANOS on SDSM

Our goalsOur goals

• Support OpenMP standardSupport OpenMP standard• High performanceHigh performance• Allow exploitation ofAllow exploitation of

– multithreading (SMP)multithreading (SMP)– high performance networkshigh performance networks

Page 11: Porting NANOS on SDSM

Nanos OpenMP compilerNanos OpenMP compiler

• Convert an OpenMP program to a task graphConvert an OpenMP program to a task graph• Communications via shared memoryCommunications via shared memory

!$omp parallel do!$omp parallel dodo i = 1,4do i = 1,4

x(i) = x(i) + x(i+1)x(i) = x(i) + x(i+1)end doend do

i=1,2i=1,2 i=3,4i=3,4

Page 12: Porting NANOS on SDSM

NthLib runtime supportNthLib runtime support

• Nanos compiler generates intermediate codesNanos compiler generates intermediate codes• Communications still via shared memoryCommunications still via shared memory

call call nthf_depaddnthf_depadd(…)(…) do nth_p = 1, procdo nth_p = 1, proc nth= nth= nthf_create_1snthf_create_1s(…,f,…)(…,f,…) donedone call nth_block()call nth_block()

subroutine f(…)subroutine f(…) x(i) = x(i) + x(i+1)x(i) = x(i) + x(i+1)

Page 13: Porting NANOS on SDSM

NthLib detailsNthLib details

• Assumes to run on top of kernel threadsAssumes to run on top of kernel threads• Provides user-level threads (QT)Provides user-level threads (QT)

• Stack management (allocate)Stack management (allocate)• Stack initialization (argument)Stack initialization (argument)• Explicit context switchExplicit context switch

Page 14: Porting NANOS on SDSM

Nthlib queuesNthlib queues

• Global/LocalGlobal/Local• Thread descriptorThread descriptor

– Rich functionalitiesRich functionalities• Work descriptorWork descriptor

– High performanceHigh performance

Page 15: Porting NANOS on SDSM

Nthlib : Nthlib : MemoryMemory managementmanagement

• Mutal exclusion Mutal exclusion mmapmmap allocation allocation • SLOT_SIZESLOT_SIZE stack alignment stack alignment

Nano-thread descriptorNano-thread descriptorSuccessorsSuccessors

StackStack

Guard zoneGuard zone

Page 16: Porting NANOS on SDSM

PortingPorting Nthlib to SDSM Nthlib to SDSM

• Data consistencyData consistency• Shared memory managementShared memory management• Nanos threadsNanos threads• JIAJIA implementationJIAJIA implementation• DSM-PM2 implementationDSM-PM2 implementation• Summary of DSM requirementsSummary of DSM requirements

Page 17: Porting NANOS on SDSM

Data consistencyData consistency• Mutual exclusion for defined data Mutual exclusion for defined data

structuresstructures Acquire/ReleaseAcquire/Release

• User level shared memory dataUser level shared memory data BarrierBarrier

Page 18: Porting NANOS on SDSM

Data consistencyData consistency• Mutual exclusion for defined data Mutual exclusion for defined data

structuresstructures Acquire/ReleaseAcquire/Release

• User level shared memory dataUser level shared memory data BarrierBarrier

barrier

barrier

barrier

Page 19: Porting NANOS on SDSM

Shared memory Shared memory managementmanagement• Asynchronous shared memory allocationAsynchronous shared memory allocation• Alignment parameter (> Alignment parameter (> PAGE_SIZEPAGE_SIZE))• Global variables/Global variables/commoncommon declarationdeclaration

Not yet supportedNot yet supported

Page 20: Porting NANOS on SDSM

Nano-threadsNano-threads

• Run-to-block execution modelRun-to-block execution model• Shared stacks (father/sons relationship)Shared stacks (father/sons relationship)• Implicit thread migration (scheduler)Implicit thread migration (scheduler)

Page 21: Porting NANOS on SDSM

JIAJIAJIAJIA• Developed at China by W. Hu, W. Shi & Z. TangDeveloped at China by W. Hu, W. Shi & Z. Tang• Public domain DSMPublic domain DSM• User level DSMUser level DSM• DSM : lock/unlock, barrier, cond. variablesDSM : lock/unlock, barrier, cond. variables• MP : send/receive, broadcast, reduceMP : send/receive, broadcast, reduce• Solaris, AIX, Irix, Linux, NT (not distributed)Solaris, AIX, Irix, Linux, NT (not distributed)

Page 22: Porting NANOS on SDSM

JIAJIA : Memory AllocationJIAJIA : Memory Allocation• No control of memory alignment (x2)No control of memory alignment (x2)• Synchronous memory allocation primitiveSynchronous memory allocation primitive

Development of an RPC versionDevelopment of an RPC version– Based on send/receive primitiveBased on send/receive primitive– Add of a user level message handlerAdd of a user level message handler ProblemsProblems– Global lockGlobal lock– Interference with JIAJIA blocking functionInterference with JIAJIA blocking function

Page 23: Porting NANOS on SDSM

JIAJIA : DiscussionJIAJIA : Discussion• Global barrier for data synchronizationGlobal barrier for data synchronization

Not multiple levels of parallelismNot multiple levels of parallelism• No thread awareNo thread aware

No efficient use of SMP nodesNo efficient use of SMP nodes

Page 24: Porting NANOS on SDSM

DSM/PM2DSM/PM2• Developed at LIP by G. Antoniu (PhD student)Developed at LIP by G. Antoniu (PhD student)• Public domainPublic domain• User level, module of PM2User level, module of PM2• Generic and multi-protocol DSMGeneric and multi-protocol DSM• DSM : lock/unlockDSM : lock/unlock• MP : LRPCMP : LRPC• Linux, Solaris, Irix (32 bits)Linux, Solaris, Irix (32 bits)

Page 25: Porting NANOS on SDSM

PM2 organizationPM2 organization

DSMMAD1TCPPVMMPISCIVIASBP

MAD2TCPMPISCIVIABIP

MARCELMONOSMPACTIVATON

PM2 TBX NTBX

http://www.pm2.org

Page 26: Porting NANOS on SDSM

DSM/PM2 : Memory DSM/PM2 : Memory AllocationAllocation• Only static memory allocationOnly static memory allocation

Build dynamic memory allocation primitiveBuild dynamic memory allocation primitive– Centralized memory allocation Centralized memory allocation – LRPC to Node 0LRPC to Node 0 Integration of alignment parameterIntegration of alignment parameter

Summer 2000 : dynamic memory allocation Summer 2000 : dynamic memory allocation ready !ready !

Page 27: Porting NANOS on SDSM

DSM/PM2 : marcel DSM/PM2 : marcel descriptordescriptor

Page boundarymarcel_t marcel_t

(sp&MASK)+SLOT_SIZE(sp&MASK)+SLOT_SIZE

NthLib requirement :NthLib requirement :a kernel thread a kernel thread many nano- many nano-threadsthreads

Page 28: Porting NANOS on SDSM

DSM/PM2 : marcel DSM/PM2 : marcel descriptordescriptor

Page boundarymarcel_t marcel_t

Page boundary

marcel_t* marcel_t*

(sp&MASK)+SLOT_SIZE(sp&MASK)+SLOT_SIZE

*((sp&MASK)+SLOT_SIZE)*((sp&MASK)+SLOT_SIZE)

Page 29: Porting NANOS on SDSM

DSM/PM2 : Discussion DSM/PM2 : Discussion • Using page level sequential consistencyUsing page level sequential consistency

+ no need of barrier (Multiple levels of + no need of barrier (Multiple levels of parallelism)parallelism)– – False sharingFalse sharing Dedicated stack layoutDedicated stack layoutPage boundary

PadPadPage boundary

marcel_t* marcel_t*

Page 30: Porting NANOS on SDSM

DSM/PM2 : Discussion DSM/PM2 : Discussion (cont)(cont)• No alternate stack for signal handlerNo alternate stack for signal handler

Prefetch page before context switch : O(n)Prefetch page before context switch : O(n) Pad to next page before opening parallelismPad to next page before opening parallelism

PadPad

Page boundary

Page boundary

SharedShared

datadata

Page 31: Porting NANOS on SDSM

DSM/PM2 improvementDSM/PM2 improvement

• Availability of an asynchronous DSM mallocAvailability of an asynchronous DSM malloc• Lazy data consistency protocol in evaluationLazy data consistency protocol in evaluation

– eager consistency, multiple writereager consistency, multiple writer– scope consistencyscope consistency

• Support for stack in shared memory (LINUX)Support for stack in shared memory (LINUX)

Page 32: Porting NANOS on SDSM

DSM/PM2 shared stack DSM/PM2 shared stack supportsupportmarcel_t marcel_t

(sp&MASK)+SLOT_SIZE(sp&MASK)+SLOT_SIZE

SEGV stack

Page 33: Porting NANOS on SDSM

DSM/PM2 shared stack DSM/PM2 shared stack supportsupportmarcel_t marcel_t

(sp&MASK)+SLOT_SIZE(sp&MASK)+SLOT_SIZE

SEGV stack

Page 34: Porting NANOS on SDSM

DSM/PM2 shared stack DSM/PM2 shared stack supportsupportmarcel_t marcel_t

(sp&MASK)+SLOT_SIZE(sp&MASK)+SLOT_SIZE

SEGV stack SEGV stack

Page 35: Porting NANOS on SDSM

DSM/PM2 shared stack DSM/PM2 shared stack supportsupportmarcel_t marcel_t

(sp&MASK)+SLOT_SIZE(sp&MASK)+SLOT_SIZE

SEGV stack SEGV stack

Page 36: Porting NANOS on SDSM

DSM/PM2 shared stack DSM/PM2 shared stack supportsupportmarcel_t marcel_t

(sp&MASK)+SLOT_SIZE(sp&MASK)+SLOT_SIZE

SEGV stack SEGV stack

Page 37: Porting NANOS on SDSM

DSM/PM2 shared stack DSM/PM2 shared stack supportsupportmarcel_t marcel_t

(sp&MASK)+SLOT_SIZE(sp&MASK)+SLOT_SIZE

SEGV stack

Page 38: Porting NANOS on SDSM

DSM requirementDSM requirement

• Support of static global shared variablesSupport of static global shared variables– Efficient codeEfficient code

•remove one indirection levelremove one indirection level– Enable use of classical compilerEnable use of classical compiler

•Support for Support for commoncommon « Sharedization » of already allocated memory« Sharedization » of already allocated memory

dsm_to_shared(void* p, size_t size);dsm_to_shared(void* p, size_t size);

Page 39: Porting NANOS on SDSM

• Support for multiple level of parallelismSupport for multiple level of parallelism– Partial barrierPartial barrier

• group managementgroup management– Dependencies supportDependencies support

• like acquire/releaselike acquire/release but without lockbut without lock

DSM requirementDSM requirement

Page 40: Porting NANOS on SDSM

• Support for multiple level of parallelismSupport for multiple level of parallelism– Partial barrierPartial barrier

•group managementgroup management– Dependencies supportDependencies support

• like acquire/releaselike acquire/release but without lockbut without lock

barrier

barrier

DSM requirementDSM requirement

Page 41: Porting NANOS on SDSM

• Support for multiple level of parallelismSupport for multiple level of parallelism– Partial barrierPartial barrier

•group managementgroup management– Dependencies supportDependencies support

• like acquire/releaselike acquire/release but without lockbut without lock

barriers

barrier

DSM requirementDSM requirement

Page 42: Porting NANOS on SDSM

• Support for multiple level of parallelismSupport for multiple level of parallelism– Partial barrierPartial barrier

•group managementgroup management– Dependencies supportDependencies support

• like acquire/releaselike acquire/release but without lockbut without lock

start(1)start(1)

stop(1)stop(1)

update(update(11,,22))

start(2)start(2)

stop(2)stop(2)

DSM requirementDSM requirement

Page 43: Porting NANOS on SDSM

Summary of DSM Summary of DSM requirementsrequirements• Support of static global shared variablesSupport of static global shared variables

« Sharedization » of already allocated « Sharedization » of already allocated memorymemory

• Acquire/release primitiveAcquire/release primitive• Partial barrier Partial barrier

group managementgroup management• Asynchronous shared memory allocationAsynchronous shared memory allocation• Alignment parameter to memory allocationAlignment parameter to memory allocation• Threads (SMP nodes)Threads (SMP nodes)• Optimized stack managementOptimized stack management

Page 44: Porting NANOS on SDSM

ConclusionConclusion

• Successfully port Nanos to 2 DSMSuccessfully port Nanos to 2 DSM JIAJIA & DSM-PM2JIAJIA & DSM-PM2

• DSM requirement to obtain performanceDSM requirement to obtain performance Support MIMD modelSupport MIMD model Automatic thread migrationAutomatic thread migration

• Performance ?Performance ?