14
An Efficient Abortable-locking Protocol for Multi-level NUMA Systems Milind Chabbi Hewlett Packard Labs [email protected] Abdelhalim Amer Argonne National Laboratory [email protected] Shasha Wen, Xu Liu College of William and Mary {swen, xl10}@cs.wm.edu Abstract The popularity of Non-Uniform Memory Access (NUMA) architectures has led to numerous locality-preserving hier- archical lock designs, such as HCLH, HMCS, and cohort locks. Locality-preserving locks trade fairness for higher throughput. Hence, some instances of acquisitions can in- cur long latencies, which may be intolerable for certain ap- plications. Few locks admit a waiting thread to abandon its protocol on a timeout. State-of-the-art abortable locks are not fully locality aware, introduce high overheads, and un- suitable for frequent aborts. Enhancing locality-aware locks with lightweight timeout capability is critical for their adop- tion. In this paper, we design and evaluate the HMCS-T lock, a Hierarchical MCS (HMCS) lock variant that admits a timeout. HMCS-T maintains the locality benefits of HMCS while ensuring aborts to be lightweight. HMCS-T offers the progress guarantee missing in most abortable queuing locks. Our evaluations show that HMCS-T offers the time- out feature at a moderate overhead over its HMCS analog. HMCS-T, used in an MPI runtime lock, mitigated the poor scalability of an MPI+OpenMP BFS code and resulted in 4.3× superior scaling. Keywords Synchronization, Spin lock, Queuing lock, MCS lock, Hierarchical lock, Abortable lock, Timeout lock. 1. Introduction Multi-socket systems have become a norm rather than an exception in modern server and HPC establishments. State- of-the-art shared memory machines such as SGI UV [22] and HP Integrity Superdome X [10] are not only multi- socket but also multi-node. Multi-node, multi-socket sys- tems along with the deep processor cache hierarchy lead Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. PPoPP ’17 February 04 - 08, 2017, Austin, TX, USA Copyright © 2017 held by owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-4493-7/17/02. . . $15.00 DOI: http://dx.doi.org/10.1145/3018743.3018768 to Non-Uniform Memory Access (NUMA) latencies. Main- taining locality of reference is critical to achieving high per- formance on NUMA architectures. The emergence of NUMA architectures has propelled re- searchers to develop various “locality-aware” locking algo- rithms. Locality-aware locks prefer to handover a lock from the lock holder to the “nearest” lock requester instead of the “earliest” lock requester. This kind of locality bias reduces the indiscriminate migration of shared cache lines (accessed in the critical section and the lock’s internal data structures) among different locality domains. Hierarchical locks sacrifice fairness [5] to exploit locality. In a hierarchical lock, on a highly contended system with a deep NUMA hierarchy, a lock requester may have to wait for an unusually long time when a distant NUMA domain holds the lock. While spinning for a short time to acquire the lock from a neighbor is desirable to enhance locality and throughput, idle waiting for a long time hurts parallelism. The following use cases motivate the need for allowing a waiting thread to timeout and abandon the protocol: (1) A thread in an application with sufficient parallel slackness [9, 24] may desire to discontinue a work chunk blocked on long lock waiting and instead take up a different work chunk not subjected to serialization, (2) A low-priority process may want to temporarily abandon (or permanently abort) its lock request to expedite the lock handoff to a high-priority process, and (3) Database systems assume a deadlock when no progress is observed for a long time, and they require the timeout capability in their locks to unroll a transaction. State-of-the-art abortable locks suffer from various limi- tations. They introduce blocking wait (e.g., [20]), have un- bounded space and time complexities (e.g., [19] and [8]), lack full locality awareness (e.g., [20], [19], [8]), do not en- sure starvation freedom (e.g., [8]), and incur high memory management costs (e.g., [19], [8]). Hierarchical queue-based spin locks with a timeout would offer greater flexibility for application programmers to exploit locality of shared data when available and abandon a long wait to exploit paral- lelism that exists elsewhere in the system. However, state- of-the-art hierarchical queue-based locks (e.g., [5]) lack the timeout capability. Few, if any, verify their correctness.

An Efficient Abortable-locking Protocol for Multi-level NUMA …aamer/papers/ppopp17_hmcst.pdf · 2017. 1. 8. · [email protected] Shasha Wen, Xu Liu College of William and Mary fswen,

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: An Efficient Abortable-locking Protocol for Multi-level NUMA …aamer/papers/ppopp17_hmcst.pdf · 2017. 1. 8. · aamer@anl.gov Shasha Wen, Xu Liu College of William and Mary fswen,

An Efficient Abortable-locking Protocol forMulti-level NUMA Systems

Milind ChabbiHewlett Packard [email protected]

Abdelhalim AmerArgonne National Laboratory

[email protected]

Shasha Wen, Xu LiuCollege of William and Mary{swen, xl10}@cs.wm.edu

AbstractThe popularity of Non-Uniform Memory Access (NUMA)architectures has led to numerous locality-preserving hier-archical lock designs, such as HCLH, HMCS, and cohortlocks. Locality-preserving locks trade fairness for higherthroughput. Hence, some instances of acquisitions can in-cur long latencies, which may be intolerable for certain ap-plications. Few locks admit a waiting thread to abandon itsprotocol on a timeout. State-of-the-art abortable locks arenot fully locality aware, introduce high overheads, and un-suitable for frequent aborts. Enhancing locality-aware lockswith lightweight timeout capability is critical for their adop-tion. In this paper, we design and evaluate the HMCS-Tlock, a Hierarchical MCS (HMCS) lock variant that admits atimeout. HMCS-T maintains the locality benefits of HMCSwhile ensuring aborts to be lightweight. HMCS-T offersthe progress guarantee missing in most abortable queuinglocks. Our evaluations show that HMCS-T offers the time-out feature at a moderate overhead over its HMCS analog.HMCS-T, used in an MPI runtime lock, mitigated the poorscalability of an MPI+OpenMP BFS code and resulted in4.3× superior scaling.

Keywords Synchronization, Spin lock, Queuing lock, MCSlock, Hierarchical lock, Abortable lock, Timeout lock.

1. IntroductionMulti-socket systems have become a norm rather than anexception in modern server and HPC establishments. State-of-the-art shared memory machines such as SGI UV [22]and HP Integrity Superdome X [10] are not only multi-socket but also multi-node. Multi-node, multi-socket sys-tems along with the deep processor cache hierarchy lead

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by anemployee, contractor or affiliate of the United States government. As such, the United States Government retains anonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposesonly.

PPoPP ’17 February 04 - 08, 2017, Austin, TX, USACopyright © 2017 held by owner/author(s). Publication rights licensed to ACM.ACM 978-1-4503-4493-7/17/02. . . $15.00DOI: http://dx.doi.org/10.1145/3018743.3018768

to Non-Uniform Memory Access (NUMA) latencies. Main-taining locality of reference is critical to achieving high per-formance on NUMA architectures.

The emergence of NUMA architectures has propelled re-searchers to develop various “locality-aware” locking algo-rithms. Locality-aware locks prefer to handover a lock fromthe lock holder to the “nearest” lock requester instead of the“earliest” lock requester. This kind of locality bias reducesthe indiscriminate migration of shared cache lines (accessedin the critical section and the lock’s internal data structures)among different locality domains.

Hierarchical locks sacrifice fairness [5] to exploit locality.In a hierarchical lock, on a highly contended system with adeep NUMA hierarchy, a lock requester may have to waitfor an unusually long time when a distant NUMA domainholds the lock. While spinning for a short time to acquirethe lock from a neighbor is desirable to enhance locality andthroughput, idle waiting for a long time hurts parallelism.The following use cases motivate the need for allowing awaiting thread to timeout and abandon the protocol: (1) Athread in an application with sufficient parallel slackness [9,24] may desire to discontinue a work chunk blocked on longlock waiting and instead take up a different work chunknot subjected to serialization, (2) A low-priority processmay want to temporarily abandon (or permanently abort) itslock request to expedite the lock handoff to a high-priorityprocess, and (3) Database systems assume a deadlock whenno progress is observed for a long time, and they require thetimeout capability in their locks to unroll a transaction.

State-of-the-art abortable locks suffer from various limi-tations. They introduce blocking wait (e.g., [20]), have un-bounded space and time complexities (e.g., [19] and [8]),lack full locality awareness (e.g., [20], [19], [8]), do not en-sure starvation freedom (e.g., [8]), and incur high memorymanagement costs (e.g., [19], [8]). Hierarchical queue-basedspin locks with a timeout would offer greater flexibility forapplication programmers to exploit locality of shared datawhen available and abandon a long wait to exploit paral-lelism that exists elsewhere in the system. However, state-of-the-art hierarchical queue-based locks (e.g., [5]) lack thetimeout capability. Few, if any, verify their correctness.

Page 2: An Efficient Abortable-locking Protocol for Multi-level NUMA …aamer/papers/ppopp17_hmcst.pdf · 2017. 1. 8. · aamer@anl.gov Shasha Wen, Xu Liu College of William and Mary fswen,

We have designed the HMCS-T lock, which allows awaiting thread to abandon on timeout. The HMCS-T lockaddresses the aforementioned challenges while being mind-ful of the principles of the HMCS lock—exploiting local-ity. It is challenging to incorporate the timeout capability inthe HMCS lock because of the dependency of successors ateach level on their predecessors to hand-off the lock throughthe hierarchical tree. A timeout may happen when a threadis waiting to acquire a lock at any level in the tree. Aban-doning from an interior node in the tree is particularly chal-lenging since it involves releasing already acquired lower-level locks. Wait loops at various places in the protocol posedaunting challenges in making the protocol non-blocking ontimeout. Furthermore, attention needs to be paid to maintainlocality in every step of the design. A design that addressesthese complications needs rigorous correctness (mutual ex-clusion and livelock and deadlock freedom) and progress(starvation freedom) guarantees.

The key idea in the HMCS-T lock is that an abandoningthread leaves its queue node record [5] (aka QNode) intactin the MCS queue structure with a special status flag indi-cating that the owner has abandoned. When a lock holdernotices an abandoned successor, it passes the lock to thenext unabandoned successor. If no unabandoned successoris found at a level, the release proceeds to find a waitingthread at an ancestral level, if any, in the hierarchical tree. Adisciplined, locality-preserving abandonment policy handlestimeout while retaining the high performance of HMCS.

In HMCS-T, when an abandoned thread requests thesame lock again, instead of bringing a new QNode record,it reuses its abandoned QNode. This design ensures a spacecomplexity linear in the number of participants. This “reuse”design allows an abandoned thread to re-secure its originalposition in the queue where it had previously abandoned.Such thread can resume its wait notwithstanding its priorabandonment if the lock was not already passed beyond itsposition. A consequence of this design is that a thread canopportunistically express its “locking intent” well before thelock may be actually needed, possibly reducing the waitingtime when the need for the lock becomes imminent.

The HMCS-T lock ensures starvation freedom for threadsthat can wait and ensures a bounded number of steps toabandon (on timeout) or release the lock. Having all thesetraits is uncommon in lock designs that admit a timeout [8,19, 20]. For threads that do not admit a timeout, HMCS-Tretains the same starvation freedom and fairness guaranteesas the HMCS lock, even when intermixed with threads thatadmit timeouts. We have proved the correctness of HMCS-Tvia formal verification using the Spin [12] model checker.

The contributions of this paper are the following:

1. Design of a high-throughput queue-based hierarchicalspin lock with a timeout capability, which exploits local-ity and ensures low overheads for abort and reentry.

2. Formal verification of correctness and progress guaran-tees of the intricate algorithm.

3. Demonstration of the superiority of HMCS-T over otherabortable locks and demonstration of 34% speedup and4.3× superior scaling in an MPI+OpenMP Graph500BFS kernel by using the HMCS-T lock.

2. Background and Related WorkMCS [17] and CLH [15] queue-based locks do not al-low abandoning once a thread has enqueued itself. Scottand Scherer [20] devised MCS and CLH locks with atimeout, which were not non-blocking. Scott designed thenon-blocking variants (MCS NB and CLH NB locks [19]),which, unfortunately, introduced an unbounded worst-casespace and time complexity for a given number of threads.The designs in [19] need explicit memory management be-fore each acquisition with potential remote memory ac-cesses, which becomes a bottleneck as we show later in ourexperiments (cf. § 6.2). In contrast, HMCS-T ensures star-vation freedom—the threads that do not timeout eventuallyacquire the lock, and the threads that timeout terminate theirprotocol in a bounded number of steps. HMCS-T has no loopthat cannot admit a timeout; it has a bounded space complex-ity, and it does not require continuous memory management.The improvements come with an intricate design and slightlyreduced performance compared to the baseline HMCS.

Jayanti [14] proposed a token-based abortable lock withlinear space bounds and logarithmic time bounds. Pareek etal. [18] proposed a randomized algorithm to abortable locksthat can achieve sub-logarithmic remote memory references.Marathe et al. [16] combined a fixed-length queue lock witha back-off lock, relieving the design from expensive memorymanagement. None of these locks are NUMA aware.

Cohort locks [8] employ two levels of locks, treatinga system as a two-level NUMA hierarchy. The authorspresent two timeout-capable cohort locks—a 2-level back-off lock and a global backoff lock coupled with a localCLH lock (A C BO CLH). Aborting a 2-level backoff lockis straightforward since there is no per-thread state. To abortan A C BO CLH lock, the authors use Scott’s CLH NBlock [19] at the local level, which also suffers from repeatedmemory management but isolates the problem to a socket.These two abortable locks neither ensure starvation freedom(for threads do not abort) nor bound the steps after timeout(for threads that abort), unlike HMCS-T.

HMCS lock [5]. Since we are concerned with the timeoutcapability in the HMCS lock [5], we provide the details ofHMCS first. HMCS employs a tree of MCS locks to leveragemultiple levels of locality in a NUMA hierarchy (Figure 1).The tree structure mirrors the underlying NUMA hierarchy;each locality domain has an MCS lock of its own, which iscontested by its subdomains. For example, all SMT threadson a core may compete for an MCS lock designated for thatCPU core. Not all NUMA levels need be mirrored in the tree.

Page 3: An Efficient Abortable-locking Protocol for Multi-level NUMA …aamer/papers/ppopp17_hmcst.pdf · 2017. 1. 8. · aamer@anl.gov Shasha Wen, Xu Liu College of William and Mary fswen,

Lock holder Spin waitingLock passing to a peer

Pointer to parent

Level-n

Level-2

Level-3

Level-1

Figure 1: HMCS hierarchical tree. Threads arrive with their context nodes (emptycircles) and compete for their designated level-1 MCS locks. Higher-level contextnodes (solid circles) are pre-allocated in the nearest NUMA domains.

HNode

HNode * parent

QNode * tail

threshold (level 2)

nextstatus

QNode

HNode

HNode * parent

QNode * tail

threshold (level 1)

nextstatus

QNode

HNode

HNode * parent

QNode * tail

threshold (level 1)

nextstatus

QNode

level-2 MCS lock tail QNode

level-3 HNode

Static links

Dynamic links

QNodeQNodeQNodeQNode

Client threads with their QNode

next

status

next

status

next

status

next

status

Level-1 MCS lock

Level-2 MCS queue

Level-1 MCS queue

Level-2 MCS lock

Level-3 MCS queue

Figure 2: HMCS data structures.

Lock acquisition for each thread begins with compet-ing for a designated leaf-level lock that corresponds to thethread’s innermost NUMA domain. When a thread acquiresan MCS lock in the tree in an uncontended manner (no pre-decessor), it proceeds to contend for its parent lock (if any)in the tree. A thread that acquires the lock at the root of thetree immediately enters the critical section. Any thread thatfails in immediately acquiring a lock in the tree waits andreceives that lock from its predecessor. The release protocolalso starts at a leaf of the tree implicitly passing all locksheld by the owner to its local successor, if any. A waitingthread at the deepest node on a thread’s path to the root, typ-ically, inherits all locks held by its predecessor. As a result,when a successor is granted a lock, it immediately enters thecritical section without explicitly contending for other locksalong its path to the root. To prevent starvation, threads maypass an HMCS lock within a locality domain for a boundednumber of times. HMCS does not admit a timeout.

Each interior node of the HMCS tree (HNode) encapsu-lates an MCS lock, a queue node (QNode), and a pointer toits parent HNode (Figure 2). Each HNode is pre-allocated inthe corresponding NUMA domain. Threads arrive with theirown QNodes and a reference to their leaf-level HNode. Whencompeting for a non-leaf-level lock, a thread uses a QNode

present inside the HNode dedicated for that domain.

Terminology. We consider a leaf lock of HMCS to be atlevel 1 and the root lock to be at level n. We represent anarbitrary level with the letter l. The tail pointer of an MCS

lock refers to the lock word [5, 17]. The tail pointer is nullwhen the lock is free and non-null otherwise. Although alevel-l MCS lock’s tail pointer is physically placed a level-l + 1 HNode data structure, for the ease of prose, level-llock, level-l queue, and level-l QNode refer to the lock atlevel-l, the MCS queue used for competing for the level-l lock, and the constituent QNodes in that queue, respec-tively. For example, a level-1 QNode is inside a level-1 queue,which is used to compete for a level-1 lock. An HMCSlock that has n lock levels is represented as HMCS〈n〉 andthe analogous HMCS-T lock is represented as HMCS-T〈n〉.MCS≈HMCS〈1〉.

Two or more QNodes (and hence their owner threads) arepeers at level l, if their lowest common ancestral lock is atlevel l. Two or more QNodes belong to the same domain atlevel l, if they share a common lock at level ≤ l. In a se-quence of locks (l)(l + 1) · · · (m), where 1 ≤ l < m ≤ n,from level-1 to the root, a lock prefix refers to the sequence(l)(l + 1) · · · (k), where l ≤ k ≤ m, and a lock suffix refersto the sequence (k + 1) · · · (m). If a lock holder hands offthe lock to a waiting thread, we refer to it as “lock pass-ing”. A thread “relinquishes” the lock if no thread is waitingduring the lock-passing attempt. A “release” refers to eitherlock passing or relinquishing. SWAP refers to an atomic ex-change operation. CAS refers to an atomic compare exchangeoperation. On timeout a thread “aborts” or “abandons”.

3. Design of One-level HMCS-TWe first describe the HMCS-T〈1〉 protocol, which allowsabandoning in an MCS lock (i.e., HMCS〈1〉). In §4, wegeneralize the design to an n-level HMCS-T lock.

Overview. The key idea of HMCS-T is that a threadcan abandon by updating its already enqueued QNode bySWAPing a special flag, leaving the QNode behind in the MCSqueue. On noticing an abandoned successor, a lock-releasingpredecessor takes the additional responsibility of passing thelock to an unabandoned successor in the queue. The protocolprovisions readmission of a previously abandoned requestorinto its earlier secured position, if possible. Figure 3 depictssome important steps in the protocol.

Details. The HMCS-T〈1〉 encodes special values—waiting (W), unlocked (U), abandoned (A), and recycled(R)—in the status field of a QNode. Every QNode is initial-ized with R value for its status field and null for its nextfield on creation. Typically, a thread waiting to acquire thelock has a W status in its QNode; a lock holder passes thelock to its successor by SWAPing an U into its successor’sstatus field. Symmetrically, a waiting thread on observingits QNode’s status change from W to U infers the lockownership; a thread updates its QNode status to A when itaborts on a timeout. A lock releaser that observes an A in itssuccessor’s status while passing the lock infers successor’sabandonment and performs the additional work of releasingthe lock to an unabandoned successor, if any.

Page 4: An Efficient Abortable-locking Protocol for Multi-level NUMA …aamer/papers/ppopp17_hmcst.pdf · 2017. 1. 8. · aamer@anl.gov Shasha Wen, Xu Liu College of William and Mary fswen,

nullR

lock

nullW

nullW

nullU

U

lock

WnullW

U AnullW

nullU U

nullW

nullU U

nullU

don’t care

Rdon’t care

Rnull

U

t1 t2 t3t1

lock

lock

lock

lock

lock

lock

lock

WnullW

Wnull

U

nullU

lock

lock

lock

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

Action by t2

Act

ions

take

n b

y t 1

Act

ions

take

n b

y t 1

Act

ion

by t 2

t2 may reenter the queue

t2 may reenter the queue

= Lock owner

☜☜

☜☜

☜☜don’t care

Rdon’t care

R

nullU

nullU

Figure 3: (a) The lock is free, and the thread t1 arrives with a virgin QNode. (b)t1 swaps its status to W. (c) t1 swaps the tail pointer to itself. (d) Since t1 has nopredecessor, it updates its status flag to unlocked (U) and becomes the lock owner. (e)Threads t2 and t3 follow a similar protocol and get enqueued. (f) t2 times out andCASes A in its status flag. (g) t1 begins releasing the lock by swapping t2’s status andnotices that t2 has abandoned. (h) t1 proceeds to t2’s successor and passes the lock tot3. Also, t1 updates the next pointers to point to their predecessors in steps (g) and (h).(i) t1 updates the status of t2 and t1 to R indicating their readiness for next round ofacquisition. After (f), t2 may re-enter the queue and transition the system back to (e).Also, t2 may re-enter after (g) by swapping its status to W in (j) but it cannot acquirethe lock or reuse the QNode until t1 has passed the lock to t3 as shown in (k) andupdated t2’s status to R as shown in (l).

QNode maintenance. In HMCS-T, the next and status

fields of a QNode will be accessed by a predecessor even af-ter the owner thread has aborted its acquisition. Hence, anowner thread may not deallocate its QNode until the nodeis marked for recycling. We use the C++ destructor to easethe QNode maintenance. For convenience, each participat-ing thread allocates an object (QNodeObject) that encap-sulates a QNode along with a reference to the parent lock.The object’s destructor waits until the object is marked forrecycling. The object’s constructor initializes the next andstatus fields and sets the parent pointer. The acquire andrelease protocols use a reference to this per-thread object. Athread can reuse the same object across multiple episodes oflock requests without having to reinitialize; for high perfor-mance, the reuse is necessary. Listing 1 shows an exampleusage model, which is applicable to any n-level HMCS-T.The object can reside anywhere—stack, heap, or thread-local storage. Heap allocation allows returning from a calleestack frame without waiting (in the destructor) for the objectto be recycled. When used in C, the user must explicitly callthe constructor and destructor routines after allocation andbefore deallocation, respectively.

Acquisition. A thread t begins its lock entry protocol bySWAPing the value W to the status field of the thread’s QNodeq. The previous value of the status field distinctly indicatesthe state of the QNode, which governs the next action takenby t. Depending on whether the QNode is ready to use, aban-doned previously, or in an intermediate state of being up-dated by a predecessor, the following three scenarios arise:

1 // Initialize the lock giving it the machine topology2 HMCST <3> hmcstLock (machineTopology);3 #pragma omp parallel4 {5 // Assume threads are pinned to cores.6 // Create and initialize the per -thread QNode object.7 QNodeObject ctxt(hmcstLock);8 for ( ... ) { // ctxt is reused many times9 if (hmcstLock.Acquire (&ctxt , timeout)== SUCCESS) {

10 CriticalSection ();11 hmcstLock.Release (&ctxt , timeout);12 } else {13 NonCriticalSection ();14 }15 } // end for -loop16 } // The destructor ensures ctxt is ready to free.

Listing 1: An example 3-level HMCS-T lock in an OpenMP C++ code.

Recycled node (R): The QNode q is ready for use. In thiscase, t, by-and-large, follows the MCS lock’s enqueue pro-tocol: swaps the tail pointer to point to q and if t has no pre-decessor, it becomes the lock owner, overwriting the statusflag to U (not done in the original MCS), otherwise, t spinseither until the lock is granted (some predecessor changesq.status to U) or a timeout occurs. On timeout, t CASesthe value A to q.status and abandons the protocol if CASsucceeds. CAS failure implies lock acquisition.

Previously abandoned node (A): It implies t is attemptingto re-acquire a lock that it had previously abandoned. Inthis case, the lock has not yet been passed beyond t by anyof its predecessors. Hence, t resumes its wait by directlyjumping to the wait loop in the MCS acquisition protocol.Note that swapping the tail pointer and updating the prede-cessor’s next field are elided. While waiting for the lock, ift times out once again, it CASes an A into its q.status andexits if CAS succeeds. CAS failure implies lock acquisition.

Node unlocked after abandonment (U): It implies t musthave abandoned previously, and some predecessor p musthave tried to grant the lock to t by SWAPing a U value.Since t had already abandoned the node, p is now in theprocess of passing the lock to a waiting successor in-linepast t. Once, p has passed or relinquished the lock, it willchange the status of t’s QNode q to R. In this case, t waitsfor q.status to change to R. After q.status becomes R,t goes through the full enqueue process following the stepslisted for a “recycled node”. While waiting for the statusto change to R, however, if t times out, it CASes an U torevert q.status and abandons the wait on CAS success.Failure to CAS implies q is recycled, and t may choose oneof the following options: (1) [optimistic] follow the stepslisted in the “recycled node” case and attempt to acquirethe lock hoping to get it immediately without waiting or(2) [pessimistic] return as if the lock waiting timed out.We have implemented the optimistic strategy.

The status flag shall never be W when SWAPing at the entry.

Release. A lock releasing thread performs actions that al-low abandoned QNodes in the queue to become reusable bytheir owner threads. When a thread, r, is releasing the lockby updating the status field of its successor s with a valueU, the successor could be simultaneously updating its statusfield with an A in an effort to abandon. HMCS-T’s release

Page 5: An Efficient Abortable-locking Protocol for Multi-level NUMA …aamer/papers/ppopp17_hmcst.pdf · 2017. 1. 8. · aamer@anl.gov Shasha Wen, Xu Liu College of William and Mary fswen,

protocol is modified to use a SWAP on successor’s status fieldunlike an unconditional store used in the original MCS lock.The releaser r may have one of the following cases based onthe previous value observed in the status field of the succes-sor:

Waiting successor (W): It implies r successfully granted thelock to s, which was still waiting.

Abandoned successor (A): It implies s abandoned the lockbefore r performed its release. In this case, r “imperson-ates” as if s were releasing the lock to its successor and rrepeats one of these two steps, if s has a successor.

In case the releaser r finds no waiting successor in theMCS queue, it CASes the tail pointer to null, similar tothe MCS lock. In the “abandoned successor” case above,once a releaser takes the role of its successor, it uncondi-tionally overwrites its successor’s next field to its prede-cessor, which aids in the reverse traversal. Thus, the next

pointer is reused as a predecessor pointer. It is legal to trashthe next pointer and reuse for another purpose since no onewill be using it beyond this point. Once r has successfullypassed or relinquished the lock, it follows the chain of pre-decessor pointers (now stored in the next fields of aban-doned QNodes), unconditionally setting the status in eachabandoned QNode to R. This step essentially marks aban-doned (and potentially re-entered) QNodes to be ready forreuse. Any attempt to reuse an U marked QNodes by theirowning threads for an acquisition during this intermediatestate will result in the acquire protocol to land into the “nodeunlocked after abandonment (U)” case.

The order of recycling the abandoned QNodes is LIFO.One can remember the first node and the last node and fol-low FIFO ordering; we chose LIFO with the intention of al-lowing newcomers a better opportunity since they will havemore time left to reenqueue and likely succeed. The releaserr, does not eagerly flip the status of abandoned QNodes toR during its left-to-right forward journey because doing soallows abandoned threads to reenqueue and abandon repeat-edly making r’s time complexity unbounded. Marking theQNodes after the lock release bounds the release protocol’stime complexity to linear in the number of participants.

Non-blocking release. The MCS lock has an unboundedwait in its release protocol. In MCS, if a lock releaser, r,notices no successor, it CASes the tail pointer to null. How-ever, in the meantime, a new thread, s, might have swappedthe tail pointer, and s may be in the process of enqueueing.On CAS failure, while the original MCS protocol waits in-definitely until the successor advertises itself by updating thenext field of its predecessor, HMCS-T deviates from this be-havior and makes the release protocol non-blocking. To ac-complish the non-blocking release, HMCS-T obeys the fol-lowing protocol (see Figure 4 for an example):

Release: On failure to CAS the tail pointer to null, thereleaser r attempts to CAS a special value M (iMpatient)

lock

nullU

nullW

r s lock

MU

nullW

lock

UnullW

lock

don’t care

RnullU(a) (b) (c) (d)

Action by r Action by s Action by sInitial state

r s r s r s

Figure 4: (a) Thread r owning the QNode qr is releasing the lock when another threads arrives, but s is slow in updating qr’s next pointer to itself. (b) r times out andCASs a special value M in qr.next and leaves. (c) Later s SWAPs a pointer to itselfinto qr.next. (d) Having observed the special value M, s updates qr.status to Rindicating that qr can now be reused. Also, s updates its status to unlocked (U) andbecomes the lock owner.

on to the next field of its QNode qr. Note that qr may not beowned by r; if successors of r had abandoned, r would beimpersonating the last one in the chain of abandoners.If the CAS succeeds, r’s forward journey is over. However,r does not toggle qr.status to R on its return journeythrough the predecessors because it is unsafe to recycle qruntil s has published itself by updating qr.next. If the CASfails, because s published itself by updating qr.next in themeantime, then r follows the next pointer to s and SWAPsqs.status to U. The procedure may recur if s abortsbefore r updates qs.status.

Acquire: To perform a symmetric handshake with the re-lease protocol, the acquire protocol of a successor s per-forms a SWAP on the next field of its predecessor. If s seesthe special value M in its predecessor’s next field (i.e.,qr.next), it infers that the predecessor abandoned havingbecome impatient waiting for its successor. Consequently,s updates its predecessor QNode’s status to R indicating thatthe predecessor node (i.e., qr) can now be recycled. Also,since s had no “legal” predecessor, it implies lock acquisi-tion. s updates its status to U and enters the critical section.

4. Design of N-level HMCS-TIn principle, if a thread wants to abort in HMCS-T〈n〉, itupdates the status flag of the QNode it is waiting on to A andreleases all of its lower-level locks. Symmetrically, if a lockreleaser encounters an abandoned successor, it continueslooking for a waiting thread until it has found one at thecurrent level; otherwise, it continues to search at the parentlevel. The details are more involved. Figure 5 presents thekey steps of the HMCS-T protocol; we provide the detaileddescription below.

Details. If a thread (say α) times out while waiting for alevel-l lock, it updates the status of its QNode (the one it usedfor competing for the level-l lock) to A. A level-l predecessorof α will ensure passing the level-l (and above) locks to alevel-l waiting successor past α, similar to HMCS-T〈1〉.

If α times out while competing for a level l > 1 lock,it must have already acquired the locks at levels [1, l − 1].By now, there could be other threads waiting at these lowerlevels (possibly some already abandoned) relying on α topass the global lock, which α failed to acquire. Hence, af-ter abandoning at level l, α should release locks at levels[1, l − 1] (see Figure 5(a)). Naively, α could release all locks

Page 6: An Efficient Abortable-locking Protocol for Multi-level NUMA …aamer/papers/ppopp17_hmcst.pdf · 2017. 1. 8. · aamer@anl.gov Shasha Wen, Xu Liu College of William and Mary fswen,

C

C

A

C

αC

C

A

C

αC

C

A

C

αA A

WA

P P

PP

P P

CP

R R R C

level 3

level 2

level 1

level 3

level 2

level 1

(a) (b) (c)

(e) (f)

level 4

level 4

β β β

P P A W

(d)

C

R

α

R

ɣ δP P P PC

α ɣ δR

α ɣ δ

Figure 5: (a) The thread α abandons at level 4 having acquired the level 1-3 locks.Every QNode in the left frontier has the C status. There are several abandoned QNodesin the same domain as α under level 4. (b) α attempts to pass the level 1-3 locksto a successor at level 1 but finds none. α ascends to level 2 and passes the level 2and 3 locks to the waiting successor β. α marks the status of all QNodes along thepath to P, temporarily disallowing their reuse. (c) β updates its status from P to C. βrecognizes (not shown) that it owns the level 3 lock by observing the QNode statusof C on ascending to level 3. (d) concurrently, α traverses the predecessors in level 2and marks them ready for reuse. By the time α descends to level 1, γ and δ enqueueat level 1 with γ already abandoned. (e) α passes the level-1 lock to the first waitingsuccessor δ, (f)α traverses the predecessors in level 1 and marks them ready for reuse.Concurrently, δ updates its status to C and proceeds to compete at level 2.

from level l − 1 to 1, in that order, to a first waiting succes-sor, forcing all of its successors at each level to compete attheir parent levels. This naive design, however, suffers fromthe following problems: 1) releasing all l − 1 locks is extrawork forα compared to simply passing a prefix of locks (suf-fix from α’s viewpoint) to the lowest common descendent, 2)top-down traversal makes high-latency releases compulsory,and 3) every successor at each level has the additional workof competing at the next level.

Our solution is a disciplined, two-pass lock-passing ar-rangement that exploits locality and minimizes work. In aforward pass, starting from leaf to level l − 1, α finds thelowest common waiting descendent β and passes all locksit has held starting from that level till level l − 1 to β. Forexample, in Figure 5, α abandons at level 4 and passes thealready acquired level 2 and 3 locks to β, which was waitingat level 2. β, now, skips acquisitions at levels that it inheritedfrom α by noticing the special value (C) left by α in the sta-tus fields of the QNodes at levels that are already acquired. Ifα found β at a level i < l − 1 , α’s forward journey stops atlevel i. In a reverse path from the level i − 1 down to level1, α releases the locks it had not passed during its upwardjourney. During the forward pass, α would have marked ev-ery abandoned successor along the way as temporarily notreusable; it marks them as reusable during its reverse pass.

Deviation in internals from HMCS. HMCS protocol usesits status field to encode the passing count—the number oftimes the lock has been circulated locally [5]. The passingcount is incremented on each successive passing to a peer,and the global lock is passed to a successor s by signaling itwith a symbolic value V ∈ [2, θ], where θ is the threshold oflocal passing. A releaser signals its successor s with a spe-cial value ACQUIRE PARENT (P), either when the threshold(θ) of local passing is reached or when s arrives after the re-

leaser already released the global lock at an enclosing level.A waiting thread s that is signaled with the special value (P),immediately resets its status to COHORT BEGIN (C, numer-ically equal to 1), which indicates: (1) s’s ownership of thelocal lock alone and hence the need to compete at the parentlevel, and (2) the start of a new round of local passing.

In HMCS-T, the meaning of P is generalized to signala successor s that it now inherits a prefix of locks (insteadof the local lock alone) starting at the current level on s’spath to the root. Symmetrically, the status C is generalizedto indicate that the thread owns a prefix of locks starting atthe current level but not all locks on the path to the root.The value P can be used either when a thread aborted atan ancestor level (and hence passing a lock prefix) or whenit reached the local passing threshold. In either case, thesuccessor does not inherit the global lock; the successorturns its status to C and proceeds to compete at ancestrallevels, similar to HMCS.

Acquisition. Acquisition begins with competing for a des-ignated level-1 lock. The acquisition protocol of any level-llock starts by SWAPing the value W to the status field of theQNode q; remember that the leaf-level QNodes are providedby the client threads and the interior QNodes at each level arepre-allocated within HNodes. The previous value of the sta-tus field of the QNode governs the subsequent actions takenby the acquire protocol for a thread t. HMCS-T’s non-root-level lock has more status flags to handle. The following fourcases arise based on the previous value of the status flag:

Recycled node (R): This implies q is reusable. HMCS-Tenqueues q by SWAPing the tail pointer of the correspond-ing level-l lock. Subsequently, one of the following ensues:

1. Uncontended acquisition (tail was null): t updates theq.status to C (the beginning of a new cohort) and pro-ceeds to compete for the parent level lock,

2. Inheritance of all locks (status becomes a legal passingvalue, V): acquisition ends successfully,

3. Acquisition of a prefix of locks (status becomes P): tupdates q.status to C and proceeds to the next level,

4. Timeout: t CASes an A into q.status and returns failureif CAS succeeds. CAS failure means successful acquisition.

Previously abandoned node (A): This implies the QNode qwas previously abandoned either by t itself or a threadbelonging to the same domain. HMCS-T, in this case, skipsenqueuing q and resumes the spin wait on q.status.

Inherited ancestral lock (C): This happens when the cur-rent level (l) lock was already acquired by a thread p be-longing to t’s domain at level l or below. Since q.status

was C, p must have started a new round of local passing atlevel l, which implies p did not inherit all locks on its pathto the root, and instead it must have proceeded to competeat higher levels (see “Recycled node” case 1, 3). Also, pmust have abandoned at some level j > l and subsequently,p found t as its lowest-common descendent at a level k < l

Page 7: An Efficient Abortable-locking Protocol for Multi-level NUMA …aamer/papers/ppopp17_hmcst.pdf · 2017. 1. 8. · aamer@anl.gov Shasha Wen, Xu Liu College of William and Mary fswen,

and passed a prefix of locks to it. Hence, t already ownsthe lock at level l. Since t swapped a W during its entry, itreverts q.status to C and proceeds to level l + 1. Revert-ing is necessary to allow a subsequent thread to acquire thislevel lock if t also were to abort at a higher level.

Node unlocked after abandonment (V and P): V ∈ [2, θ]symbolizes that a predecessor already stepped upon anabandoned node. This case is analogous to the U case ofHMCS-T〈1〉: either t itself or some other thread belongingto the same domain as t at level l must have previouslyabandoned at level l; t’s predecessor p (a level l peer)must have tried to pass the lock to t; p, having noticedthat t had abandoned, is now in the process of passingthe lock to a waiting successor past t. p will, eventually,revert q.status to R once it releases the lock. Hence, twaits for q.status to change to R and then re-enqueues qafresh into the MCS queue at level l obeying the protocolalready listed under “recycled node”. While waiting forq.status to become R, if t times out, it CASes the valueP into q.status and aborts if CAS succeeds. CAS failurehappens iff q.status becomes R allowing t to reenqueueq optimistically hoping to acquire the lock without waiting.

The status flag shall never be W when SWAPing at the entry.Also, if a thread ends up acquiring the lock at a level when itis about to abandon but does not inherit all locks, it continuesto attempt for the next level lock, optimistically hoping toacquire the next level lock in an uncontended manner. If,however, it fails to acquire the next level lock, it begins theabandonment process at that level.

Abandonment. To abandon, while waiting for a lock atlevel l, the abandoner α obeys the following protocol:

1. α, first, SWAPs the value A into the status flag of its QNodeused for competing at level l.

2. If at non-leaf level (that is l > 1), α attempts to pass itsalready acquired locks (levels [1, l−1]) to the lowest com-mon waiting descendent thread. That is, to retain maximumlocality, α starts looking for a waiting successor startingat the leaf level (curLevel=1). Lock passing is similar toHMCS-T〈1〉, looking for the first unabandoned successor.α SWAPs the special value P into its successor s’s statusfield, which signals s (if waiting) to acquire the necessarysuffix locks on its path to the root starting at curLevel+1.

3. If α succeeds in passing the lock prefix to a waiting peerat curLevel, it goes to Step #4, otherwise (the next

pointer of the right-most peer is null), α ascends to thenext level (if any) and recursively performs Step #3 withcurLevel=curLevel+1.

4. By now, α either passed its lock to a waiting successor orrelinquished the lock by CASing the tail pointer to null.α traverses the QNodes at the curLevel from right to leftflipping the flags of each QNode to R, indicating that thoseQNodes can now be recycled.

5. If curLevel>1, α descends to its children level(curLevel=curLevel-1). Having already passed its ac-quired locks at levels > curLevel, α now needs to releasethe lock at curLevel. Let m be the QNode that had its nextpointer null when α ascended to the next level in Step#3. It is possible for more threads to have enqueued pastm when α was busy passing the upper-level locks. Start-ing at m.next, if any, α attempts to pass the curLevel

lock to a waiting curLevel peer. Since passing the lock atcurLevel in this phase does not correspond to passing allupper-level locks (which were already released by now), αSWAPs the special status flag P, indicating the successor tocompete at the next level. If no peer is found, α relinquishesthe curLevel lock by CASing the tail pointer to null. Hav-ing released the curLevel lock, α performs a right-to-lefttraversal of all QNodes whose flags it had SWAPed to P andflips them to R, indicating that they can now be recycled.Figure 5(d)-(f) captures this case. α now repeats Step #5.

At any point, if a lock releasing thread times out whilewaiting for a successor to update its next pointer, it usesthe aforementioned non-blocking technique. Also, at eachlevel, once a releasing thread steps past its abandoned suc-cessor node, it uses the next pointer to serve as a predecessorpointer for its return journey.

Release. Lock release is analogous to the steps followed inthe abandonment process, except,

1. The ascendance continues till the tree root if no waitingsuccessor is found, and

2. If the value V of the releaser r’s representative QNode’sstatus flag is less than the local passing threshold at levell, then r signals its successor with V+1, similar to HMCS.If the passing threshold has reached, then the r releases theparent lock followed by signaling the successor with P toindicate that the successor should compete at the next level.

5. DiscussionFast-path optimization. To avoid the overhead of multiplelevels of lock acquisitions under no contention, we haveadapted the fast-path strategy described in [6] into HMCS-T.We omit the details of the adaptation for brevity.

Atomic operations. Our use of SWAP and CAS operationsmay look heavy handed, but they are applied on nearerneighbors, which do not generate remote coherence traffic.Recent studies [7] show that the atomic operations intro-duce no more than 1.15× overhead compared to loads andstores. Not all atomic operations are on the critical path. Un-contended acquisition in HMCS-T〈1〉 is same as the uncon-tended acquisition in MCS with an additional SWAP on thestatus field. Uncontended release in HMCS-T〈1〉 is same asthe uncontended release in MCS. Uncontended acquisitionsin deeper HMCS-T locks cost no more than HMCS-T〈1〉 dueto the fast path optimization.

Page 8: An Efficient Abortable-locking Protocol for Multi-level NUMA …aamer/papers/ppopp17_hmcst.pdf · 2017. 1. 8. · aamer@anl.gov Shasha Wen, Xu Liu College of William and Mary fswen,

Memory overhead and lock abuse. HMCS-T, like otherhierarchical locks, trades memory for speed. A priorwork [19] on abortable locks traded speed for memory. Ourevaluation shows that the node recycling is so time consum-ing on NUMA machines that the locks described in [19]offer little competition to HMCS-T. Preallocated QNodesachieve a lower memory footprint compared to [19] underhigh abort rates. However, one must use precaution when us-ing a deep HMCS-T. HMCS-T must be curated for a givenarchitecture. HMCS-T is not a universal replacement for alllocks in a program. One should opt into HMCS-T when con-tention is expected. A contended HMCS-T’s memory con-sumption is no worse than that of a contended MCS locksince in the worst case HMCS-T would require 2× morespace; much less in practice. Modern many-core processorswith 10s of MBs of cache can accommodate thousands ofHMCS-T locks in less than 1% of their last-level cache.

Design options. On finding an abandoned successor, in-stead of traversing subsequent successors, one may chooseto ascend to ancestral levels to increase the chance of find-ing a successor and reduce the critical path length. However,the cost of remote access is often much higher than severallocal accesses. The naive design of releasing ancestral locksinstead of releasing the peers first causes up to 3.2× perfor-mance loss compared to our optimized strategy. It is straight-forward to enforce a bound on how many peers to inspectbefore ascending to the parent level based on an analyticalmachine model similar to the one discussed in [5].

Hierarchical locks are best served when threads arepinned to cores. Thread migration makes HMCS-T start itsnew acquisition in its new domain iff the leaf-level QNode isalready recycled.

IsLocked() Interface. One can expose an IsLocked()

API over HMCS, which returns the boolean LOCKED orNOT LOCKED based on whether or not any thread alreadyholds the lock. A “try lock” can be built using such interface.If IsLocked() returns NOT LOCKED, the user still has to de-cide whether to acquire the lock with a timeout (HMCS-T)or wait until the lock is granted (HMCS). In summary, thetimeout capability is indispensable.

Correctness. HMCS-T is a very involved protocol; the sys-tem is stateful. HMCS-T guarantees mutual exclusion, live-lock and deadlock freedom under any situation, and starva-tion freedom for non-aborting threads. We follow a multi-step approach to prove these guarantees. We provide a de-tailed proof in a separate document: http://github.com/HMCST/hmcst, which relies on the Spin [12] model checker.

Complexity of HMCS-T. Let us assume the system hasn lock levels and each lock at a level i is contested by kidescendants. Hence, the total number of participants (maxi-mum threads) in the system is N =

∏ni=1 ki.

Space complexity: The space complexity depends onthe pre-allocated HNodes in the tree. A deeper tree has a

larger space complexity. For a given number of participants,the tree depth depends on the machine topology. A loosebound assumes a binary tree. Hence, for a system with Nparticipating threads, there can be at most N pre-allocatedHNodes and the protocol does not allocate any more HNodes.Hence, the space complexity per lock is O(N).

Time complexity: Discussion about time complexity isuseful only in light of timeout. Once a timeout occurs, wewould like to know how many steps are needed to eitherabandon or release the lock. Due to its optimistic design,a thread that times out waiting for a leaf-level lock mayacquire all locks in n steps followed by releasing all n locks.Releasing n locks, in the worst case, can take ki steps ineach of ith level from 1 to n in a bottom-to-top sweep and kimore steps in a top-to-bottom sweep. Hence, the worst caseacquire followed by a release after timeout is bounded byn +

∑ni=1 2ki steps. The worst case abandonment happens

when a thread that times out waiting for a leaf-level lockacquires n − 1 locks but fails to acquire the root-level lockand subsequently releases all n − 1 locks. Hence, the worstcase abandonment is bounded by n− 1 +

∑n−1i=1 2ki steps.

The asymptotic cost on the critical path when all threads,except the lock holder, have aborted is

∑ni=1 ki. The bound

can be tightened to (n−1)+kn via the previously proposedalternative of ascending to the parent level on finding anaborted successor.

6. Evaluation of HMCS-TIn this section, we evaluate HMCS-T by comparing it withvarious locks. We use an 8-blade, 16-socket HP IntegritySuperdome X [10] shared-memory machine. Each socket isan 18-core, 2-way SMT Intel Xeon E7-8890V3 processorclocked at 2.5GHz. Blades are interconnected with a customASIC. The system has a total of 576 hardware threads. L1,L2, and L3 cache sizes are respectively 32KB, 256KB, and45MB. A pair of sockets on the same blade are connectedover QPI [13]. The system has two on-chip, 2-channel DDR4memory for a total of 12TB. We used g++ v4.8.3 compilerand parallelized code via OpenMP. Unless stated otherwise,all experiments use a compact binding of threads to cores; athread will not be bound to another NUMA domain until thecurrent domain is fully populated; the next chosen NUMAdomain will be the nearest possible one.

Locks: We use the HMCS-T locks of 1, 2, 3, and 4 levelsof hierarchy in our studies. A shallower lock, HMCS-T〈1〉does not respect locality. An HMCS-T〈2〉 exploits the lo-cality within a node shared by two sockets but ignores thelocality within a socket. An HMCS-T〈3〉 exploits the local-ity within a socket shared by 18 cores, in addition to thebehavior of HMCS-T〈2〉 but ignores the locality of SMTssharing a core. An HMCS-T〈4〉 exploits the locality withina core shared by two SMTs, in addition to the behavior ofHMCS-T〈3〉. An HMCS〈n〉 has a symmetric configurationas an HMCS-T〈n〉 but lacks the timeout capability.

Page 9: An Efficient Abortable-locking Protocol for Multi-level NUMA …aamer/papers/ppopp17_hmcst.pdf · 2017. 1. 8. · aamer@anl.gov Shasha Wen, Xu Liu College of William and Mary fswen,

We also compare with a Test-And-Test-And-Set lock withtimeout (TATAS-T), CLH NB and MCS NB locks [19, 21],and A C BO CLH lock [8]. Our A C BO CLH lock mimicsthe implementation described in [8]; we form a 2-level hier-archy where cores sharing the same socket form a cohort.For the locks with a cohort property (HMCS, HMCS-T, andA C BO CLH), we set the passing threshold to 64 [8].

§6.1 compares HMCS-T with HMCS using a microbenchmark. §6.2 compares HMCS-T with other abortablelocks via a splay tree case study. §6.4 demonstratesthe virtue of HMCS-T by exploiting parallel slacknessin an MPI+OpenMP Graph500 code and improves itscommunication-computation overlap.

6.1 HMCS vs. HMCS-TAn HMCS〈n〉 serves as an upper (lower) bound for thethroughput (latency) of HMCS-T〈n〉 when the timeout isinfinite. To assess the overhead that HMCS-T adds atopHMCS, we compare them via a micro benchmark: a tightloop of lock acquire and conditional release. We preciselymeasure the CPU cycles for acquire and release via thex86 rdtsc instruction. The benchmark while synthetic isappropriate for the intended comparison.

Figure 6 and 7, respectively, show the throughput andround trip latency (successful acquisition plus release) whenthe timeout is set to 1K, 10K, and infinite CPU cycles.It is evident from Figure 6 that every HMCS-T instanti-ation closely follows the throughput of its correspondingHMCS counterpart. The gap narrows as the timeout in-creases. Figure 7 shows that the latency for successfully ac-quiring a lock is strictly bounded by the chosen timeout.If lock acquisition is not possible within a given timeout,thread aborts in HMCS-T, and thus the latency of success-ful acquisitions reaches a plateau, unlike HMCS. At infinitetimeout, HMCS-T〈n〉’s latency is very close to that of itsHMCS〈n〉 counterpart. In-general, the throughput of HMCS〈4〉

> HMCS-T〈4〉 > HMCS〈3〉 > HMCS-T〈3〉 > HMCS〈2〉 > HMCS-T〈2〉 >

HMCS〈1〉 > HMCS-T〈1〉. Not all latency appears on the criticalpath. The wait time during an acquire is not on the criticalpath. Once global lock is handed over to a legal successor,rest of the release protocol is not on the critical path. Hence,the throughput is unaffected after 10K CPU cycles timeout.

Table 1 shows the throughput degradation in HMCS-T〈n〉over the corresponding HMCS〈n〉 at timeout=∞. The datashows the pure protocol overhead that an HMCS-T incursto support the abort feature. Under no contention (1 thread),the overhead is fairly low (about 1.25×) in any HMCS-Tlock; the degradation does not increase noticeably with theincrease in the depth of the lock, which is due to the afore-mentioned fast path. The worst case degradation is 3.85×observed for HMCS-T〈4〉 at four threads. This is because ofthe often missed opportunity to pass to an SMT peer and thepresence of subsequent lock levels leads to a lengthy path ofrelinquishing parent locks preventing local passing benefits.HMCS-T〈4〉 is an overkill at such low thread count setting.

Table 1: Throughput degradation in HMCS-T〈n〉 wrt HMCS〈n〉 at timeout=∞.

Threads: 1 2 4 8 18 36 72 144 288 576 GeoMeanHMCS-T<1> 1.22x 1.32x 1.12x 0.97x 0.99x 0.97x 1.04x 0.85x 1.11x 1.09x 1.06xHMCS-T<2> 1.27x 1.79x 1.1x 0.99x 0.97x 0.97x 1.21x 1.19x 1.19x 1.19x 1.17xHMCS-T<3> 1.26x 1.92x 1.21x 1.02x 1.00x 0.95x 0.98x 1.00x 1.00x 1.00x 1.11xHMCS-T<4> 1.27x 2.47x 3.85x 1.68x 1.73x 1.75x 1.74x 1.75x 1.76x 1.79x 1.89x

All other locks incur no more than 1.92× overhead. The av-erage overhead is under 1.11×, except for HMCS-T〈4〉.

To further understand the overheads involved, Figure 8and Figure 9 decompose the round-trip latency into the la-tency for successful acquisition and release. We make thefollowing observations:

1. Comparing Figures 8 and 9, acquisition accounts for thebulk of the latency (up to 106 cycles) and outweighs the la-tency for release (under 103 cycles) by orders of magnitudebetween the same pairs of HMCS and HMCS-T locks .

2. Figure 9 shows that the difference in release latency be-tween an HMCS and the corresponding HMCS-T canbe high. While HMCS stabilizes at about 25 cycles,HMCS-T〈3〉 and HMCS-T〈4〉 stabilize at about 200 cy-cles (8× higher). The worst case happens for HMCS-T〈1〉at 576 threads, which is up to 28× more than HMCS〈1〉.Most importantly, the latency does not grow either withthe increase in the number of threads or by changing thetimeout. We note that the release latency of 200 cycles forHMCS-T〈3〉 and HMCS-T〈4〉 despite their longer criticalpaths is faster than MCS NB and CLH NB, which respec-tively take about 2300 and 1100 CPU cycles (not shown).

3. Figure 9 shows that the release latency of a shallowerHMCS-T is higher than the release latency of a deeperHMCS-T despite the additional cost of releasing parent-levels in a deeper HMCS-T. This is because of two reasons:(1) it is faster to release the lock in a deeper HMCS-Tdue to the locality of successors, and (2) ascending toancestral levels in the tree during lock release exponentiallyincreases the chances of finding a waiting successor andavoids inspecting aborted threads in a peer domain.

With zero timeout (not shown), the time to abandon theprotocol from the start of acquisition is under 140 cycles forHMCS-T〈1〉, HMCS-T〈2〉, and HMCS-T〈3〉 and 1K cyclesfor HMCS-T〈4〉. Beyond 1K cycles timeout, all HMCS-Tlocks accurately honor the client-chosen timeout. Overall,the HMCS-T locks maintain their competence even afteradding the abort feature and offer high timing fidelity.

6.2 HMCS-T vs. Other Abort Locks on Splay TreesSplay trees [23] are self-adjusting binary search trees witha caching behavior—the last searched item is brought tothe tree root and recently searched items appear near theroot. Any operation on a splay tree, including a lookup, mayinvolve tree rotations; hence locks are necessary.

Page 10: An Efficient Abortable-locking Protocol for Multi-level NUMA …aamer/papers/ppopp17_hmcst.pdf · 2017. 1. 8. · aamer@anl.gov Shasha Wen, Xu Liu College of William and Mary fswen,

!"#$%&' !"#$()%&' !"#$%*' !"#$()%*' !"#$%+' !"#$()%+' !"#$%,' !"#$()%,'

!"#$%#&

'"!$%#&

("'$%#&

)"($%#&

!"*$%#*

+"($%#*

&"*$%#*

! ' ( ) !& +' &( !') ',& ,!'

!"#$%&"'%()*+,%-.-(-$/.0.1+2

!"#$%&'

(a) Timeout:1K CPU cycles.

!"#$%#&

'"!$%#&

("'$%#&

)"($%#&

!"*$%#*

+"($%#*

&"*$%#*

! ' ( ) !& +' &( !') ',& ,!'

!"#$%&"'%()*+,%-.-(-$/.0.1+2

!"#$%&'

(b) Timeout:10K CPU cycles.

!"#$%#&

'"!$%#&

("'$%#&

)"($%#&

!"*$%#*

+"($%#*

&"*$%#*

! ' ( ) !& +' &( !') ',& ,!'

!"#$%&"'%()*+,%-.-(-$/.0.1+2

!"#$%&'

(c) Timeout:∞ CPU cycles.

Figure 6: Lock throughput of HMCS-T vs. HMCS. Each HMCS-T closely follows its HMCS counterpart.

!"#$%&'

("!$%&(

'"&$%&)

#"'$%&)

'"!$%&#

!"!$%&#

("!$%&*

'"&$%&!

' ( # + '! )( !# '(+ (*! *'(

!"#$%&'()*+

&'&,$-.

!"#$%&'

(a) Timeout:1K CPU cycles.

!"#$%&'

("!$%&(

'"&$%&)

#"'$%&)

'"!$%&#

!"!$%&#

("!$%&*

'"&$%&!

' ( # + '! )( !# '(+ (*! *'(

!"#$%&'()*+

&'&,$-.

!"#$%&'

(b) Timeout:10K CPU cycles.

!"#$%&'

("!$%&(

'"&$%&)

#"'$%&)

'"!$%&#

!"!$%&#

("!$%&*

'"&$%&!

' ( # + '! )( !# '(+ (*! *'(

!"#$%&'()*+

&'&,$-.

!"#$%&'

(c) Timeout:∞ CPU cycles.

Figure 7: Round trip latency of HMCS-T vs. HMCS. Each HMCS-T closely follows its HMCS counterpart at∞ timeout and offers bounded latency at other timeout values.

!"#$%&!

#"'$%&!

("#$%&(

!"&$%&)

'"!$%&)

!"#$%&'

#"#$%&'

("#$%&*

!"&$%&#

! ( ' + !# )( #' !(+ (*# *!(

!"#$%&'()*+

&'&,$-.

!"#$%&'

(a) Timeout:1K CPU cycles.

!"#$%&!

#"'$%&!

("#$%&(

!"&$%&)

'"!$%&)

!"#$%&'

#"#$%&'

("#$%&*

!"&$%&#

! ( ' + !# )( #' !(+ (*# *!(

!"#$%&'()*+

&'&,$-.

!"#$%&'

(b) Timeout:10K CPU cycles.

!"#$%&!

#"'$%&!

("#$%&(

!"&$%&)

'"!$%&)

!"#$%&'

#"#$%&'

("#$%&*

!"&$%&#

! ( ' + !# )( #' !(+ (*# *!(

!"#$%&'()*+

&'&,$-.

!"#$%&'

(c) Timeout:∞ CPU cycles.

Figure 8: Acquisition only latency of HMCS-T vs. HMCS. Acquisition latency governs the total latency.

!"#$%&!

'"($%&!

#")$%&!

!"'$%&(

("#$%&(

*"!$%&(

!"&$%&'

("&$%&'

! ( ) + !# '( #) !(+ (*# *!(

!"#$%&'()*+

&'&,$-.

!"#$%&'

(a) Timeout:1K CPU cycles.

!"#$%&!

'"($%&!

#")$%&!

!"'$%&(

("#$%&(

*"!$%&(

!"&$%&'

("&$%&'

! ( ) + !# '( #) !(+ (*# *!(

!"#$%&'()*+

&'&,$-.

!"#$%&'

(b) Timeout:10K CPU cycles.

!"#$%&!

'"($%&!

#")$%&!

!"'$%&(

("#$%&(

*"!$%&(

!"&$%&'

("&$%&'

! ( ) + !# '( #) !(+ (*# *!(

!"#$%&'()*+

&'&,$-.

!"#$%&'

(c) Timeout:∞ CPU cycles.

Figure 9: Release only latency of HMCS-T vs. HMCS. The lock release in HMCS-T is slower than HMCS but its contribution to the overall latency is insignificant.

We create a setup where all threads attempt to perform alookup operation on a global splay tree when holding a lock(critical work). On lock timeout, a thread performs some“other” work (non-critical work). The non-critical work isa thread-local splay tree and hence it does not require anylocking. Using splay tree lookups both inside and outsidethe critical section allows us to compare both critical and

non-critical work accomplished by various locks in a uni-form way. Also, instead of idle spinning outside the criti-cal section, switching to a different work chunk pollutes thecache, which represents real world situations.

Our global and thread-local splay trees each have 8Knodes. For both kinds of trees, 90% of the lookups are servedby the same key, and the remaining 10% are for the 64

Page 11: An Efficient Abortable-locking Protocol for Multi-level NUMA …aamer/papers/ppopp17_hmcst.pdf · 2017. 1. 8. · aamer@anl.gov Shasha Wen, Xu Liu College of William and Mary fswen,

numerically adjacent keys (not necessarily nearby in the treeorganization). Note, however, that not all the 90% of theglobal tree lookups hit at the root since a previous lookup,either by the same thread or by another thread, could haverotated the tree. Furthermore, we change the “hot node”every second to a randomly chosen node from the tree.

A superior lock should deliver high lock throughput andabort swiftly on a timeout; that is, it shoud show highthroughput on both global and local tree operations. Fig-ure 10(a) compares the throughput of various abortable locksat 10K CPU-cycle timeout. Throughput is measured as to-tal tree lookup operations performed per second, which in-cludes both local and global tree operations. Figure 10(b)and Figure 10(c) respectively decompose the throughput intoglobal (critical) and local (non-critical) work components.We have clipped the lines when the throughput fell below10K acquisitions per second for the global tree.

We notice that a deeper HMCS-T lock offers higherthroughput than a shallower HMCS-T on the global tree.The global-tree throughput drops with increasing concur-rency (Figure 10(b)) because the working set increases lin-early with the number of threads, which causes the threadscompete for the shared cache. For locks that can abortswiftly on timeout (HMCS-T, TATAS, and A C BO CLH)the throughput of the local tree operations increases linearlyas the number of threads increase beyond 18. All locks ex-cept MCS NB and CLH NB show comparable throughputon their local-tree operations.

CLH NB and MCS NB [19] locks have dramaticallylower throughput compared to HMCS-T locks both on theglobal tree and local trees. This is because their memorymanagement overhead is excessive. Frequent aborts severelyaffect node recycling. In the CLH NB lock, for example,when a thread aborts, it leaves its QNode in the CLH queueto be freed by its successor. However, if the successor alsoaborts, the QNode is not freed, and the task is delegatedto some other successor. When aborts are common, manyQNodes are left unreclaimed in the CLH queue. Each threadmaintains a list of QNodes it allocated (local list). Ev-ery new acquisition sweeps its local list looking fora recycled node. When successors keep aborting, none ofthe previously allocated QNode become available for reuse,and hence almost every new acquisition allocates a newQNode (via malloc) and appends it to its local list Thelocal lists keep growing increasing both memory foot-print and time spent unsuccessfully searching for a free nodeon each acquisition. Since the local list searching timekeeps growing, the timeout happens immediately after allo-cation, leaving no time to recycle an aborted predecessor.Frequent abort and lengthy allocation time feed back intoeach other further aggravating the problem.

The average time to sweep the list and allocate anew QNode took ∼900K CPU cycles in CLH NB at 10Kand 100K cycle timeout thresholds. The code spent more

than 90% of the time trying to recycle QNodes in thealloc local qnode procedure incurring more than 97%of its L2 cache misses in the same procedure. As a result,CLH NB and MCS NB locks are unsuccessful in acquir-ing the lock within 10K CPU cycles. The allocation timereduced to 700 cycles only after the timeout was above106 CPU cycles. Since the lock overhead consumes mostof the execution time, it results in reduced throughput ofnon-critical work also. In contrast, HMCS-T’s preallocatedQNodes and their reuse keep the protocol overhead small.

The memory usage and its overhead wrt TATAS-T at 10Ktimeout with 576 threads is shown in the left half of Table 2.CLH NB and MCS NB locks have very high memory usagedue to repeated allocations when the nodes cannot be recy-cled. The numbers remain the same at 100K timeout (notshown). We measured that TATAS-T has a huge skew in thelock distribution—on 70% occasions, the lock holder is thesame as the previous one causing thread starvation [19, 20].A C BO CLH also suffers from starvation.

To quantify the readmission into a previously abandonedQNode, we profiled how often a thread resumes from itsabandoned node. HMCS-T〈1〉, HMCS-T〈2〉, HMCS-T〈3〉,and HMCS-T〈4〉 resumed their wait from a previously aban-doned node on 99%, 97%, 92%, and 47% occasions, respec-tively at 10K timeout. A shallower lock is slow and has alonger queue, which provides more chance of readmitting anabandoned thread. The readmission fraction progressivelyreduces with increased depth. HMCS-T〈4〉 offers the leastopportunity since the leaf-level queue length is at most twoin our setting.

Figure 11(a) and 11(b) respectively show the through-put of both global and local tree operations at differentabort thresholds ranging from 101 to 108 CPU cycles for a576-threaded experiment. HMCS〈4〉 has the highest global-tree throughput followed by the other shallower hierarchicallocks. Throughput over the global tree can drop by about 2×between 101 to 106 cycles timeout. MCS NB and CLH NBbecome competitive for the global-tree operations only be-yond 106 CPU cycles, which is a very high threshold. Thelocal-tree throughput drops as the timeout value increasesbecause few acquisitions timeout and hence few local-treeoperations are performed.

6.3 Lock EfficiencyTo gain a unified view of the total work accomplished by alock among a set of locks under study, we compute the fol-lowing relative metrics for each lock k. G(k): the ratio oflock k’s global-tree throughput (critical work) to the max-imum global-tree throughput by any lock under study withthe same settings, and L(k): the ratio of lock k’s local-treethroughput (non-critical work) to the maximum local-treethroughput by any lock under study with the same settings.The sum of these two metrics can reach a theoretical maxi-mum of 200%. We normalize it to 100%, which representsthe efficiency E(k) of a lock k. E quantifies the ability of

Page 12: An Efficient Abortable-locking Protocol for Multi-level NUMA …aamer/papers/ppopp17_hmcst.pdf · 2017. 1. 8. · aamer@anl.gov Shasha Wen, Xu Liu College of William and Mary fswen,

!"#$%&'() !"#$%&'*) !"#$%&'+) !"#$%&',) &-&-$%& "#$./0 #1!./0 -.#.02.#1!

!"#$%#&

'"!$%#&

("'$%#&

)"($%#&

!"*$%#*

+"($%#*

&"*$%#*

!"+$%#)

'"*$%#)

! ' ( ) !& +' &( !') ',& ,!'

!"#$%&"'%()($(*+$',-,./0

!"#$%&'

(a) Total splay operations.

!"#$%&'

#"#$%&'

("#$%&)

!"&$%&#

'"($%&#

!"*$%&*

#"*$%&*

! ( ' + !# ,( #' !(+ ()# )!(!"#$%&"'%()&*$+,*(#--$'./.-01

!"#$%&'

(b) Global splay operations.

!"#$%&!'"&$%&()"'$%&('"#$%&)#"#$%&)!"#$%&*'"&$%&#)"!$%&#'"+$%&+#"+$%&+!"+$%&,

' ! ) , '# (! #) '!, !*# *'!

!"#$%&"'%()*$+,*(#--$'./.-+0

!"#$%&'

(c) Local splay operations.

Figure 10: Lock throughput of different abortable locks. Timeout=10K CPU cycles.

!"#$%&'

("($%&'

#"#$%&'

!"($%&)

*"#$%&)

)"*$%&)

!"&$%&#

*"!$%&#

'"*$%&#

+"'$%&#

!&! !&" !&# !&$ !&% !&& !&' !&( !&) !&*!"#$%&"'%()&*$+,*(#--$'./.-01

!"#$%&' ()*+ ,-,.$/0

(a) Throughput of global tree operations.

!"#$%&'

#"&$(&&

!"'$(&#

#"&$(&!

!"!$(&)

#"&$(&*

!")$(&+

#"#$(&,

!")$(#&

#&! #&" #&# #&$ #&% #&& #&' #&( #&) #&*

!"#$%&"'%()*$+,*(#--$'./.-+0

!"#$%&' ()*+ ,-,.$/0

(b) Throughput of local tree operations.

Figure 11: Impact of timeout on the throughput of local and global tree operations. The experiment has 576 threads.

Table 2: Left: resident memory and memory overhead w.r.t.a TATAS-T lock. Right: efficiency (E) of locks— a unifiedmetric of critical and non-critical work accomplished. Inboth experiments, the timeout is set to 10K CPU cycles, andboth experiments use 576 threads.

Efficiency

KB Overhead Global(G) Local(L) E=(G+L)/2

HMCS-T<1> 271,060 1x 29% 100% 64%HMCS-T<2> 271,160 1x 31% 96% 64%HMCS-T<3> 271,192 1x 62% 88% 75%HMCS-T<4> 271,188 1x 100% 84% 92%TATAS-T 271,020 1x 25% 60% 43%MCS_NB 3,746,764 14x 1.62E-06% 0.83% 0.41%CLH_NB 1,783,500 6.6x 1.62E-06% 1% 0.50%A_C_BO_CLH 271,170 1x 11% 90% 51%

%peakthroughputResidentmemory

a lock to deliver high throughput of successful lock acquisi-tions and also to expeditiously exit on timeout.

We computed G, L, and E metrics at 576 threads amongeight locks at 10K cycle timeout (right side of Table 2).HMCS-T〈4〉 has the highest critical path throughput (G is100%). Deep hierarchy helps in higher lock throughput, andthe throughput falls off for shallower locks. CLH NB andMCS NB achieve only a small fraction of the peak through-put. TATAS-T and A C BO CLH, respectively, achieve 25%and 11% of HMCS-T〈4〉’s global tree throughput.

HMCS-T〈1〉 has the highest throughput on the non-critical work (L is 100%). This is because HMCS-T〈1〉 is thequickest to abandon on timeout and switch to perform localtree operations. Abandoning in HMCS-T〈1〉 is a simple localswap operation at one level, whereas abandoning in an inte-rior node of a deeper HMCS-T may involve more operations.L decreases for a deeper HMCS-T lock; HMCS〈4〉 achieves84% of HMCS〈1〉’s non-critical work. Surprisingly, TATAS-T has lower L despite the fact that an abort takes no memoryoperation. This is because the CAS operation in TATAS-Thas a very high latency. If a thread issues a CAS closer to itstimeout, the CAS may complete thousands of cycles after thetimeout, which leads to a high protocol overhead.

In the combined metric (E), the deeper hierarchicallocks—HMCS-T〈3〉 and HMCS〈4〉—deliver higher effi-ciency due to their exceeding superiority in critical workover shallower locks. TATAS-T and A C BO CLH have re-spectable efficiency, but both are unfair locks. CLH NB andMCS NB have the lowest efficiency.

6.4 HMCS-T in an MPI+Threads BFSParallel breadth-first search (BFS) is an irregular, dynamic,sparse data exchange (DSDE) algorithm. A process commu-nicates with a small subset of processes (neighbors) whilethe neighborhoods change dynamically over iterations. Oneof the best-known algorithms for solving DSDE problemsis NBX [11]. In NBX , most of the data exchanges relyon nonblocking point-to-point primitives and a process hasto poll manually for communication progress via APIs suchas MPI Iprobe or MPI Test in the context of MPI. We useAmer et al. [2]’s multi-threaded NBX implementation ofthe Graph500 BFS kernel, where threads in a process per-form both computation and communication.

We use the MPICH-3.2 [1] MPI implementation, whichhas a single lock to protect its core functionality [3, 4]. In ahighly threaded BFS run, the MPICH global lock becomes apoint of contention because of repeated calls to MPI Test bydifferent threads. Since MPI Test is only a progress check-ing API, it can be aborted in favor of computation or non-polling APIs, such as MPI Isend. With this intuition, wesubstituted the MPICH-3.2 Pthread mutex lock with vari-ous abortable locks allowing the MPI Test implementationto abort; rest of the APIs do not abort. On abort, MPI Test

behaves as if the communication has not yet completed.While the progress semantics of MPI require a continuouslychecking thread to succeed if a matching operation has beenposted, such progress guarantee is conceivable in HMCS-Tby bounding the number of successive aborts by a thread.

Page 13: An Efficient Abortable-locking Protocol for Multi-level NUMA …aamer/papers/ppopp17_hmcst.pdf · 2017. 1. 8. · aamer@anl.gov Shasha Wen, Xu Liu College of William and Mary fswen,

20

30

40

50

60

70

80

90

100

2 4 9 17 34 68 2 4 9 17 34 68

Timeout=infinity Timeout=100K

Gra

ph T

rave

rsal

Rat

e (M

illio

n E

dges

/s)

Number of Threads per Rank

Pthread mutex HMCS-T<1> HMCS-T<3> TATAS-T CLH_NB A_C_BO_CLH

Figure 12: Scalability of BFS with 225 vertices using 4 MPI processes and varyingnumber of threads per MPI process. Pthread mutex, used by the original MPICH, doesnot support timeout but is only shown for a comparison. HMCS-T〈3〉 has the bestabsolute performance and superior scaling at 10K timeout.

We partition the 8-node shared-memory machine intofour MPI processes each with 72 cores; we do not use SMTs.Further, we do not load all sockets fully to leave some coresto run MPI daemons and scripts. We measured the BFSgraph traversal rate from 16 different root vertices and reportthe average rate per root vertex in Figure 12 for various locks(including the baseline Pthread mutex) at two timeout val-ues: 100K CPU cycles and infinity (for comparison with thetraditional lock waiting regime). Our analysis showed thatthe overall performance is strongly correlated to the timespent in MPI Test and that changing the lock implemen-tation has little effect on the other parts of the algorithm,e.g., the computation or the amount of data transferred. Thedegree of computation-communication overlap dictates theperformance variation across lock implementations.

At an infinite timeout, the performance peaks at ninethreads for all locks. Beyond 17 threads, the performancecollapses since more threads cause more wait in MPI Test,hence more polling and less real communication. The onlylock that proved to be more robust is HMCS-T〈3〉, whichexploits locality, however, it still suffers 17% performanceloss at full concurrency compared to its peak.

With a timeout of 100K, every abortable lock performssuperior to its timeout=∞ counterpart (the base Pthread mu-tex shown only for reference). At 100K timeout, the locksreach their peaks at 17 cores but this time the peaks arehigher than those at timeout=∞. This indicates that bound-ing the lock wait time improves the ability to overlap com-munication with computation. All locks except HMCS-T〈3〉collapse in their performance beyond 17 cores since: 1) thecost of aborts is expensive in other locks due to inter-socketand inter-node communication, and 2) when the lock ispassed to a remote requester, the locality is lost. HMCS-T〈3〉by virtue of its fast abort and locality-aware lock hand-off scales to a higher throughput (93.2 M edges/s) at 34and 68 threads and alleviates the performance collapse seenin the other locks. In particular, HMCS-T〈3〉 with a 100Ktimeout achieves 30% higher peak performance than its un-bounded counterpart. Furthermore, HMCS-T〈3〉 with 100K

timeout retains the same 30% speedup even when comparedagainst HMCS〈3〉 (not shown) although HMCS〈3〉 is ex-pected to have slightly less protocol overhead compared toHMCS-T〈3〉 with infinite timeout.

The top performance of HMCS-T〈3〉 (at 34-68 threads) is34% higher than the top performance of the original Pthreadmutex (at 9 threads). At full concurrency, HMCS-T〈3〉 deliv-ers 88% and 21% higher throughput than the next best per-forming abortable lock (HMCS-T〈1〉) with an infinite and100K timeout values, respectively. This justifies the virtueof both locality awareness and abortability. HMCS-T〈3〉 ata 100K timeout (with 93.2×106 edges/s) shows an impres-sive 330% (4.3×) speedup over the original Pthread mutex-based MPICH (with 21.5×106 edges/s) at full concurrency.We also experimented with a CAS-based try-lock (not shown)and observed a very poor scalability and a very low peakperformance since it is locality agnostic and generates indis-criminate cache coherency traffic.

We experimented the previously mentioned IsLocked()

primitive with both HMCS-T〈3〉 (at several timeout val-ues) and HMCS〈3〉. We invoked the underlying lock iffthe IsLocked() returned NOT LOCKED. We observed thatthe graph traversal rate initially increased up to 140 mil-lion edges/sec, but once the number of threads crossed asocket, the throughput collapsed dramatically down to be-low 90 million edges/sec because of the lost data locality.IsLocked()+HMCS-T〈3〉 showed slightly worse perfor-mance than merely using HMCS-T〈3〉 for the same time-out values. IsLocked()+HMCS〈3〉 performed worse thanHMCS-T〈3〉 with ≤ 10K cycles timeout regardless of thethread count.

These results clearly indicate that bounding lock waitingtimes through an abort feature can help scale an otherwisenon-scalable code by exploiting parallel slackness.

7. ConclusionsIn this paper, we designed the HMCS-T lock, a hierarchi-cal queuing lock that can abandon the lock wait on a time-out. Hierarchical queue-based locks combined with the time-out feature offer two key advantages on many-core NUMAsystems—data locality and enhanced concurrency. HMCS-Thas bounded space and time complexity together with star-vation freedom, which are rare in other abortable locks.HMCS-T has high timing fidelity and outperforms severalstate-of-the-art abortable locks in latency and throughput.

8. AcknowledgmentsJohn Mellor-Crummy and Karthik Murthy shared theirmodel checking experience, which inspired us to develop oursolution. Ram Swaminathan made valuable editorial sugges-tions. This research was supported by the National ScienceFoundation under grant number 1464157 and by the U.S.Department of Energy, Office of Science, under contract DE-AC02-06CH11357.

Page 14: An Efficient Abortable-locking Protocol for Multi-level NUMA …aamer/papers/ppopp17_hmcst.pdf · 2017. 1. 8. · aamer@anl.gov Shasha Wen, Xu Liu College of William and Mary fswen,

References[1] A. Amer, P. Balaji, W. Bland, W. Gropp, R. Latham,

H. Lu, L. Oden, A. Pena, K. Raffenetti, S. Seo, T. Ra-jeev, and Z. Junchao. MPICH User’s Guide, Version3.2. http://www.mpich.org/static/downloads/3.2/

mpich-3.2-userguide.pdf, 2015.

[2] A. Amer, H. Lu, P. Balaji, and S. Matsuoka. CharacterizingMPI and Hybrid MPI+Threads Applications at Scale: CaseStudy with BFS. In Cluster, Cloud and Grid Computing(CCGrid), 2015 15th IEEE/ACM International Symposiumon, pages 1075–1083. IEEE, 2015.

[3] A. Amer, H. Lu, Y. Wei, P. Balaji, and S. Matsuoka.MPI+Threads: Runtime Contention and Remedies. In Pro-ceedings of the 20th ACM SIGPLAN Symposium on Principlesand Practice of Parallel Programming, PPoPP 2015, pages239–248, New York, NY, USA, 2015. ACM.

[4] A. Amer, H. Lu, Y. Wei, H. Jeff, S. Matsuoka, and P. Bal-aji. Locking Aspects in Multithreaded MPI Implementations.Technical Report ANL/MCS-P6005-0516, 2016.

[5] M. Chabbi, M. Fagan, and J. Mellor-Crummey. High Per-formance Locks for Multi-level NUMA Systems. In Proceed-ings of the 20th ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming, pages 215–226, 2015.

[6] M. Chabbi and J. Mellor-Crummey. Contention-conscious,Locality-preserving Locks. In Proceedings of the 21st ACMSIGPLAN Symposium on Principles and Practice of ParallelProgramming, PPoPP ’16, pages 22:1–22:14, New York, NY,USA, 2016. ACM.

[7] T. David, R. Guerraoui, and V. Trigonakis. EverythingYou Always Wanted to Know About Synchronization butWere Afraid to Ask. In Proceedings of the Twenty-FourthACM Symposium on Operating Systems Principles, SOSP ’13,pages 33–48, New York, NY, USA, 2013. ACM.

[8] D. Dice, V. J. Marathe, and N. Shavit. Lock Cohorting: AGeneral Technique for Designing NUMA Locks. In Proceed-ings of the 17th ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming, pages 247–256, 2012.

[9] M. Frigo, C. E. Leiserson, and K. H. Randall. The Implemen-tation of the Cilk-5 Multithreaded Language. In Proceedingsof the ACM SIGPLAN 1998 Conference on Programming Lan-guage Design and Implementation, PLDI ’98, pages 212–223,New York, NY, USA, 1998. ACM.

[10] Hewlett Packard Enterprise. HP Integrity SuperdomeX. http://www8.hp.com/h20195/v2/GetPDF.aspx/

c04383189.pdf.

[11] T. Hoefler, C. Siebert, and A. Lumsdaine. Scalable Communi-cation Protocols for Dynamic Sparse Data Exchange. In Pro-ceedings of the 15th ACM SIGPLAN Symposium on Princi-ples and Practice of Parallel Programming, PPoPP ’10, pages

159–168, New York, NY, USA, 2010. ACM.

[12] G. J. Holzmann. The Model Checker SPIN. IEEE Trans-actions on Software Engineering — Special issue on formalmethods in software practice, 23(5):279–295, May 1997.

[13] Intel Corp. An Introduction to the Intel® Quick-Path Interconnect. http://www.intel.com/

content/www/us/en/io/quickpath-technology/quick-path-interconnect-introduction-paper.

html, 2009.

[14] P. Jayanti. Adaptive and Efficient Abortable Mutual Exclu-sion. In Proceedings of the Twenty-second Annual Symposiumon Principles of Distributed Computing, PODC ’03, pages295–304, New York, NY, USA, 2003. ACM.

[15] P. S. Magnusson, A. Landin, and E. Hagersten. Queue Lockson Cache Coherent Multiprocessors. In Proceedings of the 8thInternational Symposium on Parallel Processing, pages 165–171, 1994.

[16] V. Marathe, M. Moir, and N. Shavit. Composite AbortableLocks. In Parallel and Distributed Processing Symposium,2006. IPDPS 2006. 20th International, pages 1–10, April2006.

[17] J. M. Mellor-Crummey and M. L. Scott. Algorithms forScalable Synchronization on Shared-memory Multiproces-sors. ACM Transactions on Computer Systems, 9(1):21–65,February 1991.

[18] A. Pareek and P. Woelfel. RMR-efficient RandomizedAbortable Mutual Exclusion. In Proceedings of the 26th In-ternational Conference on Distributed Computing, DISC’12,pages 267–281, Berlin, Heidelberg, 2012. Springer-Verlag.

[19] M. L. Scott. Non-blocking Timeout in Scalable Queue-basedSpin Locks. In Proceedings of the Twenty-first Annual Sym-posium on Principles of Distributed Computing, PODC ’02,pages 31–40, New York, NY, USA, 2002. ACM.

[20] M. L. Scott and W. N. Scherer. Scalable Queue-based SpinLocks with Timeout. In Proceedings of the Eighth ACMSIGPLAN Symposium on Principles and Practices of ParallelProgramming, PPoPP ’01, pages 44–52, New York, NY, USA,2001. ACM.

[21] Scott, Michael. Non-Blocking Timeout in Scalable Queue-Based Spin Locks. https://www.cs.rochester.

edu/research/synchronization/pseudocode/nb_

timeout.html.

[22] SGI. SGI UV The World’s Most Powerful In-Memory Super-computers. https://www.sgi.com/products/servers/

uv/.

[23] D. D. Sleator and R. E. Tarjan. Self-adjusting Binary SearchTrees. Journal of the ACM, 32(3):652–686, July 1985.

[24] L. G. Valiant. A Bridging Model for Parallel Computation.Communications of the ACM, 33(8):103–111, Aug. 1990.