NIX: A Case for a Manycore System for Cloud Computing

  • Published on

  • View

  • Download


NIX: A Case for a Manycore System for CloudComputingFrancisco J. Ballesteros, Noah Evans, Charles Forsyth, Gorka Guardiola, Jim McKie, Ron Minnich, and Enrique Soriano-SalvadorVirtualized cloud computing provides lower costs, easier systemmanagement and lower energy consumption. However high performancetasks are ill-suited for traditional clouds since commodity operating systemsand hypervisors add cascading jitter to computation, slowing applications toa crawl. This paper presents an overview of NIX, an operating system formanycore central processing units (CPUs) which addresses the challenges ofrunning high performance applications on commodity cloud systems. NIXfeatures a heterogeneous CPU model and uses a shared address space. NIXhas been influenced by our work on Blue Gene and more traditional clusters.NIX partitions cores by function: timesharing cores (TCs); application cores(ACs); and kernel cores (KCs). One or more TCs run traditional applications.KCs are optional, running kernel functions on demand. ACs are also optional,devoted to running an application with no interrupts; not even clockinterrupts. Unlike traditional kernels, functions are not static: the number ofTCs, KCs, and ACs can change as needed. Also unlike traditional systems,applications can communicate by sending messages to the TC kernel, insteadof system call traps. These messages are active taking advantage of theshared-memory nature of manycore CPUs to pass pointers to data and codeto coordinate cores. 2012 Alcatel-Lucent.and persistence; for many tasks, the flexibility of thisapproach, coupled with savings in cost, management,and energy, are compelling.However, certain types of taskssuch as compute-intensive parallel computationsare not trivial toimplement as cloud applications. For instance, highperformance computing (HPC) tasks consist of highlyIntroductionCloud computing uses virtualization on sharedhardware to provide each user with the appearance ofhaving a private system of their choosing running ondedicated hardware. Virtualization enables the dynamicprovisioning of resources according to demand, andthe possibilities of entire system backup, migration,Bell Labs Technical Journal 17(2), 4154 (2012) 2012 Alcatel-Lucent. Published by Wiley Periodicals, Inc. Published online in Wiley Online Library ( DOI: 10.1002/bltj.2154342 Bell Labs Technical Journal DOI: 10.1002/bltjinterrelated subproblems where synchronization andwork allocation occur at fixed intervals. Interruptsfrom the hypervisor and the operating system addnoiserandom latencyto tasks, which thenimpacts all other tasks in the queue by making themwait for the tasks that have been slow to finish. Thisslowing effect may cascade and seriously disrupt theregular flow of a computation.HPC tasks are performed on customized hardwarethat provides bounded latency. Unlike clouds, HPCsystems are typically not timeshared. Instead of theillusion of exclusivity, individual users are given fullyprivate allocations for their tasks, with the allocationrunning as a single-user system. However, program-ming for single user systems is difficult; users preferprogramming environments that mimic the time-sharing environment of their desktop. This desireleads to conflicting constraints between the need fora convenient and productive programming environ-ment and the goal of maximum performance. Thisconflict has led to an evolution of HPC operating systems towards providing the full capabilities of acommodity timesharing operating system. For example,the IBM Compute Node Kernel, a non-Linux* HPCkernel, has changed in the last ten years to becomemore and more like Linux. On the other hand, Linuxsystems for HPC are increasingly pared down to aminimal subset of capabilities in order to avoid time-sharing degradation at the cost of compatibility withthe current Linux source tree. This convergent evo-lution has led to an environment where HPC kernelssacrifice performance for compatibility with com-modity systems, while commodity systems sacrificecompatibility for performance, leaving both issuesfundamentally unresolved.We describe a solution to this tradeoff, bridgingthe gap between performance and expressivity, pro-viding bounded latency and maximum computingpower on one hand and the rich programming envi-ronment of a commodity operating system (OS) onthe other. Based on the reality of emerging manycoreprocessors, we provide an environment in whichusers can be allocated at run-time on dedicated, non-preemptable cores, while at the same time, all the services of a full-fledged operating system are avail-able. This allows us to leverage bounded latency andexclusive uselessons learned from HPCand applythem to cloud computing.As an example of this approach we present NIX,a prototype operating system for future manycorecentral processing units (CPUs). This paper is anoverview of the system and the approach.Influenced by our work in HPC, both on BlueGene* and more traditional clusters, NIX features aheterogeneous CPU model and a change from the tra-ditional UNIX* memory model of separate virtualaddress spaces. NIX partitions cores by function: time-sharing cores (TCs); application cores (ACs); and kernelcores (KCs). There is always at least one TC, and itruns applications in the traditional model. KCs arecores created to run kernel functions on demand. ACsare entirely devoted to running an application, with nointerrupts; not even clock interrupts. Unlike traditionalHPC lightweight kernels, the number of TCs, KCs, andACs can change with the needs of the application.Unlike traditional operating systems, applications canPanel 1. Abbreviations, Acronyms, and TermsACApplication coreAPApplication processorAPICAdvanced programmable interruptcontrollerBSPBootstrap processorCPUCentral processing unitFTQFixed time quantumHPCHigh performance computingIaaSInfrastructure as a serviceICCInter-core-callI/OInput/outputIPCInterprocess communicationIPIInter processor interruptKCKernel coreOSOperating systemSMPSymmetric multiprocessingTCTimesharing coreTLBTranslation look aside bufferDOI: 10.1002/bltj Bell Labs Technical Journal 43access OS services by sending a message to the TCkernel, rather than by a system call trap. Control ofACs is managed by means of inter-core calls. NIXtakes advantage of the shared-memory nature ofmanycore CPUs, and passes pointers to both data andcode to coordinate among cores.In the sections following, we provide a briefdescription of NIX and its capabilities. We then evaluateand benchmark NIX under standard HPC workloads.Finally, we summarize our work and discuss the appli-cability of the NIX design to traditional non-HPCapplications.NIXHigh performance applications are increasinglyproviding fast paths to avoid the higher latency of tra-ditional operating system interfaces, or eliminatingthe operating system entirely and running on barehardware. Systems like Streamline [6] and Exokernel[7] either provide a minimal kernel or a derived fastpath to bypass the operating system functions. Thisapproach is particularly useful for database systems,fine-tuned servers, and HPC applications.Systems like the Multikernel [3] and, to someextent Helios [13], handle different cores (or groups ofcores) as discrete systems, with their own operatingsystem kernels. Sadly, this does not shield HPC appli-cations from the latency caused by interference fromthe system.We took a different path. As machines movetowards hundreds of cores on a single chip [2], webelieve that it is possible to avoid using the operatingsystem entirely on many cores of the system. In NIX,applications are assigned to cores where they runwithout OS interference; that is, without an operatingsystem kernel. In some sense, this is the opposite ofthe multikernel approach. Instead of using more ker-nels for more cores, we try to use no kernel for (someof) them.NIX is a new operating system based on thisapproach, evolving from a traditional operating sys-tem in a way that preserves binary compatibility butalso enables a sharp break with the past practice oftreating manycore systems as traditional symmetricmultiprocessing (SMP) systems. NIX is strongly influ-enced by our experiences over the past five years inmodifying and optimizing Plan 9 to run on HPC sys-tems such as IBMs Blue Gene, and applying ourexperience with large scale systems to general pur-pose computing.NIX is designed for heterogeneous manycore pro-cessors. Discussions with processor manufacturersrevealed the following trends. First, the processormanufacturers would prefer that not all cores run anOS at all times. Second, there is a very real possibilitythat on some future manycore systems, not all coreswill be able to support an OS. This is similar to theapproach that IBM is taking with the CellBe [5], andthe approach taken with the wire-speed processor[10] where primary processors interact with satellitespecial-purpose processors, which are optimized for various applications (floating point computation forthe CellBe, and network packet processing for thewire-speed processor).NIX achieves this objective by assigning specificroles and executables to satellite processors and usinga messaging protocol between cores based on sharedmemory active messages. These active messages notonly send references to data, but also send referencesto code. The messaging protocol communicates usinga shared memory data structure, containing cache-aligned fields. In particular, these messages containarguments, a function pointer (to be run by the peercore), and an indication whether or not to flush thetranslation look aside buffer (TLB) (when requiredand if applicable to the target architecture).The NIX Approach: Core RolesThere are a few common attributes to currentmanycore systems. The first is a non-uniform topol-ogy: sockets contain CPUs, and CPUs contain cores.Each of these sockets is connected to a local memory,and can reach other memories only via other sockets,via one or more on-board networking hops using aninterconnection technology like HyperTransport* [8]or Quickpath [9]. This topology leads to a situationwhere memory access methods are different accordingto socket locality. The result is a set of non-uniformmemory access methods, which means that memoryaccess times are also not uniform and are unpredictablewithout knowledge of the underlying interconnectiontopology.44 Bell Labs Technical Journal DOI: 10.1002/bltjAlthough the methods used to access memory arepotentially different, the programmatic interface tomemory remains unchanged. While the varying topol-ogy may affect performance, the structure of the topology does not change the correctness of programs.Backwards-compatibility makes it possible to startfrom a preexisting code base and gradually optimizethe code to take better advantage of the new structureof manycore systems. Starting from this foundationallows us to concentrate on building applications thatsolve our particular problems instead of writing anentirely new set of systems software and tools, which,like in any system that starts from a clean-slate design,would consume much of the effort. NIX has had aworking kernel and a full set of user programs andlibraries from the start.Starting with a standard unmodified timesharingkernel and gradually adding functions to exploit othercores exclusively for either user- or kernel-intensiveprocesses, means that in the worst case, the time-sharing kernel will behave as a traditional operatingsystem. In the best case, applications will be able torun as fast as permitted by the raw hardware. For our kernel we used Plan 9 from Bell Labs [14] a distributed research operating system amenable tomodification.In addition to a traditional timesharing kernelrunning on some cores, NIX partitions cores by func-tion, implementing heterogeneous cores instead of atraditional SMP system where every core is equal andrunning the same copy of the operating system.However, the idea of partitioning cores by function isnot new. The basic x86 architecture has, since theadvent of SMP, divided CPUs into two basic types:BSP, or bootstrap processor; and AP, or applicationprocessor. However, it is actually a bit more complexthan that, due to the advent of multicore. On a mul-ticore socket, only one core is the BSP. It is really abootstrap core, but the processor manufacturers havechosen to maintain the older name. This differentia-tion means that, in essence, multiprocessor x86 sys-tems have been heterogeneous from the beginning,although this was not visible from the user level to pre-serve memory compatibility. Many services, like thememory access methods described earlier, maintain the same interface even if the underlying implementationof these methods is fundamentally different. NIX pre-serves and extends this distinction between cores tohandle heterogeneous applications, creating threenew classes of cores as described below:1. The timesharing core (TC) acts as a traditional time-sharing system, handling interrupts, system calls,and scheduling. The BSP is always a TC. Therecan be more than one TC, and a system can con-sist of nothing but TCs, as determined by theneeds of the user. An SMP system can be viewedas a special case, a NIX with only timesharingcores.2. The application core (AC) runs applications. OnlyAPs can be application cores. AC applications arerun in a non-preemptive mode, and never fieldinterrupts. On ACs, applications run as if they hadno operating system, but they can still make sys-tem calls and rely on OS services as provided byother cores. But for performance, running on ACsis transparent to applications, unless they wantto explicitly utilize the capabilities of the core.3. The kernel core (KC) runs OS tasks for the TCwhich are created under the control of the TC. AKC might, for example, run a file system call. KCsnever run user mode code. Typical KC tasksinclude servicing interrupts from device drivers,and performing system calls directed to otherwiseoverloaded TCs.This separation of cores into specific roles wasinfluenced by discussions with processor manufac-turers. While current systems tend to be SMP withhomogeneous coreswith a few exceptions, such asthe systems developed by IBM mentioned earlierthere is no guarantee of such homogeneity in futuresystems. Systems with 1024 cores will not need tohave all 1024 cores running the kernel, especiallygiven that designers see potential die space and powersavings if, for example, a system with N cores has only1N cores complex enough to run an operating systemand manage interrupts and input/output (I/O).Core Start-UpThe BSP behaves as a TC and coordinates theactivity of other cores in the system. After start-up,ACs sit in a simple command loop, acsched, waitingfor commands, and executing them as instructed. KCsDOI: 10.1002/bltj Bell Labs Technical Journal 45do the same, but they are never requested to executeuser code. TCs execute standard timesharing kernels,similar to a conventional SMP system.Different cores have different roles, as illustratedin Figure 1. There are two different types of kernel inthe system. TCs and KCs execute the standard time-sharing kernel. ACs execute almost without a kernel,but for a few exception handlers and their mainscheduling loop. This has a significant impact on theinterference caused by the system to application code,which is negligible on ACs.Cores can change roles, again under the control ofthe TC. A core might be needed for applications, inwhich case NIX can direct the core to enter the ACcommand loop. Later, the TC might instruct the coreto exit the AC loop and reenter the group of TCs. Atpresent, only one corethe BSPboots to become aTC; all other cores boot and enter the AC commandloop. The BSP coordinates the roles for other cores,which may change over time.Inter-Core CommunicationIn other systems addressing heterogeneous many-core architectures, message passing is the basis forcoordination. Some of them handle the different coresas a distributed system [7]. While future manycoresystems may have heterogeneous CPUs, it appearsthat one aspect will remain unchanged: there will stillbe some shared memory addressable from all cores.NIX takes advantage of this property. NIX-specificcommunications for management are performed viaactive messages, called inter-core-calls, or ICCs.An ICC consists of a structure containing a pointerto a function, a set of arguments, and some maintenanceindications to be interpreted in an architecture-dependent manner, e.g., to flush the TLB. Each corehas a unique ICC structure associated with it, andpolls for a new message while idle. The core, uponreceiving the message, calls the supplied function withthe arguments given. Note that the arguments, andnot just the function, can be pointers, because mem-ory is shared.The BSP reaches other cores mostly by means ofICCs. Inter-processor interrupts (IPIs) are seldomused. They are relegated to cases where asynchronouscommunication cannot be avoided; for example,when a user wants to interrupt a program running inan AC.Because of this design, it could be said that ACs donot actually have a kernel. Their kernel looks morelike a single loop, executing messages as they arrive.User InterfaceIn many cases, user programs may ignore thatthey are running on NIX and behave as they would ina standard Plan 9 system. The kernel may assign anAC to a process, for example, because it is consumingfull scheduling quanta and is considered to be CPUbound by the scheduler. In the same way, an AC pro-cess that issues frequent system calls might be movedto a TC, or a KC might be assigned to execute its sys-tem calls. In all cases, this is transparent to the process.For users who wish to maximize the performanceof their programs in the NIX environment, manualcontrol is also possible. There are two primary inter-faces to the new system functions: A new system call: execac(int core, char *path, char*args[]); Two new flags for the rfork system call:rfork(RFCORE); rfork(RFCCORE);Execac is similar to exec, but includes a core numberas its first argument. If this number is 0, the standardexec system call is performed, except that all pages arefaulted-in before the process starts. If the number isnegative, the process is moved to an AC chosen byACApplication coreBSPBoot strap processorTCKCACKCKernel coreTCTimesharing coreFigure 1.Different cores have different roles.46 Bell Labs Technical Journal DOI: 10.1002/bltjthe kernel. If the number is positive, the processmoves to the AC with that core number, if available,otherwise the system call fails.RFCORE is a new flag for the standard Plan 9 rforksystem call, which controls resources for the process.This flag asks the kernel to move the process to anAC, chosen by the kernel. A counterpart, RFCCORE,may be used to ask the kernel to move the processback to a TC.Thus, processes can also change their mode. Mostprocesses are started on the TC and, depending onthe type of rfork or exec being performed, can option-ally transition to an AC. If the process encounters afault, makes a system call, or has some other prob-lem, it will transition back to the TC for service. Once serviced, the process may resume execution inthe AC.Figure 2 shows a traditional Plan 9 process,which has its user space in the same context used forits kernel space. After moving to an AC, the user con-text is found on a different core, but the kernel partremains in the TC as a handler for system calls andtraps.Binary compatibility for processes is hence rathereasy: old (legacy) processes only run on a TC, unlessthe kernel decides otherwise. If a user makes a mis-take and starts an old binary on an AC, it will simplymove back to a TC at the first system call and staythere until directed to move. If the binary is spendingmost of its time on computation, it can still move backto an AC for the duration of the time spent in heavycomputation. No checkpointing is needed: this move-ment is possible because the cores share memory;which means that all cores, whether TC, KC, or ACcan access any process on any other core as if it wasrunning locally.Semaphores and TubesIn NIX, because applications may run in ACsundisturbed by other kernel activities, it is importantto be able to perform interprocess communication(IPC) without resorting to kernel calls if that is feasi-ble. Besides the standard Plan 9 IPC toolkit, NIXincludes two additional mechanisms for this purpose: Optimistic user-level semaphores, and Tubes, a new user shared-memory communica-tions mechanism.NIX semaphores use atomic increment and decre-ment operations, as found on AMD 64* and otherarchitectures, to update a semaphore value in order tosynchronize. If the operation performed in thesemaphore may proceed without blocking (and with-out needing to wake a peer process from a sleepstate), it is performed by the user library withoutentering the kernel. Otherwise, the kernel is calledupon to either block or wake another process. Theimplementation is simple, with just 124 lines of C inthe user library and 310 lines of code in the kernel.The interface provided for semaphores containsthe two standard operations as well as another one,altsems.void upsem(int *sem);void downsem(int *sem);int altsems(int *sems[], int nsems);Upsem and downsem are traditional semaphoreoperations, other than their optimism and their abil-ity to run at user-level when feasible. In the worstcase, they call two new system calls (semsleep, andUserKernel (TC)Move to ACUserKernel (TC) ACACApplication coreTCTimesharing coreFigure 2.A traditional Plan 9 process.DOI: 10.1002/bltj Bell Labs Technical Journal 47semwakeup) to block or wake another process, ifrequired. Optionally, before blocking, downsem mayspin, while busy waiting for a chance to perform adown operation on the semaphore without resortingto blocking.Altsems is a novel operation, which tries to per-form a downsem in one of the given semaphores.(Using downsem is not equivalent because it is notknown in advance which semaphore will be the tar-get of a down operation). If several semaphores areready for a down without blocking, one of them isselected and the down is performed; the functionreturns an index value indicating which one isselected. If none of the downs may proceed, the opera-tion calls the kernel and blocks.Therefore, in the best case, altsems performs adown operation in user space in a non-deterministicway, without entering the kernel. In the worst case,the kernel is used to await a chance to down one of the semaphores. Before doing so, the operationmay be configured to spin and busy wait for a periodto wait for its turn.Optimistic semaphores, as described, are used inNIX to implement shared memory communicationchannels called tubes. A tube is a buffered unidirectionalcommunications channel. Fixed-size messages can besent and received from it (but different tubes may havedifferent message sizes). The interface is similar to thatfor channels in the Plan 9 thread library [1]:Tube* newtube(ulong msz, ulong n);void freetube(Tube *t);int nbtrecv(Tube *t, void *p);int nbtsend(Tube *t, void *p);void trecv(Tube *t, void *p);void tsend(Tube *t, void *p);int talt(Talt a[], int na);Newtube creates a tube for the given message sizeand number of messages in the tube buffer. Tsend andtrecv can be used to send and receive. There are non-blocking variants, which fail instead of blocking if theoperation cannot proceed. And there is a talt requestto perform alternative sends and/or receives on mul-tiple tubes.The implementation is a simple producer-consumer written with 141 lines of C, but, because ofthe semaphores used, it is able to run in user spacewithout entering the kernel so that sends and receivesmay proceed:struct Tube{int msz; /* message size */int tsz; /* tube size (# of messages) */int nmsg; /* semaphore: # of messages in tube */int nhole; /* semaphore: # of freeslots in tube */int hd;int tl;};It is feasible to try to perform multiple sends andreceives on different tubes at the same time, waitingfor a chance to execute one of them. This featureexploits altsems to operate at user-level if possible, orcalling the kernel otherwise. It suffices to fill an arrayof semaphores with either the sends or receives rep-resenting the messages in a tube, or the sends orreceives representing empty slots in a tube, dependingon whether a receive or a send operation is selected.Then, calling altsems guarantees that upon return, theoperation may proceed.ImplementationThe initial implementation has been both devel-oped and tested in the AMD64 architecture. We havechanged a surprisingly small amount of code at thispoint. There are about 400 lines of new assemblersource, about 80 lines of platform-independent C source,and about 350 lines of AMD64 C source code (notcounting the code for NIX semaphores). To this, wehave to add a few extra source lines in the start-upcode, system call, and trap handlers.As of today, the kernel is operational, althoughnot in production. More work is needed on systeminterfaces, role changing, and memory management;but the kernel is active enough to be used, at least fortesting.To dispatch a process for execution at an AC, weuse the ICC mechanism. ACs, as part of their start-up sequence, run the acsched function, which waits ina fast loop, polling for an active message functionpointer. To ask the AC to execute a function, the TC48 Bell Labs Technical Journal DOI: 10.1002/bltjsets the arguments for the AC, as well as an indicationof whether or not to flush the TLB, and a functionpointer, all within the ICC structure. It then changesthe process state to Exotic, which would block the process for the moment. When the pointer becomesnon-nil, the AC calls the function. The functionpointer uses one cache line while all other argumentsuse a different cache line, in order to minimize thenumber of bus transactions when polling. Once thefunction is complete, the AC sets the function pointerto nil and calls ready on the process that scheduledthe function for execution. This approach can bethought of as a software-only IPI.When an AP encounters a fault, the AP transferscontrol of the process back to the TC, by finishing theICC and then waiting for directions. That is, the APspins on the ICC structure waiting for a new call. The handling process, running runacore, deals withthe fault, and issues a new ICC to make the AP returnfrom the trap, so the process continues execution inits core. The trap might kill the process, in which casethe AC is released and becomes idle.When an AP makes a system call, the kernel han-dler in the AP returns control back to the TC in thesame manner as page faults, by completing the ICC.The handler process in the TC serves the system calland then transfers control back to the AC, by issuinga new ICC to let the process continue its executionafter returning to user mode. As in traps, the usercontext is maintained in the handler process kernelstack, as is done for all other processes.The handler process, that is, the original time-sharing process when executing runacore, behaves likethe red line separating the user code (now in the AC)and the kernel code (which is run in the TC).Hardware interrupts are all routed to the BSP. ACsshould not accept any interrupts, since they cause jit-ter. We changed the round-robin allocation code tofind the first core able to take interrupts and route allinterrupts to the available core. We currently assumethat the first core is the BSP. (Note that it is still feasi-ble to route interrupts to other TCs or KCs, and weactually plan to do so in the future). In addition, noadvanced programmable interrupt controller (APIC)timer interrupts are enabled on ACs. User code in ACsruns undisturbed, until it faults or makes a system call.The AC requires a few new assembler routines totransfer control to/from user space, while using thestandard Ureg space on the bottom of the process ker-nel stack for saving and restoring process context. Theprocess kernel stack is not used (except for maintain-ing the Ureg) while in the ACs; instead, the per-corestack, known as the Mach stack, is used for the fewkernel routines executed in the AC.Because the instructions used to enter the ker-nel, and the sequence, is exactly the same in both theTC and the AC, no change is needed in the C library(other than a new system call for using the new ser-vice). All system calls may proceed, transparently forthe user, in both kinds of cores.Queue-Based System Calls?As an experiment, we implemented a smallthread library supporting queue-based system callssimilar to those in [17]. Threads are cooperativelyscheduled within the process and not known by thekernel.Each process has been provided with two queues:one to record system call requests, and another torecord system call replies. When a thread issues a sys-tem call, it fills up a slot in the system call queue,instead of making an actual system call. At that point,the thread library marks the thread as blocked andproceeds to execute other threads. When all thethreads are blocked, the process waits for replies inthe reply queue.Before using the queue mechanism, the processissues a real system call to let the kernel know. Inresponse to this call, the kernel creates a (kernel) pro-cess sharing all segments with the caller. This processis responsible for executing the queued system callsand placing replies for them in the reply queue.We made several performance measurementswith this implementation. In particular, we measuredhow long it takes for a program with 50 threads toexecute 500 system calls in each thread. For theexperiment, the system call used does not block anddoes nothing. Table I shows the results.It takes this program 0.06 seconds (of elapsed realtime) to complete when run on the TC using thequeue-based mechanism. However, it takes only 0.02seconds to complete when using the standard systemDOI: 10.1002/bltj Bell Labs Technical Journal 49call mechanism. Therefore, at least for this program,the mechanism is more of an overhead than a bene-fit in the TC. It is likely that the total number of system calls per second that could be performed inthe machine might increase due to the smaller num-ber of domain crossings. However, for a single pro-gram, that does not seem to be the case.As another experiment, running the same pro-gram on the AC takes 0.20 seconds when using thestandard system call mechanism, and 0.04 secondswhen using queue-based system calls. The AC is notmeant to perform system calls. A system call madewhile running on it implies a trip to the TC andanother trip back to the AC. As a result, issuing systemcalls from the AC is expensive. The elapsed time forthe queue-based calls is similar to that for calls run-ning in the TC, but more expensive. Therefore, wemay conclude that queue-based system calls maymake system calls affordable even for ACs.However, a simpler mechanism is to keep processesin the TC that do not consume its entire quantum at the user level, and move only the processes that do to the ACs. As a result, we have decided not to includequeue-based system calls (although tubes can still beused for IPC).Current StatusFuture work includes the following: adding morestatistics to the /proc interface to reflect the state ofthe system, considering the different system cores;deciding if the current interface for the new service isthe right one, and to what extent the mechanism hasto be its own policy; implementing KCs (which shouldnot require any code, because they must execute thestandard kernel); testing and debugging note handlingfor the AC processes; and more testing and fine tuning.We are currently working on implementing newphysical memory management and modifying virtualmemory management to be able to exploit multiplememory page sizes, as found on modern architectures.Also, a new zero-copy I/O framework is being devel-oped for NIX. We intend to use the features describedabove, and the ones in progress, to build a high per-formance file service along the lines of Venti [15], butcapable of exploiting what a manycore machine has tooffer, to demonstrate NIX.EvaluationTo show that ACs have inherently less noise than TCs we used the FTQ, or fixed time quantum,benchmark, a test designed to provide quantitativecharacterization of OS noise [18]. FTQ performs workfor a fixed amount of time and then measures howmuch work was done. Variance in the amount ofwork performed is a result of OS noise.In the following graphs we use FTQ to measurethe amount of OS noise on ACs and TCs. Our goal isto maximize the amount of work done while placingspecial emphasis on avoiding any sort of aperiodicslowdowns in the work performed. Aperiodicity is aconsequence of non-deterministic behaviors in thesystem that have the potential to introduce a cascadinglack of performance as cores slowdown at random.We measure close to 20 billion cycles in 50 thou-sand cycle increments. The amount of work is measured between zero and just over 2000 arbitraryunits.The results of the first run of the FTQ benchmarkon a TC are shown in Figure 3. Upon closer exami-nation, these results revealed underlying aperiodicnoise, as shown in Figure 4. This noise pattern sug-gests that the TC is unsuitable for any activities thatrequire deterministically bounded latency.Figure 5 shows results from our initial FTQ teston an AC. The AC is doing the maximum amount ofwork every quantum but is interrupted by large tasksevery 256 quanta. We determined that these quantacoincide with the FTQ task crossing a page boundary.The system is stalling as more data is paged in.We confirm the origin of this periodic noise byzeroing all of the memory used for FTQs work buffers.This forces the data pages into memory, eliminatingCore System call Queue callTC 0.02s 0.06sAC 0.20s 0.04sTable I. Times in seconds, of elapsed real time, for aseries of syscall calls and queue-based system callsfrom the TC and the AC.ACApplication coreTCTimesharing core50 Bell Labs Technical Journal DOI: 10.1002/bltj1.0395e 13 1.04e 13 1.0405e 13 1.041e 130500100015002000FTQFixed time quantumFigure 3.FTQ benchmark on a timesharing core showing work versus cycles.1.0395e 13 1.04e 13 1.0405e 13 1.041e 13190019502000FTQFixed time quantumFigure 4.Underlying aperiodic noise seen in an FTQ on a timesharing core.1.1015e 13 1.102e 13 1.1025e 13 1.103e 130500100015002000FTQFixed time quantumFigure 5.FTQ test on an application core.DOI: 10.1002/bltj Bell Labs Technical Journal 51the periodic noise and constantly attaining the theo-retical maximum for the FTQ benchmark. This showsthe effectiveness of the AC model for computationallyintensive HPC tasks. The final result is shown in Figure 6. This resultis not only very good, it is close to the theoreticalideal. In fact the variance is almost entirely confinedto the low-order bit, within the measurement error ofthe counter.Related WorkThe Multikernel [3] architecture presents a novelapproach for multicore systems. It handles differentcores as different systems. Each one is given its ownoperating system. NIX takes the opposite approach:we try to remove the kernel from the applicationscore, avoiding unwanted OS interference.Helios [13] introduces satellite kernels that runon different cores but still provide the same set ofabstraction to user applications. It relies on a traditionalmessage passing scheme, similar to microkernels. NIXalso relies on message passing, but it uses active mes-sages and its application cores are more similar to asingle loop than they are to a full kernel. Also, NIXdoes not interfere with applications while in applica-tion cores, which Helios does.NIX does not focus on using architecturally het-erogeneous cores like other systems do, for exampleBarrelfish [16]. In Barrelfish, authors envision using asystem knowledge base to perform queries in order tocoordinate usage of heterogeneous resources. However,the system does not seem to be implemented, and it isunclear whether it will be able to execute actual appli-cations in the near future, although it is an interestingapproach. NIX took a rather different direction, byleveraging a working system, so that it could run userprograms from day zero of its development.Tessellation* [12] partitions the machine, performing both space and time multiplexing. Anapplication is given a partition and may communi-cate with the rest of the world using messages. Insome sense, the Tesellation kernel is like an exokernelfor partitions, multiplexing partition resources. In con-trast, NIX is a full-featured operating system kernel. InNIX, some cores may be given to applications.However, applications still have to use standard sys-tem services to perform their work.In their Analysis of Linux Scalability to ManyCores [4], the authors conclude that there is no scal-ability reason to give up on traditional operating systemorganizations just yet. This supports the idea behindthe NIX approach about leveraging a working SMPsystem and adapting it to better exploit manycoremachines. However, the NIX approach differs fromthe performance adjustments made to Linux in saidanalysis.9.425e 12 9.43e 12 9.435e 12 9.44e 120500100015002000FTQFixed time quantumFigure 6.Final result: FTQ on an application core.52 Bell Labs Technical Journal DOI: 10.1002/bltjFos [19] is a single system image operating systemacross both multicore and infrastructure-as-a- service(IaaS) cloud systems. It is a microkernel systemdesigned to run cloud services. It builds on Xen* torun virtual machines for cloud computing. NIX is notfocused on IaaS facilities, but on kernel organizationfor manycore systems.ROS [11] modifies the process abstraction tobecome manycore. That is, a process may be sched-uled for execution into multiple cores, and the systemdoes not dictate how concurrency is to be handledwithin the process, which is now a multi-process. InNIX, an application can achieve the same effect byrequesting multiple cores (creating one process foreach core). Within such processes, user-level thread-ing facilities may be used to architect intra-applicationscheduling. It is unclear if an implementation of ROSideas exists, since it is not indicated by the paper.ConclusionsNIX presents a new model for operating systemson manycore systems. It follows a heterogeneousmulti-core model, which is the anticipated model forfuture systems. It uses active messages for inter-corecontrol. It also preserves backwards compatibility, sothat existing programs will always work. Its architec-ture prevents interference from the OS on applicationsrunning on dedicated cores, while at the same timeprovides a full fledged general purpose computing sys-tem. The approach has been applied to Plan 9 fromBell Labs but, in principle, it can be applied to anyother operating system.AcknowledgementsThis material is based upon work supported inpart by the U.S. Department of Energy under awardnumbers DE-FC02-10ER25597 and DE-FG02-08ER25847;by the Sandia National Laboratory, a multiprogram lab-oratory operated by Sandia Corporation, a LockheedMartin Company, for the United States Department ofEnergys National Nuclear Security Administrationunder contract DE-AC04-94AL85000 SAND-2011-3681P; and by Spanish Government grant TIN2010-17344and Comunidad de Madrid grant S2009/ TIC-1692.The authors would also like to thank FabioPianese, who provided extensive feedback and clari-fications. His help is greatly appreciated.*TrademarksAMD 64 is a registered trademark of Advanced MicroDevices.Blue Gene is a registered trademark of InternationalBusiness Machines Corporation.HyperTransport is a registered trademark of HyperTransport Technology Consortium.Linux is a trademark of Linus Torvalds.Tessellation is a registered trademark of TessellationSoftware LLC.UNIX is a registered trademark of The Open Group.Xen is a registered trademark of Citrix Systems, Inc.References[1] Alcatel-Lucent, Bell Labs, Thread(3), Plan 9Thread Library,[2] M. Baron, The Single-Chip Cloud Computer,Microprocessor Report, Apr. 26, 2010.[3] A. Baumann, P. Barham, P.-E. Dagand, T. Harris,R. Isaacs, S. Peter, T. Roscoe, A. Schpbach, andA. Singhania, The Multikernel: A New OSArchitecture for Scalable Multicore Systems,Proc. 22nd ACM Symp. on Operating Syst.Principles (SOSP 09) (Big Sky, MT, 2009), pp. 2944.[4] S. Boyd-Wickizer, A. T. Clements, Y. Mao, A. Pesterev, M. F. Kaashoek, R. Morris, and N. Zeldovich, An Analysis of Linux Scalabilityto Many Cores, Proc. 9th USENIX Symp. onOperating Syst. Design and Implementation(OSDI 10) (Vancouver, BC, Can., 2010).[5] T. Chen, R. Raghavan, J. N. Dale, and E. Iwata,Cell Broadband Engine Architecture and ItsFirst Implementation: A Performance View,IBM J. Res. Dev., 51:5 (2007), 559572.[6] W. de Bruijn, H. Bos, and H. Bal, Application-Tailored I/O with Streamline, ACM Trans.Comput. Syst., 29:2 (2011), article 6.[7] D. R. Engler, M. F. Kaashoek, and J. OToole, Jr.,Exokernel: An Operating System Architecturefor Application-Level Resource Management,ACM SIGOPS Operating Syst. Rev., 29:5 (1995),251266.[8] HyperTransport Consortium, HyperTransport,[9] Intel, Intel QuickPath Interconnect MaximizesMulti-Core Performance, .html.DOI: 10.1002/bltj Bell Labs Technical Journal 53[10] C. Johnson, D. H. Allen, J. Brown, S. Vanderwiel, R. Hoover, H. Achilles, C.-Y.Cher, G. A. May, H. Franke, J. Xenedis, and C.Basso, A Wire-Speed Power Processor: 2.3GHz45nm SOI with 16 Cores and 64 Threads, Proc.IEEE Internat. Solid-State Circuits Conf. Digestof Tech. Papers (ISSCC 10) (San Francisco, CA,2010), pp. 104105.[11] K. Klues, B. Rhoden, A. Waterman, D. Zhu, andE. Brewer, Processes and ResourceManagement in a Scalable Many-Core OS,Proc. 2nd USENIX Workshop on Hot Topics inParallelism (HotPar 10) (Berkeley, CA, 2010).[12] R. Liu, K. Klues, S. Bird, S. Hofmeyr, K. Asanovic, and J. Kubiatowicz, Tessellation:Space-Time Partitioning in a Manycore ClientOS, Proc. 1st USENIX Workshop on Hot Topicsin Parallelism (HotPar 09) (Berkeley, CA,2009).[13] E. B. Nightingale, O. Hodson, R. McIlroy, C. Hawblitzel, and G. Hunt, Helios:Heterogeneous Multiprocessing with SatelliteKernels, Proc. 22nd ACM Symp. on OperatingSyst. Principles (SOSP 09) (Big Sky, MT, 2009),pp. 221234.[14] R. Pike, D. Presotto, K. Thompson, and H. Trickey, Plan 9 from Bell Labs, Proc. UKsUnix and Open Syst. User Group Summer Conf.(UKUUG 90Summer) (London, Eng., 1990).[15] S. Quinlan and S. Dorward, Venti: A NewApproach to Archival Data Storage, Proc. 1stUSENIX Conf. on File and Storage Technol.(FAST 02) (Monterey, CA, 2002), article 7.[16] A. Schpbach, S. Peter, A. Baumann, T. Roscoe,P. Barham, T. Harris, and R. Isaacs, EmbracingDiversity in the Barrelfish Manycore OperatingSystem, Proc. Workshop on Managed Many-Core Syst. (MMCS 08) (Boston, MA, 2008).[17] L. Soares and M. Stumm, FlexSC: FlexibleSystem Call Scheduling with Exception-LessSystem Calls, Proc. 9th USENIX Conf. onOperating Syst. Design and Implementation(OSDI 10) (Vancouver, BC, Can., 2010).[18] M. Sottile and R. Minnich, Analysis ofMicrobenchmarks for Performance Tuning ofClusters, Proc. IEEE Internat. Conf. on ClusterComput. (Cluster 04) (San Diego, CA, 2004),pp. 371377.[19] D. Wentzlaff, C. Gruenwald III, N. Beckmann,K. Modzelewski, A. Belay, L. Youseff, J. Miller,and A. Agarwal, An Operating System forMulticore and Clouds: Mechanisms andImplementation, Proc. 1st ACM Symp. onCloud Comput. (SoCC 10) (Indianapolis, IN,2010), pp. 314.(Manuscript approved March 2012)FRANCISCO J. BALLESTEROS leads the Laboratorio deSistemas at University Rey Juan Carlos(URJC) in Madrid, Spain. He received hisPh.D. in computer science from theUniversidad Politecnica of Madrid. He is alecturer on operating systems and computerprogramming, and a researcher on operating systemsand system software in general. He has a strongbackground in system software design anddevelopment, and has published numerous papers inrelated fields. He developed Off , a kernel for theUniversity of Illinois at Urbana-Champaign SystemsResearch Groups 2K operating system. He designedand implemented the core components of the Plan Band Octopus operating systems. He has also developeddifferent components of the Plan 9 from Bell Labs andthe Linux kernels. His interests include operatingsystems, adaptable systems, window systems, and filesystem protocols. NOAH EVANS is a member of technical staff in the BellLabs Networks and Networking researchdomain in Antwerp, Belgium. His work atBell Labs includes shells for large scalecomputing, mobility protocols, andoperating systems development. He isinterested in scalable systems, distributed computingand language design.CHARLES FORSYTH is a founder and technical directorat Vita Nuova, in York, England, whichspecializes in systems software. He isinterested in compilers, operating systems,networking (protocols and services),security, and distributed systems andalgorithms. He specializes in the design andimplementation of systems software, from low-leveldrivers through compilers to whole operating systems.He has published papers on operating systems, Adacompilation, worst-case execution analyzers for safety-critical applications, resources as files, and thedevelopment of computational grids. He has a computer science from the University of York, UnitedKingdom.54 Bell Labs Technical Journal DOI: 10.1002/bltjGORKA GUARDIOLA is an associate professor at theUniversidad Rey Juan Carlos and member ofits Operating Systems Lab (Laboratorio deSistemas) in Madrid, Spain. He received hisM.S. in telecommunication engineering atthe Carlos III University, and his Ph.D. incomputing science from the Universidad Rey JuanCarlos. He has participated on the teams developingboth the Plan B and the Octopus operating systems,and is an active developer on the Plan 9 operatingsystem. His research interests include operatingsystems, concurrent programming, embeddedcomputing, and pure mathematics.JIM MCKIE is a member of technical staff in theNetworking research domain at Bell Labs, inMurray Hill, New Jersey. His researchinterests include the continued evolution ofnovel operating systems onto newgenerations of networking and processorarchitectures ranging from embedded systems tosupercomputers. Mr. McKie holds a B.Sc. in electricaland electronic engineering and an M.Sc. in digitaltechniques, both from Heriot-Watt University, Scotland.RON MINNICH is a distinguished member of technicalstaff at Sandia National Laboratories inLivermore, California. He has worked inhigh performance computing for 25 years.His work for the U.S. Department of Energyincludes LinuxBIOS and ClusterMatic, bothdeveloped at Los Alamos National Laboratory (LANL);porting Plan 9 to the Blue Gene / l and /p systems; andbooting Linux virtual machines for cybersecurityresearch.ENRIQUE SORIANO-SALVADOR is an associate professorat the Universidad Rey Juan Carlos inMadrid, Spain and a member of itsLaboratorio de Sistemas. He earned his computer engineering, and received apredoctoral grant (FPI) from the Spanishgovernment to obtain his Ph.D. from the University ofRey Juan Carlos. He has been part of the team for thePlan B and Octopus operating systems. He is alsoauthor of several research papers on system software,security, and pervasive computing. His researchinterests include operating systems, distributedsystems, security, and concurrent programming.


View more >