187
SOLARIS™ OPERATING SYSTEM HARDWARE VIRTUALIZATION PRODUCT ARCHITECTURE Chien-Hua Yen, ISV Engineering [email protected] Sun BluePrints™ On-Line — November 2007 Part No 820- 3703-10 Revision 1.0, 11/27/07 Edition: November 2007 microsystems

Solaris Operating System Hardware Virtualization Product Architecture

Embed Size (px)

Citation preview

microsystem s

SOLARIS OPERATING SYSTEM HARDWARE VIRTUALIZATION PRODUCT ARCHITECTURE Chien-Hua Yen, ISV Engineering [email protected] Sun BluePrints On-Line November 2007

Part No 8203703-10 Revision 1.0, 11/27/07 Edition: November 2007

Sun Microsystems, Inc.

Table of Contents Introduction..............................................................................1 Hardware Level Virtualization.................................................2 Scope......................................................................................4 Section 1: Background Information.........................................7 Virtual Machine Monitor Basics.............................................9 VMM Requirements................................................................9 VMM Architecture ............................................................................................... 11 The x86 Processor Architecture...........................................21 SPARC Processor Architecture............................................29 Section 2: Hardware Virtualization Implementations ..........37 Sun xVM Server......................................................................39 Sun xVM Server Architecture Overview ............................................................................................... 40 Sun xVM Server CPU Virtualization ............................................................................................... 45 Sun xVM Server Memory Virtualization ............................................................................................... 52 Sun xVM Server I/O Virtualization ............................................................................................... 56 Sun xVM Server with Hardware VM (HVM)..........................63 HVM Operations and Data Structure ............................................................................................... 64 Sun xVM Server with HVM Architecture Overview ............................................................................................... 68 Logical Domains.....................................................................79 Logical Domains (LDoms) Architecture Overview ............................................................................................... 80 CPU Virtualization in LDoms ............................................................................................... 84 Memory Virtualization in LDoms ............................................................................................... 88

Sun Microsystems, Inc.

I/O Virtualization in LDoms ............................................................................................... 91 VMware....................................................................................97 VMware Infrastructure Overview ............................................................................................... 98 VMware CPU Virtualization ............................................................................................... 98 VMware Memory Virtualization .............................................................................................. 103 VMware I/O Virtualization .............................................................................................. 103 Section 3: Additional Information ........................................107 VMM Comparison .................................................................109 References.............................................................................111 Terms and Definitions..........................................................113 Author Biography................................................................117 Chapter 1 Introduction In the IT industry, virtualization is a mechanism of presenting a set of logical computing resources over a fixed hardware configuration so that these logical resources can be accessed in the same manner as the original hardware configuration. The concept of virtualization is not new. First introduced in the late 1960s on mainframe computers, virtualization has recently become popular as a means to consolidate servers and reduce the costs of hardware acquisition, energy consumption, and space utilization. The hardware resources that can be virtualized include computer systems, storage, and the network. Server virtualization can be implemented at different levels on the computing stack, including the application level, operating system level, and hardware level: An example of application level virtualization is the Virtual Machine for the Java platform (Java Virtual Machine or JVM machine)1. The JVM implementation provides an application execution environment as a layer between the application and the OS, removing application dependency on OS-specific APIs and hardwarespecific characteristics. 1The terms "Java Virtual Machine" and "JVM" mean a Virtual Machine for the Java(TM) platform.

Sun Microsystems, Inc.

OS level virtualization abstracts OS services such as file systems, devices, networking, and security, and provides a virtualized operating environment to applications. Typically, OS level virtualization is implemented by the OS kernel. Only one instance of the kernel runs on the system, and it provides multiple virtualized operating environments to applications. Examples of OS level virtualization include Solaris Containers technology, Linux VServers, and FreeBSD Jails. OS level virtualization has less performance overhead and better system resource utilization than hardware level virtualization. Since one OS kernel is shared among all virtual operating environments, isolation among all virtualized operating environments is as good as the OS provides. Hardware level virtualization, discussed in detail in this paper, has become popular recently because of increasing CPU power and low utilization of CPU resources in the IT data center. Hardware level virtualization allows a system to run multiple OS instances. With less sharing of system resources than OS level virtualization, hardware virtualization provides stronger isolation of operating environments. The Solaris OS includes bundled support for application and OS level virtualization with its JVM software and Solaris Containers offerings. Sun first added support for hardware virtualization in the Solaris 10 11/06 release with Sun Logical Domains (LDoms) technology, supported on Sun servers which utilize UltraSPARC T1 or UltraSPARC T2

5 I Sun Microsystems, Inc.

processors. VMware also supports the Solaris OS as a guest OS in its VMware Server and Virtual Infrastructure products starting with the Solaris 10 1/06 release. In October 2007, Sun announced the Sun xVM family of products that includes the Sun xVM Server and the Sun xVM Ops Center management system: Sun xVM Server includes support for the Xen open source community work [6] on the x86 platform and support for LDoms on the UltraSPARC T1/T2 platform Sun xVM Ops Center a management suite for the Sun xVM Server Note - In this paper, in order to distinguish the discussion of x86 and UltraSPARC T1/T2 processors, Sun xVM Server is specifically used to refer to the Sun hardware virtualization product for the x86 platform, and LDoms is used to refer to the Sun hardware virtualization product for the UltraSPARC T1 and T2 platforms. The hardware virtualization technology and new products built around this technology have expanded options and opportunities for deploying servers with better utilization, more flexibility, and enhanced functionality. In reaping the benefits of the hardware virtualization, IT professionals also face the challenges of operating within the limitation of a virtualized environment while delivering the same level of service agreement as the physical operating environment. Meeting this requirement requires a good understanding of virtualization technologies, CPU architecture, and software implementations, and awareness of their strengths and limitations. Hardware Level Virtualization Hardware level virtualization is a mechanism of virtualizing the system hardware resources such as CPU, memory, and I/O, and creating multiple execution environments on a single system. Each of these execution environments runs an instance of the operating system. A hardware level virtualization implementation typically consists of several virtual machines (VMs), as shown in Figure 1. A layer of software, the virtual machine monitor (VMM), manages system hardware resources and presents an abstraction of these resources to each VM. The VMM runs in privileged mode and has full control of system hardware. A guest operating system (GOS) runs in each VM. The GOS to VM is analogous to program to process in which OS plays the function of the VMM.

6 I Sun Microsystems, Inc.

Figure 1. In hardware level visualization, the VMM software manages hardware resources and presents an abstraction of these resources to one or more virtual machines.

7 I Sun Microsystems, Inc.

VM

8 I Sun Microsystems, Inc.

VM

9 I Sun Microsystems, Inc.

VM

1 I Sun Microsystems, Inc.

1 I Sun Microsystems, Inc.

1 I Sun Microsystems, Inc.

1 I Sun Microsystems, Inc.

GOS

1 I Sun Microsystems, Inc.

GOS

1 I Sun Microsystems, Inc.

GOS

1 I Sun Microsystems, Inc.

1 I Sun Microsystems, Inc.

1 I Sun Microsystems, Inc.

Virt

1 I Sun Microsystems, Inc.

ual Mac

2 I Sun Microsystems, Inc.

hine Mo

2 I Sun Microsystems, Inc.

1

2 I Sun Microsystems, Inc.

nitor (Vf

2 I Sun Microsystems, Inc.

MM)

2 I Sun Microsystems, Inc.

Platform Hardware

Hardware resource virtualization can take the form of sharing, partitioning, or delegating: Sharing Resources are shared among VMs. The VMM coordinates the use of resources by VMs. For example, the VMM may include a CPU scheduler to run threads of VMs based on a pre-determined scheduling policy and VM priority. Partitioning Resources are partitioned so that each VM gets the portion of resources allocated to it. Partitioning can be dynamically adjusted by the VMM based on the utilization of each VM. Examples of resource partitioning include the ballooning memory technique employed in Sun xVM Server and VMware, and the allocation of CPU resources in Logical Domains technology. Delegating With delegating, resources are not directly accessible by a VM. Instead, all resource accesses are made through a control VM that has direct access to the resource. I/O device virtualization is normally accessed via delegation. The distinction and boundaries between the virtualization methods are often not clear. For example, sharing may be used for one component and partitioning used in others, and together they make up an integral functional module. Benefits of Hardware Level Virtualization Hardware level virtualization allows multiple operating systems to run on a single server system. This ability offers many benefits that are not available in a single OS server. These benefits can be summarized in three functional categories: Workload Consolidation According to Gartner [17] "Intel servers running at 10 percent to 15 percent utilization are common." Many IT organizations run out and buy a new server every time they deploy a new application. With virtualization, computers no longer have to be dedicated to a particular task. Applications and users can share computing resources, remaining blissfully unaware that they are doing so. Companies can shift computing resources around to meet demand at a given time, and get by with less infrastructure overall. When used for consolidation, virtualization can also save hardware and maintenance expenses, floor space, cooling costs, and power consumption. Workload Migration Hardware level virtualization decouples the OS from the underlying physical platform resources. A guest OS state, along with the user applications running on top of it, can be encapsulated into an entity and moved to another system. This capability is useful for migrating a legacy OS system from an old under-powered server to a more powerful server while preserving the investment in software. When a server needs to

2 I Sun Microsystems, Inc.

be maintained, a VM can be dynamically migrated to a new sever with no down time, further enhancing availability. Changes in workload intensity levels can be addressed by dynamically shifting underlying resources to the starving VMs. Legacy applications that ran natively on a server continue to run on the same OS running inside a VM, leveraging the existing investment in applications and tools. Workload Isolation Workload isolation includes fault and security isolations. Multiple guest OSes run independently, and thus a software failure in one VM does not affect other VMs. However, the VMM layer introduces a single point of failure that can bring down all VMs on the system. A VMM failure, although potentially catastrophic, is less probable than a failure in the OS because the complexity of VMM is much less than that of an OS. Multiple VMs also provide strong security isolation among themselves with each VM running an independent OS. Security intrusions are confined to the VM in which they occur. The boundary around each VM is enforced by the VMM and the inter-domain communication, if provided by the VMM, is restricted to specific kernel modules only. One distinct feature of hardware level virtualization is the ability to run multiple instances of heterogeneous operating systems on a single hardware platform. This feature is important for the following reasons: Better security and fault containment among application services can be achieved through OS isolation. Applications written for one OS can run on a system that supports a different OS. Better management of system resource utilization is possible among the virtualized environments.

Scope This paper explores the underlying hardware architecture and software implementation for enabling hardware virtualization. Great emphasis has been placed on the CPU hardware architecture limitations for virtualizing CPU services and their software workarounds. In addition, this paper discusses in detail the software architecture for implementing the following types of virtualization: CPU virtualization uses processor privileged mode to control resource usage by the VM, and relays hardware traps and interrupts to VMs Memory virtualization partitions physical memory among multiple VMs and handles page translations for each VM I/O virtualization uses a dedicated VM with direct access to I/O devices to provide device services

2 I Sun Microsystems, Inc.

The paper is organized into three sections. Section I, Background Information, contains information on VMMs and provides details on the x86 and SPARC processors: "Virtual Machine Monitor Basics" on page 9 discusses the core of hardware virtualization, the VMM, as well as requirements for the VMM and several types of VMM implementations. "The x86 Processor Architecture" on page 21 describes features of the x86 processor architecture that are pertinent to virtualization. "SPARC Processor Architecture" on page 29 describes features of the SPARC processor that affect virtualization implementations. Section II, Hardware Virtualization Implementations, provides details on the Sun xVM Server, Logical Domains, and VMware implementations: "Sun xVM Server" on page 39 discusses a paravirtualized Solaris OS that is based on an open source VMM implementation for x86[6] processors and is planned for inclusion in a future Solaris release. "Sun xVM Server with Hardware VM (HVM)" on page 63 continues the discussion of Sun xVM Server for the x86 processors that support hardware virtual machines: Intel-VT and AMD-V. "Logical Domains" on page 79 discusses Logical Domains (LDoms), supported on Sun servers that utilize UltraSPARC Tl or T2 processors, and describes Solaris OS support for this feature. "VMware" on page 97 discusses the VMware implementation for the VMM. Section III, Additional Information, contains a concluding comparison, references, and appendices: "VMM Comparison" on page 109 presents a summary of the VMM implementations discussed in this paper. "References" on page 111 provides a comprehensive listing of related references. "Terms and Definitions" on page 113 contains a glossary of terms. "Author Biography" on page 117 provides information on the author.

I n

Sun Microsystems, Inc.

S 28 Introduction

Section I Background Information

Chapter 2: Virtual Machine Monitor Basics (page 9) Chapter 3: The x86 Processor Architecture (page 21) Chapter 4: SPARC Processor Architecture (page 29)

I n

Sun Microsystems, Inc.

S 3 Virtual Machine Monitor Basics

Chapter 2 Virtual Machine Monitor Basics At the heart of hardware level virtualization is the VMM. The VMM is a software layer that abstracts computer hardware resources so that multiple OS instances can run on a physical system. Hardware resources are normally controlled and managed by the OS. In a virtualized environment the VMM takes this role, managing and coordinating hardware resources. There is no clear boundary between an OS and the VMM from the definition point of view. The division of functions between OS and the VMM can be influenced by factors such as processor architecture, performance, OS, and nontechnical requirements such as ease of installation and migration. Certain VMM requirements exist for running multiple OS instances on a system. These requirements, discussed in detail in the next section, stem primarily from processor architecture design that is inherently an impediment to hardware virtualization. Based on these requirements, two types of VMMs have emerged, each with distinct characteristics in defining the relationship between the VMM and an OS. This relationship determines the privilege level of the VMM and an OS, and the control and sharing of hardware resources. VMM Requirements A software program communicates with the computer hardware through instructions. Instructions, in turn, operate on registers and memory. If any of the instructions, registers, or memory involved in an action is privileged, that instruction results in a privileged action. Sometimes an action, which is not necessarily privileged, attempts to change the configuration of resources in the system. Subsequently, this action would impact other actions whose behavior or result depends on the configuration of resources. The instructions that result in such operations are called sensitive instructions. In the context of the virtualization discussion, a processor's instructions can be classified into three groups: Privileged instructions are those that trap if the processor is in non-privileged mode and do not trap if it is in privileged mode. Sensitive instructions are those that change or reference the configuration of resources (memory), affect the processor mode without going through the memory trap sequence (page fault), or reference the sensitive registers whose contents change when the processor switches to run another VM. Non-privileged and non-sensitive instructions are those that do not fall into either the privileged or sensitive categories described above. Sensitive instructions have "a major bearing on the virtualizability of a machine" [1] because of their system-wide

S 3 Virtual Machine Monitor Basics

impact. In a virtualized environment, a GOS should only contain non-privileged and non-sensitive instructions. If sensitive instructions are a subset of privileged instructions, it is relatively easy to build a VM because all sensitive instructions will result in a trap. In this case a VMM can be constructed to catch all traps that result from execution of sensitive instructions by a GOS. All privileged and sensitive actions from VMs would be caught by the VMM, and resources could be allocated and managed accordingly (a technique called trap-and-emulate). A GOS's trap handler could then be called by the VMM trap handler to perform the GOS-specific actions for the trap. If a sensitive instruction is a non-privileged instruction, the instruction executed by one VM will be unnoticed. Robin and Irvine [3] identified several x86 instructions in this category. These instructions cannot be safely executed by a GOS as they can impact the operations of other VMs or adversely affect the operation of its own GOS. Instead, these instructions must be substituted by the VMM service. The substitution can be in the form of an API for the GOS to call, or a dynamic conversion of these instructions to explicit processor traps. Types of VMM In a virtualized environment, the VMM controls the hardware resources. VMMs can be categorized into two types, based on this control of resources: Type I maintains exclusive control of hardware resources Type II leverages the host OS by running inside the OS kernel The Type I VMM [3] has several distinct characteristics: it is the first software to run (besides BIOS and the boot loader), it has full and exclusive control of system hardware, and it runs in privileged mode directly on the physical processor. The GOS on a Type I VMM implementation runs in a less privileged mode than the VMM to avoid conflicts managing the hardware resources. An example of a Type I VMM is Sun xVM Server. Sun xVM Server includes a bundled VMM, the Sun vVM Hypervisor for x86. The Sun xVM Hypervisor for x86 is the first software, beside BIOS and boot loader, to run during boot as shown in the GRUB menu.lst file: title Sun xVM Server kernel$ /boot/ $ISADIR/xen.gz module$ /p la tfo rm /i8 6 x p v/ke rn e l/$ IS A D IR /u n ix /p la tfo rm /i8 6 x p v /k e rn e l/ $ IS A D IR /u nmodule$ /platform/i8 6pc/$ISADIR/boot_archive ix

S 3 Virtual Machine Monitor Basics

The GRUB bootloader first loads the Sun xVM Hypervisor for x86, xen.gz. After the VMM gains control of the hardware, it loads the Solaris kernel, /platform/i86xpv/kernel/$ISADIR/unix, to run as a GOS. Sun's Logical Domains and VMware's Virtual Infrastructure 3 [4] (formerly knows as VMware ESX Server), described in detail in Chapter 7 "Logical Domains" on page 79 and Chapter 8 "VMware" on page 97, are also Type I VMMs. A Type II VMM typically runs inside a host OS kernel as an addon module, and the host OS maintains control of the hardware resources. The GOS in a Type II VMM is a process of the host OS. A Type II VMM leverages the kernel services of the host OS to access hardware, and intercepts a GOS's privileged operations and performs these operations in the context of the host OS. Type II VMMs have the advantage of preserving the existing installation by allowing a new GOS to be added to an running OS. An example of type II VMM is VMware's VMware Server (formerly known as VMware GSX Server). Figure 2 illustrates the relationships among hardware, VMM, GOS, host OS, and user application in virtualized environments.

Type I VMM Server Type II VMM Server Unprivileged Mode

S 3 Virtual Machine Monitor Basics

A A A G Apps | GOS V VMM P Host OS

G h

Platform Hardware Privileged Mode Figure 2. Virtual machine monitors vary in how they support guest OS, host OS, and user applications in virtualized environments. VMM Architecture As discussed in "VMM Requirements" on page 9, the VMM performs some of the functions that an OS normally does: namely, it controls and arbitrates CPU and memory resources, and provides services to upper layer software for sensitive and privileged operations. These functions require the VMM to run in privileged mode and the OS to relinquish the privileged and sensitive operations to the VMM. In addition to processor and memory operation, I/O device support also has a large impact on VMM architecture. VMM in Privileged Mode A processor typically has two or more privileged modes. The operating system kernel runs in the privileged mode. The user applications run in a non-privileged mode and trap to the kernel when they need to access system resources or services from the kernel. The GOS normally assumes it runs in the most privileged mode of the processor. Running a VMM in a privileged mode can be accomplished with one of the following three methods: Deprivileging the GOS This method usually requires a modification to the OS to run at a lower privilege level. For x86 systems, the OS normally runs at protected ring 0, the most privileged level. In Sun xVM Server, ring 0 is reserved to run the VMM. This requires the GOS to be modified, or paravirtualized, to run outside of ring 0 at a lower privilege level. Hyperprivileging the VMM Instead of changing the GOS to run at lower privilege, another approach taken by the chip vendors is to create a hyperprivileged processor mode for the VMM. The Sun UltraSPARC Tl and T2 processor's hyperprivileged mode [2], Intel-VT's VMX-root operation (see [7] Volume 3B, Chapter 19), and AMD-V's VMRUN-Exit state (see [9] Chapter 15) are examples of a hyperprivileged processor for VMM operations. Both VMM and GOS run in same privileged mode It is possible to have both the VMM and GOS run in the same

S 3 Virtual Machine Monitor Basics

privileged mode. In this case, the VMM intercepts all privileged and sensitive operations of a GOS before passing them to the processor. For example, VMware allows both the GOS and the VMM to run in privileged mode. VMware dynamically examines each instruction to decide whether the processor state and the segment reversibility (see "Segmented Architecture" on page 23) allow the instruction to be executed directly without the involvement of the VMM. If the GOS is in privileged mode or the code segment is non-reversible, the VMM performs necessary conversions of the core execution path. Removing Sensitive Instructions in the GOS Privileged and sensitive operations are normally executed by the OS kernel. In a virtualized environment, the GOS has to relinquish the privileged and sensitive operations to the VMM. This is accomplished by one of the following approaches: Modifying the GOS source code to use the VMM services for handling sensitive operations (paravirtualization) This method is used by Sun xVM Server and Sun's Logical Domains (LDoms). Sun xVM Server and LDoms provide a set of hypercalls for an OS to request VMM services. The VMMaware Solaris OS uses these hypercalls to replace its sensitive instructions.

3 S Virtual Machine Monitor Basics

Dynamically translating the GOS sensitive instructions by software As described in a previous section, VMware uses binary translation to replace the GOS sensitive instructions with VMM instructions. Dynamically translating the GOS sensitive instructions by hardware This method requires the processor to provides a special mode of operation that is entered when an sensitive instruction is executed in reduced privileged mode. The first approach, which involves modifying the GOS source code, is called paravirtualization, because the VMM provides only partial virtualization of the processor. The GOS must replace its sensitive and privileged operations with the VMM service. The remaining two approaches provide full virtualization to the VM, enabling the GOS to run without modification In addition to OS modification, performance requirements, processor architecture design, tolerance of a single point of failure, and support for legacy OS installations have an impact on the design of VMM architecture. Physical Memory Virtualization Memory management by the VMM involves two tasks: partitioning physical memory for VMs, and supporting page translations in a VM. Each OS assumes physical memory starts from page frame number (PFN) 0 and is contiguous to the size configured for that VM. An OS uses physical addresses in operations like page table updates and Direct Memory Access (DMA). In reality, the starting PFN of the memory exported to a VM may not start from PFN 0 and may not be contiguous. The virtualization of physical address is provided in the VMM by creating another layer of addressing scheme, namely machine address (MA). Within a GOS, a virtual address (VA) is used by applications, and a physical address (PA) is used by the OS in DMA and page tables. The VMM maps a PA from a VM to a MA, which is used on hardware. The VMM maintains translation tables, one for each VM, for mapping PAs to MAs. Figure 3 depicts the scheme to partition machine memory to physical memory for each VM.

3 S Virtual Machine Monitor Basics

VM0

PFN 0 VM1

PFN 0

Physical Memory VM/GOS

MPFN 0

Machine Memory

VMM

Figure 3. Example physical-to-machine memory mapping. A ballooning technique [5] has been used in some virtualization products to achieve better utilization of physical memory among VMs. The idea behind the ballooning technique is simple. The VMM controls a balloon module in a GOS. When the VMM wants to reclaim memory, it inflates the balloon to increase pressure on memory, forcing the GOS to page out memory to disk. If the demand for physical memory decreases, the VMM deflates the balloon in a VM, enabling the GOS to claim more memory. Page Translations Virtualization Access to processor's page translation hardware is a privileged operation, and this operation is performed by the privileged VMM. Exactly what the VMM needs to perform depends on the processor architecture. For example, x86 hardware automatically loads translations from the page table to the Translation Lookaside Buffer (TLB). The software has no control of loading page translations to the TLB. Therefore, the VMM is responsible for updating the page table that is seen by the hardware. The SPARC processor uses software through traps to load page translations to the TLB. A GOS maintains its page tables in its own memory, and the VMM gets page translations from the VM and loads them to the TLB. VMMs typically support the following two methods to support page translations: Hypervisor calls The GOS makes a call to the VMM for page translation operations. This method is commonly used by paravirtualized OSes, as it provides better performance. Shadow page table The VMM maintains an independent copy of page tables, called shadow page tables, from the guest page tables. When a page fault occurs, the VMM propagates changes made by the GOS's page table to the shadow page table. This method is commonly used by VMMs that support full virtualization, as the GOS continues to update its own page table and the synchronization of the guest

3 S Virtual Machine Monitor Basics

3 S Virtual Machine Monitor Basics

page table and the shadow page table is handled by the VMM when page faults occur. Figure 4 shows three different page translation implementations in the Solaris OS on x86 and SPARC platforms. 1. The paravirtualized Sun xVM Server uses the following approach on x86 platforms: [1] The GOS uses the hypervisor call method to update the page tables maintained by the VMM. 2. The Sun xVM Server with HVM and VMware use the following approach: [2a] The GOS maintains its own guest page table. The synchronization between the guest page table and the hardware page table (shadow page table) is handled by the VMM when page faults occur. [2b] The x86 CPU loads the page translation from the hardware page table to the TLB. 3. On SPARC systems, the Solaris OS uses the following approach for Logical Domains: [3a] The GOS maintains its own page table. The GOS takes an entry from the page table as an argument to the hypervisor call that loads the translations to the TLB. [3b] The VMM gets the page translation from the GOS and loads the translation to the TLB.Figure 4. Page X86 M and SPARC architectures. (3b"

G

Guest Page Table|

HVCalls|

3 S Virtual Machine Monitor Basics

VMM HW Page Table

Hardware TLB

1

TLB SPARC Page Translations The memory management implementation for Sun xVM Server, Sun xVM Server with HVM, VMware, and Logical Domains using these mechanisms is discussed in detail in later sections of this|HV Calls|

paper.Hardwar e

S 4 Virtual Machine Monitor Basics

I/O Virtualization I/O devices are typically managed by a special software module called the device driver running in the kernel context. Due to vastly different types and varieties of device types and device drivers, the VMM either includes few device drivers or leaves device management entirely to the GOS. In the latter case, because of existing device architecture limitations (discussed later in the section), devices can only be exclusively managed by one VM. This constraint creates some challenges for I/O access by a VM, and limits the following: What device are exported to a VM How devices are exported to a VM How each I/O transaction is handled by a VM and the VMM Consequently, I/O has the most challenges in the areas of compatibility and performance for virtual machines. In order to explain what devices are exported and how they are exported, it is first necessary to understand the options available to handle I/O transactions in a VM. There are, in general, three approaches for I/O virtualization, as illustrated in Figure 5: VM1 VM2 VM3 VM4 Direct I/O (VM1 and VM3) Virtual I/O using I/O transaction emulation (VM2) Virtual I/O using device emulation (VM4)

Direct I/O

I/O VM

Virtual I/O thru I/O VM Direct I/O A Native A - Driver or Virtual

Virtual I/O thru VMM

I/O Transaction Emulation and Native

V

|J I Virtual Driver"] 111 Native DriveT] |J 3 Network Chip Device Emulation and Device Driver *------------' SCSI Controller-*-Sun X64 Server

Figure 5. Different I/O virtualization techniques used by virtual machine monitors. For direct I/O, the VMM exports all or a portion of the physical devices attached to the system to a VM, and relies on VMs to

S 4 Virtual Machine Monitor Basics

manage devices. The VM that has direct I/O access uses the existing driver in the GOS to communicate directly with the device. VM 1 and VM3 in Figure 5 have direct I/O access to devices. VM1 is also a special I/O VM that provides virtual I/O for other VMs, such as VM2, to access devices. Virtual I/O is made possible by controlling the device types exported to a VM. There are two different methods of implementing virtual I/O: I/O transaction emulation (shown in VM2 in Figure 5) and device emulation (shown in VM4). I/O transaction emulation requires virtual drivers on both ends for each type of I/O transaction (data and control functions). As shown in Figure 5, the virtual driver on the client side (VM2) receives I/O requests from applications and forwards requests through the VMM to the virtual driver on the server side (VMl); the virtual driver on the server side then sends out the request to the device. I/O transaction emulation is typically used in paravirtualization because the OS on the client side needs to include the special drivers to communicate with its corresponding driver in the OS on the server side, and needs to add kernel interfaces for inter-domain communication using the VMM services. However, it is possible to have PV drivers in an unparavirtualized OS (full virtualization) for better I/O performance. For example, Solaris 10, which is not paravirtualized, can include PV drivers on a HVM-capable system to get better performance than that achieved using device emulation drivers such as QEMU. (See "Sun xVM Server with HVM I/O Virtualization (QEMU)" on page 71.) I /O transaction emulation may cause application compatibility issues if the virtual driver does not provide all data and control functions (for example, ioctl (2)) that the existing driver does. Device emulation provides an emulation of a device type, enabling the existing driver for the emulated device in a GOS to be used. The VMM exports emulated device nodes to a VM so that the existing drivers for the emulated devices in a GOS are used. By doing this, the VMM controls the driver used by a GOS for a particular device type; for example, using the e1000g driver for all network devices. Thus, the VMM can focus on the emulation of underlying hardware using one driver interface. Driver accesses to the I/O register and port in a GOS, which will result in a trap due to invalid address, are caught and converted to access the real device hardware. VM4 in Figure 5 uses native OS drivers to access emulated devices exported by the VMM. Device emulation is in general less efficient and more limited on platforms supported than I/O transaction emulation. Device emulation does not require changes in the GOS and, therefore, is typically used to provide full virtualization to a VM.

S 4 Virtual Machine Monitor Basics

Virtual I/O, unlike direct I/O, requires additional drivers in either the I/O VM or the VMM to provide I/O virtualization. This constraint: VM Limits the type of devices that are made available to a Limits device functionality Causes significant I/O performance overhead

While virtualization provides full application binary compatibility, I/O becomes a trouble area in terms of application compatibility and performance in a VM. One solution to the I/O virtualization issues is to allow VMs to directly access I/O, as shown by VM3 in Figure 5. Direct I/O access by VMs requires additional hardware support to ensure device accesses by a VM are isolated and restricted to resources owned by the assigned VM. In order to understand the industry effort to allow an I/O device to be shared among VMs, it is necessary to examine device operations from an OS point of view. The interactions between an OS and a device consist, in general, of three operations: 1. Programmed I/O (PIO) host-initiated data transfer. In PIO, a host OS maps a virtual address to a piece of device memory and accesses the device memory using CPU load/store instructions. Direct Memory Access (DMA) device-initiated data transfer without the CPU involvement. In DMA, a host OS writes an address of its memory and the transfer size to a device's DMA descriptor. After receiving an enable DMA instruction from the host driver, the device performs data transfer at a time it chooses and uses interrupts to notify the host OS of DMA completion. 3. Interrupt a device-generated asynchronous event notification. Interrupts are already virtualized by all VMM implementations as is shown in the later discussions for Sun xVM Server, Logical Domains, and VMware. The challenge of I/O sharing among VMs therefore lies in the device handling for PIO and DMA. To meet the challenges, PCI SIG has released a suite of IOV specifications for PCI Express (PCIe) devices, in particular the "Single Root I/O Virtualization and Sharing Specification" (SRIOV) specification [35] for device sharing and PIO operation, and the "Address Translation Services (ATS)" specification [30] for DMA operation. Device Configuration and PIO A PCI device exports its memory to the host through Base Address Registers (BARs) in its configuration space. A device's configuration space is identified in the PCI configuration address space as shown in Figure 6.

2.

S 4 Virtual Machine Monitor Basics

3 1

24 23 Reserved

1 15 1 10 7 21 6 18 0 Bus Number Device Funct Register 00 Number ion Number Num ber

Figure 6. PCI configuration address space. A PCI device can have up to 8 physical functions (PF). Each PF has its own 256 byte configuration header. The BARs of a PCI function, which are 32-bit wide, are located at offset 0x10-0x24 in the configuration header. The host gets the size of the memory region mapped by a BAR by writing a value of all 1's to the BAR and then reading the value back. The address written to a BAR is the assigned starting address of the memory region mapped to the BAR.

4 S Virtual Machine Monitor Basics

To allow multiple VMs to share a PF, the SRIOV specification introduces the notion of a Virtual Function (VF). Each VF shares some common configuration header fields with the PF and other VFs. The VF BARs are defined in the PCIe's SRIOV extended capabilities structure. A VF contains a set of non-shared physical resources, such as work queue and data buffer, which are required to deliver function specific services. These resources are exported through the VF BARs and are directly accessible by a VM. The starting address of a VF's memory space is derived from the first VF's memory space address and the size of VF's BAR. For any given VFx, the starting address of its memory space mapped to BARa is calculated according to the following formula: addr(VFx,BARa) = addr(VF1;BARa) + ( x - 1 ) x ( VF BARa aperature size) where addr (VF1, BARa) is the starting address of BARa for the first VF and (VF BARa aperture size) is the size of the VF BARa as determined by writing a value of 1's to BARa and reading the value back. Using this mechanism, a GOS in a VM is able to share the device with other VMs while performing device operations that pertain only to the VM. DMA In many current implementations (especially in most x86 platforms), physical addresses are used in DMA. Since a VM shares the same physical address space on the system with other VMs, a VM might read/write to another VM's memory through DMA. For example, a device driver in a VM might write the memory contents that belong to other VMs to a disk and read the data back into the VM's memory. This causes a potential breach in security and fault isolation among VMs. To provide isolation during DMA operation, the ATS specification defines a scheme for a VM to use the address mapped to its own physical memory for DMA operation. (This approach is used in similar designs such as IOMMUSpecification [31] and DMA Remapping [28].) This DMA ATS enables DMA memory to be partitioned into multiple domains, and keeps DMA transactions on one domain isolated from other domains. Figure 7 shows device DMA with and without ATS. With DMA ATS, the DMA address is like a virtual address that is associated with a context (VM). DMA transactions initiated by a VM can only be associated with the memory owned by the VM. DMA ATS is a chipset function that resides outside of the processor.

I VM1 I DMA I Buffer DMA I BufferHH North HP PA Bridge ADVA/GPA

PCI Devic e

System Memory

P A P A

4 S Virtual Machine Monitor Basics

PCI Device South Bridge1 1

w/ IOMMU DVA/GPA PCI Device

South Bridge DMA without ATS PCI DMA with ATS PA - Physical Address HPA - Host Physical Address DVA - Device Virtual Address GPA Guest Physical Address Figure 7. DMA with and without address translation service (ATS). As shown in Figure 7, the physical address (PA) is used on the hardware platform without hardware support for ATS. For platforms with hardware support for ATS, a GOS in a VM writes either a device virtual address (DVA) or a guest physical address (GPA) to the device's DMA engine. The device driver in the GOS loads the mappings of either the DVA or GPA to the host physical address (HPA) in the hardware IOMMU. The HPA is the address understood by the memory controller. Note - The distinction between the HPA and GPA is described in detail in later sections for Sun xVM Server (see "Physical Memory Management" on page 52), for UltraSPARC LDoms (see "Physical Memory Allocation" on page 88), and for VMware (see "Physical Memory Management" on page 103). When the device performs a DMA operation, a DVA/GPA address appears on the PCI bus and is intercepted by the hardware IOMMU. The hardware IOMMU looks up the mapping for the DVA/GPA, finds the corresponding HPA, and moves the PCI data to system memory pointed to by the HPA. Since either DVA or GPA of a VM has its own address space, ATS allows system memory for DMA to be partitioned and, thus, prevents a VM from accessing another VM's DMA buffer. Chapter 3 The x86 Processor Architecture This chapter provides background information on the x86 processor architecture that is relevant to later discussions on Sun xVM Server (Chapter 5 on page 39), Sun xVM Server with HVM (Chapter 6 on page 63), and VMware (Chapter 8 on page 97).

4 S Virtual Machine Monitor Basics

The x86 processor was not designed to run in a virtualized environment, and the x86 architecture presents some challenges for CPU and memory virtualization. This chapter discusses the following x86 architecture features that are pertinent to virtualization: Protected Mode The protected mode in the x86 processor utilizes two mechanisms, segmentation and paging, to prevent a program from accessing a segment or a page with a higher privilege level. Privilege level controls how the VMM and a GOS work together to provide CPU virtualization. Segmented Architecture The x86 segmented architecture converts a program's virtual addresses into linear addresses that are used by the paging mechanism to map into physical memory. During the conversion, the processor's privilege level is checked against the privilege level of the segment for the address. Because of the segment cache technique employed by the x86 processor, the VMM must ensure segment cache consistency with the VM descriptor table updates. This x86 feature results in a significant amount of work for the VMM of full virtualization products such as VMware. Paging Architecture The x86 paging architecture provides page translations to the TLB and page tables. Because the loading of page translations from page table to TLB is done automatically by hardware on the x86 platform, page table updates have to be performed by the privileged VMM. Several mechanisms are available for updating this "hardware" page table by a VM. I/O and Interrupts A device interacts with a host processor through PIO, DMA, and interrupts. PIO in the x86 processor can be performed through either I/O ports using special I/O instructions or through memory-mapped addresses with general purpose MOVE and String instructions. DMA in most x86 platforms is performed with physical addresses. This can cause a security and isolation breach in a virtualized environment because a VM may read/write other VMs memory contents. Interrupts and exceptions are handled through the Interrupt Descriptor Table (IDT). There is only one IDT on the system and access to the IDT is privileged. Therefore, interrupts have to be handled by the VM and virtualized to be delivered to a VM. Timer Devices The x86 platform includes several timer devices for time keeping purposes. Knowledge of the characteristics of these devices is important to fully understand time keeping in a VM: Some timer devices are interrupt driven (which is virtualized and delayed) and some require privileged access to update the device counter. Protected Mode The x86 architecture protected mode provides a protection mechanism to limit access to certain segments or pages and prevent unprivileged access. The processor's segmentprotection mechanism recognizes 4 privilege levels, numbered from 0 to 3 (Figure 8). The greater the level number, the lesser the privileges provided.

4 S Virtual Machine Monitor Basics

The page-level protection mechanism restricts access to pages based on two privilege levels: supervisor mode and user mode. If the processor is operating at a current privilege level (CPL) 0, 1, or 2, it is in a supervisor mode and the processor can access all pages. If the processor is operating at a CPL 3, it is in a user mode and the processor can access only user level pages. Level 0 - OS Level 3 Kernel Level 1 Level 2 Applications

Figure 8. Privilege levels in the x86 architecture. When the processor detects a privilege level violation, it generates a general-protection exception (#GP). The x86 has more than 20 privileged instructions. These instructions can be executed only when the current privilege level (CPL) is 0 (most privileged). In addition to the CPL, the x86 has an I/O privilege level (IOPL) field in the EFLAGS register that indicates the I/O privilege level of the currently running program. Some instructions, while allowed to execute when the CPL is not 0, might generate a #GP exception if the CPL value is higher than IOPL. These instructions include CLI (clear interrupt), STI (set interrupt flag), IN/INS (input from port), and OUT/OUTS (output to port). In addition to the above instructions, there are many instructions [3] that, while not privileged, reference registers or memory locations that would allow a VM to access a memory region not assigned to that VM. These sensitive instructions will not cause a #GP exception. The trap-and-emulate method for virtualization of a GOS, as stated in "VMM Requirements" on page 9, does not apply to these instructions. However, these instructions may impact other VMs. Segmented Architecture In protected mode, all memory accesses must go through a logical address } Linear address (LA) } Physical Address (PA) translation scheme. The logical address to LA translation is managed by the x86 segmentation architecture which divides a process's address space into multiple protected segments. A logical address, which is used as the address of an operand or of an instruction, consists of a 16-bit segment selector and a 32bit offset. A segment selector points to a segment descriptor that defines the segment (see Figure 11 on page 24). The segment base address is contained in the segment descriptor. The sum of the offset in a logical address and the segment base address gives the LA. The Solaris OS directly maps an LA to a process's Virtual Address (VA) by setting the segment base address to NULL. Segmentation: VA + Segment Base Address (always 0 in Solaris) } Linear address Paging: Linear address } Physical Address For each memory reference, a VA and a segment selector are provided to the processor (Figure 9). The segment selector,

4 S Virtual Machine Monitor Basics

which is loaded to the segment register, is used to identify a segment descriptor for the address.

15

321

0

4 S Virtual Machine Monitor Basics

Index

TI

RPL

Index: up to 8K descriptors (bits 3-15) TI: Table Indicator; 0=GDT, 1=LDT RPL: Request Privilege Level Figure 9. Segment Selector Every segment descriptor has a visible part and a hidden part, as illustrated in Figure 10 (see also [7], Volume 3A Section 3.4.3). The visible part is the segment selector, an index that points into either the global descriptor table (GDT) or the local descriptor table (LDT) to identify from which descriptor the hidden part of the segment register is to be loaded. The hidden part includes portions containing segment descriptor information loaded from the descriptor table. V Selector Type Base Address Limit C P L

Hidden Figure 10. Each segment descriptor has a visible and a hidden part. The hidden fields of a segment register are loaded to the processor from a descriptor table and are stored in the descriptor cache registers. The descriptor cache registers, like the TLB, allow the hardware processor to refer to the contents of the segment register's hidden part without further reference to the descriptor table. Each time a segment register is loaded, the descriptor cache register gets fully loaded from the descriptor table. Since each VM has its own descriptor table (for example, the GDT), the VMM has to maintain a shadow copy of each VM's descriptor table. A context switch to a VM will cause the VM's shadow descriptor table to be loaded to the hardware descriptor table. If the content of the descriptor table is changed by the VMM because of a context switch to another VM, the segment is non-reversible, which means the segment cannot be restored if an event such as a trap causes the segment to be saved and replaced. The Current Privilege Level (CPL) is stored in the hidden portion of the segment register. The CPL is initially equal to the privilege level of the code segment from which it is being loaded. The processor changes the CPL when program control is transferred to a code segment with a different privilege level. The segment descriptor contains the size, location, access control, and status information of the segment that is stored in either the LDT or GDT. The OS sets segment descriptors in the descriptor table and controls which descriptor entry to use for a segment (Figure 11). See "CPU Privilege Mode" on page 45 for a discussion of setting the segment descriptor in the Solaris OS.

5 S Virtual Machine Monitor Basics

24 23 22 21 20 19 16 15 14 13 87 0 D D L A SL P D S Type Base / V PL 23:16 B L 31 1 0 6 Base 15:00 Segment Limit 15:00 L: 64-bit code segment AVL: Available for use by system software Base: Segment base address D/B Default operation size (0=64-bit segment, 1=32 bit segment) DBL: Descriptor Privilege Level G: Granularity SL: Segment Limit 19:16 P: Segment present S: Descriptor type (0=system, 1=code or data) Type: segment type Figure 11. Segment descriptor. The privilege check performed by the processor recognizes three types of privilege levels: requested privilege level (RPL), current privilege level (CPL), and descriptor privilege level (DPL). A segment can be loaded if the DPL of the segment is numerically greater than or equal to both the CPL and the RPL. In other words, a segment can be accessed only by code that has equal or higher privilege level. Otherwise, a general-protection fault exception, #GP, is generated and the segment register is not loaded. On 64-bit systems, linear address space (flat memory model) is used to create a continuous, unsegmented address space for both kernel and application programs. Segmentation is disabled in the sense that privilege checking can not apply to VA to LA translations as it doesn't exist. The only protection left to prevent a user application from accessing kernel memory is through the page protection mechanism. This is why the kernel of a GOS has to run in ring 3 (user mode in page level protection) on a 64bit system. Paging Architecture When operating in the protected mode, the LA } PA translation is performed by the paging hardware of the x86 processor. To access data in memory, the processor requires the presence of a VA } PA translation in the TLB (in Solaris, LA is equal to VA), the page table backing up the TLB entry, and a page of physical memory. For the x86 processor, loading the VA } PA page translation from the page table to TLB is performed automatically by the processor. The OS is responsible for allocating physical memory and loading the VA } PA translation to the page table. When the processor cannot load a translation from the page table, it generates a page fault exception, #PF. A #PF exception on x86 processors usually means a physical page has not been allocated, because the loading of the translation from the page table to the TLB is handled by the processor (Figure 12).

31 12 11 Base 31:24

5 S Virtual Machine Monitor Basics

TLB Entry

Page Table

Physical Memory

Performed by the processor

Performed by the OS

Figure 12. Translations through the TLB are accomplished in the processor itself, while translations through page tables are performed by the OS. The x86 processor uses a control register, %cr3, to manage the loading of address translations from the page table to the TLB. The base address of a process's page table is kept by the OS and loaded to %cr3 when the process is contexted in to run. On the Solaris OS, %cr3 is kept in the kernel hat structure. Each address space, as, has one hat structure. The mdb(1) command can be used to find the value of the %cr3 register of a process: % mdb -k > ::ps S PID PPID PGID SID UID FLAGS ADDR NAME R 9352 9351 9352 9352 28155 0x4a014000 fffffffec2ae78c0 bash > fffffffec2ae78c0::print -t 'struct proc1 ! grep p_as struct as *p_as = 0xfffffffed15ba7e0 > 0xfffffffed15ba7e0::print -t 'struct as' ! grep a_hat struct hat *a_hat = 0xfffffffed1718e98 > 0xfffffffed1718e98::print -t 'struct hat' ! grep hat_htable htable_t *hat_htable = 0xfffffffed0f67678 > 0xfffffffed0f67678::print -t 'struct htable' ! grep ht_pfn pfn_t ht_pfn = 0x16d37 // %cr3 When multiple VMs are running, the automatic loading of page translations from the page table to the TLB actually makes the virtualization more difficult because all page tables have to be accessible by the processor. As a result, pages table updates can only be performed by the VMM to enforce a consistent memory usage on the system. "Page Translations Virtualization" on page 14 discusses two mechanism for managing page tables by the VMM. Another issue of the x86 paging architecture is related to the flushing of TLB entries. Unlike many RISC processors which support a tagged TLB, the x86 TLB is not tagged. A TLB miss results in a walk of the page table by the processor to find and load the translation to the TLB. Since the TLB is not tagged, a change in the %cr3 register due to a virtual memory context switch will result in invalidating all TLB entries. This adversely affects performance if the VMM and VM are not in the same address space. A typical solution to address the performance impact of TLB flushing is to reserve a region of the VM address space for the VMM. With this solution, the VMM and VM can run from the same address space and thus avoid a TLB flush when a VM memory operation traps to the VMM. The latest CPUs from Intel and AMD with hardware virtualization support include tagged TLBs, and consequently the translation of different address spaces can co-exist in the TLB. I/O and Interrupts In general, x86 support for exceptions and I/O interrupts does not impose any particular challenge to the implementation of a

5 S Virtual Machine Monitor Basics

VMM. The x86 processor uses the interrupt descriptor table (IDT) to provide a handler for a particular interrupt or exception. Access to the IDT functions is privileged and, therefore, can only be performed by the VMM. The Sun xVM Hypervisor for x86 provides a mechanism to relay hardware interrupts to a VM through its event channel hypervisor calls (see "Event Channels" on page 43). The x86 processor allows device memory and registers to be accessed through either an I/O address space or memorymapped I/O. An I/O address space access is performed using special I/O instructions such as IN and OUT. These instructions, while allowed to execute when the CPL is not 0, will result in a #GP exception if the processor's CPL value is higher than the I/O privilege level (IOPL). The Sun xVM Hypervisor for x86 provides a hypervisor call to set the IOPL, enabling a GOS to directly access I/O ports by setting the IOPL to its privilege level. When using memory-mapped I/O, any of the processor's instructions that reference memory can be used to access an I/O location with protection provided through segmentation and paging. PIO, whether it is using I/O address space or memorymapped I/O, is normally uncacheable as device registers are usually accessed with precise programming order. PIO uses addresses in a VM's address space and doesn't cause any security and isolation issues. The x86 processor uses physical addresses for DMA. DMA in a virtualized x86 system has certain issues: A 32-bit, non-dual-address-cycle (DAC) PCI device can not address beyond 4 GB of memory. It is possible for one domain's DMA to intrude into another domain's physical memory, thus causing the risk of security violation. The solution to the above issues is to have an I/O memory management unit (IOMMU) as a part of an I/O bridge or north bridge that performs a translation of I/O addresses (for example, an address that appears on the PCI bus) to machine memory addresses. The I/O address can be any address that is recognized by the IOMMU. An IOMMU can also improve the performance of large chunk data transfers by mapping a contiguous I/O address to multiple physical pages in one DMA transaction. However, the IOMMU may hurt the I/O performance for small data transfers because the DMA setup cost is higher than that of DMA without an IOMMU. For more details on the IOMMU, also known as hardware address translation service (hardware ATS), see "I/O Virtualization" on page 16. Timer Devices An OS typically uses several timer devices for different purposes. Timer devices are characterized by their frequency granularity, frequency reliability, and ability to generate interrupts and receive counter input. Understanding the characteristics of timer devices is important for the discussion of timekeeping in a virtualized environment, as the VMM provides virtualized timekeeping of some timers to its overlaying VMs. Virtualized timekeeping has significant impact on the accuracy of time

5 S Virtual Machine Monitor Basics

related functions in the GOS and, thus, on the performance and results of time sensitive applications.

S 5 The x86 Processor Architecture

An x86 system typically includes the following timer devices: Programmable Interrupt Timer (PIT) PITs use a 1.193182 Mhz crystal oscillator and have a 16-bit counter and counter input register. The PIT contains three timers. Timer 0 can generate interrupts and is used by the Solaris OS as the system timer. Timer 1 was historically used for RAM refreshes and timer 2 for the PC speaker. Time Stamp Counter (TSC) The TSC is a feature of the x86 architecture that is accessed via the RDTSC instruction. The TSC, a 64-bit counter, changes with the processor speed. The TSC cannot generate interrupts and has no counter input register. The TSC is the finest grained of all timers and is used in the Solaris OS as the high resolution timer. For example, the gethrtime(3C) function uses the TSC to return the current highresolution real time. Real Time Clock (RTC) The RTC is used as the time-of-day (TOD) clock in the Solaris OS. The RTC uses a battery as an alternate power source, enabling it to continue to keep time while the primary source of power is not available. The RTC can generate interrupts and has a counter input register. It is the lowest grained timer on the system. Local Advanced Programmable Interrupt Controller (APIC) Timer The local APIC timer, which is a part of the local APIC, has a 32-bit counter and counter input register. It can generate interrupts and has the same frequency as the front side bus. The Solaris OS supports the use of the local APIC timer as one of the cyclic timers. High Precision Event Timer (HPET) The HPET is a relatively new timer available in some new x86 systems. The HPET is intended to replace the PIT and the RTC for generating periodic interrupts. The HPET can generate interrupts, is 64-bits wide, and has a counter input register. The Solaris OS currently does not use the HPET. Advanced Configuration and Power Interface (ACPI) Timer The ACPI timer has a 24-bit counter, can generate interrupts, and has no input counter register. The Solaris OS does not use the ACPI timer.

S 5 SPARC Processor Architecture

Chapter 4 SPARC Processor Architecture This chapter provides background information on the SPARC processor architecture that is relevant to later discussions on Logical Domains (Chapter 7 on page 79). The SPARC (Scalable Processor Architecture) processor, first introduced in 1987, is a big-endian RISC processor ISA. SPARC International (SI), an industry organization, was established in 1989 to promote the open SPARC architecture. In 1994, SI introduced a 64-bit version of the SPARC processor as SPARC v9. The UltraSPARC processor, which is a Sun-specific implementation of SPARC v9, was introduced in 1996 and has been incorporated into all Sun SPARC platforms shipping today. In 2005, Sun's UltraSPARC architecture was open sourced as the UltraSPARC Architecture 2005 Specification [2]. Included in this enhanced UltraSPARC 2005 specification is support for Chip-level Multithreading (CMT) for a highly threaded processor architecture and a hyperprivileged mode that allows the hypervisor to virtualize the processor to run multiple domains. The design of the UltraSPARC T1 processor, which is the first implementation of the UltraSPARC Architecture 2005 Specification, is also open sourced. The UltraSPARC T1 processor includes 8 cores with 4 strands in each core, providing a total of 32 strands per processor. In August 2007 Sun announced the UltraSPARC T2 processor, the follow-up CMT processor to the UltraSPARC T1 processor, and the OpenSPARC T2 architecture [33] which is the open source version of the UltraSPARC T2 processor. Sun also released the UltraSPARC Architecture 2007 specification [34] which adds a section for error handling and expands the discussion for memory management. The UltraSPARC T2 processor has several enhancements over the UltraSPARC T1 processor. These enhancements include 64 strands, per-core floating-point and graphic units, and integrated PCIe and 10 GB Ethernet (for more details see "Processor Components" on page 31). The remainder of this chapter discusses the following features of the UltraSPARC T1/T2 processor architecture, and describes their effect on virtualization implementations: Processor privilege mode The UltraSPARC 2005 specification defines a hyperprivileged mode for the hypervisor operations. Sun4v Chip Multithreaded architecture This feature enables the creation of up to 32 domains, each with its own dedicated strands, on an UltraSPARC T1 processor, and up to 64 domains on an UltraSPARC T2 processor.

S 5 SPARC Processor Architecture

Address Space Identifier (ASI) The ASI provides functionality to control access to a range of address spaces, similar to the segmentation used by x86 processors. Memory Management Unit (MMU) The software-controlled MMU allows an efficient redirection of page faults to the intended domain for loading translations. Trap and interrupt handling Each strand (virtual processor) has its own trap and interrupt priority registers. This functionality allows the hypervisor to re-direct traps to the target CPU and enables the trap to be taken by the GOS's trap handler. Note - The terms strand, hardware thread, logical processor, virtual CPU and virtual processor are used by various documents to refer to the same concept. For consistency, the term strand is used in this chapter. Processor Mode of Operation The UltraSPARC 2005 specification defines three privilege modes: non-privileged, privileged, and hyperprivileged. In hyperprivileged mode, the processor can access all registers and address spaces, and can execute all instructions. Instructions, registers, and address spaces for privileged and non-privileged modes are restricted. The processor operates in privileged mode when PSTATE .priv is set to 1 and HPSTATE.hpriv is set to 0. The processor operates in hyperprivileged mode when HPSTATE.hpriv is set to 1 (PSTATE .priv is ignored). Table 1 lists the availability of instructions, registers, and address spaces for each of the privilege modes, and includes information on where further details can be found in the UltraSPARC Architecture 2005 Specification [2]. Table 1. Documentation describing the components in the UltraSPARC processor. Compon ent Instructio n Register s Locatio n3 Table 72 Chapter 5 Comments availability of

All instructions except SIR, RDHPR, and RHPR (which require hyperprivilege to execute) can be executed from the privileged mode. There are seven hyperprivileged registers: HPSTATE, HTSTATE, HINTP, HTBA, HVER, HSTICK_CMPR, and STRAND_STS. These registers are used by the hypervisor in the hyperprivileged mode. Address Tables ASIs 0x30-0x7F are for hyperprivileged 9-1 and access only. These ASIs are mainly for CMT Space 10-1 control, MMU, TLB, and hyperprivileged scratch registers. a. Location in the UltraSPARC Architecture 2005 Specification [2]. Based on the availability of instructions, registers, and the ASI in hyperprivileged mode, the following functions of the hypervisor can be deduced:

S 5 SPARC Processor Architecture

Reset the processor: SIR instruction Control hyperprivileged traps and interrupts: HTSTATE, HTBA, HINTP registers Control strand operation: ASI 0x41, and HSTICK_CMPR and STRAND_STS registers Manage MMU: ASI 0x50-0x5F Processor Components The UltraSPARC Tl processor[lo] contains eight cores, and each core has hardware support for four strands. One FPU and one L2 cache are shared among all cores in the processor. Each core has its own Level l instruction and data cache (Ll Icache and Dcache) and TLB that are shared among all strands in the core. In addition, each strand contains the following: A full register file with eight register windows and four sets of global registers (a total of 160 registers: 8 2 16 registers per window, + 4 3 8 global registers) Most of the ASIs Ancillary privileged registers Trap queue with up to 16 entries This hardware support in each strand allows the hypervisor to partition the processor into 32 domains, with one strand for each domain. Each strand can execute instructions separately without requiring a software scheduler in the hypervisor to coordinate the processor resources. Table 2 summarizes the association of processor components to their location in the processor, core and strand. Table 2. Location of key processor components in the UltraSPARC T1 processor. Processor Core Strand Floating Point Unit 6 stage instruction Register file with 160 registers L2 cache crossbar pipeline Most of ASI L2 cache L1 Icache and Dcache Ancillary state register (ASR) TLB Trap registers Privileged registers The UltraSPARC T2 processor[33] is built upon the UltraSPARC T1 architecture. It has the following enhancements over the UltraSPARC T1 processor: EIght strands per core (for a total of 64 strands) Two integer pipelines per core, with each integer pipeline supporting 4 strands 2 Eight banks of 4andL2 cache unit (FGU) per core A The UltraSPARC T2 MB a total of 64 strands in 8 cores, and each core has its own floating-point has graphicsfloating-pointing and graphics unit (FGU). This allows up to 64 domains to be created on the UltraSPARC T2 processor. This design also adds integrated support for industry standard I/O interfaces such PCI-Express and 10 Gb Ethernet. Table 3 summarizes the association of processor components to physical processor, core and strand.

3 Eight banks of 4 MB L2 cache

The UltraSPARC T2 has a total of 64 strands in 8 cores, and each core has its own floating-pointing and graphics unit (FGU). This allows up to 64 domains to be created on the UltraSPARC T2 processor. This design also adds integrated support for industry standard I/O interfaces such PCI-Express and 10 Gb Ethernet. Table 3 summarizes the association of processor components to physical processor, core and strand.

S 5 SPARC Processor Architecture

Table 3. Location of key processor components in the UltraSPARC T2 processor.

Processor

Core

Strand

8 banks 4 MB L2 cache L2 cache crossbar Memory controller PCI-E 10 Gb/Gb Ethernet

2 instruction pipelines (8 stages) Li Icache and Dcache TLB FGU (12 stages)

Full register file with 8 windows Most of ASI Ancillary state register (ASR) Privileged registers

Address Space Identifier Unlike x86 processors in 32-bit mode, which use segmentation to divide a process's address space into several segments of protected address spaces, the SPARC v9 processor has a flat 64-bit address space. An address in the SPARC V9 processor is a tuple consisting of an 8-bit address space identifier (ASI) and a 64-bit byte-address offset within the specified address space. The ASI provides attributes of an address space, including the following: Privileged or non-privileged Register or memory Endianness (for example, little-endian or big-endian) Physical or virtual address Cacheable or non-cacheable

The SPARC processor's ASI allows different types of address spaces (user virtual address space, kernel virtual address space, processor control and status registers, etc.) to coexist as separate and independent address spaces for a given context. Unlike x86 processors in which user processes and the kernel share the same address space, user processes and the kernel have their own address space on SPARC processors. Access to these address spaces are protected by the ASI associated with each address space. ASIs in the range 0x000x2F may be accessed only by software running in privileged or hyperprivileged mode; ASIs in the range 0x30-0x7F may be accessed only by software running in hyperprivileged mode. An access to a restricted (privileged or hyperprivileged) ASI (0x000x7F) by non-privileged software will result in a privileged_action trap. Table 9-1 and Table 10-1 of [2] provide a summary and description for each ASI. Memory Management Unit The traditional UltraSPARC architecture supports two types of memory addressing:

S 5 SPARC Processor Architecture

Virtual Address (VA) managed by the GOS and used by user programs Physical address (PA) passed by the processor to the system bus when accessing physical memory The Memory Management Unit (MMU) of the UltraSPARC processor provides the translation of VAs to PAs. This translation enables user programs to use a VA to locate data in physical memory. The SpitFire Memory Management Unit (sfmmu) is Sun's implementation of the UltraSPARC MMU. The sfmmu hardware consists of Translation Lookaside Buffers (TLBs) and a number of MMU registers: Translation Lookaside Buffer (TLB) The TLB provides virtual to physical address translations. Each entry of the TLB is a Translation Table Entry (TTE) that holds information for a single page mapping of virtual to physical addresses. The format of the TTE is shown in Figure 13. The TTE consists of two 64-bit words, representing the tag and data of the translation. The privileged field, P, controls whether or not the page can be accessed by non-privileged software. MMU registers A number of MMU registers are used for accessing TLB entries, removing TLB entries (demap), context management, handling TLB misses, and support for Translation Storage Buffer (TSB) access. The TSB, an array of TTE entries, is a cache of translation tables used to quickly reload the TLB. The TSB resides in the system memory and is managed entirely by the OS. The UltraSPARC processors includes some MMU hardware registers for speeding up TSB access. The TLB miss handler will first search the TSB for the translation. If the T T translation is not found in the TSB, the TLB handler calls to a more sophisticated (and slower) TSB miss handler to load TT the translation table to the TSB. E context_i 000 d 000 48 47 va 42 41 taddr 5655e

6 3 vnsoft f o2 636261 3 0

0 cc s epvpew sz o p f i t 131211109 8 7 S4

Figure 13. The translation lookaside buffer (TLB) is an array of translation table entries containing tag and data portions. A TLB hit occurs if both the context and virtual address match an entry in the TLB. Address aliasing (multiple TLB entries with the same physical address) is permitted.

S 6 SPARC Processor Architecture

Unlike the x86 processor, the loading of page translations to the TLB is manually managed by software through traps. In the event of a TLB miss, a trap is generated trying first to get the translation from the Translation Storage Buffer (TSB) (Figure 14). The TSB, an in-memory array of translations, acts like a direct-mapped cache for the TLB. If the translation is not present in the TSB, a TSB miss trap is generated. The TSB miss trap handler uses a software lookup mechanism based on the hash memory entry

S 6 SPARC Processor Architecture

block structure, hme_blk, to obtain the TTE. If a translation is still not found in hme_blk, the kernel generic trap handler is invoked to call the kernel function pagefault() to allocate physical memory for the virtual address and load the translation into the hme_blk hash structure.

Figure 14 depicts the mechanism for handling TLB misses in an unvirtualized domain. Proces TLB sor TLB miss TSB ^ TSB miss home Allocate _blk memory w OS pagefa OS data ult () functio structu n

"TTE load "TTE load "hat to TLB to TSB memload() TTE cache in memory Figure 14. Handling a TLB miss in an unvirtualized domain, UltraSPARC T1/T2 processor architecture. Similarly, Figure 15 depicts how TLB misses are handled in a virtualized domain. In a virtualized environment, the UltraSPARC T1/T2 processor adds a Real Address type, in addition to the VA and PA, into the types of memory addressing (Figure 15). Real addresses (RA), which are equivalent to the physical memory in Sun xVM Server (see "Physical Memory Management" on page 52) are provided to the GOS as the underlying physical memory allocated to it. The GOS-maintained TSBs are used to translate VAs into RAs. The hypervisor manages the translation from RA to PA.Manage Proce d by ssor Hypervi MMU sorTLB miss PA