Networks and Operang Systems Chapter 21: Virtual ......• “Nested page tables” – Relavely new...

Preview:

Citation preview

NetworksandOpera/ngSystemsChapter21:VirtualMachineMonitors

(252‐0062‐00)

DonaldKossmann&TorstenHoeflerFrühjahrssemester2013

©SystemsGroup|DepartmentofComputerScience|ETHZürich

Last/me:I/O

•  Networkstackimplementa/on•  NetworkdevicesandnetworkI/O•  MemorymanagementintheI/Osubsystem

•  Performanceissues– Buffering– Mul/plequeuesandreceive‐sidescaling

This/me:VirtualMachineMonitors

•  Basicdefini/ons•  Whywouldyouwantone?•  Structure•  Howdoesitwork?– CPU– MMU– Memory– Devices– Network

•  Acknowledgement:ThankstoSteveHandforsomeoftheslides!

WhatisaVirtualMachineMonitor?

•  Virtualizesanen/re(hardware)machine– ContrastwithOSprocesses–  Interfaceprovidedis“illusionofrealhardware”– Applica/onsarethereforecompleteOpera/ngSystemsthemselves

– Terminology:GuestOpera+ngSystems

•  Oldidea:IBMVM/CMS(1960s)– Recentlyrevived:VMware,Xen,Hyper‐V,kvm,etc.

VMMsandHypervisors

Realhardware

Hypervisor

Guestopera/ngsystem

App

App

Guestopera/ngsystem

App

App

VMM VMM

Somefolksdis/nguishtheVirtualMachineMonitorfromthe

Hypervisor(wewon’t)

Createsillusionofhardware

Whywouldyouwantone?

•  Diagrams:•  Serverconsolida/on(programassumesownmachine)

•  Performanceisola/on

•  Backwardcompa/bility

•  Cloudcompu/ng(unitofsellingcycles)

•  SomethingundertheOS:replay,audi/ng,trustedcompu/ng,rootkits

Runningmul/pleOSesononemachine

•  Applica/oncompa/bility–  IuseUbuntufor

almosteverything,butIeditslidesinPowerPoint

–  SomepeoplecompileBarrelfishinaDebianVMoverWindows7withHyper‐V

•  Backwardcompa/bility–  Nothingbeatsa

Windows98virtualmachineforplayingoldcomputergames

Realhardware

Hypervisor

App

App

App

App

App

App

Serverconsolida/on

•  Manyapplica/onsassumetheyhavethemachinetothemselves

•  Eachmachineismostlyidle

⇒ ConsolidateserversontoasinglephysicalmachineRealhardware

Hypervisor

App

lica/

on

App

lica/

on

App

lica/

on

Resourceisola/on

•  Surprisingly,modernOSesdonothaveanabstrac/onforasingleapplica/on

•  Performanceisola/oncanbecri/calinsomeenterprises

•  UsevirtualmachinesasresourcecontainersRealhardware

Hypervisor

App

lica/

on

App

lica/

on

App

lica/

on

Cloudcompu/ng

•  Sellingcompu/ngcapacityondemand–  E.g.AmazonEC2,

GoGrid,etc.•  Hypervisors

decouplealloca+onofresources(VMs)fromprovisioningofinfrastructure(physicalmachines)

Realhardware

Hypervisor

App

lica/

on

App

lica/

on

Realhardware

Hypervisor

App

lica/

on

App

lica/

on

Realhardware

Hypervisor

App

lica/

on

App

lica/

on

Realhardware

Hypervisor

App

lica/

on

App

lica/

on

Realhardware

Hypervisor

App

lica/

on

App

lica/

on

Realhardware

Hypervisor

App

lica/

on

App

lica/

on

Opera/ngSystemdevelopment

•  Buildingandtes/nganewOSwithoutneedingtorebootrealhardware

•  VMMomengivesyoumoreinforma/onaboutfaultsthanrealhardwareanywayRealhardware

Hypervisor

Compiler

Edito

r

Visual

Stud

io

Othercoolapplica/ons…

•  Tracing•  Debugging•  Execu/onreplay

•  Lock‐stepexecu/on

•  Livemigra/on•  Rollback•  Specula/on•  Etc….Realhardware

Hypervisor

Tracer

App

lica/

on

App

lica/

on

Howdoesitallwork?

•  Note:ahypervisorisbasicallyanOS– Withan“unusualAPI”

•  Manyfunc/onsquitesimilar:– Mul/plexingresources– Scheduling,virtualmemory,devicedrivers

•  Different:– Crea/ngtheillusionofhardwareto“applica/ons”– GuestOSesarelessflexibleinresourcerequirements

HostedVMMs

Realhardware

Hostopera/ngsystem

App

lica/

on

Guestopera/ngsystem

App

App

VMM

App

lica/

on Examples:

•  VMwareworksta/on•  LinuxKVM•  MicrosomHyper‐V

Hypervisor‐basedVMMs

Realhardware

Hypervisor

Console(Mgmt)opera/ngsystem

Console

Mgm

t.

Guestopera/ngsystem

App

App

VMM VMM

Guestopera/ngsystem

App

App

VMM

Examples:•  VMwareESX•  IBMVM/CMS•  Xen

Howtovirtualize…

•  TheCPU(s)?•  TheMMU?

•  Physicalmemory?

•  Devices(disks,etc.)?•  TheNetwork

and?

VirtualizingtheCPU

•  ACPUarchitectureisstrictlyvirtualizableifitcanbeperfectlyemulatedoveritself,withallnon‐privilegedinstruc/onsexecutedna/vely

•  Privilegedinstruc/ons⇒trap–  Kernel‐mode(i.e.theVMM)emulatesinstruc/on– Guest’skernelmodeisactuallyusermode

•  Oranother,extraprivilegelevel(suchasring1)

•  Examples:IBMS/390,Alpha,PowerPC

VirtualizingtheCPU

•  Astrictlyvirtualizableprocessorcanexecuteacompletena/veGuestOS– Guestapplica/onsruninusermodeasbefore– Guestkernelworksexactlyasbefore

•  Problem:x86architectureisnotvirtualizable– About20instruc/onsaresensi/vebutnotprivileged– Mostlysegmentloadsandprocessorflagmanipula/on

Non‐virtualizablex86:example

•  PUSHF/POPFinstruc/ons–  Push/popcondi/oncoderegister–  Includesinterruptenableflag(IF)

•  Unprivilegedinstruc/ons:fineinuserspace!–  IFisignoredbyPOPFinusermode,notinkernelmode

⇒VMMcan’tdetermineifGuestOSwantsinterrruptsdisabled!–  Can’tcauseatrapona(privileged)POPF –  Preventscorrectfunc/oningoftheGuestOS

Solu/ons1.  Emula/on:emulateallkernel‐modecodeinsomware

–  Veryslow–par/cularlyforI/Ointensiveworkloads–  Usedby,e.g.,SomPC

2.  Paravirtualiza8on:modifyGuestOSkernel–  Replacewithexplicittrapinstruc/ontoVMM–  Alsocalleda“HyperCall”(usedforallkindsofthings)–  Usedby,e.g.,Xen

3.  Binaryrewri/ng:–  Protectkernelinstruc/onpages,traptoVMMonfirstIFetch–  ScanpageforPOPFinstruc/onsandreplace–  Restartinstruc/oninGuestOSandcon/nue–  Usedby,e.g.VMware

4.  Hardwaresupport:IntelVT‐x,AMD‐V–  ExtraprocessormodecausesPOPFtotrap

VirtualizingtheMMU

•  HypervisorallocatesmemorytoVMs– Guestassumescontroloverallphysicalmemory

– VMMcan’tletGuestOStoinstallmappings

•  Defini/onsneeded:– Virtualaddress:avirtualaddressintheguest– Physicaladdress:asseenbytheguest– Machineaddress:realphysicaladdress•  AsseenbytheHypervisor

Virtual/Physical/Machine

GuestVirtualAS

GuestPhysicalAS

MachineMemory

5

5

9

2

6

17Guest1:

Guest2:

MMUVirtualiza/on

•  Cri/calforperformance,challengingtomakefast,especiallySMP– Hot‐unplugunnecessaryvirtualCPUs– Usemul/castTLBflushparavirtualiza/onsetc

•  Xensupports3MMUvirtualiza/onmodes1. Direct(“Writable”)pagetables2. Shadowpagetables3. HardwareAssistedPaging

•  OSParavirtualiza/oncompulsoryfor#1,op/onal(andverybeneficial)for#2&3

Paravirtualiza/onapproach

•  GuestOScreatespagetablesthehardwareuses–  VMMmustvalidateallupdatestopagetables–  Requiresmodifica/onstoGuestOS– Notquiteenough…

•  VMMmustcheckallwritestoPTEs– Write‐protectallPTEstotheGuestkernel– AddaHyperCalltoupdatePTEs–  Batchupdatestoavoidtrapoverhead– OSisnowawareofmachineaddresses–  Significantoverhead!

Para‐VirtualizingtheMMU

•  GuestOSesallocateandmanageownPTs–  HypercalltochangePTbase

•  VMMmustvalidatePTupdatesbeforeuse–  Allowsincrementalupdates,avoidsrevalida/on

•  Valida/onrulesappliedtoeachPTE:–  1.Guestmayonlymappagesitowns*

–  2.PagetablepagesmayonlybemappedRO

•  VMMtrapsPTEupdatesandemulates,or‘unhooks’PTEpageforbulkupdates

WriteablePageTables:1–Writefault

MMU

GuestOS

VMM

Hardware

pagefault

firstguestwrite

guestreads

Virtual→Machine

WriteablePageTables:2–Emulate?

GuestOS

VMM

Hardware

firstguestwrite

guestreads

Virtual→Machine

emulate?

yes

MMU

WriteablePageTables:3‐Unhook

GuestOS

VMM

Hardware

guestwrites

guestreads

Virtual→MachineX

MMU

WriteablePageTables:4‐FirstUse

GuestOS

VMM

Hardware

pagefault

guestwrites

guestreads

Virtual→MachineX

MMU

WriteablePageTables:5–Re‐hook

GuestOS

VMM

Hardware

validate

guestwrites

guestreads

Virtual→Machine

MMU

Writeablepagetablesrequireparavirtualiza/on

GuestVirtualAS

MachineMemory

5

5

9

2

6

17Guest1:

Guest2:

GuestsdirectlyshareMachineMemory

ShadowPageTables

•  GuestOSsetsupitsownpagetables– Notusedbythehardware!

•  VMMmaintainsshadowpagetables– MapdirectlyfromGuestVAstoMachineAddresses– HardwareswitchedwheneverGuestreloadsPTPR

•  VMMmustkeepV→MtableconsistentwithGuestV→Ptableandit’sownP→Mtable–  VMMwrite‐protectsallguestpagetables– Write⇒trap:applywritetoshadowtableaswell–  Significantoverhead!

ShadowPageTables

GuestVirtualAS

GuestPhysicalAS

MachineMemory

5

5

9

2

6

17Guest1:

Guest2:

Shadowpagetablemappings

Shadowpagetables

MMU

GuestOS

VMM

Hardware

accessedanddirtybits

guestwrites

guestreads

Virtual→Guest‐Physical

Virtual→Machine

updates

•  Guestchangesop/onal,buthelpwithbatching,knowingwhentounshadow

•  Latestalgorithmsworkremarkablywell

Hardwaresupport

•  “Nestedpagetables”–  Rela/velynewinAMD(NPT)andIntel(EPT)hardware

•  Two‐leveltransla/onofaddressesintheMMU– Hardwareknowsabout:

•  V→Ptables(intheGuest)•  P→Mtables(intheHypervisor)

–  TaggedTLBstoavoidexpensiveflushonaVMentry/exit

•  Veryniceandeasytocodeto– Onereasonkvmissosmall

•  Significantperformanceoverhead…

Memoryalloca/on

•  GuestOSisnotexpec/ngphysicalmemorytochangeinsize!

•  Twoproblems:– HypervisorwantstoovercommitRAM– Howtoreallocate(machine)memorybetweenVMs

•  Phenomenon:DoublePaging– Hypervisorpagesoutmemory– GuestOSdecidestopageoutphysicalframe–  (Unwivngly)faultsitinviatheHypervisor,onlytowriteitoutagain

Ballooning

•  TechniquetoreclaimmemoryfromaGuest•  Installa“balloondriver”inGuestkernel– Canallocateandfreekernelphysicalmemory•  Justlikeanyotherpartofthekernel

– UsesHyperCallstoreturnframestotheHypervisor,andhavethemreturned•  GuestOSisunware,simplyallocatesphysicalmemory

Ballooning:takingRAMawayfromaVM

1.  VMMasksballoondriverformemory

2.  BalloondriverasksGuestOSkernelformoreframes–  “inflatestheballoon”

3.  BalloondriversendsphysicalframenumberstoVMM

4.  VMMtranslatesintomachineaddressandclaimstheframes

Balloon

Guestphysicaladdressspace

Balloondriver

Ballooning:takingRAMawayfromaVM

1.  VMMasksballoondriverformemory

2.  BalloondriverasksGuestOSkernelformoreframes–  “inflatestheballoon”

3.  BalloondriversendsphysicalframenumberstoVMM

4.  VMMtranslatesintomachineaddressesandclaimstheframes

Balloon

Guestphysicaladdressspace

Physicalmemoryclaimedby

balloondriver

Balloondriver

ReturningRAMtoaVM

1.  VMMconvertsmachineaddressintoaphysicaladdresspreviouslyallocatedbytheballoondriver

2.  VMMhandsPFNtoballoondriver

3.  BalloondriverfreesphysicalframebacktoGuestOSkernel–  “deflatestheballoon”

Balloon

Guestphysicaladdressspace

Balloondriver

VirtualizingDevices

•  Familiarbynow:trap‐and‐emulate–  I/Ospacetraps– Protectmemoryandtrap– “Devicemodel”:somwaremodelofdeviceinVMM

•  Interrupts→upcallstoGuestOS– Emulateinterruptcontroller(APIC)inGuest– EmulateDMAwithcopyintoGuestPAS

•  Significantperformanceoverhead!

Paravirtualizeddevices

•  “Fake”devicedriverswhichcommunicateefficientlywithVMMviahypercalls– Usedforblockdeviceslikediskcontrollers– Networkinterfaces– “VMwaretools”ismostlyaboutthese

•  Drama/callybeyerperformance!

Networking

•  VirtualnetworkdeviceintheGuestVM•  Hypervisorimplementsa“somswitch”– En/revirtualIP/Ethernetnetworkonamachine

•  Manydifferentaddressingop/ons– SeparateIPaddresses– SeparateMACaddresses

– NAT•  Etc.

Wherearetherealdrivers?

1.  IntheHypervisor–  E.g.VMwareESX–  Problem:needtorewritedevicedrivers(newOS)

2.  IntheconsoleOS–  ExportvirtualdevicestootherVMs

3.  In“driverdomains”– Maphardwaredirectlyintoa“trusted”VM

•  DevicePassthrough–  RunyourfavoriteOSjustforthedevicedriver–  UseIOMMUhardwaretoprotectothermemoryfromdriverVM

4.  Use“self‐virtualizingdevices”

Xen3.xArchitecture

XenVirtualMachineMonitorEventChannel VirtualMMUVirtualCPUControlIF

Hardware(SMP,MMU,physicalmemory,Ethernet,SCSI/IDE)

GuestOS(XenLinux)

DeviceManager&Controls/w

Na/veDeviceDrivers

VM0

GuestOS(XenLinux)

UnmodifiedUser

Somware

VM1

SMPGuestOS(XenLinux)

UnmodifiedUser

Somware

Front‐EndDeviceDrivers

VM2

UnmodifiedGuestOS(WinXP)

UnmodifiedUser

Somware

Front‐EndDeviceDrivers

VM3

SafeHWIF

Virtualswitch

Front‐EndDeviceDrivers

ThankstoSteveHandforsomeofthesediagrams

Rememberthiscard?

SR‐IOV

•  Single‐RootI/OVirtualiza/on•  Keyidea:dynamicallycreatenew“PCIedevices”–  PhysicalFunc/on(PF):originaldevice,fullfunc/onality

–  VirtualFunc/on(VF):extra“device”,limitedfun/onality

–  VFscreated/destroyedviaPFregisters•  Fornetworking:–  Par//onsanetworkcard’sresources– Withdirectassignmentcanimplementpassthrough

SR‐IOVinac/on

SR‐IOVNICVirtualethernetbridge/switch,packetclassifier

LAN

Virtualfunc/on

Virtualfunc/on

Virtualfunc/on Physicalfunc/on

PCIe

IOMMU

VMM

VM

VFdriver

VM

VFdriver

VM

VFdriver

VM

VNICdrvr

VM

PFdriver

VSwitch

Self‐virtualizingdevices

•  Candynamicallycreateupto2048dis/nctPCIdevicesondemand!– HypervisorcancreateavirtualNICforeachVM– Somswitchdriverprograms“master”NICtodemuxpacketstoeachvirtualNIC

– PCIbusisvirtualizedineachVM– EachGuestOSappearstohave“real”NIC,talksdirecttotherealhardware

NextWeek

Reliablestorage

OSResearch/Future™

Recommended