179

Operating Systems Principles & Practice Volume III

Embed Size (px)

Citation preview

OperatingSystems

Principles&Practice

VolumeIII:MemoryManagementSecondEdition

ThomasAndersonUniversityofWashington

MikeDahlinUniversityofTexasandGoogle

RecursiveBooks

recursivebooks.com

OperatingSystems:PrinciplesandPractice(SecondEdition)VolumeIII:MemoryManagementbyThomasAndersonandMichaelDahlinCopyright©ThomasAndersonandMichaelDahlin,2011-2015.

ISBN978-0-9856735-5-0Publisher:RecursiveBooks,Ltd.,http://recursivebooks.com/Cover:ReflectionLake,Mt.RainierCoverdesign:CameronNeatIllustrations:CameronNeatCopyeditors:SandyKaplan,WhitneySchmidtEbookdesign:RobinBriggsWebdesign:AdamAnderson

SUGGESTIONS,COMMENTS,andERRORS.Wewelcomesuggestions,commentsanderrorreports,[email protected]

Noticeofrights.Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformbyanymeans—electronic,mechanical,photocopying,recording,orotherwise—withoutthepriorwrittenpermissionofthepublisher.Forinformationongettingpermissionsforreprintsandexcerpts,[email protected]

Noticeofliability.Theinformationinthisbookisdistributedonan“AsIs”basis,withoutwarranty.NeithertheauthorsnorRecursiveBooksshallhaveanyliabilitytoanypersonorentitywithrespecttoanylossordamagecausedorallegedtobecauseddirectlyorindirectlybytheinformationorinstructionscontainedinthisbookorbythecomputersoftwareandhardwareproductsdescribedinit.

Trademarks:Throughoutthisbooktrademarkednamesareused.Ratherthanputatrademarksymbolineveryoccurrenceofatrademarkedname,westateweareusingthenamesonlyinaneditorialfashionandtothebenefitofthetrademarkownerwithnointentionofinfringementofthetrademark.Alltrademarksorservicemarksarethepropertyoftheirrespectiveowners.

ToRobin,Sandra,Katya,andAdamTomAnderson

ToMarla,Kelly,andKeithMikeDahlin

Contents

Preface

I:KernelsandProcesses1.Introduction

2.TheKernelAbstraction

3.TheProgrammingInterface

II:Concurrency4.ConcurrencyandThreads

5.SynchronizingAccesstoSharedObjects

6.Multi-ObjectSynchronization

7.Scheduling

IIIMemoryManagement8AddressTranslation

8.1AddressTranslationConcept

8.2TowardsFlexibleAddressTranslation

8.2.1SegmentedMemory8.2.2PagedMemory8.2.3Multi-LevelTranslation8.2.4Portability

8.3TowardsEfficientAddressTranslation

8.3.1TranslationLookasideBuffers8.3.2Superpages8.3.3TLBConsistency8.3.4VirtuallyAddressedCaches8.3.5PhysicallyAddressedCaches

8.4SoftwareProtection

8.4.1SingleLanguageOperatingSystems8.4.2Language-IndependentSoftwareFaultIsolation8.4.3SandboxesViaIntermediateCode

8.5SummaryandFutureDirections

Exercises

9CachingandVirtualMemory

9.1CacheConcept

9.2MemoryHierarchy

9.3WhenCachesWorkandWhenTheyDoNot

9.3.1WorkingSetModel9.3.2ZipfModel

9.4MemoryCacheLookup

9.5ReplacementPolicies

9.5.1Random9.5.2First-In-First-Out(FIFO)9.5.3OptimalCacheReplacement(MIN)9.5.4LeastRecentlyUsed(LRU)9.5.5LeastFrequentlyUsed(LFU)9.5.6Belady’sAnomaly

9.6CaseStudy:Memory-MappedFiles

9.6.1Advantages9.6.2Implementation9.6.3ApproximatingLRU

9.7CaseStudy:VirtualMemory

9.7.1Self-Paging9.7.2Swapping

9.8SummaryandFutureDirections

Exercises

10AdvancedMemoryManagement

10.1Zero-CopyI/O

10.2VirtualMachines

10.2.1VirtualMachinePageTables10.2.2TransparentMemoryCompression

10.3FaultTolerance

10.3.1CheckpointandRestart10.3.2RecoverableVirtualMemory10.3.3DeterministicDebugging

10.4Security

10.5User-LevelMemoryManagement

10.6SummaryandFutureDirections

Exercises

IV:PersistentStorage11.FileSystems:IntroductionandOverview

12.StorageDevices

13.FilesandDirectories

14.ReliableStorage

References

Glossary

AbouttheAuthors

Preface

PrefacetotheeBookEdition

OperatingSystems:PrinciplesandPracticeisatextbookforafirstcourseinundergraduateoperatingsystems.Inuseatover50collegesanduniversitiesworldwide,thistextbookprovides:

Apathforstudentstounderstandhighlevelconceptsallthewaydowntoworkingcode.Extensiveworkedexamplesintegratedthroughoutthetextprovidestudentsconcreteguidanceforcompletinghomeworkassignments.Afocusonup-to-dateindustrytechnologiesandpractice

TheeBookeditionissplitintofourvolumesthattogethercontainexactlythesamematerialasthe(2nd)printeditionofOperatingSystems:PrinciplesandPractice,reformattedforvariousscreensizes.Eachvolumeisself-containedandcanbeusedasastandalonetext,e.g.,atschoolsthatteachoperatingsystemstopicsacrossmultiplecourses.

Volume1:KernelsandProcesses.ThisvolumecontainsChapters1-3oftheprintedition.Wedescribetheessentialstepsneededtoisolateprogramstopreventbuggyapplicationsandcomputervirusesfromcrashingortakingcontrolofyoursystem.Volume2:Concurrency.ThisvolumecontainsChapters4-7oftheprintedition.Weprovideaconcretemethodologyforwritingcorrectconcurrentprogramsthatisinwidespreaduseinindustry,andweexplainthemechanismsforcontextswitchingandsynchronizationfromfundamentalconceptsdowntoassemblycode.Volume3:MemoryManagement.ThisvolumecontainsChapters8-10oftheprintedition.Weexplainboththetheoryandmechanismsbehind64-bitaddressspacetranslation,demandpaging,andvirtualmachines.Volume4:PersistentStorage.ThisvolumecontainsChapters11-14oftheprintedition.Weexplainthetechnologiesunderlyingmodernextent-based,journaling,andversioningfilesystems.

Amoredetaileddescriptionofeachchapterisgivenintheprefacetotheprintedition.

PrefacetothePrintEdition

WhyWeWroteThisBook

Manyofourstudentstellusthatoperatingsystemswasthebestcoursetheytookasanundergraduateandalsothemostimportantfortheircareers.Wearenotalone—manyofourcolleaguesreportreceivingsimilarfeedbackfromtheirstudents.

Partoftheexcitementisthatthecoreideasinamodernoperatingsystem—protection,concurrency,virtualization,resourceallocation,andreliablestorage—havebecome

widelyappliedthroughoutcomputerscience,notjustoperatingsystemkernels.WhetheryougetajobatFacebook,Google,Microsoft,oranyotherleading-edgetechnologycompany,itisimpossibletobuildresilient,secure,andflexiblecomputersystemswithouttheabilitytoapplyoperatingsystemsconceptsinavarietyofsettings.Inamodernworld,nearlyeverythingauserdoesisdistributed,nearlyeverycomputerismulti-core,securitythreatsabound,andmanyapplicationssuchaswebbrowsershavebecomemini-operatingsystemsintheirownright.

Itshouldbenosurprisethatformanycomputersciencestudents,anundergraduateoperatingsystemsclasshasbecomeadefactorequirement:atickettoaninternshipandeventuallytoafull-timeposition.

Unfortunately,manyoperatingsystemstextbooksarestillstuckinthepast,failingtokeeppacewithrapidtechnologicalchange.Severalwidely-usedbookswereinitiallywritteninthemid-1980’s,andtheyoftenactasiftechnologystoppedatthatpoint.Evenwhennewtopicsareadded,theyaretreatedasanafterthought,withoutpruningmaterialthathasbecomelessimportant.Theresultaretextbooksthatareverylong,veryexpensive,andyetfailtoprovidestudentsmorethanasuperficialunderstandingofthematerial.

Ourviewisthatoperatingsystemshavechangeddramaticallyoverthepasttwentyyears,andthatjustifiesafreshlookatbothhowthematerialistaughtandwhatistaught.Thepaceofinnovationinoperatingsystemshas,ifanything,increasedoverthepastfewyears,withtheintroductionoftheiOSandAndroidoperatingsystemsforsmartphones,theshifttomulticorecomputers,andtheadventofcloudcomputing.

Topreparestudentsforthisnewworld,webelievestudentsneedthreethingstosucceedatunderstandingoperatingsystemsatadeeplevel:

Conceptsandcode.Webelieveitisimportanttoteachstudentsbothprinciplesandpractice,conceptsandimplementation,ratherthaneitheralone.Thistextbooktakesconceptsallthewaydowntothelevelofworkingcode,e.g.,howacontextswitchworksinassemblycode.Inourexperience,thisistheonlywaystudentswillreallyunderstandandmasterthematerial.Allofthecodeinthisbookisavailablefromtheauthor’swebsite,ospp.washington.edu.

Extensiveworkedexamples.Inourview,studentsneedtobeabletoapplyconceptsinpractice.Tothatend,wehaveintegratedalargenumberofexampleexercises,alongwithsolutions,throughoutthetext.Weusestheseexercisesextensivelyinourownlectures,andwehavefoundthemessentialtochallengingstudentstogobeyondasuperficialunderstanding.

Industrypractice.Toshowstudentshowtoapplyoperatingsystemsconceptsinavarietyofsettings,weusedetailed,concreteexamplesfromFacebook,Google,Microsoft,Apple,andotherleading-edgetechnologycompaniesthroughoutthetextbook.Becauseoperatingsystemsconceptsareimportantinawiderangeofcomputersystems,wetaketheseexamplesnotonlyfromtraditionaloperatingsystemslikeLinux,Windows,andOSXbutalsofromothersystemsthatneedtosolveproblemsofprotection,concurrency,virtualization,resourceallocation,andreliablestoragelikedatabases,webbrowsers,webservers,mobileapplications,andsearchengines.

Takingafreshperspectiveonwhatstudentsneedtoknowtoapplyoperatingsystemsconceptsinpracticehasledustoinnovateineverymajortopiccoveredinanundergraduate-levelcourse:

KernelsandProcesses.Thesafeexecutionofuntrustedcodehasbecomecentraltomanytypesofcomputersystems,fromwebbrowserstovirtualmachinestooperatingsystems.YetexistingtextbookstreatprotectionasasideeffectofUNIXprocesses,asiftheyaresynonyms.Instead,westartfromfirstprinciples:whataretheminimumrequirementsforprocessisolation,howcansystemsimplementprocessisolationefficiently,andwhatdostudentsneedtoknowtoimplementfunctionscorrectlywhenthecallerispotentiallymalicious?

Concurrency.Withtheadventofmulti-corearchitectures,moststudentstodaywillspendmuchoftheircareerswritingconcurrentcode.Existingtextbooksprovideablizzardofconcurrencyalternatives,mostofwhichwereabandoneddecadesagoasimpractical.Instead,wefocusonprovidingstudentsasinglemethodologybasedonMesamonitorsthatwillenablestudentstowritecorrectconcurrentprograms—amethodologythatisbyfarthedominantapproachusedinindustry.

MemoryManagement.Evenasdemand-paginghasbecomelessimportant,virtualizationhasbecomeevenmoreimportanttomoderncomputersystems.Weprovideadeeptreatmentofaddresstranslationhardware,sparseaddressspaces,TLBs,andon-chipcaches.Wethenusethoseconceptsasaspringboardfordescribingvirtualmachinesandrelatedconceptssuchascheckpointingandcopy-on-write.

PersistentStorage.Reliablestorageinthepresenceoffailuresiscentraltothedesignofmostcomputersystems.Existingtextbookssurveythehistoryoffilesystems,spendingmostoftheirtimeadhocapproachestofailurerecoveryandde-fragmentation.Yetnomodernfilesystemsstillusethoseadhocapproaches.Instead,ourfocusisonhowfilesystemsuseextents,journaling,copy-on-write,andRAIDtoachievebothhighperformanceandhighreliability.

IntendedAudience

OperatingSystems:PrinciplesandPracticeisatextbookforafirstcourseinundergraduateoperatingsystems.Webelieveoperatingsystemsshouldbetakenasearlyaspossibleinanundergraduate’scourseofstudy;manystudentsusethecourseasaspringboardtoaninternshipandacareer.Tothatend,wehavedesignedthetextbooktoassumeminimalpre-requisites:specifically,studentsshouldhavetakenadatastructurescourseandoneoncomputerorganization.Thecodeexamplesarewritteninacombinationofx86assembly,C,andC++.Inparticular,wehavedesignedthebooktointerfacewellwiththeBryantandO’Hallorantextbook.Wereviewandcoverinmuchmoredepththematerialfromthesecondhalfofthatbook.

Weshouldnotewhatthistextbookisnot:itisnotintendedtoteachtheAPIorinternalsofanyspecificoperatingsystem,suchasLinux,Android,Windows8,OSX,oriOS.Weusemanyconcreteexamplesfromthesesystems,butourfocusisonthesharedproblemsthese

systemsfaceandthetechnologiesthesesystemsusetosolvethoseproblems.

AGuidetoInstructors

Oneofourgoalsisenableinstructorstochooseanappropriatelevelofdepthforeachcoursetopic.Eachchapterbeginsataconceptuallevel,withimplementationdetailsandthemoreadvancedmaterialtowardstheend.Themoreadvancedmaterialcanbeomittedwithoutcompromisingtheabilityofstudentstofollowlatermaterial.Nosingle-quarterorsingle-semestercourseislikelytobeabletocovereverytopicwehaveincluded,butwethinkitisagoodthingforstudentstocomeawayfromanoperatingsystemscoursewithanappreciationthatthereisalwaysmoretolearn.

Foreachtopic,weattempttoconveyitatthreelevels:

Howtoreasonaboutsystems.Wedescribecoresystemsconcepts,suchasprotection,concurrency,resourcescheduling,virtualization,andstorage,andweprovidepracticeapplyingtheseconceptsinvarioussituations.Inourview,thisprovidesthebiggestlong-termpayofftostudents,astheyarelikelytoneedtoapplytheseconceptsintheirworkthroughouttheircareer,almostregardlessofwhatprojecttheyendupworkingon.

Powertools.Weintroducestudentstoanumberofabstractionsthattheycanapplyintheirworkinindustryimmediatelyaftergraduation,andthatweexpectwillcontinuetobeusefulfordecadessuchassandboxing,protectedprocedurecalls,threads,locks,conditionvariables,caching,checkpointing,andtransactions.

Detailsofspecificoperatingsystems.Weincludenumerousexamplesofhowdifferentoperatingsystemsworkinpractice.However,thismaterialchangesrapidly,andthereisanorderofmagnitudemorematerialthancanbecoveredinasinglesemester-lengthcourse.Thepurposeoftheseexamplesistoillustratehowtousetheoperatingsystemsprinciplesandpowertoolstosolveconcreteproblems.WedonotattempttoprovideacomprehensivedescriptionofLinux,OSX,oranyotherparticularoperatingsystem.

Thebookisdividedintofiveparts:anintroduction(Chapter1),kernelsandprocesses(Chapters2-3),concurrency,synchronization,andscheduling(Chapters4-7),memorymanagement(Chapters8-10),andpersistentstorage(Chapters11-14).

Introduction.ThegoalofChapter1istointroducetherecurringthemesfoundinthelaterchapters.Wedefinesomecommonterms,andweprovideabitofthehistoryofthedevelopmentofoperatingsystems.

TheKernelAbstraction.Chapter2coverskernel-basedprocessprotection—theconceptandimplementationofexecutingauserprogramwithrestrictedprivileges.Giventheincreasingimportanceofcomputersecurityissues,webelieveprotectedexecutionandsafetransferacrossprivilegelevelsareworthtreatingindepth.Wehavebrokenthedescriptionintosections,toallowinstructorstochooseeitheraquickintroductiontotheconcepts(upthroughSection2.3),orafulltreatmentofthekernelimplementationdetailsdowntothelevelofinterrupthandlers.Someinstructorsstart

withconcurrency,andcoverkernelsandkernelprotectionafterwards.Whileourtextbookcanbeusedthatway,wehavefoundthatstudentsbenefitfromabasicunderstandingoftheroleofoperatingsystemsinexecutinguserprograms,beforeintroducingconcurrency.

TheProgrammingInterface.Chapter3isintendedasanimpedancematchforstudentsofdifferingbackgrounds.Dependingonstudentbackground,itcanbeskippedorcoveredindepth.Thechaptercoverstheoperatingsystemfromaprogrammer’sperspective:processcreationandmanagement,device-independentinput/output,interprocesscommunication,andnetworksockets.Ourgoalisthatstudentsshouldunderstandatadetailedlevelwhathappenswhenauserclicksalinkinawebbrowser,astherequestistransferredthroughoperatingsystemkernelsanduserspaceprocessesattheclient,server,andbackagain.Thischapteralsocoverstheorganizationoftheoperatingsystemitself:howdevicedriversandthehardwareabstractionlayerworkinamodernoperatingsystem;thedifferencebetweenamonolithicandamicrokerneloperatingsystem;andhowpolicyandmechanismareseparatedinmodernoperatingsystems.

ConcurrencyandThreads.Chapter4motivatesandexplainstheconceptofthreads.Becauseoftheincreasingimportanceofconcurrentprogramming,anditsintegrationwithmodernprogramminglanguageslikeJava,manystudentshavebeenintroducedtomulti-threadedprogramminginanearlierclass.Thisisabitdangerous,asstudentsatthisstagearepronetowritingprogramswithraceconditions,problemsthatmayormaynotbediscoveredwithtesting.Thus,thegoalofthischapteristoprovideasolidconceptualframeworkforunderstandingthesemanticsofconcurrency,aswellashowconcurrentthreadsareimplementedinboththeoperatingsystemkernelandinuser-levellibraries.Instructorsneedingtogomorequicklycanomittheseimplementationdetails.

Synchronization.Chapter5discussesthesynchronizationofmulti-threadedprograms,acentralpartofalloperatingsystemsandincreasinglyimportantinmanyothercontexts.Ourapproachistodescribeoneeffectivemethodforstructuringconcurrentprograms(basedonMesamonitors),ratherthantoattempttocoverseveraldifferentapproaches.Inourview,itismoreimportantforstudentstomasteronemethodology.Monitorsareaparticularlyrobustandsimpleone,capableofimplementingmostconcurrentprogramsefficiently.Theimplementationofsynchronizationprimitivesshouldbeincludedifthereistime,sostudentsseethatthereisnomagic.

Multi-ObjectSynchronization.Chapter6discussesadvancedtopicsinconcurrency—specifically,thetwinchallengesofmultiprocessorlockcontentionanddeadlock.Thismaterialisincreasinglyimportantforstudentsworkingonmulticoresystems,butsomecoursesmaynothavetimetocoveritindetail.

Scheduling.Thischaptercoverstheconceptsofresourceallocationinthespecificcontextofprocessorscheduling.Withtheadventofdatacentercomputingandmulticorearchitectures,theprinciplesandpracticeofresourceallocationhaverenewedimportance.Afteraquicktourthroughthetradeoffsbetweenresponsetimeandthroughputforuniprocessorscheduling,thechaptercoversasetofmore

advancedtopicsinaffinityandmultiprocessorscheduling,power-awareanddeadlinescheduling,aswellasbasicqueueingtheoryandoverloadmanagement.Weconcludethesetopicsbywalkingstudentsthroughacasestudyofserver-sideloadmanagement.

AddressTranslation.Chapter8explainsmechanismsforhardwareandsoftwareaddresstranslation.Thefirstpartofthechaptercovershowhardwareandoperatingsystemscooperatetoprovideflexible,sparseaddressspacesthroughmulti-levelsegmentationandpaging.Wethendescribehowtomakememorymanagementefficientwithtranslationlookasidebuffers(TLBs)andvirtuallyaddressedcaches.WeconsiderhowtokeepTLBsconsistentwhentheoperatingsystemmakeschangestoitspagetables.Weconcludewithadiscussionofmodernsoftware-basedprotectionmechanismssuchasthosefoundintheMicrosoftCommonLanguageRuntimeandGoogle’sNativeClient.

CachingandVirtualMemory.Cachesarecentraltomanydifferenttypesofcomputersystems.Moststudentswillhaveseentheconceptofacacheinanearlierclassonmachinestructures.Thus,ourgoalistocoverthetheoryandimplementationofcaches:whentheyworkandwhentheydonot,aswellashowtheyareimplementedinhardwareandsoftware.Wethenshowhowtheseideasareappliedinthecontextofmemory-mappedfilesanddemand-pagedvirtualmemory.

AdvancedMemoryManagement.Addresstranslationisapowerfultoolinsystemdesign,andweshowhowitcanbeusedforzerocopyI/O,virtualmachines,processcheckpointing,andrecoverablevirtualmemory.Asthisismoreadvancedmaterial,itcanbeskippedbythoseclassespressedfortime.

FileSystems:IntroductionandOverview.Chapter11framesthefilesystemportionofthebook,startingtopdownwiththechallengesofprovidingausefulfileabstractiontousers.WethendiscusstheUNIXfilesysteminterface,themajorinternalelementsinsideafilesystem,andhowdiskdevicedriversarestructured.

StorageDevices.Chapter12surveysblockstoragehardware,specificallymagneticdisksandflashmemory.Thelasttwodecadeshaveseenrapidchangeinstoragetechnologyaffectingbothapplicationprogrammersandoperatingsystemsdesigners;thischapterprovidesasnapshotforstudents,asabuildingblockforthenexttwochapters.Ifstudentshavepreviouslyseenthismaterial,thischaptercanbeskipped.

FilesandDirectories.Chapter13discussesfilesystemlayoutondisk.Ratherthansurveyallpossiblefilelayouts—somethingthatchangesrapidlyovertime—weusefilesystemsasaconcreteexampleofmappingcomplexdatastructuresontoblockstoragedevices.

ReliableStorage.Chapter14explainstheconceptandimplementationofreliablestorage,usingfilesystemsasaconcreteexample.Startingwiththeadhoctechniquesusedinearlyfilesystems,thechapterexplainscheckpointingandwriteaheadloggingasalternateimplementationstrategiesforbuildingreliablestorage,anditdiscusseshowredundancysuchaschecksumsandreplicationareusedtoimprovereliabilityandavailability.

Wewelcomeandencouragesuggestionsforhowtoimprovethepresentationofthematerial;pleasesendanycommentstothepublisher’swebsite,[email protected].

Acknowledgements

Wehavebeenincrediblyfortunatetohavethehelpofalargenumberofpeopleintheconception,writing,editing,andproductionofthisbook.

WestartedonthejourneyofwritingthisbookoverdinnerattheUSENIXNSDIconferencein2010.Atthetime,wethoughtperhapsitwouldtakeusthesummertocompletethefirstversionandperhapsayearbeforewecoulddeclareourselvesdone.Wewereverywrong!Itisnoexaggerationtosaythatitwouldhavetakenusalotlongerwithoutthehelpwehavereceivedfromthepeoplewementionbelow.

Perhapsmostimportanthavebeenourearlyadopters,whohavegivenusenormouslyusefulfeedbackaswehaveputtogetherthisedition:

Carnegie-Mellon DavidEckhardtandGarthGibson

Clarkson JeannaMatthews

Cornell GunSirer

ETHZurich MothyRoscoe

NewYorkUniversity LaskshmiSubramanian

PrincetonUniversity KaiLi

SaarlandUniversity PeterDruschel

StanfordUniversity JohnOusterhout

UniversityofCaliforniaRiverside HarshaMadhyastha

UniversityofCaliforniaSantaBarbara BenZhao

UniversityofMaryland NeilSpring

UniversityofMichigan PeteChen

UniversityofSouthernCalifornia RameshGovindan

UniversityofTexas-Austin LorenzoAlvisi

UniverstiyofToronto DingYuan

UniversityofWashington GaryKimuraandEdLazowska

Indevelopingourapproachtoteachingoperatingsystems,bothbeforewestartedwritingandafterwardsaswetriedtoputourthoughtstopaper,wemadeextensiveuseoflecturenotesandslidesdevelopedbyotherfaculty.OfparticularhelpwerethematerialscreatedbyPeteChen,PeterDruschel,SteveGribble,EddieKohler,JohnOusterhout,MothyRoscoe,andGeoffVoelker.Wethankthemall.

Ourillustratorforthesecondedition,CameronNeat,hasbeenajoytoworkwith.WewouldalsoliketothankSimonPeterforrunningthemultiprocessorexperimentsintroducingChapter6.

WearealsogratefultoLorenzoAlvisi,AdamAnderson,PeteChen,SteveGribble,SamHopkins,EdLazowska,HarshaMadhyastha,JohnOusterhout,MarkRich,MothyRoscoe,WillScott,GunSirer,IonStoica,LakshmiSubramanian,andJohnZahorjanfortheirhelpfulcommentsandsuggestionsastohowtoimprovethebook.

WethankJoshBerlin,MarlaDahlin,RasitEskicioglu,SandyKaplan,JohnOusterhout,WhitneySchmidt,andMikeWalfishforhelpingusidentifyandcorrectgrammaticalortechnicalbugsinthetext.

WethankJeffDean,GarthGibson,MarkOskin,SimonPeter,DaveProbert,AminVahdat,andMarkZbikowskifortheirhelpinexplainingtheinternalworkingsofsomeofthecommercialsystemsmentionedinthisbook.

WewouldliketothankDaveWetherall,DanWeld,MikeWalfish,DavePatterson,OlavKvern,DanHalperin,ArmandoFox,RobinBriggs,KatyaAnderson,SandraAnderson,LorenzoAlvisi,andWilliamAdamsfortheirhelpandadviceontextbookeconomicsandproduction.

TheHelenRiaboffWhiteleyCenteraswellasDonandJeanneDahlinwerekindenoughtolendusaplacetoescapewhenweneededtogetchapterswritten.

Finally,wethankourfamilies,ourcolleagues,andourstudentsforsupportingusinthislarger-than-expectedeffort.

IIIMemoryManagement

8.AddressTranslation

Thereisnothingwrongwithyourtelevisionset.Donotattempttoadjustthepicture.Wearecontrollingtransmission.Ifwewishtomakeitlouder,wewillbringupthevolume.Ifwewishtomakeitsofter,wewilltuneittoawhisper.Wewillcontrolthehorizontal.Wewillcontrolthevertical.Wecanrolltheimage,makeitflutter.Wecanchangethefocustoasoftblurorsharpenittocrystalclarity.Forthenexthour,sitquietlyandwewillcontrolallthatyouseeandhear.Werepeat:thereisnothingwrongwithyourtelevisionset.—Openingnarration,TheOuterLimits

Thepromiseofvirtualrealityiscompelling.Whowouldn’twanttheabilitytotravelanywherewithoutleavingtheholodeck?Ofcourse,thepromiseisfarfrombecomingareality.Intheory,byadjustingtheinputstoallofyoursensesinresponsetoyouractions,avirtualrealitysystemcouldperfectlysetthescene.However,yoursensesarenotsoeasilycontrolled.Wemightsoonbeabletoprovideanimmersiveenvironmentforvision,butbalance,hearing,taste,andsmellwilltakealotlonger.Touch,prioperception(thesenseofbeingnearsomethingelse),andg-forcesareevenfartheroff.Getasingleoneofthesewrongandtheillusiondisappears.

Canwecreateavirtualrealityenvironmentforcomputerprograms?WehavealreadyseenanexampleofthiswiththeUNIXI/Ointerface,wheretheprogramdoesnotneedtoknow,andsometimescannottell,ifitsinputsandoutputsarefiles,devices,orotherprocesses.

Inthenextthreechapters,wetakethisideaagiantstepfurther.Anamazingnumberofadvancedsystemfeaturesareenabledbyputtingtheoperatingsystemincontrolofaddresstranslation,theconversionfromthememoryaddresstheprogramthinksitisreferencingtothephysicallocationofthatmemorycell.Fromtheprogrammer’sperspective,addresstranslationoccurstransparently—theprogrambehavescorrectlydespitethefactthatitsmemoryisstoredsomewherecompletelydifferentfromwhereitthinksitisstored.

Youwereprobablytaughtinsomeearlyprogrammingclassthatamemoryaddressisjustanaddress.Apointerinalinkedlistcontainstheactualmemoryaddressofwhatitispointingto.Ajumpinstructioncontainstheactualmemoryaddressofthenextinstructiontobeexecuted.Thisisausefulfiction!Theprogrammerisoftenbetteroffnotthinkingabouthoweachmemoryreferenceisconvertedintothedataorinstructionbeingreferenced.Inpractice,thereisquitealotofactivityhappeningbeneaththecovers.

Addresstranslationisasimpleconcept,butitturnsouttobeincrediblypowerful.Whatcananoperatingsystemdowithaddresstranslation?Thisisonlyapartiallist:

Processisolation.AswediscussedinChapter2,protectingtheoperatingsystemkernelandotherapplicationsagainstbuggyormaliciouscoderequirestheabilitytolimitmemoryreferencesbyapplications.Likewise,addresstranslationcanbeusedbyapplicationstoconstructsafeexecutionsandboxesforthirdpartyextensions.

Interprocesscommunication.Oftenprocessesneedtocoordinatewitheachother,andanefficientwaytodothatistohavetheprocessesshareacommonmemoryregion.

Sharedcodesegments.Instancesofthesameprogramcansharetheprogram’sinstructions,reducingtheirmemoryfootprintandmakingtheprocessorcachemoreefficient.Likewise,differentprogramscansharecommonlibraries.

Programinitialization.Usingaddresstranslation,wecanstartaprogramrunningbeforeallofitscodeisloadedintomemoryfromdisk.

Efficientdynamicmemoryallocation.Asaprocessgrowsitsheap,orasathreadgrowsitsstack,wecanuseaddresstranslationtotraptothekerneltoallocatememoryforthosepurposesonlyasneeded.

Cachemanagement.Aswewillexplaininthenextchapter,theoperatingsystemcanarrangehowprogramsarepositionedinphysicalmemorytoimprovecacheefficiency,throughasystemcalledpagecoloring.

Programdebugging.Theoperatingsystemcanusememorytranslationtopreventabuggyprogramfromoverwritingitsowncoderegion;bycatchingpointererrorsearlier,itmakesthemmucheasiertodebug.Debuggersalsouseaddresstranslationtoinstalldatabreakpoints,tostopaprogramwhenitreferencesaparticularmemorylocation.

EfficientI/O.Serveroperatingsystemsareoftenlimitedbytherateatwhichtheycantransferdatatoandfromthediskandthenetwork.Addresstranslationenablesdatatobesafelytransferreddirectlybetweenuser-modeapplicationsandI/Odevices.

Memorymappedfiles.Aconvenientandefficientabstractionformanyapplicationsistomapfilesintotheaddressspace,sothatthecontentsofthefilecanbedirectlyreferencedwithprograminstructions.

Virtualmemory.Theoperatingsystemcanprovideapplicationstheabstractionofmorememorythanisphysicallypresentonagivencomputer.

Checkpointingandrestart.Thestateofalong-runningprogramcanbeperiodicallycheckpointedsothatiftheprogramorsystemcrashes,itcanberestartedfromthesavedstate.Thekeychallengeistobeabletoperformaninternallyconsistentcheckpointoftheprogram’sdatawhiletheprogramcontinuestorun.

Persistentdatastructures.Theoperatingsystemcanprovidetheabstractionofapersistentregionofmemory,wherechangestothedatastructuresinthatregionsurviveprogramandsystemcrashes.

Processmigration.Anexecutingprogramcanbetransparentlymovedfromoneservertoanother,forexample,forloadbalancing.

Informationflowcontrol.Anextralayerofsecurityistoverifythataprogramisnotsendingyourprivatedatatoathirdparty;e.g.,asmartphoneapplicationmayneedaccesstoyourphonelist,butitshouldn’tbeallowedtotransmitthatdata.Addresstranslationcanbethebasisformanagingtheflowofinformationintoandoutofasystem.

Distributedsharedmemory.Wecantransparentlyturnanetworkofserversintoalarge-scaleshared-memoryparallelcomputerusingaddresstranslation.

Inthischapter,wefocusonthemechanismsneededtoimplementaddresstranslation,asthatisthefoundationofalloftheseservices.Wediscusshowtheoperatingsystemandapplicationsusethemechanismstoprovidetheseservicesinthefollowingtwochapters.

Forruntimeefficiency,mostsystemshavespecializedhardwaretodoaddresstranslation;thishardwareismanagedbytheoperatingsystemkernel.Insomesystems,however,thetranslationisprovidedbyatrustedcompiler,linkerorbyte-codeinterpreter.Inothersystems,theapplicationdoesthepointertranslationasawayofmanagingthestateofitsowndatastructures.Instillothersystems,ahybridmodelisusedwhereaddressesaretranslatedbothinsoftwareandhardware.Thechoiceisoftenanengineeringtradeoffbetweenperformance,flexibility,andcost.However,thefunctionalityprovidedisoftenthesameregardlessofthemechanismusedtoimplementthetranslation.Inthischapter,wewillcoverarangeofhardwareandsoftwaremechanisms.

Chapterroadmap:

AddressTranslationConcept.Westartbyprovidingaconceptualframeworkforunderstandingbothhardwareandsoftwareaddresstranslation.(Section8.1)

FlexibleAddressTranslation.Wefocusfirstonhardwareaddresstranslation;weaskhowcanwedesignthehardwaretoprovidemaximumflexibilitytotheoperatingsystemkernel?(Section8.2)

EfficientAddressTranslation.Thesolutionswepresentwillseemflexiblebutterriblyslow.Wenextdiscussmechanismsthatmakeaddresstranslationmuchmoreefficient,withoutsacrificingflexibility.(Section8.3)

SoftwareProtection.Increasingly,softwarecompilersandruntimeinterpretersareusingaddresstranslationtechniquestoimplementoperatingsystemfunctionality.Whatchangeswhenthetranslationisinsoftwareratherthaninhardware?(Section8.4)

8.1AddressTranslationConcept

Figure8.1:Addresstranslationintheabstract.Thetranslatorconverts(virtual)memoryaddressesgeneratedbytheprogramintophysicalmemoryaddresses.

Consideredasablackbox,addresstranslationisasimplefunction,illustratedinFigure8.1.Thetranslatortakeseachinstructionanddatamemoryreferencegeneratedbyaprocess,checkswhethertheaddressislegal,andconvertsittoaphysicalmemoryaddressthatcanbeusedtofetchorstoreinstructionsordata.Thedataitself—whateverisstoredinmemory—isreturnedasis;itisnottransformedinanyway.Thetranslationisusuallyimplementedinhardware,andtheoperatingsystemkernelconfiguresthehardwaretoaccomplishitsaims.

Thetaskofthischapteristofillinthedetailsabouthowthatblackboxworks.Ifweaskedyourightnowhowyoumightimplementit,yourfirstseveralguesseswouldprobablybeonthemark.Ifyousaidwecoulduseanarray,atree,orahashtable,youwouldberight—allofthoseapproacheshavebeentakenbyrealsystems.

Giventhatanumberofdifferentimplementationsarepossible,howshouldweevaluatethealternatives?Herearesomegoalswemightwantoutofatranslationbox;thedesignweendupwithwilldependonhowwebalanceamongthesevariousgoals.

Memoryprotection.Weneedtheabilitytolimittheaccessofaprocesstocertainregionsofmemory,e.g.,topreventitfromaccessingmemorynotownedbytheprocess.Often,however,wemaywanttolimitaccessofaprogramtoitsownmemory,e.g.,topreventapointererrorfromoverwritingthecoderegionortocauseatraptothedebuggerwhentheprogramreferencesaspecificdatalocation.

Memorysharing.Wewanttoallowmultipleprocessestoshareselectedregionsofmemory.Thesesharedregionscanbelarge(e.g.,ifwearesharingaprogram’scodesegmentamongmultipleprocessesexecutingthesameprogram)orrelativelysmall

(e.g.,ifwearesharingacommonlibrary,afile,orashareddatastructure).

Flexiblememoryplacement.Wewanttoallowtheoperatingsystemtheflexibilitytoplaceaprocess(andeachpartofaprocess)anywhereinphysicalmemory;thiswillallowustopackphysicalmemorymoreefficiently.Aswewillseeinthenextchapter,flexibilityinassigningprocessdatatophysicalmemorylocationswillalsoenableustomakemoreeffectiveuseofon-chipcaches.

Sparseaddresses.Manyprogramshavemultipledynamicmemoryregionsthatcanchangeinsizeoverthecourseoftheexecutionoftheprogram:theheapfordataobjects,astackforeachthread,andmemorymappedfiles.Modernprocessorshave64-bitaddressspaces,allowingeachdynamicobjectampleroomtogrowasneeded,butmakingthetranslationfunctionmorecomplex.

Runtimelookupefficiency.Hardwareaddresstranslationoccursoneveryinstructionfetchandeverydataloadandstore.Itwouldbeimpracticalifalookuptook,onaverage,muchlongertoexecutethantheinstructionitself.Atfirst,manyoftheschemeswediscusswillseemwildlyimpractical!Wewilldiscusswaystomakeeventhemostconvolutedtranslationsystemsefficient.

Compacttranslationtables.Wealsowantthespaceoverheadoftranslationtobeminimal;anydatastructuresweneedshouldbesmallcomparedtotheamountofphysicalmemorybeingmanaged.

Portability.Differenthardwarearchitecturesmakedifferentchoicesastohowtheyimplementtranslation;ifanoperatingsystemkernelistobeeasilyportableacrossmultipleprocessorarchitectures,itneedstobeabletomapfromits(hardware-independent)datastructurestothespecificcapabilitiesofeacharchitecture.

Wewillendupwithafairlycomplexaddresstranslationmechanism,andsoourdiscussionwillstartwiththesimplestpossiblemechanismsandaddfunctionalityonlyasneeded.Itwillbehelpfulduringthediscussionforyoutokeepinmindthetwoviewsofmemory:theprocessseesitsownmemory,usingitsownaddresses.Wewillcallthesevirtualaddresses,becausetheydonotnecessarilycorrespondtoanyphysicalreality.Bycontrast,tothememorysystem,thereareonlyphysicaladdresses—reallocationsinmemory.Fromthememorysystemperspective,itisgivenphysicaladdressesanditdoeslookupsandstoresvalues.Thetranslationmechanismconvertsbetweenthetwoviews:fromavirtualaddresstoaphysicalmemoryaddress.

Addresstranslationinlinkersandloaders

Evenwithoutthekernel-userboundary,multiprogrammingrequiressomeformofaddresstranslation.Onamultiprogrammingsystem,whenaprogramiscompiled,thecompilerdoesnotknowwhichregionsofphysicalmemorywillbeinusebyotherapplications;itcannotcontrolwhereinphysicalmemorytheprogramwillland.Themachineinstructionsforaprogramcontainsbothrelativeandabsoluteaddresses;relativeaddresses,suchastobranchforwardorbackwardacertainnumberofinstructions,continuetoworkregardlessofwhereinmemorytheprogramislocated.However,someinstructionscontainabsoluteaddresses,suchastoloadaglobalvariableortojumptothe

startofaprocedure.Thesewillstopworkingunlesstheprogramisloadedintomemoryexactlywherethecompilerexpectsittogo.Beforehardwaretranslationbecamecommonplace,earlyoperatingsystemsdealtwiththisissuebyusingarelocatingloaderforcopyingprogramsintomemory.Oncetheoperatingsystempickedanemptyregionofphysicalmemoryfortheprogram,theloaderwouldmodifyanyinstructionsintheprogramthatusedanabsoluteaddress.Tosimplifytheimplementation,therewasatableatthebeginningoftheexecutableimagethatlistedalloftheabsoluteaddressesusedintheprogram.Inmodernsystems,thisiscalledasymboltable.

Today,westillhavesomethingsimilar.Complexprogramsoftenhavemultiplefiles,eachofwhichcanbecompiledindependentlyandthenlinkedtogethertoformtheexecutableimage.Whenthecompilergeneratesthemachineinstructionsforasinglefile,itcannotknowwhereintheexecutablethisparticularfilewillgo.Instead,thecompilergeneratesasymboltableatthebeginningofeachcompiledfile,indicatingwhichvalueswillneedtobemodifiedwhentheindividualfilesareassembledtogether.

Mostcommercialoperatingsystemstodaysupporttheoptionofdynamiclinking,takingthenotionofarelocatingloaderonestepfurther.Withadynamicallylinkedlibrary(DLL),alibraryislinkedintoarunningprogramondemand,whentheprogramfirstcallsintothelibrary.WewillexplaininabithowthecodeforaDLLcanbesharedbetweenmultipledifferentprocesses,butthelinkingprocedureisstraightforward.AtableofvalidentrypointsintotheDLLiskeptbythecompiler;thecallingprogramindirectsthroughthistabletoreachthelibraryroutine.

8.2TowardsFlexibleAddressTranslation

Ourdiscussionofhardwareaddresstranslationisdividedintotwosteps.First,weputtheissueoflookupefficiencyaside,andinsteadconsiderhowbesttoachievetheothergoalslistedabove:flexiblememoryassignment,spaceefficiency,fine-grainedprotectionandsharing,andsoforth.Oncewehavethefeatureswewant,wewillthenaddmechanismstogainbacklookupefficiency.

Figure8.2:Addresstranslationwithbaseandboundsregisters.Thevirtualaddressisaddedtothebasetogeneratethephysicaladdress;theboundregisterischeckedagainstthevirtualaddresstopreventaprocessfromreadingorwritingoutsideofitsallocatedmemoryregion.

InChapter2,weillustratedthenotionofhardwarememoryprotectionusingthesimplesthardwareimaginable:baseandbounds.Thetranslationboxconsistsoftwoextraregistersperprocess.Thebaseregisterspecifiesthestartoftheprocess’sregionofphysicalmemory;theboundregisterspecifiestheextentofthatregion.Ifthebaseregisterisaddedtoeveryaddressgeneratedbytheprogram,thenwenolongerneedarelocatingloader—thevirtualaddressesoftheprogramstartfrom0andgotobound,andthephysicaladdressesstartfrombaseandgotobase+bound.Figure8.2showsanexampleofbaseandboundstranslation.Sincephysicalmemorycancontainseveralprocesses,thekernelresetsthecontentsofthebaseandboundsregistersoneachprocesscontextswitchtotheappropriatevaluesforthatprocess.

Baseandboundstranslationisbothsimpleandfast,butitlacksmanyofthefeaturesneededtosupportmodernprograms.Baseandboundstranslationsupportsonlycoarse-grainedprotectionattheleveloftheentireprocess;itisnotpossibletopreventaprogramfromoverwritingitsowncode,forexample.Itisalsodifficulttoshareregionsofmemorybetweentwoprocesses.Sincethememoryforaprocessneedstobecontiguous,supportingdynamicmemoryregions,suchasforheaps,threadstacks,ormemorymappedfiles,becomesdifficulttoimpossible.

8.2.1SegmentedMemory

Figure8.3:Addresstranslationwithasegmenttable.Thevirtualaddresshastwocomponents:asegmentnumberandasegmentoffset.Thesegmentnumberindexesintothesegmenttabletolocatethestartofthesegmentinphysicalmemory.Theboundregisterischeckedagainstthesegmentoffsettopreventaprocessfromreadingorwritingoutsideofitsallocatedmemoryregion.Processescanhaverestrictedrightstocertainsegments,e.g.,topreventwritestothecodesegment.

Manyofthelimitationsofbaseandboundstranslationcanberemediedwithasmallchange:insteadofkeepingonlyasinglepairofbaseandboundsregistersperprocess,thehardwarecansupportanarrayofpairsofbaseandboundsregisters,foreachprocess.Thisiscalledsegmentation.Eachentryinthearraycontrolsaportion,orsegment,ofthevirtualaddressspace.Thephysicalmemoryforeachsegmentisstoredcontiguously,butdifferentsegmentscanbestoredatdifferentlocations.Figure8.3showssegmenttranslationinaction.Thehighorderbitsofthevirtualaddressareusedtoindexintothearray;therestoftheaddressisthentreatedasabove—addedtothebaseandcheckedagainsttheboundstoredatthatindex.Inaddition,theoperatingsystemcanassigndifferentsegmentsdifferentpermissions,e.g.,toallowexecute-onlyaccesstocodeandread-writeaccesstodata.Althoughfoursegmentsareshowninthefigure,ingeneralthenumberofsegmentsisdeterminedbythenumberofbitsforthesegmentnumberthataresetasideinthevirtualaddress.

Itshouldseemoddtoyouthatsegmentedmemoryhasgaps;programmemoryisnolongerasinglecontiguousregion,butinsteaditisasetofregions.Eachdifferentsegmentstartsatanewsegmentboundary.Forexample,codeanddataarenotimmediatelyadjacenttoeachotherineitherthevirtualorphysicaladdressspace.

Whathappensifaprogrambranchesintoortriestoloaddatafromoneofthesegaps?Thehardwarewillgenerateanexception,trappingintotheoperatingsystemkernel.OnUNIXsystems,thisisstillcalledasegmentationfault,thatis,areferenceoutsideofalegalsegmentofmemory.Howdoesaprogramkeepfromwanderingintooneofthesegaps?

Correctprogramswillnotgeneratereferencesoutsideofvalidmemory.Putanotherway,tryingtoexecutecodeorreadingdatathatdoesnotexistisprobablyanindicationthattheprogramhasabuginit.

Figure8.4:Twoprocessessharingacodesegment,butwithseparatedataandstacksegments.Inthiscase,eachprocessusesthesamevirtualaddresses,butthesevirtualaddressesmaptoeitherthesameregionofphysicalmemory(ifcode)ordifferentregionsofphysicalmemory(ifdata).

Althoughsimpletoimplementandmanage,segmentedmemoryisbothremarkablypowerfulandwidelyused.Forexample,thex86architectureissegmented(withsomeenhancementsthatwewilldescribelater).Withsegments,theoperatingsystemcanallowprocessestosharesomeregionsofmemorywhilekeepingotherregionsprotected.Forexample,twoprocessescanshareacodesegmentbysettingupanentryintheirsegmenttablestopointtothesameregionofphysicalmemory—tousethesamebaseandbounds.Theprocessescansharethesamecodewhileworkingoffdifferentdata,bysettingupthesegmenttabletopointtodifferentregionsofphysicalmemoryforthedatasegment.WeillustratethisinFigure8.4.

Likewise,sharedlibraryroutines,suchasagraphicslibrary,canbeplacedintoasegmentandsharedbetweenprocesses.Asbefore,thelibrarydatawouldbeinaseparate,non-

sharedsegment.Thisisfrequentlydoneinmodernoperatingsystemswithdynamicallylinkedlibraries.Apracticalissueisthatdifferentprocessesmayloaddifferentnumbersoflibraries,andsomayassignthesamelibraryadifferentsegmentnumber.Dependingontheprocessorarchitecture,sharingcanstillwork,ifthelibrarycodeusessegment-localaddresses,addressesthatarerelativetothecurrentsegment.

UNIXforkandcopy-on-write

InChapter3,wedescribedtheUNIXforksystemcall.UNIXcreatesanewprocessbymakingacompletecopyoftheparentprocess;theparentprocessandthechildprocessareidenticalexceptforthereturnvaluefromfork.ThechildprocesscanthensetupitsI/OandeventuallyusetheUNIXexecsystemcalltorunanewprogram.Wepromisedatthetimewewouldexplainhowthiscanbedoneefficiently.

Withsegments,thisisnowpossible.Toforkaprocess,wecansimplymakeacopyoftheparent’ssegmenttable;wedonotneedtocopyanyofitsphysicalmemory.Ofcourse,wewantthechildtobeacopyoftheparent,andnotjustpointtothesamememoryastheparent.Ifthechildchangessomedata,itshouldchangeonlyitscopy,andnotitsparent’sdata.Ontheotherhand,mostofthetime,thechildprocessinUNIXforksimplycallsUNIXexec;theshareddataisthereasaprogrammingconvenience.

Wecanmakethisworkefficientlybyusinganideacalledcopy-on-write.Duringthefork,allofthesegmentssharedbetweentheparentandchildprocessaremarked“read-only”inbothsegmenttables.Ifeithersidemodifiesdatainasegment,anexceptionisraisedandafullmemorycopyofthatsegmentismadeatthattime.Inthecommoncase,thechildprocessmodifiesonlyitsstackbeforecallingUNIXexec,andifso,onlythestackneedstobephysicallycopied.

Wecanalsousesegmentsforinterprocesscommunication,ifprocessesaregivenreadandwritepermissiontothesamesegment.Multics,anoperatingsystemfromthe1960’sthatcontainedmanyoftheideaswenowfindinMicrosoft’sWindows7,Apple’sMacOSX,andLinux,madeextensiveuseofsegmentedmemoryforinterprocesssharing.InMultics,asegmentwasallocatedforeverydatastructure,allowingfine-grainedprotectionandsharingbetweenprocesses.Ofcourse,thismadethesegmenttableprettylarge!Moremodernsystemstendtousesegmentsonlyforcoarser-grainedmemoryregions,suchasthecodeanddataforanentiresharedlibrary,ratherthanforeachofthedatastructureswithinthelibrary.

Asafinalexampleofthepowerofsegments,theyenabletheefficientmanagementofdynamicallyallocatedmemory.Whenanoperatingsystemreusesmemoryordiskspacethathadpreviouslybeenused,itmustfirstzerooutthecontentsofthememoryordisk.Otherwise,privatedatafromoneapplicationcouldinadvertentlyleakintoanother,potentiallymalicious,application.Forexample,youcouldenterapasswordintoonewebsite,sayforabank,andthenexitthebrowser.However,iftheunderlyingphysicalmemoryusedbythebrowseristhenre-assignedtoanewprocess,thenthepasswordcouldbeleakedtoamaliciouswebsite.

Ofcourse,weonlywanttopaytheoverheadofzeroingmemoryifitwillbeused.Thisis

particularlyanissuefordynamicallyallocatedmemoryontheheapandstack.Itisnotclearwhentheprogramstartshowmuchmemoryitwilluse;theheapcouldbeanywherefromafewkilobytestoseveralgigabytes,dependingontheprogram.Theoperatingsystemcanaddressthisusingzero-on-reference.Withzero-on-reference,theoperatingsystemallocatesamemoryregionfortheheap,butonlyzeroesthefirstfewkilobytes.Instead,itsetstheboundregisterinthesegmenttabletolimittheprogramtojustthezeroedpartofmemory.Iftheprogramexpandsitsheap,itwilltakeanexception,andtheoperatingsystemkernelcanzerooutadditionalmemorybeforeresumingexecution.

Givenalltheseadvantages,whynotstophere?Theprincipaldownsideofsegmentationistheoverheadofmanagingalargenumberofvariablesizeanddynamicallygrowingmemorysegments.Overtime,asprocessesarecreatedandfinish,physicalmemorywillbedividedintoregionsthatareinuseandregionsthatarenot,thatis,availabletobeallocatedtoanewprocess.Thesefreeregionswillbeofvaryingsizes.Whenwecreateanewsegment,wewillneedtofindafreespotforit.Shouldweputitinthesmallestopenregionwhereitwillfit?Thelargestopenregion?

Howeverwechoosetoplacenewsegments,asmorememorybecomesallocated,theoperatingsystemmayreachapointwherethereisenoughfreespaceforanewsegment,butthefreespaceisnotcontiguous.Thisiscalledexternalfragmentation.Theoperatingsystemisfreetocompactmemorytomakeroomwithoutaffectingapplications,becausevirtualaddressesareunchangedwhenwerelocateasegmentinphysicalmemory.Evenso,compactioncanbecostlyintermsofprocessoroverhead:atypicalserverconfigurationwouldtakeroughlyasecondtocompactitsmemory.

Allthisbecomesevenmorecomplexwhenmemorysegmentscangrow.Howmuchmemoryshouldwesetasideforaprogram’sheap?Ifweputtheheapsegmentinapartofphysicalmemorywithlotsofroom,thenwewillhavewastedmemoryifthatprogramturnsouttoneedonlyasmallheap.Ifwedotheopposite—puttheheapsegmentinasmallchunkofphysicalmemory—thenwewillneedtocopyitsomewhereelseifitchangessize.

Figure8.5:Logicalviewofpagetableaddresstranslation.Physicalmemoryissplitintopageframes,withapage-sizealignedblockofvirtualaddressesassignedtoeachframe.Unusedaddressesarenotassignedpageframesinphysicalmemory.

Figure8.6:Addresstranslationwithapagetable.Thevirtualaddresshastwocomponents:avirtualpagenumberandanoffsetwithinthepage.Thevirtualpagenumberindexesintothepagetabletoyieldapageframeinphysicalmemory.Thephysicaladdressisthephysicalpageframefromthepagetable,concatenatedwiththepageoffsetfromthevirtualaddress.Theoperatingsystemcanrestrictprocessaccesstocertainpages,e.g.,topreventwritestopagescontaininginstructions.

8.2.2PagedMemory

Analternativetosegmentedmemoryispagedmemory.Withpaging,memoryisallocatedinfixed-sizedchunkscalledpageframes.Addresstranslationissimilartohowitworkswithsegmentation.Insteadofasegmenttablewhoseentriescontainpointerstovariable-sizedsegments,thereisapagetableforeachprocesswhoseentriescontainpointerstopageframes.Becausepageframesarefixed-sizedandapoweroftwo,thepagetableentriesonlyneedtoprovidetheupperbitsofthepageframeaddress,sotheyaremorecompact.Thereisnoneedfora“bound”ontheoffset;theentirepageinphysicalmemoryisallocatedasaunit.Figure8.6illustratesaddresstranslationwithpagedmemory.

Whatwillseemodd,andperhapscool,aboutpagingisthatwhileaprogramthinksofitsmemoryaslinear,infactitsmemorycanbe,andusuallyis,scatteredthroughoutphysical

memoryinakindofabstractmosaic.Theprocessorwillexecuteoneinstructionafteranotherusingvirtualaddresses;itsvirtualaddressesarestilllinear.However,theinstructionlocatedattheendofapagewillbelocatedinacompletelydifferentregionofphysicalmemoryfromthenextinstructionatthestartofthenextpage.Datastructureswillappeartobecontiguoususingvirtualaddresses,butalargematrixmaybescatteredacrossmanyphysicalpageframes.

Anaptanalogyiswhathappenswhenyoushuffleseveraldecksofcardstogether.Asingleprocessinitsvirtualaddressspaceseesthecardsofasingledeckinorder.Adifferentprocessseesacompletelydifferentdeck,butitwillalsobeinorder.Inphysicalmemory,however,thedecksofalltheprocessescurrentlyrunningwillbeshuffledtogether,apparentlyatrandom.Thepagetablesarethemagician’sassistant:abletoinstantlyfindthequeenofheartsfromamongtheshuffleddecks.

Pagingaddressestheprincipallimitationofsegmentation:free-spaceallocationisverystraightforward.Theoperatingsystemcanrepresentphysicalmemoryasabitmap,witheachbitrepresentingaphysicalpageframethatiseitherfreeorinuse.Findingafreeframeisjustamatteroffindinganemptybit.

Sharingmemorybetweenprocessesisalsoconvenient:weneedtosetthepagetableentryforeachprocesssharingapagetopointtothesamephysicalpageframe.Foralargesharedregionthatspansmultiplepageframes,suchasasharedlibrary,thismayrequiresettingupanumberofpagetableentries.Sinceweneedtoknowwhentoreleasememorywhenaprocessfinishes,sharedmemoryrequiressomeextrabookkeepingtokeeptrackofwhetherthesharedpageisstillinuse.Thedatastructureforthisiscalledacoremap;itrecordsinformationabouteachphysicalpageframesuchaswhichpagetableentriespointtoit.

Manyoftheoptimizationswediscussedundersegmentationcanalsobedonewithpaging.Forcopy-on-write,weneedtocopythepagetableentriesandsetthemtoread-only;onastoretooneofthesepages,wecanmakearealcopyoftheunderlyingpageframebeforeresumingtheprocess.Likewise,forzero-on-reference,wecansetthepagetableentryatthetopofthestacktobeinvalid,causingatrapintothekernel.Thisallowsustoextendthestackonlyasneeded.

Pagetablesallowotherfeaturestobeadded.Forexample,wecanstartaprogramrunningbeforeallofitscodeanddataareloadedintomemory.Initially,theoperatingsystemmarksallofthepagetableentriesforanewprocessasinvalid;aspagesarebroughtinfromdisk,itmarksthosepagesasread-only(forcodepages)orread-write(fordatapages).Oncethefirstfewpagesareinmemory,however,theoperatingsystemcanstartexecutionoftheprograminuser-mode,whilethekernelcontinuestotransfertherestoftheprogram’scodeinthebackground.Astheprogramstartsup,ifithappenstojumptoalocationthathasnotbeenloadedyet,thehardwarewillcauseanexception,andthekernelcanstalltheprogramuntilthatpageisavailable.Further,thecompilercanreorganizetheprogramexecutableformoreefficientstartup,bycoalescingtheinitializationpagesintoafewpagesatthestartoftheprogram,thusoverlappinginitializationandloadingtheprogramfromdisk.

Asanotherexample,adatabreakpointisrequesttostoptheexecutionofaprogramwhen

itreferencesormodifiesaparticularmemorylocation.Itishelpfulduringdebuggingtoknowwhenadatastructurehasbeenchanged,particularlywhentrackingdownpointererrors.Databreakpointsaresometimesimplementedwithspecialhardwaresupport,buttheycanalsobeimplementedwithpagetables.Forthis,thepagetableentrycontainingthelocationismarkedread-only.Thiscausestheprocesstotraptotheoperatingsystemoneverychangetothepage;theoperatingsystemcanthencheckiftheinstructioncausingtheexceptionaffectedthespecificlocationornot.

Adownsideofpagingisthatwhilethemanagementofphysicalmemorybecomessimpler,themanagementofthevirtualaddressspacebecomesmorechallenging.Compilerstypicallyexpecttheexecutionstacktobecontiguous(invirtualaddresses)andofarbitrarysize;eachnewprocedurecallassumesthememoryforthestackisavailable.Likewise,theruntimelibraryfordynamicmemoryallocationtypicallyexpectsacontiguousheap.Inasingle-threadedprocess,wecanplacethestackandheapatoppositeendsofthevirtualaddressspace,andhavethemgrowtowardseachother,asshowninFigure8.5.However,withmultiplethreadsperprocess,weneedmultiplethreadstacks,eachwithroomtogrow.

Thisbecomesevenmoreofanissuewith64-bitvirtualaddressspaces.Thesizeofthepagetableisproportionaltothesizeofthevirtualaddressspace,nottothesizeofphysicalmemory.Themoresparsethevirtualaddressspace,themoreoverheadisneededforthepagetable.Mostoftheentrieswillbeinvalid,representingpartsofthevirtualaddressspacethatarenotinuse,butphysicalmemoryisstillneededforallofthosepagetableentries.

Wecanreducethespacetakenupbythepagetablebychoosingalargerpageframe.Howbigshouldapageframebe?Alargerpageframecanwastespaceifaprocessdoesnotuseallofthememoryinsidetheframe.Thisiscalledinternalfragmentation.Fixed-sizechunksareeasiertoallocate,butwastespaceiftheentirechunkisnotused.Unfortunately,thismeansthatwithpaging,eitherpagesareverylarge(wastingspaceduetointernalfragmentation),orthepagetableisverylarge(wastingspace),orboth.Forexample,with16KBpagesanda64bitvirtualaddressspace,wemightneed250pagetableentries!

8.2.3Multi-LevelTranslation

Ifyouweretodesignanefficientsystemfordoingalookuponasparsekeyspace,youprobablywouldnotpickasimplearray.Atreeorahashtablearemoreappropriate,andindeed,modernsystemsuseboth.Wefocusinthissubsectionontrees;wediscusshashtablesafterwards.

Manysystemsusetree-basedaddresstranslation,althoughthedetailsvaryfromsystemtosystem,andtheterminologycanbeabitconfusing.Despitethedifferences,thesystemsweareabouttodescribehavesimilarproperties.Theysupportcoarseandfine-grainedmemoryprotectionandmemorysharing,flexiblememoryplacement,efficientmemoryallocation,andefficientlookupforsparseaddressspaces,evenfor64-bitmachines.

Almostallmulti-leveladdresstranslationsystemsusepagingasthelowestlevelofthetree.Themaindifferencesbetweensystemsareinhowtheyreachthepagetableattheleaf

ofthetree—whetherusingsegmentspluspaging,ormultiplelevelsofpaging,orsegmentsplusmultiplelevelsofpaging.Thereareseveralreasonsforthis:

Efficientmemoryallocation.Byallocatingphysicalmemoryinfixed-sizepageframes,managementoffreespacecanuseasimplebitmap.

Efficientdisktransfers.Hardwaredisksarepartitionedintofixed-sizedregionscalledsectors;disksectorsmustbereadorwrittenintheirentirety.Bymakingthepagesizeamultipleofthedisksector,wesimplifytransferstoandfrommemory,forloadingprogramsintomemory,readingandwritingfiles,andinusingthedisktosimulatealargermemorythanisphysicallypresentonthemachine.

Efficientlookup.Wewilldescribeinthenextsectionhowwecanuseacachecalledatranslationlookasidebuffertomakelookupsfastinthecommoncase;thetranslationbuffercacheslookupsonapagebypagebasis.Pagingalsoallowsthelookuptablestobemorecompact,especiallyimportantatthelowestlevelofthetree.

Efficientreverselookup.Usingfixed-sizedpageframesalsomakesiteasytoimplementthecoremap,togofromaphysicalpageframetothesetofvirtualaddressesthatsharethesameframe.Thiswillbecrucialforimplementingtheillusionofaninfinitevirtualmemoryinthenextchapter.

Page-granularityprotectionandsharing.Typically,everytableentryateverylevelofthetreewillhaveitsownaccesspermissions,enablingbothcoarse-grainedandfine-grainedsharing,downtotheleveloftheindividualpageframe.

Figure8.7:Addresstranslationwithpagedsegmentation.Thevirtualaddresshasthreecomponents:asegmentnumber,avirtualpagenumberwithinthesegment,andanoffsetwithinthepage.Thesegmentnumberindexesintoasegmenttablethatyieldsthepagetableforthatsegment.Thepagenumberfromthevirtualaddressindexesintothepagetable(fromthesegmenttable)toyieldapageframeinphysicalmemory.Thephysicaladdressisthephysicalpageframefromthepagetable,concatenatedwiththepageoffsetfromthevirtualaddress.Theoperatingsystemcanrestrictaccesstoanentiresegment,e.g.,topreventwritestothecodesegment,ortoanindividualpage,e.g.,toimplementcopy-on-write.

PagedSegmentation

Letusstartasystemwithonlytwolevelsofatree.Withpagedsegmentation,memoryissegmented,butinsteadofeachsegmenttableentrypointingdirectlytoacontiguousregionofphysicalmemory,eachsegmenttableentrypointstoapagetable,whichinturnpointstothememorybackingthatsegment.Thesegmenttableentry“bound”describesthepagetablelength,thatis,thelengthofthesegmentinpages.Becausepagingisusedatthelowestlevel,allsegmentlengthsaresomemultipleofthepagesize.Figure8.7illustratestranslationwithpagedsegmentation.

Althoughsegmenttablesaresometimesstoredinspecialhardwareregisters,thepagetablesforeachsegmentarequiteabitlargerinaggregate,andsotheyarenormallystored

inphysicalmemory.Tokeepthememoryallocatorsimple,themaximumsegmentsizeisusuallychosentoallowthepagetableforeachsegmenttobeasmallmultipleofthepagesize.

Forexample,with32-bitvirtualaddressesand4KBpages,wemightsetasidetheuppertenbitsforthesegmentnumber,thenexttenbitsforthepagenumber,andtwelvebitsforthepageoffset.Inthiscase,ifeachpagetableentryisfourbytes,thepagetableforeachsegmentwouldexactlyfitintoonephysicalpageframe.

Multi-LevelPaging

Figure8.8:Addresstranslationwiththreelevelsofpagetables.Thevirtualaddresshasfourcomponents:anindexintoeachlevelofthepagetableandanoffsetwithinthephysicalpageframe.

Anearlyequivalentapproachtopagedsegmentationistousemultiplelevelsofpagetables.OntheSunMicrosystemsSPARCprocessorforexample,therearethreelevelsofpagetable.AsshowninFigure8.8,thetop-levelpagetablecontainsentries,eachofwhichpointstoasecond-levelpagetablewhoseentriesarepointerstopagetables.OntheSPARC,aswithmostothersystemsthatusemultiplelevelsofpagetables,eachlevelof

pagetableisdesignedtofitinaphysicalpageframe.Onlythetop-levelpagetablemustbefilledin;thelowerlevelsofthetreeareallocatedonlyifthoseportionsofthevirtualaddressspaceareinusebyaparticularprocess.Accesspermissionscanbespecifiedateachlevel,andsosharingbetweenprocessesispossibleateachlevel.

Multi-LevelPagedSegmentation

Wecancombinethesetwoapproachesbyusingasegmentedmemorywhereeachsegmentismanagedbyamulti-levelpagetable.Thisistheapproachtakenbythex86,forbothits32-bitand64-bitaddressingmodes.

Wedescribethe32-bitcasefirst.Thex86terminologydiffersslightlyfromwhatwehaveusedhere.Thex86hasaper-processGlobalDescriptorTable(GDT),equivalenttoasegmenttable.TheGDTisstoredinmemory;eachentry(descriptor)pointstothe(multi-level)pagetableforthatsegmentalongwiththesegmentlengthandsegmentaccesspermissions.Tostartaprocess,theoperatingsystemsetsuptheGDTandinitializesaregister,theGlobalDescriptorTableRegister(GDTR),thatcontainstheaddressandlengthoftheGDT.

Becauseofitshistory,thex86usesseparateprocessorregisterstospecifythesegmentnumber(thatis,theindexintotheGDT)andthevirtualaddressforusebyeachinstruction.Forexample,onthe“32-bit”x86,thereisbothasegmentnumberand32bitsofvirtualaddresswithineachsegment.Onthe64-bitx86,thevirtualaddresswithineachsegmentisextendedto64bits.Mostapplicationsonlyuseafewsegments,however,sotheper-processsegmenttableisusuallyshort.Theoperatingsystemkernelhasitsownsegmenttable;thisissetuptoenablethekerneltoaccess,withvirtualaddresses,alloftheper-processandsharedsegmentsonthesystem.

Forencodingefficiency,thesegmentregisterisoftenimplicitaspartoftheinstruction.Forexample,thex86stackinstructionssuchaspushandpopassumethestacksegment(theindexstoredinthestacksegmentregister),branchinstructionsassumethecodesegment(theindexstoredinthecodesegmentregister),andsoforth.Asanoptimization,wheneverthex86initializesacode,stack,ordatasegmentregisteritalsoreadstheGDTentry(thatis,thetop-levelpagetablepointerandaccesspermissions)intotheprocessor,sotheprocessorcangodirectlytothepagetableoneachreference.

Manyinstructionsalsohaveanoptiontospecifythesegmentindexexplicitly.Forexample,theljmp,orlongjump,instructionchangestheprogramcountertoanewsegmentnumberandoffsetwithinthatsegment.

Forthe32-bitx86,thevirtualaddressspacewithinasegmenthasatwo-levelpagetable.Thefirst10bitsofthevirtualaddressindexthetoplevelpagetable,calledthepagedirectory,thenext10bitsindexthesecondlevelpagetable,andthefinal12bitsaretheoffsetwithinapage.Eachpagetableentrytakesfourbytesandthepagesizeis4KB,sothetop-levelpagetableandeachsecond-levelpagetablefitsinasinglephysicalpage.Thenumberofsecond-levelpagetablesneededdependsonthelengthofthesegment;theyarenotneededtomapemptyregionsofvirtualaddressspace.Boththetop-levelandsecond-levelpagetableentrieshavepermissions,sofine-grainedprotectionandsharingispossiblewithinasegment.

Today,theamountofmemorypercomputerisoftenwellbeyondwhatcan32bitscanaddress;forexample,ahigh-endservercouldhavetwoterabytesofphysicalmemory.Forthe64-bitx86,virtualaddresseswithinasegmentcanbeupto64bits.However,tosimplifyaddresstranslation,currentprocessorsonlyallow48bitsofthevirtualaddresstobeused;thisissufficienttomap128terabytes,usingfourlevelsofpagetables.Thelowerlevelsofthepagetabletreeareonlyfilledinifthatportionofthevirtualaddressspaceisinuse.

Asanoptimization,the64-bitx86hastheoptiontoeliminateoneortwolevelsofthepagetable.Eachphysicalpageframeonthex86is4KB.Eachpageoffourthlevelpagetablemaps2MBofdata,andeachpageofthethirdlevelpagetablemaps1GBofdata.Iftheoperatingsystemplacesdatasuchthattheentire2MBcoveredbythefourthlevelpagetableisallocatedcontiguouslyinphysicalmemory,thenthepagetableentryonelayerupcanbemarkedtopointdirectlytothisregioninsteadoftoapagetable.Likewise,apageofthirdlevelpagetablecanbeomittediftheoperatingsystemallocatestheprocessa1GBchunkofphysicalmemory.Inadditiontosavingspaceneededforpagetablemappings,thisimprovestranslationbufferefficiency,apointwewilldiscussinmoredetailinthenextsection.

8.2.4Portability

Thediversityofdifferenttranslationmechanismsposesachallengetotheoperatingsystemdesigner.Tobewidelyused,wewantouroperatingsystemtobeeasilyportabletoawidevarietyofdifferentprocessorarchitectures.Evenwithinagivenprocessorfamily,suchasanx86,thereareanumberofdifferentvariantsthatanoperatingsystemmayneedtosupport.Mainmemorydensityisincreasingboththephysicalandvirtualaddressspacebyalmostabitperyear.Inotherwords,foramulti-levelpagetabletobeabletomapallofmemory,anextralevelofthepagetableisneededeverydecadejusttokeepupwiththeincreasingsizeofmainmemory.

Afurtherchallengeisthattheoperatingsystemoftenneedstokeeptwosetsofbookswithrespecttoaddresstranslation.Onesetofbooksisthehardwareview—theprocessorconsultsasetofsegmentandmulti-levelpagetablestobeabletocorrectlyandsecurelyexecuteinstructionsandloadandstoredata.Adifferentsetofbooksistheoperatingsystemviewofthevirtualaddressspace.Tosupportfeaturessuchascopy-on-write,zero-on-reference,andfill-on-reference,aswellasotherapplicationswewilldescribeinlaterchapters,theoperatingsystemmustkeeptrackofadditionalinformationabouteachvirtualpagebeyondwhatisstoredinthehardwarepagetable.

Thissoftwarememorymanagementdatastructuresmirror,butarenotidenticalto,thehardwarestructures,consistingofthreeparts:

Listofmemoryobjects.Memoryobjectsarelogicalsegments.Whetherornottheunderlyinghardwareissegmented,thekernelmemorymanagerneedstokeeptrackofwhichmemoryregionsrepresentwhichunderlyingdata,suchasprogramcode,librarycode,shareddatabetweentwoormoreprocesses,acopy-on-writeregion,oramemory-mappedfile.Forexample,whenaprocessstartsup,thekernelcanchecktheobjectlisttoseeifthecodeisalreadyinmemory;likewise,whenaprocessopensa

library,itcancheckifithasalreadybeenlinkedbysomeotherprocess.Similarly,thekernelcankeepreferencecountstodeterminewhichmemoryregionstoreclaimonprocessexit.

Virtualtophysicaltranslation.Onanexception,andduringsystemcallparametercopying,thekernelneedstobeabletotranslatefromaprocess’svirtualaddressestoitsphysicallocations.Whilethekernelcouldusethehardwarepagetablesforthis,thekernelalsoneedstokeeptrackofwhetheraninvalidpageistrulyinvalid,orsimplynotloadedyet(inthecaseoffill-on-reference)orifaread-onlypageistrulyread-onlyorjustsimulatingadatabreakpointoracopy-on-writepage.

Physicaltovirtualtranslation.Wereferredtothisaboveasthecoremap.Theoperatingsystemneedstokeeptrackoftheprocessesthatmaptoaspecificphysicalmemorylocation,toensurethatwhenthekernelupdatesapage’sstatus,itcanalsoupdatedeverypagetableentrythatreferstothatphysicalpage.

Themostinterestingofthesearethedatastructuresusedforthevirtualtophysicaltranslation.Forthesoftwarepagetable,wehaveallofthesameoptionsasbeforewithrespecttosegmentationandmultiplelevelsofpaging,aswellassomeothers.Thesoftwarepagetableneednotusethesamestructureastheunderlyinghardwarepagetable;indeed,iftheoperatingsystemistobeeasilyportable,thesoftwaredatastructuresmaybequitedifferentfromtheunderlyinghardware.

Linuxmodelstheoperatingsystem’sinternaladdresstranslationdatastructuresafterthex86architectureofsegmentsplusmulti-levelpagetables.ThishasmadeportingLinuxtonewx86architecturesrelativelyeasy,butportingLinuxtootherarchitecturessomewhatmoredifficult.

Adifferentapproach,takenfirstinaresearchsystemcalledMachandlaterinAppleOSX,istouseahashtable,ratherthanatree,forthesoftwaretranslationdata.Forhistoricalreasons,theuseofahashtableforpagedaddresstranslationiscalledaninvertedpagetable.Particularlyaswemovetodeepermulti-levelpagetables,usingahashtablefortranslationcanspeeduptranslation.

Withaninvertedpagetable,thevirtualpagenumberishashedintoatableofsizeproportionaltothenumberofphysicalpageframes.Eachentryinthehashtablecontainstuplesoftheform(inthefigure,thephysicalpageisimplicit):

Figure8.9:Addresstranslationwithasoftwarehashtable.Thehardwarepagetablesareomittedfromthepicture.Thevirtualpagenumberishashed;thisyieldsapositioninthehashtablethatindicatesthephysicalpageframe.Thevirtualpagenumbermustbecheckedagainstthecontentsofthehashentrytohandlecollisionsandtocheckpageaccesspermissions.

AsshowninFigure8.9,ifthereisamatchonboththevirtualpagenumberandtheprocessID,thenthetranslationisvalid.Somesystemsdoatwostagelookup:theyfirstmapthevirtualaddresstoamemoryobjectID,andthendothehashtablelookupontherelativevirtualaddresswithinthememoryobject.Ifmemoryismostlyshared,thiscansavespaceinthehashtablewithoutundulyslowingthetranslation.

Aninvertedpagetabledoesneedsomewaytohandlehashcollisions,whentwovirtualaddressesmaptothesamehashtableentry.Standardtechniques—suchaschainingorrehashing—canbeusedtohandlecollisions.

Aparticularlyusefulconsequenceofhavingaportabilitylayerformemorymanagementisthatthecontentsofthehardwaremulti-leveltranslationtablecanbetreatedasahint.Ahintisaresultofsomecomputationwhoseresultsmaynolongerbevalid,butwhereusinganinvalidhintwilltriggeranexception.

Withaportabilitylayer,thesoftwarepagetableisthegroundtruth,whilethehardware

pagetableisahint.Thehardwarepagetablecanbesafelyused,providedthatthetranslationsandpermissionsareasubsetofthetranslationsinthesoftwarepagetable.

Isaninvertedpagetableenough?

Theconceptofaninvertedpagetableraisesanintriguingquestion:doweneedtohaveamulti-levelpagetableinhardware?Suppose,inhardware,wehashthevirtualaddress.Butinsteadofusingthehashvaluetolookupinatablewheretofindthephysicalpageframe,supposewejustusethehashvalueasthephysicalpage.Forthistowork,weneedthehashtablesizetohaveexactlyasmanyentriesasphysicalmemorypageframes,sothatthereisaone-to-onecorrespondencebetweenthehashtableentryandthepageframe.

Westillneedatabletostorepermissionsandtoindicatewhichvirtualpageisstoredineachentry;iftheprocessdoesnothavepermissiontoaccessthepage,oriftwovirtualpageshashtothesamephysicalpage,weneedtobeabletodetectthisandtraptotheoperatingsystemkerneltohandletheproblem.Thisiswhyahashtableformanagingmemoryisoftencalledcalledaninvertedpagetable:theentriesinthetablearevirtualpagenumbers,notphysicalpagenumbers.Thephysicalpagenumberisjustthepositionofthatvirtualpageinthetable.

Thedrawbacktothisapproach?Handlinghashcollisionsbecomesmuchharder.Iftwopageshashtothesametableentry,onlyonecanbestoredinthephysicalpageframe.Theotherhastobeelsewhere—eitherinasecondaryhashtableentryorpossiblystoredondisk.Copyinginthenewpagecantaketime,andiftheprogramisunluckyenoughtoneedtosimultaneouslyaccesstwovirtualpagesthatbothhashtothesamephysicalpage,thesystemwillslowdownevenfurther.Asaresult,onmodernsystems,invertedpagetablesaretypicallyusedinsoftwaretoimproveportability,ratherthaninhardware,toeliminatetheneedformulti-levelpagetables.

8.3TowardsEfficientAddressTranslation

Atthispoint,youshouldbegettingabitantsy.Afterall,mostofthehardwaremechanismswehavedescribedinvolveatleasttwoandpossiblyasmanyasfourmemoryextrareferences,oneachinstruction,beforeweevenreachtheintendedphysicalmemorylocation!Itshouldseemcompletelyimpracticalforaprocessortodoseveralmemorylookupsoneveryinstructionfetch,andevenmorethatforeveryinstructionthatloadsorstoresdata.

Inthissection,wewilldiscusshowtoimproveaddresstranslationperformancewithoutchangingitslogicalbehavior.Inotherwords,despitetheoptimization,everyvirtualaddressistranslatedtoexactlythesamephysicalmemorylocation,andeverypermissionexceptioncausesatrap,exactlyaswouldhaveoccurredwithouttheperformanceoptimization.

Forthis,wewilluseacache,acopyofsomedatathatcanbeaccessedmorequicklythantheoriginal.Thissectionconcernshowwemightusecachestoimprovetranslation

performance.Cachesarewidelyusedincomputerarchitecture,operatingsystems,distributedsystems,andmanyothersystems;inthenextchapter,wediscussmoregenerallywhencachesworkandwhentheydonot.Fornow,however,ourfocusisjustontheuseofcachesforreducingtheoverheadofaddresstranslation.Thereisareasonforthis:theveryfirsthardwarecacheswereusedtoimprovetranslationperformance.

8.3.1TranslationLookasideBuffers

Ifyouthinkabouthowaprocessorexecutesinstructionswithaddresstranslation,therearesomeobviouswaystoimproveperformance.Afterall,theprocessornormallyexecutesinstructionsinasequence:

Thehardwarewillfirsttranslatetheprogramcounterfortheaddinstruction,walkingthemulti-leveltranslationtabletofindthephysicalmemorywheretheaddinstructionisstored.Whentheprogramcounterisincremented,theprocessormustwalkthemultiplelevelsagaintofindthephysicalmemorywherethemultinstructionisstored.Ifthetwoinstructionsareonthesamepageinthevirtualaddressspace,thentheywillbeonthesamepageinphysicalmemory.Theprocessorwilljustrepeatthesamework—thetablewalkwillbeexactlythesame,andagainforthenextinstruction,andthenextafterthat.

Atranslationlookasidebuffer(TLB)isasmallhardwaretablecontainingtheresultsofrecentaddresstranslations.EachentryintheTLBmapsavirtualpagetoaphysicalpage:

Figure8.10:Operationofatranslationlookasidebuffer.Inthediagram,eachvirtualpagenumberischeckedagainstalloftheentriesintheTLBatthesametime;ifthereisamatch,thematchingtableentrycontainsthephysicalpageframeandpermissions.Ifnot,thehardwaremulti-levelpagetablelookupisinvoked;notethehardwarepagetablesareomittedfromthepicture.

Figure8.11:Combinedoperationofatranslationlookasidebufferandhardwarepagetables.

Insteadoffindingtherelevantentrybyamulti-levellookuporbyhashing,theTLBhardware(typically)checksalloftheentriessimultaneouslyagainstthevirtualpage.Ifthereisamatch,theprocessorusesthatentrytoformthephysicaladdress,skippingtherestofthestepsofaddresstranslation.ThisiscalledaTLBhit.OnaTLBhit,thehardwarestillneedstocheckpermissions,incase,forexample,theprogramattemptstowritetoacode-onlypageortheoperatingsystemneedstotraponastoreinstructiontoacopy-on-writepage.

ATLBmissoccursifnoneoftheentriesintheTLBmatch.Inthiscase,thehardwaredoesthefulladdresstranslationinthewaywedescribedabove.Whentheaddresstranslationcompletes,thephysicalpageisusedtoformthephysicaladdress,andthetranslationisinstalledinanentryintheTLB,replacingoneoftheexistingentries.Typically,thereplacedentrywillbeonethathasnotbeenusedrecently.

TheTLBlookupisillustratedinFigure8.10,andFigure8.11showshowaTLBfitsintotheoveralladdresstranslationsystem.

AlthoughthehardwarecostofaTLBmightseemlarge,itismodestcomparedtothepotentialgaininprocessorperformance.Tobeuseful,theTLBlookupneedstobemuchmorerapidthandoingafulladdresstranslation;thus,theTLBtableentriesareimplementedinveryfast,on-chipstaticmemory,situatedneartheprocessor.Infact,tokeeplookupsrapid,manysystemsnowincludemultiplelevelsofTLB.Ingeneral,thesmallerthememory,thefasterthelookup.So,thefirstlevelTLBissmallandclosetothe

processor(andoftensplitforengineeringreasonsintooneforinstructionlookupsandaseparateonefordatalookups).IfthefirstlevelTLBdoesnotcontainthetranslation,alargersecondlevelTLBisconsulted,andthefulltranslationisonlyinvokedifthetranslationmissesbothlevels.Forsimplicity,ourdiscussionwillassumeasingle-levelTLB.

ATLBalsorequiresanaddresscomparatorforeachentrytocheckinparallelifthereisamatch.Toreducethiscost,someTLBsaresetassociative.ComparedtofullyassociativeTLBs,setassociativeonesneedfewercomparators,buttheymayhaveahighermissrate.Wewilldiscusssetassociativity,anditsimplicationsforoperatingsystemdesign,inthenextchapter.

WhatisthecostofaddresstranslationwithaTLB?Therearetwofactors.WepaythecostoftheTLBlookupregardlessofwhethertheaddressisintheTLBornot;inthecaseofanunsuccessfulTLBlookup,wealsopaythecostofthefulltranslation.IfP(hit)isthelikelihoodthattheTLBhastheentrycached:

Cost(addresstranslation) = Cost(TLBlookup)

+Cost(fulltranslation)×(1-P(hit))

Inotherwords,theprocessordesignerneedstoincludeasufficientlylargeTLBthatmostaddressesgeneratedbyaprogramwillhitintheTLB,sothatdoingthefulltranslationistherareevent.Evenso,TLBmissesareasignificantcostformanyapplications.

Software-loadedTLB

IftheTLBiseffectiveatamortizingthecostofdoingafulladdresstranslationacrossmanymemoryreferences,wecanaskaradicalquestion:doweneedhardwaremulti-levelpagetablelookuponaTLBmiss?Thisistheconceptbehindasoftware-loadedTLB.ATLBhitworksasbefore,asafastpath.OnaTLBmiss,insteadofdoinghardwareaddresstranslation,theprocessortrapstotheoperatingsystemkernel.Inthetraphandler,thekernelisresponsiblefordoingtheaddresslookup,loadingtheTLBwiththenewtranslation,andrestartingtheapplication.

Thisapproachdramaticallysimplifiesthedesignoftheoperatingsystem,becauseitnolongerneedstokeeptwosetsofpagetables,oneforthehardwareandoneforitself.OnaTLBmiss,theoperatingsystemcanconsultitsownportabledatastructurestodeterminewhatdatashouldbeloadedintotheTLB.

Althoughconvenientfortheoperatingsystem,asoftware-loadedTLBissomewhatslowerforexecutingapplications,asthecostoftrappingtothekernelissignificantlymorethanthecostofdoinghardwareaddresstranslation.Aswewillseeinthenextchapter,thecontentsofpagetableentriescanbestoredinon-chiphardwarecaches;this

meansthatevenonaTLBmiss,thehardwarecanoftenfindeverylevelofthemulti-levelpagetablealreadystoredinanon-chipcache,butnotintheTLB.Forexample,aTLBmissonamoderngenerationx86canbecompletedinthebestcaseintheequivalentof17instructions.Bycontrast,atraptotheoperatingsystemkernelwilltakeseveralhundredtoafewthousandinstructionstoprocess,eveninthebestcase.

Figure8.12:Operationofatranslationlookasidebufferwithsuperpages.Inthediagram,someentriesintheTLBcanbesuperpages;thesematchifthevirtualpageisinthesuperpage.Thesuperpageinthediagramcoversanentirememorysegment,butthisneednotalwaysbethecase.

8.3.2Superpages

OnewaytoimprovetheTLBhitrateisusingaconceptcalledsuperpages.Asuperpageisasetofcontiguouspagesinphysicalmemorythatmapacontiguousregionofvirtualmemory,wherethepagesarealignedsothattheysharethesamehigh-order(superpage)address.Forexample,an8KBsuperpagewouldconsistoftwoadjacent4KBpagesthatlieonan8KBboundaryinbothvirtualandphysicalmemory.Superpagesareatthe

discretionoftheoperatingsystem—smallprogramsormemorysegmentsthatbenefitfromasmallerpagesizecanstilloperatewiththestandard,smallerpagesize.

Superpagescomplicateoperatingsystemmemoryallocationbyrequiringthesystemtoallocatechunksofmemoryindifferentsizes.However,theupsideisthatasuperpagecandrasticallyreducethenumberofTLBentriesneededtomaplarge,contiguousregionsofmemory.EachentryintheTLBhasaflag,signifyingwhethertheentryisapageorasuperpage.Forsuperpages,theTLBmatchesthesuperpagenumber—thatis,itignorestheportionofthevirtualaddressthatisthepagenumberwithinthesuperpage.ThisisillustratedinFigure8.12.

Tomakethisconcrete,thex86skipsoneortwolevelsofthepagetablewhenthereisa2MBor1GBregionofphysicalmemorythatismappedasaunit.Whentheprocessorreferencesoneoftheseregions,onlyasingleentryisloadedintotheTLB.Whenlookingforamatchagainstasuperpage,theTLBonlyconsidersthemostsignificantbitsoftheaddress,ignoringtheoffsetwithinthesuperpage.Fora2MBsuperpage,theoffsetisthelowest21bitsofthevirtualaddress.Fora1GBsuperpageitisthelowest30bits.

Figure8.13:Layoutofahigh-resolutionframebufferinphysicalmemory.Eachlineofthepixeldisplaycantakeupanentirepage,sothatadjacentpixelsintheverticaldimensionlieondifferentpages.

Acommonuseofsuperpagesistomaptheframebufferforthecomputerdisplay.Whenredrawingthescreen,theprocessormaytoucheverypixel;withahigh-resolutiondisplay,thiscaninvolvesteppingthroughmanymegabytesofmemory.IfeachTLBentrymapsa4KBpage,evenalargeon-chipTLBwith256entrieswouldonlybeabletocontainmappingsfor1MBoftheframebufferatthesametime.Thus,theTLBwouldneedtorepeatedlydopagetablelookupstopullinnewTLBentriesasitstepsthroughmemory.Anevenworsecaseoccurswhendrawingaverticalline.Theframebufferisatwo-

dimensionalarrayinrow-majororder,sothateachhorizontallineofpixelsisonaseparatepage.Thus,modifyingeachseparatepixelinaverticallinewouldrequireloadingaseparateTLBentry!Withsuperpages,theentireframebuffercanbemappedwithasingleTLBentry,leavingmoreroomfortheotherpagesneededbytheapplication.

Similarissuesoccurwithlargematricesinscientificcode.

8.3.3TLBConsistency

Wheneverweintroduceacacheintoasystem,weneedtoconsiderhowtoensureconsistencyofthecachewiththeoriginaldatawhentheentriesaremodified.ATLBisnoexception.Forsecureandcorrectprogramexecution,theoperatingsystemmustensurethattheeachprogramseesitsmemoryandnooneelse’s.AnyinconsistencybetweentheTLB,thehardwaremulti-leveltranslationtable,andtheportableoperatingsystemlayerisapotentialcorrectnessandsecurityflaw.

Therearethreeissuestoconsider:

Figure8.14:OperationofatranslationlookasidebufferwithprocessID’s.TheTLBcontainsentriesformultipleprocesses;onlytheentriesforthecurrentprocessarevalid.TheoperatingsystemkernelmustchangethecurrentprocessIDwhenperformingacontextswitchbetweenprocesses.

Processcontextswitch.Whathappensonaprocesscontextswitch?Thevirtualaddressesoftheoldprocessarenolongervalid,andshouldnolongerbevalid,forthenewprocess.Otherwise,thenewprocesswillbeabletoreadtheoldprocess’s

datastructures,eithercausingthenewprocesstocrash,orpotentiallyallowingittoscavengesensitiveinformationsuchaspasswordsstoredinmemory.

Onacontextswitch,weneedtochangethehardwarepagetableregistertopointtothenewprocess’spagetable.However,theTLBalsocontainscopiesoftheoldprocess’spagetranslationsandpermissions.OneapproachistoflushtheTLB—discarditscontents—oneverycontextswitch.Sinceemptyingthecachecarriesaperformancepenalty,modernprocessorshaveataggedTLB,showninFigure8.14.EntriesinataggedTLBcontaintheprocessIDthatproducedeachtranslation:

WithataggedTLB,theoperatingsystemstoresthecurrentprocessIDinahardwareregisteroneachcontextswitch.Whenperformingalookup,thehardwareignoresTLBentriesfromotherprocesses,butitcanreuseanyTLBentriesthatremainfromthelasttimethecurrentprocessexecuted.

Permissionreduction.Whathappenswhentheoperatingsystemmodifiesanentryinapagetable?Fortheprocessor’sregulardatacacheofmainmemory,special-purposehardwarekeepscacheddataconsistentwiththedatastoredinmemory.However,hardwareconsistencyisnotusuallyprovidedfortheTLB;keepingtheTLBconsistentwiththepagetableistheresponsibilityoftheoperatingsystemkernel.

Softwareinvolvementisneededforseveralreasons.First,pagetableentriescanbesharedbetweenprocesses,soasinglemodificationcanaffectmultipleTLBentries(e.g.,oneforeachprocesssharingthepage).Second,theTLBcontainsonlythevirtualtophysicalpagemapping—itdoesnotrecordtheaddresswherethemappingcamefrom,soitcannottellifawritetomemorywouldaffectaTLBentry.Evenifitdidtrackthisinformation,moststorestomemorydonotaffectthepagetable,sorepeatedlycheckingeachmemorystoretoseeifitaffectsanyTLBentrywouldinvolvealargeamountofoverheadthatwouldrarelybeneeded.

Instead,whenevertheoperatingsystemchangesthepagetable,itensuresthattheTLBdoesnotcontainanincorrectmapping.

Nothingneedstobedonewhentheoperatingsystemaddspermissionstoaportionofthevirtualaddressspace.Forexample,theoperatingsystemmightdynamicallyextendtheheaporthestackbyallocatingphysicalmemoryandchanginginvalidpagetableentriestopointtothenewmemory,ortheoperatingsystemmightchangeapagefromread-onlytoread-write.Inthesecases,theTLBcanbeleftalonebecauseanyreferencesthatrequirethenewpermissionswilleithercausethehardwareloadthenewentriesorcauseanexception,allowingtheoperatingsystemtoloadthenew

entries.

However,iftheoperatingsystemneedstoreducepermissionstoapage,thenthekernelneedstoensuretheTLBdoesnothaveacopyoftheoldtranslationbeforeresumingtheprocess.Ifthepagewasshared,thekernelneedstoensurethattheTLBdoesnothavethecopyforanyoftheprocessID’sthatmighthavereferencedthepage.Forexample,tomarkaregionofmemoryascopy-on-write,theoperatingsystemmustreducepermissionstotheregiontoread-only,anditmustremoveanyentriesforthatregionfromtheTLB,sincetheoldTLBentrieswouldstillberead-write.

EarlycomputersdiscardedtheentirecontentsoftheTLBwhenevertherewasachangetoapagetable,butmoremodernarchitectures,includingthex86andtheARM,supporttheremovalofindividualTLBentries.

Figure8.15:IllustrationoftheneedforTLBshootdowntopreservecorrecttranslationbehavior.Inorderforprocessor1tochangethetranslationforpage0x53inprocess0toread-only,itmustremovetheentryfromitsTLB,anditmustensurethatnootherprocessorhastheoldtranslationinitsTLB.Todothis,itsendsaninterprocessorinterrupttoeachprocessor,requestingittoremovetheoldtranslation.TheoperatingsystemdoesnotknowifaparticularTLBcontainsanentry(e.g.,processor3’sTLBdoesnotcontainpage0x53),soitmustremoveitfromallTLBs.Theshootdowniscompleteonlywhenallprocessorshaveverifiedthattheoldtranslationhasbeenremoved.

TLBshootdown.Onamultiprocessor,thereisafurthercomplication.AnyprocessorinthesystemmayhaveacachedcopyofatranslationinitsTLB.Thus,tobesafeandcorrect,wheneverapagetableentryismodified,thecorrespondingentryineveryprocessor’sTLBhastobediscardedbeforethechangewilltakeeffect.Typically,onlythecurrentprocessorcaninvalidateitsownTLB,soremovingtheentryfromallprocessorsonthesystemrequiresthattheoperatingsysteminterrupteachprocessorandrequestthatitremovetheentryfromitsTLB.

Thisheavyweightoperationhasitsownname:itisaTLBshootdown,illustratedinFigure8.15.Theoperatingsystemfirstmodifiesthepagetable,thensendsaTLBshootdownrequesttoalloftheotherprocessors.OnceanotherprocessorhasensuredthatitsTLBhasbeencleanedofanyoldentries,thatprocessorcanresume.Theoriginalprocessorcancontinueonlywhenalloftheprocessorshaveacknowledged

removingtheoldentryfromtheirTLB.SincetheoverheadofaTLBshootdownincreaseslinearlywiththenumberofprocessorsonthesystem,manyoperatingsystemsbatchTLBshootdownrequests,toreducethefrequencyofinterprocessinterruptsatsomeincreasedcostinlatencytocompletetheshootdown.

8.3.4VirtuallyAddressedCaches

Figure8.16:Combinedoperationofavirtuallyaddressedcache,translationlookasidebuffer,andhardwarepagetable.

AnothersteptoimprovingtheperformanceofaddresstranslationistoincludeavirtuallyaddressedcachebeforetheTLBisconsulted,asshowninFigure8.16.Avirtuallyaddressedcachestoresacopyofthecontentsofphysicalmemory,indexedbythevirtualaddress.Whenthereisamatch,theprocessorcanusethedataimmediately,withoutwaitingforaTLBlookuporpagetabletranslationtogenerateaphysicaladdress,andwithoutwaitingtoretrievethedatafrommainmemory.Almostallmodernmulticorechipsincludeasmall,virtuallyaddressedon-chipcacheneareachprocessorcore.Often,liketheTLB,thevirtuallyaddressedcachewillbesplitinhalf,oneforinstructionlookupsandonefordata.

ThesameconsistencyissuesthatapplytoTLBsalsoapplytovirtuallyaddressedcaches:

Processcontextswitch.EntriesinthevirtuallyaddressedcachemusteitherbeeitherwiththeprocessIDortheymustbeinvalidatedonacontextswitchtopreventthenewprocessfromaccessingtheoldprocess’sdata.

Permissionreductionandshootdown.Whentheoperatingsystemchangesthepermissionforapageinthepagetable,thevirtualcachewillnotreflectthatchange.Invalidatingtheaffectedcacheentrieswouldrequireeitherflushingtheentirecache

orfindingallmemorylocationsstoredinthecacheontheaffectedpage,bothrelativelyheavyweightoperations.

Instead,mostsystemswithvirtuallyaddressedcachesusethemintandemwiththeTLB.EachvirtualaddressislookedupinboththecacheandtheTLBatthesametime;theTLBspecifiesthepermissionstouse,whilethecacheprovidesthedataiftheaccessispermitted.Thisway,onlytheTLB’spermissionsneedtobekeptuptodate.TheTLBandvirtualcacheareco-designedtotakethesameamountoftimetoperformalookup,sotheprocessordoesnotstallwaitingfortheTLB.

Afurtherissueisaliasing.Manyoperatingsystemsallowprocessessharingmemorytousedifferentvirtualaddressestorefertothesamememorylocation.Thisiscalledamemoryaddressalias.EachprocesswillhaveitsownTLBentryforthatmemory,andthevirtualcachemaystoreacopyofthememoryforeachprocess.Theproblemoccurswhenoneprocessmodifiesitscopy;howdoesthesystemknowtoupdatetheothercopy?

Themostcommonsolutiontothisissueistostorethephysicaladdressalongwiththevirtualaddressinthevirtualcache.Inparallelwiththevirtualcachelookup,theTLBisconsultedtogeneratethephysicaladdressandpagepermissions.Onastoreinstructionmodifyingdatainthevirtualcache,thesystemcandoareverselookuptofindalltheentriesthatmatchthesamephysicaladdress,toallowittoupdatethoseentries.

8.3.5PhysicallyAddressedCaches

Figure8.17:Combinedoperationofavirtuallyaddressedcache,translationlookasidebuffer,hardwarepagetable,andphysicallyaddressedcache.

Manyprocessorarchitecturesincludeaphysicallyaddressedcachethatisconsultedasasecond-levelcacheafterthevirtuallyaddressedcacheandTLB,butbeforemainmemory.

ThisisillustratedinFigure8.17.OncethephysicaladdressofthememorylocationisformedfromtheTLBlookup,thesecond-levelcacheisconsulted.Ifthereisamatch,thevaluestoredatthatlocationcanbereturneddirectlytotheprocessorwithouttheneedtogotomainmemory.

Withtoday’schipdensities,anon-chipphysicallyaddressedcachecanbequitelarge.Infact,manysystemsincludebothasecond-levelandathird-levelphysicallyaddressedcache.Typically,thesecond-levelcacheisper-coreandisoptimizedforlatency;atypicalsizeis256KB.Thethird-levelcacheissharedamongallofthecoresonthesamechipandwillbeoptimizedforsize;itcanbeaslargeas2MBonamodernchip.Inotherwords,theentireUNIXoperatingsystemfromthe70’s,andallofitsapplications,wouldfitonasinglemodernchip,withnoneedtoevergotomainmemory.

Together,thesephysicallyaddressedcachesserveadualpurpose:

Fastermemoryreferences.Anon-chipphysicallyaddressedcachewillhavealookuplatencythatistentimes(2ndlevel)orthreetimes(3rdlevel)fasterthanmainmemory.

FasterTLBmisses.IntheeventofaTLBmiss,thehardwarewillgenerateasequenceoflookupsthroughitsmultiplelevelsofpagetables.Becausethepagetablesarestoredinphysicalmemory,theycanbecached.Thus,evenaTLBmissandpagetablelookupmaybehandledentirelyonchip.

8.4SoftwareProtection

Anincreasingnumberofsystemscomplementhardware-basedaddresstranslationwithsoftware-basedprotectionmechanisms.Obviously,software-onlyprotectionispossible.Amachinecodeinterpreter,implementedinsoftware,cansimulatetheexactbehaviorofhardwareprotection.Theinterpretercouldfetcheachinstruction,interpretit,lookeachaddressupinapagetabletodetermineiftheinstructionispermitted,andifso,executetheinstruction.Ofcourse,thatwouldbeveryslow!

Inthissection,weask:aretherepracticalsoftwaretechniquestoexecutecodewithinarestricteddomain,withoutrelyingonhardwareaddresstranslation?Thefocusofourdiscussionwillbeonusingsoftwareforprovidinganefficientprotectionboundary,asawayofimprovingcomputersecurity.However,thetechniqueswedescribecanalsobeusedtoprovideotheroperatingsystemservices,suchascopy-on-write,stackextensibility,recoverablememory,anduser-levelvirtualmachines.Onceyouhavetheinfrastructuretoreinterpretreferencestocodeanddatalocations,whetherinsoftwareorhardware,anumberofservicesbecomepossible.

Hardwareprotectionisnearlyuniversalonmoderncomputers,soitisreasonabletoask,whydoweneedtoimplementprotectioninsoftware?

Simplifyhardware.Onegoalissimplecuriosity.Dowereallyneedhardwareaddresstranslation,orisitjustanengineeringtradeoff?Ifsoftwarecanprovideefficientprotection,wecouldeliminatealargeamountofhardwarecomplexityandruntimeoverheadfromcomputers,withasubstantialincreaseinflexibility.

Application-levelprotection.Evenifweneedhardwareaddresstranslationtoprotecttheoperatingsystemfrommisbehavingapplications,weoftenwanttorununtrustedcodewithinanapplication.Anexampleisinsideawebbrowser;webpagescancontaincodetoconfigurethedisplayforawebsite,butthebrowserneedstoprotectitselfagainstmaliciousorbuggycodeprovidedbywebsites.

Protectioninsidethekernel.Wealsosometimesneedtorununtrusted,oratleastlesstrusted,codeinsidekernel.Examplesincludethird-partydevicedriversandcodetocustomizethebehavioroftheoperatingsystemonbehalfofapplications.Becausethekernelrunswiththefullcapabilityoftheentiremachine,anyusercoderuninsidethekernelmustbeprotectedinsoftwareratherthaninhardware.

Portablesecurity.Theproliferationofconsumerdevicesposesachallengetoapplicationportability.Nosingleoperatingsystemrunsoneveryembeddedsensor,smartphone,tablet,netbook,laptop,desktop,andservermachine.Applicationsthatwanttorunacrossawiderangeofdevicesneedacommonruntimeenvironmentthatisolatestheapplicationfromthespecificsoftheunderlyingoperatingsystemandhardwaredevice.Providingprotectionaspartoftheruntimesystemmeansthatuserscandownloadandrunapplicationswithoutconcernthattheapplicationwillcorrupttheunderlyingoperatingsystem.

Figure8.18:Executionofuntrustedcodeinsidearegionoftrustedcode.Thetrustedregioncanbeaprocess,suchasabrowser,executinguntrustedJavaScript,orthetrustedregioncanbetheoperatingsystemkernel,executinguntrustedpacketfiltersordevicedrivers.

Theneedforsoftwareprotectioniswidespreadenoughthatithasitsownterm:howdoweprovideasoftwaresandboxforexecutinguntrustedcodesothatitcandoitsworkwithoutcausingharmtotherestofthesystem?

8.4.1SingleLanguageOperatingSystems

Averysimpleapproachtosoftwareprotectionistorestrictallapplicationstobewritteninasingle,carefullydesignedprogramminglanguage.Ifthelanguageanditsenvironmentpermitsonlysafeprogramstobeexpressed,andthecompilerandruntimesystemaretrustworthy,thennohardwareprotectionisneeded.

Figure8.19:Executionofapacketfilterinsidethekernel.Apacketfiltercanbeinstalledbyanetworkdebuggertotracepacketsforaparticularuserorapplication.Packetheadersmatchingthefilterarecopiedtothedebugger,whilenormalpacketprocessingcontinuesunaffected.

ApracticalexampleofthisapproachthatisstillinwideuseisUNIXpacketfilters,showninFigure8.19.UNIXpacketfiltersallowuserstodownloadcodeintotheoperatingsystemkerneltocustomizekernelnetworkprocessing.Forexample,apacketfiltercanbeinstalledinthekerneltomakeacopyofpacketheadersarrivingforaparticularconnectionandtosendthosetoauser-leveldebugger.

AUNIXpacketfilteristypicallyonlyasmallamountofcode,butbecauseitneedstoruninkernel-mode,thesystemcannotrelyonhardwareprotectiontopreventamisbehavingpacketfilterfromcausinghavoctounrelatedapplications.Instead,thesystemrestrictsthepacketfilterlanguagetopermitonlysafepacketfilters.Forexample,filtersmayonlybranchonthecontentsofpacketsandnoloopsareallowed.Sincethefiltersaretypicallyshort,theoverheadofusinganinterpretedlanguageisnotprohibitive.

Figure8.20:ExecutionofaJavaScriptprograminsideamodernwebbrowser.TheJavaScriptinterpreterisresponsibleforcontainingeffectsoftheJavaScriptprogramtoitsspecificpage.JavaScriptprogramscancallouttoabroadsetofroutinesinthebrowser,sotheseroutinesmustalsobeprotectedagainstmaliciousJavaScriptprograms.

AnotherexampleofthesameapproachistheuseofJavaScriptinmodernwebbrowsers,illustratedinFigure8.20.AJavaScriptprogramcustomizestheuserinterfaceandpresentationofawebsite;itisprovidedbythewebsite,butitexecutesontheclientmachineinsidethebrowser.Asaresult,thebrowserexecutionenvironmentforJavaScriptmustpreventmaliciousJavaScriptprogramsfromtakingcontroloverthebrowserandpossiblytherestoftheclientmachine.SinceJavaScriptprogramstendtoberelativelyshort,theyareofteninterpreted;JavaScriptcanalsocallintoapredefinedsetoflibraryroutines.IfaJavaScriptprogramattemptstocallaprocedurethatdoesnotexistorreferencearbitrarymemorylocations,theinterpreterwillcausearuntimeexceptionandstoptheprogrambeforeanyharmcanbedone.

Severalearlypersonalcomputersweresinglelanguagesystemswithprotectionimplementedinsoftwareratherthanhardware.Mostfamously,theXeroxAltoresearchprototypeusedsoftwareandnothardwareprotection;theAltoinspiredtheAppleMacintosh,andthelanguageitused,Mesa,wasaforerunnerofJava.OthersystemsincludedtheLispMachine,acomputerthatexecutedonlyprogramswritteninLisp,andcomputersthatexecutedonlySmalltalk(aprecursortoPython).

Languageprotectionandgarbagecollection

JavaScript,Lisp,andSmalltalkallprovidememory-compactinggarbagecollectionfordynamicallycreateddatastructures.Onemotivationforthisisprogrammerconvenienceandtoreduceavoidableprogrammererror.However,thereisacloserelationshipbetweensoftwareprotectionandgarbagecollection.Garbagecollectionrequirestheruntimesystemtokeeptrackofallvalidpointersvisibletotheprogram,sothatdatastructurescanberelocatedwithoutaffectingprogrambehavior.Programsexpressibleinthelanguagecannotpointtoorjumptoarbitrarymemorylocations,asthenthebehavioroftheprogramwouldbealteredbythegarbagecollector.Everyaddressgeneratedbytheprogramisnecessarilywithintheregionoftheapplication’scode,andeveryloadandstoreinstructionistotheprogram’sdata,andnooneelse’s.Inotherwords,thisisexactlywhatisneededforsoftwareprotection!

Unfortunately,language-basedsoftwareprotectionhassomepracticallimitations,sothatonmodernsystems,itisoftenusedintandemwith,ratherthanasareplacementfor,hardwareprotection.Usinganinterpretedlanguageseemslikeasafeoption,butitrequirestrustinboththeinterpreteranditsruntimelibraries.Aninterpreterisacomplexpieceofsoftware,andanyflawintheinterpretercouldprovideawayforamaliciousprogramtogaincontrolovertheprocess,thatis,toescapeitsprotectionboundary.SuchattacksarecommonforbrowsersrunningJavaScript,althoughovertimeJavaScriptinterpretershavebecomemorerobusttothesetypesofattacks.

Worse,becauserunninginterpretedcodeisoftenslow,manyinterpretedsystemsputmostoftheirfunctionalityintosystemlibrariesthatcanbecompiledintomachinecodeandrundirectlyontheprocessor.Forexample,commercialwebbrowsersprovideJavaScriptprogramsahugenumberofuserinterfaceobjects,sothattheinterpretedcodeisjustasmallamountofglue.Unfortunately,thisraisestheattacksurface—anylibraryroutinethatdoesnotcompletelyprotectitselfagainstmalicioususecanbeavectorfortheprogramtoescapeitsprotection.Forexample,aJavaScriptprogramcouldattempttocausealibraryroutinetooverwritetheendofabuffer,anddependingonwhatwasstoredinmemory,thatmightprovideawayfortheJavaScriptprogramtogaincontrolofthesystem.ThesetypesofattacksagainstJavaScriptruntimelibrariesarewidespread.

Thisleadsmostsystemstousebothhardwareandsoftwareprotection.Forexample,MicrosoftWindowsrunsitswebbrowserinaspecialprocesswithrestrictedpermissions.Thisway,ifasystemadministratorvisitsawebsitecontainingamaliciousJavaScriptprogram,eveniftheprogramtakesoverthebrowser,itcannotstorefilesordootheroperationsthatwouldnormallybeavailabletothesystemadministrator.Weknowacomputersecurityexpertwhorunseachnewwebpageinaseparatevirtualmachine;evenifthewebpagecontainsavirusthattakesoverthebrowser,andthebrowserisabletotakeovertheoperatingsystem,theoriginal,uninfected,operatingsystemcanbeautomaticallyrestoredbyresettingthevirtualmachine.

Cross-sitescripting

AnotherJavaScriptattackmakesuseofthestorageinterfaceprovidedtoJavaScriptprograms.ToallowJavaScriptprogramstocommunicatewitheachother,theycanstoredataincookiesinthebrowser.Forsomewebsites,thesecookiescancontainsensitive

informationsuchastheuser’sloginauthentication.AJavaScriptprogramthatcangainaccesstoauser’scookiescanpotentiallypretendtobetheuser,andthereforeaccesstheuser’ssensitivedatastoredattheserver.Ifawebsiteiscompromised,itcanbemodifiedtoservepagescontainingaJavaScriptprogramthatgathersandexploitstheuser’ssensitivedata.Thesearecalledcross-sitescriptingattacks,andtheyarewidespread.

Figure8.21:DesignoftheXeroxAltooperatingsystem.Applicationprogramsandmostoftheoperatingsystemwereimplementedinatype-safeprogramminglanguagecalledMesa;Mesaisolatedmosterrorstothemodulethatcausedtheerror.

Arelatedapproachistowriteallthesoftwareonasysteminasingle,safelanguage,andthentocompilethecodeintomachineinstructionsthatexecutedirectlyontheprocessor.Unlikeinterpretedlanguages,thelibrariesthemselvescanbewritteninthesafelanguage.TheXeroxAltotookthisapproach:bothapplicationsandtheentireoperatingsystemwerewritteninthesamelanguage,Mesa.LikeJava,Mesahadsupportforthreadsynchronizationbuiltdirectlyintothelanguage.Evenwiththis,however,therearepracticalissues.Youstillneedtododefensiveprogrammingatthetrustboundary—betweenuntrustedapplicationcode(writteninthesafelanguage)andtrustedoperatingsystemcode(writteninthesafelanguage).Youalsoneedtobeabletotrustthecompilertogeneratecorrectcodethatenforcesprotection;anyweaknessinthecompilercouldallowabuggyprogramtocrashthesystem.ThedesignersoftheAltobuiltasuccessorsystem,calledtheDigitalEquipmentFirefly,whichusedasuccessorlanguagetoMesa,calledModula-2,forimplementingbothapplicationsandtheoperatingsystem.However,

foranextralevelofprotection,theFireflyalsousedhardwareprotectiontoisolateapplicationsfromtheoperatingsystemkernel.

8.4.2Language-IndependentSoftwareFaultIsolation

Alimitationoftrustingalanguageanditsinterpreterorcompilertoprovidesafetyisthatmanyprogrammersvaluetheflexibilitytochoosetheirownprogramminglanguage.Forexample,somemightuseRubyforconfiguringwebservers,MatlaborPythonforwritingscientificcode,orC++forlargesoftwareengineeringefforts.

Sinceitwouldbeimpracticalfortheoperatingsystemtotrusteverycompilerforeverypossiblelanguage,canweefficientlyisolateapplicationcode,insoftwarewithouthardwaresupport,inaprogramminglanguageindependentfashion?

Onereasonforconsideringthisisthattherearemanycaseswheresystemsneedanextralevelofprotectionwithinaprocess.WesawanexampleofthiswithwebbrowsersneedingtosafelyexecuteJavaScriptprograms,buttherearemanyotherexamples.Withsoftwareprotection,wecouldgiveuserstheabilitytocustomizetheoperatingsystembydownloadingcodeintothekernel,aswithpacketfilters,butonamorewidespreadbasis.Kerneldevicedrivershavebeenshowntobetheprimarycauseofoperatingsystemcrashes;providingawayforthekerneltoexecutedevicedriversinarestrictedenvironmentcouldpotentiallycutdownontheseverityofthesefaults.Likewise,manycomplexsoftwarepackagessuchasdatabases,spreadsheets,desktoppublishingsystems,andsystemsforcomputer-aideddesign,providetheirusersawaytodownloadcodeintothesystemtocustomizeandconfigurethesystem’sbehaviortomeettheuser’sspecificneeds.Ifthisdownloadedcodecausesthesystemtocrash,theuserwillnotbeabletotellwhoisreallyatfaultandislikelytoendupblamingthevendor.

Ofcourse,onewaytodothisistorelyontheJavaScriptinterpreter.Toolsexisttocompilecodewritteninonelanguage,likeCorC++,intoJavaScript.ThisletsapplicationswritteninthoselanguagestorunonanybrowserthatsupportsJavaScript.IfexecutingJavaScriptweresafeandfastenough,thenwecoulddeclareourselvesdone.

Inthissection,wediscussanalternateapproach:canwetakeanychunkofmachineinstructionsandmodifyittoensurethatthecodedoesnottouchanymemoryoutsideofitsownregionofdata?Thatway,thecodecouldbewritteninanylanguage,compiledbyanycompiler,anddirectlyexecuteatthefullspeedoftheprocessor.

BothGoogleandMicrosofthaveproductsthataccomplishthis:asandboxthatcanruncodewritteninanyprogramminglanguage,executedsafelyinsideaprocess.Google’sproductiscalledNativeClient;Microsoft’siscalledApplicationDomains.Theseimplementationsareefficient:Googlereportsthattheruntimeoverheadofexecutingcodesafelyinsideasandboxislessthan10%.

Forsimplicityofourdiscussion,wewillassumethatthememoryregionforthesandboxiscontiguous,thatis,thesandboxhasabaseandboundthatneedstobeenforcedinsoftware.Becausewecandisallowtheexecutionofobviouslymaliciouscode,wecanstartbycheckingthatthecodeinthesandboxdoesnotuseself-modifyinginstructionsorprivilegedinstructions.

Weproceedintwosteps.First,weinsertmachineinstructionsintotheexecutabletodowhathardwareprotectionwouldhavedone,thatis,tocheckthateachaddressislegallywithintheregionspecifiedbythebaseandbounds,andtoraiseanexceptionifnot.Second,weusecontrolanddataflowanalysistoremovechecksthatarenotstrictlynecessaryforthesandboxtobecorrect.Thismirrorswhatwedidforhardwaretranslation—first,wedesignedageneral-purposeandflexiblemechanism,andthenweshowedhowtooptimizeitusingTLBssothatthefulltranslationmechanismwasnotneededoneveryinstruction.

Theaddedinstructionsforeveryloadandstoreinstructionaresimple:justaddacheckthattheaddresstobeusedbyeachloadorstoreinstructioniswithinthecorrectregionofdata.Inthecode,r1isamachineregister.

Notethatthestoreinstructionsmustbelimitedtojustthedataregionofthesandbox;otherwiseastorecouldmodifytheinstructionsequence,e.g.,tocauseajumpoutoftheprotectedregion.

Wealsoneedtocheckindirectbranchinstructions.Weneedtomakesuretheprogramcannotbranchoutsideofthesandboxexceptforpredefinedentryandexitpoints.Relativebranchesandnamedprocedurecallscanbedirectlyverified.Indirectbranchesandprocedurereturnsjumptoalocationstoredinaregisterorinmemory;theaddressmustbecheckedbeforeuse.

Asafinaldetail,theabovecodeverifiesthatindirectbranchinstructionsstaywithinthecoderegion.Thisturnsouttobeinsufficientforprotection,fortworeasons.First,x86codeisbyteaddressable,andifyouallowajumptothemiddleofaninstruction,youcannotbeguaranteedastowhatthecodewilldo.Inparticular,anerroneousormaliciousprogrammightjumptothemiddleofaninstruction,whosebyteswouldcausetheprocessortojumpoutsideoftheprotectedregion.Althoughthismayseemunlikely,rememberthattheattackerhastheadvantage;theattackercantryvariouscodesequencestoseeifthatcausesanescapefromthesandbox.Asecondissueisthatanindirectbranchmightjumppasttheprotectionchecksforaloadorstoreinstruction.Wecanpreventbothofthesebydoingallindirectjumpsthroughatablethatonlycontainsvalidentrypoints

intothecode;ofcourse,thetablemustalsobeprotectedfrombeingmodifiedbythecodeinthesandbox.

Nowthatwehavelogicalcorrectness,wecanruncontrolanddataflowanalysistoeliminatemanyoftheextrainsertedinstructions,ifitcanbeproventhattheyarenotneeded.Examplesofpossibleoptimizationsinclude:

Loopinvariants.Ifaloopstridesthroughmemory,thesandboxmaybeabletoprovewithasimpletestatthebeginningoftheloopthatallmemoryaccessesintheloopwillbewithintheprotectedregion.

Returnvalues.Ifstaticcodeanalysisofaprocedurecanprovethattheproceduredoesnotmodifythereturnprogramcounterstoredonthestack,thereturncanbemadesafelywithoutfurtherchecks.

Cross-procedurechecks.Ifthecodeanalysiscanprovethataparameterisalwayscheckedbeforeitispassedasanargumenttoasubroutine,itneednotbecheckedwhenitisusedinsidetheprocedure.

Virtualmachineswithoutkernelsupport

Modifyingmachinecodetotransparentlychangethebehaviorofaprogram,whilestillenforcingprotection,canbeusedforotherpurposes.Oneapplicationistransparentlyexecutingaguestoperatingsysteminsideauser-levelprocesswithoutkernelsupport.

Normally,whenwerunaguestoperatingsysteminavirtualmachine,thehardwarecatchesanyprivilegedinstructionsexecutedbytheguestkernelandtrapsintothehostkernel.Thehostkernelemulatestheinstructionsandreturnscontrolbacktotheguestkernelattheinstructionimmediatelyafterthehardwareexception.Thisallowsthehostkerneltoemulateprivilegelevels,interrupts,exceptions,andkernelmanagementofhardwarepagetables.

Whathappensifwearerunningontopofanoperatingsystemthatdoesnotsupportavirtualmachine?Wecanstillemulateavirtualmachinebymodifyingthemachinecodeoftheguestoperatingsystemkernel.Forexample,wecanconvertinstructionstoenableanddisableinterruptstoanoop.Wecanconvertaninstructiontostartexecutingauserprogramtotakethecontentsoftheapplicationmemory,re-writethosecontentsintoauser-levelsandbox,andstartitexecuting.Fromtheperspectiveoftheguestkernel,theapplicationprogramexecutionlooksnormal;itisthesandboxthatkeepstheapplicationprogramfromcorruptingkernel’sdatastructuresandpassescontroltotheguestkernelwhentheapplicationmakesasystemcall.

Becauseofthewidespreaduseofvirtualmachines,somehardwarearchitectureshavebeguntoaddsupportfordirectlyexecutingguestoperatingsystemsinuser-modewithoutkernelsupport.Wewillreturntothisissueinalaterchapter,asitiscloselyrelatedtothetopicofstackablevirtualmachines:howdowemanipulatepagetablestohandlethecasewheretheguestoperatingsystemisitselfavirtualmachinemonitorrunningavirtualmachine.

8.4.3SandboxesViaIntermediateCode

Toimproveportability,bothMicrosoftandGooglecanconstructtheirsandboxesfromintermediatecodegeneratedbythecompiler.Thismakesiteasierforthesystemtodothecodemodificationanddataflowanalysistoenforcethesandbox.Insteadofgeneratingx86orARMcodedirectly,thevariouscompilersgeneratetheircodeintheintermediatelanguage,andthesandboxruntimeconvertsthatintosandboxedcodeonthespecificprocessorarchitecture.

Theintermediaterepresentationcanbethoughtofasavirtualmachine,withasimplerinstructionset.Fromthecompilerperspective,itisaseasytogeneratecodeforthevirtualmachineasitwouldbetogodirectlytox86orARMinstructions.Fromthesandboxperspectivethough,usingavirtualmachineastheintermediaterepresentationismuchsimpler.Theintermediatecodecanincludeannotationsastowhichpointerscanbeproventobesafeandwhichmustbecheckedbeforeuse.Forexample,pointersinaCprogramwouldrequireruntimecheckswhilethememoryreferencesinaJavaprogrammaybeabletobestaticallyprovenassafefromthestructureofthecode.

Microsofthascompilersforvirtuallyeverycommerciallyimportantprogramminglanguage.Toavoidtrustingallofthesecompilerswiththesafetyofthesystem,theruntimeisresponsibleforvalidatinganyofthetypeinformationneededforefficientcodegenerationforthesandbox.Typically,verifyingthecorrectnessofstaticanalysisismuchsimplerthangeneratingitinthefirstplace.

TheJavavirtualmachine(JVM)isalsoakindofsandbox;Javacodeistranslatedintointermediatebytecodeinstructionsthatcanbeverifiedatruntimeasbeingsafelycontainedinthesandbox.SeverallanguageshavebeencompiledintoJavabytecode,suchasPython,Ruby,andJavaScript.Thus,aJVMcanalsobeconsideredalanguage-independentsandbox.However,becauseofthestructureoftheintermediaterepresentationinJava,itismoredifficulttogeneratecorrectJavabytecodeforlanguagessuchasCorFortran.

8.5SummaryandFutureDirections

Addresstranslationisapowerfulabstractionenablingawidevarietyofoperatingsystemservices.Itwasoriginallydesignedtoprovideisolationbetweenprocessesandtoprotecttheoperatingsystemkernelfrommisbehavingapplications,butitismorewidelyapplicable.Itisnowusedtosimplifymemorymanagement,tospeedinterprocesscommunication,toprovideforefficientsharedlibraries,tomapfilesdirectlyintomemory,andahostofotheruses.

AhugechallengetoeffectivehardwareaddresstranslationisthecumulativeeffectofdecadesofMoore’sLaw:bothserversanddesktopcomputerstodaycontainvastamountsofmemory.Processesarenowabletomaptheircode,data,heap,sharedlibraries,andfilesdirectlyintomemory.Eachofthesesegmentscanbedynamic;theycanbesharedacrossprocessesorprivatetoasingleprocess.Tohandlethesedemands,hardwaresystemshaveconvergedonatwo-tierstructure:amulti-levelsegmentandpagetableto

provideveryflexiblebutspace-efficientlookup,alongwithaTLBtoprovidetime-efficientlookupforrepeatedtranslationsofthesamepage.

Muchofwhatwecandoinhardwarewecanalsodoinsoftware;acombinationofhardwareandsoftwareprotectionhasprovenattractiveinanumberofcontexts.Modernwebbrowsersexecutecodeembeddedinwebpagesinasoftwaresandboxthatpreventsthecodefrominfectingthebrowser;theoperatingsystemuseshardwareprotectiontoprovideanextralevelofdefenseincasethebrowseritselfiscompromised.

Thefuturetrendsareclear:

Verylargememorysystems.Thecostofagigabyteofmemoryislikelytocontinuetoplummet,makingeverlargermemorysystemspractical.Overthepastfewdecades,theamountofmemorypersystemhasalmostdoubledeachyear.Wearelikelytolookbackattoday’scomputersandwonderhowwecouldhavegottenbywithaslittleasagigabyteofDRAM!Thesemassivememorieswillrequireeverdeepermulti-levelpagetables.Fortunately,thesametrendsthatmakeitpossibletobuildgiganticmemoriesalsomakeitpossibletodesignverylargeTLBstohidetheincreasingdepthofthelookuptrees.

Multiprocessors.Ontheotherhand,multiprocessorswillmeanthatmaintainingTLBconsistencywillbecomeincreasinglyexpensive.Akeyassumptionforusingpagetableprotectionhardwareforimplementingcopy-on-writeandfill-on-demandisthatthecostofmodifyingpagetableentriesismodest.OnepossibilityisthathardwarewillbeaddedtosystemstomakeTLBshootdownamuchcheaperoperation,e.g.,bymakingTLBscachecoherent.Anotherpossibilityistofollowthetrendtowardssoftwaresandboxes.IfTLBshootdownremainsexpensive,wemaystarttoseecopy-on-writeandotherfeaturesimplementedinsoftwareratherthanhardware.

User-levelsandboxes.Applicationslikebrowsersthatrununtrustedcodearebecomingincreasinglyprevalent.Operatingsystemshaveonlyrecentlybeguntorecognizetheneedtosupportthesetypesofapplications.Softwareprotectionhasbecomecommon,bothatthelanguagelevelwithJavaScript,andintheruntimesystemwithNativeClientandApplicationDomains.Asthesetechnologiesbecomemorewidelyused,itseemslikelywemaydirecthardwaresupportforapplication-levelprotection—toalloweachapplicationtosetupitsownprotectedexecutionenvironment,butenforcedinhardware.Ifso,wemaycometothinkofmanyapplicationsashavingtheirownembeddedoperatingsystem,andtheunderlyingoperatingsystemkernelasmediatingbetweentheseoperatingsystems.

Exercises

1. Trueorfalse.Avirtualmemorysystemthatusespagingisvulnerabletoexternalfragmentation.Whyorwhynot?

2. Forsystemsthatusepagedsegmentation,whattranslationstatedoesthekernelneedtochangeonaprocesscontextswitch?

3. Forthethree-levelSPARCpagetable,whattranslationstatedoesthekernelneedtochangeonaprocesscontextswitch?

4. Describetheadvantagesofanarchitecturethatincorporatessegmentationandpagingoveronesthatareeitherpurepagingorpuresegmentation.Presentyouranswerasseparatelistsofadvantagesovereachofthepureschemes.

5. Foracomputerarchitecturewithmulti-levelpaging,apagesizeof4KB,and64-bitphysicalandvirtualaddresses:

a. Listtherequiredandoptionalfieldsofitspagetableentry,alongwiththenumberofbitsperfield.

b. Assumingacompactencoding,whatisthesmallestpossiblesizeforapagetableentryinbytes,roundeduptoanevennumber.

c. Assumingarequirementthateachpagetablefitsintoasinglepage,andgivenyouranswerabove,howmanylevelsofpagetableswouldberequiredtocompletelymapthe64-bitvirtualaddressspace?

6. Considerthefollowingpieceofcodewhichmultipliestwomatrices:

Assumethatthebinaryforexecutingthisfunctionfitsinonepageandthatthestackalsofitsinonepage.Assumethatstoringafloatingpointnumbertakes4bytesofmemory.Ifthepagesizeis4KB,theTLBhas8entries,andtheTLBalwayskeepsthemostrecentlyusedpages,computethenumberofTLBmissesassumingtheTLBisinitiallyempty.

7. Ofthefollowingitems,whicharestoredinthethreadcontrolblock,whicharestoredintheprocesscontrolblock,andwhichinneither?

a. Pagetablepointerb. Pagetablec. Stackpointerd. Segmenttablee. Readylistf. CPUregistersg. Programcounter

8. Drawthesegmentandpagetableforthe32-bitIntelarchitecture.9. Drawthesegmentandpagetableforthe64-bitIntelarchitecture.

10. Foracomputerarchitecturewithmulti-levelpaging,apagesizeof4KB,and64-bitphysicalandvirtualaddresses:

a. Whatisthesmallestpossiblesizeforapagetableentry,roundeduptoapoweroftwo?

b. Usingyourresultabove,andassumingarequirementthateachpagetablefitsintoasinglepage,howmanylevelsofpagetableswouldberequiredtocompletelymapthe64-bitvirtualaddressspace?

11. Supposeyouaredesigningasystemwithpagedsegmentation,andyouanticipatethememorysegmentsizewillbeuniformlydistributedbetween0and4GB.Theoverheadofthedesignisthesumoftheinternalfragmentationandthespacetakenupbythepagetables.Ifeachpagetableentryusesfourbytesperpage,whatpagesizeminimizesoverhead?

12. Inanarchitecturewithpagedsegmentation,the32-bitvirtualaddressisdividedintofieldsasfollows:

| 4bitsegmentnumber | 12bitpagenumber | 16bitoffset |

Thesegmentandpagetablesareasfollows(allvaluesinhexadecimal):

SegmentTable PageTableA PageTableB

0 PageTableA 0 CAFE 0 F000

1 PageTableB 1 DEAD 1 D8BF

x (restinvalid) 2 BEEF 2 3333

3 BA11 x (restinvalid)

x (restinvalid)

Findthephysicaladdresscorrespondingtoeachofthefollowingvirtualaddresses(answer“invalidvirtualaddress”ifthevirtualaddressisinvalid):

a. 00000000b. 20022002

c. 10015555

13. Supposeamachinewith32-bitvirtualaddressesand40-bitphysicaladdressesisdesignedwithatwo-levelpagetable,subdividingthevirtualaddressintothreepiecesasfollows:

| 10bitpagetablenumber | 10bitpagenumber | 12bitoffset |

Thefirst10bitsaretheindexintothetop-levelpagetable,thesecond10bitsaretheindexintothesecond-levelpagetable,andthelast12bitsaretheoffsetintothepage.Thereare4protectionbitsperpage,soeachpagetableentrytakes4bytes.

a. Whatisthepagesizeinthissystem?b. Howmuchmemoryisconsumedbythefirstandsecondlevelpagetablesand

wastedbyinternalfragmentationforaprocessthathas64Kofmemorystartingataddress0?

c. Howmuchmemoryisconsumedbythefirstandsecondlevelpagetablesandwastedbyinternalfragmentationforaprocessthathasacodesegmentof48Kstartingataddress0x1000000,adatasegmentof600Kstartingataddress0x80000000andastacksegmentof64Kstartingataddress0xf0000000andgrowingupward(towardshigheraddresses)?

14. Writepseudo-codetoconverta32-bitvirtualaddresstoa32-bitphysicaladdressforatwo-leveladdresstranslationschemeusingsegmentationatthefirstleveloftranslationandpagingatthesecondlevel.Explicitlydefinewhateverconstantsanddatastructuresyouneed(e.g.,theformatofthepagetableentry,thepagesize,andsoforth).

9.CachingandVirtualMemory

Cashisking.—PerGyllenhammar

Somemayarguethatwenolongerneedachapteroncachingandvirtualmemoryinanoperatingsystemstextbook.Afterall,moststudentswillhaveseencachesinanearliermachinestructuresclass,andmostdesktopsandlaptopsareconfiguredsothattheyonlyveryrarely,ifever,runoutofmemory.Maybecachingisnolongeranoperatingsystemstopic?

Wecouldnotdisagreemore.Cachesarecentraltothedesignofahugenumberofhardwareandsoftwaresystems,includingoperatingsystems,Internetnaming,webclients,andwebservers.Inparticular,smartphoneoperatingsystemsareoftenmemoryconstrainedandmustmanagememorycarefully.Serveroperatingsystemsmakeextensiveuseofremotememoryandremotediskacrossthedatacenter,usingthelocalservermemoryasacache.Evendesktopoperatingsystemsusecachingextensivelyintheimplementationofthefilesystem.Mostimportantly,understandingwhencachesworkandwhentheydonotisessentialtoeverycomputersystemsdesigner.

ConsideratypicalFacebookpage.Itcontainsinformationaboutyou,yourinterestsandprivacysettings,yourposts,andyourphotos,plusyourlistoffriends,theirinterestsandprivacysettings,theirposts,andtheirphotos.Inturn,yourfriends’pagescontainanoverlappingviewofmuchofthesamedata,andinturn,theirfriends’pagesareconstructedthesameway.

NowconsiderhowFacebookorganizesitsdatatomakeallofthiswork.HowdoesFacebookassemblethedataneededtodisplayapage?Oneoptionwouldbetokeepallofthedataforaparticularuser’spageinoneplace.However,theinformationthatIneedtodrawmypageoverlapswiththeinformationthatmyfriends’friendsneedtodrawtheirpages.Myfriends’friends’friends’friendsincludeprettymuchtheentireplanet.Wecaneitherstoreeveryone’sdatainoneplaceorspreadthedataaround.Eitherway,performancewillsuffer!IfwestoreallthedatainCalifornia,FacebookwillbeslowforeveryonefromEurope,andviceversa.Equally,integratingdatafrommanydifferentlocationsisalsolikelytobeslow,especiallyforFacebook’smorecosmopolitanusers.

Toresolvethisdilemma,Facebookmakesheavyuseofcaches;itwouldnotbepracticalwithoutthem.Acacheisacopyofacomputationordatathatcanbeaccessedmorequicklythantheoriginal.Whileanyobjectonmypagemightchangefrommomenttomoment,itseldomdoes.Inthecommoncase,Facebookreliesonalocal,cachedcopyofthedataformypage;itonlygoesbacktotheoriginalsourceifthedataisnotstoredlocallyorbecomesoutofdate.

Cachesworkbecausebothusersandprogramsarepredictable.You(probably!)donotchangeyourfriendlisteverynanosecond;ifyoudid,Facebookcouldstillcacheyourfriendlist,butitwouldbeoutofdatebeforeitcouldbeusedagain,andsoitwouldnothelp.Ifeveryonechangedtheirfriendseverynanosecond,Facebookwouldbeoutofluck!Inmostcases,however,whatusersdonowispredictiveofwhattheyarelikelytodosoon,

andwhatprogramsdonowispredictiveofwhattheywilldonext.Thisprovidesanopportunityforacachetosaveworkthroughreuse.

Facebookisnotaloneinmakingextensiveuseofcaches.Almostalllargecomputersystemsrelyoncaches.Infact,itishardtothinkofanywidelyused,complexhardwareorsoftwaresystemthatdoesnotincludeacacheofsomesort.

Wesawthreeexamplesofhardwarecachesinthepreviouschapter:

TLBs.Modernprocessorsuseatranslationlookasidebuffer,orTLB,tocachetherecentresultsofmulti-levelpagetableaddresstranslation.Providedprogramsreferencethesamepagesrepeatedly,translatinganaddressisasfastasasingletablelookupinthecommoncase.Thefullmulti-levellookupisneededonlyinthecasewheretheTLBdoesnotcontaintherelevantaddresstranslation.

Virtuallyaddressedcaches.Mostmodernprocessordesignstakethisideaastepfartherbyincludingavirtuallyaddressedcacheclosetotheprocessor.Eachentryinthecachestoresthememoryvalueassociatedwithavirtualaddress,allowingthatvaluetobereturnedmorequicklytotheprocessorwhenneeded.Forexample,therepeatedinstructionfetchesinsidealooparewellhandledbyavirtuallyaddressedcache.

Physicallyaddressedcaches.Mostmodernprocessorscomplementthevirtuallyaddressedcachewithasecond-(andsometimesthird-)levelphysicallyaddressedcache.Eachentryinaphysicallyaddressedcachestoresthememoryvalueassociatedwithaphysicalmemorylocation.Inthecommoncase,thisallowsthememoryvaluetobereturneddirectlytotheprocessorwithouttheneedtogotomainmemory.

Therearemanymoreexamplesofcaches:

Internetnaming.Wheneveryoutypeinawebrequestorclickonalink,theclientcomputerneedstotranslatethenameinthelink(e.g.,amazon.com)toanIPnetworkaddressofwheretosendeachpacket.Theclientgetsthisinformationfromanetworkservice,calledtheDomainNameSystem(DNS),andthencachesthetranslationsothattheclientcangodirectlytothewebserverinthecommoncase.

Webcontent.WebclientscachecopiesofHTML,images,JavaScriptprograms,andotherdatasothatwebpagescanberefreshedmorequickly,usinglessbandwidth.Webserversalsokeepcopiesoffrequentlyrequestedpagesinmemorysothattheycanbetransmittedmorequickly.

Websearch.BothGoogleandBingkeepacachedcopyofeverywebpagetheyindex.Thisallowsthemtoprovidethecopyofthewebpageiftheoriginalisunavailableforsomereason.Thecachedcopymaybeoutofdate—thesearchenginesdonotguaranteethatthecopyinstantaneouslyreflectsanychangeintheoriginalwebpage.

Emailclients.Manyemailclientsstoreacopyofmailmessagesontheclientcomputertoimprovetheclientperformanceandtoallowdisconnectedoperation.Inthebackground,theclientcommunicateswiththeservertokeepthetwocopiesin

sync.

Incrementalcompilation.Ifyouhaveeverbuiltaprogramfrommultiplesourcefiles,youhaveusedcaching.Thebuildmanagersavesandreusestheindividualobjectfilesinsteadofrecompilingeverythingfromscratcheachtime.

Justintimetranslation.Somememory-constraineddevicessuchassmartphonesdonotcontainenoughmemorytostoretheentireexecutableimageforsomeprograms.Instead,systemssuchastheGoogleAndroidoperatingsystemandtheARMruntimestoreprogramsinamorecompactintermediaterepresentation,andconvertpartsoftheprogramtomachinecodeasneeded.Repeateduseofthesamecodeisfastbecauseofcaching;ifthesystemrunsoutofmemory,lessfrequentlyusedcodemaybeconvertedeachtimeitisneeded.

Virtualmemory.Operatingsystemscanrunprogramsthatdonotfitinphysicalmemorybyusingmainmemoryasacachefordisk.Applicationpagesthatfitinmemoryhavetheirpagetableentriessettovalid;thesepagescanbeaccesseddirectlybytheprocessor.Thosepagesthatdonotfithavetheirpermissionssettoinvalid,triggeringatraptotheoperatingsystemkernel.Thekernelwillthenfetchtherequiredpagefromdiskandresumetheapplicationattheinstructionthatcausedthetrap.

Filesystems.Filesystemsalsotreatmemoryasacachefordisk.Theystorecopiesinmemoryoffrequentlyuseddirectoriesandfiles,reducingtheneedfordiskaccesses.

Conditionalbranchprediction.Anotheruseofcachesisinpredictingwhetheraconditionalbranchwillbetakenornot.Inthecommoncaseofacorrectprediction,theprocessorcanstartdecodingthenextinstructionbeforetheresultofthebranchisknownforsure;ifthepredictionturnsouttobewrong,thedecodingisrestartedwiththecorrectnextinstruction.

Inotherwords,cachesareacentraldesigntechniquetomakingcomputersystemsfaster.However,cachesarenotwithouttheirdownsides.Cachescanmakeunderstandingtheperformanceofasystemmuchharder.Somethingthatseemslikeitshouldbefast—andevensomethingthatusuallyisfast—canendupbeingveryslowifmostofthedataisnotinthecache.Becausethedetailsofthecacheareoftenhiddenbehindalevelofabstraction,theuserortheprogrammermayhavelittleideaastowhatiscausingthepoorperformance.Inotherwords,theabstractionoffastaccesstodatacancauseproblemsiftheabstractiondoesnotliveuptoitspromise.Oneofouraimsistohelpyouunderstandwhencachesdoanddonotworkwell.

Inthischapter,wewillfocusonthecachingofmemoryvalues,buttheprincipleswediscussapplymuchmorewidely.Memorycachingiscommoninbothhardware(bytheprocessortoimprovememorylatency)andinsoftware(bytheoperatingsystemtohidediskandnetworklatency).Further,thestructureandorganizationofprocessorcachesrequiresspecialcarebytheoperatingsysteminsettinguppagetables;otherwise,muchoftheadvantageofprocessorcachescanevaporate.

Regardlessofthecontext,allcachesfacethreedesignchallenges:

Locatingthecachedcopy.Becausecachesaredesignedtoimproveperformance,akeyquestionisoftenhowtoquicklydeterminewhetherthecachecontainstheneededdataornot.Becausetheprocessorconsultsatleastonehardwarecacheoneveryinstruction,hardwarecachesinparticularareorganizedforefficientlookup.

Replacementpolicy.Mostcacheshavephysicallimitsonhowmanyitemstheycanstore;whennewdataarrivesinthecache,thesystemmustdecidewhichdataismostvaluabletokeepinthecacheandwhichcanbereplaced.Becauseofthehighrelativelatencyoffetchingdatafromdisk,operatingsystemsandapplicationshavefocusedmoreattentiononthechoiceofreplacementpolicy.

Coherence.Howdowedetect,andrepair,whenacachedcopybecomesoutofdate?Thisquestion,cachecoherence,iscentraltothedesignofmultiprocessoranddistributedsystems.Despitebeingveryimportant,cachecoherencebeyondthescopeofthisversionofthetextbook.Instead,wefocusonthefirsttwooftheseissues.

Chapterroadmap:

CacheConcept.Whatoperationsdoesacachedoandhowcanweevaluateitsperformance?(Section9.1)

MemoryHierarchy.Whathardwarebuildingblocksdowehaveinconstructingacacheinanapplicationoroperatingsystem?(Section9.2)

WhenCachesWorkandWhenTheyDoNot.Canwepredicthoweffectiveacachewillbeinasystemwearedesigning?Canweknowinadvancewhencachingwillnotwork?(Section9.3)

MemoryCacheLookup.Whatoptionsdowehaveforlocatingwhetheranitemiscached?Howcanweorganizehardwarecachestoallowforrapidlookup,andwhataretheimplicationsofcacheorganizationforoperatingsystemsandapplications?(Section9.4)

ReplacementPolicies.Whatoptionsdowehaveforchoosingwhichitemtoreplacewhenthereisnomoreroom?(Section9.5)

CaseStudy:Memory-MappedFiles.Howdoestheoperatingsystemprovidetheabstractionoffileaccesswithoutfirstreadingtheentirefileintomemory?(Section9.6)

CaseStudy:VirtualMemory.Howdoestheoperatingsystemprovidetheillusionofanear-infinitememorythatcanbesharedbetweenapplications?Whathappensifbothapplicationsandtheoperatingsystemwanttomanagememoryatthesametime?(Section9.7)

9.1CacheConcept

Figure9.1:Abstractoperationofamemorycacheonareadrequest.Memoryreadrequestsaresenttothecache;thecacheeitherreturnsthevaluestoredatthatmemorylocation,oritforwardstherequestonwardtothenextlevelofcache.

Westartbydefiningsometerms.Thesimplestkindofacacheisamemorycache.Itstores(address,value)pairs.AsshowninFigure9.1,whenweneedtoreadvalueofacertainmemorylocation,wefirstconsultthecache,anditeitherreplieswiththevalue(ifthecacheknowsit)andotherwiseitforwardstherequestonward.Ifthecachehasthevalue,thatiscalledacachehit.Ifthecachedoesnot,thatiscalledacachemiss.

Foramemorycachetobeuseful,twopropertiesneedtohold.First,thecostofretrievingdataoutofthecachemustbesignificantlylessthanfetchingthedatafrommemory.Inotherwords,thecostofacachehitmustbelessthanacachemiss,orwewouldjustskipusingthecache.

Second,thelikelihoodofacachehitmustbehighenoughtomakeitworththeeffort.Onesourceofpredictabilityistemporallocality:programstendtoreferencethesameinstructionsanddatathattheyhadrecentlyaccessed.Examplesincludetheinstructionsinsidealoop,oradatastructurethatisrepeatedlyaccessed.Bycachingthesememoryvalues,wecanimproveperformance.

Anothersourceofpredictabilityisspatiallocality.Programstendtoreferencedatanearotherdatathathasbeenrecentlyreferenced.Forexample,thenextinstructiontoexecuteisusuallyneartothepreviousone,anddifferentfieldsinthesamedatastructuretendtobereferencedatnearlythesametime.Toexploitthis,cachesareoftendesignedtoloadablockofdataatthesametime,insteadofonlyasinglelocation.Hardwarememorycachesoftenstore4-64memorywordsasaunit;filecachesoftenstoredatainpowersoftwoofthehardwarepagesize.

Arelateddesigntechniquethatalsotakesadvantageofspatiallocalityistoprefetchdataintothecachebeforeitisneeded.Forexample,ifthefilesystemobservestheapplication

readingasequenceofblocksintomemory,itwillreadthesubsequentblocksaheadoftime,withoutwaitingtobeasked.

Puttingthesetogether,thelatencyofareadrequestisasfollows:

Latency(readrequest) = Prob(cachehit)×Latency(cachehit)

+Prob(cachemiss)×Latency(cachemiss)

Figure9.2:Abstractoperationofamemorycachewrite.Memoryrequestsarebufferedandthensenttothecacheinthebackground.Typically,thecachestoresablockofdata,soeachwriteensuresthattherestoftheblockisinthecachebeforeupdatingthecache.Ifthecacheiswritethrough,thedataisthensentonwardtothenextlevelofcacheormemory.

ThebehaviorofacacheonawriteoperationisshowninFigure9.2.Theoperationisabitmorecomplex,butthelatencyofawriteoperationiseasiertounderstand.Mostsystemsbufferwrites.Aslongasthereisroominthebuffer,thecomputationcancontinueimmediatelywhilethedataistransferredintothecacheandtomemoryinthebackground.(Therearecertainrestrictionsontheuseofwritebuffersinamultiprocessorsystem,soforthischapter,wearesimplifyingmatterstosomedegree.)Subsequentreadrequestsmustcheckboththewritebufferandthecache—returningdatafromthewritebufferifitisthelatestcopy.

Inthebackground,thesystemchecksiftheaddressisinthecache.Ifnot,therestofthe

cacheblockmustbefetchedfrommemoryandthenupdatedwiththechangedvalue.Finally,ifthecacheiswrite-through,allupdatesaresentimmediatelyonwardtomemory.Ifthecacheiswrite-back,updatescanbestoredinthecache,andonlysenttomemorywhenthecacherunsoutofspaceandneedstoevictablocktomakeroomforanewmemoryblock.

Sincewritebuffersallowwriterequeststoappeartocompleteimmediately,therestofourdiscussionfocusesonusingcachestoimprovememoryreads.

Wefirstdiscussthepartoftheequationthatdealswiththelatencyofacachehitandacachemiss:howlongdoesittaketoaccessdifferenttypesofmemory?Wecaution,however,thattheissuesthataffectthelikelihoodofacachehitormissarejustasimportanttotheoverallmemorylatency.Inparticular,wewillshowthatapplicationcharacteristicsareoftenthelimitingfactortogoodcacheperformance.

9.2MemoryHierarchy

Whenwearedecidingwhethertouseacacheintheoperatingsystemorsomenewapplication,itishelpfultostartwithanunderstandingofthecostandperformanceofvariouslevelsofmemoryanddiskstorage.

Cache HitCost Size

1stlevelcache/1stlevelTLB 1ns 64KB

2ndlevelcache/2ndlevelTLB 4ns 256KB

3rdlevelcache 12ns 2MB

Memory(DRAM) 100ns 10GB

Datacentermemory(DRAM) 100μs 100TB

Localnon-volatilememory 100μs 100GB

Localdisk 10ms 1TB

Datacenterdisk 10ms 100PB

Remotedatacenterdisk 200ms 1XB

Figure9.3:Memoryhierarchy,fromon-chipprocessorcachestodiskstorageataremotedatacenter.On-chipcachesizeandlatencyistypicalofahigh-endprocessor.TheentriesfordatacenterDRAManddisklatencyassumetheaccessisfromoneservertoanotherinthesamedatacenter;remotedatacenterdisklatencyifforaccesstoageographicallydistantdatacenter.

Fromahardwareperspective,thereisafundamentaltradeoffbetweenthespeed,size,andcostofstorage.Thesmallermemoryis,thefasteritcanbe;theslowermemoryis,thecheaperitcanbe.

Thismotivatessystemstohavenotjustonecache,butawholehierarchyofcaches,fromthenanosecondmemorypossibleinsideachiptothemultipleexabytesofworldwidedatacenterstorage.ThishierarchyisillustratedbythetableinFigure9.3.Weshouldcautionthatthislistisjustasnapshot;additionallayerskeepbeingaddedovertime.

First-levelcache.Mostmodernprocessorarchitecturescontainasmallfirst-level,virtuallyaddressed,cacheveryclosetotheprocessor,designedtokeeptheprocessorfedwithinstructionsanddataattheclockrateoftheprocessor.

Second-levelcache.Becauseitisimpossibletobuildalargecacheasfastasasmallone,theprocessorwilloftencontainasecond-level,physicallyaddressedcachetohandlecachemissesfromthefirst-levelcache.

Third-levelcache.Likewise,manyprocessorsincludeanevenlarger,slowerthird-levelcachetocatchsecond-levelcachemisses.Thiscacheisoftensharedacrossalloftheon-chipprocessorcores.

First-andsecond-levelTLB.Thetranslationlookasidebuffer(TLB)willalsobeorganizedwithmultiplelevels:asmall,fastfirst-levelTLBdesignedtokeepupwiththeprocessor,backedupbyalarger,slightlyslower,second-levelTLBtocatchfirst-levelTLBmisses.

Mainmemory(DRAM).Fromahardwareperspective,thefirst-,second-,andthird-levelcachesprovidefasteraccesstomainmemory;fromasoftwareperspective,however,mainmemoryitselfcanbeviewedasacache.

Datacentermemory(DRAM).Withahigh-speedlocalareanetworksuchasadatacenter,thelatencytofetchapageofdatafromthememoryofanearbycomputerismuchfasterthanfetchingitfromdisk.Inaggregate,thememoryofnearbynodeswilloftenbelargerthanthatofthelocaldisk.Usingthememoryofnearbynodestoavoidthelatencyofgoingtodiskiscalledcooperativecaching,asitrequiresthecooperativemanagementofthenodesinthedatacenter.Manylargescaledatacenterservices,suchasGoogleandFacebook,makeextensiveuseofcooperativecaching.

Localdiskornon-volatilememory.Forclientmachines,localdiskornon-volatileflashmemorycanserveasbackingstorewhenthesystemrunsoutofmemory.Inturn,thelocaldiskservesasacacheforremotediskstorage.Forexample,webbrowsersstorerecentlyfetchedwebpagesintheclientfilesystemtoavoidthecostoftransferringthedataagainthenexttimeitisused;oncecached,thebrowseronlyneedstovalidatewiththeserverwhetherthepagehaschangedbeforerenderingthewebpagefortheuser.

Datacenterdisk.Theaggregatedisksinsideadatacenterprovideenormousstoragecapacitycomparedtoacomputer’slocaldisk,andevenrelativetotheaggregatememoryofthedatacenter.

Remotedatacenterdisk.Geographicallyremotedisksinadatacenteraremuchslowerbecauseofwide-areanetworklatencies,buttheyprovideaccesstoevenlargerstoragecapacityinaggregate.Manydatacentersalsostoreacopyoftheirdataonaremoterobotictapesystem,butsincethesesystemshaveveryhighlatency(measuredinthetensofseconds),theyaretypicallyaccessedonlyintheeventofafailure.

Ifcachingalwaysworkedperfectly,wecouldprovidetheillusionofinstantaneousaccesstoalltheworld’sdata,withthelatency(onaverage)ofafirstlevelcacheandthesizeandthecost(onaverage)ofdiskstorage.

However,therearereasonstobeskeptical.Evenwithtemporalandspatiallocality,therearethirteenordersofmagnitudedifferenceinstoragecapacityfromthefirstlevelcachetothestoreddataofatypicaldatacenter;thisistheequivalentofthesmallestvisibledotonthispageversusthosedotsscatteredacrossthepagesofamilliontextbooksjustlikethisone.Howcanacachebeeffectiveifitcanstoreonlyatinyamountofthedatathatcouldbestored?

Thecostofacachemisscanalsobehigh.Thereareeightordersofmagnitudedifferencebetweenthelatencyofthefirst-levelcacheandaremotedatacenterdisk;thatisequivalenttothedifferencebetweentheshortestlatencyahumancanperceive—roughlyonehundredmilliseconds—versusoneyear.Howcanacachebeeffectiveifthecostofacachemissisenormouscomparedtoacachehit?

9.3WhenCachesWorkandWhenTheyDoNot

Howdoweknowwhetheracachewillbeeffectiveforagivenworkload?Eventhesameprogramwillhavedifferentcachebehaviordependingonhowitisused.

Supposeyouwriteaprogramthatreadsandwritesitemsintoahashtable.Howwelldoesthatinteractwithcaching?Itdependsonthesizeofthehashtable.Ifthehashtablefitsinthefirst-levelcache,oncethetableisloadedintothecache,eachaccesswillbeveryrapid.Ifontheotherhand,thehashtableistoolargetostoreinmemory,eachlookupmayrequireadiskaccess.

Thus,neitherthecachesizenortheprogrambehavioralonegovernstheeffectivenessofcaching.Rather,theinteractionbetweenthetwodeterminescacheeffectiveness.

Figure9.4:CachehitrateasafunctionofcachesizeforamillioninstructionrunofaCcompiler.Thehitratevs.cachesizegraphhasasimilarshapeformanyprograms.Thekneeofthecurveiscalledtheworkingsetoftheprogram.

9.3.1WorkingSetModel

Ausefulgraphtoconsideristhecachehitrateversusthesizeofthecache.WegiveanexampleinFigure9.4;ofcourse,thepreciseshapeofthegraphwillvaryfromprogramtoprogram.

Regardlessoftheprogram,asufficientlylargecachewillhaveahighcachehitrate.Inthelimit,ifthecachecanfitalloftheprogram’smemoryanddata,themissratewillbezerooncethedataisloadedintothecache.Attheotherextreme,asufficientlysmallcachewillhaveaverylowcachehitrate.Anythingotherthanatrivialprogramwillhavemultipleproceduresandmultipledatastructures;ifthecacheissufficientlysmall,eachnewinstructionanddatareferencewillpushoutsomethingfromthecachethatwillbeusedinthenearfuture.Forthehashtableexample,ifthesizeofthecacheismuchsmallerthanthesizeofthehashtable,eachtimewedoalookup,thehashbucketweneedwillnolongerbeinthecache.

Mostprogramswillhaveaninflectionpoint,orkneeofthecurve,whereacriticalmassofprogramdatacanjustbarelyfitinthecache.Thiscriticalmassiscalledtheprogram’sworkingset.Aslongastheworkingsetcanfitinthecache,mostreferenceswillbeacachehit,andapplicationperformancewillbegood.

Thrashing

Acloselyrelatedconcepttotheworkingsetisthrashing.Aprogramthrashesifthecacheistoosmalltoholditsworkingset,sothatmostreferencesarecachemisses.Eachtimethereisacachemiss,weneedtoevictacacheblocktomakeroomforthenewreference.However,thenewcacheblockmayinturnbeevictedbeforeitisreused.

Theword“thrash”datesfromthe1960’s,whendiskdriveswereaslargeaswashingmachines.Ifaprogram’sworkingsetdidnotfitinmemory,thesystemwouldneedtoshufflememorypagesbackandforthtodisk.Thisburstofactivitywouldliterallymakethediskdriveshakeviolently,makingitveryobvioustoeveryonenearbywhythesystemwasnotperformingwell.

Thenotionofaworkingsetcanalsoapplytouserbehavior.Considerwhathappenswhenyouaredevelopingcodeforahomeworkassignment.Ifthefilesyouneedfitinmemory,compilationwillberapid;ifnot,compilationwillbeslowaseachfileisbroughtinfromdiskasitisused.

Differentprograms,anddifferentusers,willhaveworkingsetsofdifferentsizes.Evenwithinthesameprogram,differentphasesoftheprogrammayhavedifferentsizeworkingsets.Forexample,theparserforacompilerneedsdifferentdataincachethanthecodegenerator.Inatexteditor,theworkingsetshiftswhenweswitchfromonepagetothenext.Usersalsochangetheirfocusfromtimetotime,aswhenyoushiftfromaprogrammingassignmenttoahistoryassignment.

Figure9.5:Examplecachehitrateovertime.Ataphasechangewithinaprocess,orduetoacontextswitchbetweenprocesses,therewillbeaspikeofcachemissesbeforethesystemsettlesintoanewequilibrium.

Theresultofthisphasechangebehavioristhatcacheswilloftenhaveburstymissrates:

periodsoflowcachemissesinterspersedwithperiodsofhighcachemisses,asshowninFigure9.5.Processcontextswitcheswillalsocauseburstycachemisses,asthecachediscardstheworkingsetfromtheoldprocessandbringsintheworkingsetofthenewprocess.

WecancombinethegraphinFigure9.4withthetableinFigure9.3toseetheimpactofthesizeoftheworkingsetoncomputersystemperformance.Aprogramwhoseworkingsetfitsinthefirstlevelcachewillrunfourtimesfasterthanonewhoseworkingsetfitsinthesecondlevelcache.Aprogramwhoseworkingsetdoesnotfitinmainmemorywillrunathousandtimesslowerthanonewhodoes,assumingithasaccesstodatacentermemory.Itwillrunahundredthousandtimesslowerifitneedstogotodisk.

Becauseoftheincreasingdepthandcomplexityofthememoryhierarchy,animportantareaofworkisthedesignofalgorithmsthatadapttheirworkingsettothememoryhierarchy.Onefocushasbeenonalgorithmsthatmanagethegapbetweenmainmemoryanddisk,butthesameprinciplesapplyatotherlevelsofthememoryhierarchy.

Figure9.6:Algorithmtosortalargearraythatdoesnotfitintomainmemory,bybreakingtheproblemintopiecesthatdofitintomemory.

Asimpleexampleishowtoefficientlysortanarraythatdoesnotfitinmainmemory.(Equivalently,wecouldconsiderhowtosortanarraythatdoesnotfitinthefirstlevelcache.)AsshowninFigure9.6,wecanbreaktheproblemupintochunkseachofwhichdoesfitinmemory.Oncewesorteachchunk,wecanmergethesortedchunkstogetherefficiently.Tosortachunkthatfitsinmainmemory,wecaninturnbreaktheprobleminto

sub-chunksthatfitintheon-chipcache.

Wewilldiscusslaterinthischapterwhattheoperatingsystemneedstodowhenmanagingmemorybetweenprogramsthatinturnadapttheirbehaviortomanagememory.

9.3.2ZipfModel

Althoughtheworkingsetmodeloftendescribesprogramanduserbehaviorquitewell,itisnotalwaysagoodfit.Forexample,considerawebproxycache.Awebproxycachestoresfrequentlyaccessedwebpagestospeedwebaccessandreducenetworktraffic.Webaccesspatternscausetwochallengestoacachedesigner:

Newdata.Newpagesarebeingaddedtothewebatarapidrate,andpagecontentsalsochange.Everytimeauseraccessesapage,thesystemneedstocheckwhetherthepagehaschangedinthemeantime.

Noworkingset.Althoughsomewebpagesaremuchmorepopularthanothers,thereisnosmallsubsetofwebpagesthat,ifcached,giveyouthebulkofthebenefit.Unlikewithaworkingset,evenverysmallcacheshavesomevalue.Conversely,increasingcachesizeyieldsdiminishingreturns:evenverylargecachestendtohaveonlymodestcachehitrates,asthereareanenormousgroupofpagesthatarevisitedfromtimetotime.

AusefulmodelforunderstandingthecachebehaviorofwebaccessistheZipfdistribution.Zipfdevelopedthemodeltodescribethefrequencyofindividualwordsinatext,butitalsoappliesinanumberofothersettings.

Figure9.7:Zipfdistribution

Supposewehaveasetofwebpages(orwords),andweranktheminorderofpopularity.Thenthefrequencyusersvisitaparticularwebpageis(approximately)inverselyproportionaltoitsrank:

Frequencyofvisitstothekthmostpopularpage∝1/kα

whereαisvaluebetween1and2.AZipfprobabilitydistributionisillustratedinFigure9.7.

TheZipfdistributionfitsasurprisingnumberofdisparatephenomena:thepopularityoflibrarybooks,thepopulationofcities,thedistributionofsalaries,thesizeoffriendlistsinsocialnetworks,andthedistributionofreferencesinscientificpapers.TheexactcauseoftheZipfdistributioninmanyofthesecasesisunknown,buttheyshareathemeofpopularityinhumansocialnetworks.

Figure9.8:Cachehitrateasafunctionofthepercentageoftotalitemsthatcanfitinthecache,onalogscale,foraZipfdistribution.

AcharacteristicofaZipfcurveisaheavy-taileddistribution.Althoughasignificantnumberofreferenceswillbetothemostpopularitems,asubstantialportionofreferenceswillbetolesspopularones.IfweredrawFigure9.4oftherelationshipbetweencachehitrateandcachesize,butforaZipfdistribution,wegetFigure9.8.Notethatwehaverescaledthex-axistobelogscale.Ratherthanathresholdasweseeintheworkingsetmodel,increasingthecachesizecontinuestoimprovecachehitrates,butwithdiminishingreturns.

9.4MemoryCacheLookup

Nowthatwehaveoutlinedtheavailabletechnologiesforconstructingcaches,andtheusagepatternsthatlend(ordonotlend)themselvestoeffectivecaching,weturntocachedesign.Howdowefindwhetheranitemisinthecache,andwhatdowedowhenwerunoutofroominthecache?Weanswerthefirstquestionhere,andwedeferthesecondquestiontothenextsection.

Amemorycachemapsasparsesetofaddressestothedatavaluesstoredatthoseaddresses.Youcanthinkofacacheasagianttablewithtwocolumns:onefortheaddressandoneforthedatastoredatthataddress.Toexploitspatiallocality,eachentryinthe

tablewillstorethevaluesforablockofmemory,notjustthevalueforasinglememoryword.ModernIntelprocessorscachedatain64bytechunks.Foroperatingsystems,theblocksizeistypicallythehardwarepagesize,or4KBonanIntelprocessor.

Weneedtobeabletorapidlyconvertanaddresstofindthecorrespondingdata,whileminimizingstorageoverhead.Theoptionswehaveforcachelookupareallofthesameonesweexploredinthepreviouschapterforaddresslookup:wecanusealinkedlist,amulti-leveltree,orahashtable.Operatingsystemsuseeachofthosetechniquesindifferentsettings,dependingonthesizeofthecache,itsaccesspattern,andhowimportantitistohaveveryrapidlookup.

Forhardwarecaches,thedesignchoicesaremorelimited.Thelatencygapbetweencachelevelsisverysmall,soanyaddedoverheadinthelookupprocedurecanswampthebenefitofthecache.Tomakelookupfaster,hardwarecachesoftenconstrainwhereinthetablewemightfindanyspecificaddress.Thisconstraintmeansthattherecouldberoominonepartofthetable,butnotinanother,raisingthecachemissrate.Thereisatradeoffhere:afastercachelookupneedstobebalancedagainstthecostofincreasedcachemisses.

Threecommonmechanismsforcachelookupare:

Figure9.9:Fullyassociativecachelookup.Thecachecheckstheaddressagainsteveryentryandreturnsthematchingvalue,ifany.

Fullyassociative.Withafullyassociativecache,theaddresscanbestoredanywhereinthetable,andsoonalookup,thesystemmustchecktheaddressagainstalloftheentriesinthetableasillustratedinFigure9.9.Thereisacachehitifanyofthetableentriesmatch.Becauseanyaddresscanbestoredanywhere,thisprovidesthesystemmaximalflexibilitywhenitneedstochooseanentrytodiscardwhenitrunsoutofspace.

Wesawtwoexamplesoffullyassociativecachesinthepreviouschapter.Untilveryrecently,TLBswereoftenfullyassociative—theTLBwouldcheckthevirtualpageagainsteveryentryintheTLBinparallel.Likewise,physicalmemoryisafullyassociativecache.Anypageframecanholdanyvirtualpage,andwecanfindwhere

eachvirtualpageisstoredusingamulti-leveltreelookup.Thesetofpagetablesdefineswhetherthereisamatch.

AproblemwithfullyassociativelookupisthecumulativeimpactofMoore’sLaw.Asmorememorycanbepackedonchip,cachesbecomelarger.Wecanusesomeoftheaddedmemorytomakeeachtableentrylarger,butthishasalimitdependingontheamountofspatiallocalityintypicalapplications.Alternately,wecanaddmoretableentries,butthismeansmorelookuphardwareandcomparators.Asanexample,a2MBon-chipcachewith64byteblockshas32Kcachetableentries!Checkingeachaddressagainsteverytableentryinparallelisnotpractical.

Figure9.10:Directmappedcachelookup.Thecachehashestheaddresstodeterminewhichlocationinthetabletocheck.Thecachereturnsthevaluestoredintheentryifitmatchestheaddress.

Directmapped.Withadirectmappedcache,eachaddresscanonlybestoredinonelocationinthetable.Lookupiseasy:wehashtheaddresstoitsentry,asshowninFigure9.10.Thereisacachehitiftheaddressmatchesthatentryandacachemissotherwise.

Adirectmappedcacheallowsefficientlookup,butitlosesmuchofthatadvantageindecreasedflexibility.Ifaprogramhappenstoneedtwodifferentaddressesthatbothhashtothesameentry,suchastheprogramcounterandthestackpointer,thesystemwillthrash.Wewillfirstgettheinstruction;then,oops,weneedthestack.Then,oops,weneedtheinstructionagain.Thenoops,weneedthestackagain.Theprogrammerwillseetheprogramrunningslowly,withnocluewhy,asitwilldependonwhichaddressesareassignedtowhichinstructionsanddata.Iftheprogrammerinsertsaprintstatementtotrytofigureoutwhatisgoingwrong,thatmightshifttheinstructionstoadifferentcacheblock,makingtheproblemdisappear!

Setassociative.Asetassociativecachemeldsthetwoapproaches,allowingatradeoffofslightlyslowerlookupthanadirectmappedcacheinexchangeformostoftheflexibilityofafullyassociativecache.Withasetassociativecache,wereplicatethedirectmappedtableandlookupineachreplicainparallel.Aksetassociativecachehaskreplicas;aparticularaddressblockcanbeinanyofthekreplicas.(This

isequivalenttoahashtablewithabucketsizeofk.)Thereisacachehitiftheaddressmatchesanyofthereplicas.

Asetassociativecacheavoidstheproblemofthrashingwithadirectmappedcache,providedtheworkingsetforagivenbucketislargerthank.AlmostallhardwarecachesandTLBstodayusesetassociativematching;an8-waysetassociativecachestructureiscommon.

Figure9.11:Setassociativecachelookup.Thecachehashestheaddresstodeterminewhichlocationtocheck.Thecachecheckstheentryineachtableinparallel.Itreturnsthevalueifanyoftheentriesmatchtheaddress.

Directmappedandsetassociativecachesposeadesignchallengefortheoperatingsystem.Thesecachesaremuchmoreefficientiftheworkingsetoftheprogramisspreadacrossthedifferentbucketsinthecache.ThisiseasywithaTLBoravirtuallyaddressedcache,aseachsuccessivevirtualpageorcacheblockwillbeassignedtoacachebucket.Adatastructurethatstraddlesapageorcacheblockboundarywillbeautomaticallyassignedtotwodifferentbuckets.

However,theassignmentofphysicalpageframesisuptotheoperatingsystem,andthischoicecanhavealargeimpactontheperformanceofaphysicallyaddressedcache.Tomakethisconcrete,supposewehavea2MBphysicallyaddressedcachewith8-waysetassociativityand4KBpages;thisistypicalforahighperformanceprocessor.Nowsupposetheoperatingsystemhappenstoassignpageframesinasomewhatoddway,sothatanapplicationisgivenphysicalpageframesthatareseparatedbyexactly256KB.Perhapsthoseweretheonlypageframesthatwerefree.Whathappens?

Figure9.12:Whencachesarelargerthanthepagesize,multiplepageframescanmaptothesamesliceofthecache.Aprocessassignedpageframesthatareseparatedbyexactlythecachesizewillonlyuseasmallportionofthecache.Thisappliestobothsetassociativeanddirectmappedcaches;thefigureassumesadirectmappedcachetosimplifytheillustration.

Ifthehardwareusestheloworderbitsofthepageframetoindexthecache,theneverypageofthecurrentprocesswillmaptothesamebucketsinthecache.WeshowthisinFigure9.12.Insteadofthecachehaving2MBofusefulspace,theapplicationwillonlybeabletouse32KB(4KBpagestimesthe8-waysetassociativity).Thismakesitalotmorelikelyfortheapplicationtothrash.

Evenworse,theapplicationwouldhavenowaytoknowthishadhappened.Ifbyrandomchanceanapplicationendedupwithpageframesthatmaptothesamecachebuckets,itsperformancewillbepoor.Then,whentheuserre-runstheapplication,theoperatingsystemmightassigntheapplicationacompletelydifferentsetofpageframes,andperformancereturnstonormal.

Tomakecachebehaviormorepredictableandmoreeffective,operatingsystemsuseaconceptcalledpagecoloring.Withpagecoloring,physicalpageframesarepartitionedintosetsbasedonwhichcachebucketstheywilluse.Forexample,witha2MB8-waysetassociativecacheand4KBpages,therewillbe64separatesets,orcolors.Theoperatingsystemcanthenassignpageframestospreadeachapplication’sdataacrossthevariouscolors.

9.5ReplacementPolicies

Oncewehavelookedupanaddressinthecacheandfoundacachemiss,wehaveanewproblem.Whichmemoryblockdowechoosetoreplace?Assumingthereferencepatternexhibitstemporallocality,thenewblockislikelytobeneededinthenearfuture,soweneedtochoosesomeblockofmemorytoevictfromthecachetomakeroomforthenew

data.Ofcourse,withadirectmappedcachewedonothaveachoice:thereisonlyoneblockthatcanbereplaced.Ingeneral,however,wewillhaveachoice,andthischoicecanhaveasignificantimpactonthecachehitrate.

Aswithprocessorscheduling,thereareanumberofoptionsforthereplacementpolicy.Wecautionthatthereisnosinglerightanswer!Manyreplacementpoliciesareoptimalforsomeworkloadsandpessimalforothers,intermsofthecachehitrate;policiesthataregoodforaworkingsetmodelwillnotbegoodforZipfworkloads.

Policiesalsovarydependingonthesetting:hardwarecachesuseadifferentreplacementpolicythantheoperatingsystemdoesinmanagingmainmemoryasacachefordisk.Ahardwarecachewilloftenhavealimitednumberofreplacementchoices,constrainedbythesetassociativityofthecache,anditmustmakeitsdecisionsveryrapidly.Intheoperatingsystem,thereisoftenbothmoretimetomakeachoiceandamuchlargernumbercacheditemstoconsider;e.g.,with4GBofmemory,asystemwillhaveamillionseparate4KBpagestochoosefromwhendecidingwhichtoreplace.Evenwithintheoperatingsystem,thereplacementpolicyforthefilebuffercacheisoftendifferentthantheoneusedfordemandpagedvirtualmemory,dependingonwhatinformationiseasilyavailableabouttheaccesspattern.

Wefirstdiscussseveraldifferentreplacementpoliciesintheabstract,andtheninthenexttwosectionsweconsiderhowtheseconceptsareappliedtothesettingofdemandpagingmemoryfromdisk.

9.5.1Random

Althoughitmayseemarbitrary,apracticalreplacementpolicyistochoosearandomblocktoreplace.Particularlyforafirst-levelhardwarecache,thesystemmaynothavethetimetomakeamorecomplexdecision,andthecostofmakingthewrongchoicecanbesmalliftheitemisinthenextlevelcache.Thebookkeepingcostformorecomplexpoliciescanbenon-trivial:keepingmoreinformationabouteachblockrequiresspacethatmaybebetterspentonincreasingthecachesize.

Random’sbiggestweaknessisalsoitsbiggeststrength.Whatevertheaccesspatternis,Randomwillnotbepessimal—itwillnotmaketheworstpossiblechoice,atleast,notonaverage.However,itisalsounpredictable,andsoitmightfoilanapplicationthatwasdesignedtocarefullymanageitsuseofdifferentlevelsofthecache.

9.5.2First-In-First-Out(FIFO)

Alessarbitrarypolicyistoevictthecacheblockorpagethathasbeeninmemorythelongest,thatis,FirstInFirstOut,orFIFO.Particularlyforusingmemoryasacachefordisk,thiscanseemfair—eachprogram’spagesspendaroughlyequalamountoftimeinmemorybeforebeingevicted.

Unfortunately,FIFOcanbetheworstpossiblereplacementpolicyforworkloadsthathappenquiteofteninpractice.Consideraprogramthatcyclesthroughamemoryarrayrepeatedly,butwherethearrayistoolargetofitinthecache.Manyscientificapplications

doanoperationoneveryelementinanarray,andthenrepeatthatoperationuntilthedatareachesafixedpoint.Google’sPageRankalgorithmfordeterminingwhichsearchresultstodisplayusesasimilarapproach.PageRankiteratesrepeatedlythroughallpages,estimatingthepopularityofapagebasedonthepopularityofthepagesthatrefertoitascomputedinthepreviousiteration.

FIFO

Ref. A B C D E A B C D E A B C D E

1 A E D C

2 B A E D

3 C B A E

4 D C B

Figure9.13:CachebehaviorforFIFOforarepeatedscanthroughmemory,wherethescanisslightlylargerthanthecachesize.Eachrowrepresentsthecontentsofapageframeorcacheblock;eachnewreferencetriggersacachemiss.

Onarepeatedscanthroughmemory,FIFOdoesexactlythewrongthing:italwaysevictstheblockorpagethatwillbeneedednext.Figure9.13illustratesthiseffect.Notethatinthisfigure,andothersimilarfiguresinthischapter,weshowonlyasmallnumberofcacheslots;notethatthesepoliciesalsoapplytosystemswithaverylargenumberofslots.

9.5.3OptimalCacheReplacement(MIN)

IfFIFOcanbepessimalforsomeworkloads,thatraisesthequestion:whatreplacementpolicyisoptimalforminimizingcachemisses?Theoptimalpolicy,calledMIN,istoreplacewhicheverblockisusedfarthestinthefuture.Equivalently,theworstpossiblestrategyistoreplacetheblockthatisusedsoonest.

OptimalityofMIN

TheproofthatMINisoptimalisabitinvolved.IfMINisnotoptimal,theremustbesomealternativeoptimalreplacementpolicy,whichwewillcallALT,thathasfewercachemissesthanMINonsomespecificsequenceofreferences.Theremaybemanysuchalternatepolicies,soletusfocusontheonethatdiffersfromMINatthelatestpossiblepoint.ConsiderthefirstcachereplacementwhereALTdiffersfromMIN—bydefinition,ALTmustchooseablocktoreplacethatisusedsoonerthantheblockchosenbyMIN.

Weconstructanewpolicy,ALT′,thatisatleastasgoodasALT,butdiffersfromMINatalaterpointandsocontradictstheassumption.WeconstructALT′todifferfromALTinonlyonerespect:atthefirstpointwhereALTdiffersfromMIN,ALT′choosestoevicttheblockthatMINwouldhavechosen.Fromthatpoint,thecontentsofthecachedifferbetweenALTandALT′onlyforthatoneblock.ALTcontainsy,theblockreferencedfartherinthefuture;ALT′isthesame,exceptitcontainsx,theblockreferencedsooner.Onsubsequentcachemissestootherblocks,ALT′mimicsALT,evictingexactlythesameblocksthatALTwouldhaveevicted.

ItispossiblethatALTchoosestoevictybeforethenextreferencetoxory;inthiscase,ifALT′choosestoevictx,thecontentsofthecacheforALTandALT′areidentical.Further,ALT′hasthesamenumberofcachemissesasALT,butitdiffersfromMINatalaterpointthanALT.Thiscontradictsourassumptionabove,sowecanexcludethiscase.

Eventually,thesystemwillreferencex,theblockthatALTchosetoevict;byconstruction,thisoccursbeforethereferencetoy,theblockthatALT′chosetoevict.Thus,ALTwillhaveacachemiss,butALT′willnot.ALTwillevictsomeblock,q,tomakeroomforx;nowALTandALT′differonlyinthatALTcontainsyandALT′containsq.(IfALTevictsyinstead,thenALTandALT′havethesamecachecontents,butALT′hasfewermissesthanALT,acontradiction.)Finally,whenwereachthereferencetoy,ALT′willtakeacachemiss.IfALT′evictsq,thenitwillhavethesamenumberofcachemissesasALT,butitwilldifferfromMINatapointlaterthanALT,acontradiction.

AswithShortestJobFirst,MINrequiresknowledgeofthefuture,andsowecannotimplementitdirectly.Rather,wecanuseitasagoal:wewanttocomeupwithmechanismswhichareeffectiveatpredictingwhichblockswillbeusedinthenearfuture,sothatwecankeepthoseinthecache.

Ifwewereabletopredictthefuture,wecoulddoevenbetterthanMINbyprefetchingblockssothattheyarrive“justintime”—exactlywhentheyareneeded.Inthebestcase,thiscanreducethenumberofcachemissestozero.Forexample,ifweobserveaprogramscanningthroughafile,wecanprefetchtheblocksofthefileintomemory.Providedwecanreadthefileintomemoryfastenoughtokeepupwiththeprogram,theprogramwillalwaysfinditsdatainmemoryandneverhaveacachemiss.

9.5.4LeastRecentlyUsed(LRU)

Onewaytopredictthefutureistolookatthepast.Ifprogramsexhibittemporallocality,thelocationstheyreferenceinthefuturearelikelytobethesameastheonestheyhavereferencedintherecentpast.

Areplacementpolicythatcapturesthiseffectistoevicttheblockthathasnotbeenusedforthelongestperiodoftime,ortheleastrecentlyused(LRU)block.Insoftware,LRUissimpletoimplement:oneverycachehit,youmovetheblocktothefrontofthelist,andonacachemiss,youevicttheblockattheendofthelist.Inhardware,keepingalinkedlistofcachedblocksistoocomplextoimplementathighspeed;instead,weneedtoapproximateLRU,andwewilldiscussexactlyhowinabit.

LRU

Ref. A B A C B D A D E D A E B A C

1 A + + + +

2 B + +

3 C E +

4 D + + C

FIFO

1 A + + E

2 B + A +

3 C + B

4 D + + C

MIN

1 A + + + +

2 B + + C

3 C E +

4 D + +

Figure9.14:CachebehaviorforLRU(top),FIFO(middle),andMIN(bottom)forareferencepatternthatexhibitstemporallocality.Eachrowrepresentsthecontentsofapageframeorcacheblock;+indicatesacachehit.Onthisreferencepattern,LRUisthesameasMINuptothefinalreference,whereMINcanchoosetoreplaceanyblock.

Insomecases,LRUcanbeoptimal,asintheexampleinFigure9.14.Thetableillustratesareferencepatternthatexhibitsahighdegreeoftemporallocality;whenrecentreferencesaremorelikelytobereferencedinthenearfuture,LRUcanoutperformFIFO.

LRU

Ref. A B C D E A B C D E A B C D E

1 A E D C

2 B A E D

3 C B A E

4 D C B

MIN

1 A + + +

2 B + + C

3 C + D +

4 D E + +

Figure9.15:CachebehaviorforLRU(top)andMIN(bottom)forareferencepatternthatrepeatedlyscansthroughmemory.Eachrowrepresentsthecontentsofapageframeorcacheblock;+indicatesacachehit.Onthisreferencepattern,LRUisthesameasFIFO,withacachemissoneveryreference;theoptimalstrategyistoreplacethemostrecentlyusedpage,asthatwillbereferencedfarthestintothefuture.

Onthisparticularsequenceofreferences,LRUbehavessimilarlytotheoptimalstrategyMIN,butthatwillnotalwaysbethecase.Infact,LRUcansometimesbetheworstpossiblecachereplacementpolicy.Thisoccurswhenevertheleastrecentlyusedblockisthenextonetobereferenced.AcommonsituationwhereLRUispessimaliswhentheprogrammakesrepeatedscansthroughmemory,illustratedinFigure9.15;wesawearlierthatFIFOisalsopessimalforthisreferencepattern.Thebestpossiblestrategyistoreplacethemostrecentlyreferencedblock,asthisblockwillbeusedfarthestintothefuture.

9.5.5LeastFrequentlyUsed(LFU)

Consideragainthecaseofawebproxycache.Wheneverauseraccessesapage,itismorelikelyforthatusertoaccessothernearbypages(spatiallocality);sometimes,aswithaflashcrowd,itcanbemorelikelyforotheruserstoaccessthesamepage(temporallocality).Onthesurface,LeastRecentlyUsedseemslikeagoodfitforthisworkload.

However,whenauservisitsararelyusedpage,LRUwilltreatthepageasimportant,even

thoughitisprobablyjustaone-off.WhenIdoaGooglesearchforamountainhutforastayinWesternIceland,thewebpagesIvisitwillnotsuddenlybecomemorepopularthanthelatestFacebookupdatefromKatyPerry.

AbetterstrategyforreferencesthatfollowaZipfdistributionisLeastFrequentlyUsed(LFU).LFUdiscardstheblockthathasbeenusedleastoften;itthereforekeepspopularpages,evenwhenlesspopularpageshavebeentouchedmorerecently.

LRUandLFUbothattempttopredictfuturebehavior,andtheyhavecomplementarystrengths.Manysystemsmeldthetwoapproachestogainthebenefitsofeach.LRUisbetteratkeepingthecurrentworkingsetinmemory;oncetheworkingsetistakencareof,however,LRUwillyielddiminishingreturns.Instead,LFUmaybebetteratpredictingwhatfilesormemoryblockswillbeneededinthemoredistantfuture,e.g.,afterthenextworkingsetphasechange.

Replacementpolicyandfilesize

Ourdiscussionuptonowhasassumedthatallcacheditemsareequal,bothinsizeandincosttoreplace.Whentheseassumptionsdonothold,however,wemaysometimeswanttovarythepolicyfromLFUorLFU,thatis,tokeepsomeitemsthatarelessfrequentlyorlessrecentlyusedaheadofothersthataremorefrequentlyormorerecentlyused.

Forexample,considerawebproxythatcachesfilestoimprovewebresponsiveness.Thesefilesmayhavevastlydifferentsizes.Whenmakingroomforanewfile,wehaveachoicebetweenevictingoneverylargewebpageobjectoramuchlargernumberofsmallerobjects.Evenifeachsmallfileislessfrequentlyusedthanthelargefile,itmaystillmakesensetokeepthesmallfiles.Inaggregatetheymaybemorefrequentlyused,andthereforetheymayhavealargerbenefittooverallsystemperformance.Likewise,ifacacheditemisexpensivetoregenerate,itismoreimportanttokeepcachedthanonethatismoreeasilyreplaced.

Parallelcomputingmakesthecalculusevenmorecomplex.Theperformanceofaparallelprogramdependsonitscriticalpath—theminimumsequenceofstepsfortheprogramtoproduceitsresult.Cachemissesthatoccuronthecriticalpathaffecttheresponsetimewhilethosethatoccuroffthecriticalpathdonot.Forexample,aparallelMapReducejobforksasetoftasksontoprocessors;eachtaskreadsinafileandproducesanoutput.BecauseMapReducemustwaituntilalltasksarecompletebeforemovingontothenextstep,ifanyfileisnotcacheditisasbadasifalloftheneededfileswerenotcached.

9.5.6Belady’sAnomaly

Intuitively,itseemslikeitshouldalwayshelptoaddspacetoamemorycache;beingabletostoremoreblocksshouldalwayseitherimprovethecachehitrate,oratleast,notmakethecachehitrateanyworse.Formanycachereplacementstrategies,thisintuitionistrue.However,insomecases,addingspacetoacachecanactuallyhurtthecachehitrate.ThisiscalledBelady’sanomaly,afterthepersonthatdiscoveredit.

First,wenotethatmanyoftheschemeswehavedefinedcanbeproventoyieldnoworse

cachebehaviorwithlargercachesizes.Forexample,withtheoptimalstrategyMIN,ifwehaveacacheofsizekblocks,wewillkeepthenextkblocksthatwillbereferenced.Ifwehaveacacheofsizek+1blocks,wewillkeepallofthesameblocksaswithaksizedcache,plustheadditionalblockthatwillbethek+1nextreference.

WecanmakeasimilarargumentforLRUandLFU.ForLRU,acacheofsizek+1keepsallofthesameblocksasaksizedcache,plustheblockthatisreferencedfarthestinthepast.EvenifLRUisalousyreplacementpolicy—ifitrarelykeepstheblocksthatwillbeusedinthenearfuture—itwillalwaysdoatleastaswellasaslightlysmallercachealsousingthesamereplacementpolicy.AnequivalentargumentcanbeusedforLFU.

FIFO(3slots)

Ref. A B C D A B E A B C D E

1 A D E +

2 B A + C

3 C B + D

FIFO(4slots)

1 A + E D

2 B + A E

3 C B

4 D C

Figure9.16:CachebehaviorforFIFOwithtwodifferentcachesizes,illustratingBelady’sanomaly.Forthissequenceofreferences,thelargercachesufferstencachemisses,whilethesmallercachehasonefewer.

Somereplacementpolicies,however,donothavethisbehavior.Instead,thecontentsofacachewithk+1blocksmaybecompletelydifferentthanthecontentsofacachewithkblocks.Asaresult,therecachehitratesmaydiverge.Amongthepolicieswehavediscussed,FIFOsuffersfromBelady’sanomaly,andweillustratethatinFigure9.16.

9.6CaseStudy:Memory-MappedFiles

Toillustratetheconceptspresentedinthischapter,weconsiderindetailhowanoperatingsystemcanimplementdemandpaging.Withdemandpaging,applicationscanaccessmore

memorythanisphysicallypresentonthemachine,byusingmemorypagesasacachefordiskblocks.Whentheapplicationaccessesamissingmemorypage,itistransparentlybroughtinfromdisk.Westartwiththesimplercaseofademandpagingforasingle,memory-mappedfileandthenextendthediscussiontomanagingmultipleprocessescompetingforspaceinmainmemory.

AswediscussedinChapter3,mostprogramsuseexplicitread/writesystemcallstoperformfileI/O.Read/writesystemcallsallowtheprogramtoworkonacopyoffiledata.Theprogramopensafileandtheninvokesthesystemcallreadtocopychunksoffiledataintobuffersintheprogram’saddressspace.Theprogramcanthenuseandmodifythosechunks,withoutaffectingtheunderlyingfile.Forexample,itcanconvertthefilefromthediskformatintoamoreconvenientin-memoryformat.Towritechangesbacktothefile,theprograminvokesthesystemcallwritetocopythedatafromtheprogrambuffersouttodisk.Readingandwritingfilesviasystemcallsissimpletounderstandandreasonablyefficientforsmallfiles.

AnalternativemodelforfileI/Oistomapthefilecontentsintotheprogram’svirtualaddressspace.Foramemory-mappedfile,theoperatingsystemprovidestheillusionthatthefileisaprogramsegment;likeanymemorysegment,theprogramcandirectlyissueinstructionstoloadandstorevaluestothememory.Unlikefileread/write,theloadandstoreinstructionsdonotoperateonacopy;theydirectlyaccessandmodifythecontentsofthefile,treatingmemoryasawrite-backcachefordisk.

Wesawanexampleofamemory-mappedfileinthepreviouschapter:theprogramexecutableimage.Tostartaprocess,theoperatingsystembringstheexecutableimageintomemory,andcreatespagetableentriestopointtothepageframesallocatedtotheexecutable.Theoperatingsystemcanstarttheprogramexecutingassoonasthefirstpageframeisinitialized,withoutwaitingfortheotherpagestobebroughtinfromdisk.Forthis,theotherpagetableentriesaresettoinvalid—iftheprocessaccessesapagethathasnotreachedmemoryyet,thehardwaretrapstotheoperatingsystemandthenwaitsuntilthepageisavailablesoitcancontinuetoexecute.Fromtheprogram’sperspective,thereisnodifference(exceptforperformance)betweenwhethertheexecutableimageisentirelyinmemoryorstillmostlyondisk.

Wecangeneralizethisconcepttoanyfilestoredondisk,allowingapplicationstotreatanyfileaspartofitsvirtualaddressspace.Fileblocksarebroughtinbytheoperatingsystemwhentheyarereferenced,andmodifiedblocksarecopiedbacktodisk,withthebookkeepingdoneentirelybytheoperatingsystem.

9.6.1Advantages

Memory-mappedfilesofferanumberofadvantages:

Transparency.Theprogramcanoperateonthebytesinthefileasiftheyarepartofmemory;specifically,theprogramcanuseapointerintothefilewithoutneedingtocheckifthatportionofthefileisinmemoryornot.

ZerocopyI/O.Theoperatingsystemdoesnotneedtocopyfiledatafromkernelbuffersintousermemoryandback;rather,itjustchangestheprogram’spagetable

entrytopointtothephysicalpageframecontainingthatportionofthefile.Thekernelisresponsibleforcopyingdatabackandforthtodisk.WeshouldnotethatitispossibletoimplementzerocopyI/Oforexplicitread/writefilesystemcallsincertainrestrictedcases;wewillexplainhowinthenextchapter.

Pipelining.Theprogramcanstartoperatingonthedatainthefileassoonasthepagetableshavebeensetup;itdoesnotneedtowaitfortheentirefiletobereadintomemory.Withmultiplethreads,aprogramcanuseexplicitread/writecallstopipelinediskI/O,butitneedstomanagethepipelineitself.

Interprocesscommunication.Twoormoreprocessescanshareinformationinstantaneouslythroughamemory-mappedfilewithoutneedingtoshuffledatabackandforthtothekernelortodisk.Ifthehardwarearchitecturesupportsit,thepagetableforthesharedsegmentcanalsobeshared.

Largefiles.Aslongasthepagetableforthefilecanfitinphysicalmemory,theonlylimitonthesizeofamemory-mappedfileisthesizeofthevirtualaddressspace.Forexample,anapplicationmayhaveagiantmulti-leveltreeindexingdataspreadacrossanumberofdisksinadatacenter.Withread/writesystemcalls,theapplicationneedstoexplicitlymanagewhichpartsofthetreearekeptinmemoryandwhichareondisk;alternatively,withmemory-mappedfiles,theapplicationcanleavethatbookkeepingtotheoperatingsystem.

9.6.2Implementation

Toimplementmemory-mappedfiles,theoperatingsystemprovidesasystemcalltomapthefileintoaportionofthevirtualaddressspace.Inthesystemcall,thekernelinitializesasetofpagetableentriesforthatregionofthevirtualaddressspace,settingeachentrytoinvalid.Thekernelthenreturnstotheuserprocess.

Figure9.17:Beforeapagefault,thepagetablehasaninvalidentryforthereferencedpageandthedataforthepageisstoredondisk.

Figure9.18:Afterthepagefault,thepagetablehasavalidentryforthereferencedpagewiththepageframecontainingthedatathathadbeenstoredondisk.Theoldcontentsofthepageframearestoredondiskandthepagetableentrythatpreviouslypointedtothepageframeissettoinvalid.

Whentheprocessissuesaninstructionthattouchesaninvalidmappedaddress,asequenceofeventsoccurs,illustratedinFigures9.17and9.18:

TLBmiss.ThehardwarelooksthevirtualpageupintheTLB,andfindsthatthereisnotavalidentry.Thistriggersafullpagetablelookupinhardware.

Pagetableexception.Thehardwarewalksthemulti-levelpagetableandfindsthepagetableentryisinvalid.Thiscausesahardwarepagefaultexceptiontrapintotheoperatingsystemkernel.

Convertvirtualaddresstofileoffset.Intheexceptionhandler,thekernellooksupinitssegmenttabletofindthefilecorrespondingtothefaultingvirtualaddressandconvertstheaddresstoafileoffset.

Diskblockread.Thekernelallocatesanemptypageframeandissuesadiskoperationtoreadtherequiredfileblockintotheallocatedpageframe.Whilethediskoperationisinprogress,theprocessorcanbeusedforrunningotherthreadsorprocesses.

Diskinterrupt.Thediskinterruptstheprocessorwhenthediskreadfinishes,andtheschedulerresumesthekernelthreadhandlingthepagefaultexception.

Pagetableupdate.Thekernelupdatesthepagetableentrytopointtothepageframeallocatedfortheblockandsetstheentrytovalid.

Resumeprocess.Theoperatingsystemresumesexecutionoftheprocessattheinstructionthatcausedtheexception.

TLBmiss.TheTLBstilldoesnotcontainavalidentryforthepage,triggeringafullpagetablelookup.

Pagetablefetch.Thehardwarewalksthemulti-levelpagetable,findsthepagetableentryvalid,andreturnsthepageframetotheprocessor.TheprocessorloadstheTLBwiththenewtranslation,evictingapreviousTLBentry,andthenusesthetranslationtoconstructaphysicaladdressfortheinstruction.

Tomakethiswork,weneedanemptypageframetoholdtheincomingpagefromdisk.Tocreateanemptypageframe,theoperatingsystemmust:

Selectapagetoevict.Assumingthereisnotanemptypageofmemoryalreadyavailable,theoperatingsystemneedstoselectsomepagetobereplaced.WediscusshowtoimplementthisselectioninSection9.6.3below.

Findpagetableentriesthatpointtotheevictedpage.Theoperatingsystemthenlocatesthesetofpagetableentriesthatpointtothepagetobereplaced.Itcandothiswithacoremap—anarrayofinformationabouteachphysicalpageframe,includingwhichpagetableentriescontainpointerstothatparticularpageframe.

Seteachpagetableentrytoinvalid.Theoperatingsystemneedstopreventanyonefromusingtheevictedpagewhilethenewpageisbeingbroughtintomemory.Becausetheprocessorcancontinuetoexecutewhilethediskreadisinprogress,thepageframemaytemporarilycontainamixtureofbytesfromtheoldandthenewpage.Therefore,becausetheTLBmaycacheacopyoftheoldpagetableentry,aTLBshootdownisneededtoevicttheoldtranslationfromtheTLB.

Copybackanychangestotheevictedpage.Iftheevictedpagewasmodified,thecontentsofthepagemustbecopiedbacktodiskbeforethenewpagecanbebroughtintomemory.Likewise,thecontentsofmodifiedpagesmustalsobecopiedbackwhentheapplicationclosesthememory-mappedfile.

Figure9.19:Whenapageisclean,itsdirtybitissettozeroinboththeTLBandthepagetable,andthedatainmemoryisthesameasthedatastoredondisk.

Figure9.20:Onthefirststoreinstructiontoacleanpage,thehardwaresetsthedirtybitforthatpageintheTLBandthepagetable.Thecontentsofthepagewilldifferfromwhatisstoredondisk.

Howdoestheoperatingsystemknowwhichpageshavebeenmodified?Acorrect,butinefficient,solutionistosimplyassumethateverypageinamemory-mappedfilehasbeenmodified;ifthedatahasnotbeenchanged,theoperatingsystemwillhavewastedsomework,butthecontentsofthefilewillnotbeaffected.

Amoreefficientsolutionisforthehardwaretokeeptrackofwhichpageshavebeenmodified.Mostprocessorarchitecturesreserveabitineachpagetableentrytorecordwhetherthepagehasbeenmodified.Thisiscalledadirtybit.Theoperatingsysteminitializesthebittozero,andthehardwaresetsthebitautomaticallywhenitexecutesastoreinstructionforthatvirtualpage.SincetheTLBcancontainacopyofthepagetableentry,theTLBalsoneedsadirtybitperentry.ThehardwarecanignorethedirtybitifitissetintheTLB,butwheneveritgoesfromzerotoone,thehardwareneedstocopythebitbacktothecorrespondingpagetableentry.Figures9.19and9.20showthestateoftheTLB,pagetable,memoryanddiskbeforeandafterthefirststoreinstructiontoapage.

Iftherearemultiplepagetableentriespointingatthesamephysicalpageframe,thepageisdirty(andmustbecopiedbacktodisk)ifanyofthepagetableshavethedirtybitset.Normally,ofcourse,amemory-mappedfilewillhaveasinglepagetablesharedbetweenalloftheprocessesmappingthefile.

Becauseevictingadirtypagetakesmoretimethanevictingacleanpage,theoperatingsystemcanproactivelycleanpagesinthebackground.Athreadrunsinthebackground,lookingforpagesthatarelikelycandidatesforbeingevictediftheywereclean.Ifthehardwaredirtybitissetinthepagetableentry,thekernelresetsthebitinthepagetableentryanddoesaTLBshootdowntoremovetheentryfromtheTLB(withtheoldvalueofthedirtybit).Itthencopiesthepagetodisk.Ofcourse,theon-chipprocessormemorycacheandwritebufferscancontainmodificationstothepagethathavenotreachedmainmemory;thehardwareensuresthatthenewdatareachesmainmemorybeforethosebytesarecopiedtothediskinterface.

Thekernelcanthenrestarttheapplication;itneednotwaitfortheblocktoreachdisk—iftheprocessmodifiesthepageagain,thehardwarewillsimplyresetthedirtybit,

signalingthattheblockcannotbereclaimedwithoutsavingthenewsetofchangestodisk.

Emulatingahardwaredirtybitinsoftware

Interestingly,hardwaresupportforadirtybitisnotstrictlyrequired.Theoperatingsystemcanemulateahardwaredirtybitusingpagetableaccesspermissions.Anunmodifiedpageissettoallowonlyread-onlyaccess,eventhoughtheprogramislogicallyallowedtowritethepage.Theprogramcanthenexecutenormally.Onastoreinstructiontothepage,thehardwarewilltriggeramemoryexception.Theoperatingsystemcanthenrecordthefactthatthepageisdirty,upgradethepageprotectiontoread-write,andrestarttheprocess.

Tocleanapageinthebackground,thekernelresetsthepageprotectiontoread-onlyanddoesaTLBshootdown.Theshootdownremovesanytranslationthatallowsforread-writeaccesstothepage,forcingsubsequentstoreinstructionstocauseanothermemoryexception.

9.6.3ApproximatingLRU

Afurtherchallengetoimplementingdemandpagedmemory-mappedfilesisthatthehardwaredoesnotkeeptrackofwhichpagesareleastrecentlyorleastfrequentlyused.Doingsowouldrequirethehardwaretokeepalinkedlistofeverypageinmemory,andtomodifythatlistoneveryloadandstoreinstruction(andformemory-mappedexecutableimages,everyinstructionfetchaswell).Thiswouldbeprohibitivelyexpensive.Instead,thehardwaremaintainsaminimalamountofaccessinformationperpagetoallowtheoperatingsystemtoapproximateLRUorLFUifitwantstodoso.

Weshouldnotethatexplicitread/writefilesystemcallsdonothavethisproblem.Eachtimeaprocessreadsorwritesafileblock,theoperatingsystemcankeeptrackofwhichblocksareused.Thekernelcanusethisinformationtoprioritizeitscacheoffileblockswhenthesystemneedstofindspaceforanewblock.

Mostprocessorarchitectureskeepausebitineachpagetableentry,nexttothehardwaredirtybitwediscussedabove.Theoperatingsystemclearstheusebitwhenthepagetableentryisinitialized;thebitissetinhardwarewheneverthepagetableentryisbroughtintotheTLB.Aswiththedirtybit,aphysicalpageisusedifanyofthepagetableentrieshavetheirusebitset.

Figure9.21:Theclockalgorithmsweepsthrougheachpageframe,collectingthecurrentvalueoftheusebitforthatpageandresettingtheusebittozero.Theclockalgorithmstopswhenithasreclaimedasufficientnumberofunusedpageframes.

Theoperatingsystemcanleveragetheusebitinvariousways,butacommonlyusedapproachistheclockalgorithm,illustratedinFigure9.21.Periodically,theoperatingsystemscansthroughthecoremapofphysicalmemorypages.Foreachpageframe,itrecordsthevalueoftheusebitinthepagetableentriesthatpointtothatframe,andthenclearstheirusebits.BecausetheTLBcanhaveacachedcopyofthetranslation,theoperatingsystemalsodoesashootdownforanypagetableentrywheretheusebitiscleared.Notethatiftheusebitisalreadyzero,thetranslationcannotbeintheTLB.Whileitisscanning,thekernelcanalsolookfordirtyandrecentlyunusedpagesandflushtheseouttodisk.

Eachsweepoftheclockalgorithmthroughmemorycollectsonebitofinformationaboutpageusage;byadjustingthefrequencyoftheclockalgorithm,wecancollectincreasinglyfine-grainedinformationaboutusage,atthecostofincreasedsoftwareoverhead.Onmodernsystemswithhundredsofthousandsandsometimesmillionsofphysicalpageframes,theoverheadoftheclockalgorithmcanbesubstantial.

Thepolicyforwhattodowiththeusageinformationisuptotheoperatingsystemkernel.Acommonpolicyiscallednotrecentlyused,ork’thchance.Iftheoperatingsystemneedstoevictapage,thekernelpicksonethathasnotbeenused(hasnothaditsusebitset)forthelastksweepsoftheclockalgorithm.Theclockalgorithmpartitionspagesbasedonhowrecentlytheyhavebeenused;amongpageframesinthesamek’thchancepartition,theoperatingsystemcanevictpagesinFIFOorder.

Somesystemstriggertheclockalgorithmonlywhenapageisneeded,ratherthanperiodicallyinthebackground.Providedsomepageshavenotbeenaccessedsincethelastsweep,anon-demandclockalgorithmwillfindapagetoreclaim.Ifallpageshavebeenaccessed,e.g.,ifthereisastormofpagefaultsduetophasechangebehavior,thenthesystemwilldefaulttoFIFO.

Emulatingahardwareusebitinsoftware

Hardwaresupportforausebitisalsonotstrictlyrequired.Theoperatingsystemkernelcanemulateausebitwithpagetablepermissions,inthesamewaythatthekernelcanemulateahardwaredirtybit.Tocollectusageinformationaboutapage,thekernelsetsthepagetableentrytobeinvalideventhoughthepageisinmemoryandtheapplicationhaspermissiontoaccessthepage.Whenthepageisaccessed,thehardwarewilltriggeranexceptionandtheoperatingsystemcanrecordtheuseofthepage.Thekernelthenchangesthepermissiononthepagetoallowaccess,beforerestartingtheprocess.Tocollectusageinformationovertime,theoperatingsystemcanperiodicallyresetthepagetableentrytoinvalidandshootdownanycachedtranslationsintheTLB.

Manysystemsuseahybridapproach.Inadditiontoactivepageswherethehardwarecollectstheusebit,theoperatingsystemmaintainsapoolofunused,cleanpageframesthatareunmappedinanyvirtualaddressspace,butstillcontaintheirolddata.Whenanewpageframeisneeded,pagesinthispoolcanbeusedwithoutanyfurtherwork.However,iftheolddataisreferencedbeforethepageframeisreused,thepagecanbepulledoutofthepoolandmappedbackintothevirtualaddressspace.

Systemswithasoftware-managedTLBhaveanevensimplertime.EachtimethereisaTLBmisswithasoftware-managedTLB,thereisatraptothekerneltolookupthetranslation.Duringthetrap,thekernelcanupdateitslistoffrequentlyusedpages.

9.7CaseStudy:VirtualMemory

Wecangeneralizeontheconceptofmemory-mappedfiles,bybackingeverymemorysegmentwithafileondisk.Thisiscalledvirtualmemory.Programexecutables,individuallibraries,data,stackandheapsegmentscanallbedemandpagedtodisk.Unlikememory-mappedfiles,though,processmemoryisephemeral:whentheprocessexits,thereisnoneedtowritemodifieddatabacktodisk,andwecanreclaimthediskspace.

Theadvantageofvirtualmemoryisflexibility.Thesystemcancontinuetofunctioneventhoughtheuserhasstartedmoreprocessesthancanfitinmainmemoryatthesametime.Theoperatingsystemsimplymakesroomforthenewprocessesbypagingthememoryofidleapplicationstodisk.Withoutvirtualmemory,theuserhastodomemorymanagementbyhand,closingsomeapplicationstomakeroomforothers.

Allofthemechanismswehavedescribedformemory-mappedfilesapplywhenwegeneralizetovirtualmemory,withoneadditionaltwist.Weneedtobalancetheallocationofphysicalpageframesbetweenprocesses.Unfortunately,thisbalancingisquitetricky.If

weaddafewextrapagefaultstoasystem,noonewillnotice.However,amoderndiskcanhandleatmost100pagefaultspersecond,whileamodernmulti-coreprocessorcanexecute10billioninstructionspersecond.Thus,ifpagefaultsareanythingbutextremelyrare,performancewillsuffer.

9.7.1Self-Paging

Oneconsiderationisthatthebehaviorofoneprocesscansignificantlyhurttheperformanceofotherprogramsrunningatthesametime.Forexample,supposewehavetwoprocesses.Oneisanormalprogram,withaworkingsetequaltosay,aquarterofphysicalmemory.Theotherprogramisgreedy;whileitcanrunfinewithlessmemory,itwillrunfasterifitisgivenmorememory.Wegaveanexampleofthisearlierwiththesortprogram.

Canyoudesignaprogramtotakeadvantageoftheclockalgorithmtoacquiremorethanitsfairshareofmemorypages?

Figure9.22:The“pig”programtogreedilyacquirememorypages.Theimplementationassumeswearerunningonamulticorecomputer.Whenthepigtriggersapagefaultbytouchinganewmemorypage(soFar),theoperatingsystemwillfindallofthepig’spagesuptosoFarrecentlyused.Theoperatingsystemwillkeeptheseinmemoryanditwill

choosetoevictapagefromsomeotherapplication.

WegiveanexampleinFigure9.22,whichwewilldub“pig”forobviousreasons.Itallocatesanarrayinvirtualmemoryequalinsizetophysicalmemory;itthenusesmultiplethreadstocyclethroughmemory,causingeachpagetobebroughtinwhiletheotherpagesremainveryrecentlyused.

Anormalprogramsharingmemorywiththepigwilleventuallybefrozenoutofmemoryandstopmakingprogress.Whenthepigtouchesanewpage,ittriggersapagefault,butallofitspagesarerecentlyusedbecauseofthebackgroundthread.Meanwhile,thenormalprogramwillhaverecentlytouchedmanyofitspagesbuttherewillbesomethatarelessrecentlyused.Theclockalgorithmwillchoosethoseforreplacement.

Astimegoeson,moreandmoreofthepageswillbeallocatedtothepig.Asthenumberofpagesassignedtothenormalprogramdrops,itstartsexperiencingpagefaultsatanincreasingfrequency.Eventually,thenumberofpagesdropsbelowtheworkingset,atwhichpointtheprogramstopsmakingmuchprogress.Itspagesareevenlessfrequentlyused,makingthemeasiertoevict.

Ofcourse,anormaluserwouldprobablyneverrun(orwrite!)aprogramlikethis,butamaliciousattackerlaunchingacomputervirusmightusethisapproachtofreezeoutthesystemadministrator.Likewise,inadatacentersetting,asingleservercanbesharedbetweenmultipleapplicationsfromdifferentusers,forexample,runningindifferentvirtualmachines.Itisintheinterestofanysingleapplicationtoacquireasmanyphysicalresourcesaspossible,evenifthathurtsperformanceforotherusers.

Awidelyadoptedsolutionisself-paging.Withself-paging,eachprocessoruserisassigneditsfairshareofpageframes,usingthemax-minschedulingalgorithmwedescribedinChapter7.Ifalloftheactiveprocessescanfitinmemoryatthesametime,thesystemdoesnotneedtopage.Asthesystemstartstopage,itevictsthepagefromwhicheverprocesshasthemostallocatedtoit.Thus,thepigwouldonlybeabletoallocateitsfairshareofpageframes,andbeyondthatanypagefaultsittriggerswouldevictitsownpages.

Unfortunately,self-pagingcomesatacostinreducedresourceutilization.Supposewehavetwoprocesses,bothofwhichallocatelargeamountsofvirtualaddressspace.However,theworkingsetsofthetwoprogramscanfitinmemoryatthesametime,forexample,ifoneworkingsettakesup2/3rdsofmemoryandtheothertakesup1/3rd.Iftheycooperate,bothcanrunefficientlybecausethesystemhasroomforbothworkingsets.However,ifweneedtobulletprooftheoperatingsystemagainstmaliciousprogramsbyself-paging,theneachwillbeassignedhalfofmemoryandthelargerprogramwillthrash.

9.7.2Swapping

Anotherissueiswhathappensasweincreasetheworkloadforasystemwithvirtualmemory.Ifwearerunningadatacenter,forexample,wecansharephysicalmachinesamongamuchlargernumberofapplicationseachrunninginaseparatevirtualmachine.

Toreducecosts,thedatacenterneedstosupportthemaximumnumberofapplicationsoneachserver,withinsomeperformanceconstraint.

Iftheworkingsetsoftheapplicationseasilyfitinmemory,thenaspagefaultsoccur,theclockalgorithmwillfindlightlyusedpages—thatis,thoseoutsideoftheworkingsetofanyprocess—toevicttomakeroomfornewpages.Sofarsogood.Aswekeepaddingactiveprocesses,however,theirworkingsetsmaynolongerfit,evenifeachprocessisgiventheirfairshareofmemory.Inthiscase,theperformanceofthesystemwilldegradedramatically.

Thiscanbeillustratedbyconsideringhowsystemthroughputisaffectedbythenumberofprocesses.Asweaddworktothesystem,throughputincreasesaslongasthereisenoughprocessingcapacityandI/Obandwidth.Whenwereachthepointwheretherearetoomanytaskstofitentirelyinmemory,thesystemstartsdemandpaging.Throughputcancontinuetoimproveifthereareenoughlightlyusedpagestomakeroomfornewtasks,buteventuallythroughputlevelsoffandthenfallsoffacliff.Inthelimit,everyinstructionwilltriggerapagefault,meaningthattheprocessorexecutesat100instructionspersecond,ratherthan10billioninstructionspersecond.Needlesstosay,theuserwillthinkthesystemisdeadevenifitisinfactinchingforwardveryslowly.

AsweexplainedintheChapter7discussiononoverloadcontrol,theonlywaytoachievegoodperformanceinthiscaseistopreventtheoverloadconditionfromoccurring.Bothresponsetimeandthroughputwillbebetterifwepreventadditionaltasksfromstartingorifweremovesomeexistingtasks.Itisbettertocompletelystarvesometasksoftheirresources,ifthealternative,assigningeachtasktheirfairshare,willdragthesystemtoahalt.

Evictinganentireprocessfrommemoryiscalledswapping.Whenthereistoomuchpagingactivity,theoperatingsystemcanpreventacatastrophicdegradationinperformancebymovingallofthepageframesofaparticularprocesstodisk,preventingitfromrunningatall.Althoughthismayseemterriblyunfair,thealternativeisthateveryprocess,notjusttheswappedprocess,willrunmuchmoreslowly.Bydistributingtheswappedprocess’spagestootherprocesses,wecanreducethenumberofpagefaults,allowingsystemperformancetorecover.Eventuallytheothertaskswillfinish,andwecanbringtheswappedprocessbackintomemory.

9.8SummaryandFutureDirections

Cachingiscentraltomanyareasofcomputerscience:cachesareusedinprocessordesign,filesystems,webbrowsers,webservers,compilers,andkernelmemorymanagement,tonameafew.Tounderstandthesesystems,itisimportanttounderstandhowcacheswork,andevenmoreimportantly,whentheyfail.

Themanagementofmemoryinoperatingsystemsisaparticularlyusefulcasestudy.Everymajorcommercialoperatingsystemincludessupportfordemandpagingofmemory,usingmemoryasacachefordisk.Often,applicationmemorypagesandblocksinthefilebufferareallocatedfromacommonpoolofmemory,wheretheoperatingsystemattemptstokeepblocksthatarelikelytobeusedinmemoryandevictingthoseblocksthatarelesslikelytobeused.However,onmodernsystems,thedifferencebetween

findingablockinmemoryandneedingtobringitinfromdiskcanbeasmuchasafactorof100,000.Thismakesvirtualmemorypagingfragile,acceptableonlywhenusedinsmalldoses.

Movingforward,severaltrendsareinprogress:

Lowlatencybackingstore.Duetotheweightandpowerdrainofmagneticdisks,manyportabledeviceshavemovedtosolidstatepersistentstorage,suchasnon-volatileRAM.Currentsolidstatestoragedeviceshavesignificantlylowerlatencythandisk,andevenfasterdevicesarelikelyinthefuture.Similarly,themovetowardsdatacentercomputinghasaddedanewoptiontomemorymanagement:usingDRAMonothernodesinthedatacenterasalow-latency,veryhighcapacitybackingstoreforlocalmemory.Bothofthesetrendsreducethecostofpaging,makingitrelativelymoreattractive.

Variablepagesizes.Manysystemsuseastandard4KBpagesize,butthereisnothingfundamentalaboutthatchoice—itisatradeoffchosentobalanceinternalfragmentation,pagetableoverhead,disklatency,theoverheadofcollectingdirtyandusagebits,andapplicationspatiallocality.Onmoderndisks,itonlytakestwiceaslongtotransfer256contiguouspagesasitdoestotransferone,sointernally,mostoperatingsystemsarrangedisktransferstoincludemanyblocksatatime.Withnewtechnologiessuchaslowlatencysolidstatestorageandclustermemory,thisbalancemayshiftbacktowardssmallereffectivepagesizes.

Memoryawareapplications.Theincreasingdepthandcomplexityofthememoryhierarchyisbothaboonandacurse.Formanyapplications,thememoryhierarchydeliversreasonableperformancewithoutanyspecialeffort.However,thewidegulfinperformancebetweenthefirstlevelcacheandmainmemory,andbetweenmainmemoryanddisk,impliesthatthereisasignificantperformancebenefittotuningapplicationstotheavailablememory.Theposesaparticularchallengeforoperatingsystemstoadapttoapplicationsthatareadaptingtotheirphysicalresources.

Exercises

1. Acomputersystemhasa1KBpagesizeandkeepsthepagetableforeachprocessinmainmemory.Becausethepagetableentriesareusuallycachedonchip,theaverageoverheadfordoingafullpagetablelookupis40ns.Toreducethisoverhead,thecomputerhasa32-entryTLB.ATLBlookuprequires1ns.WhatTLBhitrateisrequiredtoensureanaveragevirtualaddresstranslationtimeof2ns?

2. Mostmoderncomputersystemschooseapagesizeof4KB.a. Giveasetofreasonswhydoublingthepagesizemightincreaseperformance.b. Giveasetofreasonswhydoublingthepagesizemightdecreaseperformance.

3. Foreachofthefollowingstatements,indicatewhetherthestatementistrueorfalse,andexplainwhy.

a. Adirectmappedcachecansometimeshaveahigherhitratethanafullyassociativecache(onthesamereferencepattern).

b. Addingacacheneverhurtsperformance.

4. Supposeanapplicationisassigned4pagesofphysicalmemoryandthememoryisinitiallyempty.Itthenreferencespagesinthefollowingsequence:

ACBDBAEFBFAGEFA

a. Showhowthesystemwouldfaultpagesintothefourframesofphysicalmemory,usingtheLRUreplacementpolicy.

b. Showhowthesystemwouldfaultpagesintothefourframesofphysicalmemory,usingtheMINreplacementpolicy.

c. Showhowthesystemwouldfaultpagesintothefourframesofphysicalmemory,usingtheclockreplacementpolicy.

5. Isleastrecentlyusedagoodcachereplacementalgorithmtouseforaworkloadfollowingazipfdistribution?Brieflyexplainwhyorwhynot.

6. Brieflyexplainhowtosimulateamodifybitperpageforthepagereplacementalgorithmifthehardwaredoesnotprovideone.

7. Supposewehavefourprograms:a. Oneexhibitsbothspatialandtemporallocality.b. Onetoucheseachpagesequentially,andthenrepeatsthescaninaloop.c. OnereferencespagesaccordingtoaZipfdistribution(e.g.,itisawebserverand

itsmemoryconsistsofcachedwebpages).d. Onegeneratesmemoryreferencescompletelyatrandomusingauniform

randomnumbergenerator.

Allfourprogramsusethesametotalamountofvirtualmemory—thatis,theybothtouchNdistinctvirtualpages,amongstamuchlargernumberoftotalreferences.

Foreachprogram,sketchagraphshowingtherateofprogress(instructionsperunittime)ofeachprogramasafunctionofthephysicalmemoryavailabletotheprogram,from0toN,assumingthepagereplacementalgorithmapproximatesleastrecentlyused.

8. Supposeaprogramrepeatedlyscanslinearlythroughalargearrayinvirtualmemory.Inotherwords,ifthearrayisfourpageslong,itspagereferencepatternisABCDABCDABCD…

Foreachofthefollowingpagereplacementalgorithms,sketchagraphshowingtherateofprogress(instructionsperunittime)ofeachprogramasafunctionofthephysicalmemoryavailabletotheprogram,from0toN,whereNissufficienttoholdtheentirearray.

a. FIFOb. Leastrecentlyusedc. Clockalgorithmd. Nthchancealgorithme. MIN

9. Consideracomputersystemrunningageneral-purposeworkloadwithdemand

paging.Thesystemhastwodisks,onefordemandpagingandoneforfilesystemoperations.Measuredutilizations(intermsoftime,notspace)aregiveninFigure9.23.

Processorutilization 20.0%

PagingDisk 99.7%

FileDisk 10.0%

Network 5.0%

Figure9.23:Measuredutilizationsforsamplesystemunderconsideration.

Foreachofthefollowingchanges,saywhatitslikelyimpactwillbeonprocessorutilization,andexplainwhy.Isitlikelytosignificantlyincrease,marginallyincrease,significantlydecrease,marginallydecrease,orhavenoeffectontheprocessorutilization?

a. GetafasterCPU

b. Getafasterpagingdisk

c. Increasethedegreeofmultiprogramming

10. Anoperatingsystemwithaphysicallyaddressedcacheusespagecoloringtomorefullyutilizethecache.

a. Howmanypagecolorsareneededtofullyutilizeaphysicallyaddressedcache,with1TBofmainmemory,an8MBcachewith4-waysetassociativity,anda4KBpagesize?

b. Developanalgebraicformulatocomputethenumberofpagecolorsneededforanarbitraryconfigurationofcachesize,setassociativity,andpagesize.

11. Thesequenceofvirtualpagesreferencedbyaprogramhaslengthpwithndistinctpagenumbersoccurringinit.Letmbethenumberofpageframesthatareallocatedtotheprocess(allthepageframesareinitiallyempty).Letn>m.

a. Whatisthelowerboundonthenumberofpagefaults?b. Whatistheupperboundonthenumberofpagefaults?

Thelower/upperboundshouldbeforanypagereplacementpolicy.

12. Youhavedecidedtosplurgeonalowendnetbookfordoingyouroperatingsystemshomeworkduringlecturesinyournon-computerscienceclasses.Thenetbookhasasingle-levelTLBandasingle-level,physicallyaddressedcache.Italsohastwolevelsofpagetables,andtheoperatingsystemdoesdemandpagingtodisk.

Thenetbookcomesinvariousconfigurations,andyouwanttomakesurethe

configurationyoupurchaseisfastenoughtorunyourapplications.Togetahandleonthis,youdecidetomeasureitscache,TLBandpagingperformancerunningyourapplicationsinavirtualmachine.Figure9.24listswhatyoudiscoverforyourworkload.

Measurement Value

PCacheMiss=probabilityofacachemiss 0.01

PTLBmiss=probabilityofaTLBmiss 0.01

Pfault=probabilityofapagefault,givenaTLBmissoccurs 0.00002

Tcache=timetoaccesscache 1ns=0.001μs

TTLB=timetoaccessTLB 1ns=0.001μs

TDRAM=timetoaccessmainmemory 100ns=0.1μs

Tdisk=timetotransferapageto/fromdisk 107ns=10ms

Figure9.24:Samplemeasurementsofcachebehavioronthelow-endnetbookdescribedintheexercises.

TheTLBisrefilledautomaticallybythehardwareonaTLBmiss.Thepagetablesarekeptinphysicalmemoryandarenotcached,solookingupapagetableentryincurstwomemoryaccesses(oneforeachlevelofthepagetable).Youmayassumetheoperatingsystemkeepsapoolofcleanpages,sopagesdonotneedtobewrittenbacktodiskonapagefault.

a. Whatistheaveragememoryaccesstime(thetimeforanapplicationprogramtodoonememoryreference)onthenetbook?Expressyouransweralgebraicallyandcomputetheresulttotwosignificantdigits.

b. Thenetbookhasafewoptionalperformanceenhancements:

Item Specs Price

Fasterdiskdrive Transfersapagein7ms $100

500MBmoreDRAMMakesprobabilityofapagefault0.00001 $100

Fasternetworkcard Allowspagingtoremotememory. $100

Withthefasternetworkcard,thetimetoaccessremotememoryis500ms,andtheprobabilityofaremotememorymiss(needtogotodisk),giventhereisapagefaultis0.5.

Supposeyouhave$200.Whatoptionsshouldyoubuytomaximizetheperformanceofthenetbookforthisworkload?

13. Onacomputerwithvirtualmemory,supposeaprogramrepeatedlyscansthroughaverylargearray.Inotherwords,ifthearrayisfourpageslong,itspagereferencepatternisABCDABCDABCD…

Sketchagraphshowingthepagingbehavior,foreachofthefollowingpagereplacementalgorithms.They-axisofthegraphisthenumberofpagefaultsperreferencedpage,varyingfrom0to1;thex-axisisthesizeofthearraybeingscanned,varyingfromsmallerthanphysicalmemorytomuchlargerthanphysicalmemory.Labelanyinterestingpointsonthegraphonboththexandyaxes.

a. FIFOb. LRUc. Clockd. MIN

14. Considertwoprograms,onethatexhibitsspatialandtemporallocality,andtheotherthatexhibitsneither.Tomakethecomparisonfair,theybothusethesametotalamountofvirtualmemory—thatis,theybothtouchNdistinctvirtualpages,amongamuchlargernumberoftotalreferences.

Sketchgraphsshowingtherateofprogress(instructionsperunittime)ofeachprogramasafunctionofthephysicalmemoryavailabletotheprogram,from0toN,assumingtheclockalgorithmisusedforpagereplacement.

a. Programexhibitinglocality,runningbyitself

b. Programexhibitingnolocality,runningbyitself

c. Programexhibitinglocality,runningwiththeprogramexhibitingnolocality(assumebothhavethesamevalueforN).

d. Programexhibitingnolocality,runningwiththeprogramexhibitinglocality(assumebothhavethesameN).

15. Supposeweareusingtheclockalgorithmtodecidepagereplacement,initssimplestform(“first-chance”replacement,wheretheclockisonlyadvancedonapagefaultandnotinthebackground).

Acrucialissueintheclockalgorithmishowmanypageframesmustbeconsideredinordertofindapagetoreplace.AssumingwehaveasequenceofFpagefaultsinasystemwithPpageframes,letC(F,P)bethenumberofpagesconsideredforreplacementinhandlingtheFpagefaults(iftheclockhandsweepsbyapageframemultipletimes,itiscountedeachtime).

a. GiveanalgebraicformulafortheminimumpossiblevalueofC(F,P).

b. GiveanalgebraicformulaforthemaximumpossiblevalueofC(F,P).

10.AdvancedMemoryManagement

Allproblemsincomputersciencecanbesolvedbyanotherlevelofindirection.—DavidWheeler

Atanabstractlevel,anoperatingsystemprovidesanexecutioncontextforapplicationprocesses,consistingoflimitsonprivilegedinstructions,theprocess’smemoryregions,asetofsystemcalls,andsomewayfortheoperatingsystemtoperiodicallyregaincontroloftheprocessor.Byinterposingonthatinterface—mostcommonly,bycatchingandtransformingsystemcallsormemoryreferences—theoperatingsystemcantransparentlyinsertnewfunctionalitytoimprovesystemperformance,reliability,andsecurity.

Interposingonsystemcallsisstraightforward.Thekernelusesatablelookuptodeterminewhichroutinetocallforeachsystemcallinvokedbytheapplicationprogram.Thekernelcanredirectasystemcalltoanewenhancedroutinebysimplychangingthetableentry.

Amoreinterestingcaseisthememorysystem.Addresstranslationhardwareprovidesanefficientwayfortheoperatingsystemtomonitorandgaincontroloneverymemoryreferencetoaspecificregionofmemory,whileallowingothermemoryreferencestocontinueunaffected.(Equivalently,software-basedfaultisolationprovidesmanyofthesamehooks,withdifferenttradeoffsbetweeninterpositionandexecutionspeed.)Thismakesaddresstranslationapowerfultoolforoperatingsystemstointroducenew,advancedservicestoapplications.Wehavealreadyshownhowtouseaddresstranslationfor:

Protection.Operatingsystemsuseaddresstranslationhardware,alongwithsegmentandpagetablepermissions,torestrictaccessbyapplicationstoprivilegedmemorylocationssuchasthoseinthekernel.

Fill-on-demand/zero-on-demand.Bysettingsomepagetablepermissionstoinvalid,thekernelcanstartexecutingaprocessbeforeallofitscodeanddatahasbeenloadedintomemory;thehardwarewilltraptothekerneliftheprocessreferencesdatabeforeitisready.Similarly,thekernelcanzerodataandheappagesinthebackground,relyingonpagereferencefaultstocatchthefirsttimeanapplicationusesanemptypage.Thekernelcanalsoallocatememoryforkernelanduserstacksonlyasneeded.Bymarkingunusedstackpagesasinvalid,thekernelneedstoallocatethosepagesonlyiftheprogramexecutesadeepprocedurecallchain.

Copy-on-write.Copy-on-writeallowsmultipleprocessestohavelogicallyseparatecopiesofthesamememoryregion,backedbyasinglephysicalcopyinmemory.Eachpageintheregionismappedread-onlyineachprocess;theoperatingsystemmakesaphysicalcopyonlywhen(andif)apageismodified.

Memory-mappedfiles.Diskfilescanbemadepartofaprocess’svirtualaddressspace,allowingtheprocesstoaccessthedatainthefileusingnormalprocessorinstructions.Whenapagefromamemory-mappedfileisfirstaccessed,aprotection

faulttrapstotheoperatingsystemsothatitcanbringthepageintomemoryfromdisk.Thefirstwritetoafileblockcanalsobecaught,markingtheblockasneedingtobewrittenbacktodisk.

Demandpagedvirtualmemory.Theoperatingsystemcanrunprogramsthatusemorememorythanisphysicallypresentonthecomputer,bycatchingreferencestopagesthatarenotphysicallypresentandfillingthemfromdiskorclustermemory.

Inthischapter,weexplorehowtoconstructanumberofotheradvancedoperatingsystemservicesbycatchingandre-interpretingmemoryreferencesandsystemcalls.

Chapterroadmap:

Zero-CopyI/O.Howdoweimprovetheperformanceoftransferringblocksofdatabetweenuser-levelprogramsandhardwaredevices?(Section10.1)

VirtualMachines.Howdoweexecuteanoperatingsystemontopofanotheroperatingsystem,andhowcanweusethatabstractiontointroducenewoperatingsystemservices?(Section10.2)

FaultTolerance.Howcanwemakeapplicationsresilienttomachinecrashes?(Section10.3)

Security.Howcanwecontainmaliciousapplicationsthatcanexploitunknownfaultsinsidetheoperatingsystem?(Section10.4)

User-LevelMemoryManagement.Howdowegiveapplicationscontroloverhowtheirmemoryismanaged?(Section10.5)

10.1Zero-CopyI/O

Figure10.1:Awebservergetsarequestfromthenetwork.Theserverfirstasksthekerneltocopytherequestedfilefromdiskoritsfilebufferintotheserver’saddressspace.Theserverthenasksthekerneltocopythecontentsofthefilebackouttothenetwork.

Acommontaskforoperatingsystemsistostreamdatabetweenuser-levelprogramsandphysicaldevicessuchasdisksandnetworkhardware.However,thisstreamingcanbeexpensiveinprocessingtimeifthedataiscopiedasitmovesacrossprotectionboundaries.Anetworkpacketneedstogofromthenetworkinterfacehardware,intokernelmemory,andthentouser-level;theresponseneedstogofromuser-levelbackintokernelmemoryandthenfromkernelmemorytothenetworkhardware.

Considertheoperationofthewebserver,aspicturedinFigure10.1.Almostallwebserversareimplementedasuser-levelprograms.Thisway,itiseasytoreconfigureserverbehavior,andbugsintheserverimplementationdonotnecessarilycompromisesystemsecurity.

Anumberofstepsneedtohappenforawebservertorespondtoawebrequest.Forthisexample,assumethattheconnectionbetweentheclientandserverisalreadyestablished,thereisaserverthreadallocatedtoeachclientconnection,andweuseexplicitread/writesystemcallsratherthanmemorymappedfiles.

Serverreadsfromnetwork.Theserverthreadcallsintothekerneltowaitforanarrivingrequest.

Packetarrival.Thewebrequestarrivesfromthenetwork;thenetworkhardwareusesDMAtocopythepacketdataintoakernelbuffer.

Copypacketdatatouser-level.Theoperatingsystemparsesthepacketheadertodeterminewhichuserprocessistoreceivethewebrequest.Thekernelcopiesthedataintotheuser-levelbufferprovidedbytheserverthreadandreturnstouser-level.

Serverreadsfile.Theserverparsesthedatainthewebrequesttodeterminewhichfileisrequested.Itissuesafilereadsystemcallbacktothekernel,providingauser-levelbuffertoholdthefilecontents.

Dataarrival.Thekernelissuesthediskrequest,andthediskcontrollercopiesthedatafromthediskintoakernelbuffer.Ifthefiledataisalreadyinthefilebuffercache,aswilloftenbethecaseforpopularwebrequests,thisstepisskipped.

Copyfiledatatouser-level.Thekernelcopiesthedataintothebufferprovidedbytheuserprocessandreturnstouser-level.

Serverwritetonetwork.Theserverturnsaroundandhandsthebuffercontainingthefiledatabacktothekerneltosendouttothenetwork.

Copydatatokernel.Thekernelcopiesthedatafromtheuser-levelbufferintoakernelbuffer,formatsthepacket,andissuestherequesttothenetworkhardware.

Datasend.ThehardwareusesDMAtocopythedatafromthekernelbufferouttothenetwork.

Althoughwehaveillustratedthiswithawebserver,asimilarprocessoccursforanyapplicationthatstreamsdatainoroutofacomputer.Examplesincludeawebclient,onlinevideoormusicservice,BitTorrent,networkfilesystems,andevenasoftwaredownload.Foreachofthese,dataiscopiedfromhardwareintothekernelandthenintouser-space,orviceversa.

Wecouldeliminatetheextracopyacrossthekernel-userboundarybymovingeachoftheseapplicationsintothekernel.However,thatwouldbeimpracticalasitwouldrequiretrustingtheapplicationswiththefullpoweroftheoperatingsystem.Alternately,wecouldmodifythesystemcallinterfacetoallowapplicationstodirectlymanipulatedatastoredinakernelbuffer,withoutfirstcopyingittousermemory.However,thisisnotageneral-purposesolution;itwouldnotworkiftheapplicationneededtodoanyworkonthebufferasopposedtoonlytransferringitfromonehardwaredevicetoanother.

Instead,twosolutionstozero-copyI/Oareusedinpractice.Botheliminatethecopyacrossthekernel-userboundaryforlargeblocksofdata;forsmallchunksofdata,theextracopydoesnothurtperformance.

Themorewidelyusedapproachmanipulatestheprocesspagetabletosimulateacopy.Forthistowork,theapplicationmustfirstalignitsuser-levelbuffertoapageboundary.Theuser-levelbufferisprovidedtothekernelonareadorwritesystemcall,anditsalignmentandsizeisuptotheapplication.

Thekeyideaisthatapage-to-pagecopyfromusertokernelspaceorviceversacanbesimulatedbychangingpagetablepointersinsteadofphysicallycopyingmemory.

Foracopyfromuser-spacetothekernel(e.g.,onanetworkorfilesystemwrite),thekernelchangesthepermissionsonthepagetableentryfortheuser-levelbuffertopreventitfrombeingmodified.Thekernelmustalsopinthepagetopreventitfrombeingevictedbythevirtualmemorymanager.Inthecommoncase,thisisenough—thepagewillnotnormallybemodifiedwhiletheI/Orequestisinprogress.Iftheuserprogramdoestrytomodifythepage,theprogramwilltraptothekernelandthekernelcanmakeanexplicit

copyatthatpoint.

Figure10.2:Thecontentsofthepagetablebeforeandafterthekernel“copies”datatouser-levelbyswappingthepagetableentrytopointtothekernelbuffer.

Intheotherdirection,oncethedataisinthekernelbuffer,theoperatingsystemcansimulateacopyuptouser-spacebyswitchingthepointerinthepagetable,asshowninFigure10.2.Theprocesspagetableoriginallypointedtothepageframecontainingthe(empty)userbuffer;nowitpointstothepageframecontainingthe(full)kernelbuffer.Totheuserprogram,thedataappearsexactlywhereitwasexpected!Thekernelcanreclaimanyphysicalmemorybehindtheemptybuffer.

Morerecently,somehardwareI/Odeviceshavebeendesignedtobeabletotransferdatatoandfromvirtualaddresses,ratherthanonlytoandfromphysicaladdresses.Thekernelhandsthevirtualaddressoftheuser-levelbuffertothehardwaredevice.Thehardwaredevice,ratherthanthekernel,walksthemulti-levelpagetabletodeterminewhichphysicalpageframetouseforthedevicetransfer.Whenthetransfercompletes,thedataisautomaticallywhereitbelongs,withnoextraworkbythekernel.Thisprocedureisabitmorecomplicatedforincomingnetworkpackets,asthedecisionastowhichprocessshouldreceivewhichpacketisdeterminedbythecontentsofthepacketheader.Thenetworkinterfacehardwarethereforehastoparsetheincomingpackettodeliverthedatatotheappropriateprocess.

10.2VirtualMachines

Avirtualmachineisawayforahostoperatingsystemtorunaguestoperatingsystemasanapplicationprocess.Thehostsimulatesthebehaviorofaphysicalmachinesothattheguestsystembehavesasifitwasrunningonrealhardware.Virtualmachinesarewidelyusedonclientmachinestorunapplicationsthatarenotnativetothecurrentversionoftheoperatingsystem.Theyarealsowidelyusedindatacenterstoallowasinglephysicalmachinetobesharedbetweenmultipleindependentuses,eachofwhichcanbewrittenasifithassystemadministratorcontrolovertheentire(virtual)machine.Forexample,

multiplewebservers,representingdifferentwebsites,canbehostedonthesamephysicalmachineiftheyeachruninsideaseparatevirtualmachine.

Addresstranslationthrowsawrinkleintothechallengeofimplementingavirtualmachine,butitalsoopensupopportunitiesforefficienciesandnewservices.

Figure10.3:Avirtualmachinetypicallyhastwopagetables:onetotranslatefromguestprocessaddressestotheguestphysicalmemory,andonetotranslatefromguestphysicalmemoryaddressestohostphysicalmemoryaddresses.

10.2.1VirtualMachinePageTables

Withvirtualmachines,wehavetwosetsofpagetables,insteadofone,asshowninFigure10.3:

Guestphysicalmemorytohostphysicalmemory.Thehostoperatingsystemprovidesasetofpagetablestoconstraintheexecutionoftheguestoperatingsystemkernel.Theguestkernelthinksitisrunningonreal,physicalmemory,butinfactitsaddressesarevirtual.Thehardwarepagetabletranslateseachguestoperatingsystemmemoryreferenceintoaphysicalmemorylocation,aftercheckingthattheguesthaspermissiontoreadorwriteeachlocation.Thiswaythehostoperatingsystemcanpreventbugsintheguestoperatingsystemfromoverwritingmemoryinthehost,exactlyasiftheguestwereanormaluser-levelprocess.

Guestusermemorytoguestphysicalmemory.Inturn,theguestoperatingsystemmanagespagetablesforitsguestprocesses,exactlyasiftheguestkernelwasrunningonrealhardware.Sincetheguestkerneldoesnotknowanythingaboutthephysicalpageframesithasbeenassignedbythehostkernel,thesepagetablestranslatefromtheguestprocessaddressestotheguestoperatingsystemkerneladdresses.

First,considerwhathappenswhenthehostoperatingsystemtransferscontroltotheguestkernel.Everythingworksasexpected.Theguestoperatingsystemcanreadandwriteitsmemory,andthehardwarepagetablesprovidetheillusionthattheguestkernelisrunningdirectlyonphysicalmemory.

Nowconsiderwhathappenswhentheguestoperatingsystemtransferscontroltotheguestprocess.Theguestkernelisrunningatuser-level,soitsattempttotransferofcontrolisaprivilegedinstruction.Thus,thehardwareprocessorwillfirsttrapbacktothehost.Thehostkernelcanthensimulatethetransferinstruction,handingcontroltotheuserprocess.

However,whatpagetableshouldweuseinthiscase?Wecannotusethepagetableassetupbytheguestoperatingsystem,astheguestoperatingsystemthinksitisrunninginphysicalmemory,butitisactuallyusingvirtualaddresses.Norcanweusethepagetableassetupbythehostoperatingsystem,asthatwouldprovidepermissiontotheguestprocesstoaccessandmodifytheguestkerneldatastructures.Ifwegrantaccesstotheguestkernelmemorytotheguestprocess,thenthebehaviorofthevirtualmachinewillbecompromised.

Figure10.4:Torunaguestprocess,thehostoperatingsystemconstructsashadowpagetableconsistingofthecompositionofthecontentsofthetwopagetables.

Instead,weneedtoconstructacompositepagetable,calledashadowpagetable,thatrepresentsthecompositionoftheguestpagetableandthehostpagetable,asshowninFigure10.4.Whentheguestkerneltransferscontroltoaguestprocess,thehostkernelgainscontrolandchangesthepagetabletotheshadowversion.

Tokeeptheshadowpagetableuptodate,thehostoperatingsystemneedstokeeptrackofchangesthateithertheguestorthehostoperatingsystemsmaketotheirpagetables.ThisiseasyinthecaseofthehostOS—itcanchecktoseeifanyshadowpagetablesneedtobeupdatedbeforeitchangesapagetableentry.

Tokeeptrackofchangesthattheguestoperatingsystemmakestoitspagetables,however,weneedtodoabitmorework.Thehostoperatingsystemsetsthememoryoftheguestpagetablesasread-only.ThisensuresthattheguestOStrapstothehostevery

timeitattemptstochangeapagetableentry.Thehostusesthistraptochangetheboththeguestpagetableandthecorrespondingshadowpagetable,beforeresumingtheguestoperatingsystem(withthepagetablestillread-only).

Paravirtualization

Onewaytoenablevirtualmachinestorunfasteristoassumethattheguestoperatingsystemisportedtothevirtualmachine.Thehardwaredependentlayer,specifictotheunderlyinghardware,isreplacedwithcodethatunderstandsaboutthevirtualmachine.Thisiscalledparavirtualization,becausetheresultingguestoperatingsystemisalmost,butnotprecisely,thesameasifitwererunningonreal,physicalhardware.

Paravirtualizationishelpfulinseveralways.Perhapsthemostimportantishandlingtheidleloop.Whatshouldhappenwhentheguestoperatingsystemhasnothreadstorun?Iftheguestbelievesitisrunningonphysicalhardware,thennothing—theguestspinswaitingformoreworktodo,perhapsputtingitselfinlowpowermode.Eventuallythehardwarewillcauseatimerinterrupt,transferringcontroltothehostoperatingsystem.Thehostcanthendecidewhethertoresumethevirtualmachineorrunsomeotherthread(orevensomeothervirtualmachine).

Withparavirtualization,however,theidleloopcanbemoreefficient.Thehardwaredependentsoftwareimplementingtheidleloopcantrapintothehostkernel,yieldingtheprocessorimmediatelytosomeotheruse.

Likewise,withparavirtualization,thehardwaredependentcodeinsidetheguestoperatingsystemcanmakeexplicitcallstothehostkerneltochangeitspagetables,removingtheneedforthehosttosimulateguestpagetablemanagement.

TheIntelarchitecturehasrecentlyaddeddirecthardwaresupportforthecompositionofpagetablesinvirtualmachines.Insteadofasinglepagetable,thehardwarecanbesetupwithtwopagetables,oneforthehostandonefortheguestoperatingsystem.Whenrunningaguestprocess,onaTLBmiss,thehardwaretranslatesthevirtualaddresstoaguestphysicalpageframeusingtheguestpagetable,andthehardwarethentranslatestheguestphysicalpageframetothehostphysicalpageframeusingthehostpagetable.Inotherwords,theTLBcontainsthecompositionofthetwopagetables,exactlyasifthehostmaintainedanexplicitshadowpagetable.Ofcourse,iftheguestoperatingsystemitselfhostsavirtualmachineasaguestuserprocess,thentheguestkernelmustconstructashadowpagetable.

Althoughthishardwaresupportsimplifiestheconstructionofvirtualmachines,itisnotclearifitimprovesperformance.ThehandlingofaTLBmississlowersincethehostoperatingsystemmustconsulttwopagetablesinsteadofone;changestotheguestpagetablearefasterbecausethehostdoesnotneedtomaintaintheshadowpagetable.Itremainstobeseenifthistradeoffisusefulinpractice.

10.2.2TransparentMemoryCompression

Athemerunningthroughoutthisbookisthedifficultyofmultiplexingmultiplexors.Withvirtualmachines,boththehostoperatingsystemandtheguestoperatingsystemareattemptingtodothesametask:toefficientlymultiplexasetoftasksontoalimitedamountofmemory.Decisionstheguestoperatingsystemtakestomanageitsmemorymayworkatcross-purposestothedecisionsthatthehostoperatingsystemtakestomanageitsmemory.

Efficientuseofmemorycanbecomeespeciallyimportantindatacenters.Often,asinglephysicalmachineinadatacenterisconfiguredtorunmanyvirtualmachinesatthesametime.Forexample,onemachinecanhostmanydifferentwebsites,eachofwhichistoosmalltomeritadedicatedmachineonitsown.

Tomakethiswork,thesystemneedsenoughmemorytobeabletorunmanydifferentoperatingsystemsatthesametime.Thehostoperatingsystemcanhelpbysharingmemorybetweenguestkernels,e.g.,ifitisrunningtwoguestkernelswiththesameexecutablekernelimage.Likewise,theguestoperatingsystemcanhelpbysharingmemorybetweenguestapplications,e.g.,ifitisrunningtwocopiesofthesameprogram.However,ifdifferentguestkernelsbothrunacopyofthesameuserprocess(e.g.,bothruntheApachewebserver),orusethesamelibrary,thehostkernelhasno(direct)waytosharepagesbetweenthosetwoinstances.

Anotherexampleoccurswhenaguestprocessexits.Theguestoperatingsystemplacesthepageframesfortheexitingprocessonthefreelistforreallocationtootherprocesses.Thecontentsofanydatapageswillneverbeusedagain;infact,theguestkernelwillneedtozerothosepagesbeforetheyarereassigned.However,thehostoperatingsystemhasno(direct)waytoknowthis.Eventuallythosepageswillbeevictedbythehost,e.g.,whentheybecomeleastrecentlyused.Inthemeantime,however,thehostoperatingsystemmighthaveevictedpagesfromtheguestthatarestillactive.

Onesolutionistomoretightlycoordinatetheguestandhostmemorymanagerssothateachknowswhattheotherisdoing.WediscussthisinmoredetaillaterinthisChapter.

Commercialvirtualmachineimplementationstakeadifferentapproach,exploitinghardwareaddressprotectiontomanagethesharingofcommonpagesbetweenvirtualmachines.Thesesystemsrunascavengerinthebackgroundthatlooksforpagesthatcanbesharedacrossvirtualmachines.Onceacommonpageisidentified,thehostkernelmanipulatesthepagetablepointerstoprovidetheillusionthateachguesthasitsowncopyofthepage,eventhoughthephysicalrepresentationismorecompact.

Figure10.5:Whenahostkernelrunsmultiplevirtualmachines,itcansavespacebystoringadeltatoanexistingpage(pageA)andbyusingthesamephysicalpageframeformultiplecopiesofthesamepage(pageB).

Therearetwocasestoconsider,showninFigure10.5:

Multiplecopiesofthesamepage.Twodifferentvirtualmachineswilloftenhavepageswiththesamecontents.Anobviouscaseiszeroedpages:eachkernelkeepsapoolofpagesthathavebeenzeroed,readytobeallocatedtoanewprocess.Ifeachguestoperatingsystemwererunningonitsownmachine,therewouldbelittlecosttokeepingthispoolattheready;nooneelsebutthekernelcanusethatmemory.However,whenthephysicalmachineissharedbetweenvirtualmachines,havingeachguestkeepitsownpoolofzeropagesiswasteful.

Instead,thehostcanallocateasinglezeropageinphysicalmemoryforalloftheseinstances.Allpointerstothepagewillbesetread-only,sothatanyattempttomodifythepagewillcauseatraptothehostkernel;thekernelcanthenallocateanew(zeroed)physicalpageforthatuse,exactlyasincopy-on-write.Ofcourse,theguestkernelsdonotneedtotellanyonewhentheycreateazeropage,sointhebackground,thehostkernelrunsascavengertolookforzeropagesinguestmemory.Whenitfindsone,itreclaimsthephysicalpageandchangesthepagetablepointerstopointatthesharedzeropage,withread-onlypermission.

Thescavengercandothesameforothersharedpageframes.Thecodeanddatasegmentsforbothapplicationsandsharedlibrarieswilloftenbethesameorquite

similar,evenacrossdifferentoperatingsystems.AnapplicationliketheApachewebserverwillnotbere-writtenfromscratchforeveryseparateoperatingsystem;rather,someOS-specificgluecodewillbeaddedtomatchtheportableportionoftheapplicationtoitsspecificenvironment.

Compressionofunusedpages.Evenifapageisdifferent,itmaybeclosetosomeotherpageinadifferentvirtualmachine.Forexample,differentversionsoftheoperatingsystemmaydifferinonlysomesmallrespects.Thisprovidesanopportunityforthehostkerneltointroduceanewlayerinthememoryhierarchytosavespace.

Insteadofevictingarelativelyunusedpage,theoperatingsystemcancompressit.Ifthepageisadeltaofanexistingpage,thecompressedversionmaybequitesmall.Thekernelmanipulatespagetablepermissionstomaintaintheillusionthatthedeltaisarealpage.Thefullcopyofthepageismarkedread-only;thedeltaismarkedinvalid.Ifthedeltaisreferenced,itcanbere-constitutedasafullpagemorequicklythanifitwasstoredondisk.Iftheoriginalpageismodified,thedeltacanbere-compressedorevicted,asnecessary.

10.3FaultTolerance

Allsystemsbreak.Despiteourbestefforts,applicationcodecanhavebugsthatcausetheprocesstoexitabruptly.Operatingsystemcodecanhavebugsthatcausethemachinetohaltandreboot.Powerfailuresandhardwareerrorscanalsocauseasystemtostopwithoutwarning.

Mostapplicationsarestructuredtoperiodicallysaveuserdatatodiskforjustthesetypesofevents.Whentheoperatingsystemorapplicationrestarts,theprogramcanreadthesaveddataoffdisktoallowtheusertoresumetheirwork.

Inthissection,wetakethisastepfurther,toseeifwecanmanagememorytorecoverapplicationdatastructuresafterafailure,andnotjustuserfiledata.

10.3.1CheckpointandRestart

Onereasonwemightwanttorecoverapplicationdataiswhenaprogramtakesalongtimetorun.Ifasimulationofthefutureglobalclimatetakesaweektocompute,wedonotwanttohavetostartagainfromscratcheverytimethereisapowerglitch.Ifenoughmachinesareinvolvedandthecomputationtakeslongenough,itislikelythatatleastoneofthemachineswillencounterafailuresometimeduringthecomputation.

Ofcourse,theprogramcouldbewrittentotreatitsinternaldataasprecious—toperiodicallysaveitspartialresultstoafile.Tomakesurethedataisinternallyconsistent,theprogramwouldneedsomenaturalstoppingpoint;forexample,theprogramcansavethepredictedclimatefor2050beforeitmovesontocomputingtheclimatein2051.

Amoregeneralapproachistohavetheoperatingsystemusethevirtualmemorysystemtoprovideapplicationrecoveryasaservice.Ifwecansavethestateofaprocess,wecantransparentlyrestartitwheneverthepowerfails,exactlywhereitleftoff,withtheuser

nonethewiser.

Figure10.6:Bycheckpointingthestateofaprocess,wecanrecoverthesavedstateoftheprocessafterafailurebyrestoringthesavedcopy.

Tomakethiswork,wefirstneedtosuspendeachthreadexecutingintheprocessandsaveitsstate—theprogramcounter,stackpointer,andregisterstoapplicationmemory.Onceallthreadsaresuspended,wecanthenstoreacopyofthecontentsoftheapplicationmemoryondisk.Thisiscalledacheckpointorsnapshot,illustratedinFigure10.6.Afterafailure,wecanresumetheexecutionbyrestoringthecontentsofmemoryfromthecheckpointandresumingeachofthethreadsfromfromexactlythepointwestoppedthem.Thisiscalledanapplicationrestart.

Whatwouldhappenifweallowthreadstocontinuetorunwhilewearesavingthecontentsofmemorytodisk?Duringthecopy,wehavearacecondition:somepagescouldbesavedbeforebeingmodifiedbysomethread,whileotherscouldbesavedafterbeingmodifiedbythatsamethread.Whenwetrytorestarttheapplication,itsdatastructurescouldappeartobecorrupted.Thebehavioroftheprogrammightbedifferentfromwhatwouldhavehappenedifthefailurehadnotoccurred.

Fortunately,wecanuseaddresstranslationtominimizetheamountoftimeweneedtohavethesystemstalledduringacheckpoint.Insteadofcopyingthecontentsofmemorytodisk,wecanmarktheapplication’spagesascopy-on-write.Atthispoint,wecanrestarttheprogram’sthreads.Aseachpagereachesdisk,wecanresettheprotectiononthatpagetoread-write.Whentheprogramtriestomodifyapagebeforeitreachesdisk,thehardwarewilltakeanexception,andthekernelcanmakeacopyofthepage—onetobe

savedtodiskandonetobeusedbytherunningprogram.

Wecantakecheckpointsoftheoperatingsystemitselfinthesameway.Itiseasiesttodothisiftheoperatingsystemisrunninginavirtualmachine.Thehostcantakeacheckpointbystoppingthevirtualmachine,savingtheprocessorstate,andchangingthepagetableprotections(inthehostpagetable)toread-only.Thevirtualmachineisthensafetorestartwhilethehostwritesthecheckpointtodiskinthebackground.

Checkpointsandsystemcalls

Animplementationchallengeforcheckpoint/restartistocorrectlyhandleanysystemcallsthatareinprocess.Thestateofaprogramisnotonlyitsuser-levelmemory;italsoincludesthestateofanythreadsthatareexecutinginthekernelandanyper-processstatemaintainedbythekernel,suchasitsopenfiledescriptors.Whilesomeoperatingsystemshavebeendesignedtoallowthekernelstateofaprocesstobecapturedaspartofthecheckpoint,itismorecommonforcheckpointingtobesupportedonlyatthevirtualmachinelayer.Avirtualmachinehasnostateinthekernelexceptforthecontentsofitsmemoryandprocessorregisters.Ifweneedtotakeacheckpointwhileatraphandlerisinprogress,thehandlercansimplyberestarted.

Processmigrationistheabilitytotakearunningprogramononesystem,stopitsexecution,andresumeitonadifferentmachine.Checkpointandrestartprovideabasisfortransparentprocessmigration.Forexample,itisnowcommonpracticetocheckpointandmigrateentirevirtualmachinesinsideadatacenter,asonewaytobalanceload.Ifonesystemishostingtwowebservers,eachofwhichbecomesheavilyloaded,wecanstoponeandmoveittoadifferentmachinesothateachcangetbetterperformance.

10.3.2RecoverableVirtualMemory

Takingacompletecheckpointofaprocessoravirtualmachineisaheavyweightoperation,andsoitisonlypracticaltodorelativelyrarely.Wecanusecopy-on-writepageprotectiontoresumetheprocessafterstartingthecheckpoint,butcompletingthecheckpointwillstilltakeconsiderabletimewhilewecopythecontentsofmemoryouttodisk.

Canweprovideanapplicationtheillusionofpersistentmemory,sothatthecontentsofmemoryarerestoredtoapointnotlongbeforethefailure?Theabilitytodothisiscalledrecoverablevirtualmemory.Anexamplewherewemightlikerecoverablevirtualmemoryisinanemailclient;asyouread,reply,anddeleteemail,youdonotwantyourworktobelostifthesystemcrashes.

Ifweputefficiencyaside,recoverablevirtualmemoryispossible.First,wetakeacheckpointsothatsomeconsistentversionoftheapplication’sdataisondisk.Next,werecordanorderedsequence,orlog,ofeveryupdatethattheapplicationmakestomemory.Oncethelogiswrittentodiskwerecoverafterafailurebyreadingthecheckpointandapplyingthechangesfromthelog.

Thisisexactlyhowmosttexteditorssavetheirbackups,toallowthemtorecover

uncommittedusereditsafteramachineorapplicationfailure.Atexteditorcouldrepeatedlywriteanentirecopyofthefiletoabackup,butthiswouldbeslow,particularlyforalargefile.Instead,atexteditorwillwriteaversionofthefile,andthenitwillappendasequenceofeverychangetheusermakestothatversion.Toavoidhavingtoseparatelywriteeverytypedcharactertodisk,theeditorwillbatchchanges,e.g.,allofthechangestheusermadeinthepast100milliseconds,andwritethosetodiskasaunit.Eveniftheverylatestbatchhasnotbeenwrittentodisk,theusercanusuallyrecoverthestateofthefileatalmosttheinstantimmediatelybeforethemachinecrash.

Adownsideofthisalgorithmfortexteditorsisthatitcancauseinformationtobeleakedwithoutitbeingvisibleinthecurrentversionofthefile.Texteditorssometimesusethissamemethodwhentheuserhits“save”—justappendanychangesfromthepreviousversion,ratherthanwritingafreshcopyoftheentirefile.Thismeansthattheoldversionofafilecanpotentiallystillberecoveredfromafile.Soifyouwriteamemoinsultingyourboss,andtheneditittotoneitdown,itisbesttosaveacompletelynewversionofyourfilebeforeyousenditoff!

Willthismethodworkforpersistentmemory?Keepingalogofeverychangetoeverymemorylocationintheprocesswouldbetooslow.Wewouldneedtotraponeverystoreinstructionandsavethevaluetodisk.Inotherwords,wewouldrunatthespeedofthetraphandler,ratherthanthespeedoftheprocessor.

However,wecancomeclose.Whenwetakeacheckpoint,wemarkallpagesasread-onlytoensurethatthecheckpointincludesaconsistentsnapshotofthestateoftheprocess’smemory.Thenwetraptothekernelonthefirststoreinstructiontoeachpage,toallowthekerneltomakeacopy-on-write.Thekernelresetsthepagetoberead-writesothatsuccessivestoreinstructionstothesamepagecangoatfullspeed,butitcanalsorecordthepageashavingbeenmodified.

Figure10.7:Theoperatingsystemcanrecoverthestateofamemorysegmentafteracrashbysavingasequenceofincrementalcheckpoints.

Wecantakeanincrementalcheckpointbystoppingtheprogramandsavingacopyofanypagesthathavebeenmodifiedsincethepreviouscheckpoint.Oncewechangethosepagesbacktoread-only,wecanrestarttheprogram,waitabit,andtakeanotherincremental

checkpoint.Afteracrash,wecanrecoverthemostrecentmemorybyreadinginthefirstcheckpointandthenapplyingeachoftheincrementalcheckpointsinturn,asshowninFigure10.7.

Howmuchworkweloseduringamachinecrashisafunctionofhowquicklywecancompletelywriteanincrementalcheckpointtodisk.Thisisgovernedbytherateatwhichtheapplicationcreatesnewdata.Toreducethecostofanincrementalcheckpoint,applicationsneedingrecoverablevirtualmemorywilldesignateaspecificmemorysegmentaspersistent.Afteracrash,thatmemorywillberestoredtothelatestincrementalcheckpoint,allowingtheprogramtoquicklyresumeitswork.

10.3.3DeterministicDebugging

Akeytobuildingreliablesystemssoftwareistheabilitytolocateandfixproblemswhentheydooccur.Debuggingasequentialprogramiscomparativelyeasy:ifyougiveitthesameinput,itwillexecutethesamecodeinthesameorder,andproducethesameoutput.

Debuggingaconcurrentprogramismuchharder:thebehavioroftheprogrammaychangedependingonthepreciseschedulingorderchosenbytheoperatingsystem.Iftheprogramiscorrect,thesameoutputshouldbeproducedonthesameinput.Ifwearedebuggingaprogram,however,itisprobablynotcorrect.Instead,theprecisebehavioroftheprogrammayvaryfromruntorundependingonwhichthreadsarescheduledfirst.

Debugginganoperatingsystemisevenharder:notonlydoestheoperatingsystemmakewidespreaduseofconcurrency,butitishardtotellsometimeswhatisits“input”and“output.”

Itturnsout,however,thatwecanuseavirtualmachineabstractiontoprovidearepeatabledebuggingenvironmentforanoperatingsystem,andwecaninturnusethattoprovidearepeatabledebuggingenvironmentforconcurrentapplications.

Itiseasiesttoseethisonauniprocessor.Theexecutionofanoperatingsystemrunninginavirtualmachinecanonlybeaffectedbythreefactors:itsinitialstate,theinputdataprovidedbyitsI/Odevices,andtheprecisetimingofinterrupts.

Becausethehostkernelmediateseachoftheseforthevirtualmachine,itcanrecordthemandplaythembackduringdebugging.Aslongasthehostexactlymimicswhatitdidthefirsttime,thebehavioroftheguestoperatingsystemwillbethesameandthebehaviorofallapplicationsrunningontopoftheguestoperatingsystemwillbethesame.

Replayingtheinputiseasy,buthowdowereplaytheprecisetimingofinterrupts?Mostmoderncomputerarchitectureshaveacounterontheprocessortomeasurethenumberofinstructionsexecuted.Thehostoperatingsystemcanusethistomeasurehowmanyinstructionstheguestoperatingsystem(orguestapplication)executedbetweenthepointwherethehostgaveupcontroloftheprocessortotheguest,andwhencontrolreturnedtothekernelduetoaninterruptortrap.

Toreplaytheprecisetimingofanasynchronousinterrupt,thehostkernelrecordstheguestprogramcounterandtheinstructioncountatthepointwhentheinterruptwasdeliveredtotheguest.Onreplay,thehostkernelcansetatraponthepagecontainingtheprogram

counterwherethenextinterruptwillbetaken.Sincetheguestmightvisitthesameprogramcountermultipletimes,thehostkernelusestheinstructioncounttodeterminewhichvisitcorrespondstotheonewheretheinterruptwasdelivered.(Somesystemsmakethiseveneasier,byallowingthekerneltorequestatrapwhenevertheinstructioncountreachesacertainvalue.)

Moreover,ifwewanttoskipaheadtosomeknowngoodintermediatepoint,wecantakeacheckpoint,andplayforwardthesequenceofinterruptsandinputdatafromthere.Thisisimportantassometimesbugsinoperatingsystemscantakeweekstomanifestthemselves;ifweneededtoreplayeverythingfrombootthedebuggingprocesswouldbemuchmorecumbersome.

Mattersaremorecomplexonamulticoresystem,astheprecisebehaviorofboththeguestoperatingsystemandtheguestapplicationswilldependonthepreciseorderingofinstructionsacrossthedifferentprocessors.Itisanongoingareaofresearchhowbesttoprovidedeterministicexecutioninthissetting.Providedthattheprogrambeingdebuggedhasnoraceconditions—thatis,noaccesstosharedmemoryoutsideofacriticalsection—thenitsbehaviorwillbedeterministicwithonemorepieceofinformation.Inadditiontotheinitialstate,inputs,andasynchronousinterrupts,wealsoneedtorecordwhichthreadacquireseachcriticalsectioninwhichorder.Ifwereplaythethreadsinthatorderanddeliverinterruptspreciselyandprovidethesamedeviceinput,thebehaviorwillbethesame.Whetherthisisapracticalsolutionisstillanopenquestion.

10.4Security

Hardwareorsoftwareaddresstranslationprovidesabasisforexecutinguntrustedapplicationcode,toallowtheoperatingsystemkerneltoprotectitselfandotherapplicationsfrommaliciousorbuggyimplementations.

Amodernsmartphoneortabletcomputer,however,hasliterallyhundredsofthousandsofapplicationsthatcouldbeinstalled.Manyormostarecompletelytrustworthy,butothersarespecificallydesignedtostealorcorruptlocaldatabyexploitingweaknessesintheunderlyingoperatingsystemorthenaturalhumantendencytotrusttechnology.Howisausertoknowwhichiswhich?Asimilarsituationexistsfortheweb:evenifmostwebsitesareinnocuous,someembedcodethatexploitsknownvulnerabilitiesinthebrowserdefenses.

Ifwecannotlimitourexposuretopotentiallymaliciousapplications,whatcanwedo?Oneimportantstepistokeepyoursystemsoftwareuptodate.Themaliciouscodeauthorsrecognizethis:arecentsurveyshowedthatthemostlikelywebsitestocontainvirusesarethosetargetedatthemostnoviceusers,e.g.,screensaversandchildren’sgames.

Inthissection,wediscusswhetherthereareadditionalwaystousevirtualmachinestolimitthescopeofmaliciousapplications.

Supposeyouwanttodownloadanewapplication,orvisitanewwebsite.Thereissomechanceitwillworkasadvertised,andthereissomechanceitwillcontainavirus.Isthereanywaytolimitthepotentialofthenewsoftwaretoexploitsomeunknownvulnerabilityinyouroperatingsystemorbrowser?

Oneinterestingapproachistocloneyouroperatingsystemintoanewvirtualmachine,andruntheapplicationinthecloneratherthanonthenativeoperatingsystem.Avirtualmachineconstructedforthepurposeofexecutingsuspectcodeiscalledavirtualmachinehoneypot.Byusingavirtualmachine,ifthecodeturnsouttobemalicious,wecandeletethevirtualmachineandleavetheunderlyingoperatingsystemasitwasbeforeweattemptedtoruntheapplication.

Creatingavirtualmachinetoexecuteanewapplicationmightseemextravagant.However,earlierinthischapter,wediscussedvariouswaystomakethismoreefficient:shadowpagetables,memorycompression,efficientcheckpointandrestart,andcopy-on-write.Andofcourse,reinstallingyoursystemafterithasbecomeinfectedwithavirusisevenslower!

Bothresearchersandvendorsofcommercialanti-virussoftwaremakeextensiveuseofvirtualmachinehoneypotstodetectandunderstandviruses.Forexample,afrequenttechniqueistocreateanarrayofvirtualmachines,eachwithadifferentversionoftheoperatingsystem.Byloadingapotentialvirusintoeachone,andthensimulatinguserbehavior,wecanmoreeasilydeterminewhichversionsofsoftwarearevulnerableandwhicharenot.

Alimitationisthatweneedtobeabletotellifthebrowseroroperatingsystemrunninginthevirtualmachinehoneypothasbeencorrupted.Often,virusesoperateinstantly,byattemptingtoinstallloggingsoftwareorscanningthediskforsensitiveinformationsuchascreditcardnumbers.Thereisnothingtokeepthevirusfromlyinginwait;thishasbecomemorecommonrecently,particularlythosedesignedformilitaryorbusinessespionage.

Anotherlimitationisthatthevirusmightbedesignedtoinfectboththeguestoperatingsystemrunninginthecloneandthehostkernelimplementingthevirtualmachine.(Inthecaseoftheweb,thevirusmustinfectthebrowser,theguestoperatingsystem,andthehost.)Aslongasthesystemsoftwareiskeptuptodate,thesystemisvulnerableonlyifthevirusisabletoexploitsomeunknownweaknessintheguestoperatingsystemandaseparateunknownweaknessinthehostimplementationofthevirtualmachine.Thisprovidesdefenseindepth,improvingsecuritythroughmultiplelayersofprotection.

10.5User-LevelMemoryManagement

Withtheincreasingsophisticationofapplicationsandtheirruntimesystems,mostwidelyusedoperatingsystemshaveintroducedhooksforapplicationstomanagetheirownmemory.Whilethedetailsoftheinterfacediffersfromsystemtosystem,thehookspreservetheroleofthekernelinallocatingresourcesbetweenprocessesandinpreventingaccesstoprivilegedmemory.Onceapageframehasbeenassignedtoaprocess,however,thekernelcanleaveituptotheprocesstodeterminewhattodowiththatresource.

Operatingsystemscanprovideapplicationstheflexibilitytodecide:

Wheretogetmissingpages.Aswenotedinthepreviouschapter,amodernmemoryhierarchyisdeepandcomplex:localdisk,localnon-volatilememory,remotememoryinsideadatacenter,orremotedisk.Bygivingapplicationscontrol,the

kernelcankeepitsownmemoryhierarchysimpleandlocal,whilestillallowingsophisticatedapplicationstotakeadvantageofnetworkresourceswhentheyareavailable,evenwhenthoseresourcesareonmachinesrunningcompletelydifferentoperatingsystems.

Whichpagescanbeaccessed.Manyapplicationssuchasbrowsersanddatabasesneedtosetuptheirownapplication-levelsandboxesforexecutinguntrustedcode.Todaythisisdonewithacombinationofhardwareandsoftwaretechniques,aswedescribedinChapter8.Finer-grainedcontroloverpagefaulthandlingallowsmoresophisticatedmodelsformanagingsharingbetweenregionsofuntrustedcode.

Whichpagesshouldbeevicted.Often,anapplicationwillhavebetterinformationthantheoperatingsystemoverwhichpagesitwillreferenceinthenearfuture.

Manyapplicationscanadaptthesizeoftheirworkingsettotheresourcesprovidedbythekernelbuttheywillhaveworseperformancewheneverthereisamismatch.

Garbagecollectedprograms.Consideraprogramthatdoesitsowngarbagecollection.Whenitstartsup,itallocatesablockofmemoryinitsvirtualaddressspacetoserveastheheap.Periodically,theprogramscansthroughtheheaptocompactitsdatastructures,freeinguproomforadditionaldatastructures.Thiscausesallpagestoappeartoberecentlyused,confoundingthekernel’smemorymanager.Bycontrast,theapplicationknowsthatthebestpagetoreplaceisonethatwasrecentlycleanedofapplicationdata.

Itisequallyconfoundingtotheapplication.Howdoesthegarbagecollectorknowhowmuchmemoryitshouldallocatefortheheap?Ideally,thegarbagecollectorshoulduseexactlyasmuchmemoryasthekernelisabletoprovide,andnomore.Iftheruntimeheapistoosmall,theprogrammustgarbagecollect,eventhoughmorepageframesavailable.Iftheheapistoolarge,thekernelwillpagepartsoftheheaptodiskinsteadofaskingtheapplicationtopaytheloweroverheadofcompactingitsmemory.

Databases.Databasesandotherdataprocessingsystemsoftenmanipulatehugedatasetsthatmustbestreamedfromdiskintomemory.AswenotedinChapter9,algorithmsforlargedatasetswillbemoreefficientiftheyarecustomizedtotheamountofavailablephysicalmemory.Iftheoperatingsystemevictsapagethatthedatabaseexpectstobeinmemory,thesealgorithmswillrunmuchmoreslowly.

Virtualmachines.Asimilarissueariseswithvirtualmachines.Theguestoperatingsystemrunninginsideofavirtualmachinethinksithasasetofphysicalpageframes,whichitcanassigntothevirtualpagesofapplicationsrunninginthevirtualmachine.Inreality,however,thepageframesintheguestoperatingsystemarevirtualandcanbepagedtodiskbythehostoperatingsystem.Ifthehostoperatingsystemcouldtelltheguestoperatingsystemwhenitneededtostealapageframe(ordonateapageframe),thentheguestwouldknowexactlyhowmanypageframeswereavailabletobeallocatedtoitsapplications.

Ineachofthesecases,theperformanceofaresourcemanagercanbecompromisedifit

runsontopofavirtualized,ratherthanaphysical,resource.Whatisneededisfortheoperatingsystemkerneltocommunicatehowmuchmemoryisassignedtoaprocessorvirtualmachinesothattheapplicationtodoitsownmemorymanagement.Asprocessesstartandcomplete,theamountofavailablephysicalmemorywillchange,andthereforetheassignmenttoeachapplicationwillchange.

Tohandletheseneeds,mostoperatingsystemsprovidesomelevelofapplicationcontrolovermemory.Twomodelshaveemerged:

Pinnedpages.Asimpleandwidelyavailablemodelistoallowapplicationstopinvirtualmemorypagestophysicalpageframes,preventingthosepagesfrombeingevictedunlessabsolutelynecessary.Oncepinned,theapplicationcanmanageitsmemoryhoweveritseesfit,forexample,byexplicitlyshufflingdatabackandforthtodisk.

Figure10.8:Theoperationofauser-levelpagehandler.Onapagefault,thehardwaretrapstothekernel;ifthefaultisforasegmentwithauser-levelpager,thekernelpassesthefaulttotheuser-levelhandlertomanage.Theuser-levelhandlerispinnedinmemorytoavoidrecursivefaults.

User-levelpagers.Amoregeneralsolutionisforapplicationstospecifyauser-levelpagehandlerforamemorysegment.Onapagefaultorprotectionviolation,thekerneltraphandlerisinvoked.Insteadofhandlingthefaultitself,thekernelpassescontroltouser-levelhandler,asinaUNIXsignalhandler.Theuser-levelhandlercanthendecidehowtomanagethetrap:wheretofetchthemissingpage,whatactiontotakeiftheapplicationwassandbox,andwhichpagetoreplace.Toavoidinfiniterecursion,theuser-levelpagehandlermustitselfbestoredinpinnedmemory.

10.6SummaryandFutureDirections

Inthischapter,wehavearguedthataddresstranslationprovidesapowerfultoolforoperatingsystemstoprovideasetofadvancedservicestoapplicationstoimprovesystemperformance,reliability,andsecurity.Servicessuchascheckpointing,recoverablememory,deterministicdebugging,andhoneypotsarenowwidelysupportedatthevirtualmachinelayer,andwebelievethattheywillcometobestandardinmostoperatingsystemsaswell.

Movingforward,itisclearthatthedemandsonthememorymanagementsystemforadvancedserviceswillincrease.Notonlyarememoryhierarchiesbecomingincreasinglycomplex,butthediversityofservicesprovidedbythememorymanagementsystemhasaddedevenmorecomplexity.

Operatingsystemsoftengothroughcyclesofgraduallyincreasingcomplexityfollowedbyrapidshiftsbacktowardssimplicity.Therecentcommercialinterestinvirtualmachinesmayyieldashiftbacktowardssimplermemorymanagement,byreducingtheneedforthekerneltoprovideeveryservicethatanyapplicationmightneed.Processorarchitecturesnowdirectlysupportuser-levelpagetables.Thispotentiallyopensupanentirerealmformoresophisticatedruntimesystems,forthoseapplicationsthatarethemselvesminiatureoperatingsystems,andaconcurrentsimplificationofthekernel.Withtherightoperatingsystemsupport,applicationswillbeabletosetupandmanagetheirownpagetablesdirectly,implementtheirownuser-levelprocessabstractions,andprovidetheirowntransparentcheckpointingandrecoveryonmemorysegments.

Exercises

1. Thisquestionconcernstheoperationofshadowpagetablesforvirtualmachines,whereaguestprocessisrunningontopofaguestoperatingsystemontopofahostoperatingsystem.Thearchitectureusespagedsegmentation,witha32-bitvirtualaddressdividedintofieldsasfollows:

| 4bitsegmentnumber | 12bitpagenumber | 16bitoffset |

Theguestoperatingsystemcreatesandmanagessegmentandpagetablestomaptheguestvirtualaddressestoguestphysicalmemory.Thesetablesareasfollows(allvaluesinhexadecimal):

SegmentTable PageTableA PageTableB

0 PageTableA 0 0002 0 0001

1 PageTableB 1 0006 1 0004

x (restinvalid) 2 0000 2 0003

3 0005 x (restinvalid)

x (restinvalid)

Thehostoperatingsystemcreatesandmanagessegmentandpagetablestomaptheguestphysicalmemorytohostphysicalmemory.Thesetablesareasfollows:

SegmentTable PageTableK

0 PageTableK 0 BEEF

x (restinvalid) 1 F000

2 CAFE

3 3333

4 (invalid)

5 BA11

6 DEAD

7 5555

x (restinvalid)

a. Findthehostphysicaladdresscorrespondingtoeachofthefollowingguestvirtualaddresses.Answer“invalidguestvirtualaddress”iftheguestvirtualaddressisinvalid;answer“invalidguestphysicaladdressiftheguestvirtualaddressmapstoavalidguestphysicalpageframe,buttheguestphysicalpagehasaninvalidvirtualaddress.

i. 00000000ii. 20021111iii. 10012222

iv. 00023333v. 10024444

b. Usingtheinformationinthetablesabove,fillinthecontentsoftheshadowsegmentandpagetablesfordirectexecutionoftheguestprocess.

c. Assumingthattheguestphysicalmemoryiscontiguous,listthreereasonswhythehostpagetablemighthaveaninvalidentryforaguestphysicalpageframe,withvalidentriesoneitherside.

2. Supposewedoingincrementalcheckpointsonasystemwith4KBpagesandadiskcapableoftransferringdataat10MB/s.

a. Whatisthemaximumrateofupdatestonewpagesifeverymodifiedpageissentinitsentiretytodiskoneverycheckpointandwerequirethateachcheckpointreachdiskbeforewestartthenextcheckpoint?

b. Supposethatmostpagessavedduringanincrementalcheckpointareonlypartiallymodified.Describehowyouwoulddesignasystemtosaveonlythemodifiedportionsofeachpageaspartofthecheckpoint.

References

[1]

KeithAdamsandOleAgesen.Acomparisonofsoftwareandhardwaretechniquesforx86virtualization.InProceedingsofthe12thInternationalconferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems,ASPLOS-XII,pages2–13,2006.

[2]ThomasE.Anderson,BrianN.Bershad,EdwardD.Lazowska,andHenryM.Levy.Scheduleractivations:effectivekernelsupportfortheuser-levelmanagementofparallelism.ACMTrans.Comput.Syst.,10(1):53–79,February1992.

[3]

ThomasE.Anderson,HenryM.Levy,BrianN.Bershad,andEdwardD.Lazowska.Theinteractionofarchitectureandoperatingsystemdesign.InProceedingsofthefourthInternationalconferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems,ASPLOS-IV,pages108–120,1991.

[4]AndrewW.AppelandKaiLi.Virtualmemoryprimitivesforuserprograms.InProceedingsofthefourthInternationalconferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems,ASPLOS-IV,pages96–107,1991.

[5]AmittaiAviram,Shu-ChunWeng,SenHu,andBryanFord.Efficientsystem-enforceddeterministicparallelism.InProceedingsofthe9thUSENIXconferenceonOperatingSystemsDesignandImplementation,OSDI’10,pages1–16,2010.

[6]ÖzalpBabaogluandWilliamJoy.Convertingaswap-basedsystemtodopaginginanarchitecturelackingpage-referencedbits.InProceedingsoftheeighthACMSymposiumonOperatingSystemsPrinciples,SOSP’81,pages78–86,1981.

[7]

DavidBacon,JoshuaBloch,JeffBogda,CliffClick,PaulHaahr,DougLea,TomMay,Jan-WillemMaessen,JeremyManson,JohnD.Mitchell,KelvinNilsen,BillPugh,andEminGunSirer.The“double-checkedlockingisbroken”declaration.http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html.

[8]

GauravBanga,PeterDruschel,andJeffreyC.Mogul.Resourcecontainers:anewfacilityforresourcemanagementinserversystems.InProceedingsofthethirdUSENIXsymposiumonOperatingSystemsDesignandImplementation,OSDI’99,pages45–58,1999.

[9]

PaulBarham,BorisDragovic,KeirFraser,StevenHand,TimHarris,AlexHo,RolfNeugebauer,IanPratt,andAndrewWarfield.Xenandtheartofvirtualization.InProceedingsofthenineteenthACMSymposiumonOperatingSystemsPrinciples,SOSP’03,pages164–177,2003.

[10] BlaiseBarney.POSIXthreadsprogramming.http://computing.llnl.gov/tutorials/pthreads/,2013.

[11] JoelF.Bartlett.Anonstopkernel.InProceedingsoftheeighthACMSymposiumonOperatingSystemsPrinciples,SOSP’81,pages22–29,1981.

[12]

AndrewBaumann,PaulBarham,Pierre-EvaristeDagand,TimHarris,RebeccaIsaacs,SimonPeter,TimothyRoscoe,AdrianSchüpbach,andAkhileshSinghania.Themultikernel:anewOSarchitectureforscalablemulticoresystems.InProceedingsofthe22ndACMSymposiumonOperatingSystemsPrinciples,SOSP’09,pages29–44,2009.

[13] A.Bensoussan,C.T.Clingen,andR.C.Daley.Themulticsvirtualmemory:conceptsanddesign.Commun.ACM,15(5):308–318,May1972.

[14]TomBergan,NicholasHunt,LuisCeze,andStevenD.Gribble.DeterministicprocessgroupsindOS.InProceedingsofthe9thUSENIXconferenceonOperatingSystemsDesignandImplementation,OSDI’10,pages1–16,2010.

[15]

B.N.Bershad,S.Savage,P.Pardyak,E.G.Sirer,M.E.Fiuczynski,D.Becker,C.Chambers,andS.Eggers.ExtensibilitysafetyandperformanceintheSPINoperatingsystem.InProceedingsofthefifteenthACMSymposiumonOperatingSystemsPrinciples,SOSP’95,pages267–283,1995.

[16]BrianN.Bershad,ThomasE.Anderson,EdwardD.Lazowska,andHenryM.Levy.Lightweightremoteprocedurecall.ACMTrans.Comput.Syst.,8(1):37–55,February1990.

[17]BrianN.Bershad,ThomasE.Anderson,EdwardD.Lazowska,andHenryM.Levy.User-levelinterprocesscommunicationforsharedmemorymultiprocessors.ACMTrans.Comput.Syst.,9(2):175–198,May1991.

[18] AndrewBirrell.Anintroductiontoprogrammingwiththreads.TechnicalReport35,DigitalEquipmentCorporationSystemsResearchCenter,1991.

[19] AndrewD.BirrellandBruceJayNelson.Implementingremoteprocedurecalls.ACMTrans.Comput.Syst.,2(1):39–59,February1984.

[20]

SilasBoyd-Wickizer,AustinT.Clements,YandongMao,AlekseyPesterev,M.FransKaashoek,RobertMorris,andNickolaiZeldovich.AnanalysisofLinuxscalabilitymanycores.InProceedingsofthe9thUSENIXconferenceonOperatingSystemsDesignandImplementation,OSDI’10,pages1–8,2010.

[21]LeeBreslau,PeiCao,LiFan,GrahamPhillips,andScottShenker.WebcachingandZipf-likedistributions:evidenceandimplications.InINFOCOM,pages126–134,1999.

[22] ThomasC.BressoudandFredB.Schneider.Hypervisor-basedfaulttolerance.ACMTrans.Comput.Syst.,14(1):80–107,February1996.

[23]SergeyBrinandLawrencePage.Theanatomyofalarge-scalehypertextualwebsearchengine.InProceedingsoftheseventhInternationalconferenceontheWorldWideWeb,WWW7,pages107–117,1998.

[24] MaxBruning.ZFSon-diskdatawalk(or:Where’smydata?).InOpenSolarisDeveloperConference,2008.

[25]EdouardBugnion,ScottDevine,KinshukGovil,andMendelRosenblum.Disco:runningcommodityoperatingsystemsonscalablemultiprocessors.ACMTrans.Comput.Syst.,15(4):412–447,November1997.

[26] BrianCarrier.FileSystemForensicAnalysis.AddisonWesleyProfessional,2005.

[27]

MiguelCastro,ManuelCosta,Jean-PhilippeMartin,MarcusPeinado,PeriklisAkritidis,AustinDonnelly,PaulBarham,andRichardBlack.Fastbyte-granularitysoftwarefaultisolation.InProceedingsofthe22ndACMSymposiumonOperatingSystemsPrinciples,SOSP’09,pages45–58,2009.

[28]J.Chapin,M.Rosenblum,S.Devine,T.Lahiri,D.Teodosiu,andA.Gupta.Hive:faultcontainmentforshared-memorymultiprocessors.InProceedingsofthefifteenthACM

SymposiumonOperatingSystemsPrinciples,SOSP’95,pages12–25,1995.

[29]JeffreyS.Chase,HenryM.Levy,MichaelJ.Feeley,andEdwardD.Lazowska.Sharingandprotectioninasingle-address-spaceoperatingsystem.ACMTrans.Comput.Syst.,12(4):271–307,November1994.

[30]J.BradleyChenandBrianN.Bershad.Theimpactofoperatingsystemstructureonmemorysystemperformance.InProceedingsofthefourteenthACMSymposiumonOperatingSystemsPrinciples,SOSP’93,pages120–133,1993.

[31] PeterM.ChenandBrianD.Noble.Whenvirtualisbetterthanreal.InProceedingsoftheEighthWorkshoponHotTopicsinOperatingSystems,HOTOS’01,2001.

[32] DavidCheriton.TheVdistributedsystem.Commun.ACM,31(3):314–333,March1988.

[33]DavidR.CheritonandKennethJ.Duda.Acachingmodelofoperatingsystemkernelfunctionality.InProceedingsofthe1stUSENIXconferenceonOperatingSystemsDesignandImplementation,OSDI’94,1994.

[34] DavidD.Clark.Thestructuringofsystemsusingupcalls.InProceedingsofthetenthACMSymposiumonOperatingSystemsPrinciples,SOSP’85,pages171–180,1985.

[35]

JeremyCondit,EdmundB.Nightingale,ChristopherFrost,EnginIpek,BenjaminLee,DougBurger,andDerrickCoetzee.BetterI/Othroughbyte-addressable,persistentmemory.InProceedingsofthe22ndACMSymposiumonOperatingSystemsPrinciples,SOSP’09,pages133–146,2009.

[36] FernandoJ.Corbató.Onbuildingsystemsthatwillfail.Commun.ACM,34(9):72–81,September1991.

[37] FernandoJ.CorbatóandVictorA.Vyssotsky.IntroductionandoverviewoftheMulticssystem.AFIPSFallJointComputerConference,27(1):185–196,1965.

[38] R.J.Creasy.TheoriginoftheVM/370time-sharingsystem.IBMJ.Res.Dev.,25(5):483–490,September1981.

[39]

MichaelD.Dahlin,RandolphY.Wang,ThomasE.Anderson,andDavidA.Patterson.Cooperativecaching:usingremoteclientmemorytoimprovefilesystemperformance.InProceedingsofthe1stUSENIXconferenceonOperatingSystemsDesignandImplementation,OSDI’94,1994.

[40] RobertC.DaleyandJackB.Dennis.Virtualmemory,processes,andsharinginMultics.Commun.ACM,11(5):306–312,May1968.

[41]WiebrendeJonge,M.FransKaashoek,andWilsonC.Hsieh.Thelogicaldisk:anewapproachtoimprovingfilesystems.InProceedingsofthefourteenthACMSymposiumonOperatingSystemsPrinciples,SOSP’93,pages15–28,1993.

[42]JeffreyDeanandSanjayGhemawat.MapReduce:simplifieddataprocessingonlargeclusters.InProceedingsofthe6thUSENIXSymposiumonOperatingSystemsDesign&Implementation,OSDI’04,2004.

[43] PeterJ.Denning.Theworkingsetmodelforprogrambehavior.Commun.ACM,11(5):323–333,May1968.

[44] P.J.Denning.Workingsetspastandpresent.SoftwareEngineering,IEEETransactionson,SE-6(1):64–84,jan.1980.

[45] JackB.Dennis.Segmentationandthedesignofmultiprogrammedcomputersystems.J.ACM,12(4):589–602,October1965.

[46] JackB.DennisandEarlC.VanHorn.Programmingsemanticsformultiprogrammedcomputations.Commun.ACM,9(3):143–155,March1966.

[47] E.W.Dijkstra.Solutionofaprobleminconcurrentprogrammingcontrol.Commun.ACM,8(9):569–,September1965.

[48] EdsgerW.Dijkstra.Thestructureofthe“THE”-multiprogrammingsystem.Commun.ACM,11(5):341–346,May1968.

[49]

MihaiDobrescu,NorbertEgi,KaterinaArgyraki,Byung-GonChun,KevinFall,GianlucaIannaccone,AllanKnies,MaziarManesh,andSylviaRatnasamy.Routebricks:exploitingparallelismtoscalesoftwarerouters.InProceedingsofthe22ndACMSymposiumonOperatingSystemsPrinciples,SOSP’09,pages15–28,2009.

[50] AlanDonovan,RobertMuth,BradChen,andDavidSehr.PortableNativeClientexecutables.Technicalreport,Google,2012.

[51] FredDouglisandJohnOusterhout.Transparentprocessmigration:designalternativesandtheSpriteimplementation.Softw.Pract.Exper.,21(8):757–785,July1991.

[52]

RichardP.Draves,BrianN.Bershad,RichardF.Rashid,andRandallW.Dean.Usingcontinuationstoimplementthreadmanagementandcommunicationinoperatingsystems.InProceedingsofthethirteenthACMSymposiumonOperatingSystemsPrinciples,SOSP’91,pages122–136,1991.

[53] PeterDruschelandLarryL.Peterson.Fbufs:ahigh-bandwidthcross-domaintransferfacility.SIGOPSOper.Syst.Rev.,27(5):189–202,December1993.

[54]GeorgeW.Dunlap,SamuelT.King,SukruCinar,MurtazaA.Basrai,andPeterM.Chen.ReVirt:enablingintrusionanalysisthroughvirtual-machineloggingandreplay.SIGOPSOper.Syst.Rev.,36(SI):211–224,December2002.

[55]

PetrosEfstathopoulos,MaxwellKrohn,SteveVanDeBogart,CliffFrey,DavidZiegler,EddieKohler,DavidMazières,FransKaashoek,andRobertMorris.LabelsandeventprocessesintheAsbestosoperatingsystem.InProceedingsofthetwentiethACMSymposiumonOperatingSystemsPrinciples,SOSP’05,pages17–30,2005.

[56]D.R.Engler,M.F.Kaashoek,andJ.O’Toole,Jr.Exokernel:anoperatingsystemarchitectureforapplication-levelresourcemanagement.InProceedingsofthefifteenthACMSymposiumonOperatingSystemsPrinciples,SOSP’95,pages251–266,1995.

[57]

DawsonEngler,DavidYuChen,SethHallem,AndyChou,andBenjaminChelf.Bugsasdeviantbehavior:ageneralapproachtoinferringerrorsinsystemscode.InProceedingsoftheeighteenthACMSymposiumonOperatingSystemsPrinciples,SOSP’01,pages57–72,2001.

[58] R.S.Fabry.Capability-basedaddressing.Commun.ACM,17(7):403–412,July1974.

[59]JasonFlinnandM.Satyanarayanan.Energy-awareadaptationformobileapplications.InProceedingsoftheseventeenthACMSymposiumonOperatingSystemsPrinciples,SOSP’99,pages48–63,1999.

[60]

ChristopherFrost,MikeMammarella,EddieKohler,AndrewdelosReyes,ShantHovsepian,AndrewMatsuoka,andLeiZhang.Generalizedfilesystemdependencies.

InProceedingsoftwenty-firstACMSymposiumonOperatingSystemsPrinciples,SOSP’07,pages307–320,2007.

[61]GregoryR.Ganger,MarshallKirkMcKusick,CraigA.N.Soules,andYaleN.Patt.Softupdates:asolutiontothemetadataupdateprobleminfilesystems.ACMTrans.Comput.Syst.,18(2):127–153,May2000.

[62] SimsonGarfinkelandGeneSpafford.PracticalUnixandInternetsecurity(2nded.).O’Reilly&Associates,Inc.,1996.

[63]

TalGarfinkel,BenPfaff,JimChow,MendelRosenblum,andDanBoneh.Terra:avirtualmachine-basedplatformfortrustedcomputing.InProceedingsofthenineteenthACMSymposiumonOperatingSystemsPrinciples,SOSP’03,pages193–206,2003.

[64]

KirkGlerum,KinshumanKinshumann,SteveGreenberg,GabrielAul,VinceOrgovan,GregNichols,DavidGrant,GretchenLoihle,andGalenHunt.Debugginginthe(very)large:tenyearsofimplementationandexperience.InProceedingsofthe22ndACMSymposiumonOperatingSystemsPrinciples,SOSP’09,pages103–116,2009.

[65] R.P.Goldberg.Surveyofvirtualmachineresearch.IEEEComputer,7(6):34–45,June1974.

[66]

KinshukGovil,DanTeodosiu,YongqiangHuang,andMendelRosenblum.CellularDisco:resourcemanagementusingvirtualclustersonshared-memorymultiprocessors.InProceedingsoftheseventeenthACMSymposiumonOperatingSystemsPrinciples,SOSP’99,pages154–169,1999.

[67]JimGray.Thetransactionconcept:virtuesandlimitations(invitedpaper).InProceedingsoftheseventhInternationalconferenceonVeryLargeDataBases,VLDB’81,pages144–154,1981.

[68] JimGray.Whydocomputersstopandwhatcanbedoneaboutit?TechnicalReportTR-85.7,HPLabs,1985.

[69]JimGray,PaulMcJones,MikeBlasgen,BruceLindsay,RaymondLorie,TomPrice,FrancoPutzolu,andIrvingTraiger.TherecoverymanageroftheSystemRdatabasemanager.ACMComput.Surv.,13(2):223–242,June1981.

[70] JimGrayandAndreasReuter.TransactionProcessing:ConceptsandTechniques.MorganKaufmann,1993.

[71] JimGrayandDanielP.Siewiorek.High-availabilitycomputersystems.Computer,24(9):39–48,September1991.

[72]

DiwakerGupta,SangminLee,MichaelVrable,StefanSavage,AlexC.Snoeren,GeorgeVarghese,GeoffreyM.Voelker,andAminVahdat.Differenceengine:harnessingmemoryredundancyinvirtualmachines.InProceedingsofthe8thUSENIXconferenceonOperatingSystemsDesignandImplementation,OSDI’08,pages309–322,2008.

[73] Hadoop.http://hadoop.apache.org.

[74]StevenM.Hand.Self-pagingintheNemesisoperatingsystem.InProceedingsofthethirdUSENIXSymposiumonOperatingSystemsDesignandImplementation,OSDI’99,pages73–86,1999.

[75] PerBrinchHansen.Thenucleusofamultiprogrammingsystem.Commun.ACM,13(4):238–241,April1970.

[76]MorHarchol-BalterandAllenB.Downey.Exploitingprocesslifetimedistributionsfordynamicloadbalancing.InProceedingsofthefifteenthACMSymposiumonOperatingSystemsPrinciples,SOSP’95,pages236–,1995.

[77]

KieranHartyandDavidR.Cheriton.Application-controlledphysicalmemoryusingexternalpage-cachemanagement.InProceedingsofthefifthInternationalconferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems,ASPLOS-V,pages187–197,1992.

[78] RoberHaskin,YoniMalachi,andGregoryChan.RecoverymanagementinQuickSilver.ACMTrans.Comput.Syst.,6(1):82–108,February1988.

[79] JohnL.HennessyandDavidA.Patterson.ComputerArchitecture-AQuantitativeApproach(5.ed.).MorganKaufmann,2012.

[80] MauriceHerlihy.Wait-freesynchronization.ACMTrans.Program.Lang.Syst.,13(1):124–149,January1991.

[81] MauriceHerlihyandNirShavit.TheArtofMultiprocessorProgramming.MorganKaufmann,2008.

[82] DaveHitz,JamesLau,andMichaelMalcolm.FilesystemdesignforanNFSfileserverappliance.TechnicalReport3002,NetworkAppliance,1995.

[83] C.A.R.Hoare.Monitors:Anoperatingsystemstructuringconcept.CommunicationsoftheACM,17:549–557,1974.

[84] C.A.R.Hoare.Communicatingsequentialprocesses.Commun.ACM,21(8):666–677,August1978.

[85] C.A.R.Hoare.Theemperor’soldclothes.Commun.ACM,24(2):75–83,February1981.

[86]ThomasR.HorsleyandWilliamC.Lynch.Pilot:Asoftwareengineeringcasestudy.Proceedingsofthe4thInternationalconferenceonSoftwareengineering,ICSE’79,pages94–99,1979.

[87] RajJain.TheArtofComputerSystemsPerformanceAnalysis.JohnWiley&Sons,1991.

[88]

AsimKadavandMichaelM.Swift.Understandingmoderndevicedrivers.InProceedingsoftheseventeenthinternationalconferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems,ASPLOS’12,pages87–98,NewYork,NY,USA,2012.ACM.

[89]PaulA.Karger,MaryEllenZurko,DouglasW.Bonin,AndrewH.Mason,andCliffordE.Kahn.AretrospectiveontheVAXVMMsecuritykernel.IEEETrans.Softw.Eng.,17(11):1147–1165,November1991.

[90]YousefA.KhalidiandMichaelN.Nelson.ExtensiblefilesystemsinSpring.InProceedingsofthefourteenthACMSymposiumonOperatingSystemsPrinciples,SOSP’93,pages1–14,1993.

[91]

GerwinKlein,KevinElphinstone,GernotHeiser,JuneAndronick,DavidCock,PhilipDerrin,DhammikaElkaduwe,KaiEngelhardt,RafalKolanski,MichaelNorrish,ThomasSewell,HarveyTuch,andSimonWinwood.sel4:formalverificationofan

OSkernel.InProceedingsoftheACMSIGOPS22ndSymposiumonOperatingSystemsPrinciples,SOSP’09,pages207–220,2009.

[92] L.KleinrockandR.R.Muntz.Processorsharingqueueingmodelsofmixedschedulingdisciplinesfortimesharedsystem.J.ACM,19(3):464–482,July1972.

[93]LeonardKleinrock.QueueingSystems,VolumeII:ComputerApplications.WileyInterscience,1976.

[94] H.T.KungandJohnT.Robinson.Onoptimisticmethodsforconcurrencycontrol.ACMTrans.DatabaseSyst.,6(2):213–226,June1981.

[95] LeslieLamport.Afastmutualexclusionalgorithm.ACMTrans.Comput.Syst.,5(1):1–11,January1987.

[96] B.W.Lampson.Hintsforcomputersystemdesign.IEEESoftw.,1(1):11–28,January1984.

[97] ButlerLampsonandHowardSturgis.Crashrecoveryinadistributeddatastoragesystem.Technicalreport,XeroxPaloAltoResearchCenter,1979.

[98] ButlerW.LampsonandDavidD.Redell.ExperiencewithprocessesandmonitorsinMesa.Commun.ACM,23(2):105–117,February1980.

[99] ButlerW.LampsonandHowardE.Sturgis.Reflectionsonanoperatingsystemdesign.Commun.ACM,19(5):251–265,May1976.

[100] JamesLarusandGalenHunt.TheSingularitysystem.Commun.ACM,53(8):72–79,August2010.

[101] HughC.LauerandRogerM.Needham.Onthedualityofoperatingsystemstructures.InOperatingSystemsReview,pages3–19,1979.

[102]EdwardD.Lazowska,JohnZahorjan,G.ScottGraham,andKennethC.Sevcik.Quantitativesystemperformance:computersystemanalysisusingqueueingnetworkmodels.Prentice-Hall,Inc.,1984.

[103]WillE.Leland,MuradS.Taqqu,WalterWillinger,andDanielV.Wilson.Ontheself-similarnatureofEthernettraffic(extendedversion).IEEE/ACMTrans.Netw.,2(1):1–15,February1994.

[104] N.G.LevesonandC.S.Turner.AninvestigationoftheTherac-25accidents.Computer,26(7):18–41,July1993.

[105] H.M.LevyandP.H.Lipman.VirtualmemorymanagementintheVAX/VMSoperatingsystem.Computer,15(3):35–41,March1982.

[106] J.Liedtke.Onmicro-kernelconstruction.InProceedingsofthefifteenthACMSymposiumonOperatingSystemsPrinciples,SOSP’95,pages237–250,1995.

[107] JohnLions.Lions’CommentaryonUNIX6thEdition,withSourceCode.Peer-to-PeerCommunications,1996.

[108] J.S.Liptay.StructuralaspectsoftheSystem/360model85:iithecache.IBMSyst.J.,7(1):15–21,March1968.

[109]

DavidE.Lowell,SubhachandraChandra,andPeterM.Chen.Exploringfailuretransparencyandthelimitsofgenericrecovery.InProceedingsofthe4thconferenceonSymposiumonOperatingSystemsDesignandImplementation,OSDI’00,pages20–20,2000.

[110] DavidE.LowellandPeterM.Chen.FreetransactionswithRioVista.InProceedingsofthesixteenthACMSymposiumonOperatingSystemsPrinciples,SOSP’97,pages92–101,1997.

[111] P.McKenney.Isparallelprogramminghard,and,ifso,whatcanbedoneaboutit?http://kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.2011.05.30a.pdf.

[112]PaulE.McKenney,DipankarSarma,AndreaArcangeli,AndiKleen,OrranKrieger,andRustyRussell.Read-copyupdate.InOttawaLinuxSymposium,pages338–367,June2002.

[113] MarshallK.McKusick,WilliamN.Joy,SamuelJ.Leffler,andRobertS.Fabry.AfastfilesystemforUNIX.ACMTrans.Comput.Syst.,2(3):181–197,August1984.

[114]MarshallKirkMcKusick,KeithBostic,MichaelJ.Karels,andJohnS.Quarterman.Thedesignandimplementationofthe4.4BSDoperatingsystem.AddisonWesleyLongmanPublishingCo.,Inc.,1996.

[115]JohnM.Mellor-CrummeyandMichaelL.Scott.Algorithmsforscalablesynchronizationonshared-memorymultiprocessors.ACMTrans.Comput.Syst.,9(1):21–65,February1991.

[116] ScottMeyersandAndreiAlexandrescu.C++andtheperilsofdouble-checkedlocking.Dr.DobbsJournal,2004.

[117] JeffreyC.MogulandK.K.Ramakrishnan.Eliminatingreceivelivelockinaninterrupt-drivenkernel.ACMTrans.Comput.Syst.,15(3):217–252,August1997.

[118]JeffreyC.Mogul,RichardF.Rashid,andMichaelJ.Accetta.Thepacketfilter:Anefficientmechanismforuser-levelnetworkcode.InIntheProceedingsoftheeleventhACMSymposiumonOperatingSystemsPrinciples,pages39–51,1987.

[119]C.Mohan,DonHaderle,BruceLindsay,HamidPirahesh,andPeterSchwarz.ARIES:atransactionrecoverymethodsupportingfine-granularitylockingandpartialrollbacksusingwrite-aheadlogging.ACMTrans.DatabaseSyst.,17(1):94–162,March1992.

[120] GordonE.Moore.Crammingmorecomponentsontointegratedcircuits.Electronics,38(8):114–117,1965.

[121]

MadanlalMusuvathi,ShazQadeer,ThomasBall,GerardBasler,PiramanayagamArumugaNainar,andIulianNeamtiu.FindingandreproducingHeisenbugsinconcurrentprograms.InProceedingsofthe8thUSENIXconferenceonOperatingSystemsDesignandImplementation,OSDI’08,pages267–280,2008.

[122] KaiNagelandMichaelSchreckenberg.Acellularautomatonmodelforfreewaytraffic.J.Phys.IFrance,1992.

[123]GeorgeC.NeculaandPeterLee.Safekernelextensionswithoutrun-timechecking.ProceedingsofthesecondUSENIXSymposiumonOperatingSystemsDesignandImplementation,OSDI’96,pages229–243,1996.

[124] EdmundB.Nightingale,KaushikVeeraraghavan,PeterM.Chen,andJasonFlinn.Rethinkthesync.ACMTrans.Comput.Syst.,26(3):6:1–6:26,September2008.

[125] ElliottI.Organick.TheMulticssystem:anexaminationofitsstructure.MITPress,1972.

[126]

StevenOsman,DineshSubhraveti,GongSu,andJasonNieh.ThedesignandimplementationofZap:asystemformigratingcomputingenvironments.In

ProceedingsofthefifthUSENIXSymposiumonOperatingSystemsDesignandImplementation,OSDI’02,pages361–376,2002.

[127]JohnOusterhout.Schedulingtechniquesforconcurrentsystems.InProceedingsofThirdInternationalConferenceonDistributedComputingSystems,pages22–30,1982.

[128] JohnOusterhout.Whyaren’toperatingsystemsgettingfasterasfastashardware?InProceedingsUSENIXConference,pages247–256,1990.

[129]JohnOusterhout.Whythreadsareabadidea(formostpurposes).InUSENIXWinterTechnicalConference,1996.

[130]VivekS.Pai,PeterDruschel,andWillyZwaenepoel.Flash:anefficientandportablewebserver.InProceedingsoftheannualconferenceonUSENIXAnnualTechnicalConference,ATEC’99,1999.

[131]VivekS.Pai,PeterDruschel,andWillyZwaenepoel.IO-lite:aunifiedI/Obufferingandcachingsystem.InProceedingsofthethirdUSENIXSymposiumonOperatingSystemsDesignandImplementation,OSDI’99,pages15–28,1999.

[132]DavidA.Patterson,GarthGibson,andRandyH.Katz.Acaseforredundantarraysofinexpensivedisks(RAID).InProceedingsofthe1988ACMSIGMODInternationalconferenceonManagementofData,SIGMOD’88,pages109–116,1988.

[133]L.Peterson,N.Hutchinson,S.O’Malley,andM.Abbott.RPCinthex-Kernel:evaluatingnewdesigntechniques.InProceedingsofthetwelfthACMSymposiumonOperatingSystemsPrinciples,SOSP’89,pages91–101,1989.

[134] JonathanPincusandBrandonBaker.Beyondstacksmashing:recentadvancesinexploitingbufferoverruns.IEEESecurityandPrivacy,2(4):20–27,July2004.

[135]EduardoPinheiro,Wolf-DietrichWeber,andLuizAndréBarroso.Failuretrendsinalargediskdrivepopulation.InProceedingsofthe5thUSENIXconferenceonFileandStorageTechnologies,FAST’07,pages2–2,2007.

[136]

VijayanPrabhakaran,LakshmiN.Bairavasundaram,NitinAgrawal,HaryadiS.Gunawi,AndreaC.Arpaci-Dusseau,andRemziH.Arpaci-Dusseau.IRONfilesystems.InProceedingsofthetwentiethACMSymposiumonOperatingSystemsPrinciples,SOSP’05,pages206–220,2005.

[137]

RichardRashid,RobertBaron,AlessandroForin,DavidGolub,MichaelJones,DanielJulin,DouglasOrr,andRichardSanzi.Mach:Afoundationforopensystems.InProceedingsoftheSecondWorkshoponWorkstationOperatingSystems(WWOS2),1989.

[138]

RichardF.Rashid,AvadisTevanian,MichaelYoung,DavidB.Golub,RobertV.Baron,DavidL.Black,WilliamJ.Bolosky,andJonathanChew.Machine-independentvirtualmemorymanagementforpageduniprocessorandmultiprocessorarchitectures.IEEETrans.Computers,37(8):896–907,1988.

[139] E.S.Raymond.TheCathedralandtheBazaar:MusingsOnLinuxAndOpenSourceByAnAccidentalRevolutionary.O’ReillySeries.O’Reilly,2001.

[140]DavidD.Redell,YogenK.Dalal,ThomasR.Horsley,HughC.Lauer,WilliamC.Lynch,PaulR.McJones,HalG.Murray,andStephenC.Purcell.Pilot:anoperatingsystemforapersonalcomputer.Commun.ACM,23(2):81–92,February1980.

[141] DennisM.RitchieandKenThompson.TheUNIXtime-sharingsystem.Commun.ACM,17(7):365–375,July1974.

[142] MendelRosenblumandJohnK.Ousterhout.Thedesignandimplementationofalog-structuredfilesystem.ACMTrans.Comput.Syst.,10(1):26–52,February1992.

[143] ChrisRuemmlerandJohnWilkes.Anintroductiontodiskdrivemodeling.Computer,27(3):17–28,March1994.

[144] J.H.Saltzer,D.P.Reed,andD.D.Clark.End-to-endargumentsinsystemdesign.ACMTrans.Comput.Syst.,2(4):277–288,November1984.

[145]JeromeH.Saltzer.ProtectionandthecontrolofinformationsharinginMultics.Commun.ACM,17(7):388–402,July1974.

[146]M.Satyanarayanan,HenryH.Mashburn,PuneetKumar,DavidC.Steere,andJamesJ.Kistler.Lightweightrecoverablevirtualmemory.ACMTrans.Comput.Syst.,12(1):33–57,February1994.

[147]StefanSavage,MichaelBurrows,GregNelson,PatrickSobalvarro,andThomasAnderson.Eraser:adynamicdataracedetectorformultithreadedprograms.ACMTrans.Comput.Syst.,15(4):391–411,November1997.

[148]BiancaSchroederandGarthA.Gibson.Diskfailuresintherealworld:whatdoesanMTTFof1,000,000hoursmeantoyou?InProceedingsofthe5thUSENIXconferenceonFileandStorageTechnologies,FAST’07,2007.

[149] BiancaSchroederandMorHarchol-Balter.Webserversunderoverload:Howschedulingcanhelp.ACMTrans.InternetTechnol.,6(1):20–52,February2006.

[150]MichaelD.Schroeder,DavidD.Clark,andJeromeH.Saltzer.TheMulticskerneldesignproject.InProceedingsofthesixthACMSymposiumonOperatingSystemsPrinciples,SOSP’77,pages43–56,1977.

[151] MichaelD.SchroederandJeromeH.Saltzer.Ahardwarearchitectureforimplementingprotectionrings.Commun.ACM,15(3):157–170,March1972.

[152] D.P.Siewiorek.Architectureoffault-tolerantcomputers.Computer,17(8):9–18,August1984.[153] E.H.Spafford.Crisisandaftermath.Commun.ACM,32(6):678–687,June1989.[154] StructuredQueryLanguage(SQL).http://en.wikipedia.org/wiki/SQL.

[155] MichaelStonebraker.Operatingsystemsupportfordatabasemanagement.Commun.ACM,24(7):412–418,July1981.

[156]MichaelM.Swift,MuthukaruppanAnnamalai,BrianN.Bershad,andHenryM.Levy.Recoveringdevicedrivers.ACMTrans.Comput.Syst.,24(4):333–360,November2006.

[157] K.Thompson.Uniximplementation.BellSystemTechnicalJournal,57:1931–1946,1978.

[158] KenThompson.Reflectionsontrustingtrust.Commun.ACM,27(8):761–763,August1984.

[159] PaulTyma.Thousandsofthreadsandblockingi/o.http://www.mailinator.com/tymaPaulMultithreaded.pdf,2008.RobbertvanRenesse.Goal-orientedprogramming,orcompositionusingevents,or

[160] threadsconsideredharmful.InACMSIGOPSEuropeanWorkshoponSupportforComposingDistributedApplications,pages82–87,1998.

[161] JoostS.M.Verhofstad.Recoverytechniquesfordatabasesystems.ACMComput.Surv.,10(2):167–195,June1978.

[162]

MichaelVrable,JustinMa,JayChen,DavidMoore,ErikVandekieft,AlexC.Snoeren,GeoffreyM.Voelker,andStefanSavage.Scalability,fidelity,andcontainmentinthePotemkinvirtualhoneyfarm.InProceedingsofthetwentiethACMSymposiumonOperatingSystemsPrinciples,SOSP’05,pages148–162,2005.

[163]RobertWahbe,StevenLucco,ThomasE.Anderson,andSusanL.Graham.Efficientsoftware-basedfaultisolation.InProceedingsofthefourteenthACMSymposiumonOperatingSystemsPrinciples,SOSP’93,pages203–216,1993.

[164] CarlA.Waldspurger.MemoryresourcemanagementinVMwareESXserver.SIGOPSOper.Syst.Rev.,36(SI):181–194,December2002.

[165]AndrewWhitaker,MarianneShaw,andStevenD.Gribble.ScaleandperformanceintheDenaliisolationkernel.InProceedingsofthefifthUSENIXSymposiumonOperatingSystemsDesignandImplementation,OSDI’02,pages195–209,2002.

[166]J.Wilkes,R.Golding,C.Staelin,andT.Sullivan.TheHPAutoRAIDhierarchicalstoragesystem.InProceedingsofthefifteenthACMSymposiumonOperatingSystemsPrinciples,SOSP’95,pages96–108,1995.

[167]

AlecWolman,M.Voelker,NitinSharma,NealCardwell,AnnaKarlin,andHenryM.Levy.Onthescaleandperformanceofcooperativewebproxycaching.InProceedingsoftheseventeenthACMSymposiumonOperatingSystemsPrinciples,SOSP’99,pages16–31,1999.

[168]W.Wulf,E.Cohen,W.Corwin,A.Jones,R.Levin,C.Pierson,andF.Pollack.Hydra:thekernelofamultiprocessoroperatingsystem.Commun.ACM,17(6):337–345,June1974.

[169]

BennetYee,DavidSehr,GregoryDardyk,J.BradleyChen,RobertMuth,TavisOrmandy,ShikiOkasaka,NehaNarula,andNicholasFullagar.NativeClient:asandboxforportable,untrustedx86nativecode.InProceedingsofthe200930thIEEESymposiumonSecurityandPrivacy,SP’09,pages79–93,2009.

[170] NickolaiZeldovich,SilasBoyd-Wickizer,EddieKohler,andDavidMazières.MakinginformationflowexplicitinHiStar.Commun.ACM,54(11):93–101,November2011.

Glossary

absolutepathAfilepathnameinterpretedrelativetotherootdirectory.

abstractvirtualmachineTheinterfaceprovidedbyanoperatingsystemtoitsapplications,includingthesystemcallinterface,thememoryabstraction,exceptions,andsignals.

ACIDpropertiesAmnemonicforthepropertiesofatransaction:atomicity,consistency,isolation,anddurability.

acquire-all/release-allAdesignpatterntoprovideatomicityofarequestconsistingofmultipleoperations.Athreadacquiresallofthelocksitmightneedbeforestartingtoprocessarequest;itreleasesthelocksoncetherequestisdone.

addresstranslationTheconversionfromthememoryaddresstheprogramthinksitisreferencingtothephysicallocationofthememory.

affinityschedulingAschedulingpolicywheretasksarepreferentiallyscheduledontothesameprocessortheyhadpreviouslybeenassigned,toimprovecachereuse.

annualdiskfailurerateThefractionofdisksexpectedtofailureeachyear.

APISee:applicationprogramminginterface.

applicationprogramminginterfaceThesystemcallinterfaceprovidedbyanoperatingsystemtoapplications.

armAnattachmentallowingthemotionofthediskheadacrossadisksurface.

armassemblyAmotorplusthesetofdiskarmsneededtopositionadiskheadtoreadorwriteeachsurfaceofthedisk.

arrivalrateTherateatwhichtasksarriveforservice.

asynchronousI/OAdesignpatternforsystemcallstoallowasingle-threadedprocesstomakemultipleconcurrentI/Orequests.WhentheprocessissuesanI/Orequest,thesystemcallreturnsimmediately.TheprocesslateronreceivesanotificationwhentheI/Ocompletes.

asynchronousprocedurecallAprocedurecallwherethecallerstartsthefunction,continuesexecutionconcurrentlywiththecalledfunction,andlaterwaitsforthefunctiontocomplete.

atomiccommitThemomentwhenatransactioncommitstoapplyallofitsupdates.

atomicmemoryThevaluestoredinmemoryisthelastvaluestoredbyoneoftheprocessors,notamixtureoftheupdatesofdifferentprocessors.

atomicoperationsIndivisibleoperationsthatcannotbeinterleavedwithorsplitbyotheroperations.

atomicread-modify-writeinstructionAprocessor-specificinstructionthatletsonethreadtemporarilyhaveexclusiveandatomicaccesstoamemorylocationwhiletheinstructionexecutes.Typically,theinstruction(atomically)readsamemorylocation,doessomesimplearithmeticoperationtothevalue,andstorestheresult.

attributerecordInNTFS,avariable-sizedatastructurecontainingeitherfiledataorfilemetadata.

availabilityThepercentageoftimethatasystemisusable.

averageseektimeTheaveragetimeacrossseeksbetweeneachpossiblepairoftracksonadisk.

AVMSee:abstractvirtualmachine.

backupAlogicallyorphysicallyseparatecopyofasystem’smainstorage.

baseandboundmemoryprotectionAnearlysystemformemoryprotectionwhereeachprocessislimitedtoaspecificrangeofphysicalmemory.

batchoperatingsystemAnearlytypeofoperatingsystemthatefficientlyranaqueueoftasks.Whileoneprogramwasrunning,anotherwasbeingloadedintomemory.

bathtubmodelAmodelofdiskdevicefailurecombiningdeviceinfantmortalityandwearout.

Belady’sanomalyForsomecachereplacementpoliciesandsomereferencepatterns,addingspacetoacachecanhurtthecachehitrate.

bestfitAstorageallocationpolicythatattemptstoplaceanewlyallocatedfileinthesmallestfreeregionthatislargeenoughtoholdit.

BIOSTheinitialcoderunwhenanIntelx86computerboots;acronymforBasicInput/OutputSystem.Seealso:BootROM.

biterrorrateThenon-recoverablereaderrorrate.

bitmapAdatastructureforblockallocationwhereeachblockisrepresentedbyonebit.

blockdeviceAnI/Odevicethatallowsdatatobereadorwritteninfixed-sizedblocks.

blockgroupAsetofnearbydisktracks.

blockintegritymetadataAdditionaldatastoredwithablocktoallowthesoftwaretovalidatethattheblockhasnotbeencorrupted.

blockingboundedqueue

Aboundedqueuewhereathreadtryingtoremoveanitemfromanemptyqueuewillwaituntilanitemisavailable,andathreadtryingtoputanitemintoafullqueuewillwaituntilthereisroom.

BohrbugsBugsthataredeterministicandreproducible,giventhesameprograminput.Seealso:Heisenbugs.

BootROMSpecialread-onlymemorycontainingtheinitialinstructionsforbootingacomputer.

bootloaderProgramstoredatafixedpositionondisk(orflashRAM)toloadtheoperatingsystemintomemoryandstartitexecuting.

boundedqueueAqueuewithafixedsizelimitonthenumberofitemsstoredinthequeue.

boundedresourcesAnecessaryconditionfordeadlock:thereareafinitenumberofresourcesthatthreadscansimultaneouslyuse.

bufferoverflowattackAnattackthatexploitsabugwhereinputcanoverflowthebufferallocatedtoholdit,overwritingotherimportantprogramdatastructureswithdataprovidedbytheattacker.Onecommonvariationoverflowsabufferallocatedonthestack(e.g.,alocal,automaticvariable)andreplacesthefunction’sreturnaddresswithareturnaddressspecifiedbytheattacker,possiblytocode“pushed”ontothestackwiththeoverflowinginput.

bulksynchronousAtypeofparallelapplicationwhereworkissplitintoindependenttasksandwhereeachtaskcompletesbeforetheresultsofanyofthetaskscanbeused.

bulksynchronousparallelprogrammingSee:dataparallelprogramming.

burstydistributionAprobabilitydistributionthatislessevenlydistributedaroundthemeanvaluethananexponentialdistribution.See:exponentialdistribution.Compare:heavy-taileddistribution.

busy-waitingAthreadspinsinaloopwaitingforaconcurrenteventtooccur,consumingCPUcycleswhileitiswaiting.

cacheAcopyofdatathatcanbeaccessedmorequicklythantheoriginal.

cachehitThecachecontainstherequesteditem.

cachemissThecachedoesnotcontaintherequesteditem.

checkpointAconsistentsnapshotoftheentirestateofaprocess,includingthecontentsofmemoryandprocessorregisters.

childprocessAprocesscreatedbyanotherprocess.Seealso:parentprocess.

CircularSCANSee:CSCAN.

circularwaitingAnecessaryconditionfordeadlocktooccur:thereisasetofthreadssuchthateachthreadiswaitingforaresourceheldbyanother.

client-servercommunicationTwo-waycommunicationbetweenprocesses,wheretheclientsendsarequesttotheservertodosometask,andwhentheoperationiscomplete,theserverrepliesbacktotheclient.

clockalgorithmAmethodforidentifyinganotrecentlyusedpagetoevict.Thealgorithmsweepsthrougheachpageframe:ifthepageusebitisset,itiscleared;iftheusebitisnotset,thepageisreclaimed.

cloudcomputingAmodelofcomputingwherelarge-scaleapplicationsrunonsharedcomputingandstorageinfrastructureindatacentersinsteadofontheuser’sowncomputer.

commitTheoutcomeofatransactionwhereallofitsupdatesoccur.

compare-and-swapAnatomicread-modify-writeinstructionthatfirstteststhevalueofamemorylocation,andifthevaluehasnotbeenchanged,setsittoanewvalue.

compute-boundtaskAtaskthatprimarilyusestheprocessoranddoeslittleI/O.

computervirusAcomputerprogramthatmodifiesanoperatingsystemorapplicationtocopyitselffromcomputertocomputerwithoutthecomputerowner’spermissionorknowledge.Onceinstalledonacomputer,avirusoftenprovidestheattackercontroloverthesystem’sresourcesordata.

concurrencyMultipleactivitiesthatcanhappenatthesametime.

conditionvariableAsynchronizationvariablethatenablesathreadtoefficientlywaitforachangetosharedstateprotectedbyalock.

continuationAdatastructureusedinevent-drivenprogrammingthatkeepstrackofatask’scurrentstateanditsnextstep.

cooperatingthreadsThreadsthatreadandwritesharedstate.

cooperativecachingUsingthememoryofnearbynodesoveranetworkasacachetoavoidthelatencyofgoingtodisk.

cooperativemulti-threadingEachthreadrunswithoutinterruptionuntilitexplicitlyrelinquishescontroloftheprocessor,e.g.,byexitingorcallingthread_yield.

copy-on-writeAmethodofsharingphysicalmemorybetweentwologicallydistinctcopies(e.g.,in

differentprocesses).Eachsharedpageismarkedasread-onlysothattheoperatingsystemkernelisinvokedandcanmakeacopyofthepageifeitherprocesstriestowriteit.Theprocesscanthenmodifythecopyandresumenormalexecution.

copy-on-writefilesystemAfilesystemwhereanupdatetothefilesystemismadebywritingnewversionsofmodifieddataandmetadatablockstofreediskblocks.Thenewblockscanpointtounchangedblocksinthepreviousversionofthefilesystem.Seealso:COWfilesystem.

coremapAdatastructureusedbythememorymanagementsystemtokeeptrackofthestateofphysicalpageframes,suchaswhichprocessesreferencethepageframe.

COWfilesystemSee:copy-on-writefilesystem.

criticalpathTheminimumsequenceofstepsforaparallelapplicationtocomputeitsresult,evenwithinfiniteresources.

criticalsectionAsequenceofcodethatoperatesonsharedstate.

cross-sitescriptingAnattackagainstaclientcomputerthatworksbycompromisingaservervisitedbytheclient.Thecompromisedserverthenprovidesscriptingcodetotheclientthataccessesanddownloadstheclient’ssensitivedata.

cryptographicsignatureAspeciallydesignedfunctionofadatablockandaprivatecryptographickeythatallowssomeonewiththecorrespondingpublickeytoverifythatanauthorizedentityproducedthedatablock.Itiscomputationallyintractableforanattackerwithouttheprivatekeytocreateadifferentdatablockwithavalidsignature.

CSCANAvariationoftheSCANdiskschedulingpolicyinwhichthediskonlyservicesrequestswhentheheadistravelinginonedirection.Seealso:CircularSCAN.

currentworkingdirectoryThecurrentdirectoryoftheprocess,usedforinterpretingrelativepathnames.

databreakpointArequesttostoptheexecutionofaprogramwhenitreferencesormodifiesaparticularmemorylocation.

dataparallelprogrammingAprogrammingmodelwherethecomputationisperformedinparallelacrossallitemsinadataset.

deadlockAcycleofwaitingamongasetofthreads,whereeachthreadwaitsforsomeotherthreadinthecycletotakesomeaction.

deadlockedstateThesystemhasatleastonedeadlock.

declusteringAtechniqueforreducingtherecoverytimeafteradiskfailureinaRAIDsystembyspreadingredundantdiskblocksacrossmanydisks.

defenseindepthImprovingsecuritythroughmultiplelayersofprotection.

defragmentCoalescescattereddiskblockstoimprovespatiallocality,byreadingdatafromitspresentstoragelocationandrewritingittoanew,morecompact,location.

demandpagingUsingaddresstranslationhardwaretorunaprocesswithoutallofitsmemoryphysicallypresent.Whentheprocessreferencesamissingpage,thehardwaretrapstothekernel,whichbringsthepageintomemoryfromdisk.

deterministicdebuggingTheabilitytore-executeaconcurrentprocesswiththesamescheduleandsequenceofinternalandexternalevents.

devicedriverOperatingsystemcodetoinitializeandmanageaparticularI/Odevice.

directmappedcacheOnlyoneentryinthecachecanholdaspecificmemorylocation,soonalookup,thesystemmustchecktheaddressagainstonlythatentrytodetermineifthereisacachehit.

directmemoryaccessHardwareI/Odevicestransferdatadirectlyinto/outofmainmemoryatalocationspecifiedbytheoperatingsystem.Seealso:DMA.

dirtybitAstatusbitinapagetableentryrecordingwhetherthecontentsofthepagehavebeenmodifiedrelativetowhatisstoredondisk.

diskbuffermemoryMemoryinthediskcontrollertobufferdatabeingreadorwrittentothedisk.

diskinfantmortalityThedevicefailurerateishigherthannormalduringthefirstfewweeksofuse.

diskwearoutThedevicefailureraterisesafterthedevicehasbeeninoperationforseveralyears.

DMASee:directmemoryaccess.

dnodeInZFS,afileisrepresentedbyvariable-depthtreewhoserootisadnodeandwhoseleavesareitsdatablocks.

doubleindirectblockAstorageblockcontainingpointerstoindirectblocks.

double-checkedlockingApitfallinconcurrentcodewhereadatastructureislazilyinitializedbyfirst,checkingwithoutalockifithasbeenset,andifnot,acquiringalockandcheckingagain,beforecallingtheinitializationfunction.Withinstructionre-ordering,double-checkedlockingcanfailunexpectedly.

dualredundancyarrayARAIDstoragealgorithmusingtworedundantdiskblocksperarraytotoleratetwodiskfailures.Seealso:RAID6.

dual-modeoperation

Hardwareprocessorthathas(atleast)twoprivilegelevels:oneforexecutingthekernelwithcompleteaccesstothecapabilitiesofthehardwareandasecondforexecutingusercodewithrestrictedrights.Seealso:kernel-modeoperation.Seealso:user-modeoperation.

dynamicallyloadabledevicedriverSoftwaretomanageaspecificdevice,interface,orchipset,addedtotheoperatingsystemkernelafterthekernelstartsrunning.

earliestdeadlinefirstAschedulingpolicythatperformsthetaskthatneedstobecompletedfirst,butonlyifitcanbefinishedintime.

EDFSee:earliestdeadlinefirst.

efficiencyThelackofoverheadinimplementinganabstraction.

erasureblockTheunitoferasureinaflashmemorydevice.Beforeanyportionofanerasureblockcanbeover-written,everycellintheentireerasureblockmustbesettoalogical“1.”

errorcorrectingcodeAtechniqueforstoringdataredundantlytoallowfortheoriginaldatatoberecoveredeventhoughsomebitsinadisksectororflashmemorypagearecorrupted.

event-drivenprogrammingAcodingdesignpatternwhereathreadspinsinaloop;eachiterationgetsandprocessesthenextI/Oevent.

exceptionSee:processorexception.

executableimageFilecontainingasequenceofmachineinstructionsandinitialdatavaluesforaprogram.

executionstackSpacetostorethestateoflocalvariablesduringprocedurecalls.

exponentialdistributionAconvenientprobabilitydistributionforuseinqueueingtheorybecauseithasthepropertyofbeingmemoryless.Foracontinuousrandomvariablewithameanof1⁄λ,theprobabilitydensityfunctionisf(x)=λtimeseraisedtothe-λx.

extentAvariable-sizedregionofafilethatisstoredinacontiguousregiononthestoragedevice.

externalfragmentationInasystemthatallocatesmemoryincontiguousregions,theunusablememorybetweenvalidcontiguousallocations.Anewrequestformemorymayfindnosinglefreeregionthatisbothcontiguousandlargeenough,eventhoughthereisenoughfreememoryinaggregate.

fairnessPartitioningofsharedresourcesbetweenusersorapplicationseitherequallyorbalancedaccordingtosomedesiredpriorities.

falsesharing

Extrainter-processorcommunicationrequiredbecauseasinglecacheentrycontainsportionsoftwodifferentdatastructureswithdifferentsharingpatterns.

fatesharingWhenacrashinonemoduleimpliesacrashinanother.Forexample,alibrarysharesfatewiththeapplicationitislinkedwith;ifeithercrashes,theprocessexits.

faultisolationAnerrorinoneapplicationshouldnotdisruptotherapplications,oreventheoperatingsystemitself.

fileAnamedcollectionofdatainafilesystem.

fileallocationtableAnarrayofentriesintheFATfilesystemstoredinareservedareaofthevolume,whereeachentrycorrespondstoonefiledatablock,andpointstothenextblockinthefile.

filedataContentsofafile.

filedescriptorAhandletoanopenfile,device,orchannel.Seealso:filehandle.Seealso:filestream.

filedirectoryAlistofhuman-readablenamesplusamappingfromeachnametoaspecificfileorsub-directory.

filehandleSee:filedescriptor.

fileindexstructureApersistentlystoreddatastructureusedtolocatetheblocksofthefile.

filemetadataInformationaboutafilethatismanagedbytheoperatingsystem,butnotincludingthefilecontents.

filestreamSee:filedescriptor.

filesystemAnoperatingsystemabstractionthatprovidespersistent,nameddata.

filesystemfingerprintAchecksumacrosstheentirefilesystem.

fill-on-demandAmethodforstartingaprocessbeforeallofitsmemoryisbroughtinfromdisk.Ifthefirstaccesstothemissingmemorytriggersatraptothekernel,thekernelcanfillthememoryandthenresume.

fine-grainedlockingAwaytoincreaseconcurrencybypartitioninganobject’sstateintodifferentsubsetseachprotectedbyadifferentlock.

finishedlistThesetofthreadsthatarecompletebutnotyetde-allocated,e.g.,becauseajoinmayreadthereturnvaluefromthethreadcontrolblock.

first-in-first-out

Aschedulingpolicythatperformseachtaskintheorderinwhichitarrives.flashpagefailure

Aflashmemorydevicefailurewherethedatastoredononeormoreindividualpagesofflasharelost,buttherestoftheflashcontinuestooperatecorrectly.

flashtranslationlayerAlayerthatmapslogicalflashpagestodifferentphysicalpagesontheflashdevice.Seealso:FTL.

flashwearoutAftersomenumberofprogram-erasecycles,agivenflashstoragecellmaynolongerbeabletoreliablystoreinformation.

fork-joinparallelismAtypeofparallelprogrammingwherethreadscanbecreated(forked)todoworkinparallelwithaparentthread;aparentmayasynchronouslywaitforachildthreadtofinish(join).

freespacemapAfilesystemdatastructureusedtotrackwhichstorageblocksarefreeandwhichareinuse.

FTLSee:flashtranslationlayer.

fulldiskfailureWhenadiskdevicestopsbeingabletoservicereadsorwritestoallsectors.

fullflashdrivefailureWhenaflashdevicestopsbeingabletoservicereadsorwritestoallmemorypages.

fullyassociativecacheAnyentryinthecachecanholdanymemorylocation,soonalookup,thesystemmustchecktheaddressagainstalloftheentriesinthecachetodetermineifthereisacachehit.

gangschedulingAschedulingpolicyformultiprocessorsthatperformsalloftherunnabletasksforaparticularprocessatthesametime.

GlobalDescriptorTableThex86terminologyforasegmenttableforsharedsegments.ALocalDescriptorTableisusedforsegmentsthatareprivatetotheprocess.

graceperiodForasharedobjectprotectedbyaread-copy-updatelock,thetimefromwhenanewversionofasharedobjectispublisheduntilthelastreaderoftheoldversionisguaranteedtobefinished.

greenthreadsAthreadsystemimplementedentirelyatuser-levelwithoutanyrelianceonoperatingsystemkernelservices,otherthanthosedesignedforsingle-threadedprocesses.

groupcommitAtechniquethatbatchesmultipletransactioncommitsintoasinglediskoperation.

guestoperatingsystemAnoperatingsystemrunninginavirtualmachine.

hardlinkThemappingbetweenafilenameandtheunderlyingfile,typicallywhenthereare

multiplepathnamesforthesameunderlyingfile.hardwareabstractionlayer

Amoduleintheoperatingsystemthathidesthespecificsofdifferenthardwareimplementations.Abovethislayer,theoperatingsystemisportable.

hardwaretimerAhardwaredevicethatcancauseaprocessorinterruptaftersomedelay,eitherintimeorininstructionsexecuted.

headThecomponentthatwritesthedatatoorreadsthedatafromaspinningdisksurface.

headcrashAnerrorwherethediskheadphysicallyscrapesthemagneticsurfaceofaspinningdisksurface.

headswitchtimeThetimeittakestore-positionthediskarmoverthecorrespondingtrackonadifferentsurface,beforeareadorwritecanbegin.

heapSpacetostoredynamicallyallocateddatastructures.

heavy-taileddistributionAprobabilitydistributionsuchthateventsfarfromthemeanvalue(inaggregate)occurwithsignificantprobability.Whenusedforthedistributionoftimebetweenevents,theremainingtimetothenexteventispositivelyrelatedtothetimealreadyspentwaiting—youexpecttowaitlongerthelongeryouhavealreadywaited.

HeisenbugsBugsinconcurrentprogramsthatdisappearorchangebehaviorwhenyoutrytoexaminethem.Seealso:Bohrbugs.

hintAresultofsomecomputationwhoseresultsmaynolongerbevalid,butwhereusinganinvalidhintwilltriggeranexception.

homedirectoryThesub-directorycontainingauser’sfiles.

hostoperatingsystemAnoperatingsystemthatprovidestheabstractionofavirtualmachine,torunanotheroperatingsystemasanapplication.

hosttransfertimeThetimetotransferdatabetweenthehost’smemoryandthedisk’sbuffer.

hyperthreadingSee:simultaneousmulti-threading.

I/O-boundtaskAtaskthatprimarilydoesI/O,anddoeslittleprocessing.

idempotentAnoperationthathasthesameeffectwhetherexecutedonceormanytimes.

incrementalcheckpointAconsistentsnapshotoftheportionofprocessmemorythathasbeenmodifiedsincethepreviouscheckpoint.

independentthreadsThreadsthatoperateoncompletelyseparatesubsetsofprocessmemory.

indirectblockAstorageblockcontainingpointerstofiledatablocks.

inodeIntheUnixFastFileSystem(FFS)andrelatedfilesystems,aninodestoresafile’smetadata,includinganarrayofpointersthatcanbeusedtofindallofthefile’sblocks.Theterminodeissometimesusedmoregenerallytorefertoanyfilesystem’sper-filemetadatadatastructure.

inodearrayThefixedlocationondiskcontainingallofthefilesystem’sinodes.Seealso:inumber.

intentionsThesetofwritesthatatransactionwillperformifthetransactioncommits.

internalfragmentationWithpagedallocationofmemory,theunusablememoryattheendofapagebecauseaprocesscanonlybeallocatedmemoryinpage-sizedchunks.

interruptAnasynchronoussignaltotheprocessorthatsomeexternaleventhasoccurredthatmayrequireitsattention.

interruptdisableAprivilegedhardwareinstructiontotemporarilydeferanyhardwareinterrupts,toallowthekerneltocompleteacriticaltask.

interruptenableAprivilegedhardwareinstructiontoresumehardwareinterrupts,afteranon-interruptibletaskiscompleted.

interrupthandlerAkernelprocedureinvokedwhenaninterruptoccurs.

interruptstackAregionofmemoryforholdingthestackofthekernel’sinterrupthandler.Whenaninterrupt,processorexception,orsystemcalltrapcausesacontextswitchintothekernel,thehardwarechangesthestackpointertopointtothebaseofthekernel’sinterruptstack.

interruptvectortableAtableofpointersintheoperatingsystemkernel,indexedbythetypeofinterrupt,witheachentrypointingtothefirstinstructionofahandlerprocedureforthatinterrupt.

inumberTheindexintotheinodearrayforaparticularfile.

invertedpagetableAhashtableusedfortranslationbetweenvirtualpagenumbersandphysicalpageframes.

kernelthreadAthreadthatisimplementedinsidetheoperatingsystemkernel.

kernel-modeoperationTheprocessorexecutesinanunrestrictedmodethatgivestheoperatingsystemfullcontroloverthehardware.Compare:user-modeoperation.

LBA

See:logicalblockaddress.leastfrequentlyused

Acachereplacementpolicythatevictswhicheverblockhasbeenusedtheleastoften,oversomeperiodoftime.Seealso:LFU.

leastrecentlyusedAcachereplacementpolicythatevictswhicheverblockhasnotbeenusedforthelongestperiodoftime.Seealso:LRU.

LFUSee:leastfrequentlyused.

Little’sLawInastablesystemwherethearrivalratematchesthedeparturerate,thenumberoftasksinthesystemequalsthesystem’sthroughputmultipliedbytheaveragetimeataskspendsinthesystem:N=XR.

livenesspropertyAconstraintonprogrambehaviorsuchthatitalwaysproducesaresult.Compare:safetyproperty.

localityheuristicAfilesystemblockallocationpolicythatplacesfilesinnearbydisksectorsiftheyarelikelytobereadorwrittenatthesametime.

lockAtypeofsynchronizationvariableusedforenforcingatomic,mutuallyexclusiveaccesstoshareddata.

lockorderingAwidelyusedapproachtopreventdeadlock,wherelocksareacquiredinapre-determinedorder.

lock-freedatastructuresConcurrentdatastructurethatguaranteesprogressforsomethread:somemethodwillfinishinafinitenumberofsteps,regardlessofthestateofotherthreadsexecutinginthedatastructure.

logAnorderedsequenceofstepssavedtopersistentstorage.

logicalblockaddressAuniqueidentifierforeachdisksectororflashmemoryblock,typicallynumberedfrom1tothesizeofthedisk/flashdevice.Thediskinterfaceconvertsthisidentifiertothephysicallocationofthesector/block.Seealso:LBA.

logicalseparationAbackupstoragepolicywherethebackupisstoredatthesamelocationastheprimarystorage,butwithrestrictedaccess,e.g.,topreventupdates.

LRUSee:leastrecentlyused.

masterfiletableInNTFS,anarrayofrecordsstoringmetadataabouteachfile.Seealso:MFT.

maximumseektimeThetimeittakestomovethediskarmfromtheinnermosttracktotheoutermostoneorviceversa.

max-minfairness

Aschedulingobjectivetomaximizetheminimumresourceallocationgiventoeachtask.

MCSlockAnefficientspinlockimplementationwhereeachwaitingthreadspinsonaseparatememorylocation.

meantimetodatalossTheexpectedtimeuntilaRAIDsystemsuffersanunrecoverableerror.Seealso:MTTDL.

meantimetofailureTheaveragetimethatasystemrunswithoutfailing.Seealso:MTTF.

meantimetorepairTheaveragetimethatittakestorepairasystemonceithasfailed.Seealso:MTTR.

memoryaddressaliasTwoormorevirtualaddressesthatrefertothesamephysicalmemorylocation.

memorybarrierAninstructionthatpreventsthecompilerandhardwarefromreorderingmemoryaccessesacrossthebarrier—noaccessesbeforethebarrieraremovedafterthebarrierandnoaccessesafterthebarrieraremovedbeforethebarrier.

memoryprotectionHardwareorsoftware-enforcedlimitssothateachapplicationprocesscanreadandwriteonlyitsownmemoryandnotthememoryoftheoperatingsystemoranyotherprocess.

memorylesspropertyForaprobabilitydistributionforthetimebetweenevents,theremainingtimetothenexteventdoesnotdependontheamountoftimealreadyspentwaiting.Seealso:exponentialdistribution.

memory-mappedfileAfilewhosecontentsappeartobeamemorysegmentinaprocess’svirtualaddressspace.

memory-mappedI/OEachI/Odevice’scontrolregistersaremappedtoarangeofphysicaladdressesonthememorybus.

memristorAtypeofsolid-statepersistentstorageusingacircuitelementwhoseresistancedependsontheamountsanddirectionsofcurrentsthathaveflowedthroughitinthepast.

MFQSee:multi-levelfeedbackqueue.

MFTSee:masterfiletable.

microkernelAnoperatingsystemdesignwherethekernelitselfiskeptsmall,andinsteadmostofthefunctionalityofatraditionaloperatingsystemkernelisputintoasetofuser-levelprocesses,orservers,accessedfromuserapplicationsviainterprocesscommunication.

MINcachereplacement

See:optimalcachereplacement.minimumseektime

Thetimetomovethediskarmtothenextadjacenttrack.MIPS

Anearlymeasureofprocessorperformance:millionsofinstructionspersecond.mirroring

Asystemforredundantlystoringdataondiskwhereeachblockofdataisstoredontwodisksandcanbereadfromeither.Seealso:RAID1.

modelAsimplificationthattriestocapturethemostimportantaspectsofamorecomplexsystem’sbehavior.

monolithickernelAnoperatingsystemdesignwheremostoftheoperatingsystemfunctionalityislinkedtogetherinsidethekernel.

Moore’sLawTransistordensityincreasesexponentiallyovertime.Similarexponentialimprovementshaveoccurredinmanyothercomponenttechnologies;inthepopularpress,theseoftengobythesameterm.

mountAmappingofapathintheexistingfilesystemtotherootdirectoryofanotherfilesystemvolume.

MTTDLSee:meantimetodataloss.

MTTFSee:meantimetofailure.

MTTRSee:meantimetorepair.

multi-levelfeedbackqueueAschedulingalgorithmwithmultipleprioritylevelsmanagedusingroundrobinqueues,whereataskismovedbetweenprioritylevelsbasedonhowmuchprocessingtimeithasused.Seealso:MFQ.

multi-levelindexAtreedatastructuretokeeptrackofthedisklocationofeachdatablockinafile.

multi-levelpagedsegmentationAvirtualmemorymechanismwherephysicalmemoryisallocatedinpageframes,virtualaddressesaresegmented,andeachsegmentistranslatedtophysicaladdressesthroughmultiplelevelsofpagetables.

multi-levelpagingAvirtualmemorymechanismwherephysicalmemoryisallocatedinpageframes,andvirtualaddressesaretranslatedtophysicaladdressesthroughmultiplelevelsofpagetables.

multipleindependentrequestsAnecessaryconditionfordeadlocktooccur:athreadfirstacquiresoneresourceandthentriestoacquireanother.

multiprocessorschedulingpolicyApolicytodeterminehowmanyprocessorstoassigneachprocess.

multiprogrammingSee:multitasking.

multitaskingTheabilityofanoperatingsystemtorunmultipleapplicationsatthesametime,alsocalledmultiprogramming.

multi-threadedprocessAprocesswithmultiplethreads.

multi-threadedprogramAgeneralizationofasingle-threadedprogram.Insteadofonlyonelogicalsequenceofsteps,theprogramhasmultiplesequences,orthreads,executingatthesametime.

mutualexclusionWhenonethreadusesalocktopreventconcurrentaccesstoashareddatastructure.

mutuallyrecursivelockingAdeadlockconditionwheretwosharedobjectscallintoeachotherwhilestillholdingtheirlocks.Deadlockoccursifonethreadholdsthelockonthefirstobjectandcallsintothesecond,whiletheotherthreadholdsthelockonthesecondobjectandcallsintothefirst.

nameddataDatathatcanbeaccessedbyahuman-readableidentifier,suchasafilename.

nativecommandqueueingSee:taggedcommandqueueing.

NCQSee:nativecommandqueueing.

nestedwaitingAdeadlockconditionwhereonesharedobjectcallsintoanothersharedobjectwhileholdingthefirstobject’slock,andthenwaitsonaconditionvariable.Deadlockresultsifthethreadthatcansignaltheconditionvariableneedsthefirstlocktomakeprogress.

networkeffectTheincreaseinvalueofaproductorservicebasedonthenumberofotherpeoplewhohaveadoptedthattechnologyandnotjustitsintrinsiccapabilities.

nopreemptionAnecessaryconditionfordeadlocktooccur:onceathreadacquiresaresource,itsownershipcannotberevokeduntilthethreadactstoreleaseit.

non-blockingdatastructureConcurrentdatastructurewhereathreadisneverrequiredtowaitforanotherthreadtocompleteitsoperation.

non-recoverablereaderrorWhensufficientbiterrorsoccurwithinadisksectororflashmemorypage,suchthattheoriginaldatacannotberecoveredevenaftererrorcorrection.

non-residentattributeInNTFS,anattributerecordwhosecontentsareaddressedindirectly,throughextentpointersinthemasterfiletablethatpointtothecontentsinthoseextents.

non-volatilestorageUnlikeDRAM,memorythatisdurableandretainsitsstateacrosscrashesandpoweroutages.Seealso:persistentstorage.Seealso:stablestorage.

notrecentlyusedAcachereplacementpolicythatevictssomeblockthathasnotbeenreferencedrecently,ratherthantheleastrecentlyusedblock.

obliviousschedulingAschedulingpolicywheretheoperatingsystemassignsthreadstoprocessorswithoutknowledgeoftheintentoftheparallelapplication.

opensystemAsystemwhosesourcecodeisavailabletothepublicformodificationandreuse,orasystemwhoseinterfacesaredefinedbyapublicstandardsprocess.

operatingsystemAlayerofsoftwarethatmanagesacomputer’sresourcesforitsusersandtheirapplications.

operatingsystemkernelThekernelisthelowestlevelofsoftwarerunningonthesystem,withfullaccesstoallofthecapabilitiesofthehardware.

optimalcachereplacementReplacewhicheverblockisusedfarthestinthefuture.

overheadTheaddedresourcecostofimplementinganabstractionversususingtheunderlyinghardwareresourcesdirectly.

ownershipdesignpatternAtechniqueformanagingconcurrentaccesstosharedobjectsinwhichatmostonethreadownsanobjectatanytime,andthereforethethreadcanaccesstheshareddatawithoutalock.

pagecoloringTheassignmentofphysicalpageframestovirtualaddressesbypartitioningframesbasedonwhichportionsofthecachetheywilluse.

pagefaultAhardwaretraptotheoperatingsystemkernelwhenaprocessreferencesavirtualaddresswithaninvalidpagetableentry.

pageframeAnaligned,fixed-sizechunkofphysicalmemorythatcanholdavirtualpage.

pagedmemoryAhardwareaddresstranslationmechanismwherememoryisallocatedinaligned,fixed-sizedchunks,calledpages.Anyvirtualpagecanbeassignedtoanyphysicalpageframe.

pagedsegmentationAhardwaremechanismwherephysicalmemoryisallocatedinpageframes,butvirtualaddressesaresegmented.

pairofstubsApairofshortproceduresthatmediatebetweentwoexecutioncontexts.

paravirtualizationAvirtualmachineabstractionthatallowstheguestoperatingsystemtomakesystemcallsintothehostoperatingsystemtoperformhardware-specificoperations,suchaschangingapagetableentry.

parentprocess

Aprocessthatcreatesanotherprocess.Seealso:childprocess.path

Thestringthatidentifiesafileordirectory.PCB

See:processcontrolblock.PCM

See:phasechangememory.performancepredictability

Whetherasystem’sresponsetimeorotherperformancemetricisconsistentovertime.

persistentdataDatathatisstoreduntilitisexplicitlydeleted,evenifthecomputerstoringitcrashesorlosespower.

persistentstorageSee:non-volatilestorage.

phasechangebehaviorAbruptchangesinaprogram’sworkingset,causingburstycachemissrates:periodsoflowcachemissesinterspersedwithperiodsofhighcachemisses.

phasechangememoryAtypeofnon-volatilememorythatusesthephaseofamaterialtorepresentadatabit.Seealso:PCM.

physicaladdressAnaddressinphysicalmemory.

physicalseparationAbackupstoragepolicywherethebackupisstoredatadifferentlocationthantheprimarystorage.

physicallyaddressedcacheAprocessorcachethatisaccessedusingphysicalmemoryaddresses.

pinTobindavirtualresourcetoaphysicalresource,suchasathreadtoaprocessororavirtualpagetoaphysicalpage.

platterAsinglethinroundplatethatstoresinformationinamagneticdisk,oftenonbothsurfaces.

policy-mechanismseparationAsystemdesignprinciplewheretheimplementationofanabstractionisindependentoftheresourceallocationpolicyofhowtheabstractionisused.

pollingAnalternativetohardwareinterrupts,wheretheprocessorwaitsforanasynchronouseventtooccur,bylooping,orbusy-waiting,untiltheeventoccurs.

portabilityTheabilityofsoftwaretoworkacrossmultiplehardwareplatforms.

preciseinterruptsAllinstructionsthatoccurbeforetheinterruptorexception,accordingtotheprogramexecution,arecompletedbythehardwarebeforetheinterrupthandlerisinvoked.

preemption

Whenaschedulertakestheprocessorawayfromonetaskandgivesittoanother.preemptivemulti-threading

Theoperatingsystemschedulermayswitchoutarunningthread,e.g.,onatimerinterrupt,withoutanyexplicitactionbythethreadtorelinquishcontrolatthatpoint.

prefetchTobringdataintoacachebeforeitisneeded.

principleofleastprivilegeSystemsecurityandreliabilityareenhancedifeachpartofthesystemhasexactlytheprivilegesitneedstodoitsjobandnomore.

prioritydonationAsolutiontopriorityinversion:whenathreadwaitsforalockheldbyalowerprioritythread,thelockholderistemporarilyincreasedtothewaiter’spriorityuntilthelockisreleased.

priorityinversionAschedulinganomalythatoccurswhenahighprioritytaskwaitsindefinitelyforaresource(suchasalock)heldbyalowprioritytask,becausethelowprioritytaskiswaitinginturnforaresource(suchastheprocessor)heldbyamediumprioritytask.

privacyDatastoredonacomputerisonlyaccessibletoauthorizedusers.

privilegedinstructionInstructionavailableinkernelmodebutnotinusermode.

processTheexecutionofanapplicationprogramwithrestrictedrights—theabstractionforprotectionprovidedbytheoperatingsystemkernel.

processcontrolblockAdatastructurethatstoresalltheinformationtheoperatingsystemneedsaboutaparticularprocess:e.g.,whereitisstoredinmemory,whereitsexecutableimageisondisk,whichuseraskedittostartexecuting,andwhatprivilegestheprocesshas.Seealso:PCB.

processmigrationTheabilitytotakearunningprogramononesystem,stopitsexecution,andresumeitonadifferentmachine.

processorexceptionAhardwareeventcausedbyuserprogrambehaviorthatcausesatransferofcontroltoakernelhandler.Forexample,attemptingtodividebyzerocausesaprocessorexceptioninmanyarchitectures.

processorschedulingpolicyWhentherearemorerunnablethreadsthanprocessors,thepolicythatdetermineswhichthreadstorunfirst.

processorstatusregisterAhardwareregistercontainingflagsthatcontroltheoperationoftheprocessor,includingtheprivilegelevel.

producer-consumercommunicationInterprocesscommunicationwheretheoutputofoneprocessistheinputofanother.

proprietarysystemAsystemthatisunderthecontrolofasinglecompany;itcanbechangedatanytime

byitsprovidertomeettheneedsofitscustomers.protection

Theisolationofpotentiallymisbehavingapplicationsanduserssothattheydonotcorruptotherapplicationsortheoperatingsystemitself.

publishForaread-copy-updatelock,asingle,atomicmemorywritethatupdatesasharedobjectprotectedbythelock.Thewriteallowsnewreaderthreadstoobservethenewversionoftheobject.

queueingdelayThetimeataskwaitsinlinewithoutreceivingservice.

quiescentForaread-copy-updatelock,noreaderthreadthatwasactiveatthetimeofthelastmodificationisstillactive.

raceconditionWhenthebehaviorofaprogramreliesontheinterleavingofoperationsofdifferentthreads.

RAIDARedundantArrayofInexpensiveDisks(RAID)isasystemthatspreadsdataredundantlyacrossmultipledisksinordertotolerateindividualdiskfailures.

RAID1See:mirroring.

RAID5See:rotatingparity.

RAID6See:dualredundancyarray.

RAIDstripAsetofseveralsequentialblocksplacedononediskbyaRAIDblockplacementalgorithm.

RAIDstripeAsetofRAIDstripsandtheirparitystrip.

R-CSCANAvariationoftheCSCANdiskschedulingpolicyinwhichthedisktakesintoaccountrotationtime.

RCUSee:read-copy-update.

readdisturberrorReadingaflashmemorycellalargenumberoftimescancausethedatainsurroundingcellstobecomecorrupted.

read-copy-updateAsynchronizationabstractionthatallowsconcurrentaccesstoadatastructurebymultiplereadersandasinglewriteratatime.Seealso:RCU.

readers/writerslockAlockwhichallowsmultiple“reader”threadstoaccessshareddataconcurrentlyprovidedtheynevermodifytheshareddata,butstillprovidesmutualexclusionwhenevera“writer”threadisreadingormodifyingtheshareddata.

readylist

Thesetofthreadsthatarereadytoberunbutwhicharenotcurrentlyrunning.real-timeconstraint

Thecomputationmustbecompletedbyadeadlineifitistohavevalue.recoverablevirtualmemory

Theabstractionofpersistentmemory,sothatthecontentsofamemorysegmentcanberestoredafterafailure.

redologgingAwayofimplementingatransactionbyrecordinginalogthesetofwritestobeexecutedwhenthetransactioncommits.

relativepathAfilepathnameinterpretedasbeginningwiththeprocess’scurrentworkingdirectory.

reliabilityApropertyofasystemthatdoesexactlywhatitisdesignedtodo.

requestparallelismParallelexecutiononaserverthatarisesfrommultipleconcurrentrequests.

residentattributeInNTFS,anattributerecordwhosecontentsarestoreddirectlyinthemasterfiletable.

responsetimeThetimeforatasktocomplete,fromwhenitstartsuntilitisdone.

restartTheresumptionofaprocessfromacheckpoint,e.g.,afterafailureorfordebugging.

rollbackTheoutcomeofatransactionwherenoneofitsupdatesoccur.

rootdirectoryThetop-leveldirectoryinafilesystem.

rootinodeInacopy-on-writefilesystem,theinodetable’sinode:thediskblockcontainingthemetadataneededtofindtheinodetable.

rotatingparityAsystemforredundantlystoringdataondiskwherethesystemwritesseveralblocksofdataacrossseveraldisks,protectingthoseblockswithoneredundantblockstoredonyetanotherdisk.Seealso:RAID5.

rotationallatencyOncethediskheadhassettledontherighttrack,itmustwaitforthetargetsectortorotateunderit.

roundrobinAschedulingpolicythattakesturnsrunningeachreadytaskforalimitedperiodbeforeswitchingtothenexttask.

R-SCANAvariationoftheSCANdiskschedulingpolicyinwhichthedisktakesintoaccountrotationtime.

safestateInthecontextofdeadlock,astateofanexecutionsuchthatregardlessofthesequenceoffutureresourcerequests,thereisatleastonesafesequenceofdecisions

astowhentosatisfyrequestssuchthatallpendingandfuturerequestsaremet.safetyproperty

Aconstraintonprogrambehaviorsuchthatitnevercomputesthewrongresult.Compare:livenessproperty.

samplebiasAmeasurementerrorthatoccurswhensomemembersofagrouparelesslikelytobeincludedthanothers,andwherethosemembersdifferinthepropertybeingmeasured.

sandboxAcontextforexecutinguntrustedcode,whereprotectionfortherestofthesystemisprovidedinsoftware.

SCANAdiskschedulingpolicywherethediskarmrepeatedlysweepsfromtheinnertotheoutertracksandbackagain,servicingeachpendingrequestwheneverthediskheadpassesthattrack.

scheduleractivationsAmultiprocessorschedulingpolicywhereeachapplicationisinformedofhowmanyprocessorsithasbeenassignedandwhenevertheassignmentchanges.

scrubbingAtechniqueforreducingnon-recoverableRAIDerrorsbyperiodicallyscanningforcorrupteddiskblocksandreconstructingthemfromtheparityblock.

secondarybottleneckAresourcewithrelativelylowcontention,duetoalargeamountofqueueingattheprimarybottleneck.Iftheprimarybottleneckisimproved,thesecondarybottleneckwillhavemuchhigherqueueingdelay.

sectorTheminimumamountofadiskthatcanbeindependentlyreadorwritten.

sectorfailureAmagneticdiskerrorwheredataononeormoreindividualsectorsofadiskarelost,buttherestofthediskcontinuestooperatecorrectly.

sectorsparingTransparentlyhidingafaultydisksectorbyremappingittoanearbysparesector.

securityAcomputer’soperationcannotbecompromisedbyamaliciousattacker.

securityenforcementThemechanismtheoperatingsystemusestoensurethatonlypermittedactionsareallowed.

securitypolicyWhatoperationsarepermitted—whoisallowedtoaccesswhatdata,andwhocanperformwhatoperations.

seekThemovementofthediskarmtore-positionitoveraspecifictracktoprepareforareadorwrite.

segmentationAvirtualmemorymechanismwhereaddressesaretranslatedbytablelookup,whereeachentryinthetableistoavariable-sizememoryregion.

segmentationfaultAnerrorcausedwhenaprocessattemptstoaccessmemoryoutsideofoneofitsvalidmemoryregions.

segment-localaddressAnaddressthatisrelativetothecurrentmemorysegment.

self-pagingAresourceallocationpolicyforallocatingpageframesamongprocesses;eachpagereplacementistakenfromapageframealreadyassignedtotheprocesscausingthepagefault.

semaphoreAtypeofsynchronizationvariablewithonlytwoatomicoperations,P()andV().Pwaitsforthevalueofthesemaphoretobepositive,andthenatomicallydecrementsit.Vatomicallyincrementsthevalue,andifanythreadsarewaitinginP,triggersthecompletionofthePoperation.

serializabilityTheresultofanyprogramexecutionisequivalenttoanexecutioninwhichrequestsareprocessedoneatatimeinsomesequentialorder.

servicetimeThetimeittakestocompleteataskataresource,assumingnowaiting.

setassociativecacheThecacheispartitionedintosetsofentries.Eachmemorylocationcanonlybestoredinitsassignedset,byitcanbestoredinanycacheentryinthatset.Onalookup,thesystemneedstochecktheaddressagainstalltheentriesinitssettodetermineifthereisacachehit.

settleThefine-grainedre-positioningofadiskheadaftermovingtoanewtrackbeforethediskheadisreadytoreadorwriteasectorofthenewtrack.

shadowpagetableApagetableforaprocessinsideavirtualmachine,formedbyconstructingthecompositionofthepagetablemaintainedbytheguestoperatingsystemandthepagetablemaintainedbythehostoperatingsystem.

sharedobjectAnobject(adatastructureanditsassociatedcode)thatcanbeaccessedsafelybymultipleconcurrentthreads.

shellAjobcontrolsystemimplementedasauser-levelprocess.Whenausertypesacommandtotheshell,itcreatesaprocesstorunthecommand.

shortestjobfirstAschedulingpolicythatperformsthetaskwiththeleastremainingtimelefttofinish.

shortestpositioningtimefirstAdiskschedulingpolicythatserviceswhicheverpendingrequestcanbehandledintheminimumamountoftime.Seealso:SPTF.

shortestseektimefirstAdiskschedulingpolicythatserviceswhicheverpendingrequestisonthenearesttrack.Equivalenttoshortestpositioningtimefirstifrotationalpositioningisnotconsidered.Seealso:SSTF.

SIMD(singleinstructionmultipledata)programmingSeedataparallelprogramming

simultaneousmulti-threadingAhardwaretechniquewhereeachprocessorsimulatestwo(ormore)virtualprocessors,alternatingbetweenthemonacycle-by-cyclebasis.Seealso:hyperthreading.

single-threadedprogramAprogramwritteninatraditionalway,withonelogicalsequenceofstepsaseachinstructionfollowsthepreviousone.Compare:multi-threadedprogram.

slipsparingWhenremappingafaultydisksector,remappingtheentiresequenceofdisksectorsbetweenthefaultysectorandthesparesectorbyoneslottopreservesequentialaccessperformance.

softlinkAdirectoryentrythatmapsonefileordirectorynametoanother.Seealso:symboliclink.

softwaretransactionalmemory(STM)Asystemforgeneral-purposetransactionsforin-memorydatastructures.

software-loadedTLBAhardwareTLBwhoseentriesareinstalledbysoftware,ratherthanhardware,onaTLBmiss.

solidstatestorageApersistentstoragedevicewithnomovingparts;itstoresdatausingelectricalcircuits.

spacesharingAmultiprocessorallocationpolicythatassignsdifferentprocessorstodifferenttasks.

spatiallocalityProgramstendtoreferenceinstructionsanddatanearthosethathavebeenrecentlyaccessed.

spindleTheaxleofrotationofthespinningdiskplattersmakingupadisk.

spinlockAlockwhereathreadwaitingforaBUSYlock“spins”inatightloopuntilsomeotherthreadmakesitFREE.

SPTFSee:shortestpositioningtimefirst.

SSTFSee:shortestseektimefirst.

stablepropertyApropertyofaprogram,suchthatoncethepropertybecomestrueinsomeexecutionoftheprogram,itwillstaytruefortheremainderoftheexecution.

stablestorageSee:non-volatilestorage.

stablesystemAqueueingsystemwherethearrivalratematchesthedeparturerate.

stackframe

Adatastructurestoredonthestackwithstorageforoneinvocationofaprocedure:thelocalvariablesusedbytheprocedure,theparameterstheprocedurewascalledwith,andthereturnaddresstojumptowhentheprocedurecompletes.

stagedarchitectureAstagedarchitecturedividesasystemintomultiplesubsystemsorstages,whereeachstageincludessomestateprivatetothestageandasetofoneormoreworkerthreadsthatoperateonthatstate.

starvationThelackofprogressforonetask,duetoresourcesgiventohigherprioritytasks.

statevariableMembervariableofasharedobject.

STMSee:softwaretransactionalmemory(STM).

structuredsynchronizationAdesignpatternforwritingcorrectconcurrentprograms,whereconcurrentcodeusesasetofstandardsynchronizationprimitivestocontrolaccesstosharedstate,andwhereallroutinestoaccessthesamesharedstatearelocalizedtothesamelogicalmodule.

superpageAsetofcontiguouspagesinphysicalmemorythatmapacontiguousregionofvirtualmemory,wherethepagesarealignedsothattheysharethesamehigh-order(superpage)address.

surfaceOnesideofadiskplatter.

surfacetransfertimeThetimetotransferoneormoresequentialsectorsfrom(orto)asurfaceoncethediskheadbeginsreading(orwriting)thefirstsector.

swappingEvictinganentireprocessfromphysicalmemory.

symboliclinkSee:softlink.

synchronizationbarrierAsynchronizationprimitivewherenthreadsoperatinginparallelcheckintothebarrierwhentheirworkiscompleted.Nothreadreturnsfromthebarrieruntilallncheckin.

synchronizationvariableAdatastructureusedforcoordinatingconcurrentaccesstosharedstate.

systemavailabilityTheprobabilitythatasystemwillbeavailableatanygiventime.

systemcallAprocedureprovidedbythekernelthatcanbecalledfromuserlevel.

systemreliabilityTheprobabilitythatasystemwillcontinuetobereliableforsomespecifiedperiodoftime.

taggedcommandqueueingAdiskinterfacethatallowstheoperatingsystemtoissuemultipleconcurrent

requeststothedisk.Requestsareprocessedandacknowledgedoutoforder.Seealso:nativecommandqueueing.Seealso:NCQ.

taggedTLBAtranslationlookasidebufferwhoseentriescontainaprocessID;onlyentriesforthecurrentlyrunningprocessareusedduringtranslation.ThisallowsTLBentriesforaprocesstoremainintheTLBwhentheprocessisswitchedout.

taskAuserrequest.

TCBSee:threadcontrolblock.

TCQSee:taggedcommandqueueing.

temporallocalityProgramstendtoreferencethesameinstructionsanddatathattheyhadrecentlyaccessed.

testandtest-and-setAnimplementationofaspinlockwherethewaitingprocessorwaitsuntilthelockisFREEbeforeattemptingtoacquireit.

thrashingWhenacacheistoosmalltoholditsworkingset.Inthiscase,mostreferencesarecachemisses,yetthosemissesevictdatathatwillbeusedinthenearfuture.

threadAsingleexecutionsequencethatrepresentsaseparatelyschedulabletask.

threadcontextswitchSuspendexecutionofacurrentlyrunningthreadandresumeexecutionofsomeotherthread.

threadcontrolblockTheoperatingsystemdatastructurecontainingthecurrentstateofathread.Seealso:TCB.

threadschedulerSoftwarethatmapsthreadstoprocessorsbyswitchingbetweenrunningthreadsandthreadsthatarereadybutnotrunning.

thread-safeboundedqueueAboundedqueuethatissafetocallfrommultipleconcurrentthreads.

throughputTherateatwhichagroupoftasksarecompleted.

timeofcheckvs.timeofuseattackAsecurityvulnerabilityarisingwhenanapplicationcanmodifytheusermemoryholdingasystemcallparameter(suchasafilename),afterthekernelchecksthevalidityoftheparameter,butbeforetheparameterisusedintheactualimplementationoftheroutine.OftenabbreviatedTOCTOU.

timequantumThelengthoftimethatataskisscheduledbeforebeingpreempted.

timerinterruptAhardwareprocessorinterruptthatsignifiesaperiodofelapsedrealtime.

time-sharingoperatingsystem

Anoperatingsystemdesignedtosupportinteractiveuseofthecomputer.TLB

See:translationlookasidebuffer.TLBflush

AnoperationtoremoveinvalidentriesfromaTLB,e.g.,afteraprocesscontextswitch.

TLBhitATLBlookupthatsucceedsatfindingavalidaddresstranslation.

TLBmissATLBlookupthatfailsbecausetheTLBdoesnotcontainavalidtranslationforthatvirtualaddress.

TLBshootdownArequesttoanotherprocessortoremoveanewlyinvalidTLBentry.

TOCTOUSee:timeofcheckvs.timeofuseattack.

trackAcircleofsectorsonadisksurface.

trackbufferMemoryinthediskcontrollertobufferthecontentsofthecurrenttrackeventhoughthosesectorshavenotyetbeenrequestedbytheoperatingsystem.

trackskewingAstaggeredalignmentofdisksectorstoallowsequentialreadingofsectorsonadjacenttracks.

transactionAgroupofoperationsthatareappliedpersistently,atomicallyasagroupornotatall,andindependentlyofothertransactions.

translationlookasidebufferAsmallhardwaretablecontainingtheresultsofrecentaddresstranslations.Seealso:TLB.

trapAsynchronoustransferofcontrolfromauser-levelprocesstoakernel-modehandler.Trapscanbecausedbyprocessorexceptions,memoryprotectionerrors,orsystemcalls.

tripleindirectblockAstorageblockcontainingpointerstodoubleindirectblocks.

two-phaselockingAstrategyforacquiringlocksneededbyamulti-operationrequest,wherenolockcanbereleasedbeforeallrequiredlockshavebeenacquired.

uberblockInZFS,therootoftheZFSstoragesystem.

UNIXexecAsystemcallonUNIXthatcausesthecurrentprocesstobringanewexecutableimageintomemoryandstartitrunning.

UNIXforkAsystemcallonUNIXthatcreatesanewprocessasacompletecopyoftheparentprocess.

UNIXpipeAtwo-waybytestreamcommunicationchannelbetweenUNIXprocesses.

UNIXsignalAnasynchronousnotificationtoarunningprocess.

UNIXstdinAfiledescriptorsetupautomaticallyforanewprocesstouseasitsinput.

UNIXstdoutAfiledescriptorsetupautomaticallyforanewprocesstouseasitsoutput.

UNIXwaitAsystemcallthatpausesuntilachildprocessfinishes.

unsafestateInthecontextofdeadlock,astateofanexecutionsuchthatthereisatleastonesequenceoffutureresourcerequeststhatleadstodeadlocknomatterwhatprocessingorderistried.

upcallAnevent,interrupt,orexceptiondeliveredbythekerneltoauser-levelprocess.

usebitAstatusbitinapagetableentryrecordingwhetherthepagehasbeenrecentlyreferenced.

user-levelmemorymanagementThekernelassignseachprocessasetofpageframes,buthowtheprocessusesitsassignedmemoryisleftuptotheapplication.

user-levelpagehandlerAnapplication-specificupcallroutineinvokedbythekernelonapagefault.

user-levelthreadAtypeofapplicationthreadwherethethreadiscreated,runs,andfinisheswithoutcallsintotheoperatingsystemkernel.

user-modeoperationTheprocessoroperatesinarestrictedmodethatlimitsthecapabilitiesoftheexecutingprocess.Compare:kernel-modeoperation.

utilizationThefractionoftimearesourceisbusy.

virtualaddressAnaddressthatmustbetranslatedtoproduceanaddressinphysicalmemory.

virtualmachineAnexecutioncontextprovidedbyanoperatingsystemthatmimicsaphysicalmachine,e.g.,torunanoperatingsystemasanapplicationontopofanotheroperatingsystem.

virtualmachinehoneypotAvirtualmachineconstructedforthepurposeofexecutingsuspectcodeinasafeenvironment.

virtualmachinemonitorSee:hostoperatingsystem.

virtualmemoryTheillusionofanearlyinfiniteamountofphysicalmemory,providedbydemandpagingofvirtualaddresses.

virtualizationProvideanapplicationwiththeillusionofresourcesthatarenotphysicallypresent.

virtuallyaddressedcacheAprocessorcachewhichisaccessedusingvirtual,ratherthanphysical,memoryaddresses.

volumeAcollectionofphysicalstorageblocksthatformalogicalstoragedevice(e.g.,alogicaldisk).

waitwhileholdingAnecessaryconditionfordeadlocktooccur:athreadholdsoneresourcewhilewaitingforanother.

wait-freedatastructuresConcurrentdatastructurethatguaranteesprogressforeverythread:everymethodfinishesinafinitenumberofsteps,regardlessofthestateofotherthreadsexecutinginthedatastructure.

waitinglistThesetofthreadsthatarewaitingforasynchronizationeventortimerexpirationtooccurbeforebecomingeligibletoberun.

wearlevelingAflashmemorymanagementpolicythatmoveslogicalpagesaroundthedevicetoensurethateachphysicalpageiswritten/erasedapproximatelythesamenumberoftimes.

webproxycacheAcacheoffrequentlyaccessedwebpagestospeedwebaccessandreducenetworktraffic.

work-conservingschedulingpolicyApolicythatneverleavestheprocessoridleifthereisworktodo.

workingsetThesetofmemorylocationsthataprogramhasreferencedintherecentpast.

workloadAsetoftasksforsomesystemtoperform,alongwithwheneachtaskarrivesandhowlongeachtasktakestocomplete.

woundwaitAnapproachtodeadlockrecoverythatensuresprogressbyabortingthemostrecenttransactioninanydeadlock.

writeaccelerationDatatobestoredondiskisfirstwrittentothedisk’sbuffermemory.Thewriteisthenacknowledgedandcompletedinthebackground.

write-backcacheAcachewhereupdatescanbestoredinthecacheandonlysenttomemorywhenthecacherunsoutofspace.

write-throughcacheAcachewhereupdatesaresentimmediatelytomemory.

zero-copyI/OAtechniquefortransferringdataacrossthekernel-userboundarywithoutamemory-to-memorycopy,e.g.,bymanipulatingpagetableentries.

zero-on-referenceAmethodforclearingmemoryonlyifthememoryisused,ratherthaninadvance.Ifthefirstaccesstomemorytriggersatraptothekernel,thekernelcanzerothememoryandthenresume.

ZipfdistributionTherelativefrequencyofaneventisinverselyproportionaltoitspositioninarankorderofpopularity.

AbouttheAuthors

ThomasAndersonholdstheWarrenFrancisandWilmaKolmBradleyChairofComputerScienceandEngineeringattheUniversityofWashington,wherehehasbeenteachingcomputersciencesince1997.

ProfessorAndersonhasbeenwidelyrecognizedforhiswork,receivingtheDianeS.McEntyreAwardforExcellenceinTeaching,theUSENIXLifetimeAchievementAward,theIEEEKojiKobayashiComputersandCommunicationsAward,theACMSIGOPSMarkWeiserAward,theUSENIXSoftwareToolsUserGroupAward,theIEEECommunicationsSocietyWilliamR.BennettPrize,theNSFPresidentialFacultyFellowship,andtheAlfredP.SloanResearchFellowship.HeisanACMFellow.Hehasservedasprogramco-chairoftheACMSIGCOMMConferenceandprogramchairoftheACMSymposiumonOperatingSystemsPrinciples(SOSP).In2003,hehelpedco-foundtheUSENIX/ACMSymposiumonNetworkedSystemsDesignandImplementation(NSDI).

ProfessorAnderson’sresearchinterestsspanallaspectsofbuildingpractical,robust,andefficientcomputersystems,includingoperatingsystems,distributedsystems,computernetworks,multiprocessors,andcomputersecurity.Overhiscareer,hehasauthoredorco-authoredoveronehundredpeer-reviewedpapers;nineteenofhispapershavewonbestpaperawards.

MichaelDahlinisaPrincipalEngineeratGoogle.Priortothat,from1996to2014,hewasaProfessorofComputerScienceattheUniversityofTexasinAustin,wherehetaughtoperatingsystemsandothersubjectsandwherehewasawardedtheCollegeofNaturalSciencesTeachingExcellenceAward.

ProfessorDahlin’sresearchinterestsincludeInternet-andlarge-scaleservices,faulttolerance,security,operatingsystems,distributedsystems,andstoragesystems.

ProfessorDahlin’sworkhasbeenwidelyrecognized.Overhiscareer,hehasauthoredoverseventypeerreviewedpapers;tenofwhichhavewonbestpaperawards.HeisbothanACMFellowandanIEEEFellow,andhehasreceivedanAlfredP.SloanResearchFellowshipandanNSFCAREERaward.HehasservedastheprogramchairoftheACMSymposiumonOperatingSystemsPrinciples(SOSP),co-chairoftheUSENIX/ACMSymposiumonNetworkedSystemsDesignandImplementation(NSDI),andco-chairoftheInternationalWorldWideWebconference(WWW).