Upload
khangminh22
View
0
Download
0
Embed Size (px)
Citation preview
OperatingSystems
Principles&Practice
VolumeIII:MemoryManagementSecondEdition
ThomasAndersonUniversityofWashington
MikeDahlinUniversityofTexasandGoogle
RecursiveBooks
recursivebooks.com
OperatingSystems:PrinciplesandPractice(SecondEdition)VolumeIII:MemoryManagementbyThomasAndersonandMichaelDahlinCopyright©ThomasAndersonandMichaelDahlin,2011-2015.
ISBN978-0-9856735-5-0Publisher:RecursiveBooks,Ltd.,http://recursivebooks.com/Cover:ReflectionLake,Mt.RainierCoverdesign:CameronNeatIllustrations:CameronNeatCopyeditors:SandyKaplan,WhitneySchmidtEbookdesign:RobinBriggsWebdesign:AdamAnderson
SUGGESTIONS,COMMENTS,andERRORS.Wewelcomesuggestions,commentsanderrorreports,[email protected]
Noticeofrights.Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformbyanymeans—electronic,mechanical,photocopying,recording,orotherwise—withoutthepriorwrittenpermissionofthepublisher.Forinformationongettingpermissionsforreprintsandexcerpts,[email protected]
Noticeofliability.Theinformationinthisbookisdistributedonan“AsIs”basis,withoutwarranty.NeithertheauthorsnorRecursiveBooksshallhaveanyliabilitytoanypersonorentitywithrespecttoanylossordamagecausedorallegedtobecauseddirectlyorindirectlybytheinformationorinstructionscontainedinthisbookorbythecomputersoftwareandhardwareproductsdescribedinit.
Trademarks:Throughoutthisbooktrademarkednamesareused.Ratherthanputatrademarksymbolineveryoccurrenceofatrademarkedname,westateweareusingthenamesonlyinaneditorialfashionandtothebenefitofthetrademarkownerwithnointentionofinfringementofthetrademark.Alltrademarksorservicemarksarethepropertyoftheirrespectiveowners.
Contents
Preface
I:KernelsandProcesses1.Introduction
2.TheKernelAbstraction
3.TheProgrammingInterface
II:Concurrency4.ConcurrencyandThreads
5.SynchronizingAccesstoSharedObjects
6.Multi-ObjectSynchronization
7.Scheduling
IIIMemoryManagement8AddressTranslation
8.1AddressTranslationConcept
8.2TowardsFlexibleAddressTranslation
8.2.1SegmentedMemory8.2.2PagedMemory8.2.3Multi-LevelTranslation8.2.4Portability
8.3TowardsEfficientAddressTranslation
8.3.1TranslationLookasideBuffers8.3.2Superpages8.3.3TLBConsistency8.3.4VirtuallyAddressedCaches8.3.5PhysicallyAddressedCaches
8.4SoftwareProtection
8.4.1SingleLanguageOperatingSystems8.4.2Language-IndependentSoftwareFaultIsolation8.4.3SandboxesViaIntermediateCode
8.5SummaryandFutureDirections
Exercises
9CachingandVirtualMemory
9.1CacheConcept
9.2MemoryHierarchy
9.3WhenCachesWorkandWhenTheyDoNot
9.3.1WorkingSetModel9.3.2ZipfModel
9.4MemoryCacheLookup
9.5ReplacementPolicies
9.5.1Random9.5.2First-In-First-Out(FIFO)9.5.3OptimalCacheReplacement(MIN)9.5.4LeastRecentlyUsed(LRU)9.5.5LeastFrequentlyUsed(LFU)9.5.6Belady’sAnomaly
9.6CaseStudy:Memory-MappedFiles
9.6.1Advantages9.6.2Implementation9.6.3ApproximatingLRU
9.7CaseStudy:VirtualMemory
9.7.1Self-Paging9.7.2Swapping
9.8SummaryandFutureDirections
Exercises
10AdvancedMemoryManagement
10.1Zero-CopyI/O
10.2VirtualMachines
10.2.1VirtualMachinePageTables10.2.2TransparentMemoryCompression
10.3FaultTolerance
10.3.1CheckpointandRestart10.3.2RecoverableVirtualMemory10.3.3DeterministicDebugging
10.4Security
10.5User-LevelMemoryManagement
10.6SummaryandFutureDirections
Exercises
IV:PersistentStorage11.FileSystems:IntroductionandOverview
12.StorageDevices
13.FilesandDirectories
14.ReliableStorage
References
Glossary
AbouttheAuthors
Preface
PrefacetotheeBookEdition
OperatingSystems:PrinciplesandPracticeisatextbookforafirstcourseinundergraduateoperatingsystems.Inuseatover50collegesanduniversitiesworldwide,thistextbookprovides:
Apathforstudentstounderstandhighlevelconceptsallthewaydowntoworkingcode.Extensiveworkedexamplesintegratedthroughoutthetextprovidestudentsconcreteguidanceforcompletinghomeworkassignments.Afocusonup-to-dateindustrytechnologiesandpractice
TheeBookeditionissplitintofourvolumesthattogethercontainexactlythesamematerialasthe(2nd)printeditionofOperatingSystems:PrinciplesandPractice,reformattedforvariousscreensizes.Eachvolumeisself-containedandcanbeusedasastandalonetext,e.g.,atschoolsthatteachoperatingsystemstopicsacrossmultiplecourses.
Volume1:KernelsandProcesses.ThisvolumecontainsChapters1-3oftheprintedition.Wedescribetheessentialstepsneededtoisolateprogramstopreventbuggyapplicationsandcomputervirusesfromcrashingortakingcontrolofyoursystem.Volume2:Concurrency.ThisvolumecontainsChapters4-7oftheprintedition.Weprovideaconcretemethodologyforwritingcorrectconcurrentprogramsthatisinwidespreaduseinindustry,andweexplainthemechanismsforcontextswitchingandsynchronizationfromfundamentalconceptsdowntoassemblycode.Volume3:MemoryManagement.ThisvolumecontainsChapters8-10oftheprintedition.Weexplainboththetheoryandmechanismsbehind64-bitaddressspacetranslation,demandpaging,andvirtualmachines.Volume4:PersistentStorage.ThisvolumecontainsChapters11-14oftheprintedition.Weexplainthetechnologiesunderlyingmodernextent-based,journaling,andversioningfilesystems.
Amoredetaileddescriptionofeachchapterisgivenintheprefacetotheprintedition.
PrefacetothePrintEdition
WhyWeWroteThisBook
Manyofourstudentstellusthatoperatingsystemswasthebestcoursetheytookasanundergraduateandalsothemostimportantfortheircareers.Wearenotalone—manyofourcolleaguesreportreceivingsimilarfeedbackfromtheirstudents.
Partoftheexcitementisthatthecoreideasinamodernoperatingsystem—protection,concurrency,virtualization,resourceallocation,andreliablestorage—havebecome
widelyappliedthroughoutcomputerscience,notjustoperatingsystemkernels.WhetheryougetajobatFacebook,Google,Microsoft,oranyotherleading-edgetechnologycompany,itisimpossibletobuildresilient,secure,andflexiblecomputersystemswithouttheabilitytoapplyoperatingsystemsconceptsinavarietyofsettings.Inamodernworld,nearlyeverythingauserdoesisdistributed,nearlyeverycomputerismulti-core,securitythreatsabound,andmanyapplicationssuchaswebbrowsershavebecomemini-operatingsystemsintheirownright.
Itshouldbenosurprisethatformanycomputersciencestudents,anundergraduateoperatingsystemsclasshasbecomeadefactorequirement:atickettoaninternshipandeventuallytoafull-timeposition.
Unfortunately,manyoperatingsystemstextbooksarestillstuckinthepast,failingtokeeppacewithrapidtechnologicalchange.Severalwidely-usedbookswereinitiallywritteninthemid-1980’s,andtheyoftenactasiftechnologystoppedatthatpoint.Evenwhennewtopicsareadded,theyaretreatedasanafterthought,withoutpruningmaterialthathasbecomelessimportant.Theresultaretextbooksthatareverylong,veryexpensive,andyetfailtoprovidestudentsmorethanasuperficialunderstandingofthematerial.
Ourviewisthatoperatingsystemshavechangeddramaticallyoverthepasttwentyyears,andthatjustifiesafreshlookatbothhowthematerialistaughtandwhatistaught.Thepaceofinnovationinoperatingsystemshas,ifanything,increasedoverthepastfewyears,withtheintroductionoftheiOSandAndroidoperatingsystemsforsmartphones,theshifttomulticorecomputers,andtheadventofcloudcomputing.
Topreparestudentsforthisnewworld,webelievestudentsneedthreethingstosucceedatunderstandingoperatingsystemsatadeeplevel:
Conceptsandcode.Webelieveitisimportanttoteachstudentsbothprinciplesandpractice,conceptsandimplementation,ratherthaneitheralone.Thistextbooktakesconceptsallthewaydowntothelevelofworkingcode,e.g.,howacontextswitchworksinassemblycode.Inourexperience,thisistheonlywaystudentswillreallyunderstandandmasterthematerial.Allofthecodeinthisbookisavailablefromtheauthor’swebsite,ospp.washington.edu.
Extensiveworkedexamples.Inourview,studentsneedtobeabletoapplyconceptsinpractice.Tothatend,wehaveintegratedalargenumberofexampleexercises,alongwithsolutions,throughoutthetext.Weusestheseexercisesextensivelyinourownlectures,andwehavefoundthemessentialtochallengingstudentstogobeyondasuperficialunderstanding.
Industrypractice.Toshowstudentshowtoapplyoperatingsystemsconceptsinavarietyofsettings,weusedetailed,concreteexamplesfromFacebook,Google,Microsoft,Apple,andotherleading-edgetechnologycompaniesthroughoutthetextbook.Becauseoperatingsystemsconceptsareimportantinawiderangeofcomputersystems,wetaketheseexamplesnotonlyfromtraditionaloperatingsystemslikeLinux,Windows,andOSXbutalsofromothersystemsthatneedtosolveproblemsofprotection,concurrency,virtualization,resourceallocation,andreliablestoragelikedatabases,webbrowsers,webservers,mobileapplications,andsearchengines.
Takingafreshperspectiveonwhatstudentsneedtoknowtoapplyoperatingsystemsconceptsinpracticehasledustoinnovateineverymajortopiccoveredinanundergraduate-levelcourse:
KernelsandProcesses.Thesafeexecutionofuntrustedcodehasbecomecentraltomanytypesofcomputersystems,fromwebbrowserstovirtualmachinestooperatingsystems.YetexistingtextbookstreatprotectionasasideeffectofUNIXprocesses,asiftheyaresynonyms.Instead,westartfromfirstprinciples:whataretheminimumrequirementsforprocessisolation,howcansystemsimplementprocessisolationefficiently,andwhatdostudentsneedtoknowtoimplementfunctionscorrectlywhenthecallerispotentiallymalicious?
Concurrency.Withtheadventofmulti-corearchitectures,moststudentstodaywillspendmuchoftheircareerswritingconcurrentcode.Existingtextbooksprovideablizzardofconcurrencyalternatives,mostofwhichwereabandoneddecadesagoasimpractical.Instead,wefocusonprovidingstudentsasinglemethodologybasedonMesamonitorsthatwillenablestudentstowritecorrectconcurrentprograms—amethodologythatisbyfarthedominantapproachusedinindustry.
MemoryManagement.Evenasdemand-paginghasbecomelessimportant,virtualizationhasbecomeevenmoreimportanttomoderncomputersystems.Weprovideadeeptreatmentofaddresstranslationhardware,sparseaddressspaces,TLBs,andon-chipcaches.Wethenusethoseconceptsasaspringboardfordescribingvirtualmachinesandrelatedconceptssuchascheckpointingandcopy-on-write.
PersistentStorage.Reliablestorageinthepresenceoffailuresiscentraltothedesignofmostcomputersystems.Existingtextbookssurveythehistoryoffilesystems,spendingmostoftheirtimeadhocapproachestofailurerecoveryandde-fragmentation.Yetnomodernfilesystemsstillusethoseadhocapproaches.Instead,ourfocusisonhowfilesystemsuseextents,journaling,copy-on-write,andRAIDtoachievebothhighperformanceandhighreliability.
IntendedAudience
OperatingSystems:PrinciplesandPracticeisatextbookforafirstcourseinundergraduateoperatingsystems.Webelieveoperatingsystemsshouldbetakenasearlyaspossibleinanundergraduate’scourseofstudy;manystudentsusethecourseasaspringboardtoaninternshipandacareer.Tothatend,wehavedesignedthetextbooktoassumeminimalpre-requisites:specifically,studentsshouldhavetakenadatastructurescourseandoneoncomputerorganization.Thecodeexamplesarewritteninacombinationofx86assembly,C,andC++.Inparticular,wehavedesignedthebooktointerfacewellwiththeBryantandO’Hallorantextbook.Wereviewandcoverinmuchmoredepththematerialfromthesecondhalfofthatbook.
Weshouldnotewhatthistextbookisnot:itisnotintendedtoteachtheAPIorinternalsofanyspecificoperatingsystem,suchasLinux,Android,Windows8,OSX,oriOS.Weusemanyconcreteexamplesfromthesesystems,butourfocusisonthesharedproblemsthese
systemsfaceandthetechnologiesthesesystemsusetosolvethoseproblems.
AGuidetoInstructors
Oneofourgoalsisenableinstructorstochooseanappropriatelevelofdepthforeachcoursetopic.Eachchapterbeginsataconceptuallevel,withimplementationdetailsandthemoreadvancedmaterialtowardstheend.Themoreadvancedmaterialcanbeomittedwithoutcompromisingtheabilityofstudentstofollowlatermaterial.Nosingle-quarterorsingle-semestercourseislikelytobeabletocovereverytopicwehaveincluded,butwethinkitisagoodthingforstudentstocomeawayfromanoperatingsystemscoursewithanappreciationthatthereisalwaysmoretolearn.
Foreachtopic,weattempttoconveyitatthreelevels:
Howtoreasonaboutsystems.Wedescribecoresystemsconcepts,suchasprotection,concurrency,resourcescheduling,virtualization,andstorage,andweprovidepracticeapplyingtheseconceptsinvarioussituations.Inourview,thisprovidesthebiggestlong-termpayofftostudents,astheyarelikelytoneedtoapplytheseconceptsintheirworkthroughouttheircareer,almostregardlessofwhatprojecttheyendupworkingon.
Powertools.Weintroducestudentstoanumberofabstractionsthattheycanapplyintheirworkinindustryimmediatelyaftergraduation,andthatweexpectwillcontinuetobeusefulfordecadessuchassandboxing,protectedprocedurecalls,threads,locks,conditionvariables,caching,checkpointing,andtransactions.
Detailsofspecificoperatingsystems.Weincludenumerousexamplesofhowdifferentoperatingsystemsworkinpractice.However,thismaterialchangesrapidly,andthereisanorderofmagnitudemorematerialthancanbecoveredinasinglesemester-lengthcourse.Thepurposeoftheseexamplesistoillustratehowtousetheoperatingsystemsprinciplesandpowertoolstosolveconcreteproblems.WedonotattempttoprovideacomprehensivedescriptionofLinux,OSX,oranyotherparticularoperatingsystem.
Thebookisdividedintofiveparts:anintroduction(Chapter1),kernelsandprocesses(Chapters2-3),concurrency,synchronization,andscheduling(Chapters4-7),memorymanagement(Chapters8-10),andpersistentstorage(Chapters11-14).
Introduction.ThegoalofChapter1istointroducetherecurringthemesfoundinthelaterchapters.Wedefinesomecommonterms,andweprovideabitofthehistoryofthedevelopmentofoperatingsystems.
TheKernelAbstraction.Chapter2coverskernel-basedprocessprotection—theconceptandimplementationofexecutingauserprogramwithrestrictedprivileges.Giventheincreasingimportanceofcomputersecurityissues,webelieveprotectedexecutionandsafetransferacrossprivilegelevelsareworthtreatingindepth.Wehavebrokenthedescriptionintosections,toallowinstructorstochooseeitheraquickintroductiontotheconcepts(upthroughSection2.3),orafulltreatmentofthekernelimplementationdetailsdowntothelevelofinterrupthandlers.Someinstructorsstart
withconcurrency,andcoverkernelsandkernelprotectionafterwards.Whileourtextbookcanbeusedthatway,wehavefoundthatstudentsbenefitfromabasicunderstandingoftheroleofoperatingsystemsinexecutinguserprograms,beforeintroducingconcurrency.
TheProgrammingInterface.Chapter3isintendedasanimpedancematchforstudentsofdifferingbackgrounds.Dependingonstudentbackground,itcanbeskippedorcoveredindepth.Thechaptercoverstheoperatingsystemfromaprogrammer’sperspective:processcreationandmanagement,device-independentinput/output,interprocesscommunication,andnetworksockets.Ourgoalisthatstudentsshouldunderstandatadetailedlevelwhathappenswhenauserclicksalinkinawebbrowser,astherequestistransferredthroughoperatingsystemkernelsanduserspaceprocessesattheclient,server,andbackagain.Thischapteralsocoverstheorganizationoftheoperatingsystemitself:howdevicedriversandthehardwareabstractionlayerworkinamodernoperatingsystem;thedifferencebetweenamonolithicandamicrokerneloperatingsystem;andhowpolicyandmechanismareseparatedinmodernoperatingsystems.
ConcurrencyandThreads.Chapter4motivatesandexplainstheconceptofthreads.Becauseoftheincreasingimportanceofconcurrentprogramming,anditsintegrationwithmodernprogramminglanguageslikeJava,manystudentshavebeenintroducedtomulti-threadedprogramminginanearlierclass.Thisisabitdangerous,asstudentsatthisstagearepronetowritingprogramswithraceconditions,problemsthatmayormaynotbediscoveredwithtesting.Thus,thegoalofthischapteristoprovideasolidconceptualframeworkforunderstandingthesemanticsofconcurrency,aswellashowconcurrentthreadsareimplementedinboththeoperatingsystemkernelandinuser-levellibraries.Instructorsneedingtogomorequicklycanomittheseimplementationdetails.
Synchronization.Chapter5discussesthesynchronizationofmulti-threadedprograms,acentralpartofalloperatingsystemsandincreasinglyimportantinmanyothercontexts.Ourapproachistodescribeoneeffectivemethodforstructuringconcurrentprograms(basedonMesamonitors),ratherthantoattempttocoverseveraldifferentapproaches.Inourview,itismoreimportantforstudentstomasteronemethodology.Monitorsareaparticularlyrobustandsimpleone,capableofimplementingmostconcurrentprogramsefficiently.Theimplementationofsynchronizationprimitivesshouldbeincludedifthereistime,sostudentsseethatthereisnomagic.
Multi-ObjectSynchronization.Chapter6discussesadvancedtopicsinconcurrency—specifically,thetwinchallengesofmultiprocessorlockcontentionanddeadlock.Thismaterialisincreasinglyimportantforstudentsworkingonmulticoresystems,butsomecoursesmaynothavetimetocoveritindetail.
Scheduling.Thischaptercoverstheconceptsofresourceallocationinthespecificcontextofprocessorscheduling.Withtheadventofdatacentercomputingandmulticorearchitectures,theprinciplesandpracticeofresourceallocationhaverenewedimportance.Afteraquicktourthroughthetradeoffsbetweenresponsetimeandthroughputforuniprocessorscheduling,thechaptercoversasetofmore
advancedtopicsinaffinityandmultiprocessorscheduling,power-awareanddeadlinescheduling,aswellasbasicqueueingtheoryandoverloadmanagement.Weconcludethesetopicsbywalkingstudentsthroughacasestudyofserver-sideloadmanagement.
AddressTranslation.Chapter8explainsmechanismsforhardwareandsoftwareaddresstranslation.Thefirstpartofthechaptercovershowhardwareandoperatingsystemscooperatetoprovideflexible,sparseaddressspacesthroughmulti-levelsegmentationandpaging.Wethendescribehowtomakememorymanagementefficientwithtranslationlookasidebuffers(TLBs)andvirtuallyaddressedcaches.WeconsiderhowtokeepTLBsconsistentwhentheoperatingsystemmakeschangestoitspagetables.Weconcludewithadiscussionofmodernsoftware-basedprotectionmechanismssuchasthosefoundintheMicrosoftCommonLanguageRuntimeandGoogle’sNativeClient.
CachingandVirtualMemory.Cachesarecentraltomanydifferenttypesofcomputersystems.Moststudentswillhaveseentheconceptofacacheinanearlierclassonmachinestructures.Thus,ourgoalistocoverthetheoryandimplementationofcaches:whentheyworkandwhentheydonot,aswellashowtheyareimplementedinhardwareandsoftware.Wethenshowhowtheseideasareappliedinthecontextofmemory-mappedfilesanddemand-pagedvirtualmemory.
AdvancedMemoryManagement.Addresstranslationisapowerfultoolinsystemdesign,andweshowhowitcanbeusedforzerocopyI/O,virtualmachines,processcheckpointing,andrecoverablevirtualmemory.Asthisismoreadvancedmaterial,itcanbeskippedbythoseclassespressedfortime.
FileSystems:IntroductionandOverview.Chapter11framesthefilesystemportionofthebook,startingtopdownwiththechallengesofprovidingausefulfileabstractiontousers.WethendiscusstheUNIXfilesysteminterface,themajorinternalelementsinsideafilesystem,andhowdiskdevicedriversarestructured.
StorageDevices.Chapter12surveysblockstoragehardware,specificallymagneticdisksandflashmemory.Thelasttwodecadeshaveseenrapidchangeinstoragetechnologyaffectingbothapplicationprogrammersandoperatingsystemsdesigners;thischapterprovidesasnapshotforstudents,asabuildingblockforthenexttwochapters.Ifstudentshavepreviouslyseenthismaterial,thischaptercanbeskipped.
FilesandDirectories.Chapter13discussesfilesystemlayoutondisk.Ratherthansurveyallpossiblefilelayouts—somethingthatchangesrapidlyovertime—weusefilesystemsasaconcreteexampleofmappingcomplexdatastructuresontoblockstoragedevices.
ReliableStorage.Chapter14explainstheconceptandimplementationofreliablestorage,usingfilesystemsasaconcreteexample.Startingwiththeadhoctechniquesusedinearlyfilesystems,thechapterexplainscheckpointingandwriteaheadloggingasalternateimplementationstrategiesforbuildingreliablestorage,anditdiscusseshowredundancysuchaschecksumsandreplicationareusedtoimprovereliabilityandavailability.
Wewelcomeandencouragesuggestionsforhowtoimprovethepresentationofthematerial;pleasesendanycommentstothepublisher’swebsite,[email protected].
Acknowledgements
Wehavebeenincrediblyfortunatetohavethehelpofalargenumberofpeopleintheconception,writing,editing,andproductionofthisbook.
WestartedonthejourneyofwritingthisbookoverdinnerattheUSENIXNSDIconferencein2010.Atthetime,wethoughtperhapsitwouldtakeusthesummertocompletethefirstversionandperhapsayearbeforewecoulddeclareourselvesdone.Wewereverywrong!Itisnoexaggerationtosaythatitwouldhavetakenusalotlongerwithoutthehelpwehavereceivedfromthepeoplewementionbelow.
Perhapsmostimportanthavebeenourearlyadopters,whohavegivenusenormouslyusefulfeedbackaswehaveputtogetherthisedition:
Carnegie-Mellon DavidEckhardtandGarthGibson
Clarkson JeannaMatthews
Cornell GunSirer
ETHZurich MothyRoscoe
NewYorkUniversity LaskshmiSubramanian
PrincetonUniversity KaiLi
SaarlandUniversity PeterDruschel
StanfordUniversity JohnOusterhout
UniversityofCaliforniaRiverside HarshaMadhyastha
UniversityofCaliforniaSantaBarbara BenZhao
UniversityofMaryland NeilSpring
UniversityofMichigan PeteChen
UniversityofSouthernCalifornia RameshGovindan
UniversityofTexas-Austin LorenzoAlvisi
UniverstiyofToronto DingYuan
UniversityofWashington GaryKimuraandEdLazowska
Indevelopingourapproachtoteachingoperatingsystems,bothbeforewestartedwritingandafterwardsaswetriedtoputourthoughtstopaper,wemadeextensiveuseoflecturenotesandslidesdevelopedbyotherfaculty.OfparticularhelpwerethematerialscreatedbyPeteChen,PeterDruschel,SteveGribble,EddieKohler,JohnOusterhout,MothyRoscoe,andGeoffVoelker.Wethankthemall.
Ourillustratorforthesecondedition,CameronNeat,hasbeenajoytoworkwith.WewouldalsoliketothankSimonPeterforrunningthemultiprocessorexperimentsintroducingChapter6.
WearealsogratefultoLorenzoAlvisi,AdamAnderson,PeteChen,SteveGribble,SamHopkins,EdLazowska,HarshaMadhyastha,JohnOusterhout,MarkRich,MothyRoscoe,WillScott,GunSirer,IonStoica,LakshmiSubramanian,andJohnZahorjanfortheirhelpfulcommentsandsuggestionsastohowtoimprovethebook.
WethankJoshBerlin,MarlaDahlin,RasitEskicioglu,SandyKaplan,JohnOusterhout,WhitneySchmidt,andMikeWalfishforhelpingusidentifyandcorrectgrammaticalortechnicalbugsinthetext.
WethankJeffDean,GarthGibson,MarkOskin,SimonPeter,DaveProbert,AminVahdat,andMarkZbikowskifortheirhelpinexplainingtheinternalworkingsofsomeofthecommercialsystemsmentionedinthisbook.
WewouldliketothankDaveWetherall,DanWeld,MikeWalfish,DavePatterson,OlavKvern,DanHalperin,ArmandoFox,RobinBriggs,KatyaAnderson,SandraAnderson,LorenzoAlvisi,andWilliamAdamsfortheirhelpandadviceontextbookeconomicsandproduction.
TheHelenRiaboffWhiteleyCenteraswellasDonandJeanneDahlinwerekindenoughtolendusaplacetoescapewhenweneededtogetchapterswritten.
Finally,wethankourfamilies,ourcolleagues,andourstudentsforsupportingusinthislarger-than-expectedeffort.
8.AddressTranslation
Thereisnothingwrongwithyourtelevisionset.Donotattempttoadjustthepicture.Wearecontrollingtransmission.Ifwewishtomakeitlouder,wewillbringupthevolume.Ifwewishtomakeitsofter,wewilltuneittoawhisper.Wewillcontrolthehorizontal.Wewillcontrolthevertical.Wecanrolltheimage,makeitflutter.Wecanchangethefocustoasoftblurorsharpenittocrystalclarity.Forthenexthour,sitquietlyandwewillcontrolallthatyouseeandhear.Werepeat:thereisnothingwrongwithyourtelevisionset.—Openingnarration,TheOuterLimits
Thepromiseofvirtualrealityiscompelling.Whowouldn’twanttheabilitytotravelanywherewithoutleavingtheholodeck?Ofcourse,thepromiseisfarfrombecomingareality.Intheory,byadjustingtheinputstoallofyoursensesinresponsetoyouractions,avirtualrealitysystemcouldperfectlysetthescene.However,yoursensesarenotsoeasilycontrolled.Wemightsoonbeabletoprovideanimmersiveenvironmentforvision,butbalance,hearing,taste,andsmellwilltakealotlonger.Touch,prioperception(thesenseofbeingnearsomethingelse),andg-forcesareevenfartheroff.Getasingleoneofthesewrongandtheillusiondisappears.
Canwecreateavirtualrealityenvironmentforcomputerprograms?WehavealreadyseenanexampleofthiswiththeUNIXI/Ointerface,wheretheprogramdoesnotneedtoknow,andsometimescannottell,ifitsinputsandoutputsarefiles,devices,orotherprocesses.
Inthenextthreechapters,wetakethisideaagiantstepfurther.Anamazingnumberofadvancedsystemfeaturesareenabledbyputtingtheoperatingsystemincontrolofaddresstranslation,theconversionfromthememoryaddresstheprogramthinksitisreferencingtothephysicallocationofthatmemorycell.Fromtheprogrammer’sperspective,addresstranslationoccurstransparently—theprogrambehavescorrectlydespitethefactthatitsmemoryisstoredsomewherecompletelydifferentfromwhereitthinksitisstored.
Youwereprobablytaughtinsomeearlyprogrammingclassthatamemoryaddressisjustanaddress.Apointerinalinkedlistcontainstheactualmemoryaddressofwhatitispointingto.Ajumpinstructioncontainstheactualmemoryaddressofthenextinstructiontobeexecuted.Thisisausefulfiction!Theprogrammerisoftenbetteroffnotthinkingabouthoweachmemoryreferenceisconvertedintothedataorinstructionbeingreferenced.Inpractice,thereisquitealotofactivityhappeningbeneaththecovers.
Addresstranslationisasimpleconcept,butitturnsouttobeincrediblypowerful.Whatcananoperatingsystemdowithaddresstranslation?Thisisonlyapartiallist:
Processisolation.AswediscussedinChapter2,protectingtheoperatingsystemkernelandotherapplicationsagainstbuggyormaliciouscoderequirestheabilitytolimitmemoryreferencesbyapplications.Likewise,addresstranslationcanbeusedbyapplicationstoconstructsafeexecutionsandboxesforthirdpartyextensions.
Interprocesscommunication.Oftenprocessesneedtocoordinatewitheachother,andanefficientwaytodothatistohavetheprocessesshareacommonmemoryregion.
Sharedcodesegments.Instancesofthesameprogramcansharetheprogram’sinstructions,reducingtheirmemoryfootprintandmakingtheprocessorcachemoreefficient.Likewise,differentprogramscansharecommonlibraries.
Programinitialization.Usingaddresstranslation,wecanstartaprogramrunningbeforeallofitscodeisloadedintomemoryfromdisk.
Efficientdynamicmemoryallocation.Asaprocessgrowsitsheap,orasathreadgrowsitsstack,wecanuseaddresstranslationtotraptothekerneltoallocatememoryforthosepurposesonlyasneeded.
Cachemanagement.Aswewillexplaininthenextchapter,theoperatingsystemcanarrangehowprogramsarepositionedinphysicalmemorytoimprovecacheefficiency,throughasystemcalledpagecoloring.
Programdebugging.Theoperatingsystemcanusememorytranslationtopreventabuggyprogramfromoverwritingitsowncoderegion;bycatchingpointererrorsearlier,itmakesthemmucheasiertodebug.Debuggersalsouseaddresstranslationtoinstalldatabreakpoints,tostopaprogramwhenitreferencesaparticularmemorylocation.
EfficientI/O.Serveroperatingsystemsareoftenlimitedbytherateatwhichtheycantransferdatatoandfromthediskandthenetwork.Addresstranslationenablesdatatobesafelytransferreddirectlybetweenuser-modeapplicationsandI/Odevices.
Memorymappedfiles.Aconvenientandefficientabstractionformanyapplicationsistomapfilesintotheaddressspace,sothatthecontentsofthefilecanbedirectlyreferencedwithprograminstructions.
Virtualmemory.Theoperatingsystemcanprovideapplicationstheabstractionofmorememorythanisphysicallypresentonagivencomputer.
Checkpointingandrestart.Thestateofalong-runningprogramcanbeperiodicallycheckpointedsothatiftheprogramorsystemcrashes,itcanberestartedfromthesavedstate.Thekeychallengeistobeabletoperformaninternallyconsistentcheckpointoftheprogram’sdatawhiletheprogramcontinuestorun.
Persistentdatastructures.Theoperatingsystemcanprovidetheabstractionofapersistentregionofmemory,wherechangestothedatastructuresinthatregionsurviveprogramandsystemcrashes.
Processmigration.Anexecutingprogramcanbetransparentlymovedfromoneservertoanother,forexample,forloadbalancing.
Informationflowcontrol.Anextralayerofsecurityistoverifythataprogramisnotsendingyourprivatedatatoathirdparty;e.g.,asmartphoneapplicationmayneedaccesstoyourphonelist,butitshouldn’tbeallowedtotransmitthatdata.Addresstranslationcanbethebasisformanagingtheflowofinformationintoandoutofasystem.
Distributedsharedmemory.Wecantransparentlyturnanetworkofserversintoalarge-scaleshared-memoryparallelcomputerusingaddresstranslation.
Inthischapter,wefocusonthemechanismsneededtoimplementaddresstranslation,asthatisthefoundationofalloftheseservices.Wediscusshowtheoperatingsystemandapplicationsusethemechanismstoprovidetheseservicesinthefollowingtwochapters.
Forruntimeefficiency,mostsystemshavespecializedhardwaretodoaddresstranslation;thishardwareismanagedbytheoperatingsystemkernel.Insomesystems,however,thetranslationisprovidedbyatrustedcompiler,linkerorbyte-codeinterpreter.Inothersystems,theapplicationdoesthepointertranslationasawayofmanagingthestateofitsowndatastructures.Instillothersystems,ahybridmodelisusedwhereaddressesaretranslatedbothinsoftwareandhardware.Thechoiceisoftenanengineeringtradeoffbetweenperformance,flexibility,andcost.However,thefunctionalityprovidedisoftenthesameregardlessofthemechanismusedtoimplementthetranslation.Inthischapter,wewillcoverarangeofhardwareandsoftwaremechanisms.
Chapterroadmap:
AddressTranslationConcept.Westartbyprovidingaconceptualframeworkforunderstandingbothhardwareandsoftwareaddresstranslation.(Section8.1)
FlexibleAddressTranslation.Wefocusfirstonhardwareaddresstranslation;weaskhowcanwedesignthehardwaretoprovidemaximumflexibilitytotheoperatingsystemkernel?(Section8.2)
EfficientAddressTranslation.Thesolutionswepresentwillseemflexiblebutterriblyslow.Wenextdiscussmechanismsthatmakeaddresstranslationmuchmoreefficient,withoutsacrificingflexibility.(Section8.3)
SoftwareProtection.Increasingly,softwarecompilersandruntimeinterpretersareusingaddresstranslationtechniquestoimplementoperatingsystemfunctionality.Whatchangeswhenthetranslationisinsoftwareratherthaninhardware?(Section8.4)
8.1AddressTranslationConcept
Figure8.1:Addresstranslationintheabstract.Thetranslatorconverts(virtual)memoryaddressesgeneratedbytheprogramintophysicalmemoryaddresses.
Consideredasablackbox,addresstranslationisasimplefunction,illustratedinFigure8.1.Thetranslatortakeseachinstructionanddatamemoryreferencegeneratedbyaprocess,checkswhethertheaddressislegal,andconvertsittoaphysicalmemoryaddressthatcanbeusedtofetchorstoreinstructionsordata.Thedataitself—whateverisstoredinmemory—isreturnedasis;itisnottransformedinanyway.Thetranslationisusuallyimplementedinhardware,andtheoperatingsystemkernelconfiguresthehardwaretoaccomplishitsaims.
Thetaskofthischapteristofillinthedetailsabouthowthatblackboxworks.Ifweaskedyourightnowhowyoumightimplementit,yourfirstseveralguesseswouldprobablybeonthemark.Ifyousaidwecoulduseanarray,atree,orahashtable,youwouldberight—allofthoseapproacheshavebeentakenbyrealsystems.
Giventhatanumberofdifferentimplementationsarepossible,howshouldweevaluatethealternatives?Herearesomegoalswemightwantoutofatranslationbox;thedesignweendupwithwilldependonhowwebalanceamongthesevariousgoals.
Memoryprotection.Weneedtheabilitytolimittheaccessofaprocesstocertainregionsofmemory,e.g.,topreventitfromaccessingmemorynotownedbytheprocess.Often,however,wemaywanttolimitaccessofaprogramtoitsownmemory,e.g.,topreventapointererrorfromoverwritingthecoderegionortocauseatraptothedebuggerwhentheprogramreferencesaspecificdatalocation.
Memorysharing.Wewanttoallowmultipleprocessestoshareselectedregionsofmemory.Thesesharedregionscanbelarge(e.g.,ifwearesharingaprogram’scodesegmentamongmultipleprocessesexecutingthesameprogram)orrelativelysmall
(e.g.,ifwearesharingacommonlibrary,afile,orashareddatastructure).
Flexiblememoryplacement.Wewanttoallowtheoperatingsystemtheflexibilitytoplaceaprocess(andeachpartofaprocess)anywhereinphysicalmemory;thiswillallowustopackphysicalmemorymoreefficiently.Aswewillseeinthenextchapter,flexibilityinassigningprocessdatatophysicalmemorylocationswillalsoenableustomakemoreeffectiveuseofon-chipcaches.
Sparseaddresses.Manyprogramshavemultipledynamicmemoryregionsthatcanchangeinsizeoverthecourseoftheexecutionoftheprogram:theheapfordataobjects,astackforeachthread,andmemorymappedfiles.Modernprocessorshave64-bitaddressspaces,allowingeachdynamicobjectampleroomtogrowasneeded,butmakingthetranslationfunctionmorecomplex.
Runtimelookupefficiency.Hardwareaddresstranslationoccursoneveryinstructionfetchandeverydataloadandstore.Itwouldbeimpracticalifalookuptook,onaverage,muchlongertoexecutethantheinstructionitself.Atfirst,manyoftheschemeswediscusswillseemwildlyimpractical!Wewilldiscusswaystomakeeventhemostconvolutedtranslationsystemsefficient.
Compacttranslationtables.Wealsowantthespaceoverheadoftranslationtobeminimal;anydatastructuresweneedshouldbesmallcomparedtotheamountofphysicalmemorybeingmanaged.
Portability.Differenthardwarearchitecturesmakedifferentchoicesastohowtheyimplementtranslation;ifanoperatingsystemkernelistobeeasilyportableacrossmultipleprocessorarchitectures,itneedstobeabletomapfromits(hardware-independent)datastructurestothespecificcapabilitiesofeacharchitecture.
Wewillendupwithafairlycomplexaddresstranslationmechanism,andsoourdiscussionwillstartwiththesimplestpossiblemechanismsandaddfunctionalityonlyasneeded.Itwillbehelpfulduringthediscussionforyoutokeepinmindthetwoviewsofmemory:theprocessseesitsownmemory,usingitsownaddresses.Wewillcallthesevirtualaddresses,becausetheydonotnecessarilycorrespondtoanyphysicalreality.Bycontrast,tothememorysystem,thereareonlyphysicaladdresses—reallocationsinmemory.Fromthememorysystemperspective,itisgivenphysicaladdressesanditdoeslookupsandstoresvalues.Thetranslationmechanismconvertsbetweenthetwoviews:fromavirtualaddresstoaphysicalmemoryaddress.
Addresstranslationinlinkersandloaders
Evenwithoutthekernel-userboundary,multiprogrammingrequiressomeformofaddresstranslation.Onamultiprogrammingsystem,whenaprogramiscompiled,thecompilerdoesnotknowwhichregionsofphysicalmemorywillbeinusebyotherapplications;itcannotcontrolwhereinphysicalmemorytheprogramwillland.Themachineinstructionsforaprogramcontainsbothrelativeandabsoluteaddresses;relativeaddresses,suchastobranchforwardorbackwardacertainnumberofinstructions,continuetoworkregardlessofwhereinmemorytheprogramislocated.However,someinstructionscontainabsoluteaddresses,suchastoloadaglobalvariableortojumptothe
startofaprocedure.Thesewillstopworkingunlesstheprogramisloadedintomemoryexactlywherethecompilerexpectsittogo.Beforehardwaretranslationbecamecommonplace,earlyoperatingsystemsdealtwiththisissuebyusingarelocatingloaderforcopyingprogramsintomemory.Oncetheoperatingsystempickedanemptyregionofphysicalmemoryfortheprogram,theloaderwouldmodifyanyinstructionsintheprogramthatusedanabsoluteaddress.Tosimplifytheimplementation,therewasatableatthebeginningoftheexecutableimagethatlistedalloftheabsoluteaddressesusedintheprogram.Inmodernsystems,thisiscalledasymboltable.
Today,westillhavesomethingsimilar.Complexprogramsoftenhavemultiplefiles,eachofwhichcanbecompiledindependentlyandthenlinkedtogethertoformtheexecutableimage.Whenthecompilergeneratesthemachineinstructionsforasinglefile,itcannotknowwhereintheexecutablethisparticularfilewillgo.Instead,thecompilergeneratesasymboltableatthebeginningofeachcompiledfile,indicatingwhichvalueswillneedtobemodifiedwhentheindividualfilesareassembledtogether.
Mostcommercialoperatingsystemstodaysupporttheoptionofdynamiclinking,takingthenotionofarelocatingloaderonestepfurther.Withadynamicallylinkedlibrary(DLL),alibraryislinkedintoarunningprogramondemand,whentheprogramfirstcallsintothelibrary.WewillexplaininabithowthecodeforaDLLcanbesharedbetweenmultipledifferentprocesses,butthelinkingprocedureisstraightforward.AtableofvalidentrypointsintotheDLLiskeptbythecompiler;thecallingprogramindirectsthroughthistabletoreachthelibraryroutine.
8.2TowardsFlexibleAddressTranslation
Ourdiscussionofhardwareaddresstranslationisdividedintotwosteps.First,weputtheissueoflookupefficiencyaside,andinsteadconsiderhowbesttoachievetheothergoalslistedabove:flexiblememoryassignment,spaceefficiency,fine-grainedprotectionandsharing,andsoforth.Oncewehavethefeatureswewant,wewillthenaddmechanismstogainbacklookupefficiency.
Figure8.2:Addresstranslationwithbaseandboundsregisters.Thevirtualaddressisaddedtothebasetogeneratethephysicaladdress;theboundregisterischeckedagainstthevirtualaddresstopreventaprocessfromreadingorwritingoutsideofitsallocatedmemoryregion.
InChapter2,weillustratedthenotionofhardwarememoryprotectionusingthesimplesthardwareimaginable:baseandbounds.Thetranslationboxconsistsoftwoextraregistersperprocess.Thebaseregisterspecifiesthestartoftheprocess’sregionofphysicalmemory;theboundregisterspecifiestheextentofthatregion.Ifthebaseregisterisaddedtoeveryaddressgeneratedbytheprogram,thenwenolongerneedarelocatingloader—thevirtualaddressesoftheprogramstartfrom0andgotobound,andthephysicaladdressesstartfrombaseandgotobase+bound.Figure8.2showsanexampleofbaseandboundstranslation.Sincephysicalmemorycancontainseveralprocesses,thekernelresetsthecontentsofthebaseandboundsregistersoneachprocesscontextswitchtotheappropriatevaluesforthatprocess.
Baseandboundstranslationisbothsimpleandfast,butitlacksmanyofthefeaturesneededtosupportmodernprograms.Baseandboundstranslationsupportsonlycoarse-grainedprotectionattheleveloftheentireprocess;itisnotpossibletopreventaprogramfromoverwritingitsowncode,forexample.Itisalsodifficulttoshareregionsofmemorybetweentwoprocesses.Sincethememoryforaprocessneedstobecontiguous,supportingdynamicmemoryregions,suchasforheaps,threadstacks,ormemorymappedfiles,becomesdifficulttoimpossible.
8.2.1SegmentedMemory
Figure8.3:Addresstranslationwithasegmenttable.Thevirtualaddresshastwocomponents:asegmentnumberandasegmentoffset.Thesegmentnumberindexesintothesegmenttabletolocatethestartofthesegmentinphysicalmemory.Theboundregisterischeckedagainstthesegmentoffsettopreventaprocessfromreadingorwritingoutsideofitsallocatedmemoryregion.Processescanhaverestrictedrightstocertainsegments,e.g.,topreventwritestothecodesegment.
Manyofthelimitationsofbaseandboundstranslationcanberemediedwithasmallchange:insteadofkeepingonlyasinglepairofbaseandboundsregistersperprocess,thehardwarecansupportanarrayofpairsofbaseandboundsregisters,foreachprocess.Thisiscalledsegmentation.Eachentryinthearraycontrolsaportion,orsegment,ofthevirtualaddressspace.Thephysicalmemoryforeachsegmentisstoredcontiguously,butdifferentsegmentscanbestoredatdifferentlocations.Figure8.3showssegmenttranslationinaction.Thehighorderbitsofthevirtualaddressareusedtoindexintothearray;therestoftheaddressisthentreatedasabove—addedtothebaseandcheckedagainsttheboundstoredatthatindex.Inaddition,theoperatingsystemcanassigndifferentsegmentsdifferentpermissions,e.g.,toallowexecute-onlyaccesstocodeandread-writeaccesstodata.Althoughfoursegmentsareshowninthefigure,ingeneralthenumberofsegmentsisdeterminedbythenumberofbitsforthesegmentnumberthataresetasideinthevirtualaddress.
Itshouldseemoddtoyouthatsegmentedmemoryhasgaps;programmemoryisnolongerasinglecontiguousregion,butinsteaditisasetofregions.Eachdifferentsegmentstartsatanewsegmentboundary.Forexample,codeanddataarenotimmediatelyadjacenttoeachotherineitherthevirtualorphysicaladdressspace.
Whathappensifaprogrambranchesintoortriestoloaddatafromoneofthesegaps?Thehardwarewillgenerateanexception,trappingintotheoperatingsystemkernel.OnUNIXsystems,thisisstillcalledasegmentationfault,thatis,areferenceoutsideofalegalsegmentofmemory.Howdoesaprogramkeepfromwanderingintooneofthesegaps?
Correctprogramswillnotgeneratereferencesoutsideofvalidmemory.Putanotherway,tryingtoexecutecodeorreadingdatathatdoesnotexistisprobablyanindicationthattheprogramhasabuginit.
Figure8.4:Twoprocessessharingacodesegment,butwithseparatedataandstacksegments.Inthiscase,eachprocessusesthesamevirtualaddresses,butthesevirtualaddressesmaptoeitherthesameregionofphysicalmemory(ifcode)ordifferentregionsofphysicalmemory(ifdata).
Althoughsimpletoimplementandmanage,segmentedmemoryisbothremarkablypowerfulandwidelyused.Forexample,thex86architectureissegmented(withsomeenhancementsthatwewilldescribelater).Withsegments,theoperatingsystemcanallowprocessestosharesomeregionsofmemorywhilekeepingotherregionsprotected.Forexample,twoprocessescanshareacodesegmentbysettingupanentryintheirsegmenttablestopointtothesameregionofphysicalmemory—tousethesamebaseandbounds.Theprocessescansharethesamecodewhileworkingoffdifferentdata,bysettingupthesegmenttabletopointtodifferentregionsofphysicalmemoryforthedatasegment.WeillustratethisinFigure8.4.
Likewise,sharedlibraryroutines,suchasagraphicslibrary,canbeplacedintoasegmentandsharedbetweenprocesses.Asbefore,thelibrarydatawouldbeinaseparate,non-
sharedsegment.Thisisfrequentlydoneinmodernoperatingsystemswithdynamicallylinkedlibraries.Apracticalissueisthatdifferentprocessesmayloaddifferentnumbersoflibraries,andsomayassignthesamelibraryadifferentsegmentnumber.Dependingontheprocessorarchitecture,sharingcanstillwork,ifthelibrarycodeusessegment-localaddresses,addressesthatarerelativetothecurrentsegment.
UNIXforkandcopy-on-write
InChapter3,wedescribedtheUNIXforksystemcall.UNIXcreatesanewprocessbymakingacompletecopyoftheparentprocess;theparentprocessandthechildprocessareidenticalexceptforthereturnvaluefromfork.ThechildprocesscanthensetupitsI/OandeventuallyusetheUNIXexecsystemcalltorunanewprogram.Wepromisedatthetimewewouldexplainhowthiscanbedoneefficiently.
Withsegments,thisisnowpossible.Toforkaprocess,wecansimplymakeacopyoftheparent’ssegmenttable;wedonotneedtocopyanyofitsphysicalmemory.Ofcourse,wewantthechildtobeacopyoftheparent,andnotjustpointtothesamememoryastheparent.Ifthechildchangessomedata,itshouldchangeonlyitscopy,andnotitsparent’sdata.Ontheotherhand,mostofthetime,thechildprocessinUNIXforksimplycallsUNIXexec;theshareddataisthereasaprogrammingconvenience.
Wecanmakethisworkefficientlybyusinganideacalledcopy-on-write.Duringthefork,allofthesegmentssharedbetweentheparentandchildprocessaremarked“read-only”inbothsegmenttables.Ifeithersidemodifiesdatainasegment,anexceptionisraisedandafullmemorycopyofthatsegmentismadeatthattime.Inthecommoncase,thechildprocessmodifiesonlyitsstackbeforecallingUNIXexec,andifso,onlythestackneedstobephysicallycopied.
Wecanalsousesegmentsforinterprocesscommunication,ifprocessesaregivenreadandwritepermissiontothesamesegment.Multics,anoperatingsystemfromthe1960’sthatcontainedmanyoftheideaswenowfindinMicrosoft’sWindows7,Apple’sMacOSX,andLinux,madeextensiveuseofsegmentedmemoryforinterprocesssharing.InMultics,asegmentwasallocatedforeverydatastructure,allowingfine-grainedprotectionandsharingbetweenprocesses.Ofcourse,thismadethesegmenttableprettylarge!Moremodernsystemstendtousesegmentsonlyforcoarser-grainedmemoryregions,suchasthecodeanddataforanentiresharedlibrary,ratherthanforeachofthedatastructureswithinthelibrary.
Asafinalexampleofthepowerofsegments,theyenabletheefficientmanagementofdynamicallyallocatedmemory.Whenanoperatingsystemreusesmemoryordiskspacethathadpreviouslybeenused,itmustfirstzerooutthecontentsofthememoryordisk.Otherwise,privatedatafromoneapplicationcouldinadvertentlyleakintoanother,potentiallymalicious,application.Forexample,youcouldenterapasswordintoonewebsite,sayforabank,andthenexitthebrowser.However,iftheunderlyingphysicalmemoryusedbythebrowseristhenre-assignedtoanewprocess,thenthepasswordcouldbeleakedtoamaliciouswebsite.
Ofcourse,weonlywanttopaytheoverheadofzeroingmemoryifitwillbeused.Thisis
particularlyanissuefordynamicallyallocatedmemoryontheheapandstack.Itisnotclearwhentheprogramstartshowmuchmemoryitwilluse;theheapcouldbeanywherefromafewkilobytestoseveralgigabytes,dependingontheprogram.Theoperatingsystemcanaddressthisusingzero-on-reference.Withzero-on-reference,theoperatingsystemallocatesamemoryregionfortheheap,butonlyzeroesthefirstfewkilobytes.Instead,itsetstheboundregisterinthesegmenttabletolimittheprogramtojustthezeroedpartofmemory.Iftheprogramexpandsitsheap,itwilltakeanexception,andtheoperatingsystemkernelcanzerooutadditionalmemorybeforeresumingexecution.
Givenalltheseadvantages,whynotstophere?Theprincipaldownsideofsegmentationistheoverheadofmanagingalargenumberofvariablesizeanddynamicallygrowingmemorysegments.Overtime,asprocessesarecreatedandfinish,physicalmemorywillbedividedintoregionsthatareinuseandregionsthatarenot,thatis,availabletobeallocatedtoanewprocess.Thesefreeregionswillbeofvaryingsizes.Whenwecreateanewsegment,wewillneedtofindafreespotforit.Shouldweputitinthesmallestopenregionwhereitwillfit?Thelargestopenregion?
Howeverwechoosetoplacenewsegments,asmorememorybecomesallocated,theoperatingsystemmayreachapointwherethereisenoughfreespaceforanewsegment,butthefreespaceisnotcontiguous.Thisiscalledexternalfragmentation.Theoperatingsystemisfreetocompactmemorytomakeroomwithoutaffectingapplications,becausevirtualaddressesareunchangedwhenwerelocateasegmentinphysicalmemory.Evenso,compactioncanbecostlyintermsofprocessoroverhead:atypicalserverconfigurationwouldtakeroughlyasecondtocompactitsmemory.
Allthisbecomesevenmorecomplexwhenmemorysegmentscangrow.Howmuchmemoryshouldwesetasideforaprogram’sheap?Ifweputtheheapsegmentinapartofphysicalmemorywithlotsofroom,thenwewillhavewastedmemoryifthatprogramturnsouttoneedonlyasmallheap.Ifwedotheopposite—puttheheapsegmentinasmallchunkofphysicalmemory—thenwewillneedtocopyitsomewhereelseifitchangessize.
Figure8.5:Logicalviewofpagetableaddresstranslation.Physicalmemoryissplitintopageframes,withapage-sizealignedblockofvirtualaddressesassignedtoeachframe.Unusedaddressesarenotassignedpageframesinphysicalmemory.
Figure8.6:Addresstranslationwithapagetable.Thevirtualaddresshastwocomponents:avirtualpagenumberandanoffsetwithinthepage.Thevirtualpagenumberindexesintothepagetabletoyieldapageframeinphysicalmemory.Thephysicaladdressisthephysicalpageframefromthepagetable,concatenatedwiththepageoffsetfromthevirtualaddress.Theoperatingsystemcanrestrictprocessaccesstocertainpages,e.g.,topreventwritestopagescontaininginstructions.
8.2.2PagedMemory
Analternativetosegmentedmemoryispagedmemory.Withpaging,memoryisallocatedinfixed-sizedchunkscalledpageframes.Addresstranslationissimilartohowitworkswithsegmentation.Insteadofasegmenttablewhoseentriescontainpointerstovariable-sizedsegments,thereisapagetableforeachprocesswhoseentriescontainpointerstopageframes.Becausepageframesarefixed-sizedandapoweroftwo,thepagetableentriesonlyneedtoprovidetheupperbitsofthepageframeaddress,sotheyaremorecompact.Thereisnoneedfora“bound”ontheoffset;theentirepageinphysicalmemoryisallocatedasaunit.Figure8.6illustratesaddresstranslationwithpagedmemory.
Whatwillseemodd,andperhapscool,aboutpagingisthatwhileaprogramthinksofitsmemoryaslinear,infactitsmemorycanbe,andusuallyis,scatteredthroughoutphysical
memoryinakindofabstractmosaic.Theprocessorwillexecuteoneinstructionafteranotherusingvirtualaddresses;itsvirtualaddressesarestilllinear.However,theinstructionlocatedattheendofapagewillbelocatedinacompletelydifferentregionofphysicalmemoryfromthenextinstructionatthestartofthenextpage.Datastructureswillappeartobecontiguoususingvirtualaddresses,butalargematrixmaybescatteredacrossmanyphysicalpageframes.
Anaptanalogyiswhathappenswhenyoushuffleseveraldecksofcardstogether.Asingleprocessinitsvirtualaddressspaceseesthecardsofasingledeckinorder.Adifferentprocessseesacompletelydifferentdeck,butitwillalsobeinorder.Inphysicalmemory,however,thedecksofalltheprocessescurrentlyrunningwillbeshuffledtogether,apparentlyatrandom.Thepagetablesarethemagician’sassistant:abletoinstantlyfindthequeenofheartsfromamongtheshuffleddecks.
Pagingaddressestheprincipallimitationofsegmentation:free-spaceallocationisverystraightforward.Theoperatingsystemcanrepresentphysicalmemoryasabitmap,witheachbitrepresentingaphysicalpageframethatiseitherfreeorinuse.Findingafreeframeisjustamatteroffindinganemptybit.
Sharingmemorybetweenprocessesisalsoconvenient:weneedtosetthepagetableentryforeachprocesssharingapagetopointtothesamephysicalpageframe.Foralargesharedregionthatspansmultiplepageframes,suchasasharedlibrary,thismayrequiresettingupanumberofpagetableentries.Sinceweneedtoknowwhentoreleasememorywhenaprocessfinishes,sharedmemoryrequiressomeextrabookkeepingtokeeptrackofwhetherthesharedpageisstillinuse.Thedatastructureforthisiscalledacoremap;itrecordsinformationabouteachphysicalpageframesuchaswhichpagetableentriespointtoit.
Manyoftheoptimizationswediscussedundersegmentationcanalsobedonewithpaging.Forcopy-on-write,weneedtocopythepagetableentriesandsetthemtoread-only;onastoretooneofthesepages,wecanmakearealcopyoftheunderlyingpageframebeforeresumingtheprocess.Likewise,forzero-on-reference,wecansetthepagetableentryatthetopofthestacktobeinvalid,causingatrapintothekernel.Thisallowsustoextendthestackonlyasneeded.
Pagetablesallowotherfeaturestobeadded.Forexample,wecanstartaprogramrunningbeforeallofitscodeanddataareloadedintomemory.Initially,theoperatingsystemmarksallofthepagetableentriesforanewprocessasinvalid;aspagesarebroughtinfromdisk,itmarksthosepagesasread-only(forcodepages)orread-write(fordatapages).Oncethefirstfewpagesareinmemory,however,theoperatingsystemcanstartexecutionoftheprograminuser-mode,whilethekernelcontinuestotransfertherestoftheprogram’scodeinthebackground.Astheprogramstartsup,ifithappenstojumptoalocationthathasnotbeenloadedyet,thehardwarewillcauseanexception,andthekernelcanstalltheprogramuntilthatpageisavailable.Further,thecompilercanreorganizetheprogramexecutableformoreefficientstartup,bycoalescingtheinitializationpagesintoafewpagesatthestartoftheprogram,thusoverlappinginitializationandloadingtheprogramfromdisk.
Asanotherexample,adatabreakpointisrequesttostoptheexecutionofaprogramwhen
itreferencesormodifiesaparticularmemorylocation.Itishelpfulduringdebuggingtoknowwhenadatastructurehasbeenchanged,particularlywhentrackingdownpointererrors.Databreakpointsaresometimesimplementedwithspecialhardwaresupport,buttheycanalsobeimplementedwithpagetables.Forthis,thepagetableentrycontainingthelocationismarkedread-only.Thiscausestheprocesstotraptotheoperatingsystemoneverychangetothepage;theoperatingsystemcanthencheckiftheinstructioncausingtheexceptionaffectedthespecificlocationornot.
Adownsideofpagingisthatwhilethemanagementofphysicalmemorybecomessimpler,themanagementofthevirtualaddressspacebecomesmorechallenging.Compilerstypicallyexpecttheexecutionstacktobecontiguous(invirtualaddresses)andofarbitrarysize;eachnewprocedurecallassumesthememoryforthestackisavailable.Likewise,theruntimelibraryfordynamicmemoryallocationtypicallyexpectsacontiguousheap.Inasingle-threadedprocess,wecanplacethestackandheapatoppositeendsofthevirtualaddressspace,andhavethemgrowtowardseachother,asshowninFigure8.5.However,withmultiplethreadsperprocess,weneedmultiplethreadstacks,eachwithroomtogrow.
Thisbecomesevenmoreofanissuewith64-bitvirtualaddressspaces.Thesizeofthepagetableisproportionaltothesizeofthevirtualaddressspace,nottothesizeofphysicalmemory.Themoresparsethevirtualaddressspace,themoreoverheadisneededforthepagetable.Mostoftheentrieswillbeinvalid,representingpartsofthevirtualaddressspacethatarenotinuse,butphysicalmemoryisstillneededforallofthosepagetableentries.
Wecanreducethespacetakenupbythepagetablebychoosingalargerpageframe.Howbigshouldapageframebe?Alargerpageframecanwastespaceifaprocessdoesnotuseallofthememoryinsidetheframe.Thisiscalledinternalfragmentation.Fixed-sizechunksareeasiertoallocate,butwastespaceiftheentirechunkisnotused.Unfortunately,thismeansthatwithpaging,eitherpagesareverylarge(wastingspaceduetointernalfragmentation),orthepagetableisverylarge(wastingspace),orboth.Forexample,with16KBpagesanda64bitvirtualaddressspace,wemightneed250pagetableentries!
8.2.3Multi-LevelTranslation
Ifyouweretodesignanefficientsystemfordoingalookuponasparsekeyspace,youprobablywouldnotpickasimplearray.Atreeorahashtablearemoreappropriate,andindeed,modernsystemsuseboth.Wefocusinthissubsectionontrees;wediscusshashtablesafterwards.
Manysystemsusetree-basedaddresstranslation,althoughthedetailsvaryfromsystemtosystem,andtheterminologycanbeabitconfusing.Despitethedifferences,thesystemsweareabouttodescribehavesimilarproperties.Theysupportcoarseandfine-grainedmemoryprotectionandmemorysharing,flexiblememoryplacement,efficientmemoryallocation,andefficientlookupforsparseaddressspaces,evenfor64-bitmachines.
Almostallmulti-leveladdresstranslationsystemsusepagingasthelowestlevelofthetree.Themaindifferencesbetweensystemsareinhowtheyreachthepagetableattheleaf
ofthetree—whetherusingsegmentspluspaging,ormultiplelevelsofpaging,orsegmentsplusmultiplelevelsofpaging.Thereareseveralreasonsforthis:
Efficientmemoryallocation.Byallocatingphysicalmemoryinfixed-sizepageframes,managementoffreespacecanuseasimplebitmap.
Efficientdisktransfers.Hardwaredisksarepartitionedintofixed-sizedregionscalledsectors;disksectorsmustbereadorwrittenintheirentirety.Bymakingthepagesizeamultipleofthedisksector,wesimplifytransferstoandfrommemory,forloadingprogramsintomemory,readingandwritingfiles,andinusingthedisktosimulatealargermemorythanisphysicallypresentonthemachine.
Efficientlookup.Wewilldescribeinthenextsectionhowwecanuseacachecalledatranslationlookasidebuffertomakelookupsfastinthecommoncase;thetranslationbuffercacheslookupsonapagebypagebasis.Pagingalsoallowsthelookuptablestobemorecompact,especiallyimportantatthelowestlevelofthetree.
Efficientreverselookup.Usingfixed-sizedpageframesalsomakesiteasytoimplementthecoremap,togofromaphysicalpageframetothesetofvirtualaddressesthatsharethesameframe.Thiswillbecrucialforimplementingtheillusionofaninfinitevirtualmemoryinthenextchapter.
Page-granularityprotectionandsharing.Typically,everytableentryateverylevelofthetreewillhaveitsownaccesspermissions,enablingbothcoarse-grainedandfine-grainedsharing,downtotheleveloftheindividualpageframe.
Figure8.7:Addresstranslationwithpagedsegmentation.Thevirtualaddresshasthreecomponents:asegmentnumber,avirtualpagenumberwithinthesegment,andanoffsetwithinthepage.Thesegmentnumberindexesintoasegmenttablethatyieldsthepagetableforthatsegment.Thepagenumberfromthevirtualaddressindexesintothepagetable(fromthesegmenttable)toyieldapageframeinphysicalmemory.Thephysicaladdressisthephysicalpageframefromthepagetable,concatenatedwiththepageoffsetfromthevirtualaddress.Theoperatingsystemcanrestrictaccesstoanentiresegment,e.g.,topreventwritestothecodesegment,ortoanindividualpage,e.g.,toimplementcopy-on-write.
PagedSegmentation
Letusstartasystemwithonlytwolevelsofatree.Withpagedsegmentation,memoryissegmented,butinsteadofeachsegmenttableentrypointingdirectlytoacontiguousregionofphysicalmemory,eachsegmenttableentrypointstoapagetable,whichinturnpointstothememorybackingthatsegment.Thesegmenttableentry“bound”describesthepagetablelength,thatis,thelengthofthesegmentinpages.Becausepagingisusedatthelowestlevel,allsegmentlengthsaresomemultipleofthepagesize.Figure8.7illustratestranslationwithpagedsegmentation.
Althoughsegmenttablesaresometimesstoredinspecialhardwareregisters,thepagetablesforeachsegmentarequiteabitlargerinaggregate,andsotheyarenormallystored
inphysicalmemory.Tokeepthememoryallocatorsimple,themaximumsegmentsizeisusuallychosentoallowthepagetableforeachsegmenttobeasmallmultipleofthepagesize.
Forexample,with32-bitvirtualaddressesand4KBpages,wemightsetasidetheuppertenbitsforthesegmentnumber,thenexttenbitsforthepagenumber,andtwelvebitsforthepageoffset.Inthiscase,ifeachpagetableentryisfourbytes,thepagetableforeachsegmentwouldexactlyfitintoonephysicalpageframe.
Multi-LevelPaging
Figure8.8:Addresstranslationwiththreelevelsofpagetables.Thevirtualaddresshasfourcomponents:anindexintoeachlevelofthepagetableandanoffsetwithinthephysicalpageframe.
Anearlyequivalentapproachtopagedsegmentationistousemultiplelevelsofpagetables.OntheSunMicrosystemsSPARCprocessorforexample,therearethreelevelsofpagetable.AsshowninFigure8.8,thetop-levelpagetablecontainsentries,eachofwhichpointstoasecond-levelpagetablewhoseentriesarepointerstopagetables.OntheSPARC,aswithmostothersystemsthatusemultiplelevelsofpagetables,eachlevelof
pagetableisdesignedtofitinaphysicalpageframe.Onlythetop-levelpagetablemustbefilledin;thelowerlevelsofthetreeareallocatedonlyifthoseportionsofthevirtualaddressspaceareinusebyaparticularprocess.Accesspermissionscanbespecifiedateachlevel,andsosharingbetweenprocessesispossibleateachlevel.
Multi-LevelPagedSegmentation
Wecancombinethesetwoapproachesbyusingasegmentedmemorywhereeachsegmentismanagedbyamulti-levelpagetable.Thisistheapproachtakenbythex86,forbothits32-bitand64-bitaddressingmodes.
Wedescribethe32-bitcasefirst.Thex86terminologydiffersslightlyfromwhatwehaveusedhere.Thex86hasaper-processGlobalDescriptorTable(GDT),equivalenttoasegmenttable.TheGDTisstoredinmemory;eachentry(descriptor)pointstothe(multi-level)pagetableforthatsegmentalongwiththesegmentlengthandsegmentaccesspermissions.Tostartaprocess,theoperatingsystemsetsuptheGDTandinitializesaregister,theGlobalDescriptorTableRegister(GDTR),thatcontainstheaddressandlengthoftheGDT.
Becauseofitshistory,thex86usesseparateprocessorregisterstospecifythesegmentnumber(thatis,theindexintotheGDT)andthevirtualaddressforusebyeachinstruction.Forexample,onthe“32-bit”x86,thereisbothasegmentnumberand32bitsofvirtualaddresswithineachsegment.Onthe64-bitx86,thevirtualaddresswithineachsegmentisextendedto64bits.Mostapplicationsonlyuseafewsegments,however,sotheper-processsegmenttableisusuallyshort.Theoperatingsystemkernelhasitsownsegmenttable;thisissetuptoenablethekerneltoaccess,withvirtualaddresses,alloftheper-processandsharedsegmentsonthesystem.
Forencodingefficiency,thesegmentregisterisoftenimplicitaspartoftheinstruction.Forexample,thex86stackinstructionssuchaspushandpopassumethestacksegment(theindexstoredinthestacksegmentregister),branchinstructionsassumethecodesegment(theindexstoredinthecodesegmentregister),andsoforth.Asanoptimization,wheneverthex86initializesacode,stack,ordatasegmentregisteritalsoreadstheGDTentry(thatis,thetop-levelpagetablepointerandaccesspermissions)intotheprocessor,sotheprocessorcangodirectlytothepagetableoneachreference.
Manyinstructionsalsohaveanoptiontospecifythesegmentindexexplicitly.Forexample,theljmp,orlongjump,instructionchangestheprogramcountertoanewsegmentnumberandoffsetwithinthatsegment.
Forthe32-bitx86,thevirtualaddressspacewithinasegmenthasatwo-levelpagetable.Thefirst10bitsofthevirtualaddressindexthetoplevelpagetable,calledthepagedirectory,thenext10bitsindexthesecondlevelpagetable,andthefinal12bitsaretheoffsetwithinapage.Eachpagetableentrytakesfourbytesandthepagesizeis4KB,sothetop-levelpagetableandeachsecond-levelpagetablefitsinasinglephysicalpage.Thenumberofsecond-levelpagetablesneededdependsonthelengthofthesegment;theyarenotneededtomapemptyregionsofvirtualaddressspace.Boththetop-levelandsecond-levelpagetableentrieshavepermissions,sofine-grainedprotectionandsharingispossiblewithinasegment.
Today,theamountofmemorypercomputerisoftenwellbeyondwhatcan32bitscanaddress;forexample,ahigh-endservercouldhavetwoterabytesofphysicalmemory.Forthe64-bitx86,virtualaddresseswithinasegmentcanbeupto64bits.However,tosimplifyaddresstranslation,currentprocessorsonlyallow48bitsofthevirtualaddresstobeused;thisissufficienttomap128terabytes,usingfourlevelsofpagetables.Thelowerlevelsofthepagetabletreeareonlyfilledinifthatportionofthevirtualaddressspaceisinuse.
Asanoptimization,the64-bitx86hastheoptiontoeliminateoneortwolevelsofthepagetable.Eachphysicalpageframeonthex86is4KB.Eachpageoffourthlevelpagetablemaps2MBofdata,andeachpageofthethirdlevelpagetablemaps1GBofdata.Iftheoperatingsystemplacesdatasuchthattheentire2MBcoveredbythefourthlevelpagetableisallocatedcontiguouslyinphysicalmemory,thenthepagetableentryonelayerupcanbemarkedtopointdirectlytothisregioninsteadoftoapagetable.Likewise,apageofthirdlevelpagetablecanbeomittediftheoperatingsystemallocatestheprocessa1GBchunkofphysicalmemory.Inadditiontosavingspaceneededforpagetablemappings,thisimprovestranslationbufferefficiency,apointwewilldiscussinmoredetailinthenextsection.
8.2.4Portability
Thediversityofdifferenttranslationmechanismsposesachallengetotheoperatingsystemdesigner.Tobewidelyused,wewantouroperatingsystemtobeeasilyportabletoawidevarietyofdifferentprocessorarchitectures.Evenwithinagivenprocessorfamily,suchasanx86,thereareanumberofdifferentvariantsthatanoperatingsystemmayneedtosupport.Mainmemorydensityisincreasingboththephysicalandvirtualaddressspacebyalmostabitperyear.Inotherwords,foramulti-levelpagetabletobeabletomapallofmemory,anextralevelofthepagetableisneededeverydecadejusttokeepupwiththeincreasingsizeofmainmemory.
Afurtherchallengeisthattheoperatingsystemoftenneedstokeeptwosetsofbookswithrespecttoaddresstranslation.Onesetofbooksisthehardwareview—theprocessorconsultsasetofsegmentandmulti-levelpagetablestobeabletocorrectlyandsecurelyexecuteinstructionsandloadandstoredata.Adifferentsetofbooksistheoperatingsystemviewofthevirtualaddressspace.Tosupportfeaturessuchascopy-on-write,zero-on-reference,andfill-on-reference,aswellasotherapplicationswewilldescribeinlaterchapters,theoperatingsystemmustkeeptrackofadditionalinformationabouteachvirtualpagebeyondwhatisstoredinthehardwarepagetable.
Thissoftwarememorymanagementdatastructuresmirror,butarenotidenticalto,thehardwarestructures,consistingofthreeparts:
Listofmemoryobjects.Memoryobjectsarelogicalsegments.Whetherornottheunderlyinghardwareissegmented,thekernelmemorymanagerneedstokeeptrackofwhichmemoryregionsrepresentwhichunderlyingdata,suchasprogramcode,librarycode,shareddatabetweentwoormoreprocesses,acopy-on-writeregion,oramemory-mappedfile.Forexample,whenaprocessstartsup,thekernelcanchecktheobjectlisttoseeifthecodeisalreadyinmemory;likewise,whenaprocessopensa
library,itcancheckifithasalreadybeenlinkedbysomeotherprocess.Similarly,thekernelcankeepreferencecountstodeterminewhichmemoryregionstoreclaimonprocessexit.
Virtualtophysicaltranslation.Onanexception,andduringsystemcallparametercopying,thekernelneedstobeabletotranslatefromaprocess’svirtualaddressestoitsphysicallocations.Whilethekernelcouldusethehardwarepagetablesforthis,thekernelalsoneedstokeeptrackofwhetheraninvalidpageistrulyinvalid,orsimplynotloadedyet(inthecaseoffill-on-reference)orifaread-onlypageistrulyread-onlyorjustsimulatingadatabreakpointoracopy-on-writepage.
Physicaltovirtualtranslation.Wereferredtothisaboveasthecoremap.Theoperatingsystemneedstokeeptrackoftheprocessesthatmaptoaspecificphysicalmemorylocation,toensurethatwhenthekernelupdatesapage’sstatus,itcanalsoupdatedeverypagetableentrythatreferstothatphysicalpage.
Themostinterestingofthesearethedatastructuresusedforthevirtualtophysicaltranslation.Forthesoftwarepagetable,wehaveallofthesameoptionsasbeforewithrespecttosegmentationandmultiplelevelsofpaging,aswellassomeothers.Thesoftwarepagetableneednotusethesamestructureastheunderlyinghardwarepagetable;indeed,iftheoperatingsystemistobeeasilyportable,thesoftwaredatastructuresmaybequitedifferentfromtheunderlyinghardware.
Linuxmodelstheoperatingsystem’sinternaladdresstranslationdatastructuresafterthex86architectureofsegmentsplusmulti-levelpagetables.ThishasmadeportingLinuxtonewx86architecturesrelativelyeasy,butportingLinuxtootherarchitecturessomewhatmoredifficult.
Adifferentapproach,takenfirstinaresearchsystemcalledMachandlaterinAppleOSX,istouseahashtable,ratherthanatree,forthesoftwaretranslationdata.Forhistoricalreasons,theuseofahashtableforpagedaddresstranslationiscalledaninvertedpagetable.Particularlyaswemovetodeepermulti-levelpagetables,usingahashtablefortranslationcanspeeduptranslation.
Withaninvertedpagetable,thevirtualpagenumberishashedintoatableofsizeproportionaltothenumberofphysicalpageframes.Eachentryinthehashtablecontainstuplesoftheform(inthefigure,thephysicalpageisimplicit):
Figure8.9:Addresstranslationwithasoftwarehashtable.Thehardwarepagetablesareomittedfromthepicture.Thevirtualpagenumberishashed;thisyieldsapositioninthehashtablethatindicatesthephysicalpageframe.Thevirtualpagenumbermustbecheckedagainstthecontentsofthehashentrytohandlecollisionsandtocheckpageaccesspermissions.
AsshowninFigure8.9,ifthereisamatchonboththevirtualpagenumberandtheprocessID,thenthetranslationisvalid.Somesystemsdoatwostagelookup:theyfirstmapthevirtualaddresstoamemoryobjectID,andthendothehashtablelookupontherelativevirtualaddresswithinthememoryobject.Ifmemoryismostlyshared,thiscansavespaceinthehashtablewithoutundulyslowingthetranslation.
Aninvertedpagetabledoesneedsomewaytohandlehashcollisions,whentwovirtualaddressesmaptothesamehashtableentry.Standardtechniques—suchaschainingorrehashing—canbeusedtohandlecollisions.
Aparticularlyusefulconsequenceofhavingaportabilitylayerformemorymanagementisthatthecontentsofthehardwaremulti-leveltranslationtablecanbetreatedasahint.Ahintisaresultofsomecomputationwhoseresultsmaynolongerbevalid,butwhereusinganinvalidhintwilltriggeranexception.
Withaportabilitylayer,thesoftwarepagetableisthegroundtruth,whilethehardware
pagetableisahint.Thehardwarepagetablecanbesafelyused,providedthatthetranslationsandpermissionsareasubsetofthetranslationsinthesoftwarepagetable.
Isaninvertedpagetableenough?
Theconceptofaninvertedpagetableraisesanintriguingquestion:doweneedtohaveamulti-levelpagetableinhardware?Suppose,inhardware,wehashthevirtualaddress.Butinsteadofusingthehashvaluetolookupinatablewheretofindthephysicalpageframe,supposewejustusethehashvalueasthephysicalpage.Forthistowork,weneedthehashtablesizetohaveexactlyasmanyentriesasphysicalmemorypageframes,sothatthereisaone-to-onecorrespondencebetweenthehashtableentryandthepageframe.
Westillneedatabletostorepermissionsandtoindicatewhichvirtualpageisstoredineachentry;iftheprocessdoesnothavepermissiontoaccessthepage,oriftwovirtualpageshashtothesamephysicalpage,weneedtobeabletodetectthisandtraptotheoperatingsystemkerneltohandletheproblem.Thisiswhyahashtableformanagingmemoryisoftencalledcalledaninvertedpagetable:theentriesinthetablearevirtualpagenumbers,notphysicalpagenumbers.Thephysicalpagenumberisjustthepositionofthatvirtualpageinthetable.
Thedrawbacktothisapproach?Handlinghashcollisionsbecomesmuchharder.Iftwopageshashtothesametableentry,onlyonecanbestoredinthephysicalpageframe.Theotherhastobeelsewhere—eitherinasecondaryhashtableentryorpossiblystoredondisk.Copyinginthenewpagecantaketime,andiftheprogramisunluckyenoughtoneedtosimultaneouslyaccesstwovirtualpagesthatbothhashtothesamephysicalpage,thesystemwillslowdownevenfurther.Asaresult,onmodernsystems,invertedpagetablesaretypicallyusedinsoftwaretoimproveportability,ratherthaninhardware,toeliminatetheneedformulti-levelpagetables.
8.3TowardsEfficientAddressTranslation
Atthispoint,youshouldbegettingabitantsy.Afterall,mostofthehardwaremechanismswehavedescribedinvolveatleasttwoandpossiblyasmanyasfourmemoryextrareferences,oneachinstruction,beforeweevenreachtheintendedphysicalmemorylocation!Itshouldseemcompletelyimpracticalforaprocessortodoseveralmemorylookupsoneveryinstructionfetch,andevenmorethatforeveryinstructionthatloadsorstoresdata.
Inthissection,wewilldiscusshowtoimproveaddresstranslationperformancewithoutchangingitslogicalbehavior.Inotherwords,despitetheoptimization,everyvirtualaddressistranslatedtoexactlythesamephysicalmemorylocation,andeverypermissionexceptioncausesatrap,exactlyaswouldhaveoccurredwithouttheperformanceoptimization.
Forthis,wewilluseacache,acopyofsomedatathatcanbeaccessedmorequicklythantheoriginal.Thissectionconcernshowwemightusecachestoimprovetranslation
performance.Cachesarewidelyusedincomputerarchitecture,operatingsystems,distributedsystems,andmanyothersystems;inthenextchapter,wediscussmoregenerallywhencachesworkandwhentheydonot.Fornow,however,ourfocusisjustontheuseofcachesforreducingtheoverheadofaddresstranslation.Thereisareasonforthis:theveryfirsthardwarecacheswereusedtoimprovetranslationperformance.
8.3.1TranslationLookasideBuffers
Ifyouthinkabouthowaprocessorexecutesinstructionswithaddresstranslation,therearesomeobviouswaystoimproveperformance.Afterall,theprocessornormallyexecutesinstructionsinasequence:
Thehardwarewillfirsttranslatetheprogramcounterfortheaddinstruction,walkingthemulti-leveltranslationtabletofindthephysicalmemorywheretheaddinstructionisstored.Whentheprogramcounterisincremented,theprocessormustwalkthemultiplelevelsagaintofindthephysicalmemorywherethemultinstructionisstored.Ifthetwoinstructionsareonthesamepageinthevirtualaddressspace,thentheywillbeonthesamepageinphysicalmemory.Theprocessorwilljustrepeatthesamework—thetablewalkwillbeexactlythesame,andagainforthenextinstruction,andthenextafterthat.
Atranslationlookasidebuffer(TLB)isasmallhardwaretablecontainingtheresultsofrecentaddresstranslations.EachentryintheTLBmapsavirtualpagetoaphysicalpage:
Figure8.10:Operationofatranslationlookasidebuffer.Inthediagram,eachvirtualpagenumberischeckedagainstalloftheentriesintheTLBatthesametime;ifthereisamatch,thematchingtableentrycontainsthephysicalpageframeandpermissions.Ifnot,thehardwaremulti-levelpagetablelookupisinvoked;notethehardwarepagetablesareomittedfromthepicture.
Figure8.11:Combinedoperationofatranslationlookasidebufferandhardwarepagetables.
Insteadoffindingtherelevantentrybyamulti-levellookuporbyhashing,theTLBhardware(typically)checksalloftheentriessimultaneouslyagainstthevirtualpage.Ifthereisamatch,theprocessorusesthatentrytoformthephysicaladdress,skippingtherestofthestepsofaddresstranslation.ThisiscalledaTLBhit.OnaTLBhit,thehardwarestillneedstocheckpermissions,incase,forexample,theprogramattemptstowritetoacode-onlypageortheoperatingsystemneedstotraponastoreinstructiontoacopy-on-writepage.
ATLBmissoccursifnoneoftheentriesintheTLBmatch.Inthiscase,thehardwaredoesthefulladdresstranslationinthewaywedescribedabove.Whentheaddresstranslationcompletes,thephysicalpageisusedtoformthephysicaladdress,andthetranslationisinstalledinanentryintheTLB,replacingoneoftheexistingentries.Typically,thereplacedentrywillbeonethathasnotbeenusedrecently.
TheTLBlookupisillustratedinFigure8.10,andFigure8.11showshowaTLBfitsintotheoveralladdresstranslationsystem.
AlthoughthehardwarecostofaTLBmightseemlarge,itismodestcomparedtothepotentialgaininprocessorperformance.Tobeuseful,theTLBlookupneedstobemuchmorerapidthandoingafulladdresstranslation;thus,theTLBtableentriesareimplementedinveryfast,on-chipstaticmemory,situatedneartheprocessor.Infact,tokeeplookupsrapid,manysystemsnowincludemultiplelevelsofTLB.Ingeneral,thesmallerthememory,thefasterthelookup.So,thefirstlevelTLBissmallandclosetothe
processor(andoftensplitforengineeringreasonsintooneforinstructionlookupsandaseparateonefordatalookups).IfthefirstlevelTLBdoesnotcontainthetranslation,alargersecondlevelTLBisconsulted,andthefulltranslationisonlyinvokedifthetranslationmissesbothlevels.Forsimplicity,ourdiscussionwillassumeasingle-levelTLB.
ATLBalsorequiresanaddresscomparatorforeachentrytocheckinparallelifthereisamatch.Toreducethiscost,someTLBsaresetassociative.ComparedtofullyassociativeTLBs,setassociativeonesneedfewercomparators,buttheymayhaveahighermissrate.Wewilldiscusssetassociativity,anditsimplicationsforoperatingsystemdesign,inthenextchapter.
WhatisthecostofaddresstranslationwithaTLB?Therearetwofactors.WepaythecostoftheTLBlookupregardlessofwhethertheaddressisintheTLBornot;inthecaseofanunsuccessfulTLBlookup,wealsopaythecostofthefulltranslation.IfP(hit)isthelikelihoodthattheTLBhastheentrycached:
Cost(addresstranslation) = Cost(TLBlookup)
+Cost(fulltranslation)×(1-P(hit))
Inotherwords,theprocessordesignerneedstoincludeasufficientlylargeTLBthatmostaddressesgeneratedbyaprogramwillhitintheTLB,sothatdoingthefulltranslationistherareevent.Evenso,TLBmissesareasignificantcostformanyapplications.
Software-loadedTLB
IftheTLBiseffectiveatamortizingthecostofdoingafulladdresstranslationacrossmanymemoryreferences,wecanaskaradicalquestion:doweneedhardwaremulti-levelpagetablelookuponaTLBmiss?Thisistheconceptbehindasoftware-loadedTLB.ATLBhitworksasbefore,asafastpath.OnaTLBmiss,insteadofdoinghardwareaddresstranslation,theprocessortrapstotheoperatingsystemkernel.Inthetraphandler,thekernelisresponsiblefordoingtheaddresslookup,loadingtheTLBwiththenewtranslation,andrestartingtheapplication.
Thisapproachdramaticallysimplifiesthedesignoftheoperatingsystem,becauseitnolongerneedstokeeptwosetsofpagetables,oneforthehardwareandoneforitself.OnaTLBmiss,theoperatingsystemcanconsultitsownportabledatastructurestodeterminewhatdatashouldbeloadedintotheTLB.
Althoughconvenientfortheoperatingsystem,asoftware-loadedTLBissomewhatslowerforexecutingapplications,asthecostoftrappingtothekernelissignificantlymorethanthecostofdoinghardwareaddresstranslation.Aswewillseeinthenextchapter,thecontentsofpagetableentriescanbestoredinon-chiphardwarecaches;this
meansthatevenonaTLBmiss,thehardwarecanoftenfindeverylevelofthemulti-levelpagetablealreadystoredinanon-chipcache,butnotintheTLB.Forexample,aTLBmissonamoderngenerationx86canbecompletedinthebestcaseintheequivalentof17instructions.Bycontrast,atraptotheoperatingsystemkernelwilltakeseveralhundredtoafewthousandinstructionstoprocess,eveninthebestcase.
Figure8.12:Operationofatranslationlookasidebufferwithsuperpages.Inthediagram,someentriesintheTLBcanbesuperpages;thesematchifthevirtualpageisinthesuperpage.Thesuperpageinthediagramcoversanentirememorysegment,butthisneednotalwaysbethecase.
8.3.2Superpages
OnewaytoimprovetheTLBhitrateisusingaconceptcalledsuperpages.Asuperpageisasetofcontiguouspagesinphysicalmemorythatmapacontiguousregionofvirtualmemory,wherethepagesarealignedsothattheysharethesamehigh-order(superpage)address.Forexample,an8KBsuperpagewouldconsistoftwoadjacent4KBpagesthatlieonan8KBboundaryinbothvirtualandphysicalmemory.Superpagesareatthe
discretionoftheoperatingsystem—smallprogramsormemorysegmentsthatbenefitfromasmallerpagesizecanstilloperatewiththestandard,smallerpagesize.
Superpagescomplicateoperatingsystemmemoryallocationbyrequiringthesystemtoallocatechunksofmemoryindifferentsizes.However,theupsideisthatasuperpagecandrasticallyreducethenumberofTLBentriesneededtomaplarge,contiguousregionsofmemory.EachentryintheTLBhasaflag,signifyingwhethertheentryisapageorasuperpage.Forsuperpages,theTLBmatchesthesuperpagenumber—thatis,itignorestheportionofthevirtualaddressthatisthepagenumberwithinthesuperpage.ThisisillustratedinFigure8.12.
Tomakethisconcrete,thex86skipsoneortwolevelsofthepagetablewhenthereisa2MBor1GBregionofphysicalmemorythatismappedasaunit.Whentheprocessorreferencesoneoftheseregions,onlyasingleentryisloadedintotheTLB.Whenlookingforamatchagainstasuperpage,theTLBonlyconsidersthemostsignificantbitsoftheaddress,ignoringtheoffsetwithinthesuperpage.Fora2MBsuperpage,theoffsetisthelowest21bitsofthevirtualaddress.Fora1GBsuperpageitisthelowest30bits.
Figure8.13:Layoutofahigh-resolutionframebufferinphysicalmemory.Eachlineofthepixeldisplaycantakeupanentirepage,sothatadjacentpixelsintheverticaldimensionlieondifferentpages.
Acommonuseofsuperpagesistomaptheframebufferforthecomputerdisplay.Whenredrawingthescreen,theprocessormaytoucheverypixel;withahigh-resolutiondisplay,thiscaninvolvesteppingthroughmanymegabytesofmemory.IfeachTLBentrymapsa4KBpage,evenalargeon-chipTLBwith256entrieswouldonlybeabletocontainmappingsfor1MBoftheframebufferatthesametime.Thus,theTLBwouldneedtorepeatedlydopagetablelookupstopullinnewTLBentriesasitstepsthroughmemory.Anevenworsecaseoccurswhendrawingaverticalline.Theframebufferisatwo-
dimensionalarrayinrow-majororder,sothateachhorizontallineofpixelsisonaseparatepage.Thus,modifyingeachseparatepixelinaverticallinewouldrequireloadingaseparateTLBentry!Withsuperpages,theentireframebuffercanbemappedwithasingleTLBentry,leavingmoreroomfortheotherpagesneededbytheapplication.
Similarissuesoccurwithlargematricesinscientificcode.
8.3.3TLBConsistency
Wheneverweintroduceacacheintoasystem,weneedtoconsiderhowtoensureconsistencyofthecachewiththeoriginaldatawhentheentriesaremodified.ATLBisnoexception.Forsecureandcorrectprogramexecution,theoperatingsystemmustensurethattheeachprogramseesitsmemoryandnooneelse’s.AnyinconsistencybetweentheTLB,thehardwaremulti-leveltranslationtable,andtheportableoperatingsystemlayerisapotentialcorrectnessandsecurityflaw.
Therearethreeissuestoconsider:
Figure8.14:OperationofatranslationlookasidebufferwithprocessID’s.TheTLBcontainsentriesformultipleprocesses;onlytheentriesforthecurrentprocessarevalid.TheoperatingsystemkernelmustchangethecurrentprocessIDwhenperformingacontextswitchbetweenprocesses.
Processcontextswitch.Whathappensonaprocesscontextswitch?Thevirtualaddressesoftheoldprocessarenolongervalid,andshouldnolongerbevalid,forthenewprocess.Otherwise,thenewprocesswillbeabletoreadtheoldprocess’s
datastructures,eithercausingthenewprocesstocrash,orpotentiallyallowingittoscavengesensitiveinformationsuchaspasswordsstoredinmemory.
Onacontextswitch,weneedtochangethehardwarepagetableregistertopointtothenewprocess’spagetable.However,theTLBalsocontainscopiesoftheoldprocess’spagetranslationsandpermissions.OneapproachistoflushtheTLB—discarditscontents—oneverycontextswitch.Sinceemptyingthecachecarriesaperformancepenalty,modernprocessorshaveataggedTLB,showninFigure8.14.EntriesinataggedTLBcontaintheprocessIDthatproducedeachtranslation:
WithataggedTLB,theoperatingsystemstoresthecurrentprocessIDinahardwareregisteroneachcontextswitch.Whenperformingalookup,thehardwareignoresTLBentriesfromotherprocesses,butitcanreuseanyTLBentriesthatremainfromthelasttimethecurrentprocessexecuted.
Permissionreduction.Whathappenswhentheoperatingsystemmodifiesanentryinapagetable?Fortheprocessor’sregulardatacacheofmainmemory,special-purposehardwarekeepscacheddataconsistentwiththedatastoredinmemory.However,hardwareconsistencyisnotusuallyprovidedfortheTLB;keepingtheTLBconsistentwiththepagetableistheresponsibilityoftheoperatingsystemkernel.
Softwareinvolvementisneededforseveralreasons.First,pagetableentriescanbesharedbetweenprocesses,soasinglemodificationcanaffectmultipleTLBentries(e.g.,oneforeachprocesssharingthepage).Second,theTLBcontainsonlythevirtualtophysicalpagemapping—itdoesnotrecordtheaddresswherethemappingcamefrom,soitcannottellifawritetomemorywouldaffectaTLBentry.Evenifitdidtrackthisinformation,moststorestomemorydonotaffectthepagetable,sorepeatedlycheckingeachmemorystoretoseeifitaffectsanyTLBentrywouldinvolvealargeamountofoverheadthatwouldrarelybeneeded.
Instead,whenevertheoperatingsystemchangesthepagetable,itensuresthattheTLBdoesnotcontainanincorrectmapping.
Nothingneedstobedonewhentheoperatingsystemaddspermissionstoaportionofthevirtualaddressspace.Forexample,theoperatingsystemmightdynamicallyextendtheheaporthestackbyallocatingphysicalmemoryandchanginginvalidpagetableentriestopointtothenewmemory,ortheoperatingsystemmightchangeapagefromread-onlytoread-write.Inthesecases,theTLBcanbeleftalonebecauseanyreferencesthatrequirethenewpermissionswilleithercausethehardwareloadthenewentriesorcauseanexception,allowingtheoperatingsystemtoloadthenew
entries.
However,iftheoperatingsystemneedstoreducepermissionstoapage,thenthekernelneedstoensuretheTLBdoesnothaveacopyoftheoldtranslationbeforeresumingtheprocess.Ifthepagewasshared,thekernelneedstoensurethattheTLBdoesnothavethecopyforanyoftheprocessID’sthatmighthavereferencedthepage.Forexample,tomarkaregionofmemoryascopy-on-write,theoperatingsystemmustreducepermissionstotheregiontoread-only,anditmustremoveanyentriesforthatregionfromtheTLB,sincetheoldTLBentrieswouldstillberead-write.
EarlycomputersdiscardedtheentirecontentsoftheTLBwhenevertherewasachangetoapagetable,butmoremodernarchitectures,includingthex86andtheARM,supporttheremovalofindividualTLBentries.
Figure8.15:IllustrationoftheneedforTLBshootdowntopreservecorrecttranslationbehavior.Inorderforprocessor1tochangethetranslationforpage0x53inprocess0toread-only,itmustremovetheentryfromitsTLB,anditmustensurethatnootherprocessorhastheoldtranslationinitsTLB.Todothis,itsendsaninterprocessorinterrupttoeachprocessor,requestingittoremovetheoldtranslation.TheoperatingsystemdoesnotknowifaparticularTLBcontainsanentry(e.g.,processor3’sTLBdoesnotcontainpage0x53),soitmustremoveitfromallTLBs.Theshootdowniscompleteonlywhenallprocessorshaveverifiedthattheoldtranslationhasbeenremoved.
TLBshootdown.Onamultiprocessor,thereisafurthercomplication.AnyprocessorinthesystemmayhaveacachedcopyofatranslationinitsTLB.Thus,tobesafeandcorrect,wheneverapagetableentryismodified,thecorrespondingentryineveryprocessor’sTLBhastobediscardedbeforethechangewilltakeeffect.Typically,onlythecurrentprocessorcaninvalidateitsownTLB,soremovingtheentryfromallprocessorsonthesystemrequiresthattheoperatingsysteminterrupteachprocessorandrequestthatitremovetheentryfromitsTLB.
Thisheavyweightoperationhasitsownname:itisaTLBshootdown,illustratedinFigure8.15.Theoperatingsystemfirstmodifiesthepagetable,thensendsaTLBshootdownrequesttoalloftheotherprocessors.OnceanotherprocessorhasensuredthatitsTLBhasbeencleanedofanyoldentries,thatprocessorcanresume.Theoriginalprocessorcancontinueonlywhenalloftheprocessorshaveacknowledged
removingtheoldentryfromtheirTLB.SincetheoverheadofaTLBshootdownincreaseslinearlywiththenumberofprocessorsonthesystem,manyoperatingsystemsbatchTLBshootdownrequests,toreducethefrequencyofinterprocessinterruptsatsomeincreasedcostinlatencytocompletetheshootdown.
8.3.4VirtuallyAddressedCaches
Figure8.16:Combinedoperationofavirtuallyaddressedcache,translationlookasidebuffer,andhardwarepagetable.
AnothersteptoimprovingtheperformanceofaddresstranslationistoincludeavirtuallyaddressedcachebeforetheTLBisconsulted,asshowninFigure8.16.Avirtuallyaddressedcachestoresacopyofthecontentsofphysicalmemory,indexedbythevirtualaddress.Whenthereisamatch,theprocessorcanusethedataimmediately,withoutwaitingforaTLBlookuporpagetabletranslationtogenerateaphysicaladdress,andwithoutwaitingtoretrievethedatafrommainmemory.Almostallmodernmulticorechipsincludeasmall,virtuallyaddressedon-chipcacheneareachprocessorcore.Often,liketheTLB,thevirtuallyaddressedcachewillbesplitinhalf,oneforinstructionlookupsandonefordata.
ThesameconsistencyissuesthatapplytoTLBsalsoapplytovirtuallyaddressedcaches:
Processcontextswitch.EntriesinthevirtuallyaddressedcachemusteitherbeeitherwiththeprocessIDortheymustbeinvalidatedonacontextswitchtopreventthenewprocessfromaccessingtheoldprocess’sdata.
Permissionreductionandshootdown.Whentheoperatingsystemchangesthepermissionforapageinthepagetable,thevirtualcachewillnotreflectthatchange.Invalidatingtheaffectedcacheentrieswouldrequireeitherflushingtheentirecache
orfindingallmemorylocationsstoredinthecacheontheaffectedpage,bothrelativelyheavyweightoperations.
Instead,mostsystemswithvirtuallyaddressedcachesusethemintandemwiththeTLB.EachvirtualaddressislookedupinboththecacheandtheTLBatthesametime;theTLBspecifiesthepermissionstouse,whilethecacheprovidesthedataiftheaccessispermitted.Thisway,onlytheTLB’spermissionsneedtobekeptuptodate.TheTLBandvirtualcacheareco-designedtotakethesameamountoftimetoperformalookup,sotheprocessordoesnotstallwaitingfortheTLB.
Afurtherissueisaliasing.Manyoperatingsystemsallowprocessessharingmemorytousedifferentvirtualaddressestorefertothesamememorylocation.Thisiscalledamemoryaddressalias.EachprocesswillhaveitsownTLBentryforthatmemory,andthevirtualcachemaystoreacopyofthememoryforeachprocess.Theproblemoccurswhenoneprocessmodifiesitscopy;howdoesthesystemknowtoupdatetheothercopy?
Themostcommonsolutiontothisissueistostorethephysicaladdressalongwiththevirtualaddressinthevirtualcache.Inparallelwiththevirtualcachelookup,theTLBisconsultedtogeneratethephysicaladdressandpagepermissions.Onastoreinstructionmodifyingdatainthevirtualcache,thesystemcandoareverselookuptofindalltheentriesthatmatchthesamephysicaladdress,toallowittoupdatethoseentries.
8.3.5PhysicallyAddressedCaches
Figure8.17:Combinedoperationofavirtuallyaddressedcache,translationlookasidebuffer,hardwarepagetable,andphysicallyaddressedcache.
Manyprocessorarchitecturesincludeaphysicallyaddressedcachethatisconsultedasasecond-levelcacheafterthevirtuallyaddressedcacheandTLB,butbeforemainmemory.
ThisisillustratedinFigure8.17.OncethephysicaladdressofthememorylocationisformedfromtheTLBlookup,thesecond-levelcacheisconsulted.Ifthereisamatch,thevaluestoredatthatlocationcanbereturneddirectlytotheprocessorwithouttheneedtogotomainmemory.
Withtoday’schipdensities,anon-chipphysicallyaddressedcachecanbequitelarge.Infact,manysystemsincludebothasecond-levelandathird-levelphysicallyaddressedcache.Typically,thesecond-levelcacheisper-coreandisoptimizedforlatency;atypicalsizeis256KB.Thethird-levelcacheissharedamongallofthecoresonthesamechipandwillbeoptimizedforsize;itcanbeaslargeas2MBonamodernchip.Inotherwords,theentireUNIXoperatingsystemfromthe70’s,andallofitsapplications,wouldfitonasinglemodernchip,withnoneedtoevergotomainmemory.
Together,thesephysicallyaddressedcachesserveadualpurpose:
Fastermemoryreferences.Anon-chipphysicallyaddressedcachewillhavealookuplatencythatistentimes(2ndlevel)orthreetimes(3rdlevel)fasterthanmainmemory.
FasterTLBmisses.IntheeventofaTLBmiss,thehardwarewillgenerateasequenceoflookupsthroughitsmultiplelevelsofpagetables.Becausethepagetablesarestoredinphysicalmemory,theycanbecached.Thus,evenaTLBmissandpagetablelookupmaybehandledentirelyonchip.
8.4SoftwareProtection
Anincreasingnumberofsystemscomplementhardware-basedaddresstranslationwithsoftware-basedprotectionmechanisms.Obviously,software-onlyprotectionispossible.Amachinecodeinterpreter,implementedinsoftware,cansimulatetheexactbehaviorofhardwareprotection.Theinterpretercouldfetcheachinstruction,interpretit,lookeachaddressupinapagetabletodetermineiftheinstructionispermitted,andifso,executetheinstruction.Ofcourse,thatwouldbeveryslow!
Inthissection,weask:aretherepracticalsoftwaretechniquestoexecutecodewithinarestricteddomain,withoutrelyingonhardwareaddresstranslation?Thefocusofourdiscussionwillbeonusingsoftwareforprovidinganefficientprotectionboundary,asawayofimprovingcomputersecurity.However,thetechniqueswedescribecanalsobeusedtoprovideotheroperatingsystemservices,suchascopy-on-write,stackextensibility,recoverablememory,anduser-levelvirtualmachines.Onceyouhavetheinfrastructuretoreinterpretreferencestocodeanddatalocations,whetherinsoftwareorhardware,anumberofservicesbecomepossible.
Hardwareprotectionisnearlyuniversalonmoderncomputers,soitisreasonabletoask,whydoweneedtoimplementprotectioninsoftware?
Simplifyhardware.Onegoalissimplecuriosity.Dowereallyneedhardwareaddresstranslation,orisitjustanengineeringtradeoff?Ifsoftwarecanprovideefficientprotection,wecouldeliminatealargeamountofhardwarecomplexityandruntimeoverheadfromcomputers,withasubstantialincreaseinflexibility.
Application-levelprotection.Evenifweneedhardwareaddresstranslationtoprotecttheoperatingsystemfrommisbehavingapplications,weoftenwanttorununtrustedcodewithinanapplication.Anexampleisinsideawebbrowser;webpagescancontaincodetoconfigurethedisplayforawebsite,butthebrowserneedstoprotectitselfagainstmaliciousorbuggycodeprovidedbywebsites.
Protectioninsidethekernel.Wealsosometimesneedtorununtrusted,oratleastlesstrusted,codeinsidekernel.Examplesincludethird-partydevicedriversandcodetocustomizethebehavioroftheoperatingsystemonbehalfofapplications.Becausethekernelrunswiththefullcapabilityoftheentiremachine,anyusercoderuninsidethekernelmustbeprotectedinsoftwareratherthaninhardware.
Portablesecurity.Theproliferationofconsumerdevicesposesachallengetoapplicationportability.Nosingleoperatingsystemrunsoneveryembeddedsensor,smartphone,tablet,netbook,laptop,desktop,andservermachine.Applicationsthatwanttorunacrossawiderangeofdevicesneedacommonruntimeenvironmentthatisolatestheapplicationfromthespecificsoftheunderlyingoperatingsystemandhardwaredevice.Providingprotectionaspartoftheruntimesystemmeansthatuserscandownloadandrunapplicationswithoutconcernthattheapplicationwillcorrupttheunderlyingoperatingsystem.
Figure8.18:Executionofuntrustedcodeinsidearegionoftrustedcode.Thetrustedregioncanbeaprocess,suchasabrowser,executinguntrustedJavaScript,orthetrustedregioncanbetheoperatingsystemkernel,executinguntrustedpacketfiltersordevicedrivers.
Theneedforsoftwareprotectioniswidespreadenoughthatithasitsownterm:howdoweprovideasoftwaresandboxforexecutinguntrustedcodesothatitcandoitsworkwithoutcausingharmtotherestofthesystem?
8.4.1SingleLanguageOperatingSystems
Averysimpleapproachtosoftwareprotectionistorestrictallapplicationstobewritteninasingle,carefullydesignedprogramminglanguage.Ifthelanguageanditsenvironmentpermitsonlysafeprogramstobeexpressed,andthecompilerandruntimesystemaretrustworthy,thennohardwareprotectionisneeded.
Figure8.19:Executionofapacketfilterinsidethekernel.Apacketfiltercanbeinstalledbyanetworkdebuggertotracepacketsforaparticularuserorapplication.Packetheadersmatchingthefilterarecopiedtothedebugger,whilenormalpacketprocessingcontinuesunaffected.
ApracticalexampleofthisapproachthatisstillinwideuseisUNIXpacketfilters,showninFigure8.19.UNIXpacketfiltersallowuserstodownloadcodeintotheoperatingsystemkerneltocustomizekernelnetworkprocessing.Forexample,apacketfiltercanbeinstalledinthekerneltomakeacopyofpacketheadersarrivingforaparticularconnectionandtosendthosetoauser-leveldebugger.
AUNIXpacketfilteristypicallyonlyasmallamountofcode,butbecauseitneedstoruninkernel-mode,thesystemcannotrelyonhardwareprotectiontopreventamisbehavingpacketfilterfromcausinghavoctounrelatedapplications.Instead,thesystemrestrictsthepacketfilterlanguagetopermitonlysafepacketfilters.Forexample,filtersmayonlybranchonthecontentsofpacketsandnoloopsareallowed.Sincethefiltersaretypicallyshort,theoverheadofusinganinterpretedlanguageisnotprohibitive.
Figure8.20:ExecutionofaJavaScriptprograminsideamodernwebbrowser.TheJavaScriptinterpreterisresponsibleforcontainingeffectsoftheJavaScriptprogramtoitsspecificpage.JavaScriptprogramscancallouttoabroadsetofroutinesinthebrowser,sotheseroutinesmustalsobeprotectedagainstmaliciousJavaScriptprograms.
AnotherexampleofthesameapproachistheuseofJavaScriptinmodernwebbrowsers,illustratedinFigure8.20.AJavaScriptprogramcustomizestheuserinterfaceandpresentationofawebsite;itisprovidedbythewebsite,butitexecutesontheclientmachineinsidethebrowser.Asaresult,thebrowserexecutionenvironmentforJavaScriptmustpreventmaliciousJavaScriptprogramsfromtakingcontroloverthebrowserandpossiblytherestoftheclientmachine.SinceJavaScriptprogramstendtoberelativelyshort,theyareofteninterpreted;JavaScriptcanalsocallintoapredefinedsetoflibraryroutines.IfaJavaScriptprogramattemptstocallaprocedurethatdoesnotexistorreferencearbitrarymemorylocations,theinterpreterwillcausearuntimeexceptionandstoptheprogrambeforeanyharmcanbedone.
Severalearlypersonalcomputersweresinglelanguagesystemswithprotectionimplementedinsoftwareratherthanhardware.Mostfamously,theXeroxAltoresearchprototypeusedsoftwareandnothardwareprotection;theAltoinspiredtheAppleMacintosh,andthelanguageitused,Mesa,wasaforerunnerofJava.OthersystemsincludedtheLispMachine,acomputerthatexecutedonlyprogramswritteninLisp,andcomputersthatexecutedonlySmalltalk(aprecursortoPython).
Languageprotectionandgarbagecollection
JavaScript,Lisp,andSmalltalkallprovidememory-compactinggarbagecollectionfordynamicallycreateddatastructures.Onemotivationforthisisprogrammerconvenienceandtoreduceavoidableprogrammererror.However,thereisacloserelationshipbetweensoftwareprotectionandgarbagecollection.Garbagecollectionrequirestheruntimesystemtokeeptrackofallvalidpointersvisibletotheprogram,sothatdatastructurescanberelocatedwithoutaffectingprogrambehavior.Programsexpressibleinthelanguagecannotpointtoorjumptoarbitrarymemorylocations,asthenthebehavioroftheprogramwouldbealteredbythegarbagecollector.Everyaddressgeneratedbytheprogramisnecessarilywithintheregionoftheapplication’scode,andeveryloadandstoreinstructionistotheprogram’sdata,andnooneelse’s.Inotherwords,thisisexactlywhatisneededforsoftwareprotection!
Unfortunately,language-basedsoftwareprotectionhassomepracticallimitations,sothatonmodernsystems,itisoftenusedintandemwith,ratherthanasareplacementfor,hardwareprotection.Usinganinterpretedlanguageseemslikeasafeoption,butitrequirestrustinboththeinterpreteranditsruntimelibraries.Aninterpreterisacomplexpieceofsoftware,andanyflawintheinterpretercouldprovideawayforamaliciousprogramtogaincontrolovertheprocess,thatis,toescapeitsprotectionboundary.SuchattacksarecommonforbrowsersrunningJavaScript,althoughovertimeJavaScriptinterpretershavebecomemorerobusttothesetypesofattacks.
Worse,becauserunninginterpretedcodeisoftenslow,manyinterpretedsystemsputmostoftheirfunctionalityintosystemlibrariesthatcanbecompiledintomachinecodeandrundirectlyontheprocessor.Forexample,commercialwebbrowsersprovideJavaScriptprogramsahugenumberofuserinterfaceobjects,sothattheinterpretedcodeisjustasmallamountofglue.Unfortunately,thisraisestheattacksurface—anylibraryroutinethatdoesnotcompletelyprotectitselfagainstmalicioususecanbeavectorfortheprogramtoescapeitsprotection.Forexample,aJavaScriptprogramcouldattempttocausealibraryroutinetooverwritetheendofabuffer,anddependingonwhatwasstoredinmemory,thatmightprovideawayfortheJavaScriptprogramtogaincontrolofthesystem.ThesetypesofattacksagainstJavaScriptruntimelibrariesarewidespread.
Thisleadsmostsystemstousebothhardwareandsoftwareprotection.Forexample,MicrosoftWindowsrunsitswebbrowserinaspecialprocesswithrestrictedpermissions.Thisway,ifasystemadministratorvisitsawebsitecontainingamaliciousJavaScriptprogram,eveniftheprogramtakesoverthebrowser,itcannotstorefilesordootheroperationsthatwouldnormallybeavailabletothesystemadministrator.Weknowacomputersecurityexpertwhorunseachnewwebpageinaseparatevirtualmachine;evenifthewebpagecontainsavirusthattakesoverthebrowser,andthebrowserisabletotakeovertheoperatingsystem,theoriginal,uninfected,operatingsystemcanbeautomaticallyrestoredbyresettingthevirtualmachine.
Cross-sitescripting
AnotherJavaScriptattackmakesuseofthestorageinterfaceprovidedtoJavaScriptprograms.ToallowJavaScriptprogramstocommunicatewitheachother,theycanstoredataincookiesinthebrowser.Forsomewebsites,thesecookiescancontainsensitive
informationsuchastheuser’sloginauthentication.AJavaScriptprogramthatcangainaccesstoauser’scookiescanpotentiallypretendtobetheuser,andthereforeaccesstheuser’ssensitivedatastoredattheserver.Ifawebsiteiscompromised,itcanbemodifiedtoservepagescontainingaJavaScriptprogramthatgathersandexploitstheuser’ssensitivedata.Thesearecalledcross-sitescriptingattacks,andtheyarewidespread.
Figure8.21:DesignoftheXeroxAltooperatingsystem.Applicationprogramsandmostoftheoperatingsystemwereimplementedinatype-safeprogramminglanguagecalledMesa;Mesaisolatedmosterrorstothemodulethatcausedtheerror.
Arelatedapproachistowriteallthesoftwareonasysteminasingle,safelanguage,andthentocompilethecodeintomachineinstructionsthatexecutedirectlyontheprocessor.Unlikeinterpretedlanguages,thelibrariesthemselvescanbewritteninthesafelanguage.TheXeroxAltotookthisapproach:bothapplicationsandtheentireoperatingsystemwerewritteninthesamelanguage,Mesa.LikeJava,Mesahadsupportforthreadsynchronizationbuiltdirectlyintothelanguage.Evenwiththis,however,therearepracticalissues.Youstillneedtododefensiveprogrammingatthetrustboundary—betweenuntrustedapplicationcode(writteninthesafelanguage)andtrustedoperatingsystemcode(writteninthesafelanguage).Youalsoneedtobeabletotrustthecompilertogeneratecorrectcodethatenforcesprotection;anyweaknessinthecompilercouldallowabuggyprogramtocrashthesystem.ThedesignersoftheAltobuiltasuccessorsystem,calledtheDigitalEquipmentFirefly,whichusedasuccessorlanguagetoMesa,calledModula-2,forimplementingbothapplicationsandtheoperatingsystem.However,
foranextralevelofprotection,theFireflyalsousedhardwareprotectiontoisolateapplicationsfromtheoperatingsystemkernel.
8.4.2Language-IndependentSoftwareFaultIsolation
Alimitationoftrustingalanguageanditsinterpreterorcompilertoprovidesafetyisthatmanyprogrammersvaluetheflexibilitytochoosetheirownprogramminglanguage.Forexample,somemightuseRubyforconfiguringwebservers,MatlaborPythonforwritingscientificcode,orC++forlargesoftwareengineeringefforts.
Sinceitwouldbeimpracticalfortheoperatingsystemtotrusteverycompilerforeverypossiblelanguage,canweefficientlyisolateapplicationcode,insoftwarewithouthardwaresupport,inaprogramminglanguageindependentfashion?
Onereasonforconsideringthisisthattherearemanycaseswheresystemsneedanextralevelofprotectionwithinaprocess.WesawanexampleofthiswithwebbrowsersneedingtosafelyexecuteJavaScriptprograms,buttherearemanyotherexamples.Withsoftwareprotection,wecouldgiveuserstheabilitytocustomizetheoperatingsystembydownloadingcodeintothekernel,aswithpacketfilters,butonamorewidespreadbasis.Kerneldevicedrivershavebeenshowntobetheprimarycauseofoperatingsystemcrashes;providingawayforthekerneltoexecutedevicedriversinarestrictedenvironmentcouldpotentiallycutdownontheseverityofthesefaults.Likewise,manycomplexsoftwarepackagessuchasdatabases,spreadsheets,desktoppublishingsystems,andsystemsforcomputer-aideddesign,providetheirusersawaytodownloadcodeintothesystemtocustomizeandconfigurethesystem’sbehaviortomeettheuser’sspecificneeds.Ifthisdownloadedcodecausesthesystemtocrash,theuserwillnotbeabletotellwhoisreallyatfaultandislikelytoendupblamingthevendor.
Ofcourse,onewaytodothisistorelyontheJavaScriptinterpreter.Toolsexisttocompilecodewritteninonelanguage,likeCorC++,intoJavaScript.ThisletsapplicationswritteninthoselanguagestorunonanybrowserthatsupportsJavaScript.IfexecutingJavaScriptweresafeandfastenough,thenwecoulddeclareourselvesdone.
Inthissection,wediscussanalternateapproach:canwetakeanychunkofmachineinstructionsandmodifyittoensurethatthecodedoesnottouchanymemoryoutsideofitsownregionofdata?Thatway,thecodecouldbewritteninanylanguage,compiledbyanycompiler,anddirectlyexecuteatthefullspeedoftheprocessor.
BothGoogleandMicrosofthaveproductsthataccomplishthis:asandboxthatcanruncodewritteninanyprogramminglanguage,executedsafelyinsideaprocess.Google’sproductiscalledNativeClient;Microsoft’siscalledApplicationDomains.Theseimplementationsareefficient:Googlereportsthattheruntimeoverheadofexecutingcodesafelyinsideasandboxislessthan10%.
Forsimplicityofourdiscussion,wewillassumethatthememoryregionforthesandboxiscontiguous,thatis,thesandboxhasabaseandboundthatneedstobeenforcedinsoftware.Becausewecandisallowtheexecutionofobviouslymaliciouscode,wecanstartbycheckingthatthecodeinthesandboxdoesnotuseself-modifyinginstructionsorprivilegedinstructions.
Weproceedintwosteps.First,weinsertmachineinstructionsintotheexecutabletodowhathardwareprotectionwouldhavedone,thatis,tocheckthateachaddressislegallywithintheregionspecifiedbythebaseandbounds,andtoraiseanexceptionifnot.Second,weusecontrolanddataflowanalysistoremovechecksthatarenotstrictlynecessaryforthesandboxtobecorrect.Thismirrorswhatwedidforhardwaretranslation—first,wedesignedageneral-purposeandflexiblemechanism,andthenweshowedhowtooptimizeitusingTLBssothatthefulltranslationmechanismwasnotneededoneveryinstruction.
Theaddedinstructionsforeveryloadandstoreinstructionaresimple:justaddacheckthattheaddresstobeusedbyeachloadorstoreinstructioniswithinthecorrectregionofdata.Inthecode,r1isamachineregister.
Notethatthestoreinstructionsmustbelimitedtojustthedataregionofthesandbox;otherwiseastorecouldmodifytheinstructionsequence,e.g.,tocauseajumpoutoftheprotectedregion.
Wealsoneedtocheckindirectbranchinstructions.Weneedtomakesuretheprogramcannotbranchoutsideofthesandboxexceptforpredefinedentryandexitpoints.Relativebranchesandnamedprocedurecallscanbedirectlyverified.Indirectbranchesandprocedurereturnsjumptoalocationstoredinaregisterorinmemory;theaddressmustbecheckedbeforeuse.
Asafinaldetail,theabovecodeverifiesthatindirectbranchinstructionsstaywithinthecoderegion.Thisturnsouttobeinsufficientforprotection,fortworeasons.First,x86codeisbyteaddressable,andifyouallowajumptothemiddleofaninstruction,youcannotbeguaranteedastowhatthecodewilldo.Inparticular,anerroneousormaliciousprogrammightjumptothemiddleofaninstruction,whosebyteswouldcausetheprocessortojumpoutsideoftheprotectedregion.Althoughthismayseemunlikely,rememberthattheattackerhastheadvantage;theattackercantryvariouscodesequencestoseeifthatcausesanescapefromthesandbox.Asecondissueisthatanindirectbranchmightjumppasttheprotectionchecksforaloadorstoreinstruction.Wecanpreventbothofthesebydoingallindirectjumpsthroughatablethatonlycontainsvalidentrypoints
intothecode;ofcourse,thetablemustalsobeprotectedfrombeingmodifiedbythecodeinthesandbox.
Nowthatwehavelogicalcorrectness,wecanruncontrolanddataflowanalysistoeliminatemanyoftheextrainsertedinstructions,ifitcanbeproventhattheyarenotneeded.Examplesofpossibleoptimizationsinclude:
Loopinvariants.Ifaloopstridesthroughmemory,thesandboxmaybeabletoprovewithasimpletestatthebeginningoftheloopthatallmemoryaccessesintheloopwillbewithintheprotectedregion.
Returnvalues.Ifstaticcodeanalysisofaprocedurecanprovethattheproceduredoesnotmodifythereturnprogramcounterstoredonthestack,thereturncanbemadesafelywithoutfurtherchecks.
Cross-procedurechecks.Ifthecodeanalysiscanprovethataparameterisalwayscheckedbeforeitispassedasanargumenttoasubroutine,itneednotbecheckedwhenitisusedinsidetheprocedure.
Virtualmachineswithoutkernelsupport
Modifyingmachinecodetotransparentlychangethebehaviorofaprogram,whilestillenforcingprotection,canbeusedforotherpurposes.Oneapplicationistransparentlyexecutingaguestoperatingsysteminsideauser-levelprocesswithoutkernelsupport.
Normally,whenwerunaguestoperatingsysteminavirtualmachine,thehardwarecatchesanyprivilegedinstructionsexecutedbytheguestkernelandtrapsintothehostkernel.Thehostkernelemulatestheinstructionsandreturnscontrolbacktotheguestkernelattheinstructionimmediatelyafterthehardwareexception.Thisallowsthehostkerneltoemulateprivilegelevels,interrupts,exceptions,andkernelmanagementofhardwarepagetables.
Whathappensifwearerunningontopofanoperatingsystemthatdoesnotsupportavirtualmachine?Wecanstillemulateavirtualmachinebymodifyingthemachinecodeoftheguestoperatingsystemkernel.Forexample,wecanconvertinstructionstoenableanddisableinterruptstoanoop.Wecanconvertaninstructiontostartexecutingauserprogramtotakethecontentsoftheapplicationmemory,re-writethosecontentsintoauser-levelsandbox,andstartitexecuting.Fromtheperspectiveoftheguestkernel,theapplicationprogramexecutionlooksnormal;itisthesandboxthatkeepstheapplicationprogramfromcorruptingkernel’sdatastructuresandpassescontroltotheguestkernelwhentheapplicationmakesasystemcall.
Becauseofthewidespreaduseofvirtualmachines,somehardwarearchitectureshavebeguntoaddsupportfordirectlyexecutingguestoperatingsystemsinuser-modewithoutkernelsupport.Wewillreturntothisissueinalaterchapter,asitiscloselyrelatedtothetopicofstackablevirtualmachines:howdowemanipulatepagetablestohandlethecasewheretheguestoperatingsystemisitselfavirtualmachinemonitorrunningavirtualmachine.
8.4.3SandboxesViaIntermediateCode
Toimproveportability,bothMicrosoftandGooglecanconstructtheirsandboxesfromintermediatecodegeneratedbythecompiler.Thismakesiteasierforthesystemtodothecodemodificationanddataflowanalysistoenforcethesandbox.Insteadofgeneratingx86orARMcodedirectly,thevariouscompilersgeneratetheircodeintheintermediatelanguage,andthesandboxruntimeconvertsthatintosandboxedcodeonthespecificprocessorarchitecture.
Theintermediaterepresentationcanbethoughtofasavirtualmachine,withasimplerinstructionset.Fromthecompilerperspective,itisaseasytogeneratecodeforthevirtualmachineasitwouldbetogodirectlytox86orARMinstructions.Fromthesandboxperspectivethough,usingavirtualmachineastheintermediaterepresentationismuchsimpler.Theintermediatecodecanincludeannotationsastowhichpointerscanbeproventobesafeandwhichmustbecheckedbeforeuse.Forexample,pointersinaCprogramwouldrequireruntimecheckswhilethememoryreferencesinaJavaprogrammaybeabletobestaticallyprovenassafefromthestructureofthecode.
Microsofthascompilersforvirtuallyeverycommerciallyimportantprogramminglanguage.Toavoidtrustingallofthesecompilerswiththesafetyofthesystem,theruntimeisresponsibleforvalidatinganyofthetypeinformationneededforefficientcodegenerationforthesandbox.Typically,verifyingthecorrectnessofstaticanalysisismuchsimplerthangeneratingitinthefirstplace.
TheJavavirtualmachine(JVM)isalsoakindofsandbox;Javacodeistranslatedintointermediatebytecodeinstructionsthatcanbeverifiedatruntimeasbeingsafelycontainedinthesandbox.SeverallanguageshavebeencompiledintoJavabytecode,suchasPython,Ruby,andJavaScript.Thus,aJVMcanalsobeconsideredalanguage-independentsandbox.However,becauseofthestructureoftheintermediaterepresentationinJava,itismoredifficulttogeneratecorrectJavabytecodeforlanguagessuchasCorFortran.
8.5SummaryandFutureDirections
Addresstranslationisapowerfulabstractionenablingawidevarietyofoperatingsystemservices.Itwasoriginallydesignedtoprovideisolationbetweenprocessesandtoprotecttheoperatingsystemkernelfrommisbehavingapplications,butitismorewidelyapplicable.Itisnowusedtosimplifymemorymanagement,tospeedinterprocesscommunication,toprovideforefficientsharedlibraries,tomapfilesdirectlyintomemory,andahostofotheruses.
AhugechallengetoeffectivehardwareaddresstranslationisthecumulativeeffectofdecadesofMoore’sLaw:bothserversanddesktopcomputerstodaycontainvastamountsofmemory.Processesarenowabletomaptheircode,data,heap,sharedlibraries,andfilesdirectlyintomemory.Eachofthesesegmentscanbedynamic;theycanbesharedacrossprocessesorprivatetoasingleprocess.Tohandlethesedemands,hardwaresystemshaveconvergedonatwo-tierstructure:amulti-levelsegmentandpagetableto
provideveryflexiblebutspace-efficientlookup,alongwithaTLBtoprovidetime-efficientlookupforrepeatedtranslationsofthesamepage.
Muchofwhatwecandoinhardwarewecanalsodoinsoftware;acombinationofhardwareandsoftwareprotectionhasprovenattractiveinanumberofcontexts.Modernwebbrowsersexecutecodeembeddedinwebpagesinasoftwaresandboxthatpreventsthecodefrominfectingthebrowser;theoperatingsystemuseshardwareprotectiontoprovideanextralevelofdefenseincasethebrowseritselfiscompromised.
Thefuturetrendsareclear:
Verylargememorysystems.Thecostofagigabyteofmemoryislikelytocontinuetoplummet,makingeverlargermemorysystemspractical.Overthepastfewdecades,theamountofmemorypersystemhasalmostdoubledeachyear.Wearelikelytolookbackattoday’scomputersandwonderhowwecouldhavegottenbywithaslittleasagigabyteofDRAM!Thesemassivememorieswillrequireeverdeepermulti-levelpagetables.Fortunately,thesametrendsthatmakeitpossibletobuildgiganticmemoriesalsomakeitpossibletodesignverylargeTLBstohidetheincreasingdepthofthelookuptrees.
Multiprocessors.Ontheotherhand,multiprocessorswillmeanthatmaintainingTLBconsistencywillbecomeincreasinglyexpensive.Akeyassumptionforusingpagetableprotectionhardwareforimplementingcopy-on-writeandfill-on-demandisthatthecostofmodifyingpagetableentriesismodest.OnepossibilityisthathardwarewillbeaddedtosystemstomakeTLBshootdownamuchcheaperoperation,e.g.,bymakingTLBscachecoherent.Anotherpossibilityistofollowthetrendtowardssoftwaresandboxes.IfTLBshootdownremainsexpensive,wemaystarttoseecopy-on-writeandotherfeaturesimplementedinsoftwareratherthanhardware.
User-levelsandboxes.Applicationslikebrowsersthatrununtrustedcodearebecomingincreasinglyprevalent.Operatingsystemshaveonlyrecentlybeguntorecognizetheneedtosupportthesetypesofapplications.Softwareprotectionhasbecomecommon,bothatthelanguagelevelwithJavaScript,andintheruntimesystemwithNativeClientandApplicationDomains.Asthesetechnologiesbecomemorewidelyused,itseemslikelywemaydirecthardwaresupportforapplication-levelprotection—toalloweachapplicationtosetupitsownprotectedexecutionenvironment,butenforcedinhardware.Ifso,wemaycometothinkofmanyapplicationsashavingtheirownembeddedoperatingsystem,andtheunderlyingoperatingsystemkernelasmediatingbetweentheseoperatingsystems.
Exercises
1. Trueorfalse.Avirtualmemorysystemthatusespagingisvulnerabletoexternalfragmentation.Whyorwhynot?
2. Forsystemsthatusepagedsegmentation,whattranslationstatedoesthekernelneedtochangeonaprocesscontextswitch?
3. Forthethree-levelSPARCpagetable,whattranslationstatedoesthekernelneedtochangeonaprocesscontextswitch?
4. Describetheadvantagesofanarchitecturethatincorporatessegmentationandpagingoveronesthatareeitherpurepagingorpuresegmentation.Presentyouranswerasseparatelistsofadvantagesovereachofthepureschemes.
5. Foracomputerarchitecturewithmulti-levelpaging,apagesizeof4KB,and64-bitphysicalandvirtualaddresses:
a. Listtherequiredandoptionalfieldsofitspagetableentry,alongwiththenumberofbitsperfield.
b. Assumingacompactencoding,whatisthesmallestpossiblesizeforapagetableentryinbytes,roundeduptoanevennumber.
c. Assumingarequirementthateachpagetablefitsintoasinglepage,andgivenyouranswerabove,howmanylevelsofpagetableswouldberequiredtocompletelymapthe64-bitvirtualaddressspace?
6. Considerthefollowingpieceofcodewhichmultipliestwomatrices:
Assumethatthebinaryforexecutingthisfunctionfitsinonepageandthatthestackalsofitsinonepage.Assumethatstoringafloatingpointnumbertakes4bytesofmemory.Ifthepagesizeis4KB,theTLBhas8entries,andtheTLBalwayskeepsthemostrecentlyusedpages,computethenumberofTLBmissesassumingtheTLBisinitiallyempty.
7. Ofthefollowingitems,whicharestoredinthethreadcontrolblock,whicharestoredintheprocesscontrolblock,andwhichinneither?
a. Pagetablepointerb. Pagetablec. Stackpointerd. Segmenttablee. Readylistf. CPUregistersg. Programcounter
8. Drawthesegmentandpagetableforthe32-bitIntelarchitecture.9. Drawthesegmentandpagetableforthe64-bitIntelarchitecture.
10. Foracomputerarchitecturewithmulti-levelpaging,apagesizeof4KB,and64-bitphysicalandvirtualaddresses:
a. Whatisthesmallestpossiblesizeforapagetableentry,roundeduptoapoweroftwo?
b. Usingyourresultabove,andassumingarequirementthateachpagetablefitsintoasinglepage,howmanylevelsofpagetableswouldberequiredtocompletelymapthe64-bitvirtualaddressspace?
11. Supposeyouaredesigningasystemwithpagedsegmentation,andyouanticipatethememorysegmentsizewillbeuniformlydistributedbetween0and4GB.Theoverheadofthedesignisthesumoftheinternalfragmentationandthespacetakenupbythepagetables.Ifeachpagetableentryusesfourbytesperpage,whatpagesizeminimizesoverhead?
12. Inanarchitecturewithpagedsegmentation,the32-bitvirtualaddressisdividedintofieldsasfollows:
| 4bitsegmentnumber | 12bitpagenumber | 16bitoffset |
Thesegmentandpagetablesareasfollows(allvaluesinhexadecimal):
SegmentTable PageTableA PageTableB
0 PageTableA 0 CAFE 0 F000
1 PageTableB 1 DEAD 1 D8BF
x (restinvalid) 2 BEEF 2 3333
3 BA11 x (restinvalid)
x (restinvalid)
Findthephysicaladdresscorrespondingtoeachofthefollowingvirtualaddresses(answer“invalidvirtualaddress”ifthevirtualaddressisinvalid):
a. 00000000b. 20022002
c. 10015555
13. Supposeamachinewith32-bitvirtualaddressesand40-bitphysicaladdressesisdesignedwithatwo-levelpagetable,subdividingthevirtualaddressintothreepiecesasfollows:
| 10bitpagetablenumber | 10bitpagenumber | 12bitoffset |
Thefirst10bitsaretheindexintothetop-levelpagetable,thesecond10bitsaretheindexintothesecond-levelpagetable,andthelast12bitsaretheoffsetintothepage.Thereare4protectionbitsperpage,soeachpagetableentrytakes4bytes.
a. Whatisthepagesizeinthissystem?b. Howmuchmemoryisconsumedbythefirstandsecondlevelpagetablesand
wastedbyinternalfragmentationforaprocessthathas64Kofmemorystartingataddress0?
c. Howmuchmemoryisconsumedbythefirstandsecondlevelpagetablesandwastedbyinternalfragmentationforaprocessthathasacodesegmentof48Kstartingataddress0x1000000,adatasegmentof600Kstartingataddress0x80000000andastacksegmentof64Kstartingataddress0xf0000000andgrowingupward(towardshigheraddresses)?
14. Writepseudo-codetoconverta32-bitvirtualaddresstoa32-bitphysicaladdressforatwo-leveladdresstranslationschemeusingsegmentationatthefirstleveloftranslationandpagingatthesecondlevel.Explicitlydefinewhateverconstantsanddatastructuresyouneed(e.g.,theformatofthepagetableentry,thepagesize,andsoforth).
9.CachingandVirtualMemory
Cashisking.—PerGyllenhammar
Somemayarguethatwenolongerneedachapteroncachingandvirtualmemoryinanoperatingsystemstextbook.Afterall,moststudentswillhaveseencachesinanearliermachinestructuresclass,andmostdesktopsandlaptopsareconfiguredsothattheyonlyveryrarely,ifever,runoutofmemory.Maybecachingisnolongeranoperatingsystemstopic?
Wecouldnotdisagreemore.Cachesarecentraltothedesignofahugenumberofhardwareandsoftwaresystems,includingoperatingsystems,Internetnaming,webclients,andwebservers.Inparticular,smartphoneoperatingsystemsareoftenmemoryconstrainedandmustmanagememorycarefully.Serveroperatingsystemsmakeextensiveuseofremotememoryandremotediskacrossthedatacenter,usingthelocalservermemoryasacache.Evendesktopoperatingsystemsusecachingextensivelyintheimplementationofthefilesystem.Mostimportantly,understandingwhencachesworkandwhentheydonotisessentialtoeverycomputersystemsdesigner.
ConsideratypicalFacebookpage.Itcontainsinformationaboutyou,yourinterestsandprivacysettings,yourposts,andyourphotos,plusyourlistoffriends,theirinterestsandprivacysettings,theirposts,andtheirphotos.Inturn,yourfriends’pagescontainanoverlappingviewofmuchofthesamedata,andinturn,theirfriends’pagesareconstructedthesameway.
NowconsiderhowFacebookorganizesitsdatatomakeallofthiswork.HowdoesFacebookassemblethedataneededtodisplayapage?Oneoptionwouldbetokeepallofthedataforaparticularuser’spageinoneplace.However,theinformationthatIneedtodrawmypageoverlapswiththeinformationthatmyfriends’friendsneedtodrawtheirpages.Myfriends’friends’friends’friendsincludeprettymuchtheentireplanet.Wecaneitherstoreeveryone’sdatainoneplaceorspreadthedataaround.Eitherway,performancewillsuffer!IfwestoreallthedatainCalifornia,FacebookwillbeslowforeveryonefromEurope,andviceversa.Equally,integratingdatafrommanydifferentlocationsisalsolikelytobeslow,especiallyforFacebook’smorecosmopolitanusers.
Toresolvethisdilemma,Facebookmakesheavyuseofcaches;itwouldnotbepracticalwithoutthem.Acacheisacopyofacomputationordatathatcanbeaccessedmorequicklythantheoriginal.Whileanyobjectonmypagemightchangefrommomenttomoment,itseldomdoes.Inthecommoncase,Facebookreliesonalocal,cachedcopyofthedataformypage;itonlygoesbacktotheoriginalsourceifthedataisnotstoredlocallyorbecomesoutofdate.
Cachesworkbecausebothusersandprogramsarepredictable.You(probably!)donotchangeyourfriendlisteverynanosecond;ifyoudid,Facebookcouldstillcacheyourfriendlist,butitwouldbeoutofdatebeforeitcouldbeusedagain,andsoitwouldnothelp.Ifeveryonechangedtheirfriendseverynanosecond,Facebookwouldbeoutofluck!Inmostcases,however,whatusersdonowispredictiveofwhattheyarelikelytodosoon,
andwhatprogramsdonowispredictiveofwhattheywilldonext.Thisprovidesanopportunityforacachetosaveworkthroughreuse.
Facebookisnotaloneinmakingextensiveuseofcaches.Almostalllargecomputersystemsrelyoncaches.Infact,itishardtothinkofanywidelyused,complexhardwareorsoftwaresystemthatdoesnotincludeacacheofsomesort.
Wesawthreeexamplesofhardwarecachesinthepreviouschapter:
TLBs.Modernprocessorsuseatranslationlookasidebuffer,orTLB,tocachetherecentresultsofmulti-levelpagetableaddresstranslation.Providedprogramsreferencethesamepagesrepeatedly,translatinganaddressisasfastasasingletablelookupinthecommoncase.Thefullmulti-levellookupisneededonlyinthecasewheretheTLBdoesnotcontaintherelevantaddresstranslation.
Virtuallyaddressedcaches.Mostmodernprocessordesignstakethisideaastepfartherbyincludingavirtuallyaddressedcacheclosetotheprocessor.Eachentryinthecachestoresthememoryvalueassociatedwithavirtualaddress,allowingthatvaluetobereturnedmorequicklytotheprocessorwhenneeded.Forexample,therepeatedinstructionfetchesinsidealooparewellhandledbyavirtuallyaddressedcache.
Physicallyaddressedcaches.Mostmodernprocessorscomplementthevirtuallyaddressedcachewithasecond-(andsometimesthird-)levelphysicallyaddressedcache.Eachentryinaphysicallyaddressedcachestoresthememoryvalueassociatedwithaphysicalmemorylocation.Inthecommoncase,thisallowsthememoryvaluetobereturneddirectlytotheprocessorwithouttheneedtogotomainmemory.
Therearemanymoreexamplesofcaches:
Internetnaming.Wheneveryoutypeinawebrequestorclickonalink,theclientcomputerneedstotranslatethenameinthelink(e.g.,amazon.com)toanIPnetworkaddressofwheretosendeachpacket.Theclientgetsthisinformationfromanetworkservice,calledtheDomainNameSystem(DNS),andthencachesthetranslationsothattheclientcangodirectlytothewebserverinthecommoncase.
Webcontent.WebclientscachecopiesofHTML,images,JavaScriptprograms,andotherdatasothatwebpagescanberefreshedmorequickly,usinglessbandwidth.Webserversalsokeepcopiesoffrequentlyrequestedpagesinmemorysothattheycanbetransmittedmorequickly.
Websearch.BothGoogleandBingkeepacachedcopyofeverywebpagetheyindex.Thisallowsthemtoprovidethecopyofthewebpageiftheoriginalisunavailableforsomereason.Thecachedcopymaybeoutofdate—thesearchenginesdonotguaranteethatthecopyinstantaneouslyreflectsanychangeintheoriginalwebpage.
Emailclients.Manyemailclientsstoreacopyofmailmessagesontheclientcomputertoimprovetheclientperformanceandtoallowdisconnectedoperation.Inthebackground,theclientcommunicateswiththeservertokeepthetwocopiesin
sync.
Incrementalcompilation.Ifyouhaveeverbuiltaprogramfrommultiplesourcefiles,youhaveusedcaching.Thebuildmanagersavesandreusestheindividualobjectfilesinsteadofrecompilingeverythingfromscratcheachtime.
Justintimetranslation.Somememory-constraineddevicessuchassmartphonesdonotcontainenoughmemorytostoretheentireexecutableimageforsomeprograms.Instead,systemssuchastheGoogleAndroidoperatingsystemandtheARMruntimestoreprogramsinamorecompactintermediaterepresentation,andconvertpartsoftheprogramtomachinecodeasneeded.Repeateduseofthesamecodeisfastbecauseofcaching;ifthesystemrunsoutofmemory,lessfrequentlyusedcodemaybeconvertedeachtimeitisneeded.
Virtualmemory.Operatingsystemscanrunprogramsthatdonotfitinphysicalmemorybyusingmainmemoryasacachefordisk.Applicationpagesthatfitinmemoryhavetheirpagetableentriessettovalid;thesepagescanbeaccesseddirectlybytheprocessor.Thosepagesthatdonotfithavetheirpermissionssettoinvalid,triggeringatraptotheoperatingsystemkernel.Thekernelwillthenfetchtherequiredpagefromdiskandresumetheapplicationattheinstructionthatcausedthetrap.
Filesystems.Filesystemsalsotreatmemoryasacachefordisk.Theystorecopiesinmemoryoffrequentlyuseddirectoriesandfiles,reducingtheneedfordiskaccesses.
Conditionalbranchprediction.Anotheruseofcachesisinpredictingwhetheraconditionalbranchwillbetakenornot.Inthecommoncaseofacorrectprediction,theprocessorcanstartdecodingthenextinstructionbeforetheresultofthebranchisknownforsure;ifthepredictionturnsouttobewrong,thedecodingisrestartedwiththecorrectnextinstruction.
Inotherwords,cachesareacentraldesigntechniquetomakingcomputersystemsfaster.However,cachesarenotwithouttheirdownsides.Cachescanmakeunderstandingtheperformanceofasystemmuchharder.Somethingthatseemslikeitshouldbefast—andevensomethingthatusuallyisfast—canendupbeingveryslowifmostofthedataisnotinthecache.Becausethedetailsofthecacheareoftenhiddenbehindalevelofabstraction,theuserortheprogrammermayhavelittleideaastowhatiscausingthepoorperformance.Inotherwords,theabstractionoffastaccesstodatacancauseproblemsiftheabstractiondoesnotliveuptoitspromise.Oneofouraimsistohelpyouunderstandwhencachesdoanddonotworkwell.
Inthischapter,wewillfocusonthecachingofmemoryvalues,buttheprincipleswediscussapplymuchmorewidely.Memorycachingiscommoninbothhardware(bytheprocessortoimprovememorylatency)andinsoftware(bytheoperatingsystemtohidediskandnetworklatency).Further,thestructureandorganizationofprocessorcachesrequiresspecialcarebytheoperatingsysteminsettinguppagetables;otherwise,muchoftheadvantageofprocessorcachescanevaporate.
Regardlessofthecontext,allcachesfacethreedesignchallenges:
Locatingthecachedcopy.Becausecachesaredesignedtoimproveperformance,akeyquestionisoftenhowtoquicklydeterminewhetherthecachecontainstheneededdataornot.Becausetheprocessorconsultsatleastonehardwarecacheoneveryinstruction,hardwarecachesinparticularareorganizedforefficientlookup.
Replacementpolicy.Mostcacheshavephysicallimitsonhowmanyitemstheycanstore;whennewdataarrivesinthecache,thesystemmustdecidewhichdataismostvaluabletokeepinthecacheandwhichcanbereplaced.Becauseofthehighrelativelatencyoffetchingdatafromdisk,operatingsystemsandapplicationshavefocusedmoreattentiononthechoiceofreplacementpolicy.
Coherence.Howdowedetect,andrepair,whenacachedcopybecomesoutofdate?Thisquestion,cachecoherence,iscentraltothedesignofmultiprocessoranddistributedsystems.Despitebeingveryimportant,cachecoherencebeyondthescopeofthisversionofthetextbook.Instead,wefocusonthefirsttwooftheseissues.
Chapterroadmap:
CacheConcept.Whatoperationsdoesacachedoandhowcanweevaluateitsperformance?(Section9.1)
MemoryHierarchy.Whathardwarebuildingblocksdowehaveinconstructingacacheinanapplicationoroperatingsystem?(Section9.2)
WhenCachesWorkandWhenTheyDoNot.Canwepredicthoweffectiveacachewillbeinasystemwearedesigning?Canweknowinadvancewhencachingwillnotwork?(Section9.3)
MemoryCacheLookup.Whatoptionsdowehaveforlocatingwhetheranitemiscached?Howcanweorganizehardwarecachestoallowforrapidlookup,andwhataretheimplicationsofcacheorganizationforoperatingsystemsandapplications?(Section9.4)
ReplacementPolicies.Whatoptionsdowehaveforchoosingwhichitemtoreplacewhenthereisnomoreroom?(Section9.5)
CaseStudy:Memory-MappedFiles.Howdoestheoperatingsystemprovidetheabstractionoffileaccesswithoutfirstreadingtheentirefileintomemory?(Section9.6)
CaseStudy:VirtualMemory.Howdoestheoperatingsystemprovidetheillusionofanear-infinitememorythatcanbesharedbetweenapplications?Whathappensifbothapplicationsandtheoperatingsystemwanttomanagememoryatthesametime?(Section9.7)
9.1CacheConcept
Figure9.1:Abstractoperationofamemorycacheonareadrequest.Memoryreadrequestsaresenttothecache;thecacheeitherreturnsthevaluestoredatthatmemorylocation,oritforwardstherequestonwardtothenextlevelofcache.
Westartbydefiningsometerms.Thesimplestkindofacacheisamemorycache.Itstores(address,value)pairs.AsshowninFigure9.1,whenweneedtoreadvalueofacertainmemorylocation,wefirstconsultthecache,anditeitherreplieswiththevalue(ifthecacheknowsit)andotherwiseitforwardstherequestonward.Ifthecachehasthevalue,thatiscalledacachehit.Ifthecachedoesnot,thatiscalledacachemiss.
Foramemorycachetobeuseful,twopropertiesneedtohold.First,thecostofretrievingdataoutofthecachemustbesignificantlylessthanfetchingthedatafrommemory.Inotherwords,thecostofacachehitmustbelessthanacachemiss,orwewouldjustskipusingthecache.
Second,thelikelihoodofacachehitmustbehighenoughtomakeitworththeeffort.Onesourceofpredictabilityistemporallocality:programstendtoreferencethesameinstructionsanddatathattheyhadrecentlyaccessed.Examplesincludetheinstructionsinsidealoop,oradatastructurethatisrepeatedlyaccessed.Bycachingthesememoryvalues,wecanimproveperformance.
Anothersourceofpredictabilityisspatiallocality.Programstendtoreferencedatanearotherdatathathasbeenrecentlyreferenced.Forexample,thenextinstructiontoexecuteisusuallyneartothepreviousone,anddifferentfieldsinthesamedatastructuretendtobereferencedatnearlythesametime.Toexploitthis,cachesareoftendesignedtoloadablockofdataatthesametime,insteadofonlyasinglelocation.Hardwarememorycachesoftenstore4-64memorywordsasaunit;filecachesoftenstoredatainpowersoftwoofthehardwarepagesize.
Arelateddesigntechniquethatalsotakesadvantageofspatiallocalityistoprefetchdataintothecachebeforeitisneeded.Forexample,ifthefilesystemobservestheapplication
readingasequenceofblocksintomemory,itwillreadthesubsequentblocksaheadoftime,withoutwaitingtobeasked.
Puttingthesetogether,thelatencyofareadrequestisasfollows:
Latency(readrequest) = Prob(cachehit)×Latency(cachehit)
+Prob(cachemiss)×Latency(cachemiss)
Figure9.2:Abstractoperationofamemorycachewrite.Memoryrequestsarebufferedandthensenttothecacheinthebackground.Typically,thecachestoresablockofdata,soeachwriteensuresthattherestoftheblockisinthecachebeforeupdatingthecache.Ifthecacheiswritethrough,thedataisthensentonwardtothenextlevelofcacheormemory.
ThebehaviorofacacheonawriteoperationisshowninFigure9.2.Theoperationisabitmorecomplex,butthelatencyofawriteoperationiseasiertounderstand.Mostsystemsbufferwrites.Aslongasthereisroominthebuffer,thecomputationcancontinueimmediatelywhilethedataistransferredintothecacheandtomemoryinthebackground.(Therearecertainrestrictionsontheuseofwritebuffersinamultiprocessorsystem,soforthischapter,wearesimplifyingmatterstosomedegree.)Subsequentreadrequestsmustcheckboththewritebufferandthecache—returningdatafromthewritebufferifitisthelatestcopy.
Inthebackground,thesystemchecksiftheaddressisinthecache.Ifnot,therestofthe
cacheblockmustbefetchedfrommemoryandthenupdatedwiththechangedvalue.Finally,ifthecacheiswrite-through,allupdatesaresentimmediatelyonwardtomemory.Ifthecacheiswrite-back,updatescanbestoredinthecache,andonlysenttomemorywhenthecacherunsoutofspaceandneedstoevictablocktomakeroomforanewmemoryblock.
Sincewritebuffersallowwriterequeststoappeartocompleteimmediately,therestofourdiscussionfocusesonusingcachestoimprovememoryreads.
Wefirstdiscussthepartoftheequationthatdealswiththelatencyofacachehitandacachemiss:howlongdoesittaketoaccessdifferenttypesofmemory?Wecaution,however,thattheissuesthataffectthelikelihoodofacachehitormissarejustasimportanttotheoverallmemorylatency.Inparticular,wewillshowthatapplicationcharacteristicsareoftenthelimitingfactortogoodcacheperformance.
9.2MemoryHierarchy
Whenwearedecidingwhethertouseacacheintheoperatingsystemorsomenewapplication,itishelpfultostartwithanunderstandingofthecostandperformanceofvariouslevelsofmemoryanddiskstorage.
Cache HitCost Size
1stlevelcache/1stlevelTLB 1ns 64KB
2ndlevelcache/2ndlevelTLB 4ns 256KB
3rdlevelcache 12ns 2MB
Memory(DRAM) 100ns 10GB
Datacentermemory(DRAM) 100μs 100TB
Localnon-volatilememory 100μs 100GB
Localdisk 10ms 1TB
Datacenterdisk 10ms 100PB
Remotedatacenterdisk 200ms 1XB
Figure9.3:Memoryhierarchy,fromon-chipprocessorcachestodiskstorageataremotedatacenter.On-chipcachesizeandlatencyistypicalofahigh-endprocessor.TheentriesfordatacenterDRAManddisklatencyassumetheaccessisfromoneservertoanotherinthesamedatacenter;remotedatacenterdisklatencyifforaccesstoageographicallydistantdatacenter.
Fromahardwareperspective,thereisafundamentaltradeoffbetweenthespeed,size,andcostofstorage.Thesmallermemoryis,thefasteritcanbe;theslowermemoryis,thecheaperitcanbe.
Thismotivatessystemstohavenotjustonecache,butawholehierarchyofcaches,fromthenanosecondmemorypossibleinsideachiptothemultipleexabytesofworldwidedatacenterstorage.ThishierarchyisillustratedbythetableinFigure9.3.Weshouldcautionthatthislistisjustasnapshot;additionallayerskeepbeingaddedovertime.
First-levelcache.Mostmodernprocessorarchitecturescontainasmallfirst-level,virtuallyaddressed,cacheveryclosetotheprocessor,designedtokeeptheprocessorfedwithinstructionsanddataattheclockrateoftheprocessor.
Second-levelcache.Becauseitisimpossibletobuildalargecacheasfastasasmallone,theprocessorwilloftencontainasecond-level,physicallyaddressedcachetohandlecachemissesfromthefirst-levelcache.
Third-levelcache.Likewise,manyprocessorsincludeanevenlarger,slowerthird-levelcachetocatchsecond-levelcachemisses.Thiscacheisoftensharedacrossalloftheon-chipprocessorcores.
First-andsecond-levelTLB.Thetranslationlookasidebuffer(TLB)willalsobeorganizedwithmultiplelevels:asmall,fastfirst-levelTLBdesignedtokeepupwiththeprocessor,backedupbyalarger,slightlyslower,second-levelTLBtocatchfirst-levelTLBmisses.
Mainmemory(DRAM).Fromahardwareperspective,thefirst-,second-,andthird-levelcachesprovidefasteraccesstomainmemory;fromasoftwareperspective,however,mainmemoryitselfcanbeviewedasacache.
Datacentermemory(DRAM).Withahigh-speedlocalareanetworksuchasadatacenter,thelatencytofetchapageofdatafromthememoryofanearbycomputerismuchfasterthanfetchingitfromdisk.Inaggregate,thememoryofnearbynodeswilloftenbelargerthanthatofthelocaldisk.Usingthememoryofnearbynodestoavoidthelatencyofgoingtodiskiscalledcooperativecaching,asitrequiresthecooperativemanagementofthenodesinthedatacenter.Manylargescaledatacenterservices,suchasGoogleandFacebook,makeextensiveuseofcooperativecaching.
Localdiskornon-volatilememory.Forclientmachines,localdiskornon-volatileflashmemorycanserveasbackingstorewhenthesystemrunsoutofmemory.Inturn,thelocaldiskservesasacacheforremotediskstorage.Forexample,webbrowsersstorerecentlyfetchedwebpagesintheclientfilesystemtoavoidthecostoftransferringthedataagainthenexttimeitisused;oncecached,thebrowseronlyneedstovalidatewiththeserverwhetherthepagehaschangedbeforerenderingthewebpagefortheuser.
Datacenterdisk.Theaggregatedisksinsideadatacenterprovideenormousstoragecapacitycomparedtoacomputer’slocaldisk,andevenrelativetotheaggregatememoryofthedatacenter.
Remotedatacenterdisk.Geographicallyremotedisksinadatacenteraremuchslowerbecauseofwide-areanetworklatencies,buttheyprovideaccesstoevenlargerstoragecapacityinaggregate.Manydatacentersalsostoreacopyoftheirdataonaremoterobotictapesystem,butsincethesesystemshaveveryhighlatency(measuredinthetensofseconds),theyaretypicallyaccessedonlyintheeventofafailure.
Ifcachingalwaysworkedperfectly,wecouldprovidetheillusionofinstantaneousaccesstoalltheworld’sdata,withthelatency(onaverage)ofafirstlevelcacheandthesizeandthecost(onaverage)ofdiskstorage.
However,therearereasonstobeskeptical.Evenwithtemporalandspatiallocality,therearethirteenordersofmagnitudedifferenceinstoragecapacityfromthefirstlevelcachetothestoreddataofatypicaldatacenter;thisistheequivalentofthesmallestvisibledotonthispageversusthosedotsscatteredacrossthepagesofamilliontextbooksjustlikethisone.Howcanacachebeeffectiveifitcanstoreonlyatinyamountofthedatathatcouldbestored?
Thecostofacachemisscanalsobehigh.Thereareeightordersofmagnitudedifferencebetweenthelatencyofthefirst-levelcacheandaremotedatacenterdisk;thatisequivalenttothedifferencebetweentheshortestlatencyahumancanperceive—roughlyonehundredmilliseconds—versusoneyear.Howcanacachebeeffectiveifthecostofacachemissisenormouscomparedtoacachehit?
9.3WhenCachesWorkandWhenTheyDoNot
Howdoweknowwhetheracachewillbeeffectiveforagivenworkload?Eventhesameprogramwillhavedifferentcachebehaviordependingonhowitisused.
Supposeyouwriteaprogramthatreadsandwritesitemsintoahashtable.Howwelldoesthatinteractwithcaching?Itdependsonthesizeofthehashtable.Ifthehashtablefitsinthefirst-levelcache,oncethetableisloadedintothecache,eachaccesswillbeveryrapid.Ifontheotherhand,thehashtableistoolargetostoreinmemory,eachlookupmayrequireadiskaccess.
Thus,neitherthecachesizenortheprogrambehavioralonegovernstheeffectivenessofcaching.Rather,theinteractionbetweenthetwodeterminescacheeffectiveness.
Figure9.4:CachehitrateasafunctionofcachesizeforamillioninstructionrunofaCcompiler.Thehitratevs.cachesizegraphhasasimilarshapeformanyprograms.Thekneeofthecurveiscalledtheworkingsetoftheprogram.
9.3.1WorkingSetModel
Ausefulgraphtoconsideristhecachehitrateversusthesizeofthecache.WegiveanexampleinFigure9.4;ofcourse,thepreciseshapeofthegraphwillvaryfromprogramtoprogram.
Regardlessoftheprogram,asufficientlylargecachewillhaveahighcachehitrate.Inthelimit,ifthecachecanfitalloftheprogram’smemoryanddata,themissratewillbezerooncethedataisloadedintothecache.Attheotherextreme,asufficientlysmallcachewillhaveaverylowcachehitrate.Anythingotherthanatrivialprogramwillhavemultipleproceduresandmultipledatastructures;ifthecacheissufficientlysmall,eachnewinstructionanddatareferencewillpushoutsomethingfromthecachethatwillbeusedinthenearfuture.Forthehashtableexample,ifthesizeofthecacheismuchsmallerthanthesizeofthehashtable,eachtimewedoalookup,thehashbucketweneedwillnolongerbeinthecache.
Mostprogramswillhaveaninflectionpoint,orkneeofthecurve,whereacriticalmassofprogramdatacanjustbarelyfitinthecache.Thiscriticalmassiscalledtheprogram’sworkingset.Aslongastheworkingsetcanfitinthecache,mostreferenceswillbeacachehit,andapplicationperformancewillbegood.
Thrashing
Acloselyrelatedconcepttotheworkingsetisthrashing.Aprogramthrashesifthecacheistoosmalltoholditsworkingset,sothatmostreferencesarecachemisses.Eachtimethereisacachemiss,weneedtoevictacacheblocktomakeroomforthenewreference.However,thenewcacheblockmayinturnbeevictedbeforeitisreused.
Theword“thrash”datesfromthe1960’s,whendiskdriveswereaslargeaswashingmachines.Ifaprogram’sworkingsetdidnotfitinmemory,thesystemwouldneedtoshufflememorypagesbackandforthtodisk.Thisburstofactivitywouldliterallymakethediskdriveshakeviolently,makingitveryobvioustoeveryonenearbywhythesystemwasnotperformingwell.
Thenotionofaworkingsetcanalsoapplytouserbehavior.Considerwhathappenswhenyouaredevelopingcodeforahomeworkassignment.Ifthefilesyouneedfitinmemory,compilationwillberapid;ifnot,compilationwillbeslowaseachfileisbroughtinfromdiskasitisused.
Differentprograms,anddifferentusers,willhaveworkingsetsofdifferentsizes.Evenwithinthesameprogram,differentphasesoftheprogrammayhavedifferentsizeworkingsets.Forexample,theparserforacompilerneedsdifferentdataincachethanthecodegenerator.Inatexteditor,theworkingsetshiftswhenweswitchfromonepagetothenext.Usersalsochangetheirfocusfromtimetotime,aswhenyoushiftfromaprogrammingassignmenttoahistoryassignment.
Figure9.5:Examplecachehitrateovertime.Ataphasechangewithinaprocess,orduetoacontextswitchbetweenprocesses,therewillbeaspikeofcachemissesbeforethesystemsettlesintoanewequilibrium.
Theresultofthisphasechangebehavioristhatcacheswilloftenhaveburstymissrates:
periodsoflowcachemissesinterspersedwithperiodsofhighcachemisses,asshowninFigure9.5.Processcontextswitcheswillalsocauseburstycachemisses,asthecachediscardstheworkingsetfromtheoldprocessandbringsintheworkingsetofthenewprocess.
WecancombinethegraphinFigure9.4withthetableinFigure9.3toseetheimpactofthesizeoftheworkingsetoncomputersystemperformance.Aprogramwhoseworkingsetfitsinthefirstlevelcachewillrunfourtimesfasterthanonewhoseworkingsetfitsinthesecondlevelcache.Aprogramwhoseworkingsetdoesnotfitinmainmemorywillrunathousandtimesslowerthanonewhodoes,assumingithasaccesstodatacentermemory.Itwillrunahundredthousandtimesslowerifitneedstogotodisk.
Becauseoftheincreasingdepthandcomplexityofthememoryhierarchy,animportantareaofworkisthedesignofalgorithmsthatadapttheirworkingsettothememoryhierarchy.Onefocushasbeenonalgorithmsthatmanagethegapbetweenmainmemoryanddisk,butthesameprinciplesapplyatotherlevelsofthememoryhierarchy.
Figure9.6:Algorithmtosortalargearraythatdoesnotfitintomainmemory,bybreakingtheproblemintopiecesthatdofitintomemory.
Asimpleexampleishowtoefficientlysortanarraythatdoesnotfitinmainmemory.(Equivalently,wecouldconsiderhowtosortanarraythatdoesnotfitinthefirstlevelcache.)AsshowninFigure9.6,wecanbreaktheproblemupintochunkseachofwhichdoesfitinmemory.Oncewesorteachchunk,wecanmergethesortedchunkstogetherefficiently.Tosortachunkthatfitsinmainmemory,wecaninturnbreaktheprobleminto
sub-chunksthatfitintheon-chipcache.
Wewilldiscusslaterinthischapterwhattheoperatingsystemneedstodowhenmanagingmemorybetweenprogramsthatinturnadapttheirbehaviortomanagememory.
9.3.2ZipfModel
Althoughtheworkingsetmodeloftendescribesprogramanduserbehaviorquitewell,itisnotalwaysagoodfit.Forexample,considerawebproxycache.Awebproxycachestoresfrequentlyaccessedwebpagestospeedwebaccessandreducenetworktraffic.Webaccesspatternscausetwochallengestoacachedesigner:
Newdata.Newpagesarebeingaddedtothewebatarapidrate,andpagecontentsalsochange.Everytimeauseraccessesapage,thesystemneedstocheckwhetherthepagehaschangedinthemeantime.
Noworkingset.Althoughsomewebpagesaremuchmorepopularthanothers,thereisnosmallsubsetofwebpagesthat,ifcached,giveyouthebulkofthebenefit.Unlikewithaworkingset,evenverysmallcacheshavesomevalue.Conversely,increasingcachesizeyieldsdiminishingreturns:evenverylargecachestendtohaveonlymodestcachehitrates,asthereareanenormousgroupofpagesthatarevisitedfromtimetotime.
AusefulmodelforunderstandingthecachebehaviorofwebaccessistheZipfdistribution.Zipfdevelopedthemodeltodescribethefrequencyofindividualwordsinatext,butitalsoappliesinanumberofothersettings.
Figure9.7:Zipfdistribution
Supposewehaveasetofwebpages(orwords),andweranktheminorderofpopularity.Thenthefrequencyusersvisitaparticularwebpageis(approximately)inverselyproportionaltoitsrank:
Frequencyofvisitstothekthmostpopularpage∝1/kα
whereαisvaluebetween1and2.AZipfprobabilitydistributionisillustratedinFigure9.7.
TheZipfdistributionfitsasurprisingnumberofdisparatephenomena:thepopularityoflibrarybooks,thepopulationofcities,thedistributionofsalaries,thesizeoffriendlistsinsocialnetworks,andthedistributionofreferencesinscientificpapers.TheexactcauseoftheZipfdistributioninmanyofthesecasesisunknown,buttheyshareathemeofpopularityinhumansocialnetworks.
Figure9.8:Cachehitrateasafunctionofthepercentageoftotalitemsthatcanfitinthecache,onalogscale,foraZipfdistribution.
AcharacteristicofaZipfcurveisaheavy-taileddistribution.Althoughasignificantnumberofreferenceswillbetothemostpopularitems,asubstantialportionofreferenceswillbetolesspopularones.IfweredrawFigure9.4oftherelationshipbetweencachehitrateandcachesize,butforaZipfdistribution,wegetFigure9.8.Notethatwehaverescaledthex-axistobelogscale.Ratherthanathresholdasweseeintheworkingsetmodel,increasingthecachesizecontinuestoimprovecachehitrates,butwithdiminishingreturns.
9.4MemoryCacheLookup
Nowthatwehaveoutlinedtheavailabletechnologiesforconstructingcaches,andtheusagepatternsthatlend(ordonotlend)themselvestoeffectivecaching,weturntocachedesign.Howdowefindwhetheranitemisinthecache,andwhatdowedowhenwerunoutofroominthecache?Weanswerthefirstquestionhere,andwedeferthesecondquestiontothenextsection.
Amemorycachemapsasparsesetofaddressestothedatavaluesstoredatthoseaddresses.Youcanthinkofacacheasagianttablewithtwocolumns:onefortheaddressandoneforthedatastoredatthataddress.Toexploitspatiallocality,eachentryinthe
tablewillstorethevaluesforablockofmemory,notjustthevalueforasinglememoryword.ModernIntelprocessorscachedatain64bytechunks.Foroperatingsystems,theblocksizeistypicallythehardwarepagesize,or4KBonanIntelprocessor.
Weneedtobeabletorapidlyconvertanaddresstofindthecorrespondingdata,whileminimizingstorageoverhead.Theoptionswehaveforcachelookupareallofthesameonesweexploredinthepreviouschapterforaddresslookup:wecanusealinkedlist,amulti-leveltree,orahashtable.Operatingsystemsuseeachofthosetechniquesindifferentsettings,dependingonthesizeofthecache,itsaccesspattern,andhowimportantitistohaveveryrapidlookup.
Forhardwarecaches,thedesignchoicesaremorelimited.Thelatencygapbetweencachelevelsisverysmall,soanyaddedoverheadinthelookupprocedurecanswampthebenefitofthecache.Tomakelookupfaster,hardwarecachesoftenconstrainwhereinthetablewemightfindanyspecificaddress.Thisconstraintmeansthattherecouldberoominonepartofthetable,butnotinanother,raisingthecachemissrate.Thereisatradeoffhere:afastercachelookupneedstobebalancedagainstthecostofincreasedcachemisses.
Threecommonmechanismsforcachelookupare:
Figure9.9:Fullyassociativecachelookup.Thecachecheckstheaddressagainsteveryentryandreturnsthematchingvalue,ifany.
Fullyassociative.Withafullyassociativecache,theaddresscanbestoredanywhereinthetable,andsoonalookup,thesystemmustchecktheaddressagainstalloftheentriesinthetableasillustratedinFigure9.9.Thereisacachehitifanyofthetableentriesmatch.Becauseanyaddresscanbestoredanywhere,thisprovidesthesystemmaximalflexibilitywhenitneedstochooseanentrytodiscardwhenitrunsoutofspace.
Wesawtwoexamplesoffullyassociativecachesinthepreviouschapter.Untilveryrecently,TLBswereoftenfullyassociative—theTLBwouldcheckthevirtualpageagainsteveryentryintheTLBinparallel.Likewise,physicalmemoryisafullyassociativecache.Anypageframecanholdanyvirtualpage,andwecanfindwhere
eachvirtualpageisstoredusingamulti-leveltreelookup.Thesetofpagetablesdefineswhetherthereisamatch.
AproblemwithfullyassociativelookupisthecumulativeimpactofMoore’sLaw.Asmorememorycanbepackedonchip,cachesbecomelarger.Wecanusesomeoftheaddedmemorytomakeeachtableentrylarger,butthishasalimitdependingontheamountofspatiallocalityintypicalapplications.Alternately,wecanaddmoretableentries,butthismeansmorelookuphardwareandcomparators.Asanexample,a2MBon-chipcachewith64byteblockshas32Kcachetableentries!Checkingeachaddressagainsteverytableentryinparallelisnotpractical.
Figure9.10:Directmappedcachelookup.Thecachehashestheaddresstodeterminewhichlocationinthetabletocheck.Thecachereturnsthevaluestoredintheentryifitmatchestheaddress.
Directmapped.Withadirectmappedcache,eachaddresscanonlybestoredinonelocationinthetable.Lookupiseasy:wehashtheaddresstoitsentry,asshowninFigure9.10.Thereisacachehitiftheaddressmatchesthatentryandacachemissotherwise.
Adirectmappedcacheallowsefficientlookup,butitlosesmuchofthatadvantageindecreasedflexibility.Ifaprogramhappenstoneedtwodifferentaddressesthatbothhashtothesameentry,suchastheprogramcounterandthestackpointer,thesystemwillthrash.Wewillfirstgettheinstruction;then,oops,weneedthestack.Then,oops,weneedtheinstructionagain.Thenoops,weneedthestackagain.Theprogrammerwillseetheprogramrunningslowly,withnocluewhy,asitwilldependonwhichaddressesareassignedtowhichinstructionsanddata.Iftheprogrammerinsertsaprintstatementtotrytofigureoutwhatisgoingwrong,thatmightshifttheinstructionstoadifferentcacheblock,makingtheproblemdisappear!
Setassociative.Asetassociativecachemeldsthetwoapproaches,allowingatradeoffofslightlyslowerlookupthanadirectmappedcacheinexchangeformostoftheflexibilityofafullyassociativecache.Withasetassociativecache,wereplicatethedirectmappedtableandlookupineachreplicainparallel.Aksetassociativecachehaskreplicas;aparticularaddressblockcanbeinanyofthekreplicas.(This
isequivalenttoahashtablewithabucketsizeofk.)Thereisacachehitiftheaddressmatchesanyofthereplicas.
Asetassociativecacheavoidstheproblemofthrashingwithadirectmappedcache,providedtheworkingsetforagivenbucketislargerthank.AlmostallhardwarecachesandTLBstodayusesetassociativematching;an8-waysetassociativecachestructureiscommon.
Figure9.11:Setassociativecachelookup.Thecachehashestheaddresstodeterminewhichlocationtocheck.Thecachecheckstheentryineachtableinparallel.Itreturnsthevalueifanyoftheentriesmatchtheaddress.
Directmappedandsetassociativecachesposeadesignchallengefortheoperatingsystem.Thesecachesaremuchmoreefficientiftheworkingsetoftheprogramisspreadacrossthedifferentbucketsinthecache.ThisiseasywithaTLBoravirtuallyaddressedcache,aseachsuccessivevirtualpageorcacheblockwillbeassignedtoacachebucket.Adatastructurethatstraddlesapageorcacheblockboundarywillbeautomaticallyassignedtotwodifferentbuckets.
However,theassignmentofphysicalpageframesisuptotheoperatingsystem,andthischoicecanhavealargeimpactontheperformanceofaphysicallyaddressedcache.Tomakethisconcrete,supposewehavea2MBphysicallyaddressedcachewith8-waysetassociativityand4KBpages;thisistypicalforahighperformanceprocessor.Nowsupposetheoperatingsystemhappenstoassignpageframesinasomewhatoddway,sothatanapplicationisgivenphysicalpageframesthatareseparatedbyexactly256KB.Perhapsthoseweretheonlypageframesthatwerefree.Whathappens?
Figure9.12:Whencachesarelargerthanthepagesize,multiplepageframescanmaptothesamesliceofthecache.Aprocessassignedpageframesthatareseparatedbyexactlythecachesizewillonlyuseasmallportionofthecache.Thisappliestobothsetassociativeanddirectmappedcaches;thefigureassumesadirectmappedcachetosimplifytheillustration.
Ifthehardwareusestheloworderbitsofthepageframetoindexthecache,theneverypageofthecurrentprocesswillmaptothesamebucketsinthecache.WeshowthisinFigure9.12.Insteadofthecachehaving2MBofusefulspace,theapplicationwillonlybeabletouse32KB(4KBpagestimesthe8-waysetassociativity).Thismakesitalotmorelikelyfortheapplicationtothrash.
Evenworse,theapplicationwouldhavenowaytoknowthishadhappened.Ifbyrandomchanceanapplicationendedupwithpageframesthatmaptothesamecachebuckets,itsperformancewillbepoor.Then,whentheuserre-runstheapplication,theoperatingsystemmightassigntheapplicationacompletelydifferentsetofpageframes,andperformancereturnstonormal.
Tomakecachebehaviormorepredictableandmoreeffective,operatingsystemsuseaconceptcalledpagecoloring.Withpagecoloring,physicalpageframesarepartitionedintosetsbasedonwhichcachebucketstheywilluse.Forexample,witha2MB8-waysetassociativecacheand4KBpages,therewillbe64separatesets,orcolors.Theoperatingsystemcanthenassignpageframestospreadeachapplication’sdataacrossthevariouscolors.
9.5ReplacementPolicies
Oncewehavelookedupanaddressinthecacheandfoundacachemiss,wehaveanewproblem.Whichmemoryblockdowechoosetoreplace?Assumingthereferencepatternexhibitstemporallocality,thenewblockislikelytobeneededinthenearfuture,soweneedtochoosesomeblockofmemorytoevictfromthecachetomakeroomforthenew
data.Ofcourse,withadirectmappedcachewedonothaveachoice:thereisonlyoneblockthatcanbereplaced.Ingeneral,however,wewillhaveachoice,andthischoicecanhaveasignificantimpactonthecachehitrate.
Aswithprocessorscheduling,thereareanumberofoptionsforthereplacementpolicy.Wecautionthatthereisnosinglerightanswer!Manyreplacementpoliciesareoptimalforsomeworkloadsandpessimalforothers,intermsofthecachehitrate;policiesthataregoodforaworkingsetmodelwillnotbegoodforZipfworkloads.
Policiesalsovarydependingonthesetting:hardwarecachesuseadifferentreplacementpolicythantheoperatingsystemdoesinmanagingmainmemoryasacachefordisk.Ahardwarecachewilloftenhavealimitednumberofreplacementchoices,constrainedbythesetassociativityofthecache,anditmustmakeitsdecisionsveryrapidly.Intheoperatingsystem,thereisoftenbothmoretimetomakeachoiceandamuchlargernumbercacheditemstoconsider;e.g.,with4GBofmemory,asystemwillhaveamillionseparate4KBpagestochoosefromwhendecidingwhichtoreplace.Evenwithintheoperatingsystem,thereplacementpolicyforthefilebuffercacheisoftendifferentthantheoneusedfordemandpagedvirtualmemory,dependingonwhatinformationiseasilyavailableabouttheaccesspattern.
Wefirstdiscussseveraldifferentreplacementpoliciesintheabstract,andtheninthenexttwosectionsweconsiderhowtheseconceptsareappliedtothesettingofdemandpagingmemoryfromdisk.
9.5.1Random
Althoughitmayseemarbitrary,apracticalreplacementpolicyistochoosearandomblocktoreplace.Particularlyforafirst-levelhardwarecache,thesystemmaynothavethetimetomakeamorecomplexdecision,andthecostofmakingthewrongchoicecanbesmalliftheitemisinthenextlevelcache.Thebookkeepingcostformorecomplexpoliciescanbenon-trivial:keepingmoreinformationabouteachblockrequiresspacethatmaybebetterspentonincreasingthecachesize.
Random’sbiggestweaknessisalsoitsbiggeststrength.Whatevertheaccesspatternis,Randomwillnotbepessimal—itwillnotmaketheworstpossiblechoice,atleast,notonaverage.However,itisalsounpredictable,andsoitmightfoilanapplicationthatwasdesignedtocarefullymanageitsuseofdifferentlevelsofthecache.
9.5.2First-In-First-Out(FIFO)
Alessarbitrarypolicyistoevictthecacheblockorpagethathasbeeninmemorythelongest,thatis,FirstInFirstOut,orFIFO.Particularlyforusingmemoryasacachefordisk,thiscanseemfair—eachprogram’spagesspendaroughlyequalamountoftimeinmemorybeforebeingevicted.
Unfortunately,FIFOcanbetheworstpossiblereplacementpolicyforworkloadsthathappenquiteofteninpractice.Consideraprogramthatcyclesthroughamemoryarrayrepeatedly,butwherethearrayistoolargetofitinthecache.Manyscientificapplications
doanoperationoneveryelementinanarray,andthenrepeatthatoperationuntilthedatareachesafixedpoint.Google’sPageRankalgorithmfordeterminingwhichsearchresultstodisplayusesasimilarapproach.PageRankiteratesrepeatedlythroughallpages,estimatingthepopularityofapagebasedonthepopularityofthepagesthatrefertoitascomputedinthepreviousiteration.
FIFO
Ref. A B C D E A B C D E A B C D E
1 A E D C
2 B A E D
3 C B A E
4 D C B
Figure9.13:CachebehaviorforFIFOforarepeatedscanthroughmemory,wherethescanisslightlylargerthanthecachesize.Eachrowrepresentsthecontentsofapageframeorcacheblock;eachnewreferencetriggersacachemiss.
Onarepeatedscanthroughmemory,FIFOdoesexactlythewrongthing:italwaysevictstheblockorpagethatwillbeneedednext.Figure9.13illustratesthiseffect.Notethatinthisfigure,andothersimilarfiguresinthischapter,weshowonlyasmallnumberofcacheslots;notethatthesepoliciesalsoapplytosystemswithaverylargenumberofslots.
9.5.3OptimalCacheReplacement(MIN)
IfFIFOcanbepessimalforsomeworkloads,thatraisesthequestion:whatreplacementpolicyisoptimalforminimizingcachemisses?Theoptimalpolicy,calledMIN,istoreplacewhicheverblockisusedfarthestinthefuture.Equivalently,theworstpossiblestrategyistoreplacetheblockthatisusedsoonest.
OptimalityofMIN
TheproofthatMINisoptimalisabitinvolved.IfMINisnotoptimal,theremustbesomealternativeoptimalreplacementpolicy,whichwewillcallALT,thathasfewercachemissesthanMINonsomespecificsequenceofreferences.Theremaybemanysuchalternatepolicies,soletusfocusontheonethatdiffersfromMINatthelatestpossiblepoint.ConsiderthefirstcachereplacementwhereALTdiffersfromMIN—bydefinition,ALTmustchooseablocktoreplacethatisusedsoonerthantheblockchosenbyMIN.
Weconstructanewpolicy,ALT′,thatisatleastasgoodasALT,butdiffersfromMINatalaterpointandsocontradictstheassumption.WeconstructALT′todifferfromALTinonlyonerespect:atthefirstpointwhereALTdiffersfromMIN,ALT′choosestoevicttheblockthatMINwouldhavechosen.Fromthatpoint,thecontentsofthecachedifferbetweenALTandALT′onlyforthatoneblock.ALTcontainsy,theblockreferencedfartherinthefuture;ALT′isthesame,exceptitcontainsx,theblockreferencedsooner.Onsubsequentcachemissestootherblocks,ALT′mimicsALT,evictingexactlythesameblocksthatALTwouldhaveevicted.
ItispossiblethatALTchoosestoevictybeforethenextreferencetoxory;inthiscase,ifALT′choosestoevictx,thecontentsofthecacheforALTandALT′areidentical.Further,ALT′hasthesamenumberofcachemissesasALT,butitdiffersfromMINatalaterpointthanALT.Thiscontradictsourassumptionabove,sowecanexcludethiscase.
Eventually,thesystemwillreferencex,theblockthatALTchosetoevict;byconstruction,thisoccursbeforethereferencetoy,theblockthatALT′chosetoevict.Thus,ALTwillhaveacachemiss,butALT′willnot.ALTwillevictsomeblock,q,tomakeroomforx;nowALTandALT′differonlyinthatALTcontainsyandALT′containsq.(IfALTevictsyinstead,thenALTandALT′havethesamecachecontents,butALT′hasfewermissesthanALT,acontradiction.)Finally,whenwereachthereferencetoy,ALT′willtakeacachemiss.IfALT′evictsq,thenitwillhavethesamenumberofcachemissesasALT,butitwilldifferfromMINatapointlaterthanALT,acontradiction.
AswithShortestJobFirst,MINrequiresknowledgeofthefuture,andsowecannotimplementitdirectly.Rather,wecanuseitasagoal:wewanttocomeupwithmechanismswhichareeffectiveatpredictingwhichblockswillbeusedinthenearfuture,sothatwecankeepthoseinthecache.
Ifwewereabletopredictthefuture,wecoulddoevenbetterthanMINbyprefetchingblockssothattheyarrive“justintime”—exactlywhentheyareneeded.Inthebestcase,thiscanreducethenumberofcachemissestozero.Forexample,ifweobserveaprogramscanningthroughafile,wecanprefetchtheblocksofthefileintomemory.Providedwecanreadthefileintomemoryfastenoughtokeepupwiththeprogram,theprogramwillalwaysfinditsdatainmemoryandneverhaveacachemiss.
9.5.4LeastRecentlyUsed(LRU)
Onewaytopredictthefutureistolookatthepast.Ifprogramsexhibittemporallocality,thelocationstheyreferenceinthefuturearelikelytobethesameastheonestheyhavereferencedintherecentpast.
Areplacementpolicythatcapturesthiseffectistoevicttheblockthathasnotbeenusedforthelongestperiodoftime,ortheleastrecentlyused(LRU)block.Insoftware,LRUissimpletoimplement:oneverycachehit,youmovetheblocktothefrontofthelist,andonacachemiss,youevicttheblockattheendofthelist.Inhardware,keepingalinkedlistofcachedblocksistoocomplextoimplementathighspeed;instead,weneedtoapproximateLRU,andwewilldiscussexactlyhowinabit.
LRU
Ref. A B A C B D A D E D A E B A C
1 A + + + +
2 B + +
3 C E +
4 D + + C
FIFO
1 A + + E
2 B + A +
3 C + B
4 D + + C
MIN
1 A + + + +
2 B + + C
3 C E +
4 D + +
Figure9.14:CachebehaviorforLRU(top),FIFO(middle),andMIN(bottom)forareferencepatternthatexhibitstemporallocality.Eachrowrepresentsthecontentsofapageframeorcacheblock;+indicatesacachehit.Onthisreferencepattern,LRUisthesameasMINuptothefinalreference,whereMINcanchoosetoreplaceanyblock.
Insomecases,LRUcanbeoptimal,asintheexampleinFigure9.14.Thetableillustratesareferencepatternthatexhibitsahighdegreeoftemporallocality;whenrecentreferencesaremorelikelytobereferencedinthenearfuture,LRUcanoutperformFIFO.
LRU
Ref. A B C D E A B C D E A B C D E
1 A E D C
2 B A E D
3 C B A E
4 D C B
MIN
1 A + + +
2 B + + C
3 C + D +
4 D E + +
Figure9.15:CachebehaviorforLRU(top)andMIN(bottom)forareferencepatternthatrepeatedlyscansthroughmemory.Eachrowrepresentsthecontentsofapageframeorcacheblock;+indicatesacachehit.Onthisreferencepattern,LRUisthesameasFIFO,withacachemissoneveryreference;theoptimalstrategyistoreplacethemostrecentlyusedpage,asthatwillbereferencedfarthestintothefuture.
Onthisparticularsequenceofreferences,LRUbehavessimilarlytotheoptimalstrategyMIN,butthatwillnotalwaysbethecase.Infact,LRUcansometimesbetheworstpossiblecachereplacementpolicy.Thisoccurswhenevertheleastrecentlyusedblockisthenextonetobereferenced.AcommonsituationwhereLRUispessimaliswhentheprogrammakesrepeatedscansthroughmemory,illustratedinFigure9.15;wesawearlierthatFIFOisalsopessimalforthisreferencepattern.Thebestpossiblestrategyistoreplacethemostrecentlyreferencedblock,asthisblockwillbeusedfarthestintothefuture.
9.5.5LeastFrequentlyUsed(LFU)
Consideragainthecaseofawebproxycache.Wheneverauseraccessesapage,itismorelikelyforthatusertoaccessothernearbypages(spatiallocality);sometimes,aswithaflashcrowd,itcanbemorelikelyforotheruserstoaccessthesamepage(temporallocality).Onthesurface,LeastRecentlyUsedseemslikeagoodfitforthisworkload.
However,whenauservisitsararelyusedpage,LRUwilltreatthepageasimportant,even
thoughitisprobablyjustaone-off.WhenIdoaGooglesearchforamountainhutforastayinWesternIceland,thewebpagesIvisitwillnotsuddenlybecomemorepopularthanthelatestFacebookupdatefromKatyPerry.
AbetterstrategyforreferencesthatfollowaZipfdistributionisLeastFrequentlyUsed(LFU).LFUdiscardstheblockthathasbeenusedleastoften;itthereforekeepspopularpages,evenwhenlesspopularpageshavebeentouchedmorerecently.
LRUandLFUbothattempttopredictfuturebehavior,andtheyhavecomplementarystrengths.Manysystemsmeldthetwoapproachestogainthebenefitsofeach.LRUisbetteratkeepingthecurrentworkingsetinmemory;oncetheworkingsetistakencareof,however,LRUwillyielddiminishingreturns.Instead,LFUmaybebetteratpredictingwhatfilesormemoryblockswillbeneededinthemoredistantfuture,e.g.,afterthenextworkingsetphasechange.
Replacementpolicyandfilesize
Ourdiscussionuptonowhasassumedthatallcacheditemsareequal,bothinsizeandincosttoreplace.Whentheseassumptionsdonothold,however,wemaysometimeswanttovarythepolicyfromLFUorLFU,thatis,tokeepsomeitemsthatarelessfrequentlyorlessrecentlyusedaheadofothersthataremorefrequentlyormorerecentlyused.
Forexample,considerawebproxythatcachesfilestoimprovewebresponsiveness.Thesefilesmayhavevastlydifferentsizes.Whenmakingroomforanewfile,wehaveachoicebetweenevictingoneverylargewebpageobjectoramuchlargernumberofsmallerobjects.Evenifeachsmallfileislessfrequentlyusedthanthelargefile,itmaystillmakesensetokeepthesmallfiles.Inaggregatetheymaybemorefrequentlyused,andthereforetheymayhavealargerbenefittooverallsystemperformance.Likewise,ifacacheditemisexpensivetoregenerate,itismoreimportanttokeepcachedthanonethatismoreeasilyreplaced.
Parallelcomputingmakesthecalculusevenmorecomplex.Theperformanceofaparallelprogramdependsonitscriticalpath—theminimumsequenceofstepsfortheprogramtoproduceitsresult.Cachemissesthatoccuronthecriticalpathaffecttheresponsetimewhilethosethatoccuroffthecriticalpathdonot.Forexample,aparallelMapReducejobforksasetoftasksontoprocessors;eachtaskreadsinafileandproducesanoutput.BecauseMapReducemustwaituntilalltasksarecompletebeforemovingontothenextstep,ifanyfileisnotcacheditisasbadasifalloftheneededfileswerenotcached.
9.5.6Belady’sAnomaly
Intuitively,itseemslikeitshouldalwayshelptoaddspacetoamemorycache;beingabletostoremoreblocksshouldalwayseitherimprovethecachehitrate,oratleast,notmakethecachehitrateanyworse.Formanycachereplacementstrategies,thisintuitionistrue.However,insomecases,addingspacetoacachecanactuallyhurtthecachehitrate.ThisiscalledBelady’sanomaly,afterthepersonthatdiscoveredit.
First,wenotethatmanyoftheschemeswehavedefinedcanbeproventoyieldnoworse
cachebehaviorwithlargercachesizes.Forexample,withtheoptimalstrategyMIN,ifwehaveacacheofsizekblocks,wewillkeepthenextkblocksthatwillbereferenced.Ifwehaveacacheofsizek+1blocks,wewillkeepallofthesameblocksaswithaksizedcache,plustheadditionalblockthatwillbethek+1nextreference.
WecanmakeasimilarargumentforLRUandLFU.ForLRU,acacheofsizek+1keepsallofthesameblocksasaksizedcache,plustheblockthatisreferencedfarthestinthepast.EvenifLRUisalousyreplacementpolicy—ifitrarelykeepstheblocksthatwillbeusedinthenearfuture—itwillalwaysdoatleastaswellasaslightlysmallercachealsousingthesamereplacementpolicy.AnequivalentargumentcanbeusedforLFU.
FIFO(3slots)
Ref. A B C D A B E A B C D E
1 A D E +
2 B A + C
3 C B + D
FIFO(4slots)
1 A + E D
2 B + A E
3 C B
4 D C
Figure9.16:CachebehaviorforFIFOwithtwodifferentcachesizes,illustratingBelady’sanomaly.Forthissequenceofreferences,thelargercachesufferstencachemisses,whilethesmallercachehasonefewer.
Somereplacementpolicies,however,donothavethisbehavior.Instead,thecontentsofacachewithk+1blocksmaybecompletelydifferentthanthecontentsofacachewithkblocks.Asaresult,therecachehitratesmaydiverge.Amongthepolicieswehavediscussed,FIFOsuffersfromBelady’sanomaly,andweillustratethatinFigure9.16.
9.6CaseStudy:Memory-MappedFiles
Toillustratetheconceptspresentedinthischapter,weconsiderindetailhowanoperatingsystemcanimplementdemandpaging.Withdemandpaging,applicationscanaccessmore
memorythanisphysicallypresentonthemachine,byusingmemorypagesasacachefordiskblocks.Whentheapplicationaccessesamissingmemorypage,itistransparentlybroughtinfromdisk.Westartwiththesimplercaseofademandpagingforasingle,memory-mappedfileandthenextendthediscussiontomanagingmultipleprocessescompetingforspaceinmainmemory.
AswediscussedinChapter3,mostprogramsuseexplicitread/writesystemcallstoperformfileI/O.Read/writesystemcallsallowtheprogramtoworkonacopyoffiledata.Theprogramopensafileandtheninvokesthesystemcallreadtocopychunksoffiledataintobuffersintheprogram’saddressspace.Theprogramcanthenuseandmodifythosechunks,withoutaffectingtheunderlyingfile.Forexample,itcanconvertthefilefromthediskformatintoamoreconvenientin-memoryformat.Towritechangesbacktothefile,theprograminvokesthesystemcallwritetocopythedatafromtheprogrambuffersouttodisk.Readingandwritingfilesviasystemcallsissimpletounderstandandreasonablyefficientforsmallfiles.
AnalternativemodelforfileI/Oistomapthefilecontentsintotheprogram’svirtualaddressspace.Foramemory-mappedfile,theoperatingsystemprovidestheillusionthatthefileisaprogramsegment;likeanymemorysegment,theprogramcandirectlyissueinstructionstoloadandstorevaluestothememory.Unlikefileread/write,theloadandstoreinstructionsdonotoperateonacopy;theydirectlyaccessandmodifythecontentsofthefile,treatingmemoryasawrite-backcachefordisk.
Wesawanexampleofamemory-mappedfileinthepreviouschapter:theprogramexecutableimage.Tostartaprocess,theoperatingsystembringstheexecutableimageintomemory,andcreatespagetableentriestopointtothepageframesallocatedtotheexecutable.Theoperatingsystemcanstarttheprogramexecutingassoonasthefirstpageframeisinitialized,withoutwaitingfortheotherpagestobebroughtinfromdisk.Forthis,theotherpagetableentriesaresettoinvalid—iftheprocessaccessesapagethathasnotreachedmemoryyet,thehardwaretrapstotheoperatingsystemandthenwaitsuntilthepageisavailablesoitcancontinuetoexecute.Fromtheprogram’sperspective,thereisnodifference(exceptforperformance)betweenwhethertheexecutableimageisentirelyinmemoryorstillmostlyondisk.
Wecangeneralizethisconcepttoanyfilestoredondisk,allowingapplicationstotreatanyfileaspartofitsvirtualaddressspace.Fileblocksarebroughtinbytheoperatingsystemwhentheyarereferenced,andmodifiedblocksarecopiedbacktodisk,withthebookkeepingdoneentirelybytheoperatingsystem.
9.6.1Advantages
Memory-mappedfilesofferanumberofadvantages:
Transparency.Theprogramcanoperateonthebytesinthefileasiftheyarepartofmemory;specifically,theprogramcanuseapointerintothefilewithoutneedingtocheckifthatportionofthefileisinmemoryornot.
ZerocopyI/O.Theoperatingsystemdoesnotneedtocopyfiledatafromkernelbuffersintousermemoryandback;rather,itjustchangestheprogram’spagetable
entrytopointtothephysicalpageframecontainingthatportionofthefile.Thekernelisresponsibleforcopyingdatabackandforthtodisk.WeshouldnotethatitispossibletoimplementzerocopyI/Oforexplicitread/writefilesystemcallsincertainrestrictedcases;wewillexplainhowinthenextchapter.
Pipelining.Theprogramcanstartoperatingonthedatainthefileassoonasthepagetableshavebeensetup;itdoesnotneedtowaitfortheentirefiletobereadintomemory.Withmultiplethreads,aprogramcanuseexplicitread/writecallstopipelinediskI/O,butitneedstomanagethepipelineitself.
Interprocesscommunication.Twoormoreprocessescanshareinformationinstantaneouslythroughamemory-mappedfilewithoutneedingtoshuffledatabackandforthtothekernelortodisk.Ifthehardwarearchitecturesupportsit,thepagetableforthesharedsegmentcanalsobeshared.
Largefiles.Aslongasthepagetableforthefilecanfitinphysicalmemory,theonlylimitonthesizeofamemory-mappedfileisthesizeofthevirtualaddressspace.Forexample,anapplicationmayhaveagiantmulti-leveltreeindexingdataspreadacrossanumberofdisksinadatacenter.Withread/writesystemcalls,theapplicationneedstoexplicitlymanagewhichpartsofthetreearekeptinmemoryandwhichareondisk;alternatively,withmemory-mappedfiles,theapplicationcanleavethatbookkeepingtotheoperatingsystem.
9.6.2Implementation
Toimplementmemory-mappedfiles,theoperatingsystemprovidesasystemcalltomapthefileintoaportionofthevirtualaddressspace.Inthesystemcall,thekernelinitializesasetofpagetableentriesforthatregionofthevirtualaddressspace,settingeachentrytoinvalid.Thekernelthenreturnstotheuserprocess.
Figure9.17:Beforeapagefault,thepagetablehasaninvalidentryforthereferencedpageandthedataforthepageisstoredondisk.
Figure9.18:Afterthepagefault,thepagetablehasavalidentryforthereferencedpagewiththepageframecontainingthedatathathadbeenstoredondisk.Theoldcontentsofthepageframearestoredondiskandthepagetableentrythatpreviouslypointedtothepageframeissettoinvalid.
Whentheprocessissuesaninstructionthattouchesaninvalidmappedaddress,asequenceofeventsoccurs,illustratedinFigures9.17and9.18:
TLBmiss.ThehardwarelooksthevirtualpageupintheTLB,andfindsthatthereisnotavalidentry.Thistriggersafullpagetablelookupinhardware.
Pagetableexception.Thehardwarewalksthemulti-levelpagetableandfindsthepagetableentryisinvalid.Thiscausesahardwarepagefaultexceptiontrapintotheoperatingsystemkernel.
Convertvirtualaddresstofileoffset.Intheexceptionhandler,thekernellooksupinitssegmenttabletofindthefilecorrespondingtothefaultingvirtualaddressandconvertstheaddresstoafileoffset.
Diskblockread.Thekernelallocatesanemptypageframeandissuesadiskoperationtoreadtherequiredfileblockintotheallocatedpageframe.Whilethediskoperationisinprogress,theprocessorcanbeusedforrunningotherthreadsorprocesses.
Diskinterrupt.Thediskinterruptstheprocessorwhenthediskreadfinishes,andtheschedulerresumesthekernelthreadhandlingthepagefaultexception.
Pagetableupdate.Thekernelupdatesthepagetableentrytopointtothepageframeallocatedfortheblockandsetstheentrytovalid.
Resumeprocess.Theoperatingsystemresumesexecutionoftheprocessattheinstructionthatcausedtheexception.
TLBmiss.TheTLBstilldoesnotcontainavalidentryforthepage,triggeringafullpagetablelookup.
Pagetablefetch.Thehardwarewalksthemulti-levelpagetable,findsthepagetableentryvalid,andreturnsthepageframetotheprocessor.TheprocessorloadstheTLBwiththenewtranslation,evictingapreviousTLBentry,andthenusesthetranslationtoconstructaphysicaladdressfortheinstruction.
Tomakethiswork,weneedanemptypageframetoholdtheincomingpagefromdisk.Tocreateanemptypageframe,theoperatingsystemmust:
Selectapagetoevict.Assumingthereisnotanemptypageofmemoryalreadyavailable,theoperatingsystemneedstoselectsomepagetobereplaced.WediscusshowtoimplementthisselectioninSection9.6.3below.
Findpagetableentriesthatpointtotheevictedpage.Theoperatingsystemthenlocatesthesetofpagetableentriesthatpointtothepagetobereplaced.Itcandothiswithacoremap—anarrayofinformationabouteachphysicalpageframe,includingwhichpagetableentriescontainpointerstothatparticularpageframe.
Seteachpagetableentrytoinvalid.Theoperatingsystemneedstopreventanyonefromusingtheevictedpagewhilethenewpageisbeingbroughtintomemory.Becausetheprocessorcancontinuetoexecutewhilethediskreadisinprogress,thepageframemaytemporarilycontainamixtureofbytesfromtheoldandthenewpage.Therefore,becausetheTLBmaycacheacopyoftheoldpagetableentry,aTLBshootdownisneededtoevicttheoldtranslationfromtheTLB.
Copybackanychangestotheevictedpage.Iftheevictedpagewasmodified,thecontentsofthepagemustbecopiedbacktodiskbeforethenewpagecanbebroughtintomemory.Likewise,thecontentsofmodifiedpagesmustalsobecopiedbackwhentheapplicationclosesthememory-mappedfile.
Figure9.19:Whenapageisclean,itsdirtybitissettozeroinboththeTLBandthepagetable,andthedatainmemoryisthesameasthedatastoredondisk.
Figure9.20:Onthefirststoreinstructiontoacleanpage,thehardwaresetsthedirtybitforthatpageintheTLBandthepagetable.Thecontentsofthepagewilldifferfromwhatisstoredondisk.
Howdoestheoperatingsystemknowwhichpageshavebeenmodified?Acorrect,butinefficient,solutionistosimplyassumethateverypageinamemory-mappedfilehasbeenmodified;ifthedatahasnotbeenchanged,theoperatingsystemwillhavewastedsomework,butthecontentsofthefilewillnotbeaffected.
Amoreefficientsolutionisforthehardwaretokeeptrackofwhichpageshavebeenmodified.Mostprocessorarchitecturesreserveabitineachpagetableentrytorecordwhetherthepagehasbeenmodified.Thisiscalledadirtybit.Theoperatingsysteminitializesthebittozero,andthehardwaresetsthebitautomaticallywhenitexecutesastoreinstructionforthatvirtualpage.SincetheTLBcancontainacopyofthepagetableentry,theTLBalsoneedsadirtybitperentry.ThehardwarecanignorethedirtybitifitissetintheTLB,butwheneveritgoesfromzerotoone,thehardwareneedstocopythebitbacktothecorrespondingpagetableentry.Figures9.19and9.20showthestateoftheTLB,pagetable,memoryanddiskbeforeandafterthefirststoreinstructiontoapage.
Iftherearemultiplepagetableentriespointingatthesamephysicalpageframe,thepageisdirty(andmustbecopiedbacktodisk)ifanyofthepagetableshavethedirtybitset.Normally,ofcourse,amemory-mappedfilewillhaveasinglepagetablesharedbetweenalloftheprocessesmappingthefile.
Becauseevictingadirtypagetakesmoretimethanevictingacleanpage,theoperatingsystemcanproactivelycleanpagesinthebackground.Athreadrunsinthebackground,lookingforpagesthatarelikelycandidatesforbeingevictediftheywereclean.Ifthehardwaredirtybitissetinthepagetableentry,thekernelresetsthebitinthepagetableentryanddoesaTLBshootdowntoremovetheentryfromtheTLB(withtheoldvalueofthedirtybit).Itthencopiesthepagetodisk.Ofcourse,theon-chipprocessormemorycacheandwritebufferscancontainmodificationstothepagethathavenotreachedmainmemory;thehardwareensuresthatthenewdatareachesmainmemorybeforethosebytesarecopiedtothediskinterface.
Thekernelcanthenrestarttheapplication;itneednotwaitfortheblocktoreachdisk—iftheprocessmodifiesthepageagain,thehardwarewillsimplyresetthedirtybit,
signalingthattheblockcannotbereclaimedwithoutsavingthenewsetofchangestodisk.
Emulatingahardwaredirtybitinsoftware
Interestingly,hardwaresupportforadirtybitisnotstrictlyrequired.Theoperatingsystemcanemulateahardwaredirtybitusingpagetableaccesspermissions.Anunmodifiedpageissettoallowonlyread-onlyaccess,eventhoughtheprogramislogicallyallowedtowritethepage.Theprogramcanthenexecutenormally.Onastoreinstructiontothepage,thehardwarewilltriggeramemoryexception.Theoperatingsystemcanthenrecordthefactthatthepageisdirty,upgradethepageprotectiontoread-write,andrestarttheprocess.
Tocleanapageinthebackground,thekernelresetsthepageprotectiontoread-onlyanddoesaTLBshootdown.Theshootdownremovesanytranslationthatallowsforread-writeaccesstothepage,forcingsubsequentstoreinstructionstocauseanothermemoryexception.
9.6.3ApproximatingLRU
Afurtherchallengetoimplementingdemandpagedmemory-mappedfilesisthatthehardwaredoesnotkeeptrackofwhichpagesareleastrecentlyorleastfrequentlyused.Doingsowouldrequirethehardwaretokeepalinkedlistofeverypageinmemory,andtomodifythatlistoneveryloadandstoreinstruction(andformemory-mappedexecutableimages,everyinstructionfetchaswell).Thiswouldbeprohibitivelyexpensive.Instead,thehardwaremaintainsaminimalamountofaccessinformationperpagetoallowtheoperatingsystemtoapproximateLRUorLFUifitwantstodoso.
Weshouldnotethatexplicitread/writefilesystemcallsdonothavethisproblem.Eachtimeaprocessreadsorwritesafileblock,theoperatingsystemcankeeptrackofwhichblocksareused.Thekernelcanusethisinformationtoprioritizeitscacheoffileblockswhenthesystemneedstofindspaceforanewblock.
Mostprocessorarchitectureskeepausebitineachpagetableentry,nexttothehardwaredirtybitwediscussedabove.Theoperatingsystemclearstheusebitwhenthepagetableentryisinitialized;thebitissetinhardwarewheneverthepagetableentryisbroughtintotheTLB.Aswiththedirtybit,aphysicalpageisusedifanyofthepagetableentrieshavetheirusebitset.
Figure9.21:Theclockalgorithmsweepsthrougheachpageframe,collectingthecurrentvalueoftheusebitforthatpageandresettingtheusebittozero.Theclockalgorithmstopswhenithasreclaimedasufficientnumberofunusedpageframes.
Theoperatingsystemcanleveragetheusebitinvariousways,butacommonlyusedapproachistheclockalgorithm,illustratedinFigure9.21.Periodically,theoperatingsystemscansthroughthecoremapofphysicalmemorypages.Foreachpageframe,itrecordsthevalueoftheusebitinthepagetableentriesthatpointtothatframe,andthenclearstheirusebits.BecausetheTLBcanhaveacachedcopyofthetranslation,theoperatingsystemalsodoesashootdownforanypagetableentrywheretheusebitiscleared.Notethatiftheusebitisalreadyzero,thetranslationcannotbeintheTLB.Whileitisscanning,thekernelcanalsolookfordirtyandrecentlyunusedpagesandflushtheseouttodisk.
Eachsweepoftheclockalgorithmthroughmemorycollectsonebitofinformationaboutpageusage;byadjustingthefrequencyoftheclockalgorithm,wecancollectincreasinglyfine-grainedinformationaboutusage,atthecostofincreasedsoftwareoverhead.Onmodernsystemswithhundredsofthousandsandsometimesmillionsofphysicalpageframes,theoverheadoftheclockalgorithmcanbesubstantial.
Thepolicyforwhattodowiththeusageinformationisuptotheoperatingsystemkernel.Acommonpolicyiscallednotrecentlyused,ork’thchance.Iftheoperatingsystemneedstoevictapage,thekernelpicksonethathasnotbeenused(hasnothaditsusebitset)forthelastksweepsoftheclockalgorithm.Theclockalgorithmpartitionspagesbasedonhowrecentlytheyhavebeenused;amongpageframesinthesamek’thchancepartition,theoperatingsystemcanevictpagesinFIFOorder.
Somesystemstriggertheclockalgorithmonlywhenapageisneeded,ratherthanperiodicallyinthebackground.Providedsomepageshavenotbeenaccessedsincethelastsweep,anon-demandclockalgorithmwillfindapagetoreclaim.Ifallpageshavebeenaccessed,e.g.,ifthereisastormofpagefaultsduetophasechangebehavior,thenthesystemwilldefaulttoFIFO.
Emulatingahardwareusebitinsoftware
Hardwaresupportforausebitisalsonotstrictlyrequired.Theoperatingsystemkernelcanemulateausebitwithpagetablepermissions,inthesamewaythatthekernelcanemulateahardwaredirtybit.Tocollectusageinformationaboutapage,thekernelsetsthepagetableentrytobeinvalideventhoughthepageisinmemoryandtheapplicationhaspermissiontoaccessthepage.Whenthepageisaccessed,thehardwarewilltriggeranexceptionandtheoperatingsystemcanrecordtheuseofthepage.Thekernelthenchangesthepermissiononthepagetoallowaccess,beforerestartingtheprocess.Tocollectusageinformationovertime,theoperatingsystemcanperiodicallyresetthepagetableentrytoinvalidandshootdownanycachedtranslationsintheTLB.
Manysystemsuseahybridapproach.Inadditiontoactivepageswherethehardwarecollectstheusebit,theoperatingsystemmaintainsapoolofunused,cleanpageframesthatareunmappedinanyvirtualaddressspace,butstillcontaintheirolddata.Whenanewpageframeisneeded,pagesinthispoolcanbeusedwithoutanyfurtherwork.However,iftheolddataisreferencedbeforethepageframeisreused,thepagecanbepulledoutofthepoolandmappedbackintothevirtualaddressspace.
Systemswithasoftware-managedTLBhaveanevensimplertime.EachtimethereisaTLBmisswithasoftware-managedTLB,thereisatraptothekerneltolookupthetranslation.Duringthetrap,thekernelcanupdateitslistoffrequentlyusedpages.
9.7CaseStudy:VirtualMemory
Wecangeneralizeontheconceptofmemory-mappedfiles,bybackingeverymemorysegmentwithafileondisk.Thisiscalledvirtualmemory.Programexecutables,individuallibraries,data,stackandheapsegmentscanallbedemandpagedtodisk.Unlikememory-mappedfiles,though,processmemoryisephemeral:whentheprocessexits,thereisnoneedtowritemodifieddatabacktodisk,andwecanreclaimthediskspace.
Theadvantageofvirtualmemoryisflexibility.Thesystemcancontinuetofunctioneventhoughtheuserhasstartedmoreprocessesthancanfitinmainmemoryatthesametime.Theoperatingsystemsimplymakesroomforthenewprocessesbypagingthememoryofidleapplicationstodisk.Withoutvirtualmemory,theuserhastodomemorymanagementbyhand,closingsomeapplicationstomakeroomforothers.
Allofthemechanismswehavedescribedformemory-mappedfilesapplywhenwegeneralizetovirtualmemory,withoneadditionaltwist.Weneedtobalancetheallocationofphysicalpageframesbetweenprocesses.Unfortunately,thisbalancingisquitetricky.If
weaddafewextrapagefaultstoasystem,noonewillnotice.However,amoderndiskcanhandleatmost100pagefaultspersecond,whileamodernmulti-coreprocessorcanexecute10billioninstructionspersecond.Thus,ifpagefaultsareanythingbutextremelyrare,performancewillsuffer.
9.7.1Self-Paging
Oneconsiderationisthatthebehaviorofoneprocesscansignificantlyhurttheperformanceofotherprogramsrunningatthesametime.Forexample,supposewehavetwoprocesses.Oneisanormalprogram,withaworkingsetequaltosay,aquarterofphysicalmemory.Theotherprogramisgreedy;whileitcanrunfinewithlessmemory,itwillrunfasterifitisgivenmorememory.Wegaveanexampleofthisearlierwiththesortprogram.
Canyoudesignaprogramtotakeadvantageoftheclockalgorithmtoacquiremorethanitsfairshareofmemorypages?
Figure9.22:The“pig”programtogreedilyacquirememorypages.Theimplementationassumeswearerunningonamulticorecomputer.Whenthepigtriggersapagefaultbytouchinganewmemorypage(soFar),theoperatingsystemwillfindallofthepig’spagesuptosoFarrecentlyused.Theoperatingsystemwillkeeptheseinmemoryanditwill
choosetoevictapagefromsomeotherapplication.
WegiveanexampleinFigure9.22,whichwewilldub“pig”forobviousreasons.Itallocatesanarrayinvirtualmemoryequalinsizetophysicalmemory;itthenusesmultiplethreadstocyclethroughmemory,causingeachpagetobebroughtinwhiletheotherpagesremainveryrecentlyused.
Anormalprogramsharingmemorywiththepigwilleventuallybefrozenoutofmemoryandstopmakingprogress.Whenthepigtouchesanewpage,ittriggersapagefault,butallofitspagesarerecentlyusedbecauseofthebackgroundthread.Meanwhile,thenormalprogramwillhaverecentlytouchedmanyofitspagesbuttherewillbesomethatarelessrecentlyused.Theclockalgorithmwillchoosethoseforreplacement.
Astimegoeson,moreandmoreofthepageswillbeallocatedtothepig.Asthenumberofpagesassignedtothenormalprogramdrops,itstartsexperiencingpagefaultsatanincreasingfrequency.Eventually,thenumberofpagesdropsbelowtheworkingset,atwhichpointtheprogramstopsmakingmuchprogress.Itspagesareevenlessfrequentlyused,makingthemeasiertoevict.
Ofcourse,anormaluserwouldprobablyneverrun(orwrite!)aprogramlikethis,butamaliciousattackerlaunchingacomputervirusmightusethisapproachtofreezeoutthesystemadministrator.Likewise,inadatacentersetting,asingleservercanbesharedbetweenmultipleapplicationsfromdifferentusers,forexample,runningindifferentvirtualmachines.Itisintheinterestofanysingleapplicationtoacquireasmanyphysicalresourcesaspossible,evenifthathurtsperformanceforotherusers.
Awidelyadoptedsolutionisself-paging.Withself-paging,eachprocessoruserisassigneditsfairshareofpageframes,usingthemax-minschedulingalgorithmwedescribedinChapter7.Ifalloftheactiveprocessescanfitinmemoryatthesametime,thesystemdoesnotneedtopage.Asthesystemstartstopage,itevictsthepagefromwhicheverprocesshasthemostallocatedtoit.Thus,thepigwouldonlybeabletoallocateitsfairshareofpageframes,andbeyondthatanypagefaultsittriggerswouldevictitsownpages.
Unfortunately,self-pagingcomesatacostinreducedresourceutilization.Supposewehavetwoprocesses,bothofwhichallocatelargeamountsofvirtualaddressspace.However,theworkingsetsofthetwoprogramscanfitinmemoryatthesametime,forexample,ifoneworkingsettakesup2/3rdsofmemoryandtheothertakesup1/3rd.Iftheycooperate,bothcanrunefficientlybecausethesystemhasroomforbothworkingsets.However,ifweneedtobulletprooftheoperatingsystemagainstmaliciousprogramsbyself-paging,theneachwillbeassignedhalfofmemoryandthelargerprogramwillthrash.
9.7.2Swapping
Anotherissueiswhathappensasweincreasetheworkloadforasystemwithvirtualmemory.Ifwearerunningadatacenter,forexample,wecansharephysicalmachinesamongamuchlargernumberofapplicationseachrunninginaseparatevirtualmachine.
Toreducecosts,thedatacenterneedstosupportthemaximumnumberofapplicationsoneachserver,withinsomeperformanceconstraint.
Iftheworkingsetsoftheapplicationseasilyfitinmemory,thenaspagefaultsoccur,theclockalgorithmwillfindlightlyusedpages—thatis,thoseoutsideoftheworkingsetofanyprocess—toevicttomakeroomfornewpages.Sofarsogood.Aswekeepaddingactiveprocesses,however,theirworkingsetsmaynolongerfit,evenifeachprocessisgiventheirfairshareofmemory.Inthiscase,theperformanceofthesystemwilldegradedramatically.
Thiscanbeillustratedbyconsideringhowsystemthroughputisaffectedbythenumberofprocesses.Asweaddworktothesystem,throughputincreasesaslongasthereisenoughprocessingcapacityandI/Obandwidth.Whenwereachthepointwheretherearetoomanytaskstofitentirelyinmemory,thesystemstartsdemandpaging.Throughputcancontinuetoimproveifthereareenoughlightlyusedpagestomakeroomfornewtasks,buteventuallythroughputlevelsoffandthenfallsoffacliff.Inthelimit,everyinstructionwilltriggerapagefault,meaningthattheprocessorexecutesat100instructionspersecond,ratherthan10billioninstructionspersecond.Needlesstosay,theuserwillthinkthesystemisdeadevenifitisinfactinchingforwardveryslowly.
AsweexplainedintheChapter7discussiononoverloadcontrol,theonlywaytoachievegoodperformanceinthiscaseistopreventtheoverloadconditionfromoccurring.Bothresponsetimeandthroughputwillbebetterifwepreventadditionaltasksfromstartingorifweremovesomeexistingtasks.Itisbettertocompletelystarvesometasksoftheirresources,ifthealternative,assigningeachtasktheirfairshare,willdragthesystemtoahalt.
Evictinganentireprocessfrommemoryiscalledswapping.Whenthereistoomuchpagingactivity,theoperatingsystemcanpreventacatastrophicdegradationinperformancebymovingallofthepageframesofaparticularprocesstodisk,preventingitfromrunningatall.Althoughthismayseemterriblyunfair,thealternativeisthateveryprocess,notjusttheswappedprocess,willrunmuchmoreslowly.Bydistributingtheswappedprocess’spagestootherprocesses,wecanreducethenumberofpagefaults,allowingsystemperformancetorecover.Eventuallytheothertaskswillfinish,andwecanbringtheswappedprocessbackintomemory.
9.8SummaryandFutureDirections
Cachingiscentraltomanyareasofcomputerscience:cachesareusedinprocessordesign,filesystems,webbrowsers,webservers,compilers,andkernelmemorymanagement,tonameafew.Tounderstandthesesystems,itisimportanttounderstandhowcacheswork,andevenmoreimportantly,whentheyfail.
Themanagementofmemoryinoperatingsystemsisaparticularlyusefulcasestudy.Everymajorcommercialoperatingsystemincludessupportfordemandpagingofmemory,usingmemoryasacachefordisk.Often,applicationmemorypagesandblocksinthefilebufferareallocatedfromacommonpoolofmemory,wheretheoperatingsystemattemptstokeepblocksthatarelikelytobeusedinmemoryandevictingthoseblocksthatarelesslikelytobeused.However,onmodernsystems,thedifferencebetween
findingablockinmemoryandneedingtobringitinfromdiskcanbeasmuchasafactorof100,000.Thismakesvirtualmemorypagingfragile,acceptableonlywhenusedinsmalldoses.
Movingforward,severaltrendsareinprogress:
Lowlatencybackingstore.Duetotheweightandpowerdrainofmagneticdisks,manyportabledeviceshavemovedtosolidstatepersistentstorage,suchasnon-volatileRAM.Currentsolidstatestoragedeviceshavesignificantlylowerlatencythandisk,andevenfasterdevicesarelikelyinthefuture.Similarly,themovetowardsdatacentercomputinghasaddedanewoptiontomemorymanagement:usingDRAMonothernodesinthedatacenterasalow-latency,veryhighcapacitybackingstoreforlocalmemory.Bothofthesetrendsreducethecostofpaging,makingitrelativelymoreattractive.
Variablepagesizes.Manysystemsuseastandard4KBpagesize,butthereisnothingfundamentalaboutthatchoice—itisatradeoffchosentobalanceinternalfragmentation,pagetableoverhead,disklatency,theoverheadofcollectingdirtyandusagebits,andapplicationspatiallocality.Onmoderndisks,itonlytakestwiceaslongtotransfer256contiguouspagesasitdoestotransferone,sointernally,mostoperatingsystemsarrangedisktransferstoincludemanyblocksatatime.Withnewtechnologiessuchaslowlatencysolidstatestorageandclustermemory,thisbalancemayshiftbacktowardssmallereffectivepagesizes.
Memoryawareapplications.Theincreasingdepthandcomplexityofthememoryhierarchyisbothaboonandacurse.Formanyapplications,thememoryhierarchydeliversreasonableperformancewithoutanyspecialeffort.However,thewidegulfinperformancebetweenthefirstlevelcacheandmainmemory,andbetweenmainmemoryanddisk,impliesthatthereisasignificantperformancebenefittotuningapplicationstotheavailablememory.Theposesaparticularchallengeforoperatingsystemstoadapttoapplicationsthatareadaptingtotheirphysicalresources.
Exercises
1. Acomputersystemhasa1KBpagesizeandkeepsthepagetableforeachprocessinmainmemory.Becausethepagetableentriesareusuallycachedonchip,theaverageoverheadfordoingafullpagetablelookupis40ns.Toreducethisoverhead,thecomputerhasa32-entryTLB.ATLBlookuprequires1ns.WhatTLBhitrateisrequiredtoensureanaveragevirtualaddresstranslationtimeof2ns?
2. Mostmoderncomputersystemschooseapagesizeof4KB.a. Giveasetofreasonswhydoublingthepagesizemightincreaseperformance.b. Giveasetofreasonswhydoublingthepagesizemightdecreaseperformance.
3. Foreachofthefollowingstatements,indicatewhetherthestatementistrueorfalse,andexplainwhy.
a. Adirectmappedcachecansometimeshaveahigherhitratethanafullyassociativecache(onthesamereferencepattern).
b. Addingacacheneverhurtsperformance.
4. Supposeanapplicationisassigned4pagesofphysicalmemoryandthememoryisinitiallyempty.Itthenreferencespagesinthefollowingsequence:
ACBDBAEFBFAGEFA
a. Showhowthesystemwouldfaultpagesintothefourframesofphysicalmemory,usingtheLRUreplacementpolicy.
b. Showhowthesystemwouldfaultpagesintothefourframesofphysicalmemory,usingtheMINreplacementpolicy.
c. Showhowthesystemwouldfaultpagesintothefourframesofphysicalmemory,usingtheclockreplacementpolicy.
5. Isleastrecentlyusedagoodcachereplacementalgorithmtouseforaworkloadfollowingazipfdistribution?Brieflyexplainwhyorwhynot.
6. Brieflyexplainhowtosimulateamodifybitperpageforthepagereplacementalgorithmifthehardwaredoesnotprovideone.
7. Supposewehavefourprograms:a. Oneexhibitsbothspatialandtemporallocality.b. Onetoucheseachpagesequentially,andthenrepeatsthescaninaloop.c. OnereferencespagesaccordingtoaZipfdistribution(e.g.,itisawebserverand
itsmemoryconsistsofcachedwebpages).d. Onegeneratesmemoryreferencescompletelyatrandomusingauniform
randomnumbergenerator.
Allfourprogramsusethesametotalamountofvirtualmemory—thatis,theybothtouchNdistinctvirtualpages,amongstamuchlargernumberoftotalreferences.
Foreachprogram,sketchagraphshowingtherateofprogress(instructionsperunittime)ofeachprogramasafunctionofthephysicalmemoryavailabletotheprogram,from0toN,assumingthepagereplacementalgorithmapproximatesleastrecentlyused.
8. Supposeaprogramrepeatedlyscanslinearlythroughalargearrayinvirtualmemory.Inotherwords,ifthearrayisfourpageslong,itspagereferencepatternisABCDABCDABCD…
Foreachofthefollowingpagereplacementalgorithms,sketchagraphshowingtherateofprogress(instructionsperunittime)ofeachprogramasafunctionofthephysicalmemoryavailabletotheprogram,from0toN,whereNissufficienttoholdtheentirearray.
a. FIFOb. Leastrecentlyusedc. Clockalgorithmd. Nthchancealgorithme. MIN
9. Consideracomputersystemrunningageneral-purposeworkloadwithdemand
paging.Thesystemhastwodisks,onefordemandpagingandoneforfilesystemoperations.Measuredutilizations(intermsoftime,notspace)aregiveninFigure9.23.
Processorutilization 20.0%
PagingDisk 99.7%
FileDisk 10.0%
Network 5.0%
Figure9.23:Measuredutilizationsforsamplesystemunderconsideration.
Foreachofthefollowingchanges,saywhatitslikelyimpactwillbeonprocessorutilization,andexplainwhy.Isitlikelytosignificantlyincrease,marginallyincrease,significantlydecrease,marginallydecrease,orhavenoeffectontheprocessorutilization?
a. GetafasterCPU
b. Getafasterpagingdisk
c. Increasethedegreeofmultiprogramming
10. Anoperatingsystemwithaphysicallyaddressedcacheusespagecoloringtomorefullyutilizethecache.
a. Howmanypagecolorsareneededtofullyutilizeaphysicallyaddressedcache,with1TBofmainmemory,an8MBcachewith4-waysetassociativity,anda4KBpagesize?
b. Developanalgebraicformulatocomputethenumberofpagecolorsneededforanarbitraryconfigurationofcachesize,setassociativity,andpagesize.
11. Thesequenceofvirtualpagesreferencedbyaprogramhaslengthpwithndistinctpagenumbersoccurringinit.Letmbethenumberofpageframesthatareallocatedtotheprocess(allthepageframesareinitiallyempty).Letn>m.
a. Whatisthelowerboundonthenumberofpagefaults?b. Whatistheupperboundonthenumberofpagefaults?
Thelower/upperboundshouldbeforanypagereplacementpolicy.
12. Youhavedecidedtosplurgeonalowendnetbookfordoingyouroperatingsystemshomeworkduringlecturesinyournon-computerscienceclasses.Thenetbookhasasingle-levelTLBandasingle-level,physicallyaddressedcache.Italsohastwolevelsofpagetables,andtheoperatingsystemdoesdemandpagingtodisk.
Thenetbookcomesinvariousconfigurations,andyouwanttomakesurethe
configurationyoupurchaseisfastenoughtorunyourapplications.Togetahandleonthis,youdecidetomeasureitscache,TLBandpagingperformancerunningyourapplicationsinavirtualmachine.Figure9.24listswhatyoudiscoverforyourworkload.
Measurement Value
PCacheMiss=probabilityofacachemiss 0.01
PTLBmiss=probabilityofaTLBmiss 0.01
Pfault=probabilityofapagefault,givenaTLBmissoccurs 0.00002
Tcache=timetoaccesscache 1ns=0.001μs
TTLB=timetoaccessTLB 1ns=0.001μs
TDRAM=timetoaccessmainmemory 100ns=0.1μs
Tdisk=timetotransferapageto/fromdisk 107ns=10ms
Figure9.24:Samplemeasurementsofcachebehavioronthelow-endnetbookdescribedintheexercises.
TheTLBisrefilledautomaticallybythehardwareonaTLBmiss.Thepagetablesarekeptinphysicalmemoryandarenotcached,solookingupapagetableentryincurstwomemoryaccesses(oneforeachlevelofthepagetable).Youmayassumetheoperatingsystemkeepsapoolofcleanpages,sopagesdonotneedtobewrittenbacktodiskonapagefault.
a. Whatistheaveragememoryaccesstime(thetimeforanapplicationprogramtodoonememoryreference)onthenetbook?Expressyouransweralgebraicallyandcomputetheresulttotwosignificantdigits.
b. Thenetbookhasafewoptionalperformanceenhancements:
Item Specs Price
Fasterdiskdrive Transfersapagein7ms $100
500MBmoreDRAMMakesprobabilityofapagefault0.00001 $100
Fasternetworkcard Allowspagingtoremotememory. $100
Withthefasternetworkcard,thetimetoaccessremotememoryis500ms,andtheprobabilityofaremotememorymiss(needtogotodisk),giventhereisapagefaultis0.5.
Supposeyouhave$200.Whatoptionsshouldyoubuytomaximizetheperformanceofthenetbookforthisworkload?
13. Onacomputerwithvirtualmemory,supposeaprogramrepeatedlyscansthroughaverylargearray.Inotherwords,ifthearrayisfourpageslong,itspagereferencepatternisABCDABCDABCD…
Sketchagraphshowingthepagingbehavior,foreachofthefollowingpagereplacementalgorithms.They-axisofthegraphisthenumberofpagefaultsperreferencedpage,varyingfrom0to1;thex-axisisthesizeofthearraybeingscanned,varyingfromsmallerthanphysicalmemorytomuchlargerthanphysicalmemory.Labelanyinterestingpointsonthegraphonboththexandyaxes.
a. FIFOb. LRUc. Clockd. MIN
14. Considertwoprograms,onethatexhibitsspatialandtemporallocality,andtheotherthatexhibitsneither.Tomakethecomparisonfair,theybothusethesametotalamountofvirtualmemory—thatis,theybothtouchNdistinctvirtualpages,amongamuchlargernumberoftotalreferences.
Sketchgraphsshowingtherateofprogress(instructionsperunittime)ofeachprogramasafunctionofthephysicalmemoryavailabletotheprogram,from0toN,assumingtheclockalgorithmisusedforpagereplacement.
a. Programexhibitinglocality,runningbyitself
b. Programexhibitingnolocality,runningbyitself
c. Programexhibitinglocality,runningwiththeprogramexhibitingnolocality(assumebothhavethesamevalueforN).
d. Programexhibitingnolocality,runningwiththeprogramexhibitinglocality(assumebothhavethesameN).
15. Supposeweareusingtheclockalgorithmtodecidepagereplacement,initssimplestform(“first-chance”replacement,wheretheclockisonlyadvancedonapagefaultandnotinthebackground).
Acrucialissueintheclockalgorithmishowmanypageframesmustbeconsideredinordertofindapagetoreplace.AssumingwehaveasequenceofFpagefaultsinasystemwithPpageframes,letC(F,P)bethenumberofpagesconsideredforreplacementinhandlingtheFpagefaults(iftheclockhandsweepsbyapageframemultipletimes,itiscountedeachtime).
a. GiveanalgebraicformulafortheminimumpossiblevalueofC(F,P).
b. GiveanalgebraicformulaforthemaximumpossiblevalueofC(F,P).
10.AdvancedMemoryManagement
Allproblemsincomputersciencecanbesolvedbyanotherlevelofindirection.—DavidWheeler
Atanabstractlevel,anoperatingsystemprovidesanexecutioncontextforapplicationprocesses,consistingoflimitsonprivilegedinstructions,theprocess’smemoryregions,asetofsystemcalls,andsomewayfortheoperatingsystemtoperiodicallyregaincontroloftheprocessor.Byinterposingonthatinterface—mostcommonly,bycatchingandtransformingsystemcallsormemoryreferences—theoperatingsystemcantransparentlyinsertnewfunctionalitytoimprovesystemperformance,reliability,andsecurity.
Interposingonsystemcallsisstraightforward.Thekernelusesatablelookuptodeterminewhichroutinetocallforeachsystemcallinvokedbytheapplicationprogram.Thekernelcanredirectasystemcalltoanewenhancedroutinebysimplychangingthetableentry.
Amoreinterestingcaseisthememorysystem.Addresstranslationhardwareprovidesanefficientwayfortheoperatingsystemtomonitorandgaincontroloneverymemoryreferencetoaspecificregionofmemory,whileallowingothermemoryreferencestocontinueunaffected.(Equivalently,software-basedfaultisolationprovidesmanyofthesamehooks,withdifferenttradeoffsbetweeninterpositionandexecutionspeed.)Thismakesaddresstranslationapowerfultoolforoperatingsystemstointroducenew,advancedservicestoapplications.Wehavealreadyshownhowtouseaddresstranslationfor:
Protection.Operatingsystemsuseaddresstranslationhardware,alongwithsegmentandpagetablepermissions,torestrictaccessbyapplicationstoprivilegedmemorylocationssuchasthoseinthekernel.
Fill-on-demand/zero-on-demand.Bysettingsomepagetablepermissionstoinvalid,thekernelcanstartexecutingaprocessbeforeallofitscodeanddatahasbeenloadedintomemory;thehardwarewilltraptothekerneliftheprocessreferencesdatabeforeitisready.Similarly,thekernelcanzerodataandheappagesinthebackground,relyingonpagereferencefaultstocatchthefirsttimeanapplicationusesanemptypage.Thekernelcanalsoallocatememoryforkernelanduserstacksonlyasneeded.Bymarkingunusedstackpagesasinvalid,thekernelneedstoallocatethosepagesonlyiftheprogramexecutesadeepprocedurecallchain.
Copy-on-write.Copy-on-writeallowsmultipleprocessestohavelogicallyseparatecopiesofthesamememoryregion,backedbyasinglephysicalcopyinmemory.Eachpageintheregionismappedread-onlyineachprocess;theoperatingsystemmakesaphysicalcopyonlywhen(andif)apageismodified.
Memory-mappedfiles.Diskfilescanbemadepartofaprocess’svirtualaddressspace,allowingtheprocesstoaccessthedatainthefileusingnormalprocessorinstructions.Whenapagefromamemory-mappedfileisfirstaccessed,aprotection
faulttrapstotheoperatingsystemsothatitcanbringthepageintomemoryfromdisk.Thefirstwritetoafileblockcanalsobecaught,markingtheblockasneedingtobewrittenbacktodisk.
Demandpagedvirtualmemory.Theoperatingsystemcanrunprogramsthatusemorememorythanisphysicallypresentonthecomputer,bycatchingreferencestopagesthatarenotphysicallypresentandfillingthemfromdiskorclustermemory.
Inthischapter,weexplorehowtoconstructanumberofotheradvancedoperatingsystemservicesbycatchingandre-interpretingmemoryreferencesandsystemcalls.
Chapterroadmap:
Zero-CopyI/O.Howdoweimprovetheperformanceoftransferringblocksofdatabetweenuser-levelprogramsandhardwaredevices?(Section10.1)
VirtualMachines.Howdoweexecuteanoperatingsystemontopofanotheroperatingsystem,andhowcanweusethatabstractiontointroducenewoperatingsystemservices?(Section10.2)
FaultTolerance.Howcanwemakeapplicationsresilienttomachinecrashes?(Section10.3)
Security.Howcanwecontainmaliciousapplicationsthatcanexploitunknownfaultsinsidetheoperatingsystem?(Section10.4)
User-LevelMemoryManagement.Howdowegiveapplicationscontroloverhowtheirmemoryismanaged?(Section10.5)
10.1Zero-CopyI/O
Figure10.1:Awebservergetsarequestfromthenetwork.Theserverfirstasksthekerneltocopytherequestedfilefromdiskoritsfilebufferintotheserver’saddressspace.Theserverthenasksthekerneltocopythecontentsofthefilebackouttothenetwork.
Acommontaskforoperatingsystemsistostreamdatabetweenuser-levelprogramsandphysicaldevicessuchasdisksandnetworkhardware.However,thisstreamingcanbeexpensiveinprocessingtimeifthedataiscopiedasitmovesacrossprotectionboundaries.Anetworkpacketneedstogofromthenetworkinterfacehardware,intokernelmemory,andthentouser-level;theresponseneedstogofromuser-levelbackintokernelmemoryandthenfromkernelmemorytothenetworkhardware.
Considertheoperationofthewebserver,aspicturedinFigure10.1.Almostallwebserversareimplementedasuser-levelprograms.Thisway,itiseasytoreconfigureserverbehavior,andbugsintheserverimplementationdonotnecessarilycompromisesystemsecurity.
Anumberofstepsneedtohappenforawebservertorespondtoawebrequest.Forthisexample,assumethattheconnectionbetweentheclientandserverisalreadyestablished,thereisaserverthreadallocatedtoeachclientconnection,andweuseexplicitread/writesystemcallsratherthanmemorymappedfiles.
Serverreadsfromnetwork.Theserverthreadcallsintothekerneltowaitforanarrivingrequest.
Packetarrival.Thewebrequestarrivesfromthenetwork;thenetworkhardwareusesDMAtocopythepacketdataintoakernelbuffer.
Copypacketdatatouser-level.Theoperatingsystemparsesthepacketheadertodeterminewhichuserprocessistoreceivethewebrequest.Thekernelcopiesthedataintotheuser-levelbufferprovidedbytheserverthreadandreturnstouser-level.
Serverreadsfile.Theserverparsesthedatainthewebrequesttodeterminewhichfileisrequested.Itissuesafilereadsystemcallbacktothekernel,providingauser-levelbuffertoholdthefilecontents.
Dataarrival.Thekernelissuesthediskrequest,andthediskcontrollercopiesthedatafromthediskintoakernelbuffer.Ifthefiledataisalreadyinthefilebuffercache,aswilloftenbethecaseforpopularwebrequests,thisstepisskipped.
Copyfiledatatouser-level.Thekernelcopiesthedataintothebufferprovidedbytheuserprocessandreturnstouser-level.
Serverwritetonetwork.Theserverturnsaroundandhandsthebuffercontainingthefiledatabacktothekerneltosendouttothenetwork.
Copydatatokernel.Thekernelcopiesthedatafromtheuser-levelbufferintoakernelbuffer,formatsthepacket,andissuestherequesttothenetworkhardware.
Datasend.ThehardwareusesDMAtocopythedatafromthekernelbufferouttothenetwork.
Althoughwehaveillustratedthiswithawebserver,asimilarprocessoccursforanyapplicationthatstreamsdatainoroutofacomputer.Examplesincludeawebclient,onlinevideoormusicservice,BitTorrent,networkfilesystems,andevenasoftwaredownload.Foreachofthese,dataiscopiedfromhardwareintothekernelandthenintouser-space,orviceversa.
Wecouldeliminatetheextracopyacrossthekernel-userboundarybymovingeachoftheseapplicationsintothekernel.However,thatwouldbeimpracticalasitwouldrequiretrustingtheapplicationswiththefullpoweroftheoperatingsystem.Alternately,wecouldmodifythesystemcallinterfacetoallowapplicationstodirectlymanipulatedatastoredinakernelbuffer,withoutfirstcopyingittousermemory.However,thisisnotageneral-purposesolution;itwouldnotworkiftheapplicationneededtodoanyworkonthebufferasopposedtoonlytransferringitfromonehardwaredevicetoanother.
Instead,twosolutionstozero-copyI/Oareusedinpractice.Botheliminatethecopyacrossthekernel-userboundaryforlargeblocksofdata;forsmallchunksofdata,theextracopydoesnothurtperformance.
Themorewidelyusedapproachmanipulatestheprocesspagetabletosimulateacopy.Forthistowork,theapplicationmustfirstalignitsuser-levelbuffertoapageboundary.Theuser-levelbufferisprovidedtothekernelonareadorwritesystemcall,anditsalignmentandsizeisuptotheapplication.
Thekeyideaisthatapage-to-pagecopyfromusertokernelspaceorviceversacanbesimulatedbychangingpagetablepointersinsteadofphysicallycopyingmemory.
Foracopyfromuser-spacetothekernel(e.g.,onanetworkorfilesystemwrite),thekernelchangesthepermissionsonthepagetableentryfortheuser-levelbuffertopreventitfrombeingmodified.Thekernelmustalsopinthepagetopreventitfrombeingevictedbythevirtualmemorymanager.Inthecommoncase,thisisenough—thepagewillnotnormallybemodifiedwhiletheI/Orequestisinprogress.Iftheuserprogramdoestrytomodifythepage,theprogramwilltraptothekernelandthekernelcanmakeanexplicit
copyatthatpoint.
Figure10.2:Thecontentsofthepagetablebeforeandafterthekernel“copies”datatouser-levelbyswappingthepagetableentrytopointtothekernelbuffer.
Intheotherdirection,oncethedataisinthekernelbuffer,theoperatingsystemcansimulateacopyuptouser-spacebyswitchingthepointerinthepagetable,asshowninFigure10.2.Theprocesspagetableoriginallypointedtothepageframecontainingthe(empty)userbuffer;nowitpointstothepageframecontainingthe(full)kernelbuffer.Totheuserprogram,thedataappearsexactlywhereitwasexpected!Thekernelcanreclaimanyphysicalmemorybehindtheemptybuffer.
Morerecently,somehardwareI/Odeviceshavebeendesignedtobeabletotransferdatatoandfromvirtualaddresses,ratherthanonlytoandfromphysicaladdresses.Thekernelhandsthevirtualaddressoftheuser-levelbuffertothehardwaredevice.Thehardwaredevice,ratherthanthekernel,walksthemulti-levelpagetabletodeterminewhichphysicalpageframetouseforthedevicetransfer.Whenthetransfercompletes,thedataisautomaticallywhereitbelongs,withnoextraworkbythekernel.Thisprocedureisabitmorecomplicatedforincomingnetworkpackets,asthedecisionastowhichprocessshouldreceivewhichpacketisdeterminedbythecontentsofthepacketheader.Thenetworkinterfacehardwarethereforehastoparsetheincomingpackettodeliverthedatatotheappropriateprocess.
10.2VirtualMachines
Avirtualmachineisawayforahostoperatingsystemtorunaguestoperatingsystemasanapplicationprocess.Thehostsimulatesthebehaviorofaphysicalmachinesothattheguestsystembehavesasifitwasrunningonrealhardware.Virtualmachinesarewidelyusedonclientmachinestorunapplicationsthatarenotnativetothecurrentversionoftheoperatingsystem.Theyarealsowidelyusedindatacenterstoallowasinglephysicalmachinetobesharedbetweenmultipleindependentuses,eachofwhichcanbewrittenasifithassystemadministratorcontrolovertheentire(virtual)machine.Forexample,
multiplewebservers,representingdifferentwebsites,canbehostedonthesamephysicalmachineiftheyeachruninsideaseparatevirtualmachine.
Addresstranslationthrowsawrinkleintothechallengeofimplementingavirtualmachine,butitalsoopensupopportunitiesforefficienciesandnewservices.
Figure10.3:Avirtualmachinetypicallyhastwopagetables:onetotranslatefromguestprocessaddressestotheguestphysicalmemory,andonetotranslatefromguestphysicalmemoryaddressestohostphysicalmemoryaddresses.
10.2.1VirtualMachinePageTables
Withvirtualmachines,wehavetwosetsofpagetables,insteadofone,asshowninFigure10.3:
Guestphysicalmemorytohostphysicalmemory.Thehostoperatingsystemprovidesasetofpagetablestoconstraintheexecutionoftheguestoperatingsystemkernel.Theguestkernelthinksitisrunningonreal,physicalmemory,butinfactitsaddressesarevirtual.Thehardwarepagetabletranslateseachguestoperatingsystemmemoryreferenceintoaphysicalmemorylocation,aftercheckingthattheguesthaspermissiontoreadorwriteeachlocation.Thiswaythehostoperatingsystemcanpreventbugsintheguestoperatingsystemfromoverwritingmemoryinthehost,exactlyasiftheguestwereanormaluser-levelprocess.
Guestusermemorytoguestphysicalmemory.Inturn,theguestoperatingsystemmanagespagetablesforitsguestprocesses,exactlyasiftheguestkernelwasrunningonrealhardware.Sincetheguestkerneldoesnotknowanythingaboutthephysicalpageframesithasbeenassignedbythehostkernel,thesepagetablestranslatefromtheguestprocessaddressestotheguestoperatingsystemkerneladdresses.
First,considerwhathappenswhenthehostoperatingsystemtransferscontroltotheguestkernel.Everythingworksasexpected.Theguestoperatingsystemcanreadandwriteitsmemory,andthehardwarepagetablesprovidetheillusionthattheguestkernelisrunningdirectlyonphysicalmemory.
Nowconsiderwhathappenswhentheguestoperatingsystemtransferscontroltotheguestprocess.Theguestkernelisrunningatuser-level,soitsattempttotransferofcontrolisaprivilegedinstruction.Thus,thehardwareprocessorwillfirsttrapbacktothehost.Thehostkernelcanthensimulatethetransferinstruction,handingcontroltotheuserprocess.
However,whatpagetableshouldweuseinthiscase?Wecannotusethepagetableassetupbytheguestoperatingsystem,astheguestoperatingsystemthinksitisrunninginphysicalmemory,butitisactuallyusingvirtualaddresses.Norcanweusethepagetableassetupbythehostoperatingsystem,asthatwouldprovidepermissiontotheguestprocesstoaccessandmodifytheguestkerneldatastructures.Ifwegrantaccesstotheguestkernelmemorytotheguestprocess,thenthebehaviorofthevirtualmachinewillbecompromised.
Figure10.4:Torunaguestprocess,thehostoperatingsystemconstructsashadowpagetableconsistingofthecompositionofthecontentsofthetwopagetables.
Instead,weneedtoconstructacompositepagetable,calledashadowpagetable,thatrepresentsthecompositionoftheguestpagetableandthehostpagetable,asshowninFigure10.4.Whentheguestkerneltransferscontroltoaguestprocess,thehostkernelgainscontrolandchangesthepagetabletotheshadowversion.
Tokeeptheshadowpagetableuptodate,thehostoperatingsystemneedstokeeptrackofchangesthateithertheguestorthehostoperatingsystemsmaketotheirpagetables.ThisiseasyinthecaseofthehostOS—itcanchecktoseeifanyshadowpagetablesneedtobeupdatedbeforeitchangesapagetableentry.
Tokeeptrackofchangesthattheguestoperatingsystemmakestoitspagetables,however,weneedtodoabitmorework.Thehostoperatingsystemsetsthememoryoftheguestpagetablesasread-only.ThisensuresthattheguestOStrapstothehostevery
timeitattemptstochangeapagetableentry.Thehostusesthistraptochangetheboththeguestpagetableandthecorrespondingshadowpagetable,beforeresumingtheguestoperatingsystem(withthepagetablestillread-only).
Paravirtualization
Onewaytoenablevirtualmachinestorunfasteristoassumethattheguestoperatingsystemisportedtothevirtualmachine.Thehardwaredependentlayer,specifictotheunderlyinghardware,isreplacedwithcodethatunderstandsaboutthevirtualmachine.Thisiscalledparavirtualization,becausetheresultingguestoperatingsystemisalmost,butnotprecisely,thesameasifitwererunningonreal,physicalhardware.
Paravirtualizationishelpfulinseveralways.Perhapsthemostimportantishandlingtheidleloop.Whatshouldhappenwhentheguestoperatingsystemhasnothreadstorun?Iftheguestbelievesitisrunningonphysicalhardware,thennothing—theguestspinswaitingformoreworktodo,perhapsputtingitselfinlowpowermode.Eventuallythehardwarewillcauseatimerinterrupt,transferringcontroltothehostoperatingsystem.Thehostcanthendecidewhethertoresumethevirtualmachineorrunsomeotherthread(orevensomeothervirtualmachine).
Withparavirtualization,however,theidleloopcanbemoreefficient.Thehardwaredependentsoftwareimplementingtheidleloopcantrapintothehostkernel,yieldingtheprocessorimmediatelytosomeotheruse.
Likewise,withparavirtualization,thehardwaredependentcodeinsidetheguestoperatingsystemcanmakeexplicitcallstothehostkerneltochangeitspagetables,removingtheneedforthehosttosimulateguestpagetablemanagement.
TheIntelarchitecturehasrecentlyaddeddirecthardwaresupportforthecompositionofpagetablesinvirtualmachines.Insteadofasinglepagetable,thehardwarecanbesetupwithtwopagetables,oneforthehostandonefortheguestoperatingsystem.Whenrunningaguestprocess,onaTLBmiss,thehardwaretranslatesthevirtualaddresstoaguestphysicalpageframeusingtheguestpagetable,andthehardwarethentranslatestheguestphysicalpageframetothehostphysicalpageframeusingthehostpagetable.Inotherwords,theTLBcontainsthecompositionofthetwopagetables,exactlyasifthehostmaintainedanexplicitshadowpagetable.Ofcourse,iftheguestoperatingsystemitselfhostsavirtualmachineasaguestuserprocess,thentheguestkernelmustconstructashadowpagetable.
Althoughthishardwaresupportsimplifiestheconstructionofvirtualmachines,itisnotclearifitimprovesperformance.ThehandlingofaTLBmississlowersincethehostoperatingsystemmustconsulttwopagetablesinsteadofone;changestotheguestpagetablearefasterbecausethehostdoesnotneedtomaintaintheshadowpagetable.Itremainstobeseenifthistradeoffisusefulinpractice.
10.2.2TransparentMemoryCompression
Athemerunningthroughoutthisbookisthedifficultyofmultiplexingmultiplexors.Withvirtualmachines,boththehostoperatingsystemandtheguestoperatingsystemareattemptingtodothesametask:toefficientlymultiplexasetoftasksontoalimitedamountofmemory.Decisionstheguestoperatingsystemtakestomanageitsmemorymayworkatcross-purposestothedecisionsthatthehostoperatingsystemtakestomanageitsmemory.
Efficientuseofmemorycanbecomeespeciallyimportantindatacenters.Often,asinglephysicalmachineinadatacenterisconfiguredtorunmanyvirtualmachinesatthesametime.Forexample,onemachinecanhostmanydifferentwebsites,eachofwhichistoosmalltomeritadedicatedmachineonitsown.
Tomakethiswork,thesystemneedsenoughmemorytobeabletorunmanydifferentoperatingsystemsatthesametime.Thehostoperatingsystemcanhelpbysharingmemorybetweenguestkernels,e.g.,ifitisrunningtwoguestkernelswiththesameexecutablekernelimage.Likewise,theguestoperatingsystemcanhelpbysharingmemorybetweenguestapplications,e.g.,ifitisrunningtwocopiesofthesameprogram.However,ifdifferentguestkernelsbothrunacopyofthesameuserprocess(e.g.,bothruntheApachewebserver),orusethesamelibrary,thehostkernelhasno(direct)waytosharepagesbetweenthosetwoinstances.
Anotherexampleoccurswhenaguestprocessexits.Theguestoperatingsystemplacesthepageframesfortheexitingprocessonthefreelistforreallocationtootherprocesses.Thecontentsofanydatapageswillneverbeusedagain;infact,theguestkernelwillneedtozerothosepagesbeforetheyarereassigned.However,thehostoperatingsystemhasno(direct)waytoknowthis.Eventuallythosepageswillbeevictedbythehost,e.g.,whentheybecomeleastrecentlyused.Inthemeantime,however,thehostoperatingsystemmighthaveevictedpagesfromtheguestthatarestillactive.
Onesolutionistomoretightlycoordinatetheguestandhostmemorymanagerssothateachknowswhattheotherisdoing.WediscussthisinmoredetaillaterinthisChapter.
Commercialvirtualmachineimplementationstakeadifferentapproach,exploitinghardwareaddressprotectiontomanagethesharingofcommonpagesbetweenvirtualmachines.Thesesystemsrunascavengerinthebackgroundthatlooksforpagesthatcanbesharedacrossvirtualmachines.Onceacommonpageisidentified,thehostkernelmanipulatesthepagetablepointerstoprovidetheillusionthateachguesthasitsowncopyofthepage,eventhoughthephysicalrepresentationismorecompact.
Figure10.5:Whenahostkernelrunsmultiplevirtualmachines,itcansavespacebystoringadeltatoanexistingpage(pageA)andbyusingthesamephysicalpageframeformultiplecopiesofthesamepage(pageB).
Therearetwocasestoconsider,showninFigure10.5:
Multiplecopiesofthesamepage.Twodifferentvirtualmachineswilloftenhavepageswiththesamecontents.Anobviouscaseiszeroedpages:eachkernelkeepsapoolofpagesthathavebeenzeroed,readytobeallocatedtoanewprocess.Ifeachguestoperatingsystemwererunningonitsownmachine,therewouldbelittlecosttokeepingthispoolattheready;nooneelsebutthekernelcanusethatmemory.However,whenthephysicalmachineissharedbetweenvirtualmachines,havingeachguestkeepitsownpoolofzeropagesiswasteful.
Instead,thehostcanallocateasinglezeropageinphysicalmemoryforalloftheseinstances.Allpointerstothepagewillbesetread-only,sothatanyattempttomodifythepagewillcauseatraptothehostkernel;thekernelcanthenallocateanew(zeroed)physicalpageforthatuse,exactlyasincopy-on-write.Ofcourse,theguestkernelsdonotneedtotellanyonewhentheycreateazeropage,sointhebackground,thehostkernelrunsascavengertolookforzeropagesinguestmemory.Whenitfindsone,itreclaimsthephysicalpageandchangesthepagetablepointerstopointatthesharedzeropage,withread-onlypermission.
Thescavengercandothesameforothersharedpageframes.Thecodeanddatasegmentsforbothapplicationsandsharedlibrarieswilloftenbethesameorquite
similar,evenacrossdifferentoperatingsystems.AnapplicationliketheApachewebserverwillnotbere-writtenfromscratchforeveryseparateoperatingsystem;rather,someOS-specificgluecodewillbeaddedtomatchtheportableportionoftheapplicationtoitsspecificenvironment.
Compressionofunusedpages.Evenifapageisdifferent,itmaybeclosetosomeotherpageinadifferentvirtualmachine.Forexample,differentversionsoftheoperatingsystemmaydifferinonlysomesmallrespects.Thisprovidesanopportunityforthehostkerneltointroduceanewlayerinthememoryhierarchytosavespace.
Insteadofevictingarelativelyunusedpage,theoperatingsystemcancompressit.Ifthepageisadeltaofanexistingpage,thecompressedversionmaybequitesmall.Thekernelmanipulatespagetablepermissionstomaintaintheillusionthatthedeltaisarealpage.Thefullcopyofthepageismarkedread-only;thedeltaismarkedinvalid.Ifthedeltaisreferenced,itcanbere-constitutedasafullpagemorequicklythanifitwasstoredondisk.Iftheoriginalpageismodified,thedeltacanbere-compressedorevicted,asnecessary.
10.3FaultTolerance
Allsystemsbreak.Despiteourbestefforts,applicationcodecanhavebugsthatcausetheprocesstoexitabruptly.Operatingsystemcodecanhavebugsthatcausethemachinetohaltandreboot.Powerfailuresandhardwareerrorscanalsocauseasystemtostopwithoutwarning.
Mostapplicationsarestructuredtoperiodicallysaveuserdatatodiskforjustthesetypesofevents.Whentheoperatingsystemorapplicationrestarts,theprogramcanreadthesaveddataoffdisktoallowtheusertoresumetheirwork.
Inthissection,wetakethisastepfurther,toseeifwecanmanagememorytorecoverapplicationdatastructuresafterafailure,andnotjustuserfiledata.
10.3.1CheckpointandRestart
Onereasonwemightwanttorecoverapplicationdataiswhenaprogramtakesalongtimetorun.Ifasimulationofthefutureglobalclimatetakesaweektocompute,wedonotwanttohavetostartagainfromscratcheverytimethereisapowerglitch.Ifenoughmachinesareinvolvedandthecomputationtakeslongenough,itislikelythatatleastoneofthemachineswillencounterafailuresometimeduringthecomputation.
Ofcourse,theprogramcouldbewrittentotreatitsinternaldataasprecious—toperiodicallysaveitspartialresultstoafile.Tomakesurethedataisinternallyconsistent,theprogramwouldneedsomenaturalstoppingpoint;forexample,theprogramcansavethepredictedclimatefor2050beforeitmovesontocomputingtheclimatein2051.
Amoregeneralapproachistohavetheoperatingsystemusethevirtualmemorysystemtoprovideapplicationrecoveryasaservice.Ifwecansavethestateofaprocess,wecantransparentlyrestartitwheneverthepowerfails,exactlywhereitleftoff,withtheuser
nonethewiser.
Figure10.6:Bycheckpointingthestateofaprocess,wecanrecoverthesavedstateoftheprocessafterafailurebyrestoringthesavedcopy.
Tomakethiswork,wefirstneedtosuspendeachthreadexecutingintheprocessandsaveitsstate—theprogramcounter,stackpointer,andregisterstoapplicationmemory.Onceallthreadsaresuspended,wecanthenstoreacopyofthecontentsoftheapplicationmemoryondisk.Thisiscalledacheckpointorsnapshot,illustratedinFigure10.6.Afterafailure,wecanresumetheexecutionbyrestoringthecontentsofmemoryfromthecheckpointandresumingeachofthethreadsfromfromexactlythepointwestoppedthem.Thisiscalledanapplicationrestart.
Whatwouldhappenifweallowthreadstocontinuetorunwhilewearesavingthecontentsofmemorytodisk?Duringthecopy,wehavearacecondition:somepagescouldbesavedbeforebeingmodifiedbysomethread,whileotherscouldbesavedafterbeingmodifiedbythatsamethread.Whenwetrytorestarttheapplication,itsdatastructurescouldappeartobecorrupted.Thebehavioroftheprogrammightbedifferentfromwhatwouldhavehappenedifthefailurehadnotoccurred.
Fortunately,wecanuseaddresstranslationtominimizetheamountoftimeweneedtohavethesystemstalledduringacheckpoint.Insteadofcopyingthecontentsofmemorytodisk,wecanmarktheapplication’spagesascopy-on-write.Atthispoint,wecanrestarttheprogram’sthreads.Aseachpagereachesdisk,wecanresettheprotectiononthatpagetoread-write.Whentheprogramtriestomodifyapagebeforeitreachesdisk,thehardwarewilltakeanexception,andthekernelcanmakeacopyofthepage—onetobe
savedtodiskandonetobeusedbytherunningprogram.
Wecantakecheckpointsoftheoperatingsystemitselfinthesameway.Itiseasiesttodothisiftheoperatingsystemisrunninginavirtualmachine.Thehostcantakeacheckpointbystoppingthevirtualmachine,savingtheprocessorstate,andchangingthepagetableprotections(inthehostpagetable)toread-only.Thevirtualmachineisthensafetorestartwhilethehostwritesthecheckpointtodiskinthebackground.
Checkpointsandsystemcalls
Animplementationchallengeforcheckpoint/restartistocorrectlyhandleanysystemcallsthatareinprocess.Thestateofaprogramisnotonlyitsuser-levelmemory;italsoincludesthestateofanythreadsthatareexecutinginthekernelandanyper-processstatemaintainedbythekernel,suchasitsopenfiledescriptors.Whilesomeoperatingsystemshavebeendesignedtoallowthekernelstateofaprocesstobecapturedaspartofthecheckpoint,itismorecommonforcheckpointingtobesupportedonlyatthevirtualmachinelayer.Avirtualmachinehasnostateinthekernelexceptforthecontentsofitsmemoryandprocessorregisters.Ifweneedtotakeacheckpointwhileatraphandlerisinprogress,thehandlercansimplyberestarted.
Processmigrationistheabilitytotakearunningprogramononesystem,stopitsexecution,andresumeitonadifferentmachine.Checkpointandrestartprovideabasisfortransparentprocessmigration.Forexample,itisnowcommonpracticetocheckpointandmigrateentirevirtualmachinesinsideadatacenter,asonewaytobalanceload.Ifonesystemishostingtwowebservers,eachofwhichbecomesheavilyloaded,wecanstoponeandmoveittoadifferentmachinesothateachcangetbetterperformance.
10.3.2RecoverableVirtualMemory
Takingacompletecheckpointofaprocessoravirtualmachineisaheavyweightoperation,andsoitisonlypracticaltodorelativelyrarely.Wecanusecopy-on-writepageprotectiontoresumetheprocessafterstartingthecheckpoint,butcompletingthecheckpointwillstilltakeconsiderabletimewhilewecopythecontentsofmemoryouttodisk.
Canweprovideanapplicationtheillusionofpersistentmemory,sothatthecontentsofmemoryarerestoredtoapointnotlongbeforethefailure?Theabilitytodothisiscalledrecoverablevirtualmemory.Anexamplewherewemightlikerecoverablevirtualmemoryisinanemailclient;asyouread,reply,anddeleteemail,youdonotwantyourworktobelostifthesystemcrashes.
Ifweputefficiencyaside,recoverablevirtualmemoryispossible.First,wetakeacheckpointsothatsomeconsistentversionoftheapplication’sdataisondisk.Next,werecordanorderedsequence,orlog,ofeveryupdatethattheapplicationmakestomemory.Oncethelogiswrittentodiskwerecoverafterafailurebyreadingthecheckpointandapplyingthechangesfromthelog.
Thisisexactlyhowmosttexteditorssavetheirbackups,toallowthemtorecover
uncommittedusereditsafteramachineorapplicationfailure.Atexteditorcouldrepeatedlywriteanentirecopyofthefiletoabackup,butthiswouldbeslow,particularlyforalargefile.Instead,atexteditorwillwriteaversionofthefile,andthenitwillappendasequenceofeverychangetheusermakestothatversion.Toavoidhavingtoseparatelywriteeverytypedcharactertodisk,theeditorwillbatchchanges,e.g.,allofthechangestheusermadeinthepast100milliseconds,andwritethosetodiskasaunit.Eveniftheverylatestbatchhasnotbeenwrittentodisk,theusercanusuallyrecoverthestateofthefileatalmosttheinstantimmediatelybeforethemachinecrash.
Adownsideofthisalgorithmfortexteditorsisthatitcancauseinformationtobeleakedwithoutitbeingvisibleinthecurrentversionofthefile.Texteditorssometimesusethissamemethodwhentheuserhits“save”—justappendanychangesfromthepreviousversion,ratherthanwritingafreshcopyoftheentirefile.Thismeansthattheoldversionofafilecanpotentiallystillberecoveredfromafile.Soifyouwriteamemoinsultingyourboss,andtheneditittotoneitdown,itisbesttosaveacompletelynewversionofyourfilebeforeyousenditoff!
Willthismethodworkforpersistentmemory?Keepingalogofeverychangetoeverymemorylocationintheprocesswouldbetooslow.Wewouldneedtotraponeverystoreinstructionandsavethevaluetodisk.Inotherwords,wewouldrunatthespeedofthetraphandler,ratherthanthespeedoftheprocessor.
However,wecancomeclose.Whenwetakeacheckpoint,wemarkallpagesasread-onlytoensurethatthecheckpointincludesaconsistentsnapshotofthestateoftheprocess’smemory.Thenwetraptothekernelonthefirststoreinstructiontoeachpage,toallowthekerneltomakeacopy-on-write.Thekernelresetsthepagetoberead-writesothatsuccessivestoreinstructionstothesamepagecangoatfullspeed,butitcanalsorecordthepageashavingbeenmodified.
Figure10.7:Theoperatingsystemcanrecoverthestateofamemorysegmentafteracrashbysavingasequenceofincrementalcheckpoints.
Wecantakeanincrementalcheckpointbystoppingtheprogramandsavingacopyofanypagesthathavebeenmodifiedsincethepreviouscheckpoint.Oncewechangethosepagesbacktoread-only,wecanrestarttheprogram,waitabit,andtakeanotherincremental
checkpoint.Afteracrash,wecanrecoverthemostrecentmemorybyreadinginthefirstcheckpointandthenapplyingeachoftheincrementalcheckpointsinturn,asshowninFigure10.7.
Howmuchworkweloseduringamachinecrashisafunctionofhowquicklywecancompletelywriteanincrementalcheckpointtodisk.Thisisgovernedbytherateatwhichtheapplicationcreatesnewdata.Toreducethecostofanincrementalcheckpoint,applicationsneedingrecoverablevirtualmemorywilldesignateaspecificmemorysegmentaspersistent.Afteracrash,thatmemorywillberestoredtothelatestincrementalcheckpoint,allowingtheprogramtoquicklyresumeitswork.
10.3.3DeterministicDebugging
Akeytobuildingreliablesystemssoftwareistheabilitytolocateandfixproblemswhentheydooccur.Debuggingasequentialprogramiscomparativelyeasy:ifyougiveitthesameinput,itwillexecutethesamecodeinthesameorder,andproducethesameoutput.
Debuggingaconcurrentprogramismuchharder:thebehavioroftheprogrammaychangedependingonthepreciseschedulingorderchosenbytheoperatingsystem.Iftheprogramiscorrect,thesameoutputshouldbeproducedonthesameinput.Ifwearedebuggingaprogram,however,itisprobablynotcorrect.Instead,theprecisebehavioroftheprogrammayvaryfromruntorundependingonwhichthreadsarescheduledfirst.
Debugginganoperatingsystemisevenharder:notonlydoestheoperatingsystemmakewidespreaduseofconcurrency,butitishardtotellsometimeswhatisits“input”and“output.”
Itturnsout,however,thatwecanuseavirtualmachineabstractiontoprovidearepeatabledebuggingenvironmentforanoperatingsystem,andwecaninturnusethattoprovidearepeatabledebuggingenvironmentforconcurrentapplications.
Itiseasiesttoseethisonauniprocessor.Theexecutionofanoperatingsystemrunninginavirtualmachinecanonlybeaffectedbythreefactors:itsinitialstate,theinputdataprovidedbyitsI/Odevices,andtheprecisetimingofinterrupts.
Becausethehostkernelmediateseachoftheseforthevirtualmachine,itcanrecordthemandplaythembackduringdebugging.Aslongasthehostexactlymimicswhatitdidthefirsttime,thebehavioroftheguestoperatingsystemwillbethesameandthebehaviorofallapplicationsrunningontopoftheguestoperatingsystemwillbethesame.
Replayingtheinputiseasy,buthowdowereplaytheprecisetimingofinterrupts?Mostmoderncomputerarchitectureshaveacounterontheprocessortomeasurethenumberofinstructionsexecuted.Thehostoperatingsystemcanusethistomeasurehowmanyinstructionstheguestoperatingsystem(orguestapplication)executedbetweenthepointwherethehostgaveupcontroloftheprocessortotheguest,andwhencontrolreturnedtothekernelduetoaninterruptortrap.
Toreplaytheprecisetimingofanasynchronousinterrupt,thehostkernelrecordstheguestprogramcounterandtheinstructioncountatthepointwhentheinterruptwasdeliveredtotheguest.Onreplay,thehostkernelcansetatraponthepagecontainingtheprogram
counterwherethenextinterruptwillbetaken.Sincetheguestmightvisitthesameprogramcountermultipletimes,thehostkernelusestheinstructioncounttodeterminewhichvisitcorrespondstotheonewheretheinterruptwasdelivered.(Somesystemsmakethiseveneasier,byallowingthekerneltorequestatrapwhenevertheinstructioncountreachesacertainvalue.)
Moreover,ifwewanttoskipaheadtosomeknowngoodintermediatepoint,wecantakeacheckpoint,andplayforwardthesequenceofinterruptsandinputdatafromthere.Thisisimportantassometimesbugsinoperatingsystemscantakeweekstomanifestthemselves;ifweneededtoreplayeverythingfrombootthedebuggingprocesswouldbemuchmorecumbersome.
Mattersaremorecomplexonamulticoresystem,astheprecisebehaviorofboththeguestoperatingsystemandtheguestapplicationswilldependonthepreciseorderingofinstructionsacrossthedifferentprocessors.Itisanongoingareaofresearchhowbesttoprovidedeterministicexecutioninthissetting.Providedthattheprogrambeingdebuggedhasnoraceconditions—thatis,noaccesstosharedmemoryoutsideofacriticalsection—thenitsbehaviorwillbedeterministicwithonemorepieceofinformation.Inadditiontotheinitialstate,inputs,andasynchronousinterrupts,wealsoneedtorecordwhichthreadacquireseachcriticalsectioninwhichorder.Ifwereplaythethreadsinthatorderanddeliverinterruptspreciselyandprovidethesamedeviceinput,thebehaviorwillbethesame.Whetherthisisapracticalsolutionisstillanopenquestion.
10.4Security
Hardwareorsoftwareaddresstranslationprovidesabasisforexecutinguntrustedapplicationcode,toallowtheoperatingsystemkerneltoprotectitselfandotherapplicationsfrommaliciousorbuggyimplementations.
Amodernsmartphoneortabletcomputer,however,hasliterallyhundredsofthousandsofapplicationsthatcouldbeinstalled.Manyormostarecompletelytrustworthy,butothersarespecificallydesignedtostealorcorruptlocaldatabyexploitingweaknessesintheunderlyingoperatingsystemorthenaturalhumantendencytotrusttechnology.Howisausertoknowwhichiswhich?Asimilarsituationexistsfortheweb:evenifmostwebsitesareinnocuous,someembedcodethatexploitsknownvulnerabilitiesinthebrowserdefenses.
Ifwecannotlimitourexposuretopotentiallymaliciousapplications,whatcanwedo?Oneimportantstepistokeepyoursystemsoftwareuptodate.Themaliciouscodeauthorsrecognizethis:arecentsurveyshowedthatthemostlikelywebsitestocontainvirusesarethosetargetedatthemostnoviceusers,e.g.,screensaversandchildren’sgames.
Inthissection,wediscusswhetherthereareadditionalwaystousevirtualmachinestolimitthescopeofmaliciousapplications.
Supposeyouwanttodownloadanewapplication,orvisitanewwebsite.Thereissomechanceitwillworkasadvertised,andthereissomechanceitwillcontainavirus.Isthereanywaytolimitthepotentialofthenewsoftwaretoexploitsomeunknownvulnerabilityinyouroperatingsystemorbrowser?
Oneinterestingapproachistocloneyouroperatingsystemintoanewvirtualmachine,andruntheapplicationinthecloneratherthanonthenativeoperatingsystem.Avirtualmachineconstructedforthepurposeofexecutingsuspectcodeiscalledavirtualmachinehoneypot.Byusingavirtualmachine,ifthecodeturnsouttobemalicious,wecandeletethevirtualmachineandleavetheunderlyingoperatingsystemasitwasbeforeweattemptedtoruntheapplication.
Creatingavirtualmachinetoexecuteanewapplicationmightseemextravagant.However,earlierinthischapter,wediscussedvariouswaystomakethismoreefficient:shadowpagetables,memorycompression,efficientcheckpointandrestart,andcopy-on-write.Andofcourse,reinstallingyoursystemafterithasbecomeinfectedwithavirusisevenslower!
Bothresearchersandvendorsofcommercialanti-virussoftwaremakeextensiveuseofvirtualmachinehoneypotstodetectandunderstandviruses.Forexample,afrequenttechniqueistocreateanarrayofvirtualmachines,eachwithadifferentversionoftheoperatingsystem.Byloadingapotentialvirusintoeachone,andthensimulatinguserbehavior,wecanmoreeasilydeterminewhichversionsofsoftwarearevulnerableandwhicharenot.
Alimitationisthatweneedtobeabletotellifthebrowseroroperatingsystemrunninginthevirtualmachinehoneypothasbeencorrupted.Often,virusesoperateinstantly,byattemptingtoinstallloggingsoftwareorscanningthediskforsensitiveinformationsuchascreditcardnumbers.Thereisnothingtokeepthevirusfromlyinginwait;thishasbecomemorecommonrecently,particularlythosedesignedformilitaryorbusinessespionage.
Anotherlimitationisthatthevirusmightbedesignedtoinfectboththeguestoperatingsystemrunninginthecloneandthehostkernelimplementingthevirtualmachine.(Inthecaseoftheweb,thevirusmustinfectthebrowser,theguestoperatingsystem,andthehost.)Aslongasthesystemsoftwareiskeptuptodate,thesystemisvulnerableonlyifthevirusisabletoexploitsomeunknownweaknessintheguestoperatingsystemandaseparateunknownweaknessinthehostimplementationofthevirtualmachine.Thisprovidesdefenseindepth,improvingsecuritythroughmultiplelayersofprotection.
10.5User-LevelMemoryManagement
Withtheincreasingsophisticationofapplicationsandtheirruntimesystems,mostwidelyusedoperatingsystemshaveintroducedhooksforapplicationstomanagetheirownmemory.Whilethedetailsoftheinterfacediffersfromsystemtosystem,thehookspreservetheroleofthekernelinallocatingresourcesbetweenprocessesandinpreventingaccesstoprivilegedmemory.Onceapageframehasbeenassignedtoaprocess,however,thekernelcanleaveituptotheprocesstodeterminewhattodowiththatresource.
Operatingsystemscanprovideapplicationstheflexibilitytodecide:
Wheretogetmissingpages.Aswenotedinthepreviouschapter,amodernmemoryhierarchyisdeepandcomplex:localdisk,localnon-volatilememory,remotememoryinsideadatacenter,orremotedisk.Bygivingapplicationscontrol,the
kernelcankeepitsownmemoryhierarchysimpleandlocal,whilestillallowingsophisticatedapplicationstotakeadvantageofnetworkresourceswhentheyareavailable,evenwhenthoseresourcesareonmachinesrunningcompletelydifferentoperatingsystems.
Whichpagescanbeaccessed.Manyapplicationssuchasbrowsersanddatabasesneedtosetuptheirownapplication-levelsandboxesforexecutinguntrustedcode.Todaythisisdonewithacombinationofhardwareandsoftwaretechniques,aswedescribedinChapter8.Finer-grainedcontroloverpagefaulthandlingallowsmoresophisticatedmodelsformanagingsharingbetweenregionsofuntrustedcode.
Whichpagesshouldbeevicted.Often,anapplicationwillhavebetterinformationthantheoperatingsystemoverwhichpagesitwillreferenceinthenearfuture.
Manyapplicationscanadaptthesizeoftheirworkingsettotheresourcesprovidedbythekernelbuttheywillhaveworseperformancewheneverthereisamismatch.
Garbagecollectedprograms.Consideraprogramthatdoesitsowngarbagecollection.Whenitstartsup,itallocatesablockofmemoryinitsvirtualaddressspacetoserveastheheap.Periodically,theprogramscansthroughtheheaptocompactitsdatastructures,freeinguproomforadditionaldatastructures.Thiscausesallpagestoappeartoberecentlyused,confoundingthekernel’smemorymanager.Bycontrast,theapplicationknowsthatthebestpagetoreplaceisonethatwasrecentlycleanedofapplicationdata.
Itisequallyconfoundingtotheapplication.Howdoesthegarbagecollectorknowhowmuchmemoryitshouldallocatefortheheap?Ideally,thegarbagecollectorshoulduseexactlyasmuchmemoryasthekernelisabletoprovide,andnomore.Iftheruntimeheapistoosmall,theprogrammustgarbagecollect,eventhoughmorepageframesavailable.Iftheheapistoolarge,thekernelwillpagepartsoftheheaptodiskinsteadofaskingtheapplicationtopaytheloweroverheadofcompactingitsmemory.
Databases.Databasesandotherdataprocessingsystemsoftenmanipulatehugedatasetsthatmustbestreamedfromdiskintomemory.AswenotedinChapter9,algorithmsforlargedatasetswillbemoreefficientiftheyarecustomizedtotheamountofavailablephysicalmemory.Iftheoperatingsystemevictsapagethatthedatabaseexpectstobeinmemory,thesealgorithmswillrunmuchmoreslowly.
Virtualmachines.Asimilarissueariseswithvirtualmachines.Theguestoperatingsystemrunninginsideofavirtualmachinethinksithasasetofphysicalpageframes,whichitcanassigntothevirtualpagesofapplicationsrunninginthevirtualmachine.Inreality,however,thepageframesintheguestoperatingsystemarevirtualandcanbepagedtodiskbythehostoperatingsystem.Ifthehostoperatingsystemcouldtelltheguestoperatingsystemwhenitneededtostealapageframe(ordonateapageframe),thentheguestwouldknowexactlyhowmanypageframeswereavailabletobeallocatedtoitsapplications.
Ineachofthesecases,theperformanceofaresourcemanagercanbecompromisedifit
runsontopofavirtualized,ratherthanaphysical,resource.Whatisneededisfortheoperatingsystemkerneltocommunicatehowmuchmemoryisassignedtoaprocessorvirtualmachinesothattheapplicationtodoitsownmemorymanagement.Asprocessesstartandcomplete,theamountofavailablephysicalmemorywillchange,andthereforetheassignmenttoeachapplicationwillchange.
Tohandletheseneeds,mostoperatingsystemsprovidesomelevelofapplicationcontrolovermemory.Twomodelshaveemerged:
Pinnedpages.Asimpleandwidelyavailablemodelistoallowapplicationstopinvirtualmemorypagestophysicalpageframes,preventingthosepagesfrombeingevictedunlessabsolutelynecessary.Oncepinned,theapplicationcanmanageitsmemoryhoweveritseesfit,forexample,byexplicitlyshufflingdatabackandforthtodisk.
Figure10.8:Theoperationofauser-levelpagehandler.Onapagefault,thehardwaretrapstothekernel;ifthefaultisforasegmentwithauser-levelpager,thekernelpassesthefaulttotheuser-levelhandlertomanage.Theuser-levelhandlerispinnedinmemorytoavoidrecursivefaults.
User-levelpagers.Amoregeneralsolutionisforapplicationstospecifyauser-levelpagehandlerforamemorysegment.Onapagefaultorprotectionviolation,thekerneltraphandlerisinvoked.Insteadofhandlingthefaultitself,thekernelpassescontroltouser-levelhandler,asinaUNIXsignalhandler.Theuser-levelhandlercanthendecidehowtomanagethetrap:wheretofetchthemissingpage,whatactiontotakeiftheapplicationwassandbox,andwhichpagetoreplace.Toavoidinfiniterecursion,theuser-levelpagehandlermustitselfbestoredinpinnedmemory.
10.6SummaryandFutureDirections
Inthischapter,wehavearguedthataddresstranslationprovidesapowerfultoolforoperatingsystemstoprovideasetofadvancedservicestoapplicationstoimprovesystemperformance,reliability,andsecurity.Servicessuchascheckpointing,recoverablememory,deterministicdebugging,andhoneypotsarenowwidelysupportedatthevirtualmachinelayer,andwebelievethattheywillcometobestandardinmostoperatingsystemsaswell.
Movingforward,itisclearthatthedemandsonthememorymanagementsystemforadvancedserviceswillincrease.Notonlyarememoryhierarchiesbecomingincreasinglycomplex,butthediversityofservicesprovidedbythememorymanagementsystemhasaddedevenmorecomplexity.
Operatingsystemsoftengothroughcyclesofgraduallyincreasingcomplexityfollowedbyrapidshiftsbacktowardssimplicity.Therecentcommercialinterestinvirtualmachinesmayyieldashiftbacktowardssimplermemorymanagement,byreducingtheneedforthekerneltoprovideeveryservicethatanyapplicationmightneed.Processorarchitecturesnowdirectlysupportuser-levelpagetables.Thispotentiallyopensupanentirerealmformoresophisticatedruntimesystems,forthoseapplicationsthatarethemselvesminiatureoperatingsystems,andaconcurrentsimplificationofthekernel.Withtherightoperatingsystemsupport,applicationswillbeabletosetupandmanagetheirownpagetablesdirectly,implementtheirownuser-levelprocessabstractions,andprovidetheirowntransparentcheckpointingandrecoveryonmemorysegments.
Exercises
1. Thisquestionconcernstheoperationofshadowpagetablesforvirtualmachines,whereaguestprocessisrunningontopofaguestoperatingsystemontopofahostoperatingsystem.Thearchitectureusespagedsegmentation,witha32-bitvirtualaddressdividedintofieldsasfollows:
| 4bitsegmentnumber | 12bitpagenumber | 16bitoffset |
Theguestoperatingsystemcreatesandmanagessegmentandpagetablestomaptheguestvirtualaddressestoguestphysicalmemory.Thesetablesareasfollows(allvaluesinhexadecimal):
SegmentTable PageTableA PageTableB
0 PageTableA 0 0002 0 0001
1 PageTableB 1 0006 1 0004
x (restinvalid) 2 0000 2 0003
3 0005 x (restinvalid)
x (restinvalid)
Thehostoperatingsystemcreatesandmanagessegmentandpagetablestomaptheguestphysicalmemorytohostphysicalmemory.Thesetablesareasfollows:
SegmentTable PageTableK
0 PageTableK 0 BEEF
x (restinvalid) 1 F000
2 CAFE
3 3333
4 (invalid)
5 BA11
6 DEAD
7 5555
x (restinvalid)
a. Findthehostphysicaladdresscorrespondingtoeachofthefollowingguestvirtualaddresses.Answer“invalidguestvirtualaddress”iftheguestvirtualaddressisinvalid;answer“invalidguestphysicaladdressiftheguestvirtualaddressmapstoavalidguestphysicalpageframe,buttheguestphysicalpagehasaninvalidvirtualaddress.
i. 00000000ii. 20021111iii. 10012222
iv. 00023333v. 10024444
b. Usingtheinformationinthetablesabove,fillinthecontentsoftheshadowsegmentandpagetablesfordirectexecutionoftheguestprocess.
c. Assumingthattheguestphysicalmemoryiscontiguous,listthreereasonswhythehostpagetablemighthaveaninvalidentryforaguestphysicalpageframe,withvalidentriesoneitherside.
2. Supposewedoingincrementalcheckpointsonasystemwith4KBpagesandadiskcapableoftransferringdataat10MB/s.
a. Whatisthemaximumrateofupdatestonewpagesifeverymodifiedpageissentinitsentiretytodiskoneverycheckpointandwerequirethateachcheckpointreachdiskbeforewestartthenextcheckpoint?
b. Supposethatmostpagessavedduringanincrementalcheckpointareonlypartiallymodified.Describehowyouwoulddesignasystemtosaveonlythemodifiedportionsofeachpageaspartofthecheckpoint.
References
[1]
KeithAdamsandOleAgesen.Acomparisonofsoftwareandhardwaretechniquesforx86virtualization.InProceedingsofthe12thInternationalconferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems,ASPLOS-XII,pages2–13,2006.
[2]ThomasE.Anderson,BrianN.Bershad,EdwardD.Lazowska,andHenryM.Levy.Scheduleractivations:effectivekernelsupportfortheuser-levelmanagementofparallelism.ACMTrans.Comput.Syst.,10(1):53–79,February1992.
[3]
ThomasE.Anderson,HenryM.Levy,BrianN.Bershad,andEdwardD.Lazowska.Theinteractionofarchitectureandoperatingsystemdesign.InProceedingsofthefourthInternationalconferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems,ASPLOS-IV,pages108–120,1991.
[4]AndrewW.AppelandKaiLi.Virtualmemoryprimitivesforuserprograms.InProceedingsofthefourthInternationalconferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems,ASPLOS-IV,pages96–107,1991.
[5]AmittaiAviram,Shu-ChunWeng,SenHu,andBryanFord.Efficientsystem-enforceddeterministicparallelism.InProceedingsofthe9thUSENIXconferenceonOperatingSystemsDesignandImplementation,OSDI’10,pages1–16,2010.
[6]ÖzalpBabaogluandWilliamJoy.Convertingaswap-basedsystemtodopaginginanarchitecturelackingpage-referencedbits.InProceedingsoftheeighthACMSymposiumonOperatingSystemsPrinciples,SOSP’81,pages78–86,1981.
[7]
DavidBacon,JoshuaBloch,JeffBogda,CliffClick,PaulHaahr,DougLea,TomMay,Jan-WillemMaessen,JeremyManson,JohnD.Mitchell,KelvinNilsen,BillPugh,andEminGunSirer.The“double-checkedlockingisbroken”declaration.http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html.
[8]
GauravBanga,PeterDruschel,andJeffreyC.Mogul.Resourcecontainers:anewfacilityforresourcemanagementinserversystems.InProceedingsofthethirdUSENIXsymposiumonOperatingSystemsDesignandImplementation,OSDI’99,pages45–58,1999.
[9]
PaulBarham,BorisDragovic,KeirFraser,StevenHand,TimHarris,AlexHo,RolfNeugebauer,IanPratt,andAndrewWarfield.Xenandtheartofvirtualization.InProceedingsofthenineteenthACMSymposiumonOperatingSystemsPrinciples,SOSP’03,pages164–177,2003.
[10] BlaiseBarney.POSIXthreadsprogramming.http://computing.llnl.gov/tutorials/pthreads/,2013.
[11] JoelF.Bartlett.Anonstopkernel.InProceedingsoftheeighthACMSymposiumonOperatingSystemsPrinciples,SOSP’81,pages22–29,1981.
[12]
AndrewBaumann,PaulBarham,Pierre-EvaristeDagand,TimHarris,RebeccaIsaacs,SimonPeter,TimothyRoscoe,AdrianSchüpbach,andAkhileshSinghania.Themultikernel:anewOSarchitectureforscalablemulticoresystems.InProceedingsofthe22ndACMSymposiumonOperatingSystemsPrinciples,SOSP’09,pages29–44,2009.
[13] A.Bensoussan,C.T.Clingen,andR.C.Daley.Themulticsvirtualmemory:conceptsanddesign.Commun.ACM,15(5):308–318,May1972.
[14]TomBergan,NicholasHunt,LuisCeze,andStevenD.Gribble.DeterministicprocessgroupsindOS.InProceedingsofthe9thUSENIXconferenceonOperatingSystemsDesignandImplementation,OSDI’10,pages1–16,2010.
[15]
B.N.Bershad,S.Savage,P.Pardyak,E.G.Sirer,M.E.Fiuczynski,D.Becker,C.Chambers,andS.Eggers.ExtensibilitysafetyandperformanceintheSPINoperatingsystem.InProceedingsofthefifteenthACMSymposiumonOperatingSystemsPrinciples,SOSP’95,pages267–283,1995.
[16]BrianN.Bershad,ThomasE.Anderson,EdwardD.Lazowska,andHenryM.Levy.Lightweightremoteprocedurecall.ACMTrans.Comput.Syst.,8(1):37–55,February1990.
[17]BrianN.Bershad,ThomasE.Anderson,EdwardD.Lazowska,andHenryM.Levy.User-levelinterprocesscommunicationforsharedmemorymultiprocessors.ACMTrans.Comput.Syst.,9(2):175–198,May1991.
[18] AndrewBirrell.Anintroductiontoprogrammingwiththreads.TechnicalReport35,DigitalEquipmentCorporationSystemsResearchCenter,1991.
[19] AndrewD.BirrellandBruceJayNelson.Implementingremoteprocedurecalls.ACMTrans.Comput.Syst.,2(1):39–59,February1984.
[20]
SilasBoyd-Wickizer,AustinT.Clements,YandongMao,AlekseyPesterev,M.FransKaashoek,RobertMorris,andNickolaiZeldovich.AnanalysisofLinuxscalabilitymanycores.InProceedingsofthe9thUSENIXconferenceonOperatingSystemsDesignandImplementation,OSDI’10,pages1–8,2010.
[21]LeeBreslau,PeiCao,LiFan,GrahamPhillips,andScottShenker.WebcachingandZipf-likedistributions:evidenceandimplications.InINFOCOM,pages126–134,1999.
[22] ThomasC.BressoudandFredB.Schneider.Hypervisor-basedfaulttolerance.ACMTrans.Comput.Syst.,14(1):80–107,February1996.
[23]SergeyBrinandLawrencePage.Theanatomyofalarge-scalehypertextualwebsearchengine.InProceedingsoftheseventhInternationalconferenceontheWorldWideWeb,WWW7,pages107–117,1998.
[24] MaxBruning.ZFSon-diskdatawalk(or:Where’smydata?).InOpenSolarisDeveloperConference,2008.
[25]EdouardBugnion,ScottDevine,KinshukGovil,andMendelRosenblum.Disco:runningcommodityoperatingsystemsonscalablemultiprocessors.ACMTrans.Comput.Syst.,15(4):412–447,November1997.
[26] BrianCarrier.FileSystemForensicAnalysis.AddisonWesleyProfessional,2005.
[27]
MiguelCastro,ManuelCosta,Jean-PhilippeMartin,MarcusPeinado,PeriklisAkritidis,AustinDonnelly,PaulBarham,andRichardBlack.Fastbyte-granularitysoftwarefaultisolation.InProceedingsofthe22ndACMSymposiumonOperatingSystemsPrinciples,SOSP’09,pages45–58,2009.
[28]J.Chapin,M.Rosenblum,S.Devine,T.Lahiri,D.Teodosiu,andA.Gupta.Hive:faultcontainmentforshared-memorymultiprocessors.InProceedingsofthefifteenthACM
SymposiumonOperatingSystemsPrinciples,SOSP’95,pages12–25,1995.
[29]JeffreyS.Chase,HenryM.Levy,MichaelJ.Feeley,andEdwardD.Lazowska.Sharingandprotectioninasingle-address-spaceoperatingsystem.ACMTrans.Comput.Syst.,12(4):271–307,November1994.
[30]J.BradleyChenandBrianN.Bershad.Theimpactofoperatingsystemstructureonmemorysystemperformance.InProceedingsofthefourteenthACMSymposiumonOperatingSystemsPrinciples,SOSP’93,pages120–133,1993.
[31] PeterM.ChenandBrianD.Noble.Whenvirtualisbetterthanreal.InProceedingsoftheEighthWorkshoponHotTopicsinOperatingSystems,HOTOS’01,2001.
[32] DavidCheriton.TheVdistributedsystem.Commun.ACM,31(3):314–333,March1988.
[33]DavidR.CheritonandKennethJ.Duda.Acachingmodelofoperatingsystemkernelfunctionality.InProceedingsofthe1stUSENIXconferenceonOperatingSystemsDesignandImplementation,OSDI’94,1994.
[34] DavidD.Clark.Thestructuringofsystemsusingupcalls.InProceedingsofthetenthACMSymposiumonOperatingSystemsPrinciples,SOSP’85,pages171–180,1985.
[35]
JeremyCondit,EdmundB.Nightingale,ChristopherFrost,EnginIpek,BenjaminLee,DougBurger,andDerrickCoetzee.BetterI/Othroughbyte-addressable,persistentmemory.InProceedingsofthe22ndACMSymposiumonOperatingSystemsPrinciples,SOSP’09,pages133–146,2009.
[36] FernandoJ.Corbató.Onbuildingsystemsthatwillfail.Commun.ACM,34(9):72–81,September1991.
[37] FernandoJ.CorbatóandVictorA.Vyssotsky.IntroductionandoverviewoftheMulticssystem.AFIPSFallJointComputerConference,27(1):185–196,1965.
[38] R.J.Creasy.TheoriginoftheVM/370time-sharingsystem.IBMJ.Res.Dev.,25(5):483–490,September1981.
[39]
MichaelD.Dahlin,RandolphY.Wang,ThomasE.Anderson,andDavidA.Patterson.Cooperativecaching:usingremoteclientmemorytoimprovefilesystemperformance.InProceedingsofthe1stUSENIXconferenceonOperatingSystemsDesignandImplementation,OSDI’94,1994.
[40] RobertC.DaleyandJackB.Dennis.Virtualmemory,processes,andsharinginMultics.Commun.ACM,11(5):306–312,May1968.
[41]WiebrendeJonge,M.FransKaashoek,andWilsonC.Hsieh.Thelogicaldisk:anewapproachtoimprovingfilesystems.InProceedingsofthefourteenthACMSymposiumonOperatingSystemsPrinciples,SOSP’93,pages15–28,1993.
[42]JeffreyDeanandSanjayGhemawat.MapReduce:simplifieddataprocessingonlargeclusters.InProceedingsofthe6thUSENIXSymposiumonOperatingSystemsDesign&Implementation,OSDI’04,2004.
[43] PeterJ.Denning.Theworkingsetmodelforprogrambehavior.Commun.ACM,11(5):323–333,May1968.
[44] P.J.Denning.Workingsetspastandpresent.SoftwareEngineering,IEEETransactionson,SE-6(1):64–84,jan.1980.
[45] JackB.Dennis.Segmentationandthedesignofmultiprogrammedcomputersystems.J.ACM,12(4):589–602,October1965.
[46] JackB.DennisandEarlC.VanHorn.Programmingsemanticsformultiprogrammedcomputations.Commun.ACM,9(3):143–155,March1966.
[47] E.W.Dijkstra.Solutionofaprobleminconcurrentprogrammingcontrol.Commun.ACM,8(9):569–,September1965.
[48] EdsgerW.Dijkstra.Thestructureofthe“THE”-multiprogrammingsystem.Commun.ACM,11(5):341–346,May1968.
[49]
MihaiDobrescu,NorbertEgi,KaterinaArgyraki,Byung-GonChun,KevinFall,GianlucaIannaccone,AllanKnies,MaziarManesh,andSylviaRatnasamy.Routebricks:exploitingparallelismtoscalesoftwarerouters.InProceedingsofthe22ndACMSymposiumonOperatingSystemsPrinciples,SOSP’09,pages15–28,2009.
[50] AlanDonovan,RobertMuth,BradChen,andDavidSehr.PortableNativeClientexecutables.Technicalreport,Google,2012.
[51] FredDouglisandJohnOusterhout.Transparentprocessmigration:designalternativesandtheSpriteimplementation.Softw.Pract.Exper.,21(8):757–785,July1991.
[52]
RichardP.Draves,BrianN.Bershad,RichardF.Rashid,andRandallW.Dean.Usingcontinuationstoimplementthreadmanagementandcommunicationinoperatingsystems.InProceedingsofthethirteenthACMSymposiumonOperatingSystemsPrinciples,SOSP’91,pages122–136,1991.
[53] PeterDruschelandLarryL.Peterson.Fbufs:ahigh-bandwidthcross-domaintransferfacility.SIGOPSOper.Syst.Rev.,27(5):189–202,December1993.
[54]GeorgeW.Dunlap,SamuelT.King,SukruCinar,MurtazaA.Basrai,andPeterM.Chen.ReVirt:enablingintrusionanalysisthroughvirtual-machineloggingandreplay.SIGOPSOper.Syst.Rev.,36(SI):211–224,December2002.
[55]
PetrosEfstathopoulos,MaxwellKrohn,SteveVanDeBogart,CliffFrey,DavidZiegler,EddieKohler,DavidMazières,FransKaashoek,andRobertMorris.LabelsandeventprocessesintheAsbestosoperatingsystem.InProceedingsofthetwentiethACMSymposiumonOperatingSystemsPrinciples,SOSP’05,pages17–30,2005.
[56]D.R.Engler,M.F.Kaashoek,andJ.O’Toole,Jr.Exokernel:anoperatingsystemarchitectureforapplication-levelresourcemanagement.InProceedingsofthefifteenthACMSymposiumonOperatingSystemsPrinciples,SOSP’95,pages251–266,1995.
[57]
DawsonEngler,DavidYuChen,SethHallem,AndyChou,andBenjaminChelf.Bugsasdeviantbehavior:ageneralapproachtoinferringerrorsinsystemscode.InProceedingsoftheeighteenthACMSymposiumonOperatingSystemsPrinciples,SOSP’01,pages57–72,2001.
[58] R.S.Fabry.Capability-basedaddressing.Commun.ACM,17(7):403–412,July1974.
[59]JasonFlinnandM.Satyanarayanan.Energy-awareadaptationformobileapplications.InProceedingsoftheseventeenthACMSymposiumonOperatingSystemsPrinciples,SOSP’99,pages48–63,1999.
[60]
ChristopherFrost,MikeMammarella,EddieKohler,AndrewdelosReyes,ShantHovsepian,AndrewMatsuoka,andLeiZhang.Generalizedfilesystemdependencies.
InProceedingsoftwenty-firstACMSymposiumonOperatingSystemsPrinciples,SOSP’07,pages307–320,2007.
[61]GregoryR.Ganger,MarshallKirkMcKusick,CraigA.N.Soules,andYaleN.Patt.Softupdates:asolutiontothemetadataupdateprobleminfilesystems.ACMTrans.Comput.Syst.,18(2):127–153,May2000.
[62] SimsonGarfinkelandGeneSpafford.PracticalUnixandInternetsecurity(2nded.).O’Reilly&Associates,Inc.,1996.
[63]
TalGarfinkel,BenPfaff,JimChow,MendelRosenblum,andDanBoneh.Terra:avirtualmachine-basedplatformfortrustedcomputing.InProceedingsofthenineteenthACMSymposiumonOperatingSystemsPrinciples,SOSP’03,pages193–206,2003.
[64]
KirkGlerum,KinshumanKinshumann,SteveGreenberg,GabrielAul,VinceOrgovan,GregNichols,DavidGrant,GretchenLoihle,andGalenHunt.Debugginginthe(very)large:tenyearsofimplementationandexperience.InProceedingsofthe22ndACMSymposiumonOperatingSystemsPrinciples,SOSP’09,pages103–116,2009.
[65] R.P.Goldberg.Surveyofvirtualmachineresearch.IEEEComputer,7(6):34–45,June1974.
[66]
KinshukGovil,DanTeodosiu,YongqiangHuang,andMendelRosenblum.CellularDisco:resourcemanagementusingvirtualclustersonshared-memorymultiprocessors.InProceedingsoftheseventeenthACMSymposiumonOperatingSystemsPrinciples,SOSP’99,pages154–169,1999.
[67]JimGray.Thetransactionconcept:virtuesandlimitations(invitedpaper).InProceedingsoftheseventhInternationalconferenceonVeryLargeDataBases,VLDB’81,pages144–154,1981.
[68] JimGray.Whydocomputersstopandwhatcanbedoneaboutit?TechnicalReportTR-85.7,HPLabs,1985.
[69]JimGray,PaulMcJones,MikeBlasgen,BruceLindsay,RaymondLorie,TomPrice,FrancoPutzolu,andIrvingTraiger.TherecoverymanageroftheSystemRdatabasemanager.ACMComput.Surv.,13(2):223–242,June1981.
[70] JimGrayandAndreasReuter.TransactionProcessing:ConceptsandTechniques.MorganKaufmann,1993.
[71] JimGrayandDanielP.Siewiorek.High-availabilitycomputersystems.Computer,24(9):39–48,September1991.
[72]
DiwakerGupta,SangminLee,MichaelVrable,StefanSavage,AlexC.Snoeren,GeorgeVarghese,GeoffreyM.Voelker,andAminVahdat.Differenceengine:harnessingmemoryredundancyinvirtualmachines.InProceedingsofthe8thUSENIXconferenceonOperatingSystemsDesignandImplementation,OSDI’08,pages309–322,2008.
[73] Hadoop.http://hadoop.apache.org.
[74]StevenM.Hand.Self-pagingintheNemesisoperatingsystem.InProceedingsofthethirdUSENIXSymposiumonOperatingSystemsDesignandImplementation,OSDI’99,pages73–86,1999.
[75] PerBrinchHansen.Thenucleusofamultiprogrammingsystem.Commun.ACM,13(4):238–241,April1970.
[76]MorHarchol-BalterandAllenB.Downey.Exploitingprocesslifetimedistributionsfordynamicloadbalancing.InProceedingsofthefifteenthACMSymposiumonOperatingSystemsPrinciples,SOSP’95,pages236–,1995.
[77]
KieranHartyandDavidR.Cheriton.Application-controlledphysicalmemoryusingexternalpage-cachemanagement.InProceedingsofthefifthInternationalconferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems,ASPLOS-V,pages187–197,1992.
[78] RoberHaskin,YoniMalachi,andGregoryChan.RecoverymanagementinQuickSilver.ACMTrans.Comput.Syst.,6(1):82–108,February1988.
[79] JohnL.HennessyandDavidA.Patterson.ComputerArchitecture-AQuantitativeApproach(5.ed.).MorganKaufmann,2012.
[80] MauriceHerlihy.Wait-freesynchronization.ACMTrans.Program.Lang.Syst.,13(1):124–149,January1991.
[81] MauriceHerlihyandNirShavit.TheArtofMultiprocessorProgramming.MorganKaufmann,2008.
[82] DaveHitz,JamesLau,andMichaelMalcolm.FilesystemdesignforanNFSfileserverappliance.TechnicalReport3002,NetworkAppliance,1995.
[83] C.A.R.Hoare.Monitors:Anoperatingsystemstructuringconcept.CommunicationsoftheACM,17:549–557,1974.
[84] C.A.R.Hoare.Communicatingsequentialprocesses.Commun.ACM,21(8):666–677,August1978.
[85] C.A.R.Hoare.Theemperor’soldclothes.Commun.ACM,24(2):75–83,February1981.
[86]ThomasR.HorsleyandWilliamC.Lynch.Pilot:Asoftwareengineeringcasestudy.Proceedingsofthe4thInternationalconferenceonSoftwareengineering,ICSE’79,pages94–99,1979.
[87] RajJain.TheArtofComputerSystemsPerformanceAnalysis.JohnWiley&Sons,1991.
[88]
AsimKadavandMichaelM.Swift.Understandingmoderndevicedrivers.InProceedingsoftheseventeenthinternationalconferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems,ASPLOS’12,pages87–98,NewYork,NY,USA,2012.ACM.
[89]PaulA.Karger,MaryEllenZurko,DouglasW.Bonin,AndrewH.Mason,andCliffordE.Kahn.AretrospectiveontheVAXVMMsecuritykernel.IEEETrans.Softw.Eng.,17(11):1147–1165,November1991.
[90]YousefA.KhalidiandMichaelN.Nelson.ExtensiblefilesystemsinSpring.InProceedingsofthefourteenthACMSymposiumonOperatingSystemsPrinciples,SOSP’93,pages1–14,1993.
[91]
GerwinKlein,KevinElphinstone,GernotHeiser,JuneAndronick,DavidCock,PhilipDerrin,DhammikaElkaduwe,KaiEngelhardt,RafalKolanski,MichaelNorrish,ThomasSewell,HarveyTuch,andSimonWinwood.sel4:formalverificationofan
OSkernel.InProceedingsoftheACMSIGOPS22ndSymposiumonOperatingSystemsPrinciples,SOSP’09,pages207–220,2009.
[92] L.KleinrockandR.R.Muntz.Processorsharingqueueingmodelsofmixedschedulingdisciplinesfortimesharedsystem.J.ACM,19(3):464–482,July1972.
[93]LeonardKleinrock.QueueingSystems,VolumeII:ComputerApplications.WileyInterscience,1976.
[94] H.T.KungandJohnT.Robinson.Onoptimisticmethodsforconcurrencycontrol.ACMTrans.DatabaseSyst.,6(2):213–226,June1981.
[95] LeslieLamport.Afastmutualexclusionalgorithm.ACMTrans.Comput.Syst.,5(1):1–11,January1987.
[96] B.W.Lampson.Hintsforcomputersystemdesign.IEEESoftw.,1(1):11–28,January1984.
[97] ButlerLampsonandHowardSturgis.Crashrecoveryinadistributeddatastoragesystem.Technicalreport,XeroxPaloAltoResearchCenter,1979.
[98] ButlerW.LampsonandDavidD.Redell.ExperiencewithprocessesandmonitorsinMesa.Commun.ACM,23(2):105–117,February1980.
[99] ButlerW.LampsonandHowardE.Sturgis.Reflectionsonanoperatingsystemdesign.Commun.ACM,19(5):251–265,May1976.
[100] JamesLarusandGalenHunt.TheSingularitysystem.Commun.ACM,53(8):72–79,August2010.
[101] HughC.LauerandRogerM.Needham.Onthedualityofoperatingsystemstructures.InOperatingSystemsReview,pages3–19,1979.
[102]EdwardD.Lazowska,JohnZahorjan,G.ScottGraham,andKennethC.Sevcik.Quantitativesystemperformance:computersystemanalysisusingqueueingnetworkmodels.Prentice-Hall,Inc.,1984.
[103]WillE.Leland,MuradS.Taqqu,WalterWillinger,andDanielV.Wilson.Ontheself-similarnatureofEthernettraffic(extendedversion).IEEE/ACMTrans.Netw.,2(1):1–15,February1994.
[104] N.G.LevesonandC.S.Turner.AninvestigationoftheTherac-25accidents.Computer,26(7):18–41,July1993.
[105] H.M.LevyandP.H.Lipman.VirtualmemorymanagementintheVAX/VMSoperatingsystem.Computer,15(3):35–41,March1982.
[106] J.Liedtke.Onmicro-kernelconstruction.InProceedingsofthefifteenthACMSymposiumonOperatingSystemsPrinciples,SOSP’95,pages237–250,1995.
[107] JohnLions.Lions’CommentaryonUNIX6thEdition,withSourceCode.Peer-to-PeerCommunications,1996.
[108] J.S.Liptay.StructuralaspectsoftheSystem/360model85:iithecache.IBMSyst.J.,7(1):15–21,March1968.
[109]
DavidE.Lowell,SubhachandraChandra,andPeterM.Chen.Exploringfailuretransparencyandthelimitsofgenericrecovery.InProceedingsofthe4thconferenceonSymposiumonOperatingSystemsDesignandImplementation,OSDI’00,pages20–20,2000.
[110] DavidE.LowellandPeterM.Chen.FreetransactionswithRioVista.InProceedingsofthesixteenthACMSymposiumonOperatingSystemsPrinciples,SOSP’97,pages92–101,1997.
[111] P.McKenney.Isparallelprogramminghard,and,ifso,whatcanbedoneaboutit?http://kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.2011.05.30a.pdf.
[112]PaulE.McKenney,DipankarSarma,AndreaArcangeli,AndiKleen,OrranKrieger,andRustyRussell.Read-copyupdate.InOttawaLinuxSymposium,pages338–367,June2002.
[113] MarshallK.McKusick,WilliamN.Joy,SamuelJ.Leffler,andRobertS.Fabry.AfastfilesystemforUNIX.ACMTrans.Comput.Syst.,2(3):181–197,August1984.
[114]MarshallKirkMcKusick,KeithBostic,MichaelJ.Karels,andJohnS.Quarterman.Thedesignandimplementationofthe4.4BSDoperatingsystem.AddisonWesleyLongmanPublishingCo.,Inc.,1996.
[115]JohnM.Mellor-CrummeyandMichaelL.Scott.Algorithmsforscalablesynchronizationonshared-memorymultiprocessors.ACMTrans.Comput.Syst.,9(1):21–65,February1991.
[116] ScottMeyersandAndreiAlexandrescu.C++andtheperilsofdouble-checkedlocking.Dr.DobbsJournal,2004.
[117] JeffreyC.MogulandK.K.Ramakrishnan.Eliminatingreceivelivelockinaninterrupt-drivenkernel.ACMTrans.Comput.Syst.,15(3):217–252,August1997.
[118]JeffreyC.Mogul,RichardF.Rashid,andMichaelJ.Accetta.Thepacketfilter:Anefficientmechanismforuser-levelnetworkcode.InIntheProceedingsoftheeleventhACMSymposiumonOperatingSystemsPrinciples,pages39–51,1987.
[119]C.Mohan,DonHaderle,BruceLindsay,HamidPirahesh,andPeterSchwarz.ARIES:atransactionrecoverymethodsupportingfine-granularitylockingandpartialrollbacksusingwrite-aheadlogging.ACMTrans.DatabaseSyst.,17(1):94–162,March1992.
[120] GordonE.Moore.Crammingmorecomponentsontointegratedcircuits.Electronics,38(8):114–117,1965.
[121]
MadanlalMusuvathi,ShazQadeer,ThomasBall,GerardBasler,PiramanayagamArumugaNainar,andIulianNeamtiu.FindingandreproducingHeisenbugsinconcurrentprograms.InProceedingsofthe8thUSENIXconferenceonOperatingSystemsDesignandImplementation,OSDI’08,pages267–280,2008.
[122] KaiNagelandMichaelSchreckenberg.Acellularautomatonmodelforfreewaytraffic.J.Phys.IFrance,1992.
[123]GeorgeC.NeculaandPeterLee.Safekernelextensionswithoutrun-timechecking.ProceedingsofthesecondUSENIXSymposiumonOperatingSystemsDesignandImplementation,OSDI’96,pages229–243,1996.
[124] EdmundB.Nightingale,KaushikVeeraraghavan,PeterM.Chen,andJasonFlinn.Rethinkthesync.ACMTrans.Comput.Syst.,26(3):6:1–6:26,September2008.
[125] ElliottI.Organick.TheMulticssystem:anexaminationofitsstructure.MITPress,1972.
[126]
StevenOsman,DineshSubhraveti,GongSu,andJasonNieh.ThedesignandimplementationofZap:asystemformigratingcomputingenvironments.In
ProceedingsofthefifthUSENIXSymposiumonOperatingSystemsDesignandImplementation,OSDI’02,pages361–376,2002.
[127]JohnOusterhout.Schedulingtechniquesforconcurrentsystems.InProceedingsofThirdInternationalConferenceonDistributedComputingSystems,pages22–30,1982.
[128] JohnOusterhout.Whyaren’toperatingsystemsgettingfasterasfastashardware?InProceedingsUSENIXConference,pages247–256,1990.
[129]JohnOusterhout.Whythreadsareabadidea(formostpurposes).InUSENIXWinterTechnicalConference,1996.
[130]VivekS.Pai,PeterDruschel,andWillyZwaenepoel.Flash:anefficientandportablewebserver.InProceedingsoftheannualconferenceonUSENIXAnnualTechnicalConference,ATEC’99,1999.
[131]VivekS.Pai,PeterDruschel,andWillyZwaenepoel.IO-lite:aunifiedI/Obufferingandcachingsystem.InProceedingsofthethirdUSENIXSymposiumonOperatingSystemsDesignandImplementation,OSDI’99,pages15–28,1999.
[132]DavidA.Patterson,GarthGibson,andRandyH.Katz.Acaseforredundantarraysofinexpensivedisks(RAID).InProceedingsofthe1988ACMSIGMODInternationalconferenceonManagementofData,SIGMOD’88,pages109–116,1988.
[133]L.Peterson,N.Hutchinson,S.O’Malley,andM.Abbott.RPCinthex-Kernel:evaluatingnewdesigntechniques.InProceedingsofthetwelfthACMSymposiumonOperatingSystemsPrinciples,SOSP’89,pages91–101,1989.
[134] JonathanPincusandBrandonBaker.Beyondstacksmashing:recentadvancesinexploitingbufferoverruns.IEEESecurityandPrivacy,2(4):20–27,July2004.
[135]EduardoPinheiro,Wolf-DietrichWeber,andLuizAndréBarroso.Failuretrendsinalargediskdrivepopulation.InProceedingsofthe5thUSENIXconferenceonFileandStorageTechnologies,FAST’07,pages2–2,2007.
[136]
VijayanPrabhakaran,LakshmiN.Bairavasundaram,NitinAgrawal,HaryadiS.Gunawi,AndreaC.Arpaci-Dusseau,andRemziH.Arpaci-Dusseau.IRONfilesystems.InProceedingsofthetwentiethACMSymposiumonOperatingSystemsPrinciples,SOSP’05,pages206–220,2005.
[137]
RichardRashid,RobertBaron,AlessandroForin,DavidGolub,MichaelJones,DanielJulin,DouglasOrr,andRichardSanzi.Mach:Afoundationforopensystems.InProceedingsoftheSecondWorkshoponWorkstationOperatingSystems(WWOS2),1989.
[138]
RichardF.Rashid,AvadisTevanian,MichaelYoung,DavidB.Golub,RobertV.Baron,DavidL.Black,WilliamJ.Bolosky,andJonathanChew.Machine-independentvirtualmemorymanagementforpageduniprocessorandmultiprocessorarchitectures.IEEETrans.Computers,37(8):896–907,1988.
[139] E.S.Raymond.TheCathedralandtheBazaar:MusingsOnLinuxAndOpenSourceByAnAccidentalRevolutionary.O’ReillySeries.O’Reilly,2001.
[140]DavidD.Redell,YogenK.Dalal,ThomasR.Horsley,HughC.Lauer,WilliamC.Lynch,PaulR.McJones,HalG.Murray,andStephenC.Purcell.Pilot:anoperatingsystemforapersonalcomputer.Commun.ACM,23(2):81–92,February1980.
[141] DennisM.RitchieandKenThompson.TheUNIXtime-sharingsystem.Commun.ACM,17(7):365–375,July1974.
[142] MendelRosenblumandJohnK.Ousterhout.Thedesignandimplementationofalog-structuredfilesystem.ACMTrans.Comput.Syst.,10(1):26–52,February1992.
[143] ChrisRuemmlerandJohnWilkes.Anintroductiontodiskdrivemodeling.Computer,27(3):17–28,March1994.
[144] J.H.Saltzer,D.P.Reed,andD.D.Clark.End-to-endargumentsinsystemdesign.ACMTrans.Comput.Syst.,2(4):277–288,November1984.
[145]JeromeH.Saltzer.ProtectionandthecontrolofinformationsharinginMultics.Commun.ACM,17(7):388–402,July1974.
[146]M.Satyanarayanan,HenryH.Mashburn,PuneetKumar,DavidC.Steere,andJamesJ.Kistler.Lightweightrecoverablevirtualmemory.ACMTrans.Comput.Syst.,12(1):33–57,February1994.
[147]StefanSavage,MichaelBurrows,GregNelson,PatrickSobalvarro,andThomasAnderson.Eraser:adynamicdataracedetectorformultithreadedprograms.ACMTrans.Comput.Syst.,15(4):391–411,November1997.
[148]BiancaSchroederandGarthA.Gibson.Diskfailuresintherealworld:whatdoesanMTTFof1,000,000hoursmeantoyou?InProceedingsofthe5thUSENIXconferenceonFileandStorageTechnologies,FAST’07,2007.
[149] BiancaSchroederandMorHarchol-Balter.Webserversunderoverload:Howschedulingcanhelp.ACMTrans.InternetTechnol.,6(1):20–52,February2006.
[150]MichaelD.Schroeder,DavidD.Clark,andJeromeH.Saltzer.TheMulticskerneldesignproject.InProceedingsofthesixthACMSymposiumonOperatingSystemsPrinciples,SOSP’77,pages43–56,1977.
[151] MichaelD.SchroederandJeromeH.Saltzer.Ahardwarearchitectureforimplementingprotectionrings.Commun.ACM,15(3):157–170,March1972.
[152] D.P.Siewiorek.Architectureoffault-tolerantcomputers.Computer,17(8):9–18,August1984.[153] E.H.Spafford.Crisisandaftermath.Commun.ACM,32(6):678–687,June1989.[154] StructuredQueryLanguage(SQL).http://en.wikipedia.org/wiki/SQL.
[155] MichaelStonebraker.Operatingsystemsupportfordatabasemanagement.Commun.ACM,24(7):412–418,July1981.
[156]MichaelM.Swift,MuthukaruppanAnnamalai,BrianN.Bershad,andHenryM.Levy.Recoveringdevicedrivers.ACMTrans.Comput.Syst.,24(4):333–360,November2006.
[157] K.Thompson.Uniximplementation.BellSystemTechnicalJournal,57:1931–1946,1978.
[158] KenThompson.Reflectionsontrustingtrust.Commun.ACM,27(8):761–763,August1984.
[159] PaulTyma.Thousandsofthreadsandblockingi/o.http://www.mailinator.com/tymaPaulMultithreaded.pdf,2008.RobbertvanRenesse.Goal-orientedprogramming,orcompositionusingevents,or
[160] threadsconsideredharmful.InACMSIGOPSEuropeanWorkshoponSupportforComposingDistributedApplications,pages82–87,1998.
[161] JoostS.M.Verhofstad.Recoverytechniquesfordatabasesystems.ACMComput.Surv.,10(2):167–195,June1978.
[162]
MichaelVrable,JustinMa,JayChen,DavidMoore,ErikVandekieft,AlexC.Snoeren,GeoffreyM.Voelker,andStefanSavage.Scalability,fidelity,andcontainmentinthePotemkinvirtualhoneyfarm.InProceedingsofthetwentiethACMSymposiumonOperatingSystemsPrinciples,SOSP’05,pages148–162,2005.
[163]RobertWahbe,StevenLucco,ThomasE.Anderson,andSusanL.Graham.Efficientsoftware-basedfaultisolation.InProceedingsofthefourteenthACMSymposiumonOperatingSystemsPrinciples,SOSP’93,pages203–216,1993.
[164] CarlA.Waldspurger.MemoryresourcemanagementinVMwareESXserver.SIGOPSOper.Syst.Rev.,36(SI):181–194,December2002.
[165]AndrewWhitaker,MarianneShaw,andStevenD.Gribble.ScaleandperformanceintheDenaliisolationkernel.InProceedingsofthefifthUSENIXSymposiumonOperatingSystemsDesignandImplementation,OSDI’02,pages195–209,2002.
[166]J.Wilkes,R.Golding,C.Staelin,andT.Sullivan.TheHPAutoRAIDhierarchicalstoragesystem.InProceedingsofthefifteenthACMSymposiumonOperatingSystemsPrinciples,SOSP’95,pages96–108,1995.
[167]
AlecWolman,M.Voelker,NitinSharma,NealCardwell,AnnaKarlin,andHenryM.Levy.Onthescaleandperformanceofcooperativewebproxycaching.InProceedingsoftheseventeenthACMSymposiumonOperatingSystemsPrinciples,SOSP’99,pages16–31,1999.
[168]W.Wulf,E.Cohen,W.Corwin,A.Jones,R.Levin,C.Pierson,andF.Pollack.Hydra:thekernelofamultiprocessoroperatingsystem.Commun.ACM,17(6):337–345,June1974.
[169]
BennetYee,DavidSehr,GregoryDardyk,J.BradleyChen,RobertMuth,TavisOrmandy,ShikiOkasaka,NehaNarula,andNicholasFullagar.NativeClient:asandboxforportable,untrustedx86nativecode.InProceedingsofthe200930thIEEESymposiumonSecurityandPrivacy,SP’09,pages79–93,2009.
[170] NickolaiZeldovich,SilasBoyd-Wickizer,EddieKohler,andDavidMazières.MakinginformationflowexplicitinHiStar.Commun.ACM,54(11):93–101,November2011.
Glossary
absolutepathAfilepathnameinterpretedrelativetotherootdirectory.
abstractvirtualmachineTheinterfaceprovidedbyanoperatingsystemtoitsapplications,includingthesystemcallinterface,thememoryabstraction,exceptions,andsignals.
ACIDpropertiesAmnemonicforthepropertiesofatransaction:atomicity,consistency,isolation,anddurability.
acquire-all/release-allAdesignpatterntoprovideatomicityofarequestconsistingofmultipleoperations.Athreadacquiresallofthelocksitmightneedbeforestartingtoprocessarequest;itreleasesthelocksoncetherequestisdone.
addresstranslationTheconversionfromthememoryaddresstheprogramthinksitisreferencingtothephysicallocationofthememory.
affinityschedulingAschedulingpolicywheretasksarepreferentiallyscheduledontothesameprocessortheyhadpreviouslybeenassigned,toimprovecachereuse.
annualdiskfailurerateThefractionofdisksexpectedtofailureeachyear.
APISee:applicationprogramminginterface.
applicationprogramminginterfaceThesystemcallinterfaceprovidedbyanoperatingsystemtoapplications.
armAnattachmentallowingthemotionofthediskheadacrossadisksurface.
armassemblyAmotorplusthesetofdiskarmsneededtopositionadiskheadtoreadorwriteeachsurfaceofthedisk.
arrivalrateTherateatwhichtasksarriveforservice.
asynchronousI/OAdesignpatternforsystemcallstoallowasingle-threadedprocesstomakemultipleconcurrentI/Orequests.WhentheprocessissuesanI/Orequest,thesystemcallreturnsimmediately.TheprocesslateronreceivesanotificationwhentheI/Ocompletes.
asynchronousprocedurecallAprocedurecallwherethecallerstartsthefunction,continuesexecutionconcurrentlywiththecalledfunction,andlaterwaitsforthefunctiontocomplete.
atomiccommitThemomentwhenatransactioncommitstoapplyallofitsupdates.
atomicmemoryThevaluestoredinmemoryisthelastvaluestoredbyoneoftheprocessors,notamixtureoftheupdatesofdifferentprocessors.
atomicoperationsIndivisibleoperationsthatcannotbeinterleavedwithorsplitbyotheroperations.
atomicread-modify-writeinstructionAprocessor-specificinstructionthatletsonethreadtemporarilyhaveexclusiveandatomicaccesstoamemorylocationwhiletheinstructionexecutes.Typically,theinstruction(atomically)readsamemorylocation,doessomesimplearithmeticoperationtothevalue,andstorestheresult.
attributerecordInNTFS,avariable-sizedatastructurecontainingeitherfiledataorfilemetadata.
availabilityThepercentageoftimethatasystemisusable.
averageseektimeTheaveragetimeacrossseeksbetweeneachpossiblepairoftracksonadisk.
AVMSee:abstractvirtualmachine.
backupAlogicallyorphysicallyseparatecopyofasystem’smainstorage.
baseandboundmemoryprotectionAnearlysystemformemoryprotectionwhereeachprocessislimitedtoaspecificrangeofphysicalmemory.
batchoperatingsystemAnearlytypeofoperatingsystemthatefficientlyranaqueueoftasks.Whileoneprogramwasrunning,anotherwasbeingloadedintomemory.
bathtubmodelAmodelofdiskdevicefailurecombiningdeviceinfantmortalityandwearout.
Belady’sanomalyForsomecachereplacementpoliciesandsomereferencepatterns,addingspacetoacachecanhurtthecachehitrate.
bestfitAstorageallocationpolicythatattemptstoplaceanewlyallocatedfileinthesmallestfreeregionthatislargeenoughtoholdit.
BIOSTheinitialcoderunwhenanIntelx86computerboots;acronymforBasicInput/OutputSystem.Seealso:BootROM.
biterrorrateThenon-recoverablereaderrorrate.
bitmapAdatastructureforblockallocationwhereeachblockisrepresentedbyonebit.
blockdeviceAnI/Odevicethatallowsdatatobereadorwritteninfixed-sizedblocks.
blockgroupAsetofnearbydisktracks.
blockintegritymetadataAdditionaldatastoredwithablocktoallowthesoftwaretovalidatethattheblockhasnotbeencorrupted.
blockingboundedqueue
Aboundedqueuewhereathreadtryingtoremoveanitemfromanemptyqueuewillwaituntilanitemisavailable,andathreadtryingtoputanitemintoafullqueuewillwaituntilthereisroom.
BohrbugsBugsthataredeterministicandreproducible,giventhesameprograminput.Seealso:Heisenbugs.
BootROMSpecialread-onlymemorycontainingtheinitialinstructionsforbootingacomputer.
bootloaderProgramstoredatafixedpositionondisk(orflashRAM)toloadtheoperatingsystemintomemoryandstartitexecuting.
boundedqueueAqueuewithafixedsizelimitonthenumberofitemsstoredinthequeue.
boundedresourcesAnecessaryconditionfordeadlock:thereareafinitenumberofresourcesthatthreadscansimultaneouslyuse.
bufferoverflowattackAnattackthatexploitsabugwhereinputcanoverflowthebufferallocatedtoholdit,overwritingotherimportantprogramdatastructureswithdataprovidedbytheattacker.Onecommonvariationoverflowsabufferallocatedonthestack(e.g.,alocal,automaticvariable)andreplacesthefunction’sreturnaddresswithareturnaddressspecifiedbytheattacker,possiblytocode“pushed”ontothestackwiththeoverflowinginput.
bulksynchronousAtypeofparallelapplicationwhereworkissplitintoindependenttasksandwhereeachtaskcompletesbeforetheresultsofanyofthetaskscanbeused.
bulksynchronousparallelprogrammingSee:dataparallelprogramming.
burstydistributionAprobabilitydistributionthatislessevenlydistributedaroundthemeanvaluethananexponentialdistribution.See:exponentialdistribution.Compare:heavy-taileddistribution.
busy-waitingAthreadspinsinaloopwaitingforaconcurrenteventtooccur,consumingCPUcycleswhileitiswaiting.
cacheAcopyofdatathatcanbeaccessedmorequicklythantheoriginal.
cachehitThecachecontainstherequesteditem.
cachemissThecachedoesnotcontaintherequesteditem.
checkpointAconsistentsnapshotoftheentirestateofaprocess,includingthecontentsofmemoryandprocessorregisters.
childprocessAprocesscreatedbyanotherprocess.Seealso:parentprocess.
CircularSCANSee:CSCAN.
circularwaitingAnecessaryconditionfordeadlocktooccur:thereisasetofthreadssuchthateachthreadiswaitingforaresourceheldbyanother.
client-servercommunicationTwo-waycommunicationbetweenprocesses,wheretheclientsendsarequesttotheservertodosometask,andwhentheoperationiscomplete,theserverrepliesbacktotheclient.
clockalgorithmAmethodforidentifyinganotrecentlyusedpagetoevict.Thealgorithmsweepsthrougheachpageframe:ifthepageusebitisset,itiscleared;iftheusebitisnotset,thepageisreclaimed.
cloudcomputingAmodelofcomputingwherelarge-scaleapplicationsrunonsharedcomputingandstorageinfrastructureindatacentersinsteadofontheuser’sowncomputer.
commitTheoutcomeofatransactionwhereallofitsupdatesoccur.
compare-and-swapAnatomicread-modify-writeinstructionthatfirstteststhevalueofamemorylocation,andifthevaluehasnotbeenchanged,setsittoanewvalue.
compute-boundtaskAtaskthatprimarilyusestheprocessoranddoeslittleI/O.
computervirusAcomputerprogramthatmodifiesanoperatingsystemorapplicationtocopyitselffromcomputertocomputerwithoutthecomputerowner’spermissionorknowledge.Onceinstalledonacomputer,avirusoftenprovidestheattackercontroloverthesystem’sresourcesordata.
concurrencyMultipleactivitiesthatcanhappenatthesametime.
conditionvariableAsynchronizationvariablethatenablesathreadtoefficientlywaitforachangetosharedstateprotectedbyalock.
continuationAdatastructureusedinevent-drivenprogrammingthatkeepstrackofatask’scurrentstateanditsnextstep.
cooperatingthreadsThreadsthatreadandwritesharedstate.
cooperativecachingUsingthememoryofnearbynodesoveranetworkasacachetoavoidthelatencyofgoingtodisk.
cooperativemulti-threadingEachthreadrunswithoutinterruptionuntilitexplicitlyrelinquishescontroloftheprocessor,e.g.,byexitingorcallingthread_yield.
copy-on-writeAmethodofsharingphysicalmemorybetweentwologicallydistinctcopies(e.g.,in
differentprocesses).Eachsharedpageismarkedasread-onlysothattheoperatingsystemkernelisinvokedandcanmakeacopyofthepageifeitherprocesstriestowriteit.Theprocesscanthenmodifythecopyandresumenormalexecution.
copy-on-writefilesystemAfilesystemwhereanupdatetothefilesystemismadebywritingnewversionsofmodifieddataandmetadatablockstofreediskblocks.Thenewblockscanpointtounchangedblocksinthepreviousversionofthefilesystem.Seealso:COWfilesystem.
coremapAdatastructureusedbythememorymanagementsystemtokeeptrackofthestateofphysicalpageframes,suchaswhichprocessesreferencethepageframe.
COWfilesystemSee:copy-on-writefilesystem.
criticalpathTheminimumsequenceofstepsforaparallelapplicationtocomputeitsresult,evenwithinfiniteresources.
criticalsectionAsequenceofcodethatoperatesonsharedstate.
cross-sitescriptingAnattackagainstaclientcomputerthatworksbycompromisingaservervisitedbytheclient.Thecompromisedserverthenprovidesscriptingcodetotheclientthataccessesanddownloadstheclient’ssensitivedata.
cryptographicsignatureAspeciallydesignedfunctionofadatablockandaprivatecryptographickeythatallowssomeonewiththecorrespondingpublickeytoverifythatanauthorizedentityproducedthedatablock.Itiscomputationallyintractableforanattackerwithouttheprivatekeytocreateadifferentdatablockwithavalidsignature.
CSCANAvariationoftheSCANdiskschedulingpolicyinwhichthediskonlyservicesrequestswhentheheadistravelinginonedirection.Seealso:CircularSCAN.
currentworkingdirectoryThecurrentdirectoryoftheprocess,usedforinterpretingrelativepathnames.
databreakpointArequesttostoptheexecutionofaprogramwhenitreferencesormodifiesaparticularmemorylocation.
dataparallelprogrammingAprogrammingmodelwherethecomputationisperformedinparallelacrossallitemsinadataset.
deadlockAcycleofwaitingamongasetofthreads,whereeachthreadwaitsforsomeotherthreadinthecycletotakesomeaction.
deadlockedstateThesystemhasatleastonedeadlock.
declusteringAtechniqueforreducingtherecoverytimeafteradiskfailureinaRAIDsystembyspreadingredundantdiskblocksacrossmanydisks.
defenseindepthImprovingsecuritythroughmultiplelayersofprotection.
defragmentCoalescescattereddiskblockstoimprovespatiallocality,byreadingdatafromitspresentstoragelocationandrewritingittoanew,morecompact,location.
demandpagingUsingaddresstranslationhardwaretorunaprocesswithoutallofitsmemoryphysicallypresent.Whentheprocessreferencesamissingpage,thehardwaretrapstothekernel,whichbringsthepageintomemoryfromdisk.
deterministicdebuggingTheabilitytore-executeaconcurrentprocesswiththesamescheduleandsequenceofinternalandexternalevents.
devicedriverOperatingsystemcodetoinitializeandmanageaparticularI/Odevice.
directmappedcacheOnlyoneentryinthecachecanholdaspecificmemorylocation,soonalookup,thesystemmustchecktheaddressagainstonlythatentrytodetermineifthereisacachehit.
directmemoryaccessHardwareI/Odevicestransferdatadirectlyinto/outofmainmemoryatalocationspecifiedbytheoperatingsystem.Seealso:DMA.
dirtybitAstatusbitinapagetableentryrecordingwhetherthecontentsofthepagehavebeenmodifiedrelativetowhatisstoredondisk.
diskbuffermemoryMemoryinthediskcontrollertobufferdatabeingreadorwrittentothedisk.
diskinfantmortalityThedevicefailurerateishigherthannormalduringthefirstfewweeksofuse.
diskwearoutThedevicefailureraterisesafterthedevicehasbeeninoperationforseveralyears.
DMASee:directmemoryaccess.
dnodeInZFS,afileisrepresentedbyvariable-depthtreewhoserootisadnodeandwhoseleavesareitsdatablocks.
doubleindirectblockAstorageblockcontainingpointerstoindirectblocks.
double-checkedlockingApitfallinconcurrentcodewhereadatastructureislazilyinitializedbyfirst,checkingwithoutalockifithasbeenset,andifnot,acquiringalockandcheckingagain,beforecallingtheinitializationfunction.Withinstructionre-ordering,double-checkedlockingcanfailunexpectedly.
dualredundancyarrayARAIDstoragealgorithmusingtworedundantdiskblocksperarraytotoleratetwodiskfailures.Seealso:RAID6.
dual-modeoperation
Hardwareprocessorthathas(atleast)twoprivilegelevels:oneforexecutingthekernelwithcompleteaccesstothecapabilitiesofthehardwareandasecondforexecutingusercodewithrestrictedrights.Seealso:kernel-modeoperation.Seealso:user-modeoperation.
dynamicallyloadabledevicedriverSoftwaretomanageaspecificdevice,interface,orchipset,addedtotheoperatingsystemkernelafterthekernelstartsrunning.
earliestdeadlinefirstAschedulingpolicythatperformsthetaskthatneedstobecompletedfirst,butonlyifitcanbefinishedintime.
EDFSee:earliestdeadlinefirst.
efficiencyThelackofoverheadinimplementinganabstraction.
erasureblockTheunitoferasureinaflashmemorydevice.Beforeanyportionofanerasureblockcanbeover-written,everycellintheentireerasureblockmustbesettoalogical“1.”
errorcorrectingcodeAtechniqueforstoringdataredundantlytoallowfortheoriginaldatatoberecoveredeventhoughsomebitsinadisksectororflashmemorypagearecorrupted.
event-drivenprogrammingAcodingdesignpatternwhereathreadspinsinaloop;eachiterationgetsandprocessesthenextI/Oevent.
exceptionSee:processorexception.
executableimageFilecontainingasequenceofmachineinstructionsandinitialdatavaluesforaprogram.
executionstackSpacetostorethestateoflocalvariablesduringprocedurecalls.
exponentialdistributionAconvenientprobabilitydistributionforuseinqueueingtheorybecauseithasthepropertyofbeingmemoryless.Foracontinuousrandomvariablewithameanof1⁄λ,theprobabilitydensityfunctionisf(x)=λtimeseraisedtothe-λx.
extentAvariable-sizedregionofafilethatisstoredinacontiguousregiononthestoragedevice.
externalfragmentationInasystemthatallocatesmemoryincontiguousregions,theunusablememorybetweenvalidcontiguousallocations.Anewrequestformemorymayfindnosinglefreeregionthatisbothcontiguousandlargeenough,eventhoughthereisenoughfreememoryinaggregate.
fairnessPartitioningofsharedresourcesbetweenusersorapplicationseitherequallyorbalancedaccordingtosomedesiredpriorities.
falsesharing
Extrainter-processorcommunicationrequiredbecauseasinglecacheentrycontainsportionsoftwodifferentdatastructureswithdifferentsharingpatterns.
fatesharingWhenacrashinonemoduleimpliesacrashinanother.Forexample,alibrarysharesfatewiththeapplicationitislinkedwith;ifeithercrashes,theprocessexits.
faultisolationAnerrorinoneapplicationshouldnotdisruptotherapplications,oreventheoperatingsystemitself.
fileAnamedcollectionofdatainafilesystem.
fileallocationtableAnarrayofentriesintheFATfilesystemstoredinareservedareaofthevolume,whereeachentrycorrespondstoonefiledatablock,andpointstothenextblockinthefile.
filedataContentsofafile.
filedescriptorAhandletoanopenfile,device,orchannel.Seealso:filehandle.Seealso:filestream.
filedirectoryAlistofhuman-readablenamesplusamappingfromeachnametoaspecificfileorsub-directory.
filehandleSee:filedescriptor.
fileindexstructureApersistentlystoreddatastructureusedtolocatetheblocksofthefile.
filemetadataInformationaboutafilethatismanagedbytheoperatingsystem,butnotincludingthefilecontents.
filestreamSee:filedescriptor.
filesystemAnoperatingsystemabstractionthatprovidespersistent,nameddata.
filesystemfingerprintAchecksumacrosstheentirefilesystem.
fill-on-demandAmethodforstartingaprocessbeforeallofitsmemoryisbroughtinfromdisk.Ifthefirstaccesstothemissingmemorytriggersatraptothekernel,thekernelcanfillthememoryandthenresume.
fine-grainedlockingAwaytoincreaseconcurrencybypartitioninganobject’sstateintodifferentsubsetseachprotectedbyadifferentlock.
finishedlistThesetofthreadsthatarecompletebutnotyetde-allocated,e.g.,becauseajoinmayreadthereturnvaluefromthethreadcontrolblock.
first-in-first-out
Aschedulingpolicythatperformseachtaskintheorderinwhichitarrives.flashpagefailure
Aflashmemorydevicefailurewherethedatastoredononeormoreindividualpagesofflasharelost,buttherestoftheflashcontinuestooperatecorrectly.
flashtranslationlayerAlayerthatmapslogicalflashpagestodifferentphysicalpagesontheflashdevice.Seealso:FTL.
flashwearoutAftersomenumberofprogram-erasecycles,agivenflashstoragecellmaynolongerbeabletoreliablystoreinformation.
fork-joinparallelismAtypeofparallelprogrammingwherethreadscanbecreated(forked)todoworkinparallelwithaparentthread;aparentmayasynchronouslywaitforachildthreadtofinish(join).
freespacemapAfilesystemdatastructureusedtotrackwhichstorageblocksarefreeandwhichareinuse.
FTLSee:flashtranslationlayer.
fulldiskfailureWhenadiskdevicestopsbeingabletoservicereadsorwritestoallsectors.
fullflashdrivefailureWhenaflashdevicestopsbeingabletoservicereadsorwritestoallmemorypages.
fullyassociativecacheAnyentryinthecachecanholdanymemorylocation,soonalookup,thesystemmustchecktheaddressagainstalloftheentriesinthecachetodetermineifthereisacachehit.
gangschedulingAschedulingpolicyformultiprocessorsthatperformsalloftherunnabletasksforaparticularprocessatthesametime.
GlobalDescriptorTableThex86terminologyforasegmenttableforsharedsegments.ALocalDescriptorTableisusedforsegmentsthatareprivatetotheprocess.
graceperiodForasharedobjectprotectedbyaread-copy-updatelock,thetimefromwhenanewversionofasharedobjectispublisheduntilthelastreaderoftheoldversionisguaranteedtobefinished.
greenthreadsAthreadsystemimplementedentirelyatuser-levelwithoutanyrelianceonoperatingsystemkernelservices,otherthanthosedesignedforsingle-threadedprocesses.
groupcommitAtechniquethatbatchesmultipletransactioncommitsintoasinglediskoperation.
guestoperatingsystemAnoperatingsystemrunninginavirtualmachine.
hardlinkThemappingbetweenafilenameandtheunderlyingfile,typicallywhenthereare
multiplepathnamesforthesameunderlyingfile.hardwareabstractionlayer
Amoduleintheoperatingsystemthathidesthespecificsofdifferenthardwareimplementations.Abovethislayer,theoperatingsystemisportable.
hardwaretimerAhardwaredevicethatcancauseaprocessorinterruptaftersomedelay,eitherintimeorininstructionsexecuted.
headThecomponentthatwritesthedatatoorreadsthedatafromaspinningdisksurface.
headcrashAnerrorwherethediskheadphysicallyscrapesthemagneticsurfaceofaspinningdisksurface.
headswitchtimeThetimeittakestore-positionthediskarmoverthecorrespondingtrackonadifferentsurface,beforeareadorwritecanbegin.
heapSpacetostoredynamicallyallocateddatastructures.
heavy-taileddistributionAprobabilitydistributionsuchthateventsfarfromthemeanvalue(inaggregate)occurwithsignificantprobability.Whenusedforthedistributionoftimebetweenevents,theremainingtimetothenexteventispositivelyrelatedtothetimealreadyspentwaiting—youexpecttowaitlongerthelongeryouhavealreadywaited.
HeisenbugsBugsinconcurrentprogramsthatdisappearorchangebehaviorwhenyoutrytoexaminethem.Seealso:Bohrbugs.
hintAresultofsomecomputationwhoseresultsmaynolongerbevalid,butwhereusinganinvalidhintwilltriggeranexception.
homedirectoryThesub-directorycontainingauser’sfiles.
hostoperatingsystemAnoperatingsystemthatprovidestheabstractionofavirtualmachine,torunanotheroperatingsystemasanapplication.
hosttransfertimeThetimetotransferdatabetweenthehost’smemoryandthedisk’sbuffer.
hyperthreadingSee:simultaneousmulti-threading.
I/O-boundtaskAtaskthatprimarilydoesI/O,anddoeslittleprocessing.
idempotentAnoperationthathasthesameeffectwhetherexecutedonceormanytimes.
incrementalcheckpointAconsistentsnapshotoftheportionofprocessmemorythathasbeenmodifiedsincethepreviouscheckpoint.
independentthreadsThreadsthatoperateoncompletelyseparatesubsetsofprocessmemory.
indirectblockAstorageblockcontainingpointerstofiledatablocks.
inodeIntheUnixFastFileSystem(FFS)andrelatedfilesystems,aninodestoresafile’smetadata,includinganarrayofpointersthatcanbeusedtofindallofthefile’sblocks.Theterminodeissometimesusedmoregenerallytorefertoanyfilesystem’sper-filemetadatadatastructure.
inodearrayThefixedlocationondiskcontainingallofthefilesystem’sinodes.Seealso:inumber.
intentionsThesetofwritesthatatransactionwillperformifthetransactioncommits.
internalfragmentationWithpagedallocationofmemory,theunusablememoryattheendofapagebecauseaprocesscanonlybeallocatedmemoryinpage-sizedchunks.
interruptAnasynchronoussignaltotheprocessorthatsomeexternaleventhasoccurredthatmayrequireitsattention.
interruptdisableAprivilegedhardwareinstructiontotemporarilydeferanyhardwareinterrupts,toallowthekerneltocompleteacriticaltask.
interruptenableAprivilegedhardwareinstructiontoresumehardwareinterrupts,afteranon-interruptibletaskiscompleted.
interrupthandlerAkernelprocedureinvokedwhenaninterruptoccurs.
interruptstackAregionofmemoryforholdingthestackofthekernel’sinterrupthandler.Whenaninterrupt,processorexception,orsystemcalltrapcausesacontextswitchintothekernel,thehardwarechangesthestackpointertopointtothebaseofthekernel’sinterruptstack.
interruptvectortableAtableofpointersintheoperatingsystemkernel,indexedbythetypeofinterrupt,witheachentrypointingtothefirstinstructionofahandlerprocedureforthatinterrupt.
inumberTheindexintotheinodearrayforaparticularfile.
invertedpagetableAhashtableusedfortranslationbetweenvirtualpagenumbersandphysicalpageframes.
kernelthreadAthreadthatisimplementedinsidetheoperatingsystemkernel.
kernel-modeoperationTheprocessorexecutesinanunrestrictedmodethatgivestheoperatingsystemfullcontroloverthehardware.Compare:user-modeoperation.
LBA
See:logicalblockaddress.leastfrequentlyused
Acachereplacementpolicythatevictswhicheverblockhasbeenusedtheleastoften,oversomeperiodoftime.Seealso:LFU.
leastrecentlyusedAcachereplacementpolicythatevictswhicheverblockhasnotbeenusedforthelongestperiodoftime.Seealso:LRU.
LFUSee:leastfrequentlyused.
Little’sLawInastablesystemwherethearrivalratematchesthedeparturerate,thenumberoftasksinthesystemequalsthesystem’sthroughputmultipliedbytheaveragetimeataskspendsinthesystem:N=XR.
livenesspropertyAconstraintonprogrambehaviorsuchthatitalwaysproducesaresult.Compare:safetyproperty.
localityheuristicAfilesystemblockallocationpolicythatplacesfilesinnearbydisksectorsiftheyarelikelytobereadorwrittenatthesametime.
lockAtypeofsynchronizationvariableusedforenforcingatomic,mutuallyexclusiveaccesstoshareddata.
lockorderingAwidelyusedapproachtopreventdeadlock,wherelocksareacquiredinapre-determinedorder.
lock-freedatastructuresConcurrentdatastructurethatguaranteesprogressforsomethread:somemethodwillfinishinafinitenumberofsteps,regardlessofthestateofotherthreadsexecutinginthedatastructure.
logAnorderedsequenceofstepssavedtopersistentstorage.
logicalblockaddressAuniqueidentifierforeachdisksectororflashmemoryblock,typicallynumberedfrom1tothesizeofthedisk/flashdevice.Thediskinterfaceconvertsthisidentifiertothephysicallocationofthesector/block.Seealso:LBA.
logicalseparationAbackupstoragepolicywherethebackupisstoredatthesamelocationastheprimarystorage,butwithrestrictedaccess,e.g.,topreventupdates.
LRUSee:leastrecentlyused.
masterfiletableInNTFS,anarrayofrecordsstoringmetadataabouteachfile.Seealso:MFT.
maximumseektimeThetimeittakestomovethediskarmfromtheinnermosttracktotheoutermostoneorviceversa.
max-minfairness
Aschedulingobjectivetomaximizetheminimumresourceallocationgiventoeachtask.
MCSlockAnefficientspinlockimplementationwhereeachwaitingthreadspinsonaseparatememorylocation.
meantimetodatalossTheexpectedtimeuntilaRAIDsystemsuffersanunrecoverableerror.Seealso:MTTDL.
meantimetofailureTheaveragetimethatasystemrunswithoutfailing.Seealso:MTTF.
meantimetorepairTheaveragetimethatittakestorepairasystemonceithasfailed.Seealso:MTTR.
memoryaddressaliasTwoormorevirtualaddressesthatrefertothesamephysicalmemorylocation.
memorybarrierAninstructionthatpreventsthecompilerandhardwarefromreorderingmemoryaccessesacrossthebarrier—noaccessesbeforethebarrieraremovedafterthebarrierandnoaccessesafterthebarrieraremovedbeforethebarrier.
memoryprotectionHardwareorsoftware-enforcedlimitssothateachapplicationprocesscanreadandwriteonlyitsownmemoryandnotthememoryoftheoperatingsystemoranyotherprocess.
memorylesspropertyForaprobabilitydistributionforthetimebetweenevents,theremainingtimetothenexteventdoesnotdependontheamountoftimealreadyspentwaiting.Seealso:exponentialdistribution.
memory-mappedfileAfilewhosecontentsappeartobeamemorysegmentinaprocess’svirtualaddressspace.
memory-mappedI/OEachI/Odevice’scontrolregistersaremappedtoarangeofphysicaladdressesonthememorybus.
memristorAtypeofsolid-statepersistentstorageusingacircuitelementwhoseresistancedependsontheamountsanddirectionsofcurrentsthathaveflowedthroughitinthepast.
MFQSee:multi-levelfeedbackqueue.
MFTSee:masterfiletable.
microkernelAnoperatingsystemdesignwherethekernelitselfiskeptsmall,andinsteadmostofthefunctionalityofatraditionaloperatingsystemkernelisputintoasetofuser-levelprocesses,orservers,accessedfromuserapplicationsviainterprocesscommunication.
MINcachereplacement
See:optimalcachereplacement.minimumseektime
Thetimetomovethediskarmtothenextadjacenttrack.MIPS
Anearlymeasureofprocessorperformance:millionsofinstructionspersecond.mirroring
Asystemforredundantlystoringdataondiskwhereeachblockofdataisstoredontwodisksandcanbereadfromeither.Seealso:RAID1.
modelAsimplificationthattriestocapturethemostimportantaspectsofamorecomplexsystem’sbehavior.
monolithickernelAnoperatingsystemdesignwheremostoftheoperatingsystemfunctionalityislinkedtogetherinsidethekernel.
Moore’sLawTransistordensityincreasesexponentiallyovertime.Similarexponentialimprovementshaveoccurredinmanyothercomponenttechnologies;inthepopularpress,theseoftengobythesameterm.
mountAmappingofapathintheexistingfilesystemtotherootdirectoryofanotherfilesystemvolume.
MTTDLSee:meantimetodataloss.
MTTFSee:meantimetofailure.
MTTRSee:meantimetorepair.
multi-levelfeedbackqueueAschedulingalgorithmwithmultipleprioritylevelsmanagedusingroundrobinqueues,whereataskismovedbetweenprioritylevelsbasedonhowmuchprocessingtimeithasused.Seealso:MFQ.
multi-levelindexAtreedatastructuretokeeptrackofthedisklocationofeachdatablockinafile.
multi-levelpagedsegmentationAvirtualmemorymechanismwherephysicalmemoryisallocatedinpageframes,virtualaddressesaresegmented,andeachsegmentistranslatedtophysicaladdressesthroughmultiplelevelsofpagetables.
multi-levelpagingAvirtualmemorymechanismwherephysicalmemoryisallocatedinpageframes,andvirtualaddressesaretranslatedtophysicaladdressesthroughmultiplelevelsofpagetables.
multipleindependentrequestsAnecessaryconditionfordeadlocktooccur:athreadfirstacquiresoneresourceandthentriestoacquireanother.
multiprocessorschedulingpolicyApolicytodeterminehowmanyprocessorstoassigneachprocess.
multiprogrammingSee:multitasking.
multitaskingTheabilityofanoperatingsystemtorunmultipleapplicationsatthesametime,alsocalledmultiprogramming.
multi-threadedprocessAprocesswithmultiplethreads.
multi-threadedprogramAgeneralizationofasingle-threadedprogram.Insteadofonlyonelogicalsequenceofsteps,theprogramhasmultiplesequences,orthreads,executingatthesametime.
mutualexclusionWhenonethreadusesalocktopreventconcurrentaccesstoashareddatastructure.
mutuallyrecursivelockingAdeadlockconditionwheretwosharedobjectscallintoeachotherwhilestillholdingtheirlocks.Deadlockoccursifonethreadholdsthelockonthefirstobjectandcallsintothesecond,whiletheotherthreadholdsthelockonthesecondobjectandcallsintothefirst.
nameddataDatathatcanbeaccessedbyahuman-readableidentifier,suchasafilename.
nativecommandqueueingSee:taggedcommandqueueing.
NCQSee:nativecommandqueueing.
nestedwaitingAdeadlockconditionwhereonesharedobjectcallsintoanothersharedobjectwhileholdingthefirstobject’slock,andthenwaitsonaconditionvariable.Deadlockresultsifthethreadthatcansignaltheconditionvariableneedsthefirstlocktomakeprogress.
networkeffectTheincreaseinvalueofaproductorservicebasedonthenumberofotherpeoplewhohaveadoptedthattechnologyandnotjustitsintrinsiccapabilities.
nopreemptionAnecessaryconditionfordeadlocktooccur:onceathreadacquiresaresource,itsownershipcannotberevokeduntilthethreadactstoreleaseit.
non-blockingdatastructureConcurrentdatastructurewhereathreadisneverrequiredtowaitforanotherthreadtocompleteitsoperation.
non-recoverablereaderrorWhensufficientbiterrorsoccurwithinadisksectororflashmemorypage,suchthattheoriginaldatacannotberecoveredevenaftererrorcorrection.
non-residentattributeInNTFS,anattributerecordwhosecontentsareaddressedindirectly,throughextentpointersinthemasterfiletablethatpointtothecontentsinthoseextents.
non-volatilestorageUnlikeDRAM,memorythatisdurableandretainsitsstateacrosscrashesandpoweroutages.Seealso:persistentstorage.Seealso:stablestorage.
notrecentlyusedAcachereplacementpolicythatevictssomeblockthathasnotbeenreferencedrecently,ratherthantheleastrecentlyusedblock.
obliviousschedulingAschedulingpolicywheretheoperatingsystemassignsthreadstoprocessorswithoutknowledgeoftheintentoftheparallelapplication.
opensystemAsystemwhosesourcecodeisavailabletothepublicformodificationandreuse,orasystemwhoseinterfacesaredefinedbyapublicstandardsprocess.
operatingsystemAlayerofsoftwarethatmanagesacomputer’sresourcesforitsusersandtheirapplications.
operatingsystemkernelThekernelisthelowestlevelofsoftwarerunningonthesystem,withfullaccesstoallofthecapabilitiesofthehardware.
optimalcachereplacementReplacewhicheverblockisusedfarthestinthefuture.
overheadTheaddedresourcecostofimplementinganabstractionversususingtheunderlyinghardwareresourcesdirectly.
ownershipdesignpatternAtechniqueformanagingconcurrentaccesstosharedobjectsinwhichatmostonethreadownsanobjectatanytime,andthereforethethreadcanaccesstheshareddatawithoutalock.
pagecoloringTheassignmentofphysicalpageframestovirtualaddressesbypartitioningframesbasedonwhichportionsofthecachetheywilluse.
pagefaultAhardwaretraptotheoperatingsystemkernelwhenaprocessreferencesavirtualaddresswithaninvalidpagetableentry.
pageframeAnaligned,fixed-sizechunkofphysicalmemorythatcanholdavirtualpage.
pagedmemoryAhardwareaddresstranslationmechanismwherememoryisallocatedinaligned,fixed-sizedchunks,calledpages.Anyvirtualpagecanbeassignedtoanyphysicalpageframe.
pagedsegmentationAhardwaremechanismwherephysicalmemoryisallocatedinpageframes,butvirtualaddressesaresegmented.
pairofstubsApairofshortproceduresthatmediatebetweentwoexecutioncontexts.
paravirtualizationAvirtualmachineabstractionthatallowstheguestoperatingsystemtomakesystemcallsintothehostoperatingsystemtoperformhardware-specificoperations,suchaschangingapagetableentry.
parentprocess
Aprocessthatcreatesanotherprocess.Seealso:childprocess.path
Thestringthatidentifiesafileordirectory.PCB
See:processcontrolblock.PCM
See:phasechangememory.performancepredictability
Whetherasystem’sresponsetimeorotherperformancemetricisconsistentovertime.
persistentdataDatathatisstoreduntilitisexplicitlydeleted,evenifthecomputerstoringitcrashesorlosespower.
persistentstorageSee:non-volatilestorage.
phasechangebehaviorAbruptchangesinaprogram’sworkingset,causingburstycachemissrates:periodsoflowcachemissesinterspersedwithperiodsofhighcachemisses.
phasechangememoryAtypeofnon-volatilememorythatusesthephaseofamaterialtorepresentadatabit.Seealso:PCM.
physicaladdressAnaddressinphysicalmemory.
physicalseparationAbackupstoragepolicywherethebackupisstoredatadifferentlocationthantheprimarystorage.
physicallyaddressedcacheAprocessorcachethatisaccessedusingphysicalmemoryaddresses.
pinTobindavirtualresourcetoaphysicalresource,suchasathreadtoaprocessororavirtualpagetoaphysicalpage.
platterAsinglethinroundplatethatstoresinformationinamagneticdisk,oftenonbothsurfaces.
policy-mechanismseparationAsystemdesignprinciplewheretheimplementationofanabstractionisindependentoftheresourceallocationpolicyofhowtheabstractionisused.
pollingAnalternativetohardwareinterrupts,wheretheprocessorwaitsforanasynchronouseventtooccur,bylooping,orbusy-waiting,untiltheeventoccurs.
portabilityTheabilityofsoftwaretoworkacrossmultiplehardwareplatforms.
preciseinterruptsAllinstructionsthatoccurbeforetheinterruptorexception,accordingtotheprogramexecution,arecompletedbythehardwarebeforetheinterrupthandlerisinvoked.
preemption
Whenaschedulertakestheprocessorawayfromonetaskandgivesittoanother.preemptivemulti-threading
Theoperatingsystemschedulermayswitchoutarunningthread,e.g.,onatimerinterrupt,withoutanyexplicitactionbythethreadtorelinquishcontrolatthatpoint.
prefetchTobringdataintoacachebeforeitisneeded.
principleofleastprivilegeSystemsecurityandreliabilityareenhancedifeachpartofthesystemhasexactlytheprivilegesitneedstodoitsjobandnomore.
prioritydonationAsolutiontopriorityinversion:whenathreadwaitsforalockheldbyalowerprioritythread,thelockholderistemporarilyincreasedtothewaiter’spriorityuntilthelockisreleased.
priorityinversionAschedulinganomalythatoccurswhenahighprioritytaskwaitsindefinitelyforaresource(suchasalock)heldbyalowprioritytask,becausethelowprioritytaskiswaitinginturnforaresource(suchastheprocessor)heldbyamediumprioritytask.
privacyDatastoredonacomputerisonlyaccessibletoauthorizedusers.
privilegedinstructionInstructionavailableinkernelmodebutnotinusermode.
processTheexecutionofanapplicationprogramwithrestrictedrights—theabstractionforprotectionprovidedbytheoperatingsystemkernel.
processcontrolblockAdatastructurethatstoresalltheinformationtheoperatingsystemneedsaboutaparticularprocess:e.g.,whereitisstoredinmemory,whereitsexecutableimageisondisk,whichuseraskedittostartexecuting,andwhatprivilegestheprocesshas.Seealso:PCB.
processmigrationTheabilitytotakearunningprogramononesystem,stopitsexecution,andresumeitonadifferentmachine.
processorexceptionAhardwareeventcausedbyuserprogrambehaviorthatcausesatransferofcontroltoakernelhandler.Forexample,attemptingtodividebyzerocausesaprocessorexceptioninmanyarchitectures.
processorschedulingpolicyWhentherearemorerunnablethreadsthanprocessors,thepolicythatdetermineswhichthreadstorunfirst.
processorstatusregisterAhardwareregistercontainingflagsthatcontroltheoperationoftheprocessor,includingtheprivilegelevel.
producer-consumercommunicationInterprocesscommunicationwheretheoutputofoneprocessistheinputofanother.
proprietarysystemAsystemthatisunderthecontrolofasinglecompany;itcanbechangedatanytime
byitsprovidertomeettheneedsofitscustomers.protection
Theisolationofpotentiallymisbehavingapplicationsanduserssothattheydonotcorruptotherapplicationsortheoperatingsystemitself.
publishForaread-copy-updatelock,asingle,atomicmemorywritethatupdatesasharedobjectprotectedbythelock.Thewriteallowsnewreaderthreadstoobservethenewversionoftheobject.
queueingdelayThetimeataskwaitsinlinewithoutreceivingservice.
quiescentForaread-copy-updatelock,noreaderthreadthatwasactiveatthetimeofthelastmodificationisstillactive.
raceconditionWhenthebehaviorofaprogramreliesontheinterleavingofoperationsofdifferentthreads.
RAIDARedundantArrayofInexpensiveDisks(RAID)isasystemthatspreadsdataredundantlyacrossmultipledisksinordertotolerateindividualdiskfailures.
RAID1See:mirroring.
RAID5See:rotatingparity.
RAID6See:dualredundancyarray.
RAIDstripAsetofseveralsequentialblocksplacedononediskbyaRAIDblockplacementalgorithm.
RAIDstripeAsetofRAIDstripsandtheirparitystrip.
R-CSCANAvariationoftheCSCANdiskschedulingpolicyinwhichthedisktakesintoaccountrotationtime.
RCUSee:read-copy-update.
readdisturberrorReadingaflashmemorycellalargenumberoftimescancausethedatainsurroundingcellstobecomecorrupted.
read-copy-updateAsynchronizationabstractionthatallowsconcurrentaccesstoadatastructurebymultiplereadersandasinglewriteratatime.Seealso:RCU.
readers/writerslockAlockwhichallowsmultiple“reader”threadstoaccessshareddataconcurrentlyprovidedtheynevermodifytheshareddata,butstillprovidesmutualexclusionwhenevera“writer”threadisreadingormodifyingtheshareddata.
readylist
Thesetofthreadsthatarereadytoberunbutwhicharenotcurrentlyrunning.real-timeconstraint
Thecomputationmustbecompletedbyadeadlineifitistohavevalue.recoverablevirtualmemory
Theabstractionofpersistentmemory,sothatthecontentsofamemorysegmentcanberestoredafterafailure.
redologgingAwayofimplementingatransactionbyrecordinginalogthesetofwritestobeexecutedwhenthetransactioncommits.
relativepathAfilepathnameinterpretedasbeginningwiththeprocess’scurrentworkingdirectory.
reliabilityApropertyofasystemthatdoesexactlywhatitisdesignedtodo.
requestparallelismParallelexecutiononaserverthatarisesfrommultipleconcurrentrequests.
residentattributeInNTFS,anattributerecordwhosecontentsarestoreddirectlyinthemasterfiletable.
responsetimeThetimeforatasktocomplete,fromwhenitstartsuntilitisdone.
restartTheresumptionofaprocessfromacheckpoint,e.g.,afterafailureorfordebugging.
rollbackTheoutcomeofatransactionwherenoneofitsupdatesoccur.
rootdirectoryThetop-leveldirectoryinafilesystem.
rootinodeInacopy-on-writefilesystem,theinodetable’sinode:thediskblockcontainingthemetadataneededtofindtheinodetable.
rotatingparityAsystemforredundantlystoringdataondiskwherethesystemwritesseveralblocksofdataacrossseveraldisks,protectingthoseblockswithoneredundantblockstoredonyetanotherdisk.Seealso:RAID5.
rotationallatencyOncethediskheadhassettledontherighttrack,itmustwaitforthetargetsectortorotateunderit.
roundrobinAschedulingpolicythattakesturnsrunningeachreadytaskforalimitedperiodbeforeswitchingtothenexttask.
R-SCANAvariationoftheSCANdiskschedulingpolicyinwhichthedisktakesintoaccountrotationtime.
safestateInthecontextofdeadlock,astateofanexecutionsuchthatregardlessofthesequenceoffutureresourcerequests,thereisatleastonesafesequenceofdecisions
astowhentosatisfyrequestssuchthatallpendingandfuturerequestsaremet.safetyproperty
Aconstraintonprogrambehaviorsuchthatitnevercomputesthewrongresult.Compare:livenessproperty.
samplebiasAmeasurementerrorthatoccurswhensomemembersofagrouparelesslikelytobeincludedthanothers,andwherethosemembersdifferinthepropertybeingmeasured.
sandboxAcontextforexecutinguntrustedcode,whereprotectionfortherestofthesystemisprovidedinsoftware.
SCANAdiskschedulingpolicywherethediskarmrepeatedlysweepsfromtheinnertotheoutertracksandbackagain,servicingeachpendingrequestwheneverthediskheadpassesthattrack.
scheduleractivationsAmultiprocessorschedulingpolicywhereeachapplicationisinformedofhowmanyprocessorsithasbeenassignedandwhenevertheassignmentchanges.
scrubbingAtechniqueforreducingnon-recoverableRAIDerrorsbyperiodicallyscanningforcorrupteddiskblocksandreconstructingthemfromtheparityblock.
secondarybottleneckAresourcewithrelativelylowcontention,duetoalargeamountofqueueingattheprimarybottleneck.Iftheprimarybottleneckisimproved,thesecondarybottleneckwillhavemuchhigherqueueingdelay.
sectorTheminimumamountofadiskthatcanbeindependentlyreadorwritten.
sectorfailureAmagneticdiskerrorwheredataononeormoreindividualsectorsofadiskarelost,buttherestofthediskcontinuestooperatecorrectly.
sectorsparingTransparentlyhidingafaultydisksectorbyremappingittoanearbysparesector.
securityAcomputer’soperationcannotbecompromisedbyamaliciousattacker.
securityenforcementThemechanismtheoperatingsystemusestoensurethatonlypermittedactionsareallowed.
securitypolicyWhatoperationsarepermitted—whoisallowedtoaccesswhatdata,andwhocanperformwhatoperations.
seekThemovementofthediskarmtore-positionitoveraspecifictracktoprepareforareadorwrite.
segmentationAvirtualmemorymechanismwhereaddressesaretranslatedbytablelookup,whereeachentryinthetableistoavariable-sizememoryregion.
segmentationfaultAnerrorcausedwhenaprocessattemptstoaccessmemoryoutsideofoneofitsvalidmemoryregions.
segment-localaddressAnaddressthatisrelativetothecurrentmemorysegment.
self-pagingAresourceallocationpolicyforallocatingpageframesamongprocesses;eachpagereplacementistakenfromapageframealreadyassignedtotheprocesscausingthepagefault.
semaphoreAtypeofsynchronizationvariablewithonlytwoatomicoperations,P()andV().Pwaitsforthevalueofthesemaphoretobepositive,andthenatomicallydecrementsit.Vatomicallyincrementsthevalue,andifanythreadsarewaitinginP,triggersthecompletionofthePoperation.
serializabilityTheresultofanyprogramexecutionisequivalenttoanexecutioninwhichrequestsareprocessedoneatatimeinsomesequentialorder.
servicetimeThetimeittakestocompleteataskataresource,assumingnowaiting.
setassociativecacheThecacheispartitionedintosetsofentries.Eachmemorylocationcanonlybestoredinitsassignedset,byitcanbestoredinanycacheentryinthatset.Onalookup,thesystemneedstochecktheaddressagainstalltheentriesinitssettodetermineifthereisacachehit.
settleThefine-grainedre-positioningofadiskheadaftermovingtoanewtrackbeforethediskheadisreadytoreadorwriteasectorofthenewtrack.
shadowpagetableApagetableforaprocessinsideavirtualmachine,formedbyconstructingthecompositionofthepagetablemaintainedbytheguestoperatingsystemandthepagetablemaintainedbythehostoperatingsystem.
sharedobjectAnobject(adatastructureanditsassociatedcode)thatcanbeaccessedsafelybymultipleconcurrentthreads.
shellAjobcontrolsystemimplementedasauser-levelprocess.Whenausertypesacommandtotheshell,itcreatesaprocesstorunthecommand.
shortestjobfirstAschedulingpolicythatperformsthetaskwiththeleastremainingtimelefttofinish.
shortestpositioningtimefirstAdiskschedulingpolicythatserviceswhicheverpendingrequestcanbehandledintheminimumamountoftime.Seealso:SPTF.
shortestseektimefirstAdiskschedulingpolicythatserviceswhicheverpendingrequestisonthenearesttrack.Equivalenttoshortestpositioningtimefirstifrotationalpositioningisnotconsidered.Seealso:SSTF.
SIMD(singleinstructionmultipledata)programmingSeedataparallelprogramming
simultaneousmulti-threadingAhardwaretechniquewhereeachprocessorsimulatestwo(ormore)virtualprocessors,alternatingbetweenthemonacycle-by-cyclebasis.Seealso:hyperthreading.
single-threadedprogramAprogramwritteninatraditionalway,withonelogicalsequenceofstepsaseachinstructionfollowsthepreviousone.Compare:multi-threadedprogram.
slipsparingWhenremappingafaultydisksector,remappingtheentiresequenceofdisksectorsbetweenthefaultysectorandthesparesectorbyoneslottopreservesequentialaccessperformance.
softlinkAdirectoryentrythatmapsonefileordirectorynametoanother.Seealso:symboliclink.
softwaretransactionalmemory(STM)Asystemforgeneral-purposetransactionsforin-memorydatastructures.
software-loadedTLBAhardwareTLBwhoseentriesareinstalledbysoftware,ratherthanhardware,onaTLBmiss.
solidstatestorageApersistentstoragedevicewithnomovingparts;itstoresdatausingelectricalcircuits.
spacesharingAmultiprocessorallocationpolicythatassignsdifferentprocessorstodifferenttasks.
spatiallocalityProgramstendtoreferenceinstructionsanddatanearthosethathavebeenrecentlyaccessed.
spindleTheaxleofrotationofthespinningdiskplattersmakingupadisk.
spinlockAlockwhereathreadwaitingforaBUSYlock“spins”inatightloopuntilsomeotherthreadmakesitFREE.
SPTFSee:shortestpositioningtimefirst.
SSTFSee:shortestseektimefirst.
stablepropertyApropertyofaprogram,suchthatoncethepropertybecomestrueinsomeexecutionoftheprogram,itwillstaytruefortheremainderoftheexecution.
stablestorageSee:non-volatilestorage.
stablesystemAqueueingsystemwherethearrivalratematchesthedeparturerate.
stackframe
Adatastructurestoredonthestackwithstorageforoneinvocationofaprocedure:thelocalvariablesusedbytheprocedure,theparameterstheprocedurewascalledwith,andthereturnaddresstojumptowhentheprocedurecompletes.
stagedarchitectureAstagedarchitecturedividesasystemintomultiplesubsystemsorstages,whereeachstageincludessomestateprivatetothestageandasetofoneormoreworkerthreadsthatoperateonthatstate.
starvationThelackofprogressforonetask,duetoresourcesgiventohigherprioritytasks.
statevariableMembervariableofasharedobject.
STMSee:softwaretransactionalmemory(STM).
structuredsynchronizationAdesignpatternforwritingcorrectconcurrentprograms,whereconcurrentcodeusesasetofstandardsynchronizationprimitivestocontrolaccesstosharedstate,andwhereallroutinestoaccessthesamesharedstatearelocalizedtothesamelogicalmodule.
superpageAsetofcontiguouspagesinphysicalmemorythatmapacontiguousregionofvirtualmemory,wherethepagesarealignedsothattheysharethesamehigh-order(superpage)address.
surfaceOnesideofadiskplatter.
surfacetransfertimeThetimetotransferoneormoresequentialsectorsfrom(orto)asurfaceoncethediskheadbeginsreading(orwriting)thefirstsector.
swappingEvictinganentireprocessfromphysicalmemory.
symboliclinkSee:softlink.
synchronizationbarrierAsynchronizationprimitivewherenthreadsoperatinginparallelcheckintothebarrierwhentheirworkiscompleted.Nothreadreturnsfromthebarrieruntilallncheckin.
synchronizationvariableAdatastructureusedforcoordinatingconcurrentaccesstosharedstate.
systemavailabilityTheprobabilitythatasystemwillbeavailableatanygiventime.
systemcallAprocedureprovidedbythekernelthatcanbecalledfromuserlevel.
systemreliabilityTheprobabilitythatasystemwillcontinuetobereliableforsomespecifiedperiodoftime.
taggedcommandqueueingAdiskinterfacethatallowstheoperatingsystemtoissuemultipleconcurrent
requeststothedisk.Requestsareprocessedandacknowledgedoutoforder.Seealso:nativecommandqueueing.Seealso:NCQ.
taggedTLBAtranslationlookasidebufferwhoseentriescontainaprocessID;onlyentriesforthecurrentlyrunningprocessareusedduringtranslation.ThisallowsTLBentriesforaprocesstoremainintheTLBwhentheprocessisswitchedout.
taskAuserrequest.
TCBSee:threadcontrolblock.
TCQSee:taggedcommandqueueing.
temporallocalityProgramstendtoreferencethesameinstructionsanddatathattheyhadrecentlyaccessed.
testandtest-and-setAnimplementationofaspinlockwherethewaitingprocessorwaitsuntilthelockisFREEbeforeattemptingtoacquireit.
thrashingWhenacacheistoosmalltoholditsworkingset.Inthiscase,mostreferencesarecachemisses,yetthosemissesevictdatathatwillbeusedinthenearfuture.
threadAsingleexecutionsequencethatrepresentsaseparatelyschedulabletask.
threadcontextswitchSuspendexecutionofacurrentlyrunningthreadandresumeexecutionofsomeotherthread.
threadcontrolblockTheoperatingsystemdatastructurecontainingthecurrentstateofathread.Seealso:TCB.
threadschedulerSoftwarethatmapsthreadstoprocessorsbyswitchingbetweenrunningthreadsandthreadsthatarereadybutnotrunning.
thread-safeboundedqueueAboundedqueuethatissafetocallfrommultipleconcurrentthreads.
throughputTherateatwhichagroupoftasksarecompleted.
timeofcheckvs.timeofuseattackAsecurityvulnerabilityarisingwhenanapplicationcanmodifytheusermemoryholdingasystemcallparameter(suchasafilename),afterthekernelchecksthevalidityoftheparameter,butbeforetheparameterisusedintheactualimplementationoftheroutine.OftenabbreviatedTOCTOU.
timequantumThelengthoftimethatataskisscheduledbeforebeingpreempted.
timerinterruptAhardwareprocessorinterruptthatsignifiesaperiodofelapsedrealtime.
time-sharingoperatingsystem
Anoperatingsystemdesignedtosupportinteractiveuseofthecomputer.TLB
See:translationlookasidebuffer.TLBflush
AnoperationtoremoveinvalidentriesfromaTLB,e.g.,afteraprocesscontextswitch.
TLBhitATLBlookupthatsucceedsatfindingavalidaddresstranslation.
TLBmissATLBlookupthatfailsbecausetheTLBdoesnotcontainavalidtranslationforthatvirtualaddress.
TLBshootdownArequesttoanotherprocessortoremoveanewlyinvalidTLBentry.
TOCTOUSee:timeofcheckvs.timeofuseattack.
trackAcircleofsectorsonadisksurface.
trackbufferMemoryinthediskcontrollertobufferthecontentsofthecurrenttrackeventhoughthosesectorshavenotyetbeenrequestedbytheoperatingsystem.
trackskewingAstaggeredalignmentofdisksectorstoallowsequentialreadingofsectorsonadjacenttracks.
transactionAgroupofoperationsthatareappliedpersistently,atomicallyasagroupornotatall,andindependentlyofothertransactions.
translationlookasidebufferAsmallhardwaretablecontainingtheresultsofrecentaddresstranslations.Seealso:TLB.
trapAsynchronoustransferofcontrolfromauser-levelprocesstoakernel-modehandler.Trapscanbecausedbyprocessorexceptions,memoryprotectionerrors,orsystemcalls.
tripleindirectblockAstorageblockcontainingpointerstodoubleindirectblocks.
two-phaselockingAstrategyforacquiringlocksneededbyamulti-operationrequest,wherenolockcanbereleasedbeforeallrequiredlockshavebeenacquired.
uberblockInZFS,therootoftheZFSstoragesystem.
UNIXexecAsystemcallonUNIXthatcausesthecurrentprocesstobringanewexecutableimageintomemoryandstartitrunning.
UNIXforkAsystemcallonUNIXthatcreatesanewprocessasacompletecopyoftheparentprocess.
UNIXpipeAtwo-waybytestreamcommunicationchannelbetweenUNIXprocesses.
UNIXsignalAnasynchronousnotificationtoarunningprocess.
UNIXstdinAfiledescriptorsetupautomaticallyforanewprocesstouseasitsinput.
UNIXstdoutAfiledescriptorsetupautomaticallyforanewprocesstouseasitsoutput.
UNIXwaitAsystemcallthatpausesuntilachildprocessfinishes.
unsafestateInthecontextofdeadlock,astateofanexecutionsuchthatthereisatleastonesequenceoffutureresourcerequeststhatleadstodeadlocknomatterwhatprocessingorderistried.
upcallAnevent,interrupt,orexceptiondeliveredbythekerneltoauser-levelprocess.
usebitAstatusbitinapagetableentryrecordingwhetherthepagehasbeenrecentlyreferenced.
user-levelmemorymanagementThekernelassignseachprocessasetofpageframes,buthowtheprocessusesitsassignedmemoryisleftuptotheapplication.
user-levelpagehandlerAnapplication-specificupcallroutineinvokedbythekernelonapagefault.
user-levelthreadAtypeofapplicationthreadwherethethreadiscreated,runs,andfinisheswithoutcallsintotheoperatingsystemkernel.
user-modeoperationTheprocessoroperatesinarestrictedmodethatlimitsthecapabilitiesoftheexecutingprocess.Compare:kernel-modeoperation.
utilizationThefractionoftimearesourceisbusy.
virtualaddressAnaddressthatmustbetranslatedtoproduceanaddressinphysicalmemory.
virtualmachineAnexecutioncontextprovidedbyanoperatingsystemthatmimicsaphysicalmachine,e.g.,torunanoperatingsystemasanapplicationontopofanotheroperatingsystem.
virtualmachinehoneypotAvirtualmachineconstructedforthepurposeofexecutingsuspectcodeinasafeenvironment.
virtualmachinemonitorSee:hostoperatingsystem.
virtualmemoryTheillusionofanearlyinfiniteamountofphysicalmemory,providedbydemandpagingofvirtualaddresses.
virtualizationProvideanapplicationwiththeillusionofresourcesthatarenotphysicallypresent.
virtuallyaddressedcacheAprocessorcachewhichisaccessedusingvirtual,ratherthanphysical,memoryaddresses.
volumeAcollectionofphysicalstorageblocksthatformalogicalstoragedevice(e.g.,alogicaldisk).
waitwhileholdingAnecessaryconditionfordeadlocktooccur:athreadholdsoneresourcewhilewaitingforanother.
wait-freedatastructuresConcurrentdatastructurethatguaranteesprogressforeverythread:everymethodfinishesinafinitenumberofsteps,regardlessofthestateofotherthreadsexecutinginthedatastructure.
waitinglistThesetofthreadsthatarewaitingforasynchronizationeventortimerexpirationtooccurbeforebecomingeligibletoberun.
wearlevelingAflashmemorymanagementpolicythatmoveslogicalpagesaroundthedevicetoensurethateachphysicalpageiswritten/erasedapproximatelythesamenumberoftimes.
webproxycacheAcacheoffrequentlyaccessedwebpagestospeedwebaccessandreducenetworktraffic.
work-conservingschedulingpolicyApolicythatneverleavestheprocessoridleifthereisworktodo.
workingsetThesetofmemorylocationsthataprogramhasreferencedintherecentpast.
workloadAsetoftasksforsomesystemtoperform,alongwithwheneachtaskarrivesandhowlongeachtasktakestocomplete.
woundwaitAnapproachtodeadlockrecoverythatensuresprogressbyabortingthemostrecenttransactioninanydeadlock.
writeaccelerationDatatobestoredondiskisfirstwrittentothedisk’sbuffermemory.Thewriteisthenacknowledgedandcompletedinthebackground.
write-backcacheAcachewhereupdatescanbestoredinthecacheandonlysenttomemorywhenthecacherunsoutofspace.
write-throughcacheAcachewhereupdatesaresentimmediatelytomemory.
zero-copyI/OAtechniquefortransferringdataacrossthekernel-userboundarywithoutamemory-to-memorycopy,e.g.,bymanipulatingpagetableentries.
zero-on-referenceAmethodforclearingmemoryonlyifthememoryisused,ratherthaninadvance.Ifthefirstaccesstomemorytriggersatraptothekernel,thekernelcanzerothememoryandthenresume.
ZipfdistributionTherelativefrequencyofaneventisinverselyproportionaltoitspositioninarankorderofpopularity.
AbouttheAuthors
ThomasAndersonholdstheWarrenFrancisandWilmaKolmBradleyChairofComputerScienceandEngineeringattheUniversityofWashington,wherehehasbeenteachingcomputersciencesince1997.
ProfessorAndersonhasbeenwidelyrecognizedforhiswork,receivingtheDianeS.McEntyreAwardforExcellenceinTeaching,theUSENIXLifetimeAchievementAward,theIEEEKojiKobayashiComputersandCommunicationsAward,theACMSIGOPSMarkWeiserAward,theUSENIXSoftwareToolsUserGroupAward,theIEEECommunicationsSocietyWilliamR.BennettPrize,theNSFPresidentialFacultyFellowship,andtheAlfredP.SloanResearchFellowship.HeisanACMFellow.Hehasservedasprogramco-chairoftheACMSIGCOMMConferenceandprogramchairoftheACMSymposiumonOperatingSystemsPrinciples(SOSP).In2003,hehelpedco-foundtheUSENIX/ACMSymposiumonNetworkedSystemsDesignandImplementation(NSDI).
ProfessorAnderson’sresearchinterestsspanallaspectsofbuildingpractical,robust,andefficientcomputersystems,includingoperatingsystems,distributedsystems,computernetworks,multiprocessors,andcomputersecurity.Overhiscareer,hehasauthoredorco-authoredoveronehundredpeer-reviewedpapers;nineteenofhispapershavewonbestpaperawards.
MichaelDahlinisaPrincipalEngineeratGoogle.Priortothat,from1996to2014,hewasaProfessorofComputerScienceattheUniversityofTexasinAustin,wherehetaughtoperatingsystemsandothersubjectsandwherehewasawardedtheCollegeofNaturalSciencesTeachingExcellenceAward.
ProfessorDahlin’sresearchinterestsincludeInternet-andlarge-scaleservices,faulttolerance,security,operatingsystems,distributedsystems,andstoragesystems.
ProfessorDahlin’sworkhasbeenwidelyrecognized.Overhiscareer,hehasauthoredoverseventypeerreviewedpapers;tenofwhichhavewonbestpaperawards.HeisbothanACMFellowandanIEEEFellow,andhehasreceivedanAlfredP.SloanResearchFellowshipandanNSFCAREERaward.HehasservedastheprogramchairoftheACMSymposiumonOperatingSystemsPrinciples(SOSP),co-chairoftheUSENIX/ACMSymposiumonNetworkedSystemsDesignandImplementation(NSDI),andco-chairoftheInternationalWorldWideWebconference(WWW).