61
HathiTrust is a Solution The Foundations of a Disaster Recovery Plan for the Shared Digital Repository This report serves as recommendations made by Michael J. Shallcross, 2009 Digital Preservation Intern University of Michigan School of Information

HathiTrust is a Solution

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

HathiTrustisaSolution

TheFoundationsofaDisasterRecoveryPlanfortheSharedDigitalRepository

ThisreportservesasrecommendationsmadebyMichaelJ.Shallcross,2009DigitalPreservationInternUniversityofMichiganSchoolofInformation

ii

ExecutiveSummary ThisreportseekstoestablishtheframeworkofaDisasterRecoveryPlanfortheHathiTrustDigitalLibrary.WhileprofessionalbestpracticesandinstitutionalneedshaveprovidedaclearmandateforHathiTrust’sDisasterRecoveryProgram,commonparlancehasoftenobscuredtwoprominentfeaturesofsuchinitiatives.First,a‘DisasterRecoveryPlan’isactuallycomprisedofasuiteofdocumentswhichdetailarangeofissues,fromcrisiscommunicationsandthecontinuityofadministrativeactivitiestotherestorationofhardwareanddata.Second,thereisnoconclusiontotheplanningprocess;itisinsteadacontinuouscycleofobservation,analysis,solutiondesign,implementation,training,testing,andmaintenance.

Theprimarygoalofthepresentdocumentistoprovideafoundationonwhichfutureplanningeffortsmaybuild.Tothatend,itexaminesthestrategiesbywhichHathiTrusthasanticipatedandmitigatedtherisksposedbytencommonscenarioswhichcouldprecipitateadisaster:

o Hardwarefailureanddatalosso Networkconfigurationerrorso Externalattackso Formatobsolescenceo Coreutilityorbuildingfailureo Softwarefailureo Operatorerroro Physicalsecuritybreacho Mediadegradationo Manmadeaswellasnaturaldisasters.

Asthislistreveals,adisasterwithinthedigitalrepositoryrefersnotmerelytodataloss,thedestructionofequipment,ordamagetoitsenvironment,buttoanyeventwhichhasthepotentialtocauseanextendedserviceoutage.Foreachscenario,thereportdiscussespossiblethreats,summarizesthepotentialseverityofrelatedevents,andthendetailssolutionsHathiTrusthasenactedthroughdirectquotationsfromtheHathiTrustWebsiteandTRACself‐assessment,ServiceLevelAgreements,andliteraturefromserviceprovidersandvendors.AttachedappendicesproviderelevantinformationandincludecontactsforimportantHathiTrustresources,anannotatedguidetoDisasterRecoveryPlanningreferences,andanoverviewofkeystepsintheDisasterRecoveryPlanningprocess.

TheconcludingsectionofthereportprovidesrecommendationsandactionitemsforHathiTrust

asitproceedswithitsDisasterRecoveryInitiative.ThesearedividedintoShort(0‐6mos.),Intermediate(6‐12mos.)andLong‐Term(12+mos.)objectivesandarearrangedinasuggestedorderofaccomplishment.

o Short‐termgoalsinclude: DescribingthenatureandextentofHathiTrust’sinsurancecoverage Testingandvalidationofcurrenttapebackupprocedures Improvedphysicalandintellectualcontroloversystemhardware Establishment,distribution,andmaintenanceofphonetrees Increaseddocumentationofinstitutionalknowledge IdentificationofDisasterRecoverymeasuresinplaceattheIndianapolissite.

o Intermediate‐termobjectivesfocuson: CreationofaDisasterRecoveryPlanningCommittee

iii

Initiationofthedatacollectionandanalysisessentialtothecreationofrecoverystrategies(ThissectionprovidesahighlevelbreakdownofvarioustasksandincludesthecoordinationofactivitiesbetweentheAnnArborandIndianapolissitesaswellaswithserviceprovidersandvendors.)

o Long‐termactionitemsdealwith: CompletionandimplementationofthesuiteofDisasterRecoverydocuments Initiationofstafftrainingandtestsoforganizationalcompliance. Storageofanadditionalcopyofbackuptapesataremotethirdlocation InvestigationofanalternatehotsiteinAnnArborintheeventadisaster

renderstheMACCunusable Considerationofathirdinstanceoftherepository Avoidanceofvendorlock‐inifakeysuppliershouldgooutofbusiness.

Thisreportdemonstratesthatvariousriskmanagementstrategies,designelements,operating

procedures,andsupportcontractshaveendowedHathiTrustwiththeabilitytopreserveitsdigitalcontentandcontinueessentialrepositoryfunctionsintheeventofadisaster.TheestablishmentoftheIndianapolismirrorsite,theperformanceofnightlytapebackupstoaremotelocation,andtheredundantpowerandenvironmentalsystemsoftheMACCreflectprofessionalbestpracticesandwillenableHathiTrusttoweatherawiderangeofforeseeableevents.Unfortunately,disastersoftenresultfromtheunknownandtheunexpected;whiletheaforementionedstrategiesarecrucialcomponentsofaDisasterRecoveryPlan,theymustbesupplementedwithadditionalpoliciesandprocedurestoensurethat,comewhatmay,HathiTrustwillbeabletocarryonasbothanorganizationandadedicatedserviceprovider.

iv

Acknowledgements TheauthorwouldliketothankShannonZacharyforherencouragementandguidance;CorySnavelyandJeremyYorkfortheirgenerousexpenditureoftime,energy,andknowledge;andNancyMcGovernandLanceStuchellforaccesstotheiroutstandingDisasterRecoveryPlanningresources.Thefollowingindividualshavealsobeeninvaluablesourcesofadvice,support,andinformation:JohnWilkin,BobCampe,CyndiMesa,AnnThomas,JohnWeise,LarryWentzel,LaraUnger‐Syrigos,BillHall,EmilyCampbell,SebastienKorner,JessicaFeeman,PhilFarber,ChrisPowell,CameronHanover,StephenHipkiss,TimPrettyman,ReneGobeyn,andKrystalHall.ThanksalsotoDr.ElizabethYakel,MagiaKrause,andVeronicaandCoraFambrough.TheworkinthisreportwasmadepossiblebyanIMLSGrant.

v

TableofContents• ExecutiveSummary p.ii• Acknowledgements p.iv• Introduction p.1

o GoalsforHathiTrust’sDisasterRecoveryProgram p.1o TheMandateforDisasterRecoveryPlanninginDigitalPreservation p.2o DisasterPreparednessintheDesignandOperationofHathiTrust p.2o EssentialHathiTrustBusinessFunctions p.3

• HathiTrust’sDisasterRecoveryStrategies p.5o BasicRequirementsforDisasterRecovery p.5o DisasterRecoveryStrategy#1:RedundancybetweentheAnnArborandIndianapolisSitesp.5o DisasterRecoveryStrategy#2:NightlyAutomatedTapeBackups p.6

• Scenario1:HardwareFailureorObsolescenceandDataLoss p.8o Review:RisksInvolvingHardwareFailureorObsolescenceandDataLoss p.8o HathiTrust’sSolutionsforHardwareFailureandDataLoss p.8o RedundantComponentsandSinglePointsofFailureintheHathiTrustInfrastructure p.9o KeyFeaturesofHathiTrust’sIsilonIQClusteredStorage p.10o HardwareSupportandService p.12o EquipmentTracking p.13o HardwareReplacementSchedule p.13o TimelineforEmergencyReplacementofHathiTrustInfrastructure p.13o HathiTrustandInsuranceCoverageattheUniversityofMichigan p.14

• Scenario2:NetworkConfigurationErrors p.15o Review:RisksInvolvingNetworkConfigurationErrors p.15o HathiTrust’sSolutionsforNetworkConfigurationErrors p.15o ExtentofITComSupport p.15o ITComResponsibilities p.16o ITComServicesinResponsetoOutagesorDegradationImpactingtheNetwork p.16o HathiTrustResponsibilities p.16

• Scenario3:NetworkSecurityandExternalAttacks p.17o Review:RisksInvolvingNetworkSecurityandExternalAttacks p.17o HathiTrust’sSolutionsforNetworkSecurity p.17

• Scenario4:FormatObsolescence p.18o Review:RisksInvolvingFormatObsolescence p.18o HathiTrust’sSolutionsforFormatObsolescence p.18o SelectionofFileFormats p.18o FormatMigrationPoliciesandActivities p.19

• Scenario5:CoreUtilityand/orBuildingFailure p.20o Review:RisksInvolvingCoreUtilityorBuildingFailure p.20o HathiTrust’sSolutionsforUtilityorBuildingFailure p.20o GeneralMaintenanceandRepairsinUniversityofMichiganFacilities p.20o TheMichiganAcademicComputingCenter(MACC) p.20o ArborLakesDataFacility(ALDF) p.22

vi

• Scenario6:SoftwareFailureorObsolescence p.23o Review:RisksInvolvingSoftwareFailureorObsolescence p.23o HathiTrust’sSolutionsforSoftwareIssues p.23

• Scenario7:OperatorError p.24o Review:RisksInvolvingOperatorError p.24o HathiTrust’sSolutionsforOperatorError p.24o Ingest p.24o ArchivalStorage p.24o Dissemination p.24o DataManagement p.24

• Scenario8:PhysicalSecurityBreach p.25o Review:RisksInvolvingaPhysicalSecurityBreach p.25o HathiTrust’sSolutionsforPhysicalSecurity p.25o SecurityattheMACC p.25o SecurityattheALDF p.26

• Scenario9:NaturalorManmadeDisaster p.27o Review:RisksInvolvingaNaturalorManmadeDisaster p.27o HathiTrust’sSolutionsforNaturalorManmadeCatastrophicEvents p.27o BasicDisasterRecoveryStrategies p.28

• Scenario10:MediaFailureorObsolescence p.29o Review:RisksInvolvingMediaFailureorObsolescence p.29o HathiTrust’sSolutionsforMediaFailure p.29o RemainingVulnerabilities p.29

• ConclusionsandActionItems p.30o Conclusions p.30o Short‐TermActionItems p.30o Intermediate‐TermActionItems p.31o Long‐TermActionItems p.32

• APPENDIXA:ContactInformationforImportantHathiTrustResources p.34• APPENDIXB:HathiTrustOutagesfromMarch2008throughApril2009 p.37• APPENDIXC:WashtenawCountyHazardRankingList p.38• APPENDIXD:AnnotatedGuidetoDisasterRecoveryPlanningReferences p.39• APPENDIXE:OverviewoftheDisasterRecoveryPlanningProcess p.45• APPENDIXF:TSMBackupServiceStandardServiceLevelAgreement(2008) p.52• APPENDIXG:ITCS/ITComCustomerNetworkInfrastructureMaintenanceStandardService

Agreement(2006) p.53• APPENDIXH:MACCServerHostingServiceLevelAgreement(Draft,2009) p.54• APPENDIXI:MichiganAcademicComputingCenterOperatingAgreement(2006) p.55

**AppendicesF–IareembeddedPDFfiles.**

2009‐08‐24 1

Introduction

Intherealmofprintlibraries,adisasterisafairlyunambiguousevent:itisafire,abrokenpipe,aninfestationofpests—inshort,anythingwhichthreatensthecontinueduseandexistenceoftextsortheenvironmentinwhichtheyarestored.Thisbasicdefinitionmayalsobeappliedtothedigitallibrary,inwhichadisasterrefersnotmerelytothelossofcontentorcorruptionofdata,thedestructionofequipmentordamagetoitsenvironment,buttoanyeventwhichhasthepotentialtocauseanextendedserviceoutage.Thislastpartprovestobethegreatestdifferencebetweentheprintanddigitalworldsbecausethereareagreatmanythreatswhichcanleavedataintactbutincapacitatetheprimaryfunctionsofadigitallibrary.ThedailyoperationofaninstitutionsuchasHathiTrustinvolvestheanticipationandresolutionofavarietyofproblems—crashedservers,softwarebugs,networkingerrors,etc.—whichonlyrisetothelevelofa‘disaster’whentheyexceedthecapacityofnormaloperatingproceduresand/orthemaximumallowableoutageperiods.DisasterRecoveryPlanningthuspromptsustodeveloprobuststrategiestomitigateandlimittheeffectsofcommonproblemsandatthesametimeforcesustothinktheunthinkable.Nevertheless,confrontingworst‐casescenariosisavitalactivity;thebeliefthataneventwillneverhappensimplybecauseithasneverhappenedisaninvitationtotheverydisasterweseektoavoid.Hereinliesaconundrum,inthatthecreationofdetailedplansforeveryeventualityisnearlyimpossibleandalsoimpractical,sincetheresultsofsuchanendeavorwouldbeneedlesslycomplexaswellasexpensive.Atitsbasis,then,DisasterRecoveryPlanningdemandsanastuteassessmentofrisksothatwemayweighthecostsofpreparationsandsolutionsagainstthecostsofapotentialevent.

Sowheretobegin?WhenthesubjectofDisasterRecoveryPlanningarises,commonparlanceoftenobscurestwoprominentfeaturesofsuchinitiatives.First,a‘DisasterRecoveryPlan’isactuallycomprisedofasuiteofdocumentswhichdetailavarietyofrelatedissues,fromcrisiscommunicationsandthecontinuityofadministrativeactivitiestotherecoveryofhardwareanddataandtherestorationofcorefunctions.Second,thereisnoconclusiontotheplanningprocessorapointatwhichaplanis‘done’;thereisinsteadacontinuouscycleofobservation,analysis,solutiondesign,implementation,training,testing,andmaintenance.Theessentialfirststepisthereforeathoroughknowledgeoftheorganization,itsgoals,anditsmandateforaDisasterRecoveryProgramsothatlatereffortscanfocusonthearticulationofpoliciesandthedevelopmentofsolutions.Asapreliminarystepinthiseffort,thisreportlookstoestablishabasicfoundationfromwhichfutureplanningeffortsmaygrow.

• GoalsforHathiTrust’sDisasterRecoveryProgram WhileamoreformalstatementofHathiTrust’sgoalsandrequirementsforitsDisasterRecoveryProgrammustbeelucidated,therepository’smissionstatementprovidesagoodindicationofitsmainobjectiveintheformationofaDisasterRecoveryPlan.Aspartofitsaimto“contributetothecommongoodbycollecting,organizing,preserving,communicating,andsharingtherecordofhumanknowledge,”HathiTrustseeks“tohelppreservetheseimportanthumanrecordsbycreatingreliableandaccessibleelectronicrepresentations.”1Thisstatementclearlyjoinsthetwinimperativesofpreservationandaccesswithanadditionalrequirement:reliability.ThedevelopmentandimplementationofaDisasterRecoveryPlanwillensurethatdigitalobjectswillretaintheirauthenticityandintegrityoverthelongtermandthatpartnerlibrariesanddesignatedusersmayrelyonHathiTrustservices(ortheirtimelyresumption)andcontentinthefaceofcatastrophicevents.

1HathiTrust.“Mission&Goals”(2009)retrievedfromhttp://www.hathitrust.org/mission_goalson8July2009.

2009‐08‐24 2

• TheMandateforDisasterRecoveryPlanninginDigitalPreservation HathiTrust’smandateforacomprehensiveandproactiveDisasterRecoveryPlanstemsfromanumberofsignificantsources,amongwhichwemayincludeitsmissionandgoals.The“InstitutionalDataResourceManagementPolicy”(2008)oftheUniversityofMichigan’sStandardPracticeGuidealsoprovidesanimpetusforthecreationofaDisasterRecoveryProgram.WhilenotnecessarilyinclusiveoftheMichiganDigitizationProjectmaterialsstoredinHathiTrust,thisdocumentunderscoreshowimportantitisthatdataresources“besafeguarded[and]protected”and“contingencyplans[…]bedevelopedandimplemented.”2Initsdiscussionofthelatterpoint,thepolicyspecifiesthat:

DisasterRecovery/BusinessContinuityplansandothermethodsofrespondingtoanemergencyorotheroccurrencesofdamagetosystemscontaininginstitutionaldata[…]willbedeveloped,implemented,andmaintained.Thesecontingencyplansshallinclude,butarenotlimitedto,databackup,DisasterRecovery,andemergencymodeoperationsprocedures.Theseplanswillalsoaddresstestingofandrevisiontodisasterrecovery/businesscontinuityproceduresandacriticalityanalysis.3

WhiledatabackupproceduresandahostofriskmanagementpracticesarealreadyanintegralpartofHathiTrust’soperation,therepositorynowlookstoformalizetheotherstrategiessuggestedbythe“InstitutionalDataManagementPolicy.”Beyondtheexamplelaidoutbythisdocument,HathiTrust’smandateforDisasterRecoveryderivesfromtheprofessionalliteraturedetailingbestpracticesinthefieldofdigitalpreservation.TheReferenceModelforanOpenArchivalReferenceSystemidentifiesDisasterRecoveryasanessentialcomponentofits“ArchivalStorage”functionandhighlightstheimportanceofsuchplansinachievingthegoaloflong‐termpreservationofadigitalarchive’sholding.AsoutlinedintheOAISdocument,“theDisasterRecoveryfunctionprovidesamechanismforduplicatingthedigitalcontentsofthearchivecollectionandstoringtheduplicateinaphysicallyseparatefacility.”4HathiTrusthassuccessfullymetthisrequirementbyperformingnightlytapebackupsandestablishingamirrorsiteatIndianaUniversityinIndianapolis.TheTrustedRepositoriesAudit&Checklist:CriteriaandChecklist(2007)isevenmoreexplicitinitsrequirementthatrepositoriesdocumenttheirpoliciesandprocedureswith“suitablewrittendisasterpreparednessandrecoveryplan(s),includingatleastoneoff‐sitebackupofallpreservedinformationtogetherwithanoff‐sitecopyoftherecoveryplan(s).”5ProfessionalbestpracticesaswellasinternalneedsandgoalsthusprovidethemandatewhichunderliesHathiTrust’sdevelopmentofaformalDisasterRecoveryPlan.

• DisasterPreparednessintheDesignandOperationofHathiTrust OneoftheprimarygoalsofHathiTrustistoprovide“transparencyinallofitsoperations,includingitsworktocomplywithdigitalpreservationstandardsandreviewprocesses.”6Nowhereisthiscommitmentmoreclearthaninitseffortstoanticipateandmitigateriskswhichcouldthreatenthe

2UniversityofMichigan.“InstitutionalDataResourceManagementPolicy”(2008)StandardPracticeGuide,retrievedfromhttp://spg.umich.edu/on8July2009.3Ibid.4ConsultativeCommitteeforSpaceDataSystems.ReferenceModelforanOpenArchivalInformationSystem(2002)p.4‐8.5OCLCandCRL.“SectionC3.4”TrustedRepositoriesAudit&Checklist:CriteriaandChecklist(2007)p.49.6HathiTrust.“Accountability”(2009)retrievedfromhttp://www.hathitrust.org/accountabilityon25June2009.

2009‐08‐24 3

contentsandfunctionsoftheSharedDigitalRepository.AsafirststepinaddressingthedisasterpreparednessrequirementinsectionC3.4oftheTRACCriteriaandChecklist,7thisdocumentservestwopurposes.First,itprovidesanoverviewofthepolicies,procedures,resourcesandcontractsthatenableHathiTrusttoaddressthechallengesandthreatsendemictothefieldofdigitalpreservation.MaterialisthereforeciteddirectlyfromtheHathiTrustWebsite(http://www.hathitrust.org),themostrecentversionofHathiTrust’sreviewofitscompliancewiththeminimumrequiredelementsoftheTRACCriteriaandChecklist,8andrelevantliteratureprovidedbykeyvendorsandserviceproviders.9Second,thisreportexaminesHathiTrust’scurrentlevelofdisasterpreparednessanddefinescurrentandforthcomingeffortsinitsdevelopmentofadynamicandproactiveDisasterRecoveryProgram.PertherecommendationsoftheTRACCriteriaandChecklist,thisdocumentrecordsthemeasuresandprecautionsalreadyinplaceinregardsto“specifictypesofdisasters”thatcouldbefallHathiTrust.Theseeventsincludehardwarefailure,dataloss,networkconfigurationerrors,externalattacks,coreutilityfailure,formatobsolescence,softwarefailure,physicalsecuritybreach,andmanmadeaswellasnaturaldisasters.Whileaformal,writtenplandetailingindividualrolesandresponsibilitiesintherepository’sresponsetoeachofthesescenariosisstillforthcoming,theevidencegatheredinthisreportrevealsthatcrucialelementsofaDisasterRecoveryPlanarealreadyinplacewithinHathiTrust.10

• EssentialHathiTrustBusinessFunctionsAsthedevelopmentoftheDisasterRecoveryPlanproceeds,itisimportanttobearinmindthat

itsgoalisnotmerelytherestorationofhardwareanddatabutalsotherecoveryandcontinuityofessentialrepositoryfunctions.ThefollowinglistrepresentscorefunctionsthatneedtobeaddressedbyHathiTrust’sDisasterRecoveryPlanandassuchshouldnotbeconsideredacomprehensiverepresentationoftherepository’sfunctions.Bydirectingplanningeffortstowardspecificfunctions(ratherthantheorganization’sactivitiesasawhole),HathiTrustmayprioritizeandfocusitsrecoveryresponsesandresourcestoensurethatthemostessentialfunctionsgobackonlinefirst.SubsequentdiscussionofDisasterRecoverystrategiesandriskmanagementsolutionsinthisreportarepresentedundertheassumptionthatthecontinuityofthesefunctionsisaprimaryobjective.Theprioritizationofthesefunctionsremainstobedeterminedbyanappropriateauthority.11

7“Repositoryhassuitablewrittendisasterpreparednessandrecoveryplan(s),includingatleastoneoff‐sitebackupofallpreservedinformationtogetherwithanoff‐sitecopyoftherecoveryplan(s).Therepositorymusthaveawrittenplanwithsomeapprovalprocessforwhathappensinspecifictypesofdisaster(fire,flood,systemcompromise,etc.)andforwhohasresponsibilityforactions.Thelevelofdetailinadisasterplanandthespecificrisksaddressedneedtobeappropriatetotherepository’slocationandserviceexpectations.Fireisanalmostuniversalconcern,butearthquakesmaynotrequirespecificplanningatalllocations.Thedisasterplanmust,however,dealwithunspecifiedsituationsthatwouldhavespecificconsequences,suchaslackofaccesstoabuilding.”OCLCandCRL.TrustedRepositoriesAudit&Checklist:CriteriaandChecklist(2007)p.49.8HathitrustDigitalLibraryReviewofCompliancewithTrustworthyRepositoriesAudit&Certification:CriteriaandChecklistMinimumRequiredElements,revisedMay20,2009.Availableathttp://hathitrust.org/documents/trac.pdf9ContactinformationforrelevantUniversityofMichigandepartmentsandserviceprovidersaswellasforexternalvendorsmaybefoundinAppendixA.10AlistofresourcesrelatedtodisasterrecoveryandtheplanningprocessmaybefoundinAppendixD(AnnotatedListofDisasterRecoveryPlanningResources).11ThislistofessentialHathiTrustbusinessfunctionswasdevelopedinconjunctionwithJeremyYork.

2009‐08‐24 4

o Ingest Ingestdigitalobjects(SIPs)viaGRIN—theGoogleReturnInterface(ora

modifiedingestportalforlocalcontent) ValidateingestedcontentwithGROOVE—theGoogleReturnObject‐Oriented

ValidationEnvironment(oramodifiedversionforlocalizedingest)o ArchivalStorage

Preserveindefinitelydigitalobjectsandmetadata(AIPs)intheSharedDigitalRepository(includesensuringtheintegrityandauthenticityofmaterials).Thisfunctionaddressestheneedsofpartnerlibrariesaswellasindividualusers.

Recordchangestoandactionsonitemswhiletheyareintherepository Maintainapersistentobjectaddressforitemswithinrepository

o Dissemination Provideaccesstodigitalobjectsforusers Allowforthetextsearchesthroughavarietyoffields Enablelargescalefull‐textsearches Permitthecreationofpublicandprivatecontentcollections Disseminatedigitalobjects(DIPs)tousers(viathepage‐turneraccesssystem

anddataAPI) DistributedatasetsandHathiTrustAPIstodevelopers ResearchanddevelopadditionalapplicationsandresourcesforHathiTrust

o Administration Providetransparentandup‐to‐dateinformationtousersandthegeneralpublic

viahttp://www.hathitrust.org/ Communicateinformationandcoordinateactivitiesamongstpartnerlibraries

andHathiTrustboardsandcommittees.o DataManagement

UpdateandmanagetheRightsandGeoIPdatabases BuildandmaintainCollectionBuilderandLargeScaleSearchSolrindexes Determineappropriateuseraccesstotextsviadatabasequeries SynccontentwiththeIndianapolissiteandbackupcontenttotape

2009‐08‐24 5

HathiTrust’sDisasterRecoveryStrategies

• BasicRequirementsforDisasterRecovery RoyTennanthasidentifiedthreerequisitecomponentsofadigitalDisasterRecoveryPlan:(1)theuseofaneffectivedataprotectionsystem(i.e.RAID),(2)redundantpowerandenvironmentalsystems,and(3)regularbackupofinformationtotapeand,ideally,toaremotemirroredsite.12HathiTrusthasincorporatedalltheseelementsintoitsdesignandoperation.ItsIsilonIQstorageclusterprovidesahighdegreeofdataredundancywithitsN+3parityprotection;theMichiganAcademicComputingCenterprovidesfullyredundantpowerandenvironmentalsystemsforHathiTrustinfrastructure;andnightlytapebackupsandthereplicationofdatatoafullyoperationalmirrorsitelocatedatIndianaUniversityinIndianapoliswiththesamelevelsofpowerandenvironmentalconditioningprovidemultiplecopiesaswellasgeographicdistributionofcontent.

o “HathiTrustisintendedtoprovidepersistentandhighavailabilitystoragefordepositedfiles.Inordertofacilitatethis,theinitiative’stechnologyconcentratesoncreatingaminimumoftwosynchronizedversionsofhigh‐availabilityclusteredstoragewithwidegeographicseparation(thefirsttwoinstancesofstoragewillbelocatedinAnnArbor,MIandIndianapolis,IN),aswellasanencryptedtapebackup(writtentoandstoredinaseparateAnnArborfacility).Eachofthesestorageortapeinstancesisphysicallysecure(e.g.,inalockedcageinamachineroom)andonlyaccessibletospecifiedpersonnel.Eachseparatestoragesystemisalsoequippedwithmechanismstoprovidemirroredmanagementandaccessfunctionality,andemploy100%dataredundancyinanefforttopreventdataloss.”13

DetailsonparityprotectionandtheHathiTrustserverenvironmentareavailablebelow(seeScenario1andScenario5,respectively).

• DisasterRecoveryStrategy#1:RedundancybetweentheAnnArborandIndianapolisSites HathiTrust'sfirstlineofdefenseintheeventofadisasterisitshotmirrorsiteinIndianapolis.WhileingestofmaterialisrestrictedtotheAnnArborlocation,bothsitespossesstwowebservers,aMYSQLdatabaseserver,andanIsilonIQstoragecluster(currentlycomposedof21‘nodes,’serverscomposedofCentralProcessingUnitsaswellasstorage).Duringnormaloperations,thisarrangementallowsHathiTrusttobalanceahighvolumeofwebtrafficacrossbothsitessuchthatindividualuserrequestsmaybehandledbyeithersiteinatransparentmanner.Shouldthetolerancesforfailurebeexceededatasite(asinadisastersituation)thefailovercapabilitybuitintotheHathiTrustarchitectureenablestheremainingsitetoprovideaccesstothedesignatedcommunitywithoutnoticeableservicedisruptions.AsnotedintheMay2009HathiTrustUpdate,withthefulloperationofbothlocations,“Wearenowensuringthatusersdonotfeeltheeffectsofsingle‐siteoutages,suchasroutinemaintenance,

12Tennant,Roy.“DigitalLibraries:CopingwithDisasters.”LibraryJournal,15November2009.Retrievedfromhttp://www.libraryjournal.com/article/CA180529.htmlon13July2009.13HathiTrust.“Technology”retrievedfromhttp://www.hathitrust.org/technologyon15June2009.

2009‐08‐24 6

bytakingadvantageofsiteredundancy.”14However,becauseingesttakesplaceonlyinAnnArbor,thelossofkeycomponentstherewouldinhibittherepository’sabilitytoacquirenewcontent.

HathiTrustutilizesIsilonSystem’sSyncIQApplicationSoftwaretosynchronizedataattheIndianapolissitewithnewlyingestedorupdatedmaterialfromtheAnnArborsite.ThesynctoIndianapolisrunson24separatesubsetsofthedataandeachonerunsevery2hours,withtheexceptionofSundays.Inotherwords,subset1runsatmidnightonMonday,subset2runsat2a.m.,andsoon.ThemaximumtimefordatatobereplicatedfromAnnArbortoIndianapoliswouldthereforebethreedaysplustheruntimeofthesyncprocess(whichtendstotakelessthanthreehours.)15

o “SyncIQisanasynchronousreplicationapplicationthatfullyleveragestheuniquearchitectureofIsilonIQstoragetoefficientlycopydatafromaprimaryclustertoonelocatedatasecondarylocation.”16

o “Allnodes[…inboththesourceandtargetIsilonIQclusters]concurrentlysendandreceivedataduringreplicationjobsinrealtime,withoutimpactingusersreadingandwritingtothesystem.”17

o “Arobustwizard‐drivenweb‐basedinterfaceisfullyintegratedinto[…Isilon’sproprietary]OneFSmanagementtooltocontrolallthefunctionality,includingscheduling,policysettings,monitoringandloggingofdatatransferredandbandwidthutilization.”18

o “Onlyfilesthathavechangedwillbereplicatedtothetargetclusters.Thiswilloptimizetransfertimesandminimizebandwidthused.”19

o “Intheeventthesecondarysystemisnotavailableduetoasystemornetworkinterruption,thereplicationjobwillbeabletorollbackandrestartatthelastsuccessfulcopyoperation.”20

o “Uponacriticalfailureorlossofnetworkconnection,analertwillbesenttoallrecipientsconfiguredtoreceivecriticalalerts.”21

• DisasterRecoveryStrategy#2:NightlyAutomatedTapeBackups

HathiTrust’sabilitytorecoverfromadisasterisalsoensuredbythenightlyautomatedtapebackupsperformedbytheTivoliStorageManager(TSM)clientapplicationinstalledontheingestserversconnectedtotheHathiTruststorageclusterandmanagedbyMichigan’sITCSTSMGroup.TheTSMBackupServiceStandardServiceLevelAgreement22outlinestheobligationsandresponsibilitiesofboththeserviceproviderandHathiTrust:

14HathiTrust.“UpdateonMay2009Activities”(2009)retrievedfromhttp://www.hathitrust.org/updates_may2009on2July2009.15Snavely,Cory(Head,UMLibraryITCoreServices).Personalemailon13July2009.16“BackupandRecoveryWithIsilonIQClusteredStorage,”2007p.1117Ibid.18Ibid.19Ibid.20Ibid.21Ibid22PleaserefertoAppendixF(TSMBackupServiceStandardServiceLevelAgreement).

2009‐08‐24 7

o “TheprogressiveincrementalmethodologyusedbyTivoliStorageManageronlybacksupneworchangedversionsoffiles,therebygreatlyreducingdataredundancy,networkbandwidthandstoragepoolconsumptionascomparedtotraditionalmethodologiesbasedonperiodicfullbackups.”23

o “ITCSisresponsibleforallofthecentralserverhardware,tapehardware,networkinghardware,andrelatedcomponents.ITCSisalsoresponsibleforhardwaremaintenanceaswellassoftwaremaintenance,administration,andsecurityauditsonthecentral(non‐client)TSMservers.”(TSMBackupServiceSLA,sec.4.1)

o “ITCSprovides7x24on‐callmonitoringandsupport,andstrivestokeeptheserversupinproductionatalltimes.Thetargetup‐timeis99.9%ofthetime.TheTSMhardwaredesignismodularandshouldallowustotakepiecesoutofservicewithoutaffectingcustomers.Wheneverpossible,systemmaintenancewillbeperformedduringstandardweekendmaintenancewindowsasdefinedbyITCS.”(sec.4.2)

o “Inanemergency,[email protected](thiswillgototheon‐callstaff’spagerinrealtime).(sec.4.6)

o “ITCSisresponsibleforphysicalsecurity.Machineaccessaudits,OSsecurity,andnetworksecurityontheTSMserverendarealsotheresponsibilityofITCS.”(sec.4.9)

o “Theservice[…]includesdatacompression,dataencryptions,anddatareplication.”(sec.1.0)

o “ITCSwillmaintainatleasttwoTSMsitesandwillmirrordatabetweenthesitestoprovideredundancyintheeventofadisaster.CurrentlythosesitesaretheArborLakesDataFacility(ALDF)at4251PlymouthRd.andtheMichiganAcademicComputingCenter(MACC)locatedat1000OakbrookDr.”(sec.4.10)

o “Bothfacilitiesaresecure,climatecontrolledsitesdesignedandbuiltforhighavailableproductionservices.”24

o “Intheeventofacustomerdisasterwithlarge‐scale(afullserverormore)dataloss,ITCSwillworkwiththecustomertooptimizetherestoretimetobestofourability.Wewillonlybeabletodevoteresourcestotheextentthatothercustomersarenotaffected.Restoringlargefileservers(multipleTerabytes)cantakeseveraldays.Ifcustomerswanttominimizethisamountoftimetorestore,wecanpurchaseadditionalresourcesforthispurpose.Contactusdirectly,andwe’llworkoutascenariowithcostinginformation.IntheeventofaMAJORcampusoutageaffectingalargenumberofcustomers,ITCSmanagementwillworkwithcustomerstodeterminehowtoprioritizecustomerrestores.”(sec.4.11)

o “DisasterRecoveryplanningistheresponsibilityofthecustomerunit.”(sec.5.8)HavingestablishedthemainDisasterRecoverystrategiesemployedbyHathiTrust,wemaynowproceedtoinvestigatethemeansbywhichitanticipatesandmitigatesthemostcommonthreatsfacingdigitalrepositories.

23IBM.“IBMTivoliStorageManager:FeaturesandBenefits”(2009)retrievedfromhttp://www‐01.ibm.com/software/tivoli/products/storage‐mgr/features.html?S_CMP=rnavon16June2009.24InformationTechnologyCentralServicesattheUniversityofMichigan.“FrequentlyAskedQuestionsabouttheTSMBackupService”(2009)retrievedfromhttp://www.itcs.umich.edu/tsm/questions.phpon16June2009.

2009‐08‐24 8

Scenario1:HardwareFailureorObsolescenceandDataLoss

• Review:RisksInvolvingHardwareFailureorObsolescenceandDataLoss ThefollowingtablehighlightsthevariouseventswhichposearisktothehardwareanddataofHathiTrust.Thesethreatsmaystemfromflawsormalfunctionsintheequipmentitselforasaresultofexternaleventsthatincludephysicalsecuritybreachesandnaturalormanmadedisasters.Thearrangementofthesepotentialrisksreflectstherelativeseverityoftheirrespectiveconsequences.

• HathiTrust’sSolutionsforHardwareFailureandDataLoss

ThethreatsfacedbyHathiTrust’shardware(andassociatedapplicationsaswellasthedatastoredtherein)arecomprisedofthefailureofredundantfeatures,failurethatexceedscomponents’toleranceforredundancy,andsinglepointsoffailure.Whilethefailureofredundantcomponentsmayhappenmorefrequently(i.e.,thelossofanindividualdrivewithintheIsilonIQcluster),suchlossesdonothavealargeimpactontherepository;eventswhichcompromisesinglepointsoffailurewillhavemuchgreaterconsequencesforthecontinuityofHathiTrustoperations.Atthesametime,whileacomponentmayhaveredundancyononelevel(forexample,therearefiveserversdedicatedtoingest),thatcomponentsimultaneouslymaybeconsideredatahigherleveltobeasinglepointoffailure(i.e.,becausetheingestserversarehousedinasinglechassis,theentireunitisvulnerabletoaneventsuchasafire).Thisdualityhighlightstheneedforvigilanceandforesightinmanagingtherepository’sinfrastructure. BecauseHathiTrustreliesheavilyuponhardwaretofulfillitsmissionanddeliverservicestoitsdesignatedcommunityofusers,theselectionofequipmentanddevelopmentofsystemarchitecture

Severity EventHighimpact Lossatasinglepointoffailure

• Anadditionalfailurepasttoleranceswhenonlyonesiteisoperational• Serviceisunavailableandcannotberestoreduntilcomponentisrepaired/restored

ModerateImpact Failureofacomponentpastredundancytolerance• Systemnolongerhasredundancy:additionallossorfailureofcomponentswill

resultinlossofsystem.Thisisaparticularproblemifonesiteisalreadydown.• Lossofdbserver(homeofRightsdb)orofbothWebserversatasitewillrender

thatlocationinaccessible• LossoffourdrivesornodesineitherIsilonstorageclusterwillresultinthelossof

thatinstance.Theclusterwillbeofflineandunabletohandlereadorwriterequests;alltrafficwouldhavetobehandledbytheremainingsite.

• LossofUMArborLakessitewouldpreventperformanceoftapebackups.• LossofUMMACCsitewoulddepriveIUsiteofdataredundancy• Lossofingestserverswouldpreventnewcontentfromenteringrepository

LowImpact Failureofredundantsystemcomponents• Includesredundantcomponentswithineachsiteaswellasgeneralredundancy

betweentheIUandUMsiteso HTinfrastructurehasbeendesignedtoavoidsinglepointsoffailureandto

ensuredataandequipmentredundancyo Servicecontinuesinanuninterruptedandtransparentmanner

2009‐08‐24 9

hasaimedatminimizingthedangersposedbysinglepointsoffailurethroughtheintroductionofstrategicredundancies.ThebasicmeansforavoidingthedisastrouseffectsofhardwarefailureordatalosshavebeentheestablishmentoftheIndianapolismirrorsiteandthenightlybackupofcontenttotape.(Formoredetail,pleaserefertotheprecedingsection).Whilethesestrategiesaccountforextraordinaryevents,HathiTrust’sserverreplacementscheduleallowstherepositorytoanticipatetheresultsofnormalequipmentuseanddepreciation.Stepstosafeguardthelong‐termfunctionalityofHathiTrusthavethereforebeencomplementedbyaconsiderationofbestpracticesfordisasterpreparedness.

• RedundantComponentsandSinglePointsofFailureintheHathiTrustInfrastructureThefollowingsectionsprovideageneraloutlineofHathiTrust’sredundantcomponentsand

singlepointsoffailure.Giventhecomplexityoftherepository’sinfrastructure,unknownorunanticipatedscenariosmayexist;futureDisasterRecoveryPlanningwillthusinvolveaperiodicreviewofkeyfeaturesandvulnerabilities.

o SiteRedundancy:TheestablishmentofthemirrorsiteinIndianaprovidesHathiTrustwithafullyredundantoperation.Becausebothinstancesprovidefullaccesstocontentinadditiontootherrepositoryfunctions,userswillnotexperiencealossordegradationofserviceintheeventthatserviceislostfromonesite.KeyexceptionstoHathiTrust’ssiteredundancyarenotedbelow.

o RedundantComponentsatEachSite:ThefollowingcomponentsprovideeachsitewithatoleranceunderwhichlimitedfailureswillnotdisruptmajorHathiTrustfunctionsanduserservices.

Webservers:eachsitehastwoserverssothatifonefails,theothermaycontinuetohandletraffic.ThesealsohosttheGeoIPdatabase.

IsilonIQclusters:thecurrentconfigurationof21nodesfeaturesN+3parityprotection;thisdataredundancypermitsthesimultaneousfailureof3drivesonseparatenodesorthelossofthreeentirenodeswithoutservicedegradation.

Ingestservers:theAnnArborsitepossessesfiveserverssothatingestmaycontinue(albeitataslowerrate)intheeventofanyfailures.

LargeScaleSearch(LSS)Solrindex:currentlyhousedonthewebservers,butwillsoonbemaintainedonfivenewserversinAnnArbor.

o SinglePointsofFailure:25Thesearecomponentsofasystemwhich,iflost,willpreventtheentiresystemfromfunctioning.Eventhosecomponentswithwhollyredundantpeerdevices(suchastheweboringestservers)maybeconsideredsinglepointsoffailureiftheyhaveexceededtheircapacitytosustainlosses(i.e.,ifonewebserveratasitehasalreadybeenlost).

SinglePointsofFailureattheComponentLevel:BecauseonlyoneofthesecomponentsexistsateachHathiTrustsite,alosswillresultinsystemfailure.

• MYSQLdatabaseserver:housestherightsdatabase,ingesttrackingdatabase,andtheCollectionBuilderSolrindex

• Servernetworkswitches• Outboundnetworkswitches

SinglePointsofFailureattheSystemLevel:Whileanygivencomponentmayhavevariousdegreesofinternalredundancy(suchasmultiplepowersuppliesor

25ContentinthissectioniscourtesyofCorySnavely(personalemailfrom13July2009).

2009‐08‐24 10

multipledrives)itmightstillfailasawholeandthusresultinthelossofaparticularinstanceofHathiTrust.Thefollowingarecomponentslocatedateachsitewhich,whilepossessedofinternalredundancies,arestillsubjecttocompleteloss(asintheeventofafire)andmaythusrenderasiteinoperable.

• IsilonIQstoragecluster:theentireclustercouldbelostinalarge‐scaleevent.Additionally,thelossofafourthdriveornodewillexceedthecluster’sfailuretoleranceandresultinaservicedisruption.

• Webservers:shouldonefail,theremainingserverwillbeasinglepointoffailure.

• Bladeserverchassis:sinceweb,ingest,anddatabaseserversarehousedinonechassis,theentireunitcouldpotentiallyfail.

• LSSindex:inthenearfuture,theserversinAnnArborwillbethesoleinstanceoftheLargeScaleSearchindex.

• MirlyndatabaseandMirlyn2Solrindex26:thesearecurrentlykeycomponentsoftheUMLibraryinfrastructure;shouldthesebeunavailable,accesstoanduseofHathiTrustwillbecompromised.

• KeyFeaturesofHathiTrust’sIsilonIQClusteredStorage

TheIsilonIQstorageclusterstoresandprovidesdigitalobjectsforHathiTrust’spartnerlibrariesandmembersofitsdesignatedcommunity.Theclusterprovidesahighdegreeofinherentredundancy,whichgivesbothHathiTrustsitesaconsiderabledegreeoftoleranceinregardstothefailureofvariousaspectsofthestorageunits.Asoneexample,Isilon’sproprietaryOneFSoperatingsystempermitstheindividualstoragenodes—theindividualserversthatarethebuildingblocksofthecluster—tofunctionas‘coherentpeers’sothatanyonenode‘knows’everythingcontainedontheotherunitsinthecluster.

o “Isilon'sOneFSoperatingsystem[…]intelligentlystripesdataacrossallnodesinaclustertocreateasingle,sharedpoolofstorage.”27

o “Becauseallfilesarestripedacrossmultiplenodeswithinacluster,nosinglenodestores100%ofafile;ifanodefails,allothernodesintheclustercandeliver100%ofthefileswithinthatcluster.”28

o “Adistributedclusteredarchitecturebydefinitionishighlyavailablesinceeachnodeisacoherentpeertotheother.Ifanynodeorcomponentfails,thedataisstillaccessiblethroughanyothernode,andthereisnosinglepointoffailureasthefilesystemstateismaintainedacrosstheentirecluster.”29

26MirlynisthenameoftheUniversityofMichigan’scurrentOnlinePublicAccessCatalog,whichissupportedbytheAlephintegratedlibrarysystem.Mirlyn2isabetaversionofUM’srecentlyimplementednextgenerationcatalog,basedontheVuFindplatform,whichwillbecomethemainlibrarycatalogonAugust3,2009.27IsilonSystems,Inc.“IsilonIQOneFSOperatingSystem”(2009)retrievedfromhttp://www.isilon.com/products/OneFS.phpon17June2009.28IsilonSystems.“UncompromisingReliabilitythroughClusteredStorage:DeliveringHighlyAvailableClusteredStorageSystems”(2008)p.7.“Incomputerdatastorage,datastripingisthetechniqueofsegmentinglogicallysequentialdata,suchasasinglefile,sothatsegmentscanbeassignedtomultiplephysicaldevices.[…]ifonedrivefailsandthesystemcrashes,thedatacanberestoredbyusingtheotherdrivesinthearray.”(http://en.wikipedia.org/wiki/Data_striping,retrievedon16August,2009).29IsilonSystems.“BreakingtheBottleneck:SolvingtheStorageChallengesofNextGenerationDataCenters”(2008)p.8

2009‐08‐24 11

HathiTrust’sIsilonIQclustersensureahighdegreeofdataredundancywiththeirN+3parityprotection.N+3providestriplesimultaneousfailureprotectionsothatuptothreedrivesonseparateIsilonIQnodes,orthreeentirenodes,canfailatthesametimeandalldatawillstillbefullyavailable.

o “TraditionalRAID‐5parityprotectionresultsindatalossifmultiplecomponentsfailpriortothecompletionofarebuild.FlexProtect,incontrast,automaticallydistributesalldataanderrorcorrectioninformationacrosstheentireIsilonclusterandwithitsrobusterrorcorrectiontechniquesefficientlyandreliablyensuresthatalldataremainsintactandfullyaccessibleevenintheunlikelyeventofsimultaneouscomponentfailures.”30

o “Eachfileisstripedacrossmultiplenodeswithinacluster,with[three]paritystripesforeachdatablock.”31

ThefilesystemmayalsoperformaDynamicSectorRepair(DSR)atthetimeofanyfilewriting.Ifitencountersabaddisksector,thefilesystemwilluseparityinformationelsewhereinthesystemtorebuildthenecessaryinformationandrewriteanewblockelsewhereelseonthedrive.Thebadsectorwillberemappedbythedrivesothatitisneverusedagainandthewriteoperationwillbecompleted. TheIsilon“restriper”isameta‐process/infrastructurethathasfourprimaryphasestohelpmanageandprotectdataintheeventthatcomponentsoftheclustersustainapartialfailureormalfunction.Theprocessesrunasbackgroundoperationsanddonotrequiresystemdowntime.3233

o FlexProtectrepairsdata(i.e.,intheeventofadriveloss)usingparity. “IsilonOneFSwithFlexProtectcanboasttheindustryleadingMeanTimeto

DataLoss(MTTDL)forpetabyteclusters.”34 “FlexProtectintroducesstate‐of‐the‐artfunctionality,whichrebuildsfaileddisks

inafractionofthetime,harnessesfreestoragespaceacrosstheentireclustertofurtherinsureagainstdataloss,andproactivelymonitorsandpreemptivelymigratesdataoffofat‐riskcomponents.”35

o AutoBalance“rebalancesthedatainaclusteraccordingtobusinessrules,inrealtime,non‐disruptively.”36

“Assoonasthe[neworrepaired]nodeisturnedonandnetworkcablesareconnected,AutoBalanceimmediatelybeginstomigratecontentfromtheexistingstoragenodestothenewlyaddednodeacrosstheclusterinterconnectback‐endswitch,re‐balancingallofthecontentacrossallnodesintheclusterandmaximizingutilization.”37

30IsilonSystems,Inc.“IsilonIQOneFSOperatingSystem”(2009)retrievedfromhttp://www.isilon.com/products/OneFS.phpon30June2009.31IsilonSystems.“UncompromisingReliabilitythroughClusteredStorage:DeliveringHighlyAvailableClusteredStorageSystems”(2008)p.732IsilonX‐SeriesSpecifications(productbrochure)33InformationontheIsilonrestripercomesfromapersonalemailsentbyKipCranfordofIsilonSystems,Inc.on1June2009.34IsilonSystems.“DataProtectionforIsilonScale‐OutNAS”(2009)p.435IsilonSystems,Inc.“IsilonIQOneFSOperatingSystem”(2009)retrievedfromhttp://www.isilon.com/products/OneFS.phpon15June2009.36McFarland,Anne.“IsilonAcceleratesDeliveryofDigitalContent”TheClipperGroupNavigator(2003).37IsilonSystems.“TheClusteredStorageRevolution”(2008)p.13

2009‐08‐24 12

o Collectcleansuporphanednodesanddatablockstopreventfragmentationofdata.o MediaScanverifiesdisksectors.

ThefunctionofMediaScanistoscaneveryblockinthefilesystemlookingforbaddisksectors.Ifitencountersabadsector,itwillperformaDynamicSectorRepair(DSR)anduseparityinformationelsewhereinthesystemtorebuildthenecessaryinformationandrewriteanewblocksomewhereelseonthedrive.

MediaScanperiodicallyreviewsdatablocksanddisksectorsthatmaynothavebeenaccessed,fromafilelevel,inmonthsoryearsandtherebyhelpstokeepthedrivesashealthyaspossible.

o AsoftheOneFS5.0release,allfilesystemmetadatacanbecheckedbytheIntegrityScanrestriperphase.ThisprocesswillallowHathiTrusttocompletelycheckfiledataandmetadataviaassociatedchecksums.

Otherinstancesofinherentredundancyincludenon‐volatileRAM,afullyjournaledfilesystem,andsoftwareapplicationsthatmanageclientconnectionsintheeventofanode’sfailure.

o “OneFSisafully‐journaledfilesystemwithlargeamountsofbattery‐backednon‐volatilerandomaccessmemory(NVRAM)withineachnode,whichensurestheintegrityofthefilesystemintheeventofunexpectedfailuresduringanywriteoperation.”38

o “TheIsilonSmartConnectmodule[…ensures]thatwhenanodefailureoccurs,allin‐flightreadsandwritesarehandedofftoanothernodeintheclustertofinishitsoperationwithoutanyuserorapplicationinterruption.[…]Ifanodeisbroughtdownforanyreason,includingafailure,thevirtualIPaddressesontheclientswillseamlesslyfailoveracrossallothernodesinthecluster.Whentheofflinenodeisbroughtbackonline,SmartConnectautomaticallyfailsbackandrebalancestheNFSclientsacrosstheentireclustertoensuremaximumstorageandperformanceutilization.”39

• HardwareSupportandService HathiTrustequipmentiscoveredbysupportandserviceagreementswithitsvariousvendors(SunMicrosystems,Dell,CDW‐G,etc.).Agoodexampleofonesuchagreementisfoundinthe“Platinum”supportprovidedbyIsilonSystemsandwhichincludes:

o Extended24x7x365Telephone&OnlineHardwareandSoftwareSupporto 24x7ProactiveMonitoring&Alerts–EmailHome(forHardwareandSoftware)o ReturnPartstoFactoryforRepairand4‐hourReplacementPartsDeliveryo SupportIQ(EnhancedServiceabilityDiagnostics)andSystemEventTrackingo On‐siteTroubleshootingo IsilonHardwareInstallationo SoftwareProductDocumentation,ReleaseNotes,andaccesstoProductTechnicalNoteso RemoteDiagnosis(ProvidedUserGrantsAccess)o Maintenance&PatchReleases

38IsilonSystems.“UncompromisingReliabilitythroughClusteredStorage:DeliveringHighlyAvailableClusteredStorageSystems”(2008)p.939IsilonSystems.“DataProtectionforIsilonScale‐OutNAS”(2009)p.6

2009‐08‐24 13

o MinorandMajorUpgradeReleases(IncludesPerformanceImprovements,NewFeatures,ServiceabilityImprovements).40

• EquipmentTrackingLITCoreServices(CS)maintainsaninventoryofserversonawikipageaccessibletoitsstaff.

Detailsincludeeachserver’sname,location,onlineandretiredates,upgrades,notesonstorage,anditsprimaryservice.Additionalinformationisprovidedrelatedtospecifications,supportcontracts,andkeycontactinformation.TheCSserverinventoryiscurrentlyoutofdate.

• HardwareReplacementSchedule

o “HathiTrustreplacesstorageregularly,approximatelyevery3‐4yearsorastheusablelifeofstorageequipmentdictates”(HTTRACC1.7)

o “HathiTruststaffupgradehardwareonaregularbasis(i.e.,everythreeorfouryears),andtohelpdetectmorerapidgrowthindemands,thewebserverandstorageinfrastructureshavetheirownperformancemonitoringthatindicateoverloadconditions.”(HTTRACC1.10)

• TimelineforEmergencyReplacementofHathiTrustInfrastructureShouldaseriouseventrequirethereplacementofpart(orall)oftheHathiTrusttechnical

infrastructure,thefollowingtimelineprovidesageneralestimateofthetimerequiredtoorder,ship,andinstallnewequipment.AcursoryreviewofthetimenecessaryforHathiTrusttorecoverfromamajordisasteratthemainAnnArbororIndianapolisdatacentersuggeststhatalargeeventcouldidleaninstanceoftherepositoryforatleastamonthandahalf.Inadditiontotheserversandswitchesmentionedabove,criticalcomponentsincludefour30Apowerdistributionunits(PDUs)perrackandfourracksperdatacenterasofthiswriting.

o SubmissionofPurchaseOrders: Forordersunder$5,000,theM‐PathwaysapplicationallowstheUniversity

Library’sbusinessmanagertosendpurchaseordersdirectlytovendors. Forordersover$5,000,ProcurementServicesnormallytakesonetotwo

businessdaystoapprovethepurchase,buttheprocessmaytakeuptoaweekifquestionsariseoradditionalpurchaseinformationisneeded.

o DeliveryofEquipment: Productsthevendorhasinstockandavailableforimmediateshipmenttake1‐3

daystobedelivered. Itemsthatneedtobeconfigured(suchasservers)usuallytake1‐2weeks. Isilonstoragewilltake3weekstobedeliveredinaworstcasescenario.

o Installation: 3daysFTEforIsilonIQclusterinadditiontothetimerequiredforotherservers,

switches,PDUsandrackunits.

40IsilonSystems.“SupportAdvantageOfferings”(2009)retrievedfromhttp://www.isilon.com/support/?page=planson30June2009.

2009‐08‐24 14

o DataRestoration:about.5TB/hour(15days,asofJune2009)41 WhileHThasabout110TBofdatainitsstorage,thebackuptapesmaintained

bytheTSMGroupcontainroughly176TBofinformationduetothedataencryptionusedtoprotecttheintellectualrightsofthematerial(asof06/2009).

Thelengthoftimerequiredfora‘bare‐metalrestoration’willbeinfluencedbytapemounts,networkspeed,restoringtotheNFSshares,decryption,etcetera.

Ifthelibrary/HTweretopurchaseanadditionaltapedrive(atroughly$20,000),theprocesscouldbespedup,perhapstoabout1TB/hour.

Intheeventofalarge‐scaledisasterinwhichmultiplecampusunitsrequireextensivedatarestoration,theTSMBackupServiceSLAstatesthat“ITCSmanagementwillworkwithcustomerstodeterminehowtoprioritizecustomerrestores.”(sec.4.11)ThisdeterminationwillreflecttheUniversityofMichigan’sorganizationalpriorities42:

• Priority1:Healthandsafetyoffaculty,staff,students,hospitalpatients,contractors,renters,andanyotherpeopleonUniversitypremises.

• Priority2:Deliveryofhealthcareandhospitalpatientservices• Priority3:Continuationandmaintenanceofresearchspecimens,

animals,biomedicalspecimens,researcharchives.• Priority4:Deliveryofteaching/learningprocessesandservices• Priority5:SecurityandpreservationofUniversityfacilities/equipment.• Priority6:Maintenanceofcommunity/Universitypartnerships.

o Fractionalrestoreswould,forthemostpart,runatcomparablespeedsunlesstherewasaneedtorestorealargenumberofrandomfiles,inwhichcasetherewouldbeadecreaseinspeedduetotapeseekandmounttimes.

o DelaysinrecoverycouldbeincreaseddramaticallyiftheMACCdatacenteroritsinfrastructurehassustaineddamageandneedsrepair.

• HathiTrustandInsuranceCoverageattheUniversityofMichigan

TheOfficeofFinancialOperationsreviewsandaddsfinancialassetsgreaterthan$5,000totheassetmanagementsystemoftheUniversityofMichigan.ThePropertyControlOfficeisthenresponsiblefortaggingfinancialassetswithuniqueUniversityofMichiganidentifiersandtrackingthem.RiskManagementServicesadministerstheUniversity’spropertyinsuranceandwillprovidethereimbursementofreplacementcostsforitemsself‐insuredbyMichigan.AsofJuly2009,thenatureandextentoftheUniversityofMichigan’sinsurancecoverageforHathiTrusthardwareremainedunderreview.ThemaincontactwithRiskManagementServicesinthismatterhasbeenCyndiMesa,HeadofUMLibraryFinance.

41Hanover,Cameron(ITCSTSMGroupStorageEngineer).Personalemailon23June2009.42UniversityofMichiganAdministrativeInformationServices.“EmergencyManagement,BusinessContinuity,andDisasterRecoveryPlanning”(2007)retrievedfromhttp://www.mais.umich.edu/projects/drbc_methodology.htmlon6July2009.

2009‐08‐24 15

Scenario2:NetworkConfigurationErrors

• Review:RisksInvolvingNetworkConfigurationErrorsThefollowingtablesummarizestherisksfacingHathiTrustastheresultofnetworkconfiguration

errors.ConsiderationisgiventonetworkconnectionswithinUMdatacentersaswellasatUM’sHatcherGraduateLibrary(siteofkeyadministrativeanddevelopmentactivities).Thearrangementoftheseeventsreflectstherelativeseverityoftheirrespectiveconsequences.

• HathiTrust’sSolutionsforNetworkConfigurationErrors

HathiTrust’scontinuedaccesstotheInternetviatheUMnetBackboneisessentialforitscontinuedprovisionofservice.TherepositoryreceivesnetworkinfrastructuremaintenancethroughUM’sITCS/ITCom;withitsrobustdisasterplanninginadditiontothelessonslearnedfromtheMidwestblackoutof2003,ITComguaranteescontinuednetworkaccessinallbutthemostcatastrophicscenarios.Intheeventofawidespreadpoweroutage,HathiTrustwouldbeabletomaintainaccesstotheUMnetBackbonesincedatacentersareequippedwithredundantpowersuppliesandtheHatcherGraduateLibraryiscurrentlycategorizedasapriorityrecipientofpowerfromtheuniversity.ITCSalsohas17generatorswhichcanbeusedtomaintainpowertonetworkswitchesintheeventofablackout.TheresponsibilitiesandobligationsofbothpartiesareoutlinedintheCustomerNetworkInfrastructureMaintenanceServiceAgreement.43

• ExtentofITComSupporto “ITComagreestoprovidetheUnitNetworkInfrastructureMaintenancetoincludedata

switches,routers,accesspoints,hubs,uninterruptiblepowersupplies(UPS’s),firewalls,andotheridentifiedandagreeduponcomponents.”(ITCSsec.1.0)

43PleaserefertoAppendixG(ITCS/ITComCustomerNetworkInfrastructureMaintenanceServiceAgreement).

Severity EventHighimpact • Lossofservernetworkswitchoroutboundnetworkswitch

• LossofaccesstoUMnetBackbone

ModerateImpact • ExtendedlossofpoweratHatcherLibrarycouldleadtolossoflocalserversanddisruptionofadministrativeandoperationalactivities.

LowImpact • LossofpowerthatthreatensabilitytoconnecttoLocalAreaNetwork(LAN)/Backbone

o Thelibraryremains(fornow)apriorityrecipientofelectricityfromtheUMpowerplant

o CampusdatacentershaveUPSsandredundantbackuppower• Failureoflocal/server‐sideconnections

o Shouldproblemsarisewithconnectionstoindividualnodes,theclusteredarchitectureoftheIsilonsystemwillallowread/writerequeststobehandledbyalternatenodes.

o IfconnectionsfailatoneHTsite,trafficcanbehandledbyremainingsite.

2009‐08‐24 16

• ITComResponsibilities

o “ProvideandmaintainthenecessarymaterialsandelectroniccomponentstooperatetheUnitNetworkInfrastructure.”(sec.5.2)

o “ProvideconfigurationandNetworkInfrastructureAdministrationsupportnecessarytorepairandmaintaintheUnitNetworkInfrastructurehardwareandsoftwarecoveredbythisagreement.”(sec.5.3)

o “Monitor24hours/dayand365days/year(24x365),supportedprotocolstothebackboneinterfaceoftheUnitsnetworkuptoandincludingtheextensiontothefirsthuborswitch.”(sec.5.6)

o “Monitor24hours/dayand365days/year(24x365),networkinterfacesonuninterruptiblepowersupplies(UPS)thatsupporttheUnitnetworkswitches.ProvidenotificationintheeventthataUPSisactivated,(inputpowerislostordegradedandsystemswitchestobatterypower),deactivated,(inputpowerisrestored),orunreachable.ProvidenotificationtotheUnitNetworkAdministratorwhenbatteriesdegradetothepointofneedingreplacement.”(sec.5.7)

o “ProvidemaintenanceonthestationcablingasinstalledbyITCom,oranapprovedU‐MvendorwhichmetITCominstallationspecifications.”(sec.5.8)

o “ProvidePreventativeMaintenance(clean&vacuum)oneachCustomerUnitswitchcoveredinthisagreementyearly.”(sec.5.9)

• ITComServicesinResponsetoOutagesorDegradationImpactingtheNetworko “Aresponsewithin30minutesoftheITComNOCnotificationortheUnit’scall,to

provideinformationtotheUnitonspecificstepsthathavebeen/willbetakentoresolvetheproblem.”(sec.7.2.1)

o “Anon‐sitevisit,ifnecessary,withintwo(2)hoursoftheresponse(i.e.,themaximumon‐siteresponsetimewillbetwoandahalf(21/2)hours).AnupdatewillbeprovidedtotheUnitNetworkAdministratorifonsiteandabestguessETRwillbeprovidedbasedonavailablefacts.ITComwillcontinuetoprovidetheUnitwithupdateseverytwohoursduringanoutage.”(sec.7.2.1)

o “IfanoutageisidentifiedwithintheagreementservicehoursITComwillresolvetheoutageeveniftherepairtimeextendsbeyondtheserviceagreementhours.”(sec.7.2.1)(Repairsoutsideoftheagreementhoursresultinadditionallaborexpenses.)

o ConductmonitoringviaSNMPPOLLINGatoneminuteintervals.(sec.7.2.1)

• HathiTrustResponsibilitiesITCom’sresponsibilitiesendatthefirstnetworkswitchandfromtheretoitsservers,HathiTrust

isresponsibleformaintainingnetworkconnectivityandsecurity.TherepositoryusesInternet2forcommunicationandsynchronizationbetweentheAnnArborandIndianapolissites.EachIsilonnodehasdual10GBInfinibandportsforinternal(i.e.,intra‐cluster)communicationanddual1GBEthernetforexternalcommunication.Scenario3:NetworkSecurityandExternalAttacks

2009‐08‐24 17

• Review:RisksInvolvingNetworkSecurityandExternalAttacks

ThefollowingtablegivesageneraloverviewofthebasicthreatanexternalattackornetworksecuritybreachposestoHathiTrust;entriesarearrangedbyseverity.Thelist,however,isnotexhaustiveandnoattempthasbeenmadetopublicizepotentialvulnerabilities.

• HathiTrust’sSolutionsforNetworkSecurity

MaliciousactivityagainstHathiTrustcouldinvolveunauthorizedaccesstoasystemordata,denialofservice,orunauthorizedchangestothesystem,software,ordata.Asanacademicentity,therepositoryisseenaslessofatargetforsuchactionsthancommercialorgovernmentaltargets;despitethisperceivedlowerrisk,HathiTrusthasnotbeenlulledintoafalsesenseofsecurity.TherepositorytakesseriouslythepotentialforviolationsofitsnetworkandoperatingsystemsecurityandthereforehasinstitutedaprogramofperiodicsoftwareupdatesinadditiontothemaintenanceofanITCom‐supportedfirewall,authentication‐requiredaccess,andothermeasures(suchasthrottlingsoftwaretodeterdenialofserviceattacks).Becausecontentiscurrentlyacceptedfromtrustedsources(namely,GoogleandlegacydigitalcollectionsfromHathiTrustpartners)theGROOVEprocessdoesnotincludeavirusdetectionphase.Asdigitalobjectsareingestedfromagreaternumberofsources,additionalsecuritymeasuresshouldbeconsidered.

o “HathiTruststaffapplysecurityupdatestotheoperatingsystemandtonetworkingdevicesassoonastheybecomeavailableinordertominimizesystemvulnerability.Aswithnewsoftwarereleases,securityupdatesaretestedinadevelopmentenvironmentbeforebeingreleasedtoproduction.Softwarepackagesthatpresentalowersecurityriskandthathaveagreaterpotentialtoaffectapplicationbehavior(webservers,languageinterpreters,etc.)aregenerallyinstalled,configuredandtestedmanuallytoallowforgreatercontrolinmanagingupdates.Softwareupdatesarenotappliedautomatically;moreover,updatesthatpresentapotentialforhavinganimpactonsystembehaviorareappliedandtestedfirstinthedevelopmentenvironment.Ifnoimpactsareseen,HathiTruststaffapplytheseupdatesinproductionafteratestingperiodofatleastoneweek.”(HTTRACC1.10)

Severity EventsHighimpact • UnauthorizedaccesstoHathiTrustcontentleadstotheinfringementofcopyrights.

• Lossofdataorfunctionalityforanextendedperiodoftimeasaresultofmaliciousactivity.

ModerateImpact • HathiTrustservicesaretemporarilyunavailableasaresultofmaliciousactivity.LowImpact • ThedeliveryofHathiTrustservicesslowsastheresultofmaliciousactivity.

• Asecurityweaknessexistswithinthesystembutremainsunexploited.

2009‐08‐24 18

Scenario4:FormatObsolescence

• Review:RisksInvolvingFormatObsolescenceThefollowingtableoutlinesthethreatsposedbyformatobsolescenceandarrangesthem

accordingtotheirpotentialseverity.

• HathiTrust’sSolutionsforFormatObsolescence

AnawarenessandacknowledgementofthedangersofformatobsolescencehasledHathiTrusttoimplementproactivepoliciesandprocedurestoensurelong‐termaccesstotherepository’scontent.Therepositoryonlyacceptsspecificformatsthatmeetrigorousspecificationsand,throughthepriorexperienceofUniversityofMichiganpersonnel,hasdevelopedprotocolsforthesuccessfulmigrationofcontentfromoneformattoanother.Inaddressingthethreatofformatobsolescence,thepreservationoftheintegrityandauthenticityofdepositedcontenthasbeenanoverarchingconcern.

• SelectionofFileFormatso “HathiTrustiscommittedtopreservingtheintellectualcontentandinmanycasesthe

exactappearanceandlayoutofmaterialsdigitizedfordeposit.HathiTruststoresandpreservesmetadatadetailingthesequenceoffilesforthedigitalobject.HathiTrusthasextensivespecificationsonfileformats,preservationmetadata,andqualitycontrolmethods,includedintheUniversityofMichigandigitizationspecifications,datedMay1,2007.”44(HTTRACB1.1)

o “HathiTrustcurrentlyingestsonlydocumentedacceptablepreservationformats,includingTIFFITUG4filesstoredat600dpi,JPEGorJPEG2000filesstoredatseveralresolutionsrangingfrom200dpito400dpi,andXMLfileswithanaccompanyingDTD(typicallyMETS).HathiTrustsupportstheseformatsbecauseoftheirbroadacceptanceaspreservationformatsandbecausetheformatsaredocumented,openandstandards‐based,givingHathiTrustaneffectivemeanstomigrateitscontentstosuccessivepreservationformatsovertime,asnecessary.TheRepositoryAdministratorshaveundertakensuchtransformationsinthepast;moreover,HathiTrustoffersend‐userservicesthatroutinelytransformdigitalobjectsstoredinHathiTrustto“presentation”formatsusingmanyofthewidelyavailablesoftwaretoolsassociatedwithHathiTrust’s

44Specificationsareavailableathttp://www.lib.umich.edu/lit/dlps/dcs/UMichDigitizationSpecifications20070501.pdf

Severity EventsHighimpact • Applicationsandhardwarearenolongerabletoreadordisplaydigitalobjects.

• Errorsintranslatingandreadingfilesarenotunderstoodoracknowledgedbyrepositoryusers.

ModerateImpact • ProblemswiththetranslationoffileformatsresultinDIPsthatdonotfaithfullyreflecttheoriginaldigitalobjects.

LowImpact • Formatsandassociatedapplicationschangebutretaincompatibilitywitholderversionsofthefileformats.

2009‐08‐24 19

preservationformats.HathiTrustgivesattentiontodataintegrity(e.g.,throughchecksumvalidation)aspartofformatchoiceandmigration.”45

o “Eachformatconformstoawell‐documentedandregisteredstandard(e.g.,ITUTIFFandJPEG2000)and,wherepossible,isalsonon‐proprietary(e.g.,XML).”(HTTRACB4.2)

• FormatMigrationPoliciesandActivitieso “HathiTrustiscommittedtomigratingtheformatsofmaterialscreatedaccordingto[its]

specificationsastechnology,standards,andbestpracticesinthedigitallibrarycommunitychange.”(HTTRACB1.1)

o “HathiTruststaffmembersconductmigrationsfromonestoragemediumtoanotherusingtoolsthatvalidatechecksumsinternally.(Digitalobjectsarestoredbothonlineandontape,andtheonlinestoragesystemconductsregularscanstodetectandcorrectdataintegrityproblems.)Atotalfilecountisdonefollowingalargedatatransfer,andregularlyscheduledintegritychecksfollow.”(HTTRACC1.7)

o “[HathiTrust]hasmigratedlargeSGML‐encodedcollectionstoXML,andLatin‐1characterencodingstoUTF‐8Unicode.Oursuccessinmigratingfromolderformatstonewerformatsdemonstratesourcommitmenttoourcollectionsandourabilitytokeepmaterialsinourrepositoryviable.Allmigrationsaredocumentedinchangelogs.”(HTTRACB4.2)

45HathiTrust.“Preservation”(2009)retrievedfromhttp://www.hathitrust.org/preservationon16June2009.

2009‐08‐24 20

Scenario5:CoreUtilityand/orBuildingFailure

• Review:RisksInvolvingCoreUtilityorBuildingFailureThefollowingtablesummarizesthedangersautilityorbuildingfailureposestoHathiTrustand

rankseventsbytheirpotentialseverity.

• HathiTrust’sSolutionsforUtilityorBuildingFailure

ThecontinueddeliveryofHathiTrust’sservicesdependsuponthemaintenanceofpower,environmentalcontrol,andsecurityinitsserverenvironmentattheMichiganAcademicComputingCenter(MACC)andotherlocationsthathostcomponentsoftherepository.Inthisrespect,HathiTrustisheavilyreliantupontheinfrastructureoftheMACCaswellasthatoftheArborLakesDataFacility,hometooneinstanceoftheTSMGroup’sbackuptapelibrary.BothlocationsprovidecloselymonitoredandhighlyredundantenvironmentsthathelpensurethatHathiTrust’sinfrastructureremainssecureandoperable.Atthesametime,administrativeanddatamanagementfunctionscriticaltothedevelopmentandmaintenanceoftherepositorytakeplaceintheUniversityofMichigan’sHatcherGraduateLibrary.TheserviceandcooperationofMichigan’sPlantOperationsDivisionarethereforecriticalforthecontinuedaccesstoanduseofthisstructureintheoperationofHathiTrust.

• GeneralMaintenanceandRepairsinUniversityofMichiganFacilitiesFacilitiesandmaintenanceissuesontheUniversityofMichigancampusarereportedtothe

PlantOperationsDivision,theDepartmentofPublicSafety(DPS),andOccupationalSafetyandEnvironmentalHealth(OSEH)inadditiontotheimpactedfacility’smanager.RepairworkiscoordinatedbytheUniversityLibraryfacilitiesmanagerinconjunctionwithadministratorsandworkersfromPlantOperations.

• TheMichiganAcademicComputingCenter(MACC) TheMACChostsmanyofthekeycomponentsoftheMichigan’sUniversityLibrarysystemandas

wellasthetechnicalinfrastructureofHathiTrust.TheUniversityofMichigandoesnotownthebuildinginwhichthedatacenterislocatedbutinsteadoperatestheMACCinconjunctionwiththeMichiganInformationTechnologyCenter(MITC)Foundationandotherpartners.TheMACCServerHostingService

Severity Events• ExtensivestructuraldamagerenderstheMACC(orkeyelementsofits

infrastructure)unusableandnecessitatestheestablishmentofahotsitetorecoverandcontinueoperations.

• Additionalfailurepasttoleranceinbackupcoolingorpowerinfrastructure

Highimpact

ModerateImpact • Failureofbackuppowerpastredundancytolerance(failureof2generators)

o DatacentercoordinatormayinitiateloadshedandshutdownhalfoftheMACC(butlibraryrackswillremainoperational)

• Structuraldamagerendersfacilitytemporarilyunsafeand/orunusable.LowImpact • Lossofpower

• Lossofenvironmentalcontrolunitswithinredundancy

2009‐08‐24 21

LevelAgreement46liststheresponsibilitiesofthedatacenteraswellastherepository;ofparticularsignificancearetheMACC’sagreementsto:

o “Provideacontrolledphysicalenvironmenttosupportservers[with]roomaveragetemperatureofbetween65and75degreesand35‐50%relativehumidity[and]monitoredenvironmentals(temperature,humidity,smoke,water,electrical.”(sec.4.1)

o “Provideadequate,conditioned,60‐cycleelectricalservicewithadequatebackupelectricalcapacitytosupportcircuits,service,andoutlets[andalsoto]provideUninterruptiblePowerSupply(UPS)andgeneratorbackup”(sec.4.2)

o “Provide7x24telephonecontactforemergenciesandforemergencyaccesstofacility.”(sec.4.4)

Inadditiontofeaturessuchasredundantelectricalandenvironmentalsystems,theMACCmaintainsafull‐timecoordinatorandstaffwhoprovide24x7responsestofailuresormalfunctionsintheserverenvironment.AlertspromptedbyissueswiththeenvironmentalsystemsorpoweraresenttotheUniversityofMichiganNetworkOperationsCenter(NOC)duringnon‐businesshours.

o Overview: “TheMACC'sredundancyisdesignedtoensurethesafetyandsecurityofthe

datahousedwithin.Itconsistsof:• Adualpowerpathfromthepropertylinetothepowerdistribution

units• Dieselpoweredgeneratorsforelectricalbackup• Flywheels(notbatteries)toprovidepowerwhilethegeneratorscome

on• State‐of‐the‐artgeneratorsandflywheelsforbackuppower• Threeextracomputerroomairconditioners• Twoextradrycoolers• Glycolloopforcoolingwithtwoparallelpathwayswithcrossovervalves

atregularintervals.”47 “Astate‐of‐the‐artmonitoringsystemkeepstrackof1,700differentparameters

andautomaticallynotifiesstaffofanyirregularity.”48o EnvironmentalControlsandMonitoring

“TheMACChas18ComputerRoomAirConditioningunits(CRACs).Atanygiventime,only15arenecessarytomaintaintherequiredtemperatureandhumidity.[Thus,thecomputerroomhasN5+1redundancyinitscoolingability.]Italsoisequippedwithanumberofportablecoolerstoaddressspecificcoolingneeds.Theheatfromtheroomistransferredtoanunder‐floorglycolloopthatreleasestheheattotheoutdoors.”49

46PleaserefertoAppendixH(MACCServerHostingServiceLevelAgreement).47MichiganAcademicComputingCenter.“VitalStatistics”(2009)retrievedfromhttp://macc.umich.edu/about/vital‐statistics.phpon16June2009.48‐‐.“MichiganAcademicComputingCenter”(2009)retrievedfromhttp://macc.umich.edu/index.phpon16June2009.49‐‐.“VitalStatistics”(2009)retrievedfromhttp://macc.umich.edu/about/vital‐statistics.phpon16June2009.

2009‐08‐24 22

“Thelayoutofthefacilityallowsthefrontonthecomputerrackstobefacingthecoldaisles.Theseaisleshaveperforatedfloortilesthroughwhichthecoolairispumpeddirectlytothecomputerslocatedthere.Heatisdischargedfromthebacksofthecomputers,whichcreatesthehotaisles.Thisalternatingarrangementfacilitatesthecoolingprocess,asthehotairproducedbythecomputerscanbesiphonedoffbeforeitminglestoomuchwiththecoolerairofthefacility.”50

“TwoseparatesmokedetectionandfirealarmsystemsprotecttheMACC.Oneisforthebuilding;theotherisfortheMACCitself.Thetwosystemsworktogethertoactivatealarmsystemsandnotifythefiredepartmentandkeypersonnel.Intheeventofanactualfire,thefire‐suppressionsystempipeswillnotfillwithwaterunlessthereisapressuredropcausedbymeltingofoneormoreofthesprinklerheads.”51

o BackupPower “Threegenerators,eachroughlythesizeofarailcar,providebackuppower.

Onlytwoofthethreearerequiredtorunthefacilityintheeventofapoweroutage.”52

“TheMACCusesenvironmentallyresponsibleflywheelsinsteadofbatteriesforpowerbackupwhilethegeneratorscomeonline.Thecombinationofgeneratorsandflywheelsprovidesthefacilitywithafullyredundantuninterruptiblepowersystem(UPS).”53

TheMACChasacontractwiththeUMPlantOperationsDivisionforthedeliveryofdieselfuelforitsgeneratorsintheeventofanextendedblackout.54

Intheeventthatabackupgeneratorisdisabled,theMACCcoordinatorwillinitiateloadshed,inwhichonehalfoftheMACCwillbeshutdownsothattheotherhalf(andrequisiteenvironmentalsystems)maycontinuetooperate.TheHathiTrustandUMLibraryracksareamongthosewhichwillretainpowershouldthisresponseprovenecessary.55

• ArborLakesDataFacility(ALDF)TheALDFhousestheTSMGroup’sinfrastructureandoneinstanceofthebackuptapelibrary

thatformsanintegralpartofHathiTrust’sDisasterRecoverystrategy.AsthehomeofcriticalcomponentsoftheUMnetBackbone,theALDFprovidesasafeandsecurelocationforonesetoftherepository’sbackuptapes.Intheinterestofsecurity,thisreportwillomitfurtherinformationontheexactnatureofthefacility’spowerandenvironmentalsystems.

50Ibid.51Ibid.52‐‐.“MichiganAcademicComputingCenter”(2009)retrievedfromhttp://macc.umich.edu/index.phpon16June2009.53Ibid.54Gobeyn,Rene(MACCDataCenterCoordinator).Personalinterviewon23June2009.55Ibid.

2009‐08‐24 23

Scenario6:SoftwareFailureorObsolescence

• Review:RisksInvolvingSoftwareFailureorObsolescenceThefollowingtabledetailsvariousrisksinherenttosoftwarefailureorobsolescenceandranks

themaccordingtotheirseverity.

• HathiTrust’sSolutionsforSoftwareIssues

ThedevelopmentanduseofHathiTrust’stoolsandresourcesdependsonhighlyfunctionalsoftwareapplications.Repositorypolicieshavethereforebeencraftedtoensurethattheseapplicationsarethoroughlytestedandregularlyupdatedtominimizethethreatofserviceoutagesasaresultofsoftwarefailureorobsolescence.HathiTrustfurthermoreemploysopensourceapplicationsthatarewell‐supportedandenjoywidespreaduseanddevelopmentwithinthedigitallibrarycommunity.

o “Changesinsoftwarereleasesofallcomponentsofthesystem(fromingesttoaccess)aredevelopedandtestedinanisolated“development”environmenttoprepareforreleasetoproduction.Whenreadyforrelease,developersrecordthechangesmadeandincrementversionnumbersofsystemcomponentsasappropriateusingaversioncontrolsystem.Newversionsofsoftwarearereleasedusingautomatedmechanisms(inordertopreventmanualerrors).Majorchangesandupgradesinhardwarearchitecturearerecordedinmonthlyreportsofunitactivity,andthusaretraceabletothatlevelofdetail.”(HTTRACC1.8).

o “Additionally,subsetsofproductiondataareavailableinthedevelopmentenvironmenttoallowdeveloperstoensurepropersystembehaviorbeforereleasingchangestoproduction.”(HTTRACC1.9)

o “Inordertodesign,buildandmodifysoftwareforthedesignatedend‐usercommunity,HathiTrustconductsanactiveusabilityprogramandseeksinputfromtheStrategicAdvisoryBoardofHathiTrust.Similarly,withregardtosoftwaredevelopmentinsupportofthearchivingneedsoftheParticipatingLibraries,HathiTrustfocusesonthedevelopmentofhighlyfunctionalingestandvalidationmechanisms.HathiTrustalsoseeksandrespondstoguidancefromtheStrategicAdvisoryBoardwithregardtoarchivingservices.”(HTTRACC2.2)

Severity Events

Highimpact • Softwarebugescapesdetectionindevelopmentenvironmentandresultsincrashofapplication.

ModerateImpact • Softwarebugescapesdetectionindevelopmentenvironmentandpreventsfullaccesstodigitalobjects.

• Improperversionofsoftwareisintroducedtosystem(couldhaveagreaterorlesserimpactdependingonresultsoferrorandrepository’sabilitytodetectit).

LowImpact

• Softwarebugescapesdetectionindevelopmentenvironmentandpreventsfulluseofsystemcapabilities(i.e.,rotationofimagesoradditionalfunctionality)

2009‐08‐24 24

Scenario7:OperatorError

• Review:RisksInvolvingOperatorErrorThefollowingtablesummarizesriskstoHathiTrustposedbyoperatorerror;eventsareranked

accordingtotheirpotentialseverity.

• HathiTrust’sSolutionsforOperatorError

Inanyhumanenterprise,occasionaloperatorerrorisunavoidable;HathiTruststrivestoensurethatanysucheventsaredetectedandresolvedinatimelyfashion.56Tohelpavoidoccurrencesandmitigatetheirpotentialimpact,HathiTrusthasautomatedmanyproceduresandalsoreliesuponapplicationassertions,whichcannotifyadministratorswhenprocessesarenotoperatingcorrectly.Evenifanerrorisintroducedtothefilesystemandthenbackedup,theTSMclientsavesuptosevenversionsofafileforuptosixmonthssothatanearlierversioncanberetrieved.

• Ingest:TheGoogleReturn(Object‐Oriented)ValidationEnvironment(GROOVE)processis

entirelyautomatedtoavoidtheintroductionofoperatorerrortotheprocess;stepsinclude:o Identificationofmaterialforingesto DecryptionandunzippingoffilesFormatverificationandvalidationwithJHOVEo LunBarcodeandMD5checksumvalidationo CreationofHathiTrustMETSdocumentso EstablishmentofHathiTrusthandles(persistentURLs)o Extensionofthepairtreefiledirectory(asnewmaterialentersthesystem)

• ArchivalStorage:Filesstoredwithintherepositoryarenotaccesseddirectlyormanipulatedby

staffsothatneitherthezippedimageandOCRfilesnortheMETSdocumentmaybeaccidentlyalteredordeleted.

• Dissemination:Thepage‐turnerapplicationreferencesthestoredimageandthencreatesa.png(forTIFFs)or.jpg(forJPEG2000s)filefordisplaytotheviewer.

• DataManagement:“Newversionsofsoftwarearereleasedusingautomatedmechanisms(inordertopreventmanualerrors).”(HTTRACC1.8)

56PleaserefertoAppendixB(HathiTrustOutagesfromMarch2008throughApril2009).

Severity EventsHighimpact • Operatorerrorresultsintheirreparablelossofdataordamagetoequipment.

• Operatorerrorresultsinlossofkeyrepositoryfunctions(ingest,storage,dissemination,etc.)foranextendedperiodoftime.

ModerateImpact • Operatorerrorremainsundetectedandcausespersistentproblemsinthesystembuthasnolongtermconsequences.

LowImpact • Operatorerrorisdetectedbynormalproceduresorviaanactivitylogandcanbereadilycorrected.

2009‐08‐24 25

Scenario8:PhysicalSecurityBreach

• Review:RisksInvolvingaPhysicalSecurityBreach MaintainingthephysicalsecurityoftheHathiTrustinfrastructureisyetanothercrucialelementintherepository’seffortstomanagerisksandtherebylessenthechancethatadisaster‐typeeventoccurs.Risksinvolvethedamageanddestructionofequipmentandcouldevenextendtounauthorizedsystemaccess.MultiplelevelsofsecurityexistatboththeMichiganAcademicComputingCenter(MACC)andtheArborLakesDataFacility(ALDF)toprotectHathiTrustfromtheactsofvandalism,destructionormalicioustampering.Detailsonthepotentialimpactsofaphysicalsecuritybreacharecoveredin“Scenario1:HardwareFailure”and“Scenario3:NetworkSecurity.”

• HathiTrust’sSolutionsforPhysicalSecurityo “Eachof[theHathiTrust]storageortapeinstancesisphysicallysecure(e.g.,inalocked

cageinamachineroom)andonlyaccessibletospecifiedpersonnel.”57

• SecurityattheMACCTheMACCServerHostingSLAstatesthedatacenterstaffwill:

o “Provideservicesnecessarytomaintainasafe,secure,andorderlyenvironmentforalltenantsoftheMACC.”(sec.4.7)

o “ProvideaccesscontrolviaHiDcardandbiometricreadersforthoselistedontheTenantStaffAuthorizedforAccesslist.”(sec.4.5)

TheMACCWebsiteandtheMichiganAcademicComputingCenterOperatingAgreement58provideadditionaldetailsconcerningtheresourcesandproceduresthathelpprotectHathiTrust’sequipmentattheMACC.TheMACCDataCenterCoordinatorpersonallyoverseestheenforcementofsecurityprotocolsandconductsregularauditsofsecuritylogsand,whennecessary,reviewssurveillancevideofootage.

o SecuritySystems “State‐of‐the‐artsecuritydevicessuchasirisscanners,cameras,closedcircuit

televisionandon‐callstaffkeepthedataandmachineshousedintheMACCsafe.”59

“Accesstothedatacenterwillbebytwo‐factorauthentication(accesscardandirisscan)orescorted,supervisedaccess.Accesstothebuildingwillbebyaccesscard.”(MACCOA,sec.5.3.1)

“Camerasthroughoutthecorridor,securitytrap,andfacilitywillbemonitoredandmaintainedbytheDataCenterCoordinator.”(sec.5.2.1)

o SecurityProcedures

57HathiTrust.“Technology”(2009)retrievedfromhttp://www.hathitrust.org/technologyon15June2009.58PleaserefertoAppendixI(MichiganAcademicComputingCenterOperatingAgreement).59MichiganAcademicComputingCenter.“VitalStatistics”(2009)retrievedfromhttp://macc.umich.edu/about/vital‐statistics.phpon17June2009.

2009‐08‐24 26

“TheOperationsAdvisoryCommitteewillestablishproceduresforgrantingaccesscardstothefacilitytothosewhosejobsrequirehands‐onaccesstosystems.AllrequestsforaccesscardswillbevettedandapprovedbytheOperationsAdvisoryCommitteeattheirnextmeeting.”(sec.5.3.2)

“Everyoneontheaccesslistforthedatacenterwillberequiredtoattendatrainingsessionbeforeworkinginthedatacenterandsignanaccessagreementstatingpoliciestheymustobservewhileinthedatacenter.”(sec.5.3.8)

• SecurityattheALDFAsnotedintheTSMBackupServiceSLA,theUniversityofMichigan’sITCS“isresponsiblefor

physicalsecurity”attheALDF.(sec.4.9)WhilethisdocumentwillnotdetailspecificfeaturesoftheALDF’soperation,multiplelevelsofsecurityandoversightareemployed.

2009‐08‐24 27

Scenario9:NaturalorManmadeDisaster

• Review:RisksInvolvingaNaturalorManmadeDisasterThefollowingtabledetailstheriskstoHathiTrustposedbyanaturalormanmadedisaster;

eventsarerankedbyorderoftheirseverity.DuetopossibleoverlapbetweenthisscenarioandScenario1(HardwareFailure),readersareencouragedtoconsultthatearliersection.

• HathiTrust’sSolutionsforNaturalorManmadeCatastrophicEvents

TheUniversityofMichiganAnnArborCampusEmergencyProcedures(revisedJanuary2008)hassetprocedurestoaddressbuildingevacuations(intheeventoffire),tornadoes,severeweather,flooding,chemical/biological/radioactivespills,aswellasbombthreats,civildisturbances,andactsofviolenceorterrorism.60Inallcases,staffwillfollowthedirectionsofPublicSafetyandnotre‐enterbuildingsorresumework“untiladvisedtodosobyDPSorOSEHorsomeonefromon‐siteincidentcommand.”

Intheeventofaseverenaturalormanmadedisaster,therepairandrestorationofthephysicallocationsofHathiTrustinfrastructurewouldneedtobecoordinatedbetweentherepositoryandtheappropriatefacilitymanagers.SuchactivitywouldrelyuponthedisasterrecoveryplansinplaceattheMITCBuilding(homeoftheMACC)andUniversityofMichigan(whichincludestheHatcherGraduateLibraryandtheALDF).Itmustbenotedthataneventwhichcausessignificantdamagetoanimportantstructureortoabuilding’sinfrastructurecouldresultinthelossofaninstanceoftherepositoryforanextendedperiodoftime.Insuchacase,HathiTrustwouldneedtosetupanalternatehotsiteuntilstructuralrestorationiscomplete(oranewfacilityhasbeenfound).

60PleaseseeAppendixC(WashtenawCountyHazardRankingList).

Severity EventsHighimpact • Widespreaddamagetoadatacenterand/oritsinfrastructurethatforcesan

instanceoftherepositorytofindanewhotsitewithsufficientpowersupply,environmentalcontrols,andsecurity.

• Damagetoworkareasforcestafftorelocatetoanewcenterofoperations.• Extensivelossordamagetohardwarerequireslarge‐scalereplacement.• Withtheextendedlossofonesite,HathiTrustlosesredundancy(andpossiblysome

functionality:i.e.theabilitytoingestnewmaterialinAnnArbor)andthusacentralcomponentofitsdisasterrecoveryandbackupplans.

• AnactofviolenceorterrorismoccursatornearHathiTrustfacilities.ModerateImpact • Aneventresultsinanextendedoutageatonesitethatexceedstherecoverytime

objective.• Hardwaresustainssomedamageandsiteisabletocontinueoperationina

reducedcapacity.• Anactualorthreatenedactofviolenceorterrorismforcesthetemporary

evacuationorquarantineofHathiTrustfacilities.LowImpact • LocalconditionsresultinatemporaryoutageataHathiTrustsite.

2009‐08‐24 28

• BasicDisasterRecoveryStrategies

Intheimmediateaftermathofalarge‐scalemanmadeornaturaldisaster,therepository’simmediaterecoverywillbeenabledbyitsbasicsystemarchitecture:

o “theinitiative’stechnologyconcentratesoncreatingaminimumoftwosynchronizedversionsofhigh‐availabilityclusteredstoragewithwidegeographicseparation(thefirsttwoinstancesofstoragearelocatedinAnnArbor,MIandIndianapolis,IN),aswellasanencryptedtapebackup(writtentoandstoredinaseparatefacilityoutsideofAnnArbor).”61

TheestablishmentofthemirrorsiteinIndianapolisandtheretentionofmultiplebackuptapesattwolocationsinAnnArborensurethataseriouseventateitherlocationwillnotimpedethecontinuedfunctioningoftherepositoryattheother.ConsiderationmustbegivenastohowdataattheIndianapolissitewillbebackedupandhowkeyrepositoryfunctions(suchasingest)willproceediftheAnnArborinstanceisoff‐lineforanextendedperiodoftime.Likewise,along‐termoutageattheIUlocationwouldrequireHathiTrusttoestablishathirdsitefordatabackup(i.e.,alocationwhereadditionalcopiesofbackuptapescouldbestored).

61HathiTrust.“Technology”retrievedfromhttp://www.hathitrust.org/technologyon15June2009.

2009‐08‐24 29

Scenario10:MediaFailureorObsolescence

• Review:RisksInvolvingMediaFailureorObsolescenceThefollowingtablesummarizesriskstoHathiTrustposedbythefailureofthemediausedforits

databackups.Whiletherisksfromthisarelimited(bothcopiesofthetapebackupswouldhavetobeimpactedfordatatobeunavailable),theissueshouldnonethelessbeaddressedwithregulartestrestorationsand/orinspectionsofthemedia.

• HathiTrust’sSolutionsforMediaFailure

GiventhenatureofHathiTrust’sstoragesystem,thisscenarioisonlyaconcerninregardstothedigitalmagnetictapesusedbytheTSMGroupforbackups.

o Twotapecopiesofallbackupdataaremadeandthesearestoredinseparateclimate‐controlledconditionsintapelibrariesattheMACCandtheALDF.

o Contentistransferredtonewtapeduringdatadefragmentation(whichoccurswhenexistingtapesare80%full),

o Ifadegradedorotherwise‘bad’sectionoftapeisdetectedduringabackupprocedurethattapeisimmediatelymarkedas“readonly.”

Dataisthenceforthwrittentoadifferenttape;existingdataonthebadtapewillbecopiedtoproperlyfunctioningmedia.

Ifdatacannotbereclaimedfrombadtape,theTSMGroupwouldcontactHathiTrustsothatthebackupofcontentcanbeproperlycompleted.

• RemainingVulnerabilities

ThereissomereasonforconcerninthisareabecausetheTSMGroupdoesnothavearegularprogramtomonitoritsmediaforphysicaldegradationorimpairmentafterdatadefragmentation.Whilethetapesarereportedtobehighlydependable,problemssuchas“stickyshed”(thehydrolysisofthetape’sbinder)couldbecomeanissuewitholdertapes.Aregularprogramoftapevalidationortestrestorationswouldprovideanopportunitytocheckonthephysicalconditionanddataintegrityofthetapes.Likewise,thecreationofascheduleforthereplacementofoldertapescouldavoidfutureproblemswithmediadegradation.

Severity EventsHighimpact • Physicaldegradation(i.e.intapebinder,substrate,ormagneticcontent)affects

bothcopiesofolderbackuptapes.ModerateImpact • Becausebackuptapesarenotregularlytestedoraudited,thephysicalsubstrateof

tapesmaydegradeovertime.

LowImpact • Badtapeisdetectedduringatapebackup.

2009‐08‐24 30

ConclusionsandActionItems

• ConclusionsAsthisreportdemonstrates,avarietyofriskmanagementstrategiesinadditiontodesign

elements,operatingprocedures,andserviceandsupportcontractsendowHathiTrustwiththeabilitytopreserveitsdigitalcontentandcontinueessentialrepositoryfunctionsintheeventofarangeofdisasters.TheestablishmentoftheIndianapolismirrorsite,theperformanceofnightlytapebackups,andtheredundantpowerandenvironmentalsystemsoftheMACCreflectprofessionalbestpracticesandwillenableHathiTrusttoweatherawiderangeofforeseeableevents.Asitis,disastersoftenresultfromtheunknownandtheunexpected;whiletheaforementionedstrategiesarecrucialcomponentsofaDisasterRecoveryPlan,theymustbesupplementedwithadditionalpoliciesandprocedurestoensurethat,comewhatmay,HathiTrustwillbeabletocarryonasbothanorganizationandadedicatedserviceprovider.

IntheefforttosecureHathiTrust’slong‐termcontinuity,thepresentdocumentstandsmerelyasapreliminarystepintheestablishmentofalegitimateDisasterRecoveryPlan.ThedataonHathiTrust’spolicies,procedures,andcontractsconsolidatedhereinshouldfacilitatethedatacollectionrequisitetotheinitialphasesoftheplanningprocess,butthecoreactivitiesofformulatingtechnicalandadministrativeresponsestrategiesanddelegatingrolesandresponsibilitiesremaintobeundertaken.ThefollowingsectionoutlinesrecommendationsandactionitemsderivedfromresearchintotherepositoryaswellasfromdiscussionswithCorySnavelyandotherHathiTruststaffmembers.ItemshavebeenseparatedintoanapproximatetimelineofactivityrangingfromShortTermthroughLongTermandthearrangementwithineachcategoryrepresentsasuggested(butbynomeansdefinitive)orderofaccomplishment.ForamoredetailedexplanationofactionitemsrelatedexplicitlytoDisasterRecoveryPlanning,pleaserefertotheoverviewoftheplanningprocessinAppendixEorconsultAppendixDforalistofmorecomprehensiveguidesandresources.(NB:*=Denotesanongoingactivity.)

• ShortTermActionItems(0‐6months)a. ResolvethenatureandextentoftheinsurancecoverageforHathiTrustequipment.b. ArrangewithTSMGroupadministratorstoperiodicallyperformavolumeauditof

backuptapestoensuredataintegrity.c. InstituteperiodictestrestoreswithTSMGrouptoensurethattheprocesswillrun

smoothlyintheeventofadisaster.d. Discussthecreationofalong‐termreplacementscheduleforbackuptapeswiththe

TSMGrouptoavoidthepossibilityofmediadegradation.e. Improvecontroloversystemcomponents

i. Updatethehardwareinventorytoincludeallimportantsystemcomponents;documentmodels,serialnumbers,UMID’s,associatedsoftwareandversionnumber,dateofpurchase,originalcost,aswellasvendorcontactinformationandproductsupportcontracts.*

2009‐08‐24 31

ii. Establishasoftwareinventorytodocumentnecessaryapplicationsintheeventofhardwareloss;shouldincludepurpose,acquisitiondate,cost,licensenumber,andversionnumber.*

iii. CreateamapidentifyingwherecomponentsareintheMACCandwithinindividualracks*

iv. Reviewandassesspointsoffailureaswellastheadequacyofredundantcomponents.*

f. Establishphonetreesi. Includekeycontactsfordifferenttypesofdisasterii. Prioritizephonetreestotargetindividualswho

1. Makedecisions2. Havevitalinformation3. Canofferassistanceinresolvingsituations

iii. Distributeinformationandexplainprotocolstoallrelevantstaff*iv. Developaregularmaintenance/updateschedule(onceevery4‐6months)*

g. Thoroughlydocumentandmakeavailable(asneeded)importantinstitutionalknowledgesothatHathiTrustmaycontinuetofunctionintheeventoftheextendedabsenceorlossofkeystaff.*

h. IdentifydisasterpreparednessanddisasterrecoverymeasuresinplaceatIndianapolis.

• IntermediateTerm(6‐12months)a. FormaDisasterRecoveryPlanningCommitteetoresearchanddevelopplansandto

overseetheirimplementation.b. CommunicateandcoordinateplanningactivitiesbetweenAnnArborandIndianapolis.*

i. Considertheformationofsub‐committeesforlocalizedresearchanddevelopmentofplansandanexecutivecommitteetooverseetheimplementationandmanagementofplans.

c. DraftaDisasterRecoveryPlanningpolicystatementtodefinethemandate,responsibilities,andobjectivesfortheplan.

d. Initiatethedatacollectionandanalysisphaseoftheplanningprocess.i. Identifycorerepositoryfunctionsandassociatedhardwareandinfrastructure

elements.ii. Determinethepotentialimpactfromthelossofthosefunctionsiii. Definethelevelsoffunctionalityrequiredforpartialaswellasfullrecovery.

EstablishwhatlevelisneededforHTtofulfillitsmissionandtheneedsofitsusers.

iv. DefineHathiTrust’sRecoveryTimeObjective(RTO:themaximumallowableoutageperiodforservices)andRecoveryPointObjective(RPO:thepointintimetowhichdatastoresmustbereturnedfollowingadisaster).

v. Determinetheavailabilityofresourcesintheeventofadisasterandestablishtherepository’sprioritizationwithmajorserviceprovidersandvendors(i.e.,TSMGroup,ITCom,Isilon,etc.).

2009‐08‐24 32

e. Addressrisksuncoveredinthedatacollectionphaseandinstitutepreventativecontrolsasneededtoanticipateandmitigatethoserisks.*

f. Developrecoverystrategiestobringcorefunctionsbackonlineassoonaspossiblewithinasetcostrange.

i. Establishalogicalprogressionintherestorationofservicesandassociatedcomponents.

ii. Identifytheresourcesrequiredfortheseefforts.iii. Consideralternativesolutions,includingpartial(vs.full)recovery

g. Communicateplanninggoalsandeffortstokeycontactsfromserviceprovidersandvendorstobettercoordinaterecoveryefforts.*

h. InitiatetheproductionofcoreDisasterRecoverydocuments(seeAppendixEformoreinformation).Thefollowinglistisnotexhaustive;datacollectionandanalysiswillhelpdetermineifallorotherplans(i.e.,awebcontinuityplan)areneeded.

i. BusinessContinuityPlan:detailsHathiTrust’scorefunctionsandtheprioritiesforre‐establishingeachintheeventofadisruption.

ii. ContinuityofOperationsPlan:focusesonrestoringanorganization’s(usuallyaheadquarterselement)essentialfunctionsatanalternatesiteandperformingthosefunctionsforupto30daysbeforereturningtonormaloperations.

iii. ITContingencyPlan:addressesexplicitlythedisasterplanningforcomputers,servers,andelementsofthetechnicalinfrastructurethatsupportkeyapplicationsandfunctions.

iv. CrisisCommunicationsPlan:establishesproceduresforinternalandexternalcommunicationsduringandafteranemergency.

v. Cyber‐IncidentResponsePlan:definestheproceduresforrespondingtocyberattacksagainsttheHathiTrustITsystem.

vi. OccupantEmergencyPlan:definesresponseproceduresforstaffintheeventofasituationthatposesapotentialthreattothehealthandsafetyofHathiTrustpersonnelortheirenvironment.(ThisrequirementisaddressedbyUniversityofMichiganBuildingEmergencyActionPlans.)

vii. DisasterRecoveryPlan:bringstogetherguidanceandproceduresfromtheotherplanstoenabletherestorationofcoreinformationsystems,applications,andservices.ThisplandefinesrolesandresponsibilitieswithinDisasterResponseTeams.

viii. DisasterRecoveryTrainingPlan:establishesthesituationsandprocedurestobecoveredbyHathiTrust’sDisasterRecoverytraining.

• LongTerm(12+months)

a. CompleteandimplementDisasterRecoveryPlans.i. Distributephysicalcopiesoftheplansasneededandincludeatleastonecopy

inanoff‐sitelocation.ii. Integrateelementsofresponsestrategiesintosystemarchitecturetofacilitate

theirdeploymentintheeventofadisaster.*

2009‐08‐24 33

b. DisasterRecoveryCommitteeshouldmonitorchangesinbestpracticesandtechnology,updateplans,andoverseeorganizationalreadiness.*

i. InitiatestafftrainingsothatindividualsarefamiliarwithDisasterRecoveryproceduresandcommunicationprotocols.*

ii. InstituteregulartestsofdisasterpreparednesswithsimulateddisastersinvolvingdifferentcomponentsofHathiTrustoperations.*

iii. EstablishascheduleformaintenanceandrevisionstotheDisasterRecoverydocuments.*

iv. CoordinateDisasterRecoveryPlanimplementation,training,andreviewwithIndianapolis.*

c. StoreanadditionalcopyofbackuptapesatathirdsitetoincreaseexposureandlimitthechancethatawidespreadeventinAnnArborcouldimpactbothlocalcopies.

d. ExplorethepossibilityofestablishingathirdsiteforHathiTrust’sdigitalobjectstoincreaseexposureandaddressconcernsovertherelativegeographicalproximityofIndianapolisandAnnArbor.

e. Determinethefeasibilityofmovingoperationstoa“hot”siteinAnnArborshouldadisasterrendertheMACCunusable.

i. Identifysuitablesitesandconsidermakingpreliminaryarrangements.ii. Identifyandpriceoutequipment/infrastructurenecessarytocontinue

operations.f. PlanforintegrationofnewsystemcomponentsshouldthesuddencollapseofIsilon

leaveHathiTrustwithoutservice/support.g. Consideranincreasetosystemsecuritymeasuresascontentbecomesacceptedfroma

widerrangeofsourcesandasHathiTrustbecomesahigher‐profileorganization.

2009‐08‐24 34

APPENDIXA:ContactInformationforImportantHathiTrustResources

IndianaUniversityMirrorSite

• AndrewPoland(Staff,InformationTechnologyServices)o [email protected] (317)274‐0746

• TroyDeanWilliams(VicePresidentforInformationTechnology,IUatBloomington)o [email protected] (812)856‐5323

UniversityofMichiganMichiganAcademicComputingCenter(MACC):HousesmuchofthetechnicalinfrastructureoftheUniversityLibrary’sdigitalresources.

• ReneGobeyn(MACCDataCenterCoordinator)o [email protected] (734)936‐2654

• ITComUMNOC(NetworkOperationsCenter)o [email protected] (734)647‐8888

ITCS‐ITCom:ResponsibleformaintainingnetworkconnectionstotheUMnetBackboneandInternet;ITComprovidesmaintenanceandsupportservicesforhardwareandsoftware.

• MikeBrower(SeniorProjectManager,UMLibraries)o [email protected] (734)936‐9736

• KrystalHall(DisasterRecoveryPlanner,ITCS/ITComOperations)o [email protected] (734)647‐3214

• ITComUMNOC(NetworkOperationsCenter)o [email protected] (734)647‐8888

TivoliStorageManagerGroup:Responsiblefornightlyautomatedtapebackupsofstorageservers.

• AndrewInman(ServiceManager)o [email protected] (734)615‐6286

• CameronHanover(StorageEngineer)o [email protected] (734)764‐7019

• GeneralSupport:[email protected]• Emergencycontact:[email protected]

o Messagewillgotoon‐callstaff’spagerinrealtime• [email protected]

ArborLakesDataFacility:HousesoneinstanceoftheTSMbackuptapelibrary.

• ITComUMNOC(NetworkOperationsCenter)

2009‐08‐24 35

o [email protected] (734)615‐4209

• KenPritchard(ALDFfacilitymanager)o [email protected] (734)615‐2812

ProcurementServices:Approvesdepartmentalpurchasesover$5,000;buyersalsoworkasintermediarieswithvendors.

• SteveWorden(UMHardwarePurchasingSpecialist)o [email protected] (734)645‐8972

• ShellyEauclaire(SeniorBuyer,PurchasingServices)o [email protected] (734)615‐8767

• IanPepper(UMDellComputersContractAdministrator)o [email protected] (734)647‐4981

• JeffRabbitt(AlternateDellContractAdministrator)o [email protected] (734)644‐9232

PropertyControl:Responsiblefortrackingandtaggingtheuniversity’sassets.

• MaryEllenLyon(BusinessOperationManager)o [email protected] (734)647‐3351(t,th)o (734)763‐1197(m,w,f)

OfficeofFinancialAnalysis:

• DavidStorey(InventoryCoordinator):DeliversUMpropertytagstoequipmentattheMACC.o [email protected] (734)647‐4264

RiskManagementServices:Providesinsurancecoverageofuniversityassets.

• KathleenRychlinski(AssistantDirector,RiskManagementServices)o [email protected] (734)763‐1587

Non‐UniversityContactInformationIsilonSystems

• JimRamberg(RegionalTerritoryManager)o [email protected] Desk:(847)330‐6399o Cell:(630)561‐2463

SunMicrosystems

• ChristineSluman(ServiceSalesRep—Education)o [email protected] (303)557‐3660,ext.60519

2009‐08‐24 36

o (303)949‐1567(Cell)• LarryZimmerman(MichiganAccountManager‐Sales)

o [email protected] (248)880‐3756

CDW‐G

• UniversityofMichiganAccountTeamo [email protected]

• HansenChennikkra(AccountManager)o [email protected] (866)339‐3639

• AdamSullivan(AccountManager)o [email protected] (866)339‐4118

DellComputers

• BrianUllestad(HigherEducationAccountManager)o [email protected] 1‐800‐274‐7799ext.7249522

2009‐08‐24 37

APPENDIXB:HathiTrustOutagesfromMarch2008throughApril200962

• April2009:HathiTrustexperiencedreducedperformancefrom11:00pmEDTonThursday,April23to8:22amEDTonFriday,April24duetoadatabaseproblematoneofthesitesandfrom5:30pmto9:00pmEDTonThursday,April30duetounintendedconsequencesfromanetworkingconfigurationchange.

• March2009:HathiTrustwasunavailableonTuesday,March3from7:00‐8:00amESTandonThursday,March5from7:00‐7:45amESTforoperatingsystemanddatabasesoftwareupgrades.

• February2009:OnSunday,February22at8:40amEST,apowersurgeresultingfromelectricalsystemmaintenancecausedHathiTrustdatabaseandwebserverstogooffline.Stafflearnedoftheproblematapproximately6:00pmEST,andservicewasrestoredby6:30pmEST.

• January2009:AbriefoutageisscheduledinJanuaryforastoragesystemsoftwareupgrade.• December2008:OnFriday,December19at7:30amEST,HathiTrustwasdownbrieflytoapply

securityupdatestoadatabaseserver.Servicewasrestoredat7:40amEST.• November2008:OnTuesday,November4at7:30amEST,HathiTrustwasdownbrieflytoapply

securityupdatestoadatabaseserver.Servicewasrestoredat7:45amEST• October2008:Nooutagesreported.• September2008:OnThursday,September18atapproximately9:30amEDT,HathiTrustbecame

inaccessibleduetoasoftwareproblemonastoragesystem;theproblemwasrelatedtoourworkwithdatasynchronization.Supportwascontactedandtheproblemwasresolvedat10:45amEDT

• August2008:OnTuesday,August26atapproximately9:00amEDT,adatabaseserverwasbroughtdowntomovetoIndianapolis.Priortoshuttingthisserverdown,wedidnotupdateamanualfailoverconfiguration,causingvolumestobeinaccessibletosomeusers.Theproblemwasresolvedat11:15amEDT.

• July2008:ServicewasunavailableonThursdayJuly31from7:00‐7:30amEDTforastoragesystemsoftwareupgrade.

• June2008:Nooutagesreported.• May2008:Nooutagesreported.• April2008:Nooutagesreported.• March2008:Nooutagesreported.

62HathiTrust.“Updates”fromhttp://www.hathitrust.org/updatesretrievedon16June2009.

2009‐08‐24 38

APPENDIXC:WashtenawCountyHazardRankingList

ThefollowinglistranksavarietyofnaturalandmanmadeeventswithinWashtenawCounty,Michigan,basedupontheirfrequencyofoccurrenceandtheextentoftheirpotentialimpact(onthecounty’spopulation).

Rank Hazard FrequencyPopulationImpacted

1Convectiveweather(severewinds,lightning,tornados,hailstorms)

Onceormore/yr.

250,000

2Hazardousmaterialsincidents:transportation

Onceormore/yr.

2,000

3 Hazardousmaterialsincidents:fixedsiteOnceormore/yr.

10,000

4Severewinterweatherhazards(ice/sleet/snowstorms)

Onceormore/yr.

250,000

5 InfrastructurefailuresOnceevery5yrs.

30,000

6 Transportationaccidents:airandlandOnceormore/yr.

100

7 ExtremetemperaturesOnceevery5yrs.

10,000

8 Floodhazards:riverine/urbanfloodingOnceevery10yrs.

2,000

9 NuclearattackHasnotoccurred

250,000

10Petroleumandnaturalgaspipelineaccidents

Onceevery10yrs.

1,000

11 Firehazards:wildfiresOnceormore/yr.

0

Source:WashtenawCountyHazardMitigationPlan(availableonlineathttp://www.ewashtenaw.org/government/departments/planning_environment/planning/planning/hazard_html)

2009‐08‐24 39

APPENDIXD:AnnotatedGuidetoDisasterRecoveryPlanningReferences

Thetopicofdisasterrecoveryplanningfortheprintandanalogresourcesoflibrarieshasbeenwidelydealtwithinprofessionalliterature,butcomparativelylittleinformationexistsconcerningthedevelopmentandimplementationofplansforthedigitalcontentofculturalinstitutions.Thefollowingbibliographydetailsresourceswhichprovideguidance,examples,andexplanationsoftheobjectivesandstrategiesfordigitalDisasterRecoveryPlans.ItconsistsprimarilyofmaterialcompiledbyLanceStuchell(ICPSRIntern)andNancyMcGovern(ICPSRDigitalPreservationOfficer)andisincludedherewiththeirpermission.

UniversityofMichiganResources

• UniversityofMichiganAdministrativeInformationServices(MAIS):EmergencyManagement,BusinessContinuity,andDisasterRecoveryPlanning.

o http://www.mais.umich.edu/projects/drbc_methodology.htmlo ThissitebroadlyoutlinestheneedforandfunctionsofEmergencyManagement,

BusinessContinuity,andDisasterRecoveryPlanningatUM.Italsocontainstemplatesdesignedtohelpunitsplan,test,andauditdisasterandcontinuityprograms.

• ProvostandExecutiveVicePresidentforAcademicAffairs:StandardPracticeGuide:InstitutionalDataResourceManagementPolicy

o http://spg.umich.edu/o ThispolicydefinesinstitutionaldataresourcesasUniversityassetsandmakes

recommendationsonidentifying,preserving,andprovidingaccesstotheseassets.Thedigitalresourcesofthelibrarymaybeidentifiedassuch,basedupontheirusebydepartmentsacrosstheuniversity.

• ICPSRDisasterPlanningResources:

o DigitalPreservationOfficerNancyMcGovernispartofaDisasterRecoveryinitiativeatICPSRandoverthepastseveralyearsherteam(includingLanceStuchell)hasproducedavarietyofdocumentsandtemplatestohelpotherinstitutionsworkthethroughtheplanningprocess.

o Documentsareavailableuponrequestandshouldbepostedinthenearfuture(asofJuly2009)totheICPSRWebsite(http://icpsr.umich.edu/).

• DisasterRecoveryExperts:o ReneGobeyn(MACCDataCenterCoordinator)

ManagedandcoordinatedDisasterRecoveryforU.S.militarydatacenters [email protected]

o KrystalHall(DisasterRecoveryPlanner,ITCS/ITComOperations) HelpeddevelopcurrentITCSDisasterRecoveryplans [email protected]

2009‐08‐24 40

ExternalResources

• GeneralGuidetoDisasterPlanningo ContingencyPlanningGuideforInformationTechnologySystems:Recommendationsof

theNationalInstituteofStandardsandTechnology,NISTSpecialPublication800‐34,June2002.

http://csrc.nist.gov/publications/nistpubs/800‐34/sp800‐34.pdf AnindispensableresourcewhichwasusedheavilybyICPSRinitsDisaster

Recoveryplanning.Itcoverseverythingfrominitialdatacollectionandpolicyformationtothestructureofdisasterresponseteamsandthearticulationofrecoverystrategies.

• ExamplesandToolsfortheDocumentationOutlinedbyNISTGuide:o FullDisasterRecoveryPlan:

UnitedStatesDepartmentofAgricultureDisasterRecoveryandBusinessResumptionPlans

http://www.ocio.usda.gov/directives/doc/DM3570‐001.htmo BusinessContinuityPlan(BCP):

MAIS:EmergencyManagement,BusinessContinuity,andDisasterRecoveryPlanning

http://www.mais.umich.edu/projects/drbc_templates.html Thissiteprovidesseveralresourcesthatdealwithcontinuityplanning.

o ContinuityofOperationsPrograms(COOP): FEMA:ContinuityofOperations(COOP)Programs

• http://www.fema.gov/government/coop/index.shtm• Containsalotofusefulinformationongovernmentpolicy,templates,

andtrainingresourcestoassistinthecreationofaCOOP. Ready.gov:ContinuityofOperationsPlanning

• http://www.ready.gov/business/plan/planning.html• GuidelinesforcomposingabusinessCOOP,includingwhatoutside

actorsshouldbeinvolvedintheplanningprocess. TheFloridaDepartmentofHealth:ContinuityofOperationsPlanforInformation

Technology• http://www.naphit.org/global/library/basement_docs/FL_DisasterReco

very_template.doc• Lengthy(40pages)anddetailedCOOPtemplatewrittenforanIT

environment. FloridaAtlanticUniversityLibraries:ContinuityofOperationsPlan

• http://www.staff.library.fau.edu/policies/coop‐2007.pdf• AdetailedworkingCOOP,whichincludesreactionstospecificdisaster

scenarios.o ITContingencyPlan:

2009‐08‐24 41

SeetheUSDADisasterRecoveryPlanforanexampleofanITContingencyPlan.o CyberIncidentResponsePlan:

Multi‐StateInformationSharingandAnalysisCenterCyberIncidentResponseGuide

• http://www.msisac.org/localgov/documents/FINALIncidentResponseGuide.pdf

• Theguideprovidesastep‐by‐stepprocessforrespondingtoincidentsanddevelopinganincidentresponseteam.ItmayalsoserveatemplateinordertodraftaCyber‐IncidentResponsePolicyandPlan.

o CrisisCommunicationPlan: Ready.gov:WriteaCrisisCommunicationPlan

• http://www.ready.gov/business/talk/crisisplan.html• Thissiteprovidesguidelinesforcomposingabusinessdisaster

communicationplanandincludessuggestionsfortheplan’sWebpresence.

NCStateUniversity:CrisisCommunicationPlan• http://www.ncsu.edu/emergency‐information/crisisplan.php• ThisisthepolicyandplanfortheUniversityasawhole.Whilemuchof

thispolicydealswithcommunicationatahighlevel,usefulsectionsdetailvitalcontactswithintheorganization(includingwhotocontactfirst),andhowtomanageexternalcommunications.

OtherthoroughuniversitypoliciesandplansincludetheLSU:CrisisCommunicationPlanandtheMissouriS&T:CrisisCommunicationPlan.

HeritageMicrofilmFloodUpdateEmail• ThisemailwassentinresponsetotheJune2008floodingthatoccurred

intheMidwest.• ItupdatesclientsontheoutageofNewspaperArchive.comwhich

resultedfromaflood‐inducedwidespreadpowerfailure.Itisanexcellentexampleofanexternalcrisiscommunicationtousers.

o DisasterRecoveryPlans(DRP): TheUniversityofIowa:ITServicesDisasterRecoveryPlan

• http://cio.uiowa.edu/ITplanning/Plans/ITSdisasterPrep.shtml• Thispolicydetailsthedatacollectionandassessmentwhichinformsthe

UIplanandalsoincludesemergencyprocedures,responsestrategies,andacrisiscommunicationplan.

UniversityofArkansas:ComputingServicesDisasterRecoveryPlan• http://www.uark.edu/staff/drp/• Acompleteandthoroughplanthatoutlinestheinitiationofemergency

andrecoveryprocedures,andaddresseshowtheplanwillbemaintained.

AdamsStateCollege(CO):InformationTechnologyDisasterRecoveryPlan• http://www.adams.edu/administration/computing/dr‐plan100206.pdf

2009‐08‐24 42

• Thisplanhasathoroughsectiononriskassessment. DigitalPreservationEuropeRepositoryPlanningChecklistandGuidance

• http://www.digitalpreservationeurope.eu/platter.pdf• DesignedforusewiththePlanningToolforTrustedElectronic

Repositories(PLATTER),thisdocumentoutlinesconsiderationsforaDisasterRecoveryStrategicObjectivePlan(SOP)andplacesthemincontextwithotherrepositoryplans.

o OccupantEmergencyPlan(OEP): ThisrequirementisaddressedbyUniversityofMichiganBuildingEmergency

ActionPlans(EAP).• http://www.umich.edu/~oseh/guideep.pdf

o DisasterRecoveryTrainingGuides: dPlan.org

• Providesusefulinformationontrainingandanonlineformthatwouldbeusefulinassigningtrainersandmonitoringthetrainingprocess.

CalPreservation.org:DisasterPlanExercise• http://calpreservation.org/disasters/exercise.html• Providesrolesandteachingpointsforarole‐playtrainingexercisethat

focusesonadisasterinalibrary.

• PolicyPlanningTools:o AssociationofPublicTreasurersoftheUnitedStatesandCanada:DisasterPolicy

CertificationGuidelines www.aptusc.org/includes/getpdf.php?f=Disaster_Policy.pdf Thisplanningdocumentandtemplatefordisastermanagementpolicies

providesoutlinesandexamplelanguageonseveralfacetsofastrongpolicy,includingthepossiblelossofabuilding,thereplacementofcomputerresources,andtestingandtrainingforthedisasterplan.Italsooutlinestheneedtoidentifypossiblethreatstoassets.

• ExamplesofDisasterPlanningPolicies:

o ArkansasSecretaryofState:DisasterPlanningPolicy http://www.sos.arkansas.gov/elections/elections_pdfs/register/oct_reg/016.14.

01‐020.pdf Thispolicyoutlinesareasofresponsibilitybetweendepartmentsandunits,and

includestraining,communication,andrecoveryplanupdates.o WashingtonStateDepartmentofInformationServices:DisasterRecoveryandBusiness

ResumptionPlanningPolicy http://isb.wa.gov/policies/portfolio/500p.doc ThisdocumentillustratespolicyformationforanITDisasterRecoveryPlan.It

providesguidelinesforDisasterRecoveryPlanningaswellasmaintenance,testing,andtraininginvolvedwiththerecoveryplan.

2009‐08‐24 43

o FloridaStateUniversity:InformationTechnologyDisasterRecoveryandDataBackupPolicy

http://oti.fsu.edu/oti_pdf/Information%20Technology%20Disaster%20Recovery%20and%20Data%20Backup%20Policy.pdf

ThisdocumentincludespolicyfordatabackupaswellasDisasterRecovery.PartofthepolicyincludesadefinitionofBestPracticeDisasterRecoveryProcedures,aswellasanoutlineoftheuniversity’sownITrecoveryplanningandimplementationprocedures.

• ExampleofaRelevantDisasterPlanningProgram:o OCLCDigitalArchivePreservationPolicyandSupportingDocumentation

http://www.oclc.org/support/documentation/digitalarchive/preservationpolicy.pdf

ThisdocumenthasacleararticulationofOCLC'sdisasterpolicy,alongwithanoutlineofdisasterpreventionandrecoveryproceduresandatime‐framefortherestorationofservicesintheeventofadisaster.

Thepolicyincludesagooddefinitionofadisasterpreventionandrecoveryplan:“Asetofresponsesbasedonsoundprinciplesandendorsedbyseniormanagement,whichcanbeactivatedbytrainedstaffwiththegoalofpreventingorreducingtheseverityoftheimpactofdisastersandincidents.”

OCLCembedsitsdisasterplanwithinitsoverallpreservationpolicy,stating:“Thegoalofdisasterpreventionistosafeguardthedata(contentandmetadata)intheDigitalArchiveandtosafeguardtheDigitalArchive’ssoftwareandsystems.Fordisasterpreventionandrecovery,alldata(contentandmetadata)isconsideredofequalvalue.”

• DesigningaDisasterPlanningProgram:o MichiganStateUniversity:StepbyStepGuidetoDisasterRecoveryPlanning

http://www.drp.msu.edu/Documentation/StepbyStepGuide.htm Thisprogrambreaksdownthedisasterplanningprocessintosteps,and

providesinformationrelevanttoindividualunitswithinauniversitysetting.TheMSUDisasterRecoveryPlanningHomepage(http://www.drp.msu.edu/)alsooffersavarietyofresources.

o MinnesotaStateArchives:DisasterPreparedness http://www.mnhs.org/preserve/records/docs_pdfs/disaster_000.pdf Thisdocumentisadetailedguidetothedisasterplanningprocess.Whilemostly

dealingwithpaperrecords,thedocumentclearlyidentifiesdifferentrolesandresponsibilitiesformembersoftheplanningandrecoveryteam.

o CiscoSystems:DisasterRecoveryBestPracticesWhitePaper http://www.cisco.com/warp/public/63/disrec.pdf

2009‐08‐24 44

ThepaperoutlinesDisasterRecoveryusingtheframeworkoftheaboveresources,buttailorsittoanITpointofview.Ithasusefulinformationonhowtoprepareandrecoverbothhardwareandsoftwareassets.

o AT&T:KeyElementstoanEffectiveBusinessContinuityPlan http://www.business.att.com/content/article/Key_to_Effective_BC_Plan.pdf Ashortpaperthatsummarizesbusinesscontinuityplanningintheprivate

sector.

• GeneralInformationo FederalEmergencyManagementAdministration:EmergencyManagementGuidefor

Business&Industry http://www.fema.gov/business/guide/index.shtm Apracticalguidewithstep‐by‐stepadviceoncreatingaDisasterRecovery

program.Includesinformationontheformationonaplanningcommittee,organizationalanalysis,anddetailsonspecifichazards.

o SpecialLibrariesAssociationInformationPortal:DisasterPlanningandRecovery http://www.sla.org/content/resources/infoportals/disaster.cfm Anexhaustivelistofresources,thispageincludesarticlesondigitaldisaster

recoverystrategiesaswellasinformationonplanning,examplesofplans,andlinkstoawiderangeofresourcesinthepublicandprivatesector.

WrittenResources:

• Wellheiser,JohannaandJudeScott.AnOunceofPrevention:IntegratedDisasterPlanningforArchives,Libraries,andRecordCentres.Lanham,MD:ScarecrowPress,2002.

o http://mirlyn.lib.umich.edu/F/?func=direct&doc_number=004233950&local_base=AA_PUB

• Cox.RichardJ.FlowersAftertheFuneral:ReflectionsonthePost‐9/11DigitalAge.Lanham,MD:ScarecrowPress,2003.

o http://mirlyn.lib.umich.edu/F/?func=direct&doc_number=004341258&local_base=AA_PUB

• Matthews,GrahamandJohnFeather,eds.DisasterManagementforLibrariesandArchives.Burlington,VT:Ashgate,2003.

o http://mirlyn.lib.umich.edu/F/?func=direct&doc_number=004354795&local_base=AA_PUB

2009‐08‐24 45

APPENDIXE:OverviewoftheDisasterRecoveryPlanningProcess

VariousresourcesagreethatthereisnoonewaytogoaboutinitiatingaDisasterRecoveryprogramordraftingaDRplan.Anorganizationmustproceedaccordingtoitsfunctionsandresourcesaswellastheneedsofitsdesignatedcommunityofusers.ThefollowingdiscussiondrawsheavilyupontheICPSRDisasterPlanningPolicyFramework(writtenbyNancyMcGovernandLanceStuchell)andtheContingencyPlanningGuideforInformationTechnologySystemspublishedbyNIST(2002).Assuch,itrepresentsaconsolidationandsimplificationofinformationpresentedinmoredepthelsewhere.Alistofplanningresources(withlinkinformationtofulltexts)isavailableinAppendixD.

• BasicPreceptsofDisasterRecoveryPlanning

1) DisasterRecoveryPlanningisacontinuousactivitythatinvolvesmonitoringinternalconditionsaswellasevolutionsintechnologyandthreats;respondingtonewdevelopmentsthatarise;revisingplanssothattheyremainrelevantandeffective;trainingstaffaccordingtoplans;andtestingorganizationalreadiness.

a. Thereisnosingledocumentwhichcontains“theplan”;rather,aDisasterRecoveryPlanconsistsofasuiteofdocumentsthatrequirearegularscheduleoftestingandrevisiontobeeffective.

b. ThereisnopointatwhichaDisasterRecoveryPlanis“finished.”

2) DisasterRecoveryPlanningneedstobeanorganizationwideactivity

a. DisasterrecoverymustbeoneofthebasicfunctionsofHathiTrust.

b. Aneffectiveplanneedsfulladministrativesupport.

c. Policiesandproceduresmustcomplementandconformtodisasterresponseplansestablishedbytheuniversity,city,andDepartmentofHomelandSecurity.

3) DisasterrecoverycannotbelimitedtothehardwareandsoftwarecomponentsordatacollectionsofHathiTrust;planningmustalsoaccountfortheimpactofhumanemergenciesontherepository’soperations.

• EssentialStepsinDisasterRecoveryPlanning

1) EstablishaDisasterRecoveryPlanningCommittee.

a. Thisgroupwillresearchanddeveloptheplanandhelpwithitsimplementationaswellasmonitorthetraining,testing,andrevisingofplanstoensureorganizationalcomplianceandreadiness.

b. Thecommitteeshouldinvolveindividualsrepresentingthevariousmissioncriticalunitswithinthelibrary(fromadministrationtoCoreServicestotheDigitalPreservationLibrarian)whowillparticipateinthedevelopmentofpolicyandrecoveryplanning.

c. Itisessentialthatthecommitteeinvolveindividualswiththeauthoritytosupportandenforcerecommendations.

d. Thecommittee’sactivitiesshouldinitiatetheformationofaDisasterResponseProgram.

2) DraftaDisasterRecoveryPlanningPolicyStatement

2009‐08‐24 46

a. Enablestheorganization—andothers—tounderstandthescopeandnatureoftheDisasterRecoveryPlan.

b. Establishestheorganizationalframeworkandresponsibilitiesfortheplanningprocess.

c. Keypolicyelements(asdetailedintheNISTreport):

i. Rolesandresponsibilitieswithintheorganizationinregardstoplanning

ii. MandateforDisasterRecoveryaswellasanystatutoryorregulatoryrequirements

iii. Scopeasappliestothetype(s)ofplatform(s)andorganizationalfunctionssubjecttoDisasterRecoveryPlanning

iv. ResourcerequirementsfortheDisasterRecoveryprogram

v. Trainingrequirements

vi. Exerciseandtestingschedules(atleastonemajorannualtest)

vii. Planmaintenanceschedule(elementsshouldbereviewedannually)

viii. Frequencyofbackupsandstorageofbackupmedia.

3) ConductDataCollectionandAnalysis(i.e.“BusinessImpactAnalysis”)

a. Determinecriticalfunctionsandidentifyspecificsystemresourcesrequiredtoperformthem.Minimumrequirementsforfunctionalityshouldbeestablished.

b. Determinerisksandvulnerabilitiesfacingtherepository’ssystemsandinfrastructure.

c. Identifyandcoordinatewithinternalandexternalpointsofcontacttodeterminehowtheydependonorsupporttherepositoryanditsfunctions;considerhowonefailuremightcascadeintoothers.

i. IdentifyresourcesthatarecrucialtoHathiTrust(I.e.,Mirlyn)

ii. Determinetheallowableoutage/disruptiontimefortheseresources

d. Developrecoverypriorities;balancethecostofinoperabilityagainstthecostofrecovery

i. DetermineHathiTrust’spositionwithintheprioritiesoftheuniversityaswellaswithitsmajorserviceprovidersandvendors(i.e.,TSMGroup,ITCom,Isilon,etc.)tobetterunderstandhowthatprioritizationwillimpactrecoveryefforts.

ii. Establishthemostcrucialfunctionswhichmustberestoredfirst.

iii. DetermineHathiTrust’sRecoveryTimeObjective(RTO,i.e.,themaximumallowableoutageperiod)andRecoveryPointObjective(RPO,i.e.,thepointintimetowhichdatafilesmustberestoredafteradisaster).

iv. Reviewpotentialresources(financial,personnel,etc.)withinHathiTrustaswellasthoseavailableviacontracts,serviceproviders,andproductsupport.ThisstepshouldinvolvetheclarificationofHathiTrust’spositionwithintheuniversity’saswellaskeyserviceproviders’andvendors’priorities.

4) Addressrisksuncoveredinthedatacollectionphaseandinstitutepreventativecontrolsasneededtoanticipateandmitigatethoserisks.

2009‐08‐24 47

5) Developrecoverystrategiesthatrespondtothepotentialimpactsandmaximumallowableoutagetimesestablishedinthedatacollectionphase.Effortsshouldfocusonsolutionsthatarecost‐effectiveandtechnicallyviable.

a. Strategiesshouldbedesignedtobringcorefunctionsbackonlineassoonaspossiblewithinanestablishedcostrange.

b. Recoveryeffortsmustbeprioritizedaccordingtothenatureofcorefunctionsaswellaslogicalorderofprocedures.

c. Alternativesolutionsshouldbeconsideredbaseduponcost,availabilityofresources,outagetimes,levelsoffunctionality(partialvs.full),andabilitytointegratemethodswithexistinginfrastructure.

d. Determinethepracticalityofpartial(vs.full)recoveryinordertobringservicesbackonlineinatimelyandcost‐effectivemanner.

e. Recoverystrategiesandresourcesshouldbeincorporated(aspossible)intotherepository’ssystemarchitecturesothatintheeventofadisaster,theresponsemayproceedinanefficientandstraightforwardmanner.

6) FormalizeandrecordcollecteddataandrecoverystrategiesinDisasterRecoveryDocuments.Intheprocessofproducingthiswiderangeofdocuments,anorganizationisforcedtoconsideranddocumentpoliciesandproceduresrelatedtoavarietyofkeyadministrativeandtechnicalissues.Thedecisionofwhichplanstoinclude(andwhichtoexclude)mustbedeterminedbaseduponareviewofHathiTrust’sneedsandobjectives.Additionaldocuments(aWebcontinuityplan,forinstance)maybenecessarybasedupondatacollectionandanalysis.

a. BusinessContinuityPlan

i. Businesscontinuityistheabilityofabusinesstocontinueitsoperationswithminimaldisruptionordowntimeintheeventofnaturalormanmadedisasters.

ii. Suchplanningallowsanorganizationtoensureitssurvivalbyconsideringpotentialbusinessinterruptionsandestablishingappropriate,cost‐effectiveresponses.

iii. TheBusinessContinuityPlandetailsHathiTrust’scorefunctionsandtheprioritiesforre‐establishingeachintheeventofadisruption.Itshouldaddresskeyadministrativeandsupportfunctionsaswellasthosewhichdirectlyinvolvetherepository’sdesignatedcommunity.

iv. Theplanshouldthoroughlydocumentthenatureofkeyfunctions,interdependences,theimpactoftheirloss,andalternativemeanstoensuretheircontinuationintheeventofadisaster.MAISoffersausefulBusinessContinuityplanningtemplateathttp://www.mais.umich.edu/projects/drbc_templates.html.

b. ContinuityofOperationsPlan(COOP)

i. TheCOOPfocusesonrestoringanorganization’s(usuallyaheadquarterselement)essentialfunctionsatanalternatesiteandperformingthosefunctionsforupto30daysbeforereturningtonormaloperations.

2009‐08‐24 48

ii. ThisplanmayincludetheBusinessContinuityPlanandDisasterRecoveryPlanasappendices.

c. ITContingencyPlan

i. TheITContingencyPlanaddressesdisasterplanningforcomputers,servers,andelementsofthetechnicalinfrastructurethatsupportkeyapplicationsandfunctions.

ii. Itshouldaccountforthefollowing:

1. Documenthardwareandsoftware

2. Developanemergencycontactlist

3. Backupandstorealldatafilesoff‐site

4. Proactivelymonitorequipmentanddata

5. Installandupdateantivirussoftwareonbothcomputersandservers

6. Developrecoveryscenarios

7. Communicateandmonitortheplan

iii. TheplanallowsHathiTrusttoformalizeanddocumentproceduresandpoliciesalreadyinplaceanddetailstherepository’sadherencetothesegoals.

d. CrisisCommunicationsPlan

i. CommunicationisavitallyimportantaspectofDisasterRecoveryPlanningandanorganization’sactualresponseinadisaster.

ii. TheCrisisCommunicationsPlanestablishesproceduresforinternalandexternalcommunicationsduringandafteranemergency.

iii. Thedifferentphasesofcrisiscommunicationencompasstheinitialnotificationofanevent,damageassessment,andplanactivationaswellasstatusreports(asneeded)andtheeventualcompletionofrecoveryefforts.

iv. Activationofthecommunicationsplanmustbetheresponsibilityofaspecificindividual.

v. TheDisasterResponseTeamcoordinateswiththeCrisisCommunicationTeamtoensurethatinformationprovidedaboutanemergencyisclear,concise,andconsistent.

e. Cyber‐IncidentResponsePlan

i. ThisplandefinestheproceduresforrespondingtocyberattacksagainsttheHathiTrustITsystem.

ii. Itprovidesaformalframeworkfortheidentification,mitigation,andrecoveryfrommaliciouscomputerincidents,suchasunauthorizedaccesstoasystemordata,denialofservice,orunauthorizedchangestosystemhardware,software,ordata.

2009‐08‐24 49

f. OccupantEmergencyPlan

i. TheOccupantEmergencyPlandefinesresponseproceduresforlibrarystaffintheeventofasituationthatposesapotentialthreattothehealthandsafetyofpersonnel,theenvironment,orHathiTrustproperty.

ii. HathiTrustmayutilizetheframeworkprovidedbyUMBuildingEmergencyActionPlansforthiselement.

g. DisasterRecoveryPlan

i. TheprimaryfocusoftheDisasterRecoveryPlanistherestorationofcoreinformationsystems,applications,andservices.

ii. Theplanbringstogetherguidanceandproceduresfromtheotherplans(i.e.,BusinessContinuityPlan,ITContingencyPlan,CrisisCommunicationsPlan,etc.)pertainingtoemergenciesthatresultininterruptionsofservicethatexceedacceptabledowntimes,asdefinedintheBCP.

iii. Theplanshoulddetailestablishedrecoverystrategiesforspecificdisastersituationsaswellastheteamsinvolvedintheirexecution.

iv. Personnelshouldbechosentostaffdisasterresponseteamsbasedontheirskillsandknowledge.Ideally,teamswouldbestaffedwiththepersonnelresponsibleforthesameorsimilaroperationundernormalconditions.It’salsoimportantthatteammembersshouldbefamiliarwiththegoalsandproceduresofotherteamstofacilitateinter‐teamcoordination.Eachteamisledbyateamleader(withasuitablealternate)whodirectsoverallteamoperationsandactsastheteam’srepresentativetomanagementandliaisonswithotherteamleaders.DisasterResponsecannotbeindividual‐specificoroverlyreliantonspecificpeople.Teamsmustassigneachroleatleastonealternateintheeventthatcorepeopleareunavailableatthetimeofadisaster.

v. NISTsuggeststhatacapablestrategywillrequiresomeorallofthefollowingfunctionalgroups.ForHathiTrust,manyofthesearealreadyinplaceintheformofUniversityofMichiganunitsandserviceproviders.

1. Anauthoritativeroleforoveralldecision‐makingresponsibility

2. SeniorManagementOfficial

3. ManagementTeam

4. DamageAssessmentTeam

5. OperatingSystemAdministrationTeam

6. SystemsSoftwareTeam

7. ServerRecoveryTeam(e.g.,clientserver,Webserver)

8. LAN/WANRecoveryTeam

9. DatabaseRecoveryTeam

10. NetworkOperationsRecoveryTeam

11. ApplicationRecoveryTeam(s)

2009‐08‐24 50

12. TelecommunicationsTeam

13. HardwareSalvageTeam

14. AlternateSiteRecoveryCoordinationTeam

15. OriginalSiteRestoration/SalvageCoordinationTeam

16. TestTeam

17. AdministrativeSupportTeam

18. TransportationandRelocationTeam

19. MediaRelationsTeam

20. LegalAffairsTeam

21. Physical/PersonnelSecurityTeam

22. ProcurementTeam(equipmentandsupplies)

h. DisasterRecoveryTrainingPlan

i. ThisplanwillestablishthesituationsandprocedurestobecoveredbyHathiTrust’sDisasterRecoverytraining.

ii. Thecontentsoftheplanshouldreflecttherangeofresponsibilitiesheldbetweenadministrators,departmentheads,andstaffwithinHathiTrust.

iii. TheplanshouldaccommodateDisasterRecoveryPlanningCommitteemembersaswellasthoseoftheDisasterResponseTeam.Forthelatter,itshouldidentifykeyrolesandresponsibilitiesinrecoveryefforts.

iv. Theplanshouldallowin‐housetrainingtobesupplementedbyexternalopportunities.

v. Aregularlyscheduledemergencydrillsshouldalsobeincludedtotestthereadinessofstaffandtheappropriatenessofresponseprocedures.

7) Implementelementsdevelopedinplanningprocess.Proceduresandpoliciesrelatedtocommunication,technologicalsolutions,etc.mustbeincorporatedintoHathiTrust’soveralldesignandoperationsothatDisasterRecoverybecomesacriticalorganizationalfunction.

8) InstituteregularprogramoftrainingandtestingtobesurethatstaffunderstandandacceptpoliciesandproceduresandtoensurethatHathiTrustispreparedforadisaster.

9) ConductregularreviewandmaintenanceofDisasterRecoverydocumentstorespondtochangesinpersonnel,organizationalstructureorfunctions,andevolutionsintechnologyand/orthreats.

• MainPhasesinaDisasterResponse:

1) Notification/Activation:Thisphasecoverstheinitialactionsonceasituationhasbeendetectedoristhreatened.Itincludesdamageassessmentandtheimplementationofanappropriateresponsestrategy.

a. Properdiagnosisandcommunication(bothinternalandexternal)ofadisasterisessential.

2009‐08‐24 51

b. Thenatureofindividualeventswilldeterminewhoneedstobeinvolved(i.e.,facilitiesmanagement,coreservices,etc.).

2) Recovery:Thisphasefocusesonthereturntoapre‐establishedleveloffunctionality(plansshoulddetailpartialaswellasfullrecoveries).

a. ResponseteamsimplementrecoverystrategiesandadheretoproceduresandprotocolsoutlinedinDisasterRecoveryDocuments

3) Reconstitution:Afterrecoveryeffortsarecomplete,normaloperationsmustberestored.Thismayinvolvethereconstructionoffacilitiesand/orinfrastructureaswellasthetestingofrestoredelementstoensuretheirfullfunctionality.

2009‐08‐24 52

APPENDIXF:TSMBackupServiceStandardServiceLevelAgreement(2008)(RightclicktoopentheAdobeDocumentObjectlocatedbelow)

2009‐08‐24 53

APPENDIXG:ITCS/ITComCustomerNetworkInfrastructureMaintenanceStandardSA(2006)(RightclicktoopentheAdobeDocumentObjectlocatedbelow)

2009‐08‐24 54

APPENDIXH:MACCServerHostingServiceLevelAgreement(Draft,2009)(RightclicktoopentheAdobeDocumentObjectlocatedbelow)

2009‐08‐24 55

APPENDIXI:MichiganAcademicComputingCenterOperatingAgreement(2006)(RightclicktoopentheAdobeDocumentObjectlocatedbelow)