Upload
bulles-de-savoir
View
225
Download
0
Embed Size (px)
Citation preview
7/27/2019 Rapport d'HathiTrust sur un plan de sauvegarde des donnes informatiques en cas de sinistre.
1/61
HathiTrust
isa
Solution
TheFoundationsofa
DisasterRecoveryPlanfortheShared
DigitalRepository
Thisreportservesas
recommendationsmadeby
MichaelJ.Shallcross,
2009DigitalPreservationIntern
UniversityofMichigan
SchoolofInformation
7/27/2019 Rapport d'HathiTrust sur un plan de sauvegarde des donnes informatiques en cas de sinistre.
2/61
ii
ExecutiveSummary
ThisreportseekstoestablishtheframeworkofaDisasterRecoveryPlanfortheHathiTrust
DigitalLibrary.Whileprofessionalbestpracticesandinstitutionalneedshaveprovidedaclearmandate
forHathiTrustsDisasterRecoveryProgram,commonparlancehasoftenobscuredtwoprominent
featuresofsuchinitiatives.First,aDisasterRecoveryPlanisactuallycomprisedofasuiteofdocuments
whichdetailarangeofissues,fromcrisiscommunicationsandthecontinuityofadministrativeactivities
totherestorationofhardwareanddata.Second,thereisnoconclusiontotheplanningprocess;itis
insteadacontinuouscycleofobservation,analysis,solutiondesign,implementation,training,testing,
andmaintenance.
Theprimarygoalofthepresentdocumentistoprovideafoundationonwhichfutureplanning
effortsmaybuild.Tothatend,itexaminesthestrategiesbywhichHathiTrusthasanticipatedand
mitigatedtherisksposedbytencommonscenarioswhichcouldprecipitateadisaster:
o Hardwarefailureanddatalosso Networkconfigurationerrorso Externalattackso Formatobsolescenceo Coreutilityorbuildingfailureo Softwarefailureo Operatorerroro Physicalsecuritybreacho Mediadegradationo Manmadeaswellasnaturaldisasters.
Asthislistreveals,adisasterwithinthedigitalrepositoryrefersnotmerelytodataloss,thedestruction
ofequipment,ordamagetoitsenvironment,buttoanyeventwhichhasthepotentialtocausean
extendedserviceoutage.Foreachscenario,thereportdiscussespossiblethreats,summarizesthe
potentialseverityofrelatedevents,andthendetailssolutionsHathiTrusthasenactedthroughdirectquotationsfromtheHathiTrustWebsiteandTRACselfassessment,ServiceLevelAgreements,and
literaturefromserviceprovidersandvendors.Attachedappendicesproviderelevantinformationand
includecontactsforimportantHathiTrustresources,anannotatedguidetoDisasterRecoveryPlanning
references,andanoverviewofkeystepsintheDisasterRecoveryPlanningprocess.
TheconcludingsectionofthereportprovidesrecommendationsandactionitemsforHathiTrust
asitproceedswithitsDisasterRecoveryInitiative.ThesearedividedintoShort(06mos.),Intermediate
(612mos.)andLongTerm(12+mos.)objectivesandarearrangedinasuggestedorderof
accomplishment.
o Shorttermgoalsinclude: DescribingthenatureandextentofHathiTrustsinsurancecoverage Testingandvalidationofcurrenttapebackupprocedures Improvedphysicalandintellectualcontroloversystemhardware Establishment,distribution,andmaintenanceofphonetrees Increaseddocumentationofinstitutionalknowledge IdentificationofDisasterRecoverymeasuresinplaceattheIndianapolissite.
o Intermediatetermobjectivesfocuson: CreationofaDisasterRecoveryPlanningCommittee
7/27/2019 Rapport d'HathiTrust sur un plan de sauvegarde des donnes informatiques en cas de sinistre.
3/61
iii
Initiationofthedatacollectionandanalysisessentialtothecreationofrecoverystrategies(Thissectionprovidesahighlevelbreakdownofvarioustasksand
includesthecoordinationofactivitiesbetweentheAnnArborandIndianapolis
sitesaswellaswithserviceprovidersandvendors.)
o Longtermactionitemsdealwith: CompletionandimplementationofthesuiteofDisasterRecoverydocuments Initiationofstafftrainingandtestsoforganizationalcompliance. Storageofanadditionalcopyofbackuptapesataremotethirdlocation InvestigationofanalternatehotsiteinAnnArborintheeventadisaster
renderstheMACCunusable
Considerationofathirdinstanceoftherepository Avoidanceofvendorlockinifakeysuppliershouldgooutofbusiness.
Thisreportdemonstratesthatvariousriskmanagementstrategies,designelements,operating
procedures,andsupportcontractshaveendowedHathiTrustwiththeabilitytopreserveitsdigital
contentandcontinueessentialrepositoryfunctionsintheeventofadisaster.Theestablishmentofthe
Indianapolismirrorsite,theperformanceofnightlytapebackupstoaremotelocation,andthe
redundantpowerandenvironmentalsystemsoftheMACCreflectprofessionalbestpracticesandwillenableHathiTrusttoweatherawiderangeofforeseeableevents.Unfortunately,disastersoftenresult
fromtheunknownandtheunexpected;whiletheaforementionedstrategiesarecrucialcomponentsof
aDisasterRecoveryPlan,theymustbesupplementedwithadditionalpoliciesandprocedurestoensure
that,comewhatmay,HathiTrustwillbeabletocarryonasbothanorganizationandadedicatedservice
provider.
7/27/2019 Rapport d'HathiTrust sur un plan de sauvegarde des donnes informatiques en cas de sinistre.
4/61
iv
Acknowledgements
TheauthorwouldliketothankShannonZacharyforherencouragementandguidance;Cory
SnavelyandJeremyYorkfortheirgenerousexpenditureoftime,energy,andknowledge;andNancy
McGovernandLanceStuchellforaccesstotheiroutstandingDisasterRecoveryPlanningresources.The
followingindividualshavealsobeeninvaluablesourcesofadvice,support,andinformation:JohnWilkin,
BobCampe,CyndiMesa,AnnThomas,JohnWeise,LarryWentzel,LaraUngerSyrigos,BillHall,Emily
Campbell,SebastienKorner,JessicaFeeman,PhilFarber,ChrisPowell,CameronHanover,Stephen
Hipkiss,TimPrettyman,ReneGobeyn,andKrystalHall.ThanksalsotoDr.ElizabethYakel,MagiaKrause,
andVeronicaandCoraFambrough.TheworkinthisreportwasmadepossiblebyanIMLSGrant.
7/27/2019 Rapport d'HathiTrust sur un plan de sauvegarde des donnes informatiques en cas de sinistre.
5/61
v
TableofContents
ExecutiveSummary p.ii Acknowledgements p.iv Introduction p.1
o GoalsforHathiTrustsDisasterRecoveryProgram p.1o TheMandateforDisasterRecoveryPlanninginDigitalPreservation p.2o DisasterPreparednessintheDesignandOperationofHathiTrust p.2o EssentialHathiTrustBusinessFunctions p.3
HathiTrustsDisasterRecoveryStrategies p.5o BasicRequirementsforDisasterRecovery p.5o DisasterRecoveryStrategy#1:RedundancybetweentheAnnArborandIndianapolisSitesp.5o DisasterRecoveryStrategy#2:NightlyAutomatedTapeBackups p.6
Scenario1:HardwareFailureorObsolescenceandDataLoss p.8o Review:RisksInvolvingHardwareFailureorObsolescenceandDataLoss p.8o HathiTrustsSolutionsforHardwareFailureandDataLoss p.8o RedundantComponentsandSinglePointsofFailureintheHathiTrustInfrastructure p.9o KeyFeaturesofHathiTrustsIsilonIQClusteredStorage p.10o HardwareSupportandService p.12o EquipmentTracking p.13o HardwareReplacementSchedule p.13o TimelineforEmergencyReplacementofHathiTrustInfrastructure p.13o HathiTrustandInsuranceCoverageattheUniversityofMichigan p.14
Scenario2:NetworkConfigurationErrors p.15o Review:RisksInvolvingNetworkConfigurationErrors p.15o
HathiTrustsSolutionsforNetworkConfigurationErrors p.15o ExtentofITComSupport p.15o ITComResponsibilities p.16o ITComServicesinResponsetoOutagesorDegradationImpactingtheNetwork p.16o HathiTrustResponsibilities p.16
Scenario3:NetworkSecurityandExternalAttacks p.17o Review:RisksInvolvingNetworkSecurityandExternalAttacks p.17o HathiTrustsSolutionsforNetworkSecurity p.17
Scenario4:FormatObsolescence p.18o Review:RisksInvolvingFormatObsolescence p.18o HathiTrustsSolutionsforFormatObsolescence p.18o SelectionofFileFormats p.18o FormatMigrationPoliciesandActivities p.19
Scenario5:CoreUtilityand/orBuildingFailure p.20o Review:RisksInvolvingCoreUtilityorBuildingFailure p.20o HathiTrustsSolutionsforUtilityorBuildingFailure p.20o GeneralMaintenanceandRepairsinUniversityofMichiganFacilities p.20o TheMichiganAcademicComputingCenter(MACC) p.20o ArborLakesDataFacility(ALDF) p.22
7/27/2019 Rapport d'HathiTrust sur un plan de sauvegarde des donnes informatiques en cas de sinistre.
6/61
vi
Scenario6:SoftwareFailureorObsolescence p.23o Review:RisksInvolvingSoftwareFailureorObsolescence p.23o HathiTrustsSolutionsforSoftwareIssues p.23
Scenario7:OperatorError p.24o Review:RisksInvolvingOperatorError p.24o HathiTrustsSolutionsforOperatorError p.24o Ingest p.24o ArchivalStorage p.24o Dissemination p.24o DataManagement p.24
Scenario8:PhysicalSecurityBreach p.25o Review:RisksInvolvingaPhysicalSecurityBreach p.25o HathiTrustsSolutionsforPhysicalSecurity p.25o SecurityattheMACC p.25o SecurityattheALDF p.26
Scenario9:NaturalorManmadeDisaster p.27o Review:RisksInvolvingaNaturalorManmadeDisaster p.27o HathiTrustsSolutionsforNaturalorManmadeCatastrophicEvents p.27o BasicDisasterRecoveryStrategies p.28
Scenario10:MediaFailureorObsolescence p.29o Review:RisksInvolvingMediaFailureorObsolescence p.29o HathiTrustsSolutionsforMediaFailure p.29o RemainingVulnerabilities p.29
ConclusionsandActionItems p.30o Conclusions p.30o ShortTermActionItems p.30o IntermediateTermActionItems p.31o LongTermActionItems p.32
APPENDIXA:ContactInformationforImportantHathiTrustResources p.34 APPENDIXB:HathiTrustOutagesfromMarch2008throughApril2009 p.37 APPENDIXC:WashtenawCountyHazardRankingList p.38 APPENDIXD:AnnotatedGuidetoDisasterRecoveryPlanningReferences p.39 APPENDIXE:OverviewoftheDisasterRecoveryPlanningProcess p.45 APPENDIXF:TSMBackupServiceStandardServiceLevelAgreement(2008) p.52 APPENDIXG:ITCS/ITComCustomerNetworkInfrastructureMaintenanceStandardService
Agreement(2006) p.53
APPENDIXH:MACCServerHostingServiceLevelAgreement(Draft,2009) p.54 APPENDIXI:MichiganAcademicComputingCenterOperatingAgreement(2006) p.55
**AppendicesFIareembeddedPDFfiles.**
7/27/2019 Rapport d'HathiTrust sur un plan de sauvegarde des donnes informatiques en cas de sinistre.
7/61
20090824 1
Introduction
Intherealmofprintlibraries,adisasterisafairlyunambiguousevent:itisafire,abrokenpipe,
aninfestationofpestsinshort,anythingwhichthreatensthecontinueduseandexistenceoftextsor
theenvironmentinwhichtheyarestored.Thisbasicdefinitionmayalsobeappliedtothedigitallibrary,
inwhichadisasterrefersnotmerelytothelossofcontentorcorruptionofdata,thedestructionofequipmentordamagetoitsenvironment,buttoanyeventwhichhasthepotentialtocausean
extendedserviceoutage.Thislastpartprovestobethegreatestdifferencebetweentheprintand
digitalworldsbecausethereareagreatmanythreatswhichcanleavedataintactbutincapacitatethe
primaryfunctionsofadigitallibrary.ThedailyoperationofaninstitutionsuchasHathiTrustinvolvesthe
anticipationandresolutionofavarietyofproblemscrashedservers,softwarebugs,networkingerrors,
etc.whichonlyrisetothelevelofadisasterwhentheyexceedthecapacityofnormaloperating
proceduresand/orthemaximumallowableoutageperiods.DisasterRecoveryPlanningthuspromptsus
todeveloprobuststrategiestomitigateandlimittheeffectsofcommonproblemsandatthesametime
forcesustothinktheunthinkable.Nevertheless,confrontingworstcasescenariosisavitalactivity;the
beliefthataneventwillneverhappensimplybecauseithasneverhappenedisaninvitationtothevery
disasterweseektoavoid.Hereinliesaconundrum,inthatthecreationofdetailedplansforevery
eventualityisnearlyimpossibleandalsoimpractical,sincetheresultsofsuchanendeavorwouldbe
needlesslycomplexaswellasexpensive.Atitsbasis,then,DisasterRecoveryPlanningdemandsan
astuteassessmentofrisksothatwemayweighthecostsofpreparationsandsolutionsagainstthecosts
ofapotentialevent.
Sowheretobegin?WhenthesubjectofDisasterRecoveryPlanningarises,commonparlance
oftenobscurestwoprominentfeaturesofsuchinitiatives.First,aDisasterRecoveryPlanisactually
comprisedofasuiteofdocumentswhichdetailavarietyofrelatedissues,fromcrisiscommunications
andthecontinuityofadministrativeactivitiestotherecoveryofhardwareanddataandtherestoration
ofcorefunctions.Second,thereisnoconclusiontotheplanningprocessorapointatwhichaplanis
done;thereisinsteadacontinuouscycleofobservation,analysis,solutiondesign,implementation,
training,testing,andmaintenance.Theessentialfirststepisthereforeathoroughknowledgeofthe
organization,itsgoals,anditsmandateforaDisasterRecoveryProgramsothatlatereffortscanfocusonthearticulationofpoliciesandthedevelopmentofsolutions.Asapreliminarystepinthiseffort,this
reportlookstoestablishabasicfoundationfromwhichfutureplanningeffortsmaygrow.
GoalsforHathiTrustsDisasterRecoveryProgram WhileamoreformalstatementofHathiTrustsgoalsandrequirementsforitsDisasterRecovery
Programmustbeelucidated,therepositorysmissionstatementprovidesagoodindicationofitsmain
objectiveintheformationofaDisasterRecoveryPlan.Aspartofitsaimtocontributetothecommon
goodbycollecting,organizing,preserving,communicating,andsharingtherecordofhuman
knowledge,HathiTrustseekstohelppreservetheseimportanthumanrecordsbycreatingreliableand
accessibleelectronicrepresentations.
1
Thisstatementclearlyjoinsthetwinimperativesofpreservationandaccesswithanadditionalrequirement:reliability.Thedevelopmentandimplementationofa
DisasterRecoveryPlanwillensurethatdigitalobjectswillretaintheirauthenticityandintegrityoverthe
longtermandthatpartnerlibrariesanddesignatedusersmayrelyonHathiTrustservices(ortheirtimely
resumption)andcontentinthefaceofcatastrophicevents.
1HathiTrust.Mission&Goals(2009)retrievedfromhttp://www.hathitrust.org/mission_goalson8July2009.
7/27/2019 Rapport d'HathiTrust sur un plan de sauvegarde des donnes informatiques en cas de sinistre.
8/61
20090824 2
TheMandateforDisasterRecoveryPlanninginDigitalPreservation HathiTrustsmandateforacomprehensiveandproactiveDisasterRecoveryPlanstemsfroma
numberofsignificantsources,amongwhichwemayincludeitsmissionandgoals.TheInstitutional
DataResourceManagementPolicy(2008)oftheUniversityofMichigansStandardPracticeGuidealso
providesanimpetusforthecreationofaDisasterRecoveryProgram.Whilenotnecessarilyinclusiveof
theMichiganDigitizationProjectmaterialsstoredinHathiTrust,thisdocumentunderscoreshow
importantitisthatdataresourcesbesafeguarded[and]protectedandcontingencyplans[]be
developedandimplemented.2Initsdiscussionofthelatterpoint,thepolicyspecifiesthat:
DisasterRecovery/BusinessContinuityplansandothermethodsofrespondingtoanemergency
orotheroccurrencesofdamagetosystemscontaininginstitutionaldata[]willbedeveloped,
implemented,andmaintained.Thesecontingencyplansshallinclude,butarenotlimitedto,
databackup,DisasterRecovery,andemergencymodeoperationsprocedures.Theseplanswill
alsoaddresstestingofandrevisiontodisasterrecovery/businesscontinuityproceduresanda
criticalityanalysis.3
Whiledatabackupproceduresandahostofriskmanagementpracticesarealreadyanintegralpartof
HathiTrustsoperation,therepositorynowlookstoformalizetheotherstrategiessuggestedbythe
InstitutionalDataManagementPolicy.Beyondtheexamplelaidoutbythisdocument,HathiTrusts
mandateforDisasterRecoveryderivesfromtheprofessionalliteraturedetailingbestpracticesinthe
fieldofdigitalpreservation.TheReferenceModelforanOpenArchivalReferenceSystemidentifies
DisasterRecoveryasanessentialcomponentofitsArchivalStoragefunctionandhighlightsthe
importanceofsuchplansinachievingthegoaloflongtermpreservationofadigitalarchivesholding.As
outlinedintheOAISdocument,theDisasterRecoveryfunctionprovidesamechanismforduplicating
thedigitalcontentsofthearchivecollectionandstoringtheduplicateinaphysicallyseparatefacility.4
HathiTrusthassuccessfullymetthisrequirementbyperformingnightlytapebackupsandestablishinga
mirrorsiteatIndianaUniversityinIndianapolis.TheTrustedRepositoriesAudit&Checklist:Criteriaand
Checklist(2007)isevenmoreexplicitinitsrequirementthatrepositoriesdocumenttheirpoliciesand
procedureswithsuitablewrittendisasterpreparednessandrecoveryplan(s),includingatleastoneoff
sitebackupofallpreservedinformationtogetherwithanoffsitecopyoftherecoveryplan(s).5
Professionalbestpracticesaswellasinternalneedsandgoalsthusprovidethemandatewhichunderlies
HathiTrustsdevelopmentofaformalDisasterRecoveryPlan.
DisasterPreparednessintheDesignandOperationofHathiTrust OneoftheprimarygoalsofHathiTrustistoprovidetransparencyinallofitsoperations,
includingitsworktocomplywithdigitalpreservationstandardsandreviewprocesses.6Nowhereisthis
commitmentmoreclearthaninitseffortstoanticipateandmitigateriskswhichcouldthreatenthe
2UniversityofMichigan.InstitutionalDataResourceManagementPolicy(2008)StandardPracticeGuide,
retrievedfromhttp://spg.umich.edu/on8July2009.3Ibid.4ConsultativeCommitteeforSpaceDataSystems.ReferenceModelforanOpenArchivalInformationSystem
(2002)p.48.5OCLCandCRL.SectionC3.4TrustedRepositoriesAudit&Checklist:CriteriaandChecklist(2007)p.49.6HathiTrust.Accountability(2009)retrievedfromhttp://www.hathitrust.org/accountabilityon25June2009.
7/27/2019 Rapport d'HathiTrust sur un plan de sauvegarde des donnes informatiques en cas de sinistre.
9/61
20090824 3
contentsandfunctionsoftheSharedDigitalRepository.Asafirststepinaddressingthedisaster
preparednessrequirementinsectionC3.4oftheTRACCriteriaandChecklist,7thisdocumentservestwo
purposes.First,itprovidesanoverviewofthepolicies,procedures,resourcesandcontractsthatenable
HathiTrusttoaddressthechallengesandthreatsendemictothefieldofdigitalpreservation.Materialis
thereforeciteddirectlyfromtheHathiTrustWebsite(http://www.hathitrust.org),themostrecent
versionofHathiTrustsreviewofitscompliancewiththeminimumrequiredelementsoftheTRAC
CriteriaandChecklist,8andrelevantliteratureprovidedbykeyvendorsandserviceproviders.9Second,
thisreportexaminesHathiTrustscurrentlevelofdisasterpreparednessanddefinescurrentand
forthcomingeffortsinitsdevelopmentofadynamicandproactiveDisasterRecoveryProgram.Perthe
recommendationsoftheTRACCriteriaandChecklist,thisdocumentrecordsthemeasuresand
precautionsalreadyinplaceinregardstospecifictypesofdisastersthatcouldbefallHathiTrust.These
eventsincludehardwarefailure,dataloss,networkconfigurationerrors,externalattacks,coreutility
failure,formatobsolescence,softwarefailure,physicalsecuritybreach,andmanmadeaswellasnatural
disasters.Whileaformal,writtenplandetailingindividualrolesandresponsibilitiesintherepositorys
responsetoeachofthesescenariosisstillforthcoming,theevidencegatheredinthisreportrevealsthat
crucialelementsofaDisasterRecoveryPlanarealreadyinplacewithinHathiTrust.10
EssentialHathiTrustBusinessFunctionsAsthedevelopmentoftheDisasterRecoveryPlanproceeds,itisimportanttobearinmindthat
itsgoalisnotmerelytherestorationofhardwareanddatabutalsotherecoveryandcontinuityof
essentialrepositoryfunctions.Thefollowinglistrepresentscorefunctionsthatneedtobeaddressedby
HathiTrustsDisasterRecoveryPlanandassuchshouldnotbeconsideredacomprehensive
representationoftherepositorysfunctions.Bydirectingplanningeffortstowardspecificfunctions
(ratherthantheorganizationsactivitiesasawhole),HathiTrustmayprioritizeandfocusitsrecovery
responsesandresourcestoensurethatthemostessentialfunctionsgobackonlinefirst.Subsequent
discussionofDisasterRecoverystrategiesandriskmanagementsolutionsinthisreportarepresented
undertheassumptionthatthecontinuityofthesefunctionsisaprimaryobjective.Theprioritizationof
thesefunctionsremainstobedeterminedbyanappropriateauthority.11
7Repositoryhassuitablewrittendisasterpreparednessandrecoveryplan(s),includingatleastoneoffsitebackup
ofallpreservedinformationtogetherwithanoffsitecopyoftherecoveryplan(s).Therepositorymusthavea
writtenplanwithsomeapprovalprocessforwhathappensinspecifictypesofdisaster(fire,flood,system
compromise,etc.)andforwhohasresponsibilityforactions.Thelevelofdetailinadisasterplanandthespecific
risksaddressedneedtobeappropriatetotherepositoryslocationandserviceexpectations.Fireisanalmost
universalconcern,butearthquakesmaynotrequirespecificplanningatalllocations.Thedisasterplanmust,
however,dealwithunspecifiedsituationsthatwouldhavespecificconsequences,suchaslackofaccesstoa
building.OCLCandCRL.TrustedRepositoriesAudit&Checklist:CriteriaandChecklist(2007)p.49.8HathitrustDigitalLibraryReviewofCompliancewithTrustworthyRepositoriesAudit&Certification:Criteriaand
ChecklistMinimumRequiredElements,revisedMay20,2009.Availableat
http://hathitrust.org/documents/trac.pdf9ContactinformationforrelevantUniversityofMichigandepartmentsandserviceprovidersaswellasforexternal
vendorsmaybefoundinAppendixA.10AlistofresourcesrelatedtodisasterrecoveryandtheplanningprocessmaybefoundinAppendixD(Annotated
ListofDisasterRecoveryPlanningResources).11ThislistofessentialHathiTrustbusinessfunctionswasdevelopedinconjunctionwithJeremyYork.
7/27/2019 Rapport d'HathiTrust sur un plan de sauvegarde des donnes informatiques en cas de sinistre.
10/61
20090824 4
o Ingest Ingestdigitalobjects(SIPs)viaGRINtheGoogleReturnInterface(ora
modifiedingestportalforlocalcontent)
ValidateingestedcontentwithGROOVEtheGoogleReturnObjectOrientedValidationEnvironment(oramodifiedversionforlocalizedingest)
o ArchivalStorage Preserveindefinitelydigitalobjectsandmetadata(AIPs)intheSharedDigital
Repository(includesensuringtheintegrityandauthenticityofmaterials).This
functionaddressestheneedsofpartnerlibrariesaswellasindividualusers.
Recordchangestoandactionsonitemswhiletheyareintherepository Maintainapersistentobjectaddressforitemswithinrepository
o Dissemination Provideaccesstodigitalobjectsforusers Allowforthetextsearchesthroughavarietyoffields Enablelargescalefulltextsearches Permitthecreationofpublicandprivatecontentcollections Disseminatedigitalobjects(DIPs)tousers(viathepageturneraccesssystem
anddataAPI)
DistributedatasetsandHathiTrustAPIstodevelopers ResearchanddevelopadditionalapplicationsandresourcesforHathiTrust
o Administration Providetransparentanduptodateinformationtousersandthegeneralpublic
viahttp://www.hathitrust.org/
CommunicateinformationandcoordinateactivitiesamongstpartnerlibrariesandHathiTrustboardsandcommittees.
o DataManagement UpdateandmanagetheRightsandGeoIPdatabases BuildandmaintainCollectionBuilderandLargeScaleSearchSolrindexes Determineappropriateuseraccesstotextsviadatabasequeries SynccontentwiththeIndianapolissiteandbackupcontenttotape
7/27/2019 Rapport d'HathiTrust sur un plan de sauvegarde des donnes informatiques en cas de sinistre.
11/61
20090824 5
HathiTrustsDisasterRecoveryStrategies
BasicRequirementsforDisasterRecovery RoyTennanthasidentifiedthreerequisitecomponentsofadigitalDisasterRecoveryPlan:(1)
theuseofaneffectivedataprotectionsystem(i.e.RAID),(2)redundantpowerandenvironmentalsystems,and(3)regularbackupofinformationtotapeand,ideally,toaremotemirroredsite.12
HathiTrusthasincorporatedalltheseelementsintoitsdesignandoperation.ItsIsilonIQstoragecluster
providesahighdegreeofdataredundancywithitsN+3parityprotection;theMichiganAcademic
ComputingCenterprovidesfullyredundantpowerandenvironmentalsystemsforHathiTrust
infrastructure;andnightlytapebackupsandthereplicationofdatatoafullyoperationalmirrorsite
locatedatIndianaUniversityinIndianapoliswiththesamelevelsofpowerandenvironmental
conditioningprovidemultiplecopiesaswellasgeographicdistributionofcontent.
o HathiTrustisintendedtoprovidepersistentandhighavailabilitystoragefordepositedfiles.Inordertofacilitatethis,theinitiativestechnologyconcentratesoncreatinga
minimumoftwosynchronizedversionsofhighavailabilityclusteredstoragewithwide
geographicseparation(thefirsttwoinstancesofstoragewillbelocatedinAnnArbor,
MIandIndianapolis,IN),aswellasanencryptedtapebackup(writtentoandstoredina
separateAnnArborfacility).
Eachofthesestorageortapeinstancesisphysicallysecure(e.g.,inalockedcageina
machineroom)andonlyaccessibletospecifiedpersonnel.Eachseparatestorage
systemisalsoequippedwithmechanismstoprovidemirroredmanagementandaccess
functionality,andemploy100%dataredundancyinanefforttopreventdataloss.13
DetailsonparityprotectionandtheHathiTrustserverenvironmentareavailablebelow(seeScenario1
andScenario5,respectively).
DisasterRecoveryStrategy#1:RedundancybetweentheAnnArborandIndianapolisSites HathiTrust'sfirstlineofdefenseintheeventofadisasterisitshotmirrorsiteinIndianapolis.
WhileingestofmaterialisrestrictedtotheAnnArborlocation,bothsitespossesstwowebservers,a
MYSQLdatabaseserver,andanIsilonIQstoragecluster(currentlycomposedof21nodes,servers
composedofCentralProcessingUnitsaswellasstorage).Duringnormaloperations,thisarrangement
allowsHathiTrusttobalanceahighvolumeofwebtrafficacrossbothsitessuchthatindividualuser
requestsmaybehandledbyeithersiteinatransparentmanner.Shouldthetolerancesforfailurebe
exceededatasite(asinadisastersituation)thefailovercapabilitybuitintotheHathiTrustarchitecture
enablestheremainingsitetoprovideaccesstothedesignatedcommunitywithoutnoticeableservice
disruptions.AsnotedintheMay2009HathiTrustUpdate,withthefulloperationofbothlocations,We
arenowensuringthatusersdonotfeeltheeffectsofsinglesiteoutages,suchasroutinemaintenance,
12Tennant,Roy.DigitalLibraries:CopingwithDisasters.LibraryJournal,15November2009.Retrievedfrom
http://www.libraryjournal.com/article/CA180529.htmlon13July2009.13HathiTrust.Technologyretrievedfromhttp://www.hathitrust.org/technologyon15June2009.
7/27/2019 Rapport d'HathiTrust sur un plan de sauvegarde des donnes informatiques en cas de sinistre.
12/61
20090824 6
bytakingadvantageofsiteredundancy.14However,becauseingesttakesplaceonlyinAnnArbor,the
lossofkeycomponentstherewouldinhibittherepositorysabilitytoacquirenewcontent.
HathiTrustutilizesIsilonSystemsSyncIQApplicationSoftwaretosynchronizedataatthe
IndianapolissitewithnewlyingestedorupdatedmaterialfromtheAnnArborsite.Thesyncto
Indianapolisrunson24separatesubsetsofthedataandeachonerunsevery2hours,withthe
exceptionofSundays.Inotherwords,subset1runsatmidnightonMonday,subset2runsat2a.m.,and
soon.ThemaximumtimefordatatobereplicatedfromAnnArbortoIndianapoliswouldthereforebe
threedaysplustheruntimeofthesyncprocess(whichtendstotakelessthanthreehours.)15
o SyncIQisanasynchronousreplicationapplicationthatfullyleveragestheuniquearchitectureofIsilonIQstoragetoefficientlycopydatafromaprimaryclustertoone
locatedatasecondarylocation.16
o Allnodes[inboththesourceandtargetIsilonIQclusters]concurrentlysendandreceivedataduringreplicationjobsinrealtime,withoutimpactingusersreadingand
writingtothesystem.17
o Arobustwizarddrivenwebbasedinterfaceisfullyintegratedinto[Isilonsproprietary]OneFSmanagementtooltocontrolallthefunctionality,including
scheduling,policysettings,monitoringandloggingofdatatransferredandbandwidth
utilization.18
o Onlyfilesthathavechangedwillbereplicatedtothetargetclusters.Thiswilloptimizetransfertimesandminimizebandwidthused.19
o Intheeventthesecondarysystemisnotavailableduetoasystemornetworkinterruption,thereplicationjobwillbeabletorollbackandrestartatthelastsuccessful
copyoperation.20
o Uponacriticalfailureorlossofnetworkconnection,analertwillbesenttoallrecipientsconfiguredtoreceivecriticalalerts.21
DisasterRecoveryStrategy#2:NightlyAutomatedTapeBackupsHathiTrustsabilitytorecoverfromadisasterisalsoensuredbythenightlyautomatedtape
backupsperformedbytheTivoliStorageManager(TSM)clientapplicationinstalledontheingestservers
connectedtotheHathiTruststorageclusterandmanagedbyMichigansITCSTSMGroup.TheTSM
BackupServiceStandardServiceLevelAgreement22outlinestheobligationsandresponsibilitiesofboth
theserviceproviderandHathiTrust:
14HathiTrust.UpdateonMay2009Activities(2009)retrievedfrom
http://www.hathitrust.org/updates_may2009on2July2009.15Snavely,Cory(Head,UMLibraryITCoreServices).Personalemailon13July2009.
16BackupandRecoveryWithIsilonIQClusteredStorage,2007p.11
17Ibid.
18Ibid.
19Ibid.
20Ibid.
21Ibid
22PleaserefertoAppendixF(TSMBackupServiceStandardServiceLevelAgreement).
7/27/2019 Rapport d'HathiTrust sur un plan de sauvegarde des donnes informatiques en cas de sinistre.
13/61
20090824 7
o TheprogressiveincrementalmethodologyusedbyTivoliStorageManageronlybacksupneworchangedversionsoffiles,therebygreatlyreducingdataredundancy,network
bandwidthandstoragepoolconsumptionascomparedtotraditionalmethodologies
basedonperiodicfullbackups.23
o ITCSisresponsibleforallofthecentralserverhardware,tapehardware,networkinghardware,andrelatedcomponents.ITCSisalsoresponsibleforhardwaremaintenance
aswellassoftwaremaintenance,administration,andsecurityauditsonthecentral
(nonclient)TSMservers.(TSMBackupServiceSLA,sec.4.1)
o ITCSprovides7x24oncallmonitoringandsupport,andstrivestokeeptheserversupinproductionatalltimes.Thetargetuptimeis99.9%ofthetime.TheTSMhardware
designismodularandshouldallowustotakepiecesoutofservicewithoutaffecting
customers.Wheneverpossible,systemmaintenancewillbeperformedduringstandard
weekendmaintenancewindowsasdefinedbyITCS.(sec.4.2)
o Inanemergency,[email protected](thiswillgototheoncallstaffspagerinrealtime).(sec.4.6)
o ITCSisresponsibleforphysicalsecurity.Machineaccessaudits,OSsecurity,andnetworksecurityontheTSMserverendarealsotheresponsibilityofITCS.(sec.4.9)
o Theservice[]includesdatacompression,dataencryptions,anddatareplication.(sec.1.0)
o ITCSwillmaintainatleasttwoTSMsitesandwillmirrordatabetweenthesitestoprovideredundancyintheeventofadisaster.CurrentlythosesitesaretheArborLakes
DataFacility(ALDF)at4251PlymouthRd.andtheMichiganAcademicComputingCenter
(MACC)locatedat1000OakbrookDr.(sec.4.10)
o Bothfacilitiesaresecure,climatecontrolledsitesdesignedandbuiltforhighavailableproductionservices.24
o Intheeventofacustomerdisasterwithlargescale(afullserverormore)dataloss,ITCSwillworkwiththecustomertooptimizetherestoretimetobestofourability.We
willonlybeabletodevoteresourcestotheextentthatothercustomersarenot
affected.Restoringlargefileservers(multipleTerabytes)cantakeseveraldays.If
customerswanttominimizethisamountoftimetorestore,wecanpurchaseadditional
resourcesforthispurpose.Contactusdirectly,andwellworkoutascenariowith
costinginformation.IntheeventofaMAJORcampusoutageaffectingalargenumberof
customers,ITCSmanagementwillworkwithcustomerstodeterminehowtoprioritize
customerrestores.(sec.4.11)
o DisasterRecoveryplanningistheresponsibilityofthecustomerunit.(sec.5.8)HavingestablishedthemainDisasterRecoverystrategiesemployedbyHathiTrust,wemaynowproceed
toinvestigatethemeansbywhichitanticipatesandmitigatesthemostcommonthreatsfacingdigital
repositories.
23IBM.IBMTivoliStorageManager:FeaturesandBenefits(2009)retrievedfromhttp://www
01.ibm.com/software/tivoli/products/storagemgr/features.html?S_CMP=rnavon16June2009.24InformationTechnologyCentralServicesattheUniversityofMichigan.FrequentlyAskedQuestionsaboutthe
TSMBackupService(2009)retrievedfromhttp://www.itcs.umich.edu/tsm/questions.phpon16June2009.
7/27/2019 Rapport d'HathiTrust sur un plan de sauvegarde des donnes informatiques en cas de sinistre.
14/61
20090824 8
Scenario1:HardwareFailureorObsolescenceandDataLoss
Review:RisksInvolvingHardwareFailureorObsolescenceandDataLoss Thefollowingtablehighlightsthevariouseventswhichposearisktothehardwareanddataof
HathiTrust.Thesethreatsmaystemfromflawsormalfunctionsintheequipmentitselforasaresultofexternaleventsthatincludephysicalsecuritybreachesandnaturalormanmadedisasters.The
arrangementofthesepotentialrisksreflectstherelativeseverityoftheirrespectiveconsequences.
HathiTrustsSolutionsforHardwareFailureandDataLoss ThethreatsfacedbyHathiTrustshardware(andassociatedapplicationsaswellasthedata
storedtherein)arecomprisedofthefailureofredundantfeatures,failurethatexceedscomponents
toleranceforredundancy,andsinglepointsoffailure.Whilethefailureofredundantcomponentsmay
happenmorefrequently(i.e.,thelossofanindividualdrivewithintheIsilonIQcluster),suchlossesdo
nothavealargeimpactontherepository;eventswhichcompromisesinglepointsoffailurewillhave
muchgreaterconsequencesforthecontinuityofHathiTrustoperations.Atthesametime,whilea
componentmayhaveredundancyononelevel(forexample,therearefiveserversdedicatedtoingest),
thatcomponentsimultaneouslymaybeconsideredatahigherleveltobeasinglepointoffailure(i.e.,
becausetheingestserversarehousedinasinglechassis,theentireunitisvulnerabletoaneventsuch
asafire).Thisdualityhighlightstheneedforvigilanceandforesightinmanagingtherepositorys
infrastructure.
BecauseHathiTrustreliesheavilyuponhardwaretofulfillitsmissionanddeliverservicestoits
designatedcommunityofusers,theselectionofequipmentanddevelopmentofsystemarchitecture
Severity Event
Highimpact Lossatasinglepointoffailure
Anadditionalfailurepasttoleranceswhenonlyonesiteisoperational Serviceisunavailableandcannotberestoreduntilcomponentisrepaired/restored
ModerateImpact Failureofacomponentpastredundancytolerance
Systemnolongerhasredundancy:additionallossorfailureofcomponentswillresultinlossofsystem.Thisisaparticularproblemifonesiteisalreadydown.
Lossofdbserver(homeofRightsdb)orofbothWebserversatasitewillrenderthatlocationinaccessible LossoffourdrivesornodesineitherIsilonstorageclusterwillresultinthelossof
thatinstance.Theclusterwillbeofflineandunabletohandlereadorwrite
requests;alltrafficwouldhavetobehandledbytheremainingsite.
LossofUMArborLakessitewouldpreventperformanceoftapebackups. LossofUMMACCsitewoulddepriveIUsiteofdataredundancy Lossofingestserverswouldpreventnewcontentfromenteringrepository
LowImpact Failureofredundantsystemcomponents
IncludesredundantcomponentswithineachsiteaswellasgeneralredundancybetweentheIUandUMsites
o HTinfrastructurehasbeendesignedtoavoidsinglepointsoffailureandtoensuredataandequipmentredundancy
o Servicecontinuesinanuninterruptedandtransparentmanner
7/27/2019 Rapport d'HathiTrust sur un plan de sauvegarde des donnes informatiques en cas de sinistre.
15/61
20090824 9
hasaimedatminimizingthedangersposedbysinglepointsoffailurethroughtheintroductionof
strategicredundancies.Thebasicmeansforavoidingthedisastrouseffectsofhardwarefailureordata
losshavebeentheestablishmentoftheIndianapolismirrorsiteandthenightlybackupofcontentto
tape.(Formoredetail,pleaserefertotheprecedingsection).Whilethesestrategiesaccountfor
extraordinaryevents,HathiTrustsserverreplacementscheduleallowstherepositorytoanticipatethe
resultsofnormalequipmentuseanddepreciation.Stepstosafeguardthelongtermfunctionalityof
HathiTrusthavethereforebeencomplementedbyaconsiderationofbestpracticesfordisaster
preparedness.
RedundantComponentsandSinglePointsofFailureintheHathiTrustInfrastructureThefollowingsectionsprovideageneraloutlineofHathiTrustsredundantcomponentsand
singlepointsoffailure.Giventhecomplexityoftherepositorysinfrastructure,unknownor
unanticipatedscenariosmayexist;futureDisasterRecoveryPlanningwillthusinvolveaperiodicreview
ofkeyfeaturesandvulnerabilities.
o SiteRedundancy:TheestablishmentofthemirrorsiteinIndianaprovidesHathiTrustwithafullyredundantoperation.Becausebothinstancesprovidefullaccesstocontent
inadditiontootherrepositoryfunctions,userswillnotexperiencealossordegradation
ofserviceintheeventthatserviceislostfromonesite.KeyexceptionstoHathiTrusts
siteredundancyarenotedbelow.
o RedundantComponentsatEachSite:ThefollowingcomponentsprovideeachsitewithatoleranceunderwhichlimitedfailureswillnotdisruptmajorHathiTrustfunctionsand
userservices.
Webservers:eachsitehastwoserverssothatifonefails,theothermaycontinuetohandletraffic.ThesealsohosttheGeoIPdatabase.
IsilonIQclusters:thecurrentconfigurationof21nodesfeaturesN+3parityprotection;thisdataredundancypermitsthesimultaneousfailureof3driveson
separatenodesorthelossofthreeentirenodeswithoutservicedegradation.
Ingestservers:theAnnArborsitepossessesfiveserverssothatingestmaycontinue(albeitataslowerrate)intheeventofanyfailures.
LargeScaleSearch(LSS)Solrindex:currentlyhousedonthewebservers,butwillsoonbemaintainedonfivenewserversinAnnArbor.
o SinglePointsofFailure:25Thesearecomponentsofasystemwhich,iflost,willpreventtheentiresystemfromfunctioning.Eventhosecomponentswithwhollyredundantpeer
devices(suchastheweboringestservers)maybeconsideredsinglepointsoffailureif
theyhaveexceededtheircapacitytosustainlosses(i.e.,ifonewebserveratasitehas
alreadybeenlost).
SinglePointsofFailureattheComponentLevel:BecauseonlyoneofthesecomponentsexistsateachHathiTrustsite,alosswillresultinsystemfailure.
MYSQLdatabaseserver:housestherightsdatabase,ingesttrackingdatabase,andtheCollectionBuilderSolrindex
Servernetworkswitches Outboundnetworkswitches
SinglePointsofFailureattheSystemLevel:Whileanygivencomponentmayhavevariousdegreesofinternalredundancy(suchasmultiplepowersuppliesor
25ContentinthissectioniscourtesyofCorySnavely(personalemailfrom13July2009).
7/27/2019 Rapport d'HathiTrust sur un plan de sauvegarde des donnes informatiques en cas de sinistre.
16/61
20090824 10
multipledrives)itmightstillfailasawholeandthusresultinthelossofa
particularinstanceofHathiTrust.Thefollowingarecomponentslocatedateach
sitewhich,whilepossessedofinternalredundancies,arestillsubjectto
completeloss(asintheeventofafire)andmaythusrenderasiteinoperable.
IsilonIQstoragecluster:theentireclustercouldbelostinalargescaleevent.Additionally,thelossofafourthdriveornodewillexceedthe
clustersfailuretoleranceandresultinaservicedisruption.
Webservers:shouldonefail,theremainingserverwillbeasinglepointoffailure.
Bladeserverchassis:sinceweb,ingest,anddatabaseserversarehousedinonechassis,theentireunitcouldpotentiallyfail.
LSSindex:inthenearfuture,theserversinAnnArborwillbethesoleinstanceoftheLargeScaleSearchindex.
MirlyndatabaseandMirlyn2Solrindex26:thesearecurrentlykeycomponentsoftheUMLibraryinfrastructure;shouldthesebe
unavailable,accesstoanduseofHathiTrustwillbecompromised.
KeyFeaturesofHathiTrustsIsilonIQClusteredStorage TheIsilonIQstorageclusterstoresandprovidesdigitalobjectsforHathiTrustspartnerlibraries
andmembersofitsdesignatedcommunity.Theclusterprovidesahighdegreeofinherentredundancy,
whichgivesbothHathiTrustsitesaconsiderabledegreeoftoleranceinregardstothefailureofvarious
aspectsofthestorageunits.Asoneexample,IsilonsproprietaryOneFSoperatingsystempermitsthe
individualstoragenodestheindividualserversthatarethebuildingblocksoftheclustertofunction
ascoherentpeerssothatanyonenodeknowseverythingcontainedontheotherunitsinthecluster.
o Isilon'sOneFSoperatingsystem[]intelligentlystripesdataacrossallnodesinaclustertocreateasingle,sharedpoolofstorage.27
o Becauseallfilesarestripedacrossmultiplenodeswithinacluster,nosinglenodestores100%ofafile;ifanodefails,allothernodesintheclustercandeliver100%ofthe
fileswithinthatcluster.28
o Adistributedclusteredarchitecturebydefinitionishighlyavailablesinceeachnodeisacoherentpeertotheother.Ifanynodeorcomponentfails,thedataisstillaccessible
throughanyothernode,andthereisnosinglepointoffailureasthefilesystemstateis
maintainedacrosstheentirecluster.29
26MirlynisthenameoftheUniversityofMichiganscurrentOnlinePublicAccessCatalog,whichissupportedby
theAlephintegratedlibrarysystem.Mirlyn2isabetaversionofUMsrecentlyimplementednextgeneration
catalog,basedontheVuFindplatform,whichwillbecomethemainlibrarycatalogonAugust3,2009.27IsilonSystems,Inc.IsilonIQOneFSOperatingSystem(2009)retrievedfrom
http://www.isilon.com/products/OneFS.phpon17June2009.28IsilonSystems.UncompromisingReliabilitythroughClusteredStorage:DeliveringHighlyAvailableClustered
StorageSystems(2008)p.7.Incomputerdatastorage,datastripingisthetechniqueofsegmentinglogically
sequentialdata,suchasasinglefile,sothatsegmentscanbeassignedtomultiplephysicaldevices.[]ifonedrive
failsandthesystemcrashes,thedatacanberestoredbyusingtheotherdrivesinthearray.
(http://en.wikipedia.org/wiki/Data_striping,retrievedon16August,2009).29IsilonSystems.BreakingtheBottleneck:SolvingtheStorageChallengesofNextGenerationDataCenters
(2008)p.8
7/27/2019 Rapport d'HathiTrust sur un plan de sauvegarde des donnes informatiques en cas de sinistre.
17/61
20090824 11
HathiTrustsIsilonIQclustersensureahighdegreeofdataredundancywiththeirN+3parityprotection.
N+3providestriplesimultaneousfailureprotectionsothatuptothreedrivesonseparateIsilonIQ
nodes,orthreeentirenodes,canfailatthesametimeandalldatawillstillbefullyavailable.
o TraditionalRAID5parityprotectionresultsindatalossifmultiplecomponentsfailpriortothecompletionofarebuild.FlexProtect,incontrast,automaticallydistributesall
dataanderrorcorrectioninformationacrosstheentireIsilonclusterandwithitsrobust
errorcorrectiontechniquesefficientlyandreliablyensuresthatalldataremainsintact
andfullyaccessibleevenintheunlikelyeventofsimultaneouscomponentfailures.30
o Eachfileisstripedacrossmultiplenodeswithinacluster,with[three]paritystripesforeachdatablock.31
ThefilesystemmayalsoperformaDynamicSectorRepair(DSR)atthetimeofanyfilewriting.Ifit
encountersabaddisksector,thefilesystemwilluseparityinformationelsewhereinthesystemto
rebuildthenecessaryinformationandrewriteanewblockelsewhereelseonthedrive.Thebadsector
willberemappedbythedrivesothatitisneverusedagainandthewriteoperationwillbecompleted.
TheIsilonrestriperisametaprocess/infrastructurethathasfourprimaryphasestohelp
manageandprotectdataintheeventthatcomponentsoftheclustersustainapartialfailureor
malfunction.Theprocessesrunasbackgroundoperationsanddonotrequiresystemdowntime.3233
o FlexProtectrepairsdata(i.e.,intheeventofadriveloss)usingparity. IsilonOneFSwithFlexProtectcanboasttheindustryleadingMeanTimeto
DataLoss(MTTDL)forpetabyteclusters.34
FlexProtectintroducesstateoftheartfunctionality,whichrebuildsfaileddisksinafractionofthetime,harnessesfreestoragespaceacrosstheentirecluster
tofurtherinsureagainstdataloss,andproactivelymonitorsandpreemptively
migratesdataoffofatriskcomponents.35
o AutoBalancerebalancesthedatainaclusteraccordingtobusinessrules,inrealtime,nondisruptively.36
Assoonasthe[neworrepaired]nodeisturnedonandnetworkcablesareconnected,AutoBalanceimmediatelybeginstomigratecontentfromthe
existingstoragenodestothenewlyaddednodeacrosstheclusterinterconnect
backendswitch,rebalancingallofthecontentacrossallnodesinthecluster
andmaximizingutilization.37
30IsilonSystems,Inc.IsilonIQOneFSOperatingSystem(2009)retrievedfrom
http://www.isilon.com/products/OneFS.phpon30June2009.31IsilonSystems.UncompromisingReliabilitythroughClusteredStorage:DeliveringHighlyAvailableClustered
StorageSystems(2008)p.732IsilonXSeriesSpecifications(productbrochure)
33InformationontheIsilonrestripercomesfromapersonalemailsentbyKipCranfordofIsilonSystems,Inc.on1
June2009.34IsilonSystems.DataProtectionforIsilonScaleOutNAS(2009)p.4
35IsilonSystems,Inc.IsilonIQOneFSOperatingSystem(2009)retrievedfrom
http://www.isilon.com/products/OneFS.phpon15June2009.36McFarland,Anne.IsilonAcceleratesDeliveryofDigitalContentTheClipperGroupNavigator(2003).
37IsilonSystems.TheClusteredStorageRevolution(2008)p.13
7/27/2019 Rapport d'HathiTrust sur un plan de sauvegarde des donnes informatiques en cas de sinistre.
18/61
20090824 12
o Collectcleansuporphanednodesanddatablockstopreventfragmentationofdata.o MediaScanverifiesdisksectors.
ThefunctionofMediaScanistoscaneveryblockinthefilesystemlookingforbaddisksectors.Ifitencountersabadsector,itwillperformaDynamicSector
Repair(DSR)anduseparityinformationelsewhereinthesystemtorebuildthe
necessaryinformationandrewriteanewblocksomewhereelseonthedrive.
MediaScanperiodicallyreviewsdatablocksanddisksectorsthatmaynothavebeenaccessed,fromafilelevel,inmonthsoryearsandtherebyhelpstokeep
thedrivesashealthyaspossible.
o AsoftheOneFS5.0release,allfilesystemmetadatacanbecheckedbytheIntegrityScanrestriperphase.ThisprocesswillallowHathiTrusttocompletelycheckfile
dataandmetadataviaassociatedchecksums.
OtherinstancesofinherentredundancyincludenonvolatileRAM,afullyjournaledfilesystem,and
softwareapplicationsthatmanageclientconnectionsintheeventofanodesfailure.
o OneFSisafullyjournaledfilesystemwithlargeamountsofbatterybackednonvolatilerandomaccessmemory(NVRAM)withineachnode,whichensurestheintegrity
ofthefilesystemintheeventofunexpectedfailuresduringanywriteoperation.38
o TheIsilonSmartConnectmodule[ensures]thatwhenanodefailureoccurs,allinflightreadsandwritesarehandedofftoanothernodeintheclustertofinishits
operationwithoutanyuserorapplicationinterruption.[]Ifanodeisbroughtdown
foranyreason,includingafailure,thevirtualIPaddressesontheclientswillseamlessly
failoveracrossallothernodesinthecluster.Whentheofflinenodeisbroughtback
online,SmartConnectautomaticallyfailsbackandrebalancestheNFSclientsacrossthe
entireclustertoensuremaximumstorageandperformanceutilization.39
HardwareSupportandService HathiTrustequipmentiscoveredbysupportandserviceagreementswithitsvariousvendors
(SunMicrosystems,Dell,CDWG,etc.).Agoodexampleofonesuchagreementisfoundinthe
PlatinumsupportprovidedbyIsilonSystemsandwhichincludes:
o Extended24x7x365Telephone&OnlineHardwareandSoftwareSupporto 24x7ProactiveMonitoring&AlertsEmailHome(forHardwareandSoftware)o ReturnPartstoFactoryforRepairand4hourReplacementPartsDeliveryo SupportIQ(EnhancedServiceabilityDiagnostics)andSystemEventTrackingo OnsiteTroubleshootingo IsilonHardwareInstallationo SoftwareProductDocumentation,ReleaseNotes,andaccesstoProductTechnicalNoteso RemoteDiagnosis(ProvidedUserGrantsAccess)o Maintenance&PatchReleases
38IsilonSystems.UncompromisingReliabilitythroughClusteredStorage:DeliveringHighlyAvailableClustered
StorageSystems(2008)p.939IsilonSystems.DataProtectionforIsilonScaleOutNAS(2009)p.6
7/27/2019 Rapport d'HathiTrust sur un plan de sauvegarde des donnes informatiques en cas de sinistre.
19/61
20090824 13
o MinorandMajorUpgradeReleases(IncludesPerformanceImprovements,NewFeatures,ServiceabilityImprovements).40
EquipmentTrackingLITCoreServices(CS)maintainsaninventoryofserversonawikipageaccessibletoitsstaff.
Detailsincludeeachserversname,location,onlineandretiredates,upgrades,notesonstorage,andits
primaryservice.Additionalinformationisprovidedrelatedtospecifications,supportcontracts,andkey
contactinformation.TheCSserverinventoryiscurrentlyoutofdate.
HardwareReplacementScheduleo HathiTrustreplacesstorageregularly,approximatelyevery34yearsorastheusable
lifeofstorageequipmentdictates(HTTRACC1.7)
o HathiTruststaffupgradehardwareonaregularbasis(i.e.,everythreeorfouryears),andtohelpdetectmorerapidgrowthindemands,thewebserverandstorage
infrastructureshavetheirownperformancemonitoringthatindicateoverload
conditions.(HTTRACC1.10)
TimelineforEmergencyReplacementofHathiTrustInfrastructureShouldaseriouseventrequirethereplacementofpart(orall)oftheHathiTrusttechnical
infrastructure,thefollowingtimelineprovidesageneralestimateofthetimerequiredtoorder,ship,
andinstallnewequipment.AcursoryreviewofthetimenecessaryforHathiTrusttorecoverfroma
majordisasteratthemainAnnArbororIndianapolisdatacentersuggeststhatalargeeventcouldidle
aninstanceoftherepositoryforatleastamonthandahalf.Inadditiontotheserversandswitches
mentionedabove,criticalcomponentsincludefour30Apowerdistributionunits(PDUs)perrackand
fourracksperdatacenterasofthiswriting.
o SubmissionofPurchaseOrders: Forordersunder$5,000,theMPathwaysapplicationallowstheUniversity
Librarysbusinessmanagertosendpurchaseordersdirectlytovendors.
Forordersover$5,000,ProcurementServicesnormallytakesonetotwobusinessdaystoapprovethepurchase,buttheprocessmaytakeuptoaweekif
questionsariseoradditionalpurchaseinformationisneeded.
o DeliveryofEquipment: Productsthevendorhasinstockandavailableforimmediateshipmenttake13
daystobedelivered.
Itemsthatneedtobeconfigured(suchasservers)usuallytake12weeks. Isilonstoragewilltake3weekstobedeliveredinaworstcasescenario.
o Installation: 3daysFTEforIsilonIQclusterinadditiontothetimerequiredforotherservers,
switches,PDUsandrackunits.
40IsilonSystems.SupportAdvantageOfferings(2009)retrievedfrom
http://www.isilon.com/support/?page=planson30June2009.
7/27/2019 Rapport d'HathiTrust sur un plan de sauvegarde des donnes informatiques en cas de sinistre.
20/61
20090824 14
o DataRestoration:about.5TB/hour(15days,asofJune2009)41 WhileHThasabout110TBofdatainitsstorage,thebackuptapesmaintained
bytheTSMGroupcontainroughly176TBofinformationduetothedata
encryptionusedtoprotecttheintellectualrightsofthematerial(asof06/2009).
Thelengthoftimerequiredforabaremetalrestorationwillbeinfluencedbytapemounts,networkspeed,restoringtotheNFSshares,decryption,etcetera.
Ifthelibrary/HTweretopurchaseanadditionaltapedrive(atroughly$20,000),theprocesscouldbespedup,perhapstoabout1TB/hour.
Intheeventofalargescaledisasterinwhichmultiplecampusunitsrequireextensivedatarestoration,theTSMBackupServiceSLAstatesthatITCS
managementwillworkwithcustomerstodeterminehowtoprioritizecustomer
restores.(sec.4.11)ThisdeterminationwillreflecttheUniversityofMichigans
organizationalpriorities42:
Priority1:Healthandsafetyoffaculty,staff,students,hospitalpatients,contractors,renters,andanyotherpeopleonUniversitypremises.
Priority2:Deliveryofhealthcareandhospitalpatientservices Priority3:Continuationandmaintenanceofresearchspecimens,
animals,biomedicalspecimens,researcharchives.
Priority4:Deliveryofteaching/learningprocessesandservices Priority5:SecurityandpreservationofUniversityfacilities/equipment. Priority6:Maintenanceofcommunity/Universitypartnerships.
o Fractionalrestoreswould,forthemostpart,runatcomparablespeedsunlesstherewasaneedtorestorealargenumberofrandomfiles,inwhichcasetherewouldbea
decreaseinspeedduetotapeseekandmounttimes.
o DelaysinrecoverycouldbeincreaseddramaticallyiftheMACCdatacenteroritsinfrastructurehassustaineddamageandneedsrepair.
HathiTrustandInsuranceCoverageattheUniversityofMichiganTheOfficeofFinancialOperationsreviewsandaddsfinancialassetsgreaterthan$5,000tothe
assetmanagementsystemoftheUniversityofMichigan.ThePropertyControlOfficeisthenresponsible
fortaggingfinancialassetswithuniqueUniversityofMichiganidentifiersandtrackingthem.Risk
ManagementServicesadministerstheUniversityspropertyinsuranceandwillprovidethe
reimbursementofreplacementcostsforitemsselfinsuredbyMichigan.AsofJuly2009,thenatureand
extentoftheUniversityofMichigansinsurancecoverageforHathiTrusthardwareremainedunder
review.ThemaincontactwithRiskManagementServicesinthismatterhasbeenCyndiMesa,Headof
UMLibraryFinance.
41Hanover,Cameron(ITCSTSMGroupStorageEngineer).Personalemailon23June2009.
42UniversityofMichiganAdministrativeInformationServices.EmergencyManagement,BusinessContinuity,and
DisasterRecoveryPlanning(2007)retrievedfromhttp://www.mais.umich.edu/projects/drbc_methodology.html
on6July2009.
7/27/2019 Rapport d'HathiTrust sur un plan de sauvegarde des donnes informatiques en cas de sinistre.
21/61
20090824 15
Scenario2:NetworkConfigurationErrors
Review:RisksInvolvingNetworkConfigurationErrorsThefollowingtablesummarizestherisksfacingHathiTrustastheresultofnetworkconfiguration
errors.ConsiderationisgiventonetworkconnectionswithinUMdatacentersaswellasatUMsHatcher
GraduateLibrary(siteofkeyadministrativeanddevelopmentactivities).Thearrangementofthese
eventsreflectstherelativeseverityoftheirrespectiveconsequences.
HathiTrustsSolutionsforNetworkConfigurationErrorsHathiTrustscontinuedaccesstotheInternetviatheUMnetBackboneisessentialforits
continuedprovisionofservice.Therepositoryreceivesnetworkinfrastructuremaintenancethrough
UMsITCS/ITCom;withitsrobustdisasterplanninginadditiontothelessonslearnedfromtheMidwest
blackoutof2003,ITComguaranteescontinuednetworkaccessinallbutthemostcatastrophic
scenarios.Intheeventofawidespreadpoweroutage,HathiTrustwouldbeabletomaintainaccessto
theUMnetBackbonesincedatacentersareequippedwithredundantpowersuppliesandtheHatcher
GraduateLibraryiscurrentlycategorizedasapriorityrecipientofpowerfromtheuniversity.ITCSalso
has17generatorswhichcanbeusedtomaintainpowertonetworkswitchesintheeventofablackout.
TheresponsibilitiesandobligationsofbothpartiesareoutlinedintheCustomerNetworkInfrastructure
MaintenanceServiceAgreement.43
ExtentofITComSupporto ITComagreestoprovidetheUnitNetworkInfrastructureMaintenancetoincludedata
switches,routers,accesspoints,hubs,uninterruptiblepowersupplies(UPSs),firewalls,
andotheridentifiedandagreeduponcomponents.(ITCSsec.1.0)
43PleaserefertoAppendixG(ITCS/ITComCustomerNetworkInfrastructureMaintenanceServiceAgreement).
Severity Event
Highimpact Lossofservernetworkswitchoroutboundnetworkswitch LossofaccesstoUMnetBackbone
ModerateImpact ExtendedlossofpoweratHatcherLibrarycouldleadtolossoflocalserversanddisruptionofadministrativeandoperationalactivities.
LowImpact LossofpowerthatthreatensabilitytoconnecttoLocalAreaNetwork(LAN)/Backbone
o Thelibraryremains(fornow)apriorityrecipientofelectricityfromtheUMpowerplant
o CampusdatacentershaveUPSsandredundantbackuppower Failureoflocal/serversideconnections
o Shouldproblemsarisewithconnectionstoindividualnodes,theclusteredarchitectureoftheIsilonsystemwillallowread/writerequeststobe
handledbyalternatenodes.
o IfconnectionsfailatoneHTsite,trafficcanbehandledbyremainingsite.
7/27/2019 Rapport d'HathiTrust sur un plan de sauvegarde des donnes informatiques en cas de sinistre.
22/61
20090824 16
ITComResponsibilitieso Provideandmaintainthenecessarymaterialsandelectroniccomponentstooperate
theUnitNetworkInfrastructure.(sec.5.2)
o ProvideconfigurationandNetworkInfrastructureAdministrationsupportnecessarytorepairandmaintaintheUnitNetworkInfrastructurehardwareandsoftwarecoveredby
thisagreement.(sec.5.3)
o Monitor24hours/dayand365days/year(24x365),supportedprotocolstothebackboneinterfaceoftheUnitsnetworkuptoandincludingtheextensiontothefirst
huborswitch.(sec.5.6)
o Monitor24hours/dayand365days/year(24x365),networkinterfacesonuninterruptiblepowersupplies(UPS)thatsupporttheUnitnetworkswitches.Provide
notificationintheeventthataUPSisactivated,(inputpowerislostordegradedand
systemswitchestobatterypower),deactivated,(inputpowerisrestored),or
unreachable.ProvidenotificationtotheUnitNetworkAdministratorwhenbatteries
degradetothepointofneedingreplacement.(sec.5.7)
o ProvidemaintenanceonthestationcablingasinstalledbyITCom,oranapprovedUMvendorwhichmetITCominstallationspecifications.(sec.5.8)
o ProvidePreventativeMaintenance(clean&vacuum)oneachCustomerUnitswitchcoveredinthisagreementyearly.(sec.5.9)
ITComServicesinResponsetoOutagesorDegradationImpactingtheNetworko Aresponsewithin30minutesoftheITComNOCnotificationortheUnitscall,to
provideinformationtotheUnitonspecificstepsthathavebeen/willbetakentoresolve
theproblem.(sec.7.2.1)
o Anonsitevisit,ifnecessary,withintwo(2)hoursoftheresponse(i.e.,themaximumonsiteresponsetimewillbetwoandahalf(21/2)hours).Anupdatewillbeprovided
totheUnitNetworkAdministratorifonsiteandabestguessETRwillbeprovidedbased
onavailablefacts.ITComwillcontinuetoprovidetheUnitwithupdateseverytwohours
duringanoutage.(sec.7.2.1)
o IfanoutageisidentifiedwithintheagreementservicehoursITComwillresolvetheoutageeveniftherepairtimeextendsbeyondtheserviceagreementhours.(sec.
7.2.1)(Repairsoutsideoftheagreementhoursresultinadditionallaborexpenses.)
o ConductmonitoringviaSNMPPOLLINGatoneminuteintervals.(sec.7.2.1)
HathiTrustResponsibilitiesITComsresponsibilitiesendatthefirstnetworkswitchandfromtheretoitsservers,HathiTrust
isresponsibleformaintainingnetworkconnectivityandsecurity.TherepositoryusesInternet2for
communicationandsynchronizationbetweentheAnnArborandIndianapolissites.EachIsilonnodehas
dual10GBInfinibandportsforinternal(i.e.,intracluster)communicationanddual1GBEthernetfor
externalcommunication.
Scenario3:NetworkSecurityandExternalAttacks
7/27/2019 Rapport d'HathiTrust sur un plan de sauvegarde des donnes informatiques en cas de sinistre.
23/61
20090824 17
Review:RisksInvolvingNetworkSecurityandExternalAttacksThefollowingtablegivesageneraloverviewofthebasicthreatanexternalattackornetwork
securitybreachposestoHathiTrust;entriesarearrangedbyseverity.Thelist,however,isnotexhaustive
andnoattempthasbeenmadetopublicizepotentialvulnerabilities.
HathiTrustsSolutionsforNetworkSecurity MaliciousactivityagainstHathiTrustcouldinvolveunauthorizedaccesstoasystemordata,
denialofservice,orunauthorizedchangestothesystem,software,ordata.Asanacademicentity,the
repositoryisseenaslessofatargetforsuchactionsthancommercialorgovernmentaltargets;despite
thisperceivedlowerrisk,HathiTrusthasnotbeenlulledintoafalsesenseofsecurity.Therepository
takesseriouslythepotentialforviolationsofitsnetworkandoperatingsystemsecurityandtherefore
hasinstitutedaprogramofperiodicsoftwareupdatesinadditiontothemaintenanceofanITCom
supportedfirewall,authenticationrequiredaccess,andothermeasures(suchasthrottlingsoftwareto
deterdenialofserviceattacks).Becausecontentiscurrentlyacceptedfromtrustedsources(namely,
GoogleandlegacydigitalcollectionsfromHathiTrustpartners)theGROOVEprocessdoesnotincludea
virusdetectionphase.Asdigitalobjectsareingestedfromagreaternumberofsources,additional
securitymeasuresshouldbeconsidered.
o HathiTruststaffapplysecurityupdatestotheoperatingsystemandtonetworkingdevicesassoonastheybecomeavailableinordertominimizesystemvulnerability.As
withnewsoftwarereleases,securityupdatesaretestedinadevelopmentenvironment
beforebeingreleasedtoproduction.Softwarepackagesthatpresentalowersecurity
riskandthathaveagreaterpotentialtoaffectapplicationbehavior(webservers,
languageinterpreters,etc.)aregenerallyinstalled,configuredandtestedmanuallyto
allowforgreatercontrolinmanagingupdates.Softwareupdatesarenotapplied
automatically;moreover,updatesthatpresentapotentialforhavinganimpacton
systembehaviorareappliedandtestedfirstinthedevelopmentenvironment.Ifno
impactsareseen,HathiTruststaffapplytheseupdatesinproductionafteratesting
periodofatleastoneweek.(HTTRACC1.10)
Severity Events
Highimpact UnauthorizedaccesstoHathiTrustcontentleadstotheinfringementofcopyrights. Lossofdataorfunctionalityforanextendedperiodoftimeasaresultofmalicious
activity.
ModerateImpact HathiTrustservicesaretemporarilyunavailableasaresultofmaliciousactivity.LowImpact ThedeliveryofHathiTrustservicesslowsastheresultofmaliciousactivity.
Asecurityweaknessexistswithinthesystembutremainsunexploited.
7/27/2019 Rapport d'HathiTrust sur un plan de sauvegarde des donnes informatiques en cas de sinistre.
24/61
20090824 18
Scenario4:FormatObsolescence
Review:RisksInvolvingFormatObsolescenceThefollowingtableoutlinesthethreatsposedbyformatobsolescenceandarrangesthem
accordingtotheirpotentialseverity.
HathiTrustsSolutionsforFormatObsolescenceAnawarenessandacknowledgementofthedangersofformatobsolescencehasledHathiTrust
toimplementproactivepoliciesandprocedurestoensurelongtermaccesstotherepositoryscontent.
Therepositoryonlyacceptsspecificformatsthatmeetrigorousspecificationsand,throughtheprior
experienceofUniversityofMichiganpersonnel,hasdevelopedprotocolsforthesuccessfulmigrationof
contentfromoneformattoanother.Inaddressingthethreatofformatobsolescence,thepreservation
oftheintegrityandauthenticityofdepositedcontenthasbeenanoverarchingconcern.
SelectionofFileFormatso HathiTrustiscommittedtopreservingtheintellectualcontentandinmanycasesthe
exactappearanceandlayoutofmaterialsdigitizedfordeposit.HathiTruststoresandpreservesmetadatadetailingthesequenceoffilesforthedigitalobject.HathiTrusthas
extensivespecificationsonfileformats,preservationmetadata,andqualitycontrol
methods,includedintheUniversityofMichigandigitizationspecifications,datedMay1,
2007.44(HTTRACB1.1)
o HathiTrustcurrentlyingestsonlydocumentedacceptablepreservationformats,includingTIFFITUG4filesstoredat600dpi,JPEGorJPEG2000filesstoredatseveral
resolutionsrangingfrom200dpito400dpi,andXMLfileswithanaccompanyingDTD
(typicallyMETS).HathiTrustsupportstheseformatsbecauseoftheirbroadacceptance
aspreservationformatsandbecausetheformatsaredocumented,openandstandards
based,givingHathiTrustaneffectivemeanstomigrateitscontentstosuccessivepreservationformatsovertime,asnecessary.TheRepositoryAdministratorshave
undertakensuchtransformationsinthepast;moreover,HathiTrustoffersenduser
servicesthatroutinelytransformdigitalobjectsstoredinHathiTrusttopresentation
formatsusingmanyofthewidelyavailablesoftwaretoolsassociatedwithHathiTrusts
44Specificationsareavailableat
http://www.lib.umich.edu/lit/dlps/dcs/UMichDigitizationSpecifications20070501.pdf
Severity Events
Highimpact Applicationsandhardwarearenolongerabletoreadordisplaydigitalobjects. Errorsintranslatingandreadingfilesarenotunderstoodoracknowledgedby
repositoryusers.
ModerateImpact ProblemswiththetranslationoffileformatsresultinDIPsthatdonotfaithfullyreflecttheoriginaldigitalobjects.
LowImpact Formatsandassociatedapplicationschangebutretaincompatibilitywitholderversionsofthefileformats.
7/27/2019 Rapport d'HathiTrust sur un plan de sauvegarde des donnes informatiques en cas de sinistre.
25/61
20090824 19
preservationformats.HathiTrustgivesattentiontodataintegrity(e.g.,through
checksumvalidation)aspartofformatchoiceandmigration.45
o Eachformatconformstoawelldocumentedandregisteredstandard(e.g.,ITUTIFFandJPEG2000)and,wherepossible,isalsononproprietary(e.g.,XML).(HTTRACB4.2)
FormatMigrationPoliciesandActivitieso HathiTrustiscommittedtomigratingtheformatsofmaterialscreatedaccordingto[its]
specificationsastechnology,standards,andbestpracticesinthedigitallibrary
communitychange.(HTTRACB1.1)
o HathiTruststaffmembersconductmigrationsfromonestoragemediumtoanotherusingtoolsthatvalidatechecksumsinternally.(Digitalobjectsarestoredbothonline
andontape,andtheonlinestoragesystemconductsregularscanstodetectandcorrect
dataintegrityproblems.)Atotalfilecountisdonefollowingalargedatatransfer,and
regularlyscheduledintegritychecksfollow.(HTTRACC1.7)
o [HathiTrust]hasmigratedlargeSGMLencodedcollectionstoXML,andLatin1characterencodingstoUTF8Unicode.Oursuccessinmigratingfromolderformatsto
newerformatsdemonstratesourcommitmenttoourcollectionsandourabilitytokeep
materialsinourrepositoryviable.Allmigrationsaredocumentedinchangelogs.(HT
TRACB4.2)
45HathiTrust.Preservation(2009)retrievedfromhttp://www.hathitrust.org/preservationon16June2009.
7/27/2019 Rapport d'HathiTrust sur un plan de sauvegarde des donnes informatiques en cas de sinistre.
26/61
20090824 20
Scenario5:CoreUtilityand/orBuildingFailure
Review:RisksInvolvingCoreUtilityorBuildingFailureThefollowingtablesummarizesthedangersautilityorbuildingfailureposestoHathiTrustand
rankseventsbytheirpotentialseverity.
HathiTrustsSolutionsforUtilityorBuildingFailureThecontinueddeliveryofHathiTrustsservicesdependsuponthemaintenanceofpower,
environmentalcontrol,andsecurityinitsserverenvironmentattheMichiganAcademicComputing
Center(MACC)andotherlocationsthathostcomponentsoftherepository.Inthisrespect,HathiTrustis
heavilyreliantupontheinfrastructureoftheMACCaswellasthatoftheArborLakesDataFacility,home
tooneinstanceoftheTSMGroupsbackuptapelibrary.Bothlocationsprovidecloselymonitoredand
highlyredundantenvironmentsthathelpensurethatHathiTrustsinfrastructureremainssecureand
operable.Atthesametime,administrativeanddatamanagementfunctionscriticaltothedevelopment
andmaintenanceoftherepositorytakeplaceintheUniversityofMichigansHatcherGraduateLibrary.
TheserviceandcooperationofMichigansPlantOperationsDivisionarethereforecriticalforthe
continuedaccesstoanduseofthisstructureintheoperationofHathiTrust.
GeneralMaintenanceandRepairsinUniversityofMichiganFacilitiesFacilitiesandmaintenanceissuesontheUniversityofMichigancampusarereportedtothe
PlantOperationsDivision,theDepartmentofPublicSafety(DPS),andOccupationalSafetyand
EnvironmentalHealth(OSEH)inadditiontotheimpactedfacilitysmanager.Repairworkiscoordinated
bytheUniversityLibraryfacilitiesmanagerinconjunctionwithadministratorsandworkersfromPlant
Operations.
TheMichiganAcademicComputingCenter(MACC) TheMACChostsmanyofthekeycomponentsoftheMichigansUniversityLibrarysystemandas
wellasthetechnicalinfrastructureofHathiTrust.TheUniversityofMichigandoesnotownthebuilding
inwhichthedatacenterislocatedbutinsteadoperatestheMACCinconjunctionwiththeMichigan
InformationTechnologyCenter(MITC)Foundationandotherpartners.TheMACCServerHostingService
Severity Events
ExtensivestructuraldamagerenderstheMACC(orkeyelementsofitsinfrastructure)unusableandnecessitatestheestablishmentofahotsitetorecover
andcontinueoperations.
Additionalfailurepasttoleranceinbackupcoolingorpowerinfrastructure
Highimpact
ModerateImpact Failureofbackuppowerpastredundancytolerance(failureof2generators)o Datacentercoordinatormayinitiateloadshedandshutdownhalfofthe
MACC(butlibraryrackswillremainoperational)
Structuraldamagerendersfacilitytemporarilyunsafeand/orunusable.LowImpact Lossofpower
Lossofenvironmentalcontrolunitswithinredundancy
7/27/2019 Rapport d'HathiTrust sur un plan de sauvegarde des donnes informatiques en cas de sinistre.
27/61
20090824 21
LevelAgreement46liststheresponsibilitiesofthedatacenteraswellastherepository;ofparticular
significancearetheMACCsagreementsto:
o Provideacontrolledphysicalenvironmenttosupportservers[with]roomaveragetemperatureofbetween65and75degreesand3550%relativehumidity[and]
monitoredenvironmentals(temperature,humidity,smoke,water,electrical.(sec.4.1)
o Provideadequate,conditioned,60cycleelectricalservicewithadequatebackupelectricalcapacitytosupportcircuits,service,andoutlets[andalsoto]provide
UninterruptiblePowerSupply(UPS)andgeneratorbackup(sec.4.2)
o Provide7x24telephonecontactforemergenciesandforemergencyaccesstofacility.(sec.4.4)
Inadditiontofeaturessuchasredundantelectricalandenvironmentalsystems,theMACC
maintainsafulltimecoordinatorandstaffwhoprovide24x7responsestofailuresormalfunctionsinthe
serverenvironment.Alertspromptedbyissueswiththeenvironmentalsystemsorpoweraresenttothe
UniversityofMichiganNetworkOperationsCenter(NOC)duringnonbusinesshours.
o Overview: TheMACC'sredundancyisdesignedtoensurethesafetyandsecurityofthe
datahousedwithin.Itconsistsof:
Adualpowerpathfromthepropertylinetothepowerdistributionunits
Dieselpoweredgeneratorsforelectricalbackup Flywheels(notbatteries)toprovidepowerwhilethegeneratorscome
on
Stateoftheartgeneratorsandflywheelsforbackuppower Threeextracomputerroomairconditioners Twoextradrycoolers Glycolloopforcoolingwithtwoparallelpathwayswithcrossovervalves
atregularintervals.47
Astateoftheartmonitoringsystemkeepstrackof1,700differentparametersandautomaticallynotifiesstaffofanyirregularity.48
o EnvironmentalControlsandMonitoring TheMACChas18ComputerRoomAirConditioningunits(CRACs).Atanygiven
time,only15arenecessarytomaintaintherequiredtemperatureandhumidity.
[Thus,thecomputerroomhasN5+1redundancyinitscoolingability.]Italsois
equippedwithanumberofportablecoolerstoaddressspecificcoolingneeds.
Theheatfromtheroomistransferredtoanunderfloorglycolloopthat
releasestheheattotheoutdoors.49
46PleaserefertoAppendixH(MACCServerHostingServiceLevelAgreement).
47MichiganAcademicComputingCenter.VitalStatistics(2009)retrievedfrom
http://macc.umich.edu/about/vitalstatistics.phpon16June2009.48.MichiganAcademicComputingCenter(2009)retrievedfromhttp://macc.umich.edu/index.phpon16June
2009.49.VitalStatistics(2009)retrievedfromhttp://macc.umich.edu/about/vitalstatistics.phpon16June2009.
7/27/2019 Rapport d'HathiTrust sur un plan de sauvegarde des donnes informatiques en cas de sinistre.
28/61
20090824 22
Thelayoutofthefacilityallowsthefrontonthecomputerrackstobefacingthecoldaisles.Theseaisleshaveperforatedfloortilesthroughwhichthecool
airispumpeddirectlytothecomputerslocatedthere.Heatisdischargedfrom
thebacksofthecomputers,whichcreatesthehotaisles.Thisalternating
arrangementfacilitatesthecoolingprocess,asthehotairproducedbythe
computerscanbesiphonedoffbeforeitminglestoomuchwiththecoolerairof
thefacility.50
TwoseparatesmokedetectionandfirealarmsystemsprotecttheMACC.Oneisforthebuilding;theotherisfortheMACCitself.Thetwosystemswork
togethertoactivatealarmsystemsandnotifythefiredepartmentandkey
personnel.Intheeventofanactualfire,thefiresuppressionsystempipeswill
notfillwithwaterunlessthereisapressuredropcausedbymeltingofoneor
moreofthesprinklerheads.51
o BackupPower Threegenerators,eachroughlythesizeofarailcar,providebackuppower.
Onlytwoofthethreearerequiredtorunthefacilityintheeventofapower
outage.52
TheMACCusesenvironmentallyresponsibleflywheelsinsteadofbatteriesforpowerbackupwhilethegeneratorscomeonline.Thecombinationofgenerators
andflywheelsprovidesthefacilitywithafullyredundantuninterruptiblepower
system(UPS).53
TheMACChasacontractwiththeUMPlantOperationsDivisionforthedeliveryofdieselfuelforitsgeneratorsintheeventofanextendedblackout.54
Intheeventthatabackupgeneratorisdisabled,theMACCcoordinatorwillinitiateloadshed,inwhichonehalfoftheMACCwillbeshutdownsothatthe
otherhalf(andrequisiteenvironmentalsystems)maycontinuetooperate.The
HathiTrustandUMLibraryracksareamongthosewhichwillretainpower
shouldthisresponseprovenecessary.55
ArborLakesDataFacility(ALDF)TheALDFhousestheTSMGroupsinfrastructureandoneinstanceofthebackuptapelibrary
thatformsanintegralpartofHathiTrustsDisasterRecoverystrategy.Asthehomeofcritical
componentsoftheUMnetBackbone,theALDFprovidesasafeandsecurelocationforonesetofthe
repositorysbackuptapes.Intheinterestofsecurity,thisreportwillomitfurtherinformationonthe
exactnatureofthefacilityspowerandenvironmentalsystems.
50Ibid.
51Ibid.
52.MichiganAcademicComputingCenter(2009)retrievedfromhttp://macc.umich.edu/index.phpon16June
2009.53Ibid.
54Gobeyn,Rene(MACCDataCenterCoordinator).Personalinterviewon23June2009.
55Ibid.
7/27/2019 Rapport d'HathiTrust sur un plan de sauvegarde des donnes informatiques en cas de sinistre.
29/61
20090824 23
Scenario6:SoftwareFailureorObsolescence
Review:RisksInvolvingSoftwareFailureorObsolescenceThefollowingtabledetailsvariousrisksinherenttosoftwarefailureorobsolescenceandranks
themaccordingtotheirseverity.
HathiTrustsSolutionsforSoftwareIssues ThedevelopmentanduseofHathiTruststoolsandresourcesdependsonhighlyfunctional
softwareapplications.Repositorypolicieshavethereforebeencraftedtoensurethattheseapplications
arethoroughlytestedandregularlyupdatedtominimizethethreatofserviceoutagesasaresultof
softwarefailureorobsolescence.HathiTrustfurthermoreemploysopensourceapplicationsthatare
wellsupportedandenjoywidespreaduseanddevelopmentwithinthedigitallibrarycommunity.
o Changesinsoftwarereleasesofallcomponentsofthesystem(fromingesttoaccess)aredevelopedandtestedinanisolateddevelopmentenvironmenttopreparefor
releasetoproduction.Whenreadyforrelease,developersrecordthechangesmade
andincrementversionnumbersofsystemcomponentsasappropriateusingaversion
controlsystem.Newversionsofsoftwarearereleasedusingautomatedmechanisms(in
ordertopreventmanualerrors).Majorchangesandupgradesinhardwarearchitecture
arerecordedinmonthlyreportsofunitactivity,andthusaretraceabletothatlevelof
detail.(HTTRACC1.8).
o Additionally,subsetsofproductiondataareavailableinthedevelopmentenvironmenttoallowdeveloperstoensurepropersystembehaviorbeforereleasingchangesto
production.(HTTRACC1.9)
o Inordertodesign,buildandmodifysoftwareforthedesignatedendusercommunity,HathiTrustconductsanactiveusabilityprogramandseeksinputfromtheStrategic
AdvisoryBoardofHathiTrust.Similarly,withregardtosoftwaredevelopmentinsupport
ofthearchivingneedsoftheParticipatingLibraries,HathiTrustfocusesonthe
developmentofhighlyfunctionalingestandvalidationmechanisms.HathiTrustalso
seeksandrespondstoguidancefromtheStrategicAdvisoryBoardwithregardto
archivingservices.(HTTRACC2.2)
Severity Events
Highimpact Softwarebugescapesdetectionindevelopmentenvironmentandresultsincrashofapplication.
ModerateImpact Softwarebugescapesdetectionindevelopmentenvironmentandpreventsfullaccesstodigitalobjects.
Improperversionofsoftwareisintroducedtosystem(couldhaveagreaterorlesserimpactdependingonresultsoferrorandrepositorysabilitytodetectit).
LowImpact
Softwarebugescapesdetectionindevelopmentenvironmentandpreventsfulluseofsystemcapabilities(i.e.,rotationofimagesoradditionalfunctionality)
7/27/2019 Rapport d'HathiTrust sur un plan de sauvegarde des donnes informatiques en cas de sinistre.
30/61
20090824 24
Scenario7:OperatorError
Review:RisksInvolvingOperatorErrorThefollowingtablesummarizesriskstoHathiTrustposedbyoperatorerror;eventsareranked
accordingtotheirpotentialseverity.
HathiTrustsSolutionsforOperatorErrorInanyhumanenterprise,occasionaloperatorerrorisunavoidable;HathiTruststrivestoensure
thatanysucheventsaredetectedandresolvedinatimelyfashion.56Tohelpavoidoccurrencesand
mitigatetheirpotentialimpact,HathiTrusthasautomatedmanyproceduresandalsoreliesupon
applicationassertions,whichcannotifyadministratorswhenprocessesarenotoperatingcorrectly.Even
ifanerrorisintroducedtothefilesystemandthenbackedup,theTSMclientsavesuptosevenversions
ofafileforuptosixmonthssothatanearlierversioncanberetrieved.
Ingest:TheGoogleReturn(ObjectOriented)ValidationEnvironment(GROOVE)processisentirelyautomatedtoavoidtheintroductionofoperatorerrortotheprocess;stepsinclude:
o Identificationofmaterialforingesto
DecryptionandunzippingoffilesFormatverificationandvalidationwithJHOVEo LunBarcodeandMD5checksumvalidationo CreationofHathiTrustMETSdocumentso EstablishmentofHathiTrusthandles(persistentURLs)o Extensionofthepairtreefiledirectory(asnewmaterialentersthesystem)
ArchivalStorage:FilesstoredwithintherepositoryarenotaccesseddirectlyormanipulatedbystaffsothatneitherthezippedimageandOCRfilesnortheMETSdocumentmaybeaccidently
alteredordeleted.
Dissemination:Thepageturnerapplicationreferencesthestoredimageandthencreatesa.png(forTIFFs)or.jpg(forJPEG2000s)filefordisplaytotheviewer.
DataManagement:Newversionsofsoftwarearereleasedusingautomatedmechanisms(inordertopreventmanualerrors).(HTTRACC1.8)
56PleaserefertoAppendixB(HathiTrustOutagesfromMarch2008throughApril2009).
Severity Events
Highimpact Operatorerrorresultsintheirreparablelossofdataordamagetoequipment. Operatorerrorresultsinlossofkeyrepositoryfunctions(ingest,storage,
dissemination,etc.)foranextendedperiodoftime.
ModerateImpact Operatorerrorremainsundetectedandcausespersistentproblemsinthesystembuthasnolongtermconsequences.
LowImpact Operatorerrorisdetectedbynormalproceduresorviaanactivitylogandcanbereadilycorrected.
7/27/2019 Rapport d'HathiTrust sur un plan de sauvegarde des donnes informatiques en cas de sinistre.
31/61
20090824 25
Scenario8:PhysicalSecurityBreach
Review:RisksInvolvingaPhysicalSecurityBreach MaintainingthephysicalsecurityoftheHathiTrustinfrastructureisyetanothercrucialelement
intherepositoryseffortstomanagerisksandtherebylessenthechancethatadisastertypeevent
occurs.Risksinvolvethedamageanddestructionofequipmentandcouldevenextendtounauthorized
systemaccess.MultiplelevelsofsecurityexistatboththeMichiganAcademicComputingCenter
(MACC)andtheArborLakesDataFacility(ALDF)toprotectHathiTrustfromtheactsofvandalism,
destructionormalicioustampering.Detailsonthepotentialimpactsofaphysicalsecuritybreachare
coveredinScenario1:HardwareFailureandScenario3:NetworkSecurity.
HathiTrustsSolutionsforPhysicalSecurityo Eachof[theHathiTrust]storageortapeinstancesisphysicallysecure(e.g.,inalocked
cageinamachineroom)andonlyaccessibletospecifiedpersonnel.57
SecurityattheMACCTheMACCServerHostingSLAstatesthedatacenterstaffwill:
o Provideservicesnecessarytomaintainasafe,secure,andorderlyenvironmentforalltenantsoftheMACC.(sec.4.7)
o ProvideaccesscontrolviaHiDcardandbiometricreadersforthoselistedontheTenantStaffAuthorizedforAccesslist.(sec.4.5)
TheMACCWebsiteandtheMichiganAcademicComputingCenterOperatingAgreement58provide
additionaldetailsconcerningtheresourcesandproceduresthathelpprotectHathiTrustsequipmentat
theMACC.TheMACCDataCenterCoordinatorpersonallyoverseestheenforcementofsecurity
protocolsandconductsregularauditsofsecuritylogsand,whennecessary,reviewssurveillancevideo
footage.
o SecuritySystems Stateoftheartsecuritydevicessuchasirisscanners,cameras,closedcircuit
televisionandoncallstaffkeepthedataandmachineshousedintheMACC
safe.59
Accesstothedatacenterwillbebytwofactorauthentication(accesscardandirisscan)orescorted,supervisedaccess.Accesstothebuildingwillbebyaccess
card.(MACCOA,sec.5.3.1)
Camerasthroughoutthecorridor,securitytrap,andfacilitywillbemonitoredandmaintainedbytheDataCenterCoordinator.(sec.5.2.1)
o SecurityProcedures57HathiTrust.Technology(2009)retrievedfromhttp://www.hathitrust.org/technologyon15June2009.
58PleaserefertoAppendixI(MichiganAcademicComputingCenterOperatingAgreement).
59MichiganAcademicComputingCenter.VitalStatistics(2009)retrievedfrom
http://macc.umich.edu/about/vitalstatistics.phpon17June2009.
7/27/2019 Rapport d'HathiTrust sur un plan de sauvegarde des donnes informatiques en cas de sinistre.
32/61
20090824 26
TheOperationsAdvisoryCommitteewillestablishproceduresforgrantingaccesscardstothefacilitytothosewhosejobsrequirehandsonaccessto
systems.Allrequestsforaccesscardswillbevettedandapprovedbythe
OperationsAdvisoryCommitteeattheirnextmeeting.(sec.5.3.2)
Everyoneontheaccesslistforthedatacenterwillberequiredtoattendatrainingsessionbeforeworkinginthedatacenterandsignanaccessagreement
statingpoliciestheymustobservewhileinthedatacenter.(sec.5.3.8)
SecurityattheALDFAsnotedintheTSMBackupServiceSLA,theUniversityofMichigansITCSisresponsiblefor
physicalsecurityattheALDF.(sec.4.9)Whilethisdocumentwillnotdetailspecificfeaturesofthe
ALDFsoperation,multiplelevelsofsecurityandoversightareemployed.
7/27/2019 Rapport d'HathiTrust sur un plan de sauvegarde des donnes informatiques en cas de sinistre.
33/61
20090824 27
Scenario9:NaturalorManmadeDisaster
Review:RisksInvolvingaNaturalorManmadeDisasterThefollowingtabledetailstheriskstoHathiTrustposedbyanaturalormanmadedisaster;
eventsarerankedbyorderoftheirseverity.DuetopossibleoverlapbetweenthisscenarioandScenario
1(HardwareFailure),readersareencouragedtoconsultthatearliersection.
HathiTrustsSolutionsforNaturalorManmadeCatastrophicEventsTheUniversityofMichiganAnnArborCampusEmergencyProcedures(revisedJanuary2008)
hassetprocedurestoaddressbuildingevacuations(intheeventoffire),tornadoes,severeweather,
flooding,chemical/biological/radioactivespills,aswellasbombthreats,civildisturbances,andactsof
violenceorterrorism.60Inallcases,staffwillfollowthedirectionsofPublicSafetyandnotreenter
buildingsorresumeworkuntiladvisedtodosobyDPSorOSEHorsomeonefromonsiteincident
command.
Intheeventofaseverenaturalormanmadedisaster,therepairandrestorationofthephysical
locationsofHathiTrustinfrastructurewouldneedtobecoordinatedbetweentherepositoryandthe
appropriatefacilitymanagers.Suchactivitywouldrelyuponthedisasterrecoveryplansinplaceatthe
MITCBuilding(homeoftheMACC)andUniversityofMichigan(whichincludestheHatcherGraduate
LibraryandtheALDF).Itmustbenotedthataneventwhichcausessignificantdamagetoanimportant
structureortoabuildingsinfrastructurecouldresultinthelossofaninstanceoftherepositoryforan
extendedperiodoftime.Insuchacase,HathiTrustwouldneedtosetupanalternatehotsiteuntil
structuralrestorationiscomplete(oranewfacilityhasbeenfound).
60PleaseseeAppendixC(WashtenawCountyHazardRankingList).
Severity Events
Highimpact Widespreaddamagetoadatacenterand/oritsinfrastructurethatforcesaninstanceoftherepositorytofindanewhotsitewithsufficientpowersupply,
environmentalcontrols,andsecurity.
Damagetoworkareasforcestafftorelocatetoanewcenterofoperations. Extensivelossordamagetohardwarerequireslargescalereplacement. Withtheextendedlossofonesite,HathiTrustlosesredundancy(andpossiblysome
functionality:i.e.theabilitytoingestnewmaterialinAnnArbor)andthusacentral
componentofitsdisasterrecoveryandbackupplans.
AnactofviolenceorterrorismoccursatornearHathiTrustfacilities.ModerateImpact Aneventresultsinanextendedoutageatonesitethatexceedstherecoverytime
objective.
Hardwaresustainssomedamageandsiteisabletocontinueoperationinareducedcapacity.
AnactualorthreatenedactofviolenceorterrorismforcesthetemporaryevacuationorquarantineofHathiTrustfacilities.
LowImpact LocalconditionsresultinatemporaryoutageataHathiTrustsite.
7/27/2019 Rapport d'HathiTrust sur un plan de sauvegarde des donnes informatiques en cas de sinistre.
34/61
20090824 28
BasicDisasterRecoveryStrategiesIntheimmediateaftermathofalargescalemanmadeornaturaldisaster,therepositorys
immediaterecoverywillbeenabledbyitsbasicsystemarchitecture:
o theinitiativestechnologyconcentratesoncreatingaminimumoftwosynchronizedversionsofhighavailabilityclusteredstoragewithwidegeographicseparation(thefirsttwoinstancesofstoragearelocatedinAnnArbor,MIandIndianapolis,IN),aswellasan
encryptedtapebackup(writtentoandstoredinaseparatefacilityoutsideofAnn
Arbor).61
TheestablishmentofthemirrorsiteinIndianapolisandtheretentionofmultiplebackuptapesattwo
locationsinAnnArborensurethataseriouseventateitherlocationwillnotimpedethecontinued
functioningoftherepositoryattheother.Considerationmustbegivenastohowdataatthe
Indianapolissitewillbebackedupandhowkeyrepositoryfunctions(suchasingest)willproceedifthe
AnnArborinstanceisofflineforanextendedperiodoftime.Likewise,alongtermoutageattheIU
locationwouldrequireHathiTrusttoestablishathirdsitefordatabackup(i.e.,alocationwhere
additionalcopiesofbackuptapescouldbestored).
61HathiTrust.Technologyretrievedfromhttp://www.hathitrust.org/technologyon15June2009.
7/27/2019 Rapport d'HathiTrust sur un plan de sauvegarde des donnes informatiques en cas de sinistre.
35/61
20090824 29
Scenario10:MediaFailureorObsolescence
Review:RisksInvolvingMediaFailureorObsolescenceThefollowingtablesummarizesriskstoHathiTrustposedbythefailureofthemediausedforits
databackups.Whiletherisksfromthisarelimited(bothcopiesofthetapebackupswouldhavetobe
impactedfordatatobeunavailable),theissueshouldnonethelessbeaddressedwithregulartest
restorationsand/orinspectionsofthemedia.
HathiTrustsSolutionsforMediaFailure GiventhenatureofHathiTrustsstoragesystem,thisscenarioisonlyaconcerninregardstothe
digitalmagnetictapesusedbytheTSMGroupforbackups.
o TwotapecopiesofallbackupdataaremadeandthesearestoredinseparateclimatecontrolledconditionsintapelibrariesattheMACCandtheALDF.
o Contentistransferredtonewtapeduringdatadefragmentation(whichoccurswhenexistingtapesare80%full),
o Ifadegradedorotherwisebadsectionoftapeisdetectedduringabackupprocedurethattapeisimmediatelymarkedasreadonly.
Dataisthenceforthwrittentoadifferenttape;existingdataonthebadtapewillbecopiedtoproperlyfunctioningmedia.
Ifdatacannotbereclaimedfrombadtape,theTSMGroupwouldcontactHathiTrustsothatthebackupofcontentcanbeproperlycompleted.
RemainingVulnerabilitiesThereissomereasonforconcerninthisareabecausetheTSMGroupdoesnothavearegular
programtomonitoritsmediaforphysicaldegradationorimpairmentafterdatadefragmentation.While
thetapesarereportedtobehighlydependable,problemssuchasstickyshed(thehydrolysisofthe
tapesbinder)couldbecomeanissuewitholdertapes.Aregularprogramoftapevalidationortest
restorationswouldprovideanopportunitytocheckonthephysicalconditionanddataintegrityofthe
tapes.Likewise,thecreationofascheduleforthereplacementofoldertapescouldavoidfuture
problemswithmediadegradation.
Severity Events
Highimpact Physicaldegradation(i.e.intapebinder,substrate,ormagneticcontent)affectsbothcopiesofolderbackuptapes.
ModerateImpact Becausebackuptapesarenotregularlytestedoraudited,thephysicalsubstrateoftapesmaydegradeovertime.
LowImpact Badtapeisdetectedduringatapebackup.
7/27/2019 Rapport d'HathiTrust sur un plan de sauvegarde des donnes informatiques en cas de sinistre.
36/61
20090824 30
ConclusionsandActionItems
ConclusionsAsthisreportdemonstrates,avarietyofriskmanagementstrategiesinadditiontodesign
elements,operatingprocedures,andserviceandsupportcontractsendowHathiTrustwiththeabilityto
preserveitsdigitalcontentandcontinueessentialrepositoryfunctionsintheeventofarangeof
disasters.TheestablishmentoftheIndianapolismirrorsite,theperformanceofnightlytapebackups,
andtheredundantpowerandenvironmentalsystemsoftheMACCreflectprofessionalbestpractices
andwillenableHathiTrusttoweatherawiderangeofforeseeableevents.Asitis,disastersoftenresult
fromtheunknownandtheunexpected;whiletheaforementionedstrategiesarecrucialcomponentsof
aDisasterRecoveryPlan,theymustbesupplementedwithadditionalpoliciesandprocedurestoensure
that,comewhatmay,HathiTrustwillbeabletocarryonasbothanorganizationandadedicatedservice
provider.
IntheefforttosecureHathiTrustslongtermcontinuity,thepresentdocumentstandsmerelyas
apreliminarystepintheestablishmentofalegitimateDisasterRecoveryPlan.ThedataonHathiTrusts
policies,procedures,andcontractsconsolidatedhereinshouldfacilitatethedatacollectionrequisiteto
theinitialphasesoftheplanningprocess,butthecoreactivitiesofformulatingtechnicaland
administrativeresponsestrategiesanddelegatingrolesandresponsibilitiesremaintobeundertaken.
Thefollowingsectionoutlinesrecommendationsandactionitemsderivedfromresearchintothe
repositoryaswellasfromdiscussionswithCorySnavelyandotherHathiTruststaffmembers.Itemshave
beenseparatedintoanapproximatetimelineofactivityrangingfromShortTermthroughLongTerm
andthearrangementwithineachcategoryrepres