Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Archiving Scientific Data
SusanB.DavidsonCIS700:AdvancedTopicsinDatabases
MW1:30-3
Towne309
http://www.cis.upenn.edu/~susan/cis700/homepage.html
• Datachangesovertime• Newdataisadded
• Mistakesarecorrected• Olddataisremoved
• Toenablereproducibilityandverifiability,itmustbepossibletoaccessthestateofadatabaseasofacertainpointintime.• Alsocrucialfordereferencingcitations
• Mayalsowanttoaskquestionsabouthowthedatabasehaschanged.
Why archive?
2
• Manydatabasesperiodicallypublishnewversions• Keepcopyofeachversion
• Allowsdataasofacertaintimetobeaccessedquickly
• Maynotbespaceefficientsinceverylittlemaychangebetweenversions
• Doesn’tallowefficientqueriesoverthechangehistory
• Keepalogofchanges(“sequenceofdelta”)• Spaceefficient• Maybeexpensivetorecomputedataasofacertaintime
• Maybeexpensivetoquerychangehistory
How to archive?
3
• Versioningandcitation:experienceswitheagle-i• ArchivingXMLdatasets• Conclusions
Outline
4
• eagle-iisanRDFdatasetwhichcontainsinformationaboutresourcesfortranslationalresearch(e.g.software,celllines,labfacilities)
• Eachresourcehasanimmutableeagle-iid;thesubjectofeachresourcetripleisaneagle-iid
• Resourcesareclassifiedusinganontology,andthecitationdependsontheclassificationoftheresource.
• eagle-italkedaboutcitationbutdidn’tautomateit…
Our experience: eagle-i
5
6
7
8
Citation architecture
9
• Thelatestcopyofeagle-iisavailableonthewebsite,butitisnot“versioned”
• Wedidadailydownloadsincewedidn’tknowhowfrequentlyitchanged(notfrequently!)
• Needed“timequeries”tounderstandhowthedatasetchangedovertime• Whattripleswereadded/deletedintheperiod[t,t’]?
• WhatwastheobjectoftripleXattimet?
• WhenwastripleYfirstadded/deleted
eagle-i versioning manager
10
Example: versioning 2 RDF triples
11
• Whenshouldversioningbetriggered?• Atleastwhenausercitesaneagle-iresource
• Whatshouldbeversioned?• Atleastchangestotheresourcebeingcited.
Ø Ifaversionofaresourceisnotcited,itdoesnothavetobestored.
Ø However,time-basedquerieswillonlydetectchangeswithrespecttocitationsratherthanallchanges.
Versioning and citation
12
• Versioningandcitation:experienceswitheagle-i• ArchivingXML• Conclusions
Outline
13
• Keepcopyofeachnewversionofthedatabase• Allowsdataasofacertaintimetobeaccessedquickly
• Maynotbespaceefficientsinceverylittlemaychangebetweenversions
• Doesn’tallowefficientqueriesoverthechangehistory
• Keepalogofchanges(“sequenceofdelta”)• Spaceefficient• Maybeexpensivetorecomputedataasofacertaintime
• Maybeexpensivetoquerychangehistory
Recall: approaches to archiving
14
• Ignoresthe“semanticcontinuityofkeys”byfocusingonminimaleditdistance
Problem with diff-based approaches
15
• Focusonhierarchicalscientificdatasets• XML-based• Changesareprimarilyinsertions
• Changesidentifiedbasedonkeys• Versionmergingbasedonkeys• Inheritanceoftimestamps
• Timestampisstoredatachildelementonlywhenitisdifferentfromthetimestampofitsparentelement
Ø “Key-based+merging”approach
Proposed approach in paper
16
Example: sequence of versions
17
Adding keys
18
Example of an archive
19
Representing archive in XML
20
• Akeyhasform(Q,{P1,…,Pk}),whereQ,Piarepathexpressions• Qidentifiesthetargetset
• Piarekeypaths,analogoustokeyattributesinrelations
• AnXMLdocumentsatisfiesakey(Q,{P1,…,Pk})if• FromanynodeidentifiedbyQ,everyPiexistsuniquely• Iftwonodesn1andn2identifiedbyQhavethesamevalueattheendofeachkeypathin{P1,…,Pk}thenn1andn2arethesamenode.
What is a key for XML?
21
• SinceXMLishierarchical,wealsoneedtospecifykeysrelativetoacontextnode• (Q,(Q’,{P1,…,Pk}))
• Examples• (/,(db,{})).Thereisatmostonedbelementbelowtheroot.
• (/db,(dept,{name})).Everydeptnodewithinadbnodecanbeuniquelyidentifiedbythecontentsofitsnamesubelement.
• (/db/dept,(emp,{fn,ln})).Everyempnodewithinadeptnodealongthepath/db/deptcanbeuniquelyidentifiedbythecontentsofitsfnandlnsubelements.
• (/db/dept/emp,(sal,{})).Thereisatmostonesalsubelementundereachempnodealongthepath/db/dept/emp.
Relative keys
22
• Assumptions:• Everykeydefinedforanodeisrelativetoitsparent,e.g.thekeyforempisrelativetoitsparentdeptnode
• Frontiernodesidentifyunkeyedportionsofthedocument
Archiver architecture
23
• Recursivelymergenodesintheincomingversion(D)tonodesinthearchive(A)thathavethesamekeyvalue,startingfromtheroot.
• WhenanodeyinDismergedwithanodexfromA,thetimestampofxisaugmentedwithi(thenewversionnumber),andsubtreesarerecursivelymerged.
• NodesinDthatdonothavenodesinAaresimplyaddedwithiasthetimestamp
Nested merge
24
Further compaction under frontier node
25
• Whatisthedatabaseatt=1?
• WhendidJoeDoegetasalaryraise?
• Whatwerethechangestothedatabasebetweent=1andt=3?
Querying the archive
26
• Versioningisimportantformanydifferentapplications
• Whiletechniquesaresimilarbetweendifferentrepresentations(e.g.files,relations,XML,RDF),differencesinassumptionscanbeusedtobuildmoreefficientsolutions.• Andtheoperations(e.g.queries)youwishtoperformareimportanttoo!
Conclusions
27