27
Archiving Scientific Data Susan B. Davidson CIS 700: Advanced Topics in Databases MW 1:30-3 Towne 309 http://www.cis.upenn.edu/~susan/cis700/homepage.html

Archiving Scientific Data - University of Pennsylvania

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Archiving Scientific Data - University of Pennsylvania

Archiving Scientific Data

SusanB.DavidsonCIS700:AdvancedTopicsinDatabases

MW1:30-3

Towne309

http://www.cis.upenn.edu/~susan/cis700/homepage.html

Page 2: Archiving Scientific Data - University of Pennsylvania

• Datachangesovertime• Newdataisadded

• Mistakesarecorrected• Olddataisremoved

• Toenablereproducibilityandverifiability,itmustbepossibletoaccessthestateofadatabaseasofacertainpointintime.• Alsocrucialfordereferencingcitations

• Mayalsowanttoaskquestionsabouthowthedatabasehaschanged.

Why archive?

2

Page 3: Archiving Scientific Data - University of Pennsylvania

• Manydatabasesperiodicallypublishnewversions• Keepcopyofeachversion

• Allowsdataasofacertaintimetobeaccessedquickly

• Maynotbespaceefficientsinceverylittlemaychangebetweenversions

• Doesn’tallowefficientqueriesoverthechangehistory

• Keepalogofchanges(“sequenceofdelta”)• Spaceefficient• Maybeexpensivetorecomputedataasofacertaintime

• Maybeexpensivetoquerychangehistory

How to archive?

3

Page 4: Archiving Scientific Data - University of Pennsylvania

• Versioningandcitation:experienceswitheagle-i• ArchivingXMLdatasets• Conclusions

Outline

4

Page 5: Archiving Scientific Data - University of Pennsylvania

• eagle-iisanRDFdatasetwhichcontainsinformationaboutresourcesfortranslationalresearch(e.g.software,celllines,labfacilities)

• Eachresourcehasanimmutableeagle-iid;thesubjectofeachresourcetripleisaneagle-iid

• Resourcesareclassifiedusinganontology,andthecitationdependsontheclassificationoftheresource.

• eagle-italkedaboutcitationbutdidn’tautomateit…

Our experience: eagle-i

5

Page 6: Archiving Scientific Data - University of Pennsylvania

6

Page 7: Archiving Scientific Data - University of Pennsylvania

7

Page 8: Archiving Scientific Data - University of Pennsylvania

8

Page 9: Archiving Scientific Data - University of Pennsylvania

Citation architecture

9

Page 10: Archiving Scientific Data - University of Pennsylvania

• Thelatestcopyofeagle-iisavailableonthewebsite,butitisnot“versioned”

• Wedidadailydownloadsincewedidn’tknowhowfrequentlyitchanged(notfrequently!)

• Needed“timequeries”tounderstandhowthedatasetchangedovertime• Whattripleswereadded/deletedintheperiod[t,t’]?

• WhatwastheobjectoftripleXattimet?

• WhenwastripleYfirstadded/deleted

eagle-i versioning manager

10

Page 11: Archiving Scientific Data - University of Pennsylvania

Example: versioning 2 RDF triples

11

Page 12: Archiving Scientific Data - University of Pennsylvania

• Whenshouldversioningbetriggered?• Atleastwhenausercitesaneagle-iresource

• Whatshouldbeversioned?• Atleastchangestotheresourcebeingcited.

Ø Ifaversionofaresourceisnotcited,itdoesnothavetobestored.

Ø However,time-basedquerieswillonlydetectchangeswithrespecttocitationsratherthanallchanges.

Versioning and citation

12

Page 13: Archiving Scientific Data - University of Pennsylvania

• Versioningandcitation:experienceswitheagle-i• ArchivingXML• Conclusions

Outline

13

Page 14: Archiving Scientific Data - University of Pennsylvania

• Keepcopyofeachnewversionofthedatabase• Allowsdataasofacertaintimetobeaccessedquickly

• Maynotbespaceefficientsinceverylittlemaychangebetweenversions

• Doesn’tallowefficientqueriesoverthechangehistory

• Keepalogofchanges(“sequenceofdelta”)• Spaceefficient• Maybeexpensivetorecomputedataasofacertaintime

• Maybeexpensivetoquerychangehistory

Recall: approaches to archiving

14

Page 15: Archiving Scientific Data - University of Pennsylvania

• Ignoresthe“semanticcontinuityofkeys”byfocusingonminimaleditdistance

Problem with diff-based approaches

15

Page 16: Archiving Scientific Data - University of Pennsylvania

• Focusonhierarchicalscientificdatasets• XML-based• Changesareprimarilyinsertions

• Changesidentifiedbasedonkeys• Versionmergingbasedonkeys• Inheritanceoftimestamps

• Timestampisstoredatachildelementonlywhenitisdifferentfromthetimestampofitsparentelement

Ø “Key-based+merging”approach

Proposed approach in paper

16

Page 17: Archiving Scientific Data - University of Pennsylvania

Example: sequence of versions

17

Page 18: Archiving Scientific Data - University of Pennsylvania

Adding keys

18

Page 19: Archiving Scientific Data - University of Pennsylvania

Example of an archive

19

Page 20: Archiving Scientific Data - University of Pennsylvania

Representing archive in XML

20

Page 21: Archiving Scientific Data - University of Pennsylvania

• Akeyhasform(Q,{P1,…,Pk}),whereQ,Piarepathexpressions• Qidentifiesthetargetset

• Piarekeypaths,analogoustokeyattributesinrelations

• AnXMLdocumentsatisfiesakey(Q,{P1,…,Pk})if• FromanynodeidentifiedbyQ,everyPiexistsuniquely• Iftwonodesn1andn2identifiedbyQhavethesamevalueattheendofeachkeypathin{P1,…,Pk}thenn1andn2arethesamenode.

What is a key for XML?

21

Page 22: Archiving Scientific Data - University of Pennsylvania

• SinceXMLishierarchical,wealsoneedtospecifykeysrelativetoacontextnode• (Q,(Q’,{P1,…,Pk}))

• Examples• (/,(db,{})).Thereisatmostonedbelementbelowtheroot.

• (/db,(dept,{name})).Everydeptnodewithinadbnodecanbeuniquelyidentifiedbythecontentsofitsnamesubelement.

• (/db/dept,(emp,{fn,ln})).Everyempnodewithinadeptnodealongthepath/db/deptcanbeuniquelyidentifiedbythecontentsofitsfnandlnsubelements.

• (/db/dept/emp,(sal,{})).Thereisatmostonesalsubelementundereachempnodealongthepath/db/dept/emp.

Relative keys

22

Page 23: Archiving Scientific Data - University of Pennsylvania

• Assumptions:• Everykeydefinedforanodeisrelativetoitsparent,e.g.thekeyforempisrelativetoitsparentdeptnode

• Frontiernodesidentifyunkeyedportionsofthedocument

Archiver architecture

23

Page 24: Archiving Scientific Data - University of Pennsylvania

• Recursivelymergenodesintheincomingversion(D)tonodesinthearchive(A)thathavethesamekeyvalue,startingfromtheroot.

• WhenanodeyinDismergedwithanodexfromA,thetimestampofxisaugmentedwithi(thenewversionnumber),andsubtreesarerecursivelymerged.

• NodesinDthatdonothavenodesinAaresimplyaddedwithiasthetimestamp

Nested merge

24

Page 25: Archiving Scientific Data - University of Pennsylvania

Further compaction under frontier node

25

Page 26: Archiving Scientific Data - University of Pennsylvania

• Whatisthedatabaseatt=1?

• WhendidJoeDoegetasalaryraise?

• Whatwerethechangestothedatabasebetweent=1andt=3?

Querying the archive

26

Page 27: Archiving Scientific Data - University of Pennsylvania

• Versioningisimportantformanydifferentapplications

• Whiletechniquesaresimilarbetweendifferentrepresentations(e.g.files,relations,XML,RDF),differencesinassumptionscanbeusedtobuildmoreefficientsolutions.• Andtheoperations(e.g.queries)youwishtoperformareimportanttoo!

Conclusions

27