Using Git to Manage the Storage and Versioning of Digital Objects

Embed Size (px)

Citation preview

  • 8/10/2019 Using Git to Manage the Storage and Versioning of Digital Objects

    1/7

    Using Git to Manage the Storage and Versioning of DigitalObjects

    Richard AndersonDigital Library Systems & Services, Stanford University

    13 December 201

    Introduction !his doc"ment s"mmari#es some information $ have recently gathered on thea%%licability of the it Distrib"ted 'ersion (ontrol System )D'(S* for "se in managingthe storage and versioning of digital ob+ects

    it is o%timi#ed to facilitate collaborative develo%ment of soft-are, b"t it has storageand version control ca%abilities that may be similarly a%%lied to the management ofdigital ob+ects in a %reservation system $n this mode of "sage, each digital ob+ect -o"ldbe stored in its o-n it .re%ository/ and standard it commands -o"ld be "sed to add or"%date the ob+ect s content and metadata les !he it clone and %"ll commands co"ldbe "sed for re%lication to additional storage locations

    Some "sers -ho have %revio"sly e %lored that a%%roach, ho-ever, have enco"nteredslo-ness and other iss"es -hen %rocessing large binary les s"ch as images or video

    Basic Git Referencesere are some lin4s to o5cial and 3 rd %arty -eb sites6

    it 2010 it ome%age it 2010 !he it (omm"nity 7oo4 it 2010 it User8s 9an"al Scott (hacon 200: ;ro it

  • 8/10/2019 Using Git to Manage the Storage and Versioning of Digital Objects

    2/7

    Git Object Modelere are some lin4s that give yo" an overvie- of it s content storage architect"re6

    it 7oo46 !he it @b+ect 9odel it 9agic6 !he @b+ect Database ohn it from the bottom "% !ommi 'irtanen > it for (om%"ter Scientists ;ro it6 it @b+ects it User 9an"al6 !he @b+ect Database

    Git Storage$n ty%ical "sage, the c"rrent version of a code %ro+ect s les is stored in a hierarchy offolders "nder a to%>level .

    %ro+ect git* that does not incl"de a -or4ing folder 7are re%ositories are ty%ically storedon remote shared sites

    Git BlobAll bytestreams )incl"ding content les* managed by it are stored in a ty%e of it ob+ectcalled a blob, -hich has this str"ct"re6

    the string .blob/ a s%ace . . a decimal string s%ecifying the length of the content in bytes a n"ll .B000/ the content being stored in the blob

    Cach blob is digested to generate a 0>digit S A1 hash, -hich is "sed to s%ecify theblob s identi er and location in the ob+ect tree !he blob is initially stored in a le -herethe rst 2 digits are "sed as a folder name and the remainder "sed as the lename

    !his design is referred to as .content>addressable storage/ Note that the S !" hashis not the digest of the original contents# but rather the digest of the content$lus the header%

    !he other ob+ect ty%es "sed by it )tree, commit, tag* "se the same ob+ect str"ct"re,di ering mainly in the rst string that s%eci es ob+ect ty%e

    Git &ree !ree ob+ects contain references to sets of blobs and=or other trees )"sing S A1identi ers*, similar to the f"nction of directory entries in Uni lesystems A tree ob+ectstores the original lenames names of its child ob+ects !his design allo-s a given childob+ect to be referenced from more than one %arent tree "sing di erent names, similar tothe -ay Uni le lin4s -or4

    Git 'o((it@riginally called a changeset, a commit ob+ect adds an annotation to a to%>level treeob+ect that re%resents a %oint>in>time sna%shot of the collection of les being stored in

    http://book.git-scm.com/1_the_git_object_model.htmlhttp://www-cs-students.stanford.edu/~blynn/gitmagic/ch08.html#_the_object_databasehttp://ftp.newartisans.com/pub/git.from.bottom.up.pdfhttp://eagain.net/articles/git-for-computer-scientists/http://progit.org/book/ch9-2.htmlhttp://www.kernel.org/pub/software/scm/git/docs/user-manual.html#the-object-databasehttp://book.git-scm.com/1_the_git_object_model.htmlhttp://www-cs-students.stanford.edu/~blynn/gitmagic/ch08.html#_the_object_databasehttp://ftp.newartisans.com/pub/git.from.bottom.up.pdfhttp://eagain.net/articles/git-for-computer-scientists/http://progit.org/book/ch9-2.htmlhttp://www.kernel.org/pub/software/scm/git/docs/user-manual.html#the-object-database
  • 8/10/2019 Using Git to Manage the Storage and Versioning of Digital Objects

    3/7

    the code .re%ository/ $t %rovides the ability to record the name of the content creatorand the agent ma4ing the commit, as -ell as a %ointer to the %revio"s commit)s* thatthis .version/ of the ob+ect is derived from )allo-ing version history to be traced*

    Git References and &agsit %rovides the ability to vie- the change history as a net-or4 of commits along -ith

    h"man>readable labels for develo%ment branches )e g .master/ and .develo%/* andmilestones )e g .v1 0 2/* $nformation abo"t develo%ment branches is stored in

    reference les A !ag label can be attached to any given commit !ags are c"stomarily"sed to assign arbitrary release version labels to a s%eci c %oint in the version historyA s%ecial label, . CAD/ refers to the c"rrent ti% of any branch

    Re$lication !he .git clone/ command is "sed to co%y a it re%ository from one location to another !he defa"lt behavior is to co%y all version history Slo-ness in cloning a git re%ositorycan be es%ecially %roblematic if there is a high fre?"ency of changes to a %o%"lation oflarge les !hat creates a large vol"me of history in the ob+ect database, -hich can ta4ea long time to transfer bet-een machines

    !he de%th o%tion can be "sed to modify this behavior !he command Egit clone >>de%thFnGE creates a .shallo-/ clone -ith the history tr"ncated to the s%eci ed n"mber ofrevisions De%th 0 -o"ld transfer only the latest version

    !he it fetch, %"ll, and %"sh commands are "sed to synchroni#e the change histories oft-o co%ies of a re%ository !hey do not -or4 -ith shallo- clones, ho-ever

    'o($ression and )ackingLin4s related to ob+ect %ac4ing basics6

    ;ro it6 ;ac4 les it 7oo46 o- it Stores @b+ects it 7oo46 !he ;ac4 le it User 9an"al6 o- git stores ob+ects e5ciently6 %ac4 les $! %ac4 format

  • 8/10/2019 Using Git to Manage the Storage and Versioning of Digital Objects

    4/7

    combined into one big %ac4 le !here is also a .git gc >>aggressiveE o%tion that can be"sed to force a re%ac4 of all ob+ects from scratch

    As mentioned %revio"sly, it a"tomatically %ac4s any loose blobs -henever yo" do a%"sh o%eration !his can ma4e the transfer s%eed seem slo-er than -o"ld be e %ected@ne can im%roved the %erceived %erformance by doing a se%arate re%ac4 o%eration%revio"s to the %"sh

    Su$$ressing co($ression and $acking beha*iorsLin4s related to con g"ration of #lib and delta com%ression d"ring storage and %ac4ing6 it 9an"al > (on g it 9an"al > itattrib"tes Stac4overJo- > git %"ll -itho"t remotely com%ressing ob+ects o- to %revent it from com%ressing certain lesK ;ro it > it Attrib"tes

    7y defa"lt, it does a"tomatic #lib com%ression of the bytestreams stored in loose and%ac4ed ob+ect les (om%ression behavior can be s"%%ressed or modi ed via the

    .core com%ression/ con g"ration o%tion6 An integer -1..9, indicating a default compression level. -1 is the zlib default. 0means no compression, and 1..9 are various speed/size tradeo s, 9 being slowest.

    f set, this provides a default to other compression variables, such ascore.loosecompression and pac!.compression.

    !he con g setting .core com%ression 0/ -ill disable #lib com%ression of loose ob+ectsand ob+ects -ithin %ac4 les 7"t it does not a ect delta com%ression that occ"rs -hen%ac4 les are created

    !he .%ac4 -indo-/ setting can be "sed to limit the n"mber of other ob+ects git -illconsider -hen doing delta com%ression Setting it to 0 sho"ld eliminate deltacom%ression entirely

    A .gc a"to 0/ con g setting -ill disable a"tomatic re%ac4ing -hen yo" have a lot ofob+ects 7"t it does not a ect the %ac4ing behavior that occ"rs d"ring %"shes and %"lls

    Use of Ecommit ?/ s"%%resses the di o%eration at the end of a commit

    A more gran"lar o%tion is to "se the . gitattrib"tes/ le to indicate binary stat"s and tos"%%ress delta com%ression for s%eci ed le ty%es e g

    M +%g binary >deltaM %ng binary >deltaM g# binary >delta

    !he attrib"te .binary/ is a macro that e %ands to >crlf di !he .>crlf/ o%tion tells itnot to mess -ith line endings of les !he .>di / o%tion s"%%resses the analysis ofte t"al di erences and theins%ection of blob contents that -o"ld normally occ"r to determine if the contents arete t !he di attrib"te can alternatively be "sed to s%ecify a c"stom di "tility to "se forthe given le ty%e

    http://www.kernel.org/pub/software/scm/git/docs/git-config.htmlhttp://www.kernel.org/pub/software/scm/git/docs/gitattributes.htmlhttp://stackoverflow.com/questions/7102053/git-pull-without-remotely-compressing-objectshttp://git.661346.n2.nabble.com/How-to-prevent-Git-from-compressing-certain-files-td3305492.htmlhttp://progit.org/book/ch7-2.htmlhttp://www.kernel.org/pub/software/scm/git/docs/git-config.htmlhttp://www.kernel.org/pub/software/scm/git/docs/gitattributes.htmlhttp://stackoverflow.com/questions/7102053/git-pull-without-remotely-compressing-objectshttp://git.661346.n2.nabble.com/How-to-prevent-Git-from-compressing-certain-files-td3305492.htmlhttp://progit.org/book/ch7-2.html
  • 8/10/2019 Using Git to Manage the Storage and Versioning of Digital Objects

    5/7

    !he lename %attern M can be "sed to match all les

    !he .>delta/ o%tion forces les to be co%ied into %ac4 les -itho"t attem%ting to deltacom%ress them

    )roble(s +ith big ,les and-or lots of ,lesLin4s to relevant email threads6

    o- to %revent it from com%ressing certain lesK Serio"s %erformance iss"es -ith images, a"dio les, and other Enon>codeE data H-d6 it and Large 7inaries6 A ;ro%osed Sol"tion oogle S"mmer of (ode 2011 $deas N;A!( v0 0=3O git add a>7ig> le it 1 P Q Release Iotes

    !he it mailing list Ngit vger 4ernel orgO has elded a variety of ?"eries -here "sershave re%orted serio"s %erformance iss"es -ith git re%ositories "sed to store media orother large binary les 9any of these disc"ssion threads incl"de s"ggestions to "seone or more of the con g"ration o%tions covered in the %revio"s session

    !he ,rst e(ail thread e %lores -ays to %revent it from trying to com%ress les

    !he second e(ail thread e %lores %otential it con g"ration enhancements that-o"ld s%eed "% the handling of large binary les

    !he third e(ail thread e %lores a%%roaches that avoid directly incl"ding large binaryles in the git ob+ect database, -hile still "sing it to trac4 versions

    !he oogle S"mmer of (ode %ro%osals con rm that f"rther it enhancements are stilldesirable for better handling of large binary les

    !he git add a>big> le %atch sho-s that enhancement to handle adding of big lesare=-ere in %rogress

    !he version 1 P Q release notes incl"des the te t6 Adding a "le larger than core.big"lethreshold #defaults to 1/$ %ig& using 'git add'will send the contents straight to a pac!"le without having to hold it and itscompressed representation both at the same time in memor(.

    $n older versions of it, -hen adding a ne- content to the re%ository, it loaded the blobin its entirety into memory, com%"ted the ob+ect name and com%ressed it into a loose

    ob+ect le andling large binary les )e g video and a"dio asset for games* has been%roblematic beca"se of this design @"t of memory errors co"ld occ"r

    !ncillar. $rojects that address big ,le issues

    !he follo-ing it %l"gins %rovide mechanisms for se%arating the storage of large binaryles from the storage of trac4ing information abo"t those les

    git/big,leshtt%6==caca #oy org=-i4i=git>big les

    http://git.661346.n2.nabble.com/How-to-prevent-Git-from-compressing-certain-files-td3305492.htmlhttp://git.661346.n2.nabble.com/serious-performance-issues-with-images-audio-files-and-other-quot-non-code-quot-data-td5042748.htmlhttp://git.661346.n2.nabble.com/Fwd-Git-and-Large-Binaries-A-Proposed-Solution-td5948908.htmlhttps://git.wiki.kernel.org/index.php/SoC2011Ideas#Better_big-file_supporthttp://git.661346.n2.nabble.com/PATCH-v0-0-3-git-add-a-Big-file-td6341445.htmlhttp://www.kernel.org/pub/software/scm/git/docs/RelNotes/1.7.6.txthttp://caca.zoy.org/wiki/git-bigfileshttp://git.661346.n2.nabble.com/How-to-prevent-Git-from-compressing-certain-files-td3305492.htmlhttp://git.661346.n2.nabble.com/serious-performance-issues-with-images-audio-files-and-other-quot-non-code-quot-data-td5042748.htmlhttp://git.661346.n2.nabble.com/Fwd-Git-and-Large-Binaries-A-Proposed-Solution-td5948908.htmlhttps://git.wiki.kernel.org/index.php/SoC2011Ideas#Better_big-file_supporthttp://git.661346.n2.nabble.com/PATCH-v0-0-3-git-add-a-Big-file-td6341445.htmlhttp://www.kernel.org/pub/software/scm/git/docs/RelNotes/1.7.6.txthttp://caca.zoy.org/wiki/git-bigfiles
  • 8/10/2019 Using Git to Manage the Storage and Versioning of Digital Objects

    6/7

    !his %ro+ect a%%ears to be a no- inactive for4 of it that im%lemented someim%rovements for handling of big les !he core bigHile!hreshold con g o%tion added bythe %ro+ect seems to have been merged bac4 into mainstream it

    git/anne0htt%6==git>anne branchable com=

    it>anne is a git %l"gin )-ritten in as4ell* that allo-s yo" to "se it for versioningsymlin4s to les, -hile storing the act"al le in a se%arate .bac4end/ location !his

    avoids many of the iss"es associated -ith big les !he tool seems targeted to-ard%eo%le that -ant to either scatter les among many storage sites and=or have a sim%lemechanism for synchroni#ing storage bet-een those sites !he -al4thro"gh e am%legives one a feeling of ho- this tool o%erates !he soft-are s home %age and thisLanne is not

    !here is very little disc"ssion of le versioning in the git>anne doc"mentation andfor"ms !he disc"ssions $ have fo"nd are not enco"raging in that regard6

    @bvio"sly, the core feat"re of git>anne is the ability to 4ee% a s"bset of les in alocal re%o !he main trade>o is that yo" don8t get version trac4ing

    git>anne can allo- reverting a le to an earlier version $ thin4 there is a ma+or distinction bet-een boar and Ngit>anne and git>mediaO

    7oar trac4s the content of yo"r binary les, allo-ing yo" to retrieve to %revio"sversions the others don8t seem to do that

    git/(ediahtt%s6==gith"b com=schacon=git>media

    it>media has design goals similar to git>anne , b"t is not as -ell doc"mented oractively develo%ed o-ever, it has some attraction for the "se case -e envision, andthe a"thor, Scott (hacon, is highly regarded in the it comm"nity )being the %rimarya"thor of o5cial it doc"mentation* According to a %osting by the a"thor .it "ses thesm"dge and clean lters to a"tomatically redirect content into a git=media directoryinstead of into it itself -hile 4ee%ing the S A in it See it Large @b+ect S"%%ort;ro%osal for some bac4gro"nd reading As -ith git>anne , $ have concerns abo"t thee %licit s"%%ort for le versioning, -hich -o"ld re?"ire more research to g"re o"t

    bfs.nchtt%6==s%ace t-c de= stefan=bfsync %h%

    !he home %age says .bfsync is a %rogram that %rovides git>style revision control forcollections of big les !he contents of the les are managed by bfsync, and a gitre%ository is "sed to do version controlT in this re%o only the hashes of the act"al data

    les are stored / !his is very ne- soft-are -itho"t m"ch of a trac4 record seehtt%6==blogs gnome org=st-=2011=0 =23=23>0 >2011>bfsync>0>1>0>or>managing>big> les>-ith>git>home=

    So(e obser*ations about other soft+are *ersion control s.ste(s

    Mercurial 1 g2

    9erc"rial is very similar in f"nctionality to it $t di ers mainly in the -ay that itstr"ct"res the ob+ect store and in ho- it handles delta com%ression !hey also di er in

    http://git-annex.branchable.com/http://git-annex.branchable.com/walkthrough/http://lwn.net/Articles/419241/http://git-annex.branchable.com/not/http://git-annex.branchable.com/forum/wishlist:_git_backend_for_git-annex/http://git-annex.branchable.com/forum/wishlist:_git_backend_for_git-annex/http://kristianrumberg.wordpress.com/2011/07/06/git-annex/http://www.reddit.com/r/linux/comments/fx9kr/boar_simple_version_control_and_backup_for_photos/http://www.reddit.com/r/linux/comments/fx9kr/boar_simple_version_control_and_backup_for_photos/http://www.reddit.com/r/linux/comments/fx9kr/boar_simple_version_control_and_backup_for_photos/https://github.com/schacon/git-mediahttp://git.661346.n2.nabble.com/Git-Large-Object-Support-Proposal-td2505770.htmlhttp://git.661346.n2.nabble.com/Git-Large-Object-Support-Proposal-td2505770.htmlhttp://space.twc.de/~stefan/bfsync.phphttp://blogs.gnome.org/stw/2011/08/23/23-08-2011-bfsync-0-1-0-or-managing-big-files-with-git-home/http://blogs.gnome.org/stw/2011/08/23/23-08-2011-bfsync-0-1-0-or-managing-big-files-with-git-home/http://git-annex.branchable.com/http://git-annex.branchable.com/walkthrough/http://lwn.net/Articles/419241/http://git-annex.branchable.com/not/http://git-annex.branchable.com/forum/wishlist:_git_backend_for_git-annex/http://git-annex.branchable.com/forum/wishlist:_git_backend_for_git-annex/http://kristianrumberg.wordpress.com/2011/07/06/git-annex/http://www.reddit.com/r/linux/comments/fx9kr/boar_simple_version_control_and_backup_for_photos/http://www.reddit.com/r/linux/comments/fx9kr/boar_simple_version_control_and_backup_for_photos/http://www.reddit.com/r/linux/comments/fx9kr/boar_simple_version_control_and_backup_for_photos/https://github.com/schacon/git-mediahttp://git.661346.n2.nabble.com/Git-Large-Object-Support-Proposal-td2505770.htmlhttp://git.661346.n2.nabble.com/Git-Large-Object-Support-Proposal-td2505770.htmlhttp://space.twc.de/~stefan/bfsync.phphttp://blogs.gnome.org/stw/2011/08/23/23-08-2011-bfsync-0-1-0-or-managing-big-files-with-git-home/http://blogs.gnome.org/stw/2011/08/23/23-08-2011-bfsync-0-1-0-or-managing-big-files-with-git-home/
  • 8/10/2019 Using Git to Manage the Storage and Versioning of Digital Objects

    7/7

    ho- they handle le renaming it "ses he"ristic methods to detect that renames haveocc"rred, -hereas 9erc"rial does e %licit rename trac4ing !here are %ros and cons toboth a%%roaches

    9erc"rial has a 7ig les C tension that allo-s one to trac4 large les that are storede ternal to the '(S re%ository !his f"nctionality is similar to git>anne and git>media

    Sub*ersion 1SVN2

    S"bversion "ses a centrali#ed re%ository model instead of a distrib"ted model, th"s itallo-s s"bsets of les to be chec4ed o"t and committed, -itho"t re?"iring the entiredata o-ever, S'I does is not recommended for large binary les, and it too s" ersfrom "sing delta technology in an attem%t to red"ce the storage needed As -ith other'(S systems this slo-s do-n storage and retrieval;erformance t"ning S"bversion

    http://mercurial.selenic.com/wiki/BigfilesExtensionhttp://www.ibm.com/developerworks/java/library/j-svnbins/index.htmlhttp://mercurial.selenic.com/wiki/BigfilesExtensionhttp://www.ibm.com/developerworks/java/library/j-svnbins/index.html