Upload
bernhard-haslhofer
View
1.557
Download
0
Embed Size (px)
DESCRIPTION
Citation preview
ResourceSync: Leveraging Sitemaps for Resource Synchronization
WWW 2013, Rio de Janeiro, May 17th
Bernhard Haslhofer | University of ViennaSimeon Warner | Cornell UniversityCarl Lagoze | University of MichiganMartin Klein, Robert Sanderson | Los Alamos National LabsMichael L. Nelson | Old Dominion UniversityHerbert van de Sompel | Los Alamos National Labs
http://www.openarchives.org/rs/
WWW 2013, May 17th
ResourceSync
• What and Why?
• Synchronization Scenarios
• ResourceSync Basics
• Demos
• Status and Next Steps
2
WWW 2013, May 17th
What?
• A framework for synchronizing Web resources from a Source to a Destination
3
Web
sync
$ resync http://example.com
WWW 2013, May 17th
Why?
• rsync: filesystem sync, but not Web
• OAI-PMH: metadata, but not resources
• Web-DAV: extends HTTP, requires server installation at source
• ...
4
… because lots of projects and services are doing synchronization but rely on ad-hoc solutions!
WWW 2013, May 17th
ResourceSync
• What and Why?
• Synchronization Scenarios
• ResourceSync Basics
• Demos
• Status and Next Steps
5
WWW 2013, May 17th
arxiv.org mirroring
• 2.4M resources (PDF, metadata, Latex src)
• ~800/day created or updated
• uses homebrew mirroring since 1994 (!)
• look for more general solution to support independent destinations
6
WWW 2013, May 17th
Wikipedia
• 1.4 updates / sec
• many dependent services reusing Wikipedia content (e.g., DBPedia, Freebase, etc.)
• harvest articles via OAI-PMH, retrieve changes via IRC, download dumps
7
WWW 2013, May 17th
data.europeana.eu
• aggregates metadata from >200 data providers in Europe
• 10 largest providers contribute 80%
• >190 providers contribute 20%
8
WWW 2013, May 17th
Design Guidelines
• Sync small websites / repositories (few resources) but also large data collections (millions of resources)
• Support low change frequency (weeks / months) to high change frequency (seconds) sources
• Low adoption barrier!
9
WWW 2013, May 17th
ResourceSync
• What and Why?
• Synchronization Scenarios
• ResourceSync Basics
• Demos
• Status and Next Steps
10
WWW 2013, May 17th
Resource List
11
Destination
Source
<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcelist" modified="2013-01-03T09:00:00Z"/> <url> <loc>http://example.com/res1</loc> </url> <url> <loc>http://example.com/res2</loc> </url></urlset>
$ resync -b http://example.com
XML Sitemap
WWW 2013, May 17th
Resource List
12
<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcelist" modified="2013-01-03T09:00:00Z"/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"/> </url> <url> <loc>http://example.com/res2</loc> <lastmod>2013-01-02T14:00:00Z</lastmod> <rs:md hash="md5:1e0d5cb8ef6ba40c99b14c0237be735e"/> </url></urlset>
Source
WWW 2013, May 17th
Change List
13
Destination
Source
<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="changelist" modified="2013-01-03T11:00:00Z"/> <url> <loc>http://example.com/res2</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change="updated"/> </url> <url> <loc>http://example.com/res3</loc> <lastmod>2013-01-02T18:00:00Z</lastmod> <rs:md change="deleted"/> </url></urlset>
$ resync -b http://example.com$ resync -i http://example.com
XML Sitemap
WWW 2013, May 17th
Resource Dump
14
Source
<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcedump" modified="2013-01-03T09:00:00Z"/> <url> <loc>http://example.com/resourcedump.zip</loc> <lastmod>2013-01-03T09:00:00Z</lastmod> </url></urlset>
XML Sitemap
WWW 2013, May 17th
Resource Dump
15
http://example.com/resourcedump.zip
|- manifest.xml|- resources
|- res1|- res2
WWW 2013, May 17th
Resource Dump Manifest
16
<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcedump-manifest" modified="2013-01-03T09:00:00Z"/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-03T03:00:00Z</lastmod> <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6" path="/resources/res1"/> </url> <url> <loc>http://example.com/res2</loc> <lastmod>2013-01-03T04:00:00Z</lastmod> <rs:md hash="md5:1e0d5cb8ef6ba40c99b14c0237be735e" path="/resources/res2"/> </url></urlset>
manifest.xml (XML Sitemap)
WWW 2013, May 17th
Capability List
17
Destination
Source
<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:ln href="http://example.com/info-about-source.xml" rel="describedby" type="application/xml"/> <rs:md capability="capabilitylist" modified="2013-01-02T14:00:00Z"/> <url> <loc>http://example.com/dataset1/resourcelist.xml</loc> <rs:md capability="resourcelist"/> </url> <url> <loc>http://example.com/dataset1/resourcedump.xml</loc> <rs:md capability="resourcedump"/> </url> <url> <loc>http://example.com/dataset1/changelist.xml</loc> <rs:md capability="changelist"/> </url></urlset>
$ resync -x http://example.com
XML Sitemap
WWW 2013, May 17th
Large Resource Lists
18
<?xml version="1.0" encoding="UTF-8"?><sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcelist" modified="2013-01-03T09:00:00Z"/> <sitemap> <loc>http://example.com/resourcelist-part2.xml</loc> <lastmod>2013-01-03T09:00:00Z</lastmod> </sitemap> <sitemap> <loc>http://example.com/resourcelist-part1.xml</loc> <lastmod>2013-01-03T09:00:00Z</lastmod> </sitemap></sitemapindex>
Source
WWW 2013, May 17th
Other Capabilities
WWW 2013, May 17th
ResourceSync
• What and Why?
• Synchronization Scenarios
• ResourceSync Basics Walkthrough
• Demos
• Status and Next Steps
20
WWW 2013, May 17th
Available code
• ResourceSync client and library (Python)
• ResourceSync source simulator
21
http://github.com/resync
WWW 2013, May 17th
Install resync client/library
22
$ git clone git://github.com/resync/resync.git$ cd resync/$ python setup.py build$ sudo python setup.py install
$ sudo easy_install resync
$ sudo pip install resync
or
or
WWW 2013, May 17th
Install resync simulator
23
$ git clone git://github.com/resync/simulator.git$ cd simulator/$ chmod u+x simulate-source$ ./simulate-source
$ sudo easy_install tornado
WWW 2013, May 17th
Run client against simulator
24
$ resync -b http://localhost:8888
$ resync -i http://localhost:8888
WWW 2013, May 17th
resync @ arxiv.org
25
resync -v --noauth http://resync.library.cornell.edu/arxiv-q-bio\=/tmp/qbio http://resync.library.cornell.edu/arxiv\=/tmp/arxiv
WWW 2013, May 17th
resync @ en.wikipedia.org
26
WWW 2013, May 17th
ResourceSync
• What and Why?
• Synchronization Scenarios
• ResourceSync Basics Walkthrough
• Demos
• Status and Next Steps
27
WWW 2013, May 17th
Status
• Beta spec (v.0.6) for public commenthttp://www.openarchives.org/rs/0.6/resourcesync
• Tool development started
• Separate documents for archiving and push deployments
28
WWW 2013, May 17th
Next Steps
• Continue tool development & deployment
• Collect
• public comments on [email protected]
• implementation issues onhttps://github.com/resync/resync/issues
• Version 0.9 to be released in Summer 2013
• Version 1.0 in fall 2013 (NISO standard)
29
WWW 2013, May 17th
Thanks!
@bhaslhoferhttp://slideshare.net/bhaslhofer
http://openarchives.org/rs