31
Sawood Alam, Mat Kelly, Michele C. Weigle, Michael L. Nelson Web Science and Digital Libraries Research Group Old Dominion University Norfolk, Virginia, USA @WebSciDL InterPlanetary Wayback The Next Step Towards Decentralized Web Archiving IPFS Lab Day, Decentralized Web Summit, 2018 San Francisco, CA (USA) August 3, 2018 http://github.com/oduwsdl/ipwb

@WebSciDL Norfolk, Virginia, USA Old Dominion University ... · Persistence of archived web dependent on resilience of organizations Availability of data is subject to censorship

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: @WebSciDL Norfolk, Virginia, USA Old Dominion University ... · Persistence of archived web dependent on resilience of organizations Availability of data is subject to censorship

Sawood Alam, Mat Kelly, Michele C. Weigle, Michael L. NelsonWeb Science and Digital Libraries Research Group

Old Dominion UniversityNorfolk, Virginia, USA

@WebSciDL

InterPlanetary WaybackThe Next Step Towards Decentralized Web Archiving

IPFS Lab Day, Decentralized Web Summit, 2018San Francisco, CA (USA)

August 3, 2018

http://github.com/oduwsdl/ipwb

Page 2: @WebSciDL Norfolk, Virginia, USA Old Dominion University ... · Persistence of archived web dependent on resilience of organizations Availability of data is subject to censorship

Content Addressing

http://foo.com/spaceDog.jpg

http://example.org/yuri.jpg

QmZAD4xeeNeYF3TmwWgBXypLKTiCGwGRMXHW7MtheWKtw4

QmZAD4xeeNeYF3TmwWgBXypLKTiCGwGRMXHW7MtheWKtw4

===

$ ipfs cat QmZAD4xeeNeYF3TmwWgBXypLKTiCGwGRMXHW7MtheWKtw4 > doge.jpg

2@ibnesayeed

Page 3: @WebSciDL Norfolk, Virginia, USA Old Dominion University ... · Persistence of archived web dependent on resilience of organizations Availability of data is subject to censorship

Rendered HTML vs. Source Code

3@ibnesayeed

Page 4: @WebSciDL Norfolk, Virginia, USA Old Dominion University ... · Persistence of archived web dependent on resilience of organizations Availability of data is subject to censorship

HTTP Response vs. WARC Record

4

HTTP headers

Payload

WARC headers

@ibnesayeed

Page 5: @WebSciDL Norfolk, Virginia, USA Old Dominion University ... · Persistence of archived web dependent on resilience of organizations Availability of data is subject to censorship

How Wayback Machine Works?

Archival Indexer

Archival Index(e.g., CDXJ) Replay Engine

processes

Outputs references

reads (file, offset)

read archived content

Present WARC content to user

5@ibnesayeed

Page 6: @WebSciDL Norfolk, Virginia, USA Old Dominion University ... · Persistence of archived web dependent on resilience of organizations Availability of data is subject to censorship

Memento: Time Dimension to the Web

https://tools.ietf.org/html/rfc7089 6@ibnesayeed

Page 7: @WebSciDL Norfolk, Virginia, USA Old Dominion University ... · Persistence of archived web dependent on resilience of organizations Availability of data is subject to censorship

Why IPWB?

● Persistence of archived web dependent on resilience of organizations

● Availability of data is subject to censorship● Redundancy in web archive files of exact duplicate content● Lack of public participation in web archiving● Discoverability issue of small web archives

7@ibnesayeed

Page 8: @WebSciDL Norfolk, Virginia, USA Old Dominion University ... · Persistence of archived web dependent on resilience of organizations Availability of data is subject to censorship

Indexing

8@ibnesayeed

Page 9: @WebSciDL Norfolk, Virginia, USA Old Dominion University ... · Persistence of archived web dependent on resilience of organizations Availability of data is subject to censorship

WARC Creation HTTP Header & Payload Extraction Push to IPFS Generate CDXJ WARC-CDXJ Correspondence

9@ibnesayeed

Page 10: @WebSciDL Norfolk, Virginia, USA Old Dominion University ... · Persistence of archived web dependent on resilience of organizations Availability of data is subject to censorship

HTTP HEADERBLOCK

HTTP PAYLOADBLOCK

WARC Creation HTTP Header & Payload Extraction Push to IPFS Generate CDXJ WARC-CDXJ Correspondence

10@ibnesayeed

Page 11: @WebSciDL Norfolk, Virginia, USA Old Dominion University ... · Persistence of archived web dependent on resilience of organizations Availability of data is subject to censorship

QmcN9eWwRF73dZj5BgT4x8jeEcFrxurX1hot8QwCbMi9PB

Qmczh9YnB4U1ptPeqxcaTZA4aMmuNUswTLTWzXntvbp9sL

HEADER DIGEST

PAYLOAD DIGEST

WARC Creation HTTP Header & Payload Extraction Push to IPFS Generate CDXJ WARC-CDXJ Correspondence

11

HTTP HEADERBLOCK

HTTP PAYLOADBLOCK

@ibnesayeed

Page 12: @WebSciDL Norfolk, Virginia, USA Old Dominion University ... · Persistence of archived web dependent on resilience of organizations Availability of data is subject to censorship

QmcN9eWwRF73dZj5BgT4x8jeEcFrxurX1hot8QwCbMi9PB

Qmczh9YnB4U1ptPeqxcaTZA4aMmuNUswTLTWzXntvbp9sL

HEADER DIGEST

com,example,ipwb)/ 20180802012013 {"locator":"urn:ipfs/QmcN9eWwRF73dZj5BgT4x8jeEcFrxurX1hot8QwCbMi9PB/Qmczh9YnB4U1ptPeqxcaTZA4aMmuNUswTLTWzXntvbp9sL","mime_type": "text/html","status_code": 200,“other_fields”: “other values...”

}// * This is a single-line record, line breaks and indentation are added for readability only

PAYLOAD DIGEST

CDXJ: http://ws-dl.blogspot.com/2015/09/2015-09-10-cdxj-object-resource-stream.html

WARC Creation HTTP Header & Payload Extraction Push to IPFS Generate CDXJ Record WARC-CDXJ Correspondence

12@ibnesayeed

Page 13: @WebSciDL Norfolk, Virginia, USA Old Dominion University ... · Persistence of archived web dependent on resilience of organizations Availability of data is subject to censorship

edu,odu,cs)/~salam/dweb/ 20180802012013 { "locator": "urn:ipfs/QmcN9eWwRF73dZj5.../Qmczh9YnB4U1ptPe...", "mime_type": "text/html", "status_code": "200"}

edu,odu,cs)/~salam/dweb/style.css 20180802012013 { "locator": "urn:ipfs/QmU1k71bT6ibZBSd.../QmbvUAo9U31wSdvA...", "mime_type": "text/css", "status_code": "200"}

edu,odu,cs)/~salam/dweb/wsdl-logo.png 20180802012013 { "locator": "urn:ipfs/QmTjfMxFGvbP4nwF.../QmYMKZbnk53kuPJi...", "mime_type": "image/png", "status_code": "200"}

WARC Creation HTTP Header & Payload Extraction Push to IPFS Generate CDXJ WARC-CDXJ Correspondence

13@ibnesayeed

Page 14: @WebSciDL Norfolk, Virginia, USA Old Dominion University ... · Persistence of archived web dependent on resilience of organizations Availability of data is subject to censorship

Replay

14@ibnesayeed

Page 15: @WebSciDL Norfolk, Virginia, USA Old Dominion University ... · Persistence of archived web dependent on resilience of organizations Availability of data is subject to censorship

https://www.cs.odu.edu/~salam/dweb/

15

edu,odu,cs)/~salam/dweb/ 20180802012013 { "locator": "urn:ipfs/QmcN9eWwRF.../Qmczh9YnB4...", "mime_type": "text/html", "status_code": "200"}

edu,odu,cs)/~salam/dweb/style.css 20180802012013 { "locator": "urn:ipfs/QmU1k71bT6.../QmbvUAo9U3...", "mime_type": "text/css", "status_code": "200"}

edu,odu,cs)/~salam/dweb/wsdl-logo.png 20180802012013 { "locator": "urn:ipfs/QmTjfMxFGv.../QmYMKZbnk5...", "mime_type": "image/png", "status_code": "200"}

Fetch from IPFS Reroute ResourcesReconstruct ResponseLookup in CDXJ

@ibnesayeed

Page 16: @WebSciDL Norfolk, Virginia, USA Old Dominion University ... · Persistence of archived web dependent on resilience of organizations Availability of data is subject to censorship

edu,odu,cs)/~salam/dweb/ 20180802012013 { "locator": "urn:ipfs/QmcN9eWwRF73dZj5.../Qmczh9YnB4U1ptPe...", "mime_type": "text/html", "status_code": "200"}

16

HTTP HEADERBLOCK

HTTP PAYLOADBLOCK

Fetch from IPFS Reroute ResourcesReconstruct ResponseLookup in CDXJ

@ibnesayeed

Page 17: @WebSciDL Norfolk, Virginia, USA Old Dominion University ... · Persistence of archived web dependent on resilience of organizations Availability of data is subject to censorship

17

HTTP HEADERBLOCK

HTTP PAYLOADBLOCK

Fetch from IPFS Reroute ResourcesReconstruct ResponseLookup in CDXJ

Reconstruct

@ibnesayeed

Page 18: @WebSciDL Norfolk, Virginia, USA Old Dominion University ... · Persistence of archived web dependent on resilience of organizations Availability of data is subject to censorship

Fetch from IPFS Reroute Resources

18● https://oduwsdl.github.io/Reconstructive/● http://ws-dl.blogspot.com/2018/01/2018-01-08-introducing-reconstructive.html

● Avoids zombies (live-leakage)● Adds an unobtrusive archival banner (Custom HTML Element)

Reconstruct ResponseLookup in CDXJ

@ibnesayeed

Page 19: @WebSciDL Norfolk, Virginia, USA Old Dominion University ... · Persistence of archived web dependent on resilience of organizations Availability of data is subject to censorship

19

IPWB Indexing and Replay

@ibnesayeed

Page 20: @WebSciDL Norfolk, Virginia, USA Old Dominion University ... · Persistence of archived web dependent on resilience of organizations Availability of data is subject to censorship

Decentralization

20@ibnesayeed

Page 21: @WebSciDL Norfolk, Virginia, USA Old Dominion University ... · Persistence of archived web dependent on resilience of organizations Availability of data is subject to censorship

Current Issues

● IPFS is permanent, but not persistent● DHT-based IPNS is history-unaware● CDXJ index, a critical piece of replay, is centralized

21@ibnesayeed

Page 22: @WebSciDL Norfolk, Virginia, USA Old Dominion University ... · Persistence of archived web dependent on resilience of organizations Availability of data is subject to censorship

Persistence

● Data persistence is critical for web archiving● A decentralized storage with sufficient replication is needed● Memory organizations should contribute storage infrastructure● Qri, Filecoin, IPFS-Cluster, IPFS-Sync etc. can be helpful

22@ibnesayeed

Page 23: @WebSciDL Norfolk, Virginia, USA Old Dominion University ... · Persistence of archived web dependent on resilience of organizations Availability of data is subject to censorship

IPNS: InterPlanetary Naming System

URI IPFS Hash

http://example.org/yuri.jpg

http://example.com/style.css

http://example.com/logo.png

http://example.com/style.css

How about changes and history?

23@ibnesayeed

Page 24: @WebSciDL Norfolk, Virginia, USA Old Dominion University ... · Persistence of archived web dependent on resilience of organizations Availability of data is subject to censorship

IPNS Blockchain

● URI → Latest hash● URI + DateTime → A historical hash● URI → List historical hashes with times

https://github.com/oduwsdl/IPNS-Blockchain

Owner URI Time Hash PrevBlock

Pub K1 URI1 T1 H1 1234567...

Pub K2 URI2 T2 H2 0000000...

Pub K3 URI3 T3 H3 9876543...

Owner URI Time Hash PrevBlock

Pub K1 URI1 T4 H5 5463728...

Pub K3 URI4 T5 H6 0000000...

IPNS + Blockchain + Memento

24@ibnesayeed

Page 25: @WebSciDL Norfolk, Virginia, USA Old Dominion University ... · Persistence of archived web dependent on resilience of organizations Availability of data is subject to censorship

Lazy Relationship Evaluation

/<namespace>/about/<URI>

IP-LD to the rescue?

https://github.com/oduwsdl/ipwb/issues/61

Memento

Memento

Memento

MementoOf(Active Relation)

HasMemento(Lazy Evaluation)

25@ibnesayeed

Page 26: @WebSciDL Norfolk, Virginia, USA Old Dominion University ... · Persistence of archived web dependent on resilience of organizations Availability of data is subject to censorship

Evaluation

26@ibnesayeed

Page 27: @WebSciDL Norfolk, Virginia, USA Old Dominion University ... · Persistence of archived web dependent on resilience of organizations Availability of data is subject to censorship

● Reported IPFS slowness https://github.com/ipfs/go-ipfs/issues/1216○ Has since been fixed, but we did not evaluate again

570 files per minute~10% overhead

27

Storage Space and Time Overhead

@ibnesayeed

Page 28: @WebSciDL Norfolk, Virginia, USA Old Dominion University ... · Persistence of archived web dependent on resilience of organizations Availability of data is subject to censorship

Replay Time

● 600 requests in 222 seconds● Slower than PyWB (which took 5.26 seconds)● File vs. rich object based retrieval● Never expiring cache

28

https://github.com/ibnesayeed/ipfsapi-concurrency-test

@ibnesayeed

Page 29: @WebSciDL Norfolk, Virginia, USA Old Dominion University ... · Persistence of archived web dependent on resilience of organizations Availability of data is subject to censorship

Future Works

● Evaluate the improved IPFS on large dataset● Evaluate deduplication● Implement an index-free collaborative archiving system● Utilize IPNS to reference URI-Rs with datetime

29@ibnesayeed

Page 30: @WebSciDL Norfolk, Virginia, USA Old Dominion University ... · Persistence of archived web dependent on resilience of organizations Availability of data is subject to censorship

Conclusions

● A proof of concept system to leverage a novel approach to archiving and retrieval

● Storage and time costs evaluation and qualitative analysis● It can only work for small archives in its current state● A path to answer “who will archive the archives?”● More work to be done to make it a truly decentralized

archiving system

30@ibnesayeed

Page 31: @WebSciDL Norfolk, Virginia, USA Old Dominion University ... · Persistence of archived web dependent on resilience of organizations Availability of data is subject to censorship

InterPlanetary WaybackThe Next Step Towards Decentralized Web Archiving

@WebSciDL

http://github.com/oduwsdl/ipwb

Supported in part by Protocol Labs, AMF 11600663, and NSF IIS-1526700

Sawood Alam, Mat Kelly, Michele C. Weigle, Michael L. Nelson