44
A Comparison Between the Performance of Wayback Machines Fernando Melo [email protected]

A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

A Comparison Between thePerformance of WaybackMachines

Fernando Melo [email protected]

Page 2: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

Main reasons for this study

Outdated Wayback

Evaluate possible alternatives

Page 3: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

How does a Web archive work?

Page 4: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

(W)ARC

Wayback

2016

Live

Page

Crawl

Lucene

CDX

Index

Search

Replay

Page 5: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

What is a Wayback Machine?

Page 6: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

What is a Wayback Machine?

Software Component

Replay Archived Web Pages

Search by URL and Date

Page 7: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

What is a Wayback Machine?

Page 8: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

What is a Wayback Machine?

Page 9: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

Common Wayback Machine Issues

Page 10: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

Slow Replay

Page 11: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

Not Found Errors

Page 12: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

Not Found Errors

Page 13: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

Not Found Errors

Page 14: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

Live-Web Leaks

2010

Archived

Page

2016

Live

Page

link

Page 15: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

Live-Web Leaks

2010

Archived

Page

2016

Live

Page

link

Page 16: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

Live-Web Leaks

2010

Archived

Page

2010

Archived

Page

link

Page 17: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

Live-Web Leaks

Original Web PageJuly 14th, 2012

Archived Web PageJuly 14th, 2012

Page 18: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

Live-Web Leaks

Original Web PageJuly 14th, 2012

Archived Web PageJuly 14th, 2012

Page 19: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

Live-Web Leaks

Original Web PageJuly 14th, 2012

Archived Web PageJuly 14th, 2012

Page 20: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

Let’s evaluate the performance ofWayback Machine Software!

Page 21: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

Wayback Machines

Arquivo.pt Wayback

OpenWayback

PyWb

Page 22: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

Wayback Machines

Page 23: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

Arquivo.pt Wayback

Derives from version 1.2.1 of Open Source Wayback Machine (2008)

Java

Used by Arquivo.pt

Outdated - Presents several replay issues

Page 24: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

PyWb Wayback

Developed by Ilya Kreymer

Python

Used by

http://rhizome.org

http://webrecorder.io

http://perma.cc

Page 25: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

OpenWayback

Released by the Internet Archive

Maintained by the IIPC

Java

Page 26: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

OpenWayback - Users

National and University Library of Iceland

The British Library

Archive-It Mirror @ ODU

Stanford Web Archive Portal

The Library of Congress

Bibliotheca Alexandrina

York University Digital Library

Bibliothèque nationale de France

University of North Texas Libraries

Page 27: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

The .EU Collection - 2014

Page 28: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

The .EU Collection - 2014

Domains can be sold to anyone with a valid address in the European Union

European Institutions, Online Shops, and Web Spam

250 million documents from 34 thousand seeds

6TB

Page 29: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

Methodology

400 URLs from the .EU

WebPageTest service

4 Wayback Configurations

HAR – to record performance data

Page 30: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

Methodology

Page 31: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

Methodology

Page 32: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

Methodology

Only test each URL once

Tested using WebPageTest public servers

Response timeout of 2 minutes

Error Code – Leak to the live Web

Page 33: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

Wayback Specifications

Wayback Year

Arquivo Pwa Lucene 2008

PyWb CDX 2015

PyWb Pwa Lucene 2015

OpenWayback CDX 2015

Page 34: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

Replay Quality – HTTP Status andError Codes

Page 35: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

Results – Live Web Leaks

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

Arquivo PwaLucene

PyWb CDX PyWb PwaLucene

OpenWaybackCDX

Nu

mb

er o

f U

RL

s

Page 36: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

Results – Timeout Error

0

200

400

600

800

1000

1200

Arquivo PwaLucene

PyWb CDX PyWb PwaLucene

OpenWaybackCDX

Nu

mb

er o

f U

RL

s

Page 37: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

Results – 200 OK Status Code

0

5000

10000

15000

20000

25000

Arquivo PwaLucene

PyWb CDX PyWb PwaLucene

OpenWaybackCDX

Nu

mb

er o

f U

RL

s

Page 38: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

Results – 404 Error HTTP Code

0

1000

2000

3000

4000

5000

6000

7000

Arquivo PwaLucene

PyWb CDX PyWb PwaLucene

OpenWaybackCDX

Nu

mb

er o

f U

RL

s

Page 39: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

Results – Summary Table

Wayback Success Error Success/Error

Arquivo 3 930 17 711 0.22

PyWb CDX 19 415 7 082 2.74

PyWb PwaLucene

11 087 4 652 2.38

OpenWayback 13 068 4 668 2.80

Page 40: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

Response Speed

Page 41: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

Results – Average Load Time

0

5

10

15

20

25

30

35

40

Arquivo PwaLucene

PyWb CDX PyWb PwaLucene

OpenWaybackCDX

Tim

e (s

eco

nd

s)

Page 42: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

Conclusions

PyWb presented the biggest number of 200 OK HTTP status codes

OpenWayback was the fastest Wayback

Replace or Update Arquivo.pt’s Wayback!

Page 43: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

Future Work

Test with older collections to evaluate the performance of Wayback Machine software

Test with private instance of WebpageTestserver to be able to execute more tests and to control the server workload

Page 44: A Comparison Between the Performance of Wayback Machinessobre.arquivo.pt/wp-content/uploads/a-comparison-between... · 2018-12-28 · What is a Wayback Machine? Software Component

References

https://github.com/Fernando-Melo/WaybackComparison