Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
A temporal overview of TNA’s CDX index
Philip Webster, The University of SheffieldClaire Newing, The National Archives
22/06/2017 © Philip Webster / The National Archives
2
Contents
• The Project
• The UK Government Web Archive
• CDX processing
• Archive Overview
• Temporal analysis – HTTP codes, media types and protocols
22/06/2017 © Philip Webster / The National Archives
3
The Project
• What information can be extracted from CDX files?
• What can analysis of that data tell us about the UK Government Web Archive?
• What are the potential pitfalls with using CDX data in this way?
22/06/2017 © The University of Sheffield
4
UK Government Web Archive
• Administered by The National Archives
• Archived versions of UK central government websites dated from 1996 to present
• Around 4,000 unique websites captured at least once
• Over 4 billion archive entries (DNS, HTTP –images, HTML, page resources and document types)
22/06/2017 © The University of Sheffield
5
UK Government Web Archive
• Data made available as ‘units’
• Semi-arbitrary division of the entire archive over physical drives – 1 drive is 1 ‘unit’
• No guarantee of chronological sequence
• ARC files hold UKGWA archive data
• CDX index available (derived from ARC)
22/06/2017 © The University of Sheffield
6
CDX processing
• Text-based index format
• Easily machine-readable
• Inefficient representation of dates and numeric types
• Easy to scan sequentially, but difficult to use for faceted, dynamic querying
• (because it wasn’t designed for that)
22/06/2017 © The University of Sheffield
7
Archive overview
• UKGWA composition:
• By media (MIME) type, temporal coverage, file size, HTTP response code
• Temporal range from 1996-present
• Most data from 2008-present
22/06/2017 © The University of Sheffield
8
Archive overview
0
20000000
40000000
60000000
80000000
100000000
120000000
01
/06
/20
01
01
/11
/20
01
01
/04
/20
02
01
/09
/20
02
01
/02
/20
03
01
/07
/20
03
01
/12
/20
03
01
/05
/20
04
01
/10
/20
04
01
/03
/20
05
01
/08
/20
05
01
/01
/20
06
01
/06
/20
06
01
/11
/20
06
01
/04
/20
07
01
/09
/20
07
01
/02
/20
08
01
/07
/20
08
01
/12
/20
08
01
/05
/20
09
01
/10
/20
09
01
/03
/20
10
01
/08
/20
10
01
/01
/20
11
01
/06
/20
11
01
/11
/20
11
01
/04
/20
12
01
/09
/20
12
01
/02
/20
13
01
/07
/20
13
01
/12
/20
13
01
/05
/20
14
Nu
mb
er o
f en
trie
s
Month/year
Temporal distribution of CDX index entries in the UKGWA
22/06/2017 © The University of Sheffield
9
HTTP status codes
• 3.3 billion HTTP status codes (3,279,650,659)
• Data range: August 2003 to August 2014
22/06/2017 © The University of Sheffield
10
HTTP status codes (absolute)
0
10000000
20000000
30000000
40000000
50000000
60000000
70000000
80000000
90000000
100000000
01
/08
/20
03
01
/11
/20
03
01
/02
/20
04
01
/05
/20
04
01
/08
/20
04
01
/11
/20
04
01
/02
/20
05
01
/05
/20
05
01
/08
/20
05
01
/11
/20
05
01
/02
/20
06
01
/05
/20
06
01
/08
/20
06
01
/11
/20
06
01
/02
/20
07
01
/05
/20
07
01
/08
/20
07
01
/11
/20
07
01
/02
/20
08
01
/05
/20
08
01
/08
/20
08
01
/11
/20
08
01
/02
/20
09
01
/05
/20
09
01
/08
/20
09
01
/11
/20
09
01
/02
/20
10
01
/05
/20
10
01
/08
/20
10
01
/11
/20
10
01
/02
/20
11
01
/05
/20
11
01
/08
/20
11
01
/11
/20
11
01
/02
/20
12
01
/05
/20
12
01
/08
/20
12
01
/11
/20
12
01
/02
/20
13
01
/05
/20
13
01
/08
/20
13
01
/11
/20
13
01
/02
/20
14
01
/05
/20
14
01
/08
/20
14
Raw frequency data for 200, 302 and 404 HTTP responses during crawls, 2003-2014
HTTP 302 HTTP 404 HTTP 200
22/06/2017 © The University of Sheffield
11
HTTP status codes (absolute)
• Absolute frequency counts highlight peaks and troughs in crawl frequency.
• Data is sparse prior to 2008
• Proportion must be used to identify shifts in frequency
22/06/2017 © The University of Sheffield
12
HTTP status codes (relative)
• Graph restricted to 3 key HTTP status codes:
• 200 (success)
• 302 (redirect)
• 404 (not found)
• Other codes (500 etc.) excluded.
22/06/2017 © The University of Sheffield
13
HTTP status codes (relative)
0
20
40
60
80
100
120
01
/08
/20
03
01
/11
/20
03
01
/02
/20
04
01
/05
/20
04
01
/08
/20
04
01
/11
/20
04
01
/02
/20
05
01
/05
/20
05
01
/08
/20
05
01
/11
/20
05
01
/02
/20
06
01
/05
/20
06
01
/08
/20
06
01
/11
/20
06
01
/02
/20
07
01
/05
/20
07
01
/08
/20
07
01
/11
/20
07
01
/02
/20
08
01
/05
/20
08
01
/08
/20
08
01
/11
/20
08
01
/02
/20
09
01
/05
/20
09
01
/08
/20
09
01
/11
/20
09
01
/02
/20
10
01
/05
/20
10
01
/08
/20
10
01
/11
/20
10
01
/02
/20
11
01
/05
/20
11
01
/08
/20
11
01
/11
/20
11
01
/02
/20
12
01
/05
/20
12
01
/08
/20
12
01
/11
/20
12
01
/02
/20
13
01
/05
/20
13
01
/08
/20
13
01
/11
/20
13
01
/02
/20
14
01
/05
/20
14
01
/08
/20
14
Percentage data for 200, 302 and 404 HTTP responses during crawls, 2003-2014
HTTP 302 HTTP 404 HTTP 200
22/06/2017 © The University of Sheffield
14
Post-2008 HTTP status codes
0
20
40
60
80
100
120
39
44
8
39
50
8
39
56
9
39
63
0
39
69
2
39
75
3
39
81
4
39
87
3
39
93
4
39
99
5
40
05
7
40
11
8
40
17
9
40
23
8
40
29
9
40
36
0
40
42
2
40
48
3
40
54
4
40
60
3
40
66
4
40
72
5
40
78
7
40
84
8
40
90
9
40
96
9
41
03
0
41
09
1
41
15
3
41
21
4
41
27
5
41
33
4
41
39
5
41
45
6
41
51
8
41
57
9
41
64
0
41
69
9
41
76
0
41
82
1
41
88
3
Percentage data for 200, 302 and 404 HTTP responses during crawls, 2008-2014
HTTP 302 HTTP 404 HTTP 200
22/06/2017 © The University of Sheffield
15
HTTP status codes - trends
• Gradual increase in non-success response codes
• Possibly indicative of increased use of dynamic sites (HTTP 500), access control, or indicators of site closure.
• Gradual increase in the number of redirects (302) and not found (404) codes.
22/06/2017 © The University of Sheffield
16
HTTP status codes - issues
• Data is known to be influenced by short term changes in crawler focus.
• Shifting focus to specific domains of interest can skew results .
• Researchers using archive CDX data should consider this.
22/06/2017 © The University of Sheffield
17
Media types (MIME types)
• Restricted to 4 media type groups:
• application/*
• image/*
• text/*
• video/*
• Significant media types within these groups selected for investigation
22/06/2017 © The University of Sheffield
18
Media types - application
• application/x-shockwave-flash
• application/x-java
• application/java-byte-code
• application/javascript
• application/msword
• application/pdf
• application/rtf
• application/vnd.ms-excel
• application/xml
• application/zip
22/06/2017 © The University of Sheffield
19
Media types - applicationFrequency of application media types
application/pdf
application/atom+xml
application/rss+xml
application/x-javascript
application/xml
application/octet-stream
application/msword
application/x-shockwave-flash
22/06/2017 © The University of Sheffield
20
Media types - application
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
01
/06
/20
01
01
/11
/20
01
01
/04
/20
02
01
/09
/20
02
01
/02
/20
03
01
/07
/20
03
01
/12
/20
03
01
/05
/20
04
01
/10
/20
04
01
/03
/20
05
01
/08
/20
05
01
/01
/20
06
01
/06
/20
06
01
/11
/20
06
01
/04
/20
07
01
/09
/20
07
01
/02
/20
08
01
/07
/20
08
01
/12
/20
08
01
/05
/20
09
01
/10
/20
09
01
/03
/20
10
01
/08
/20
10
01
/01
/20
11
01
/06
/20
11
01
/11
/20
11
01
/04
/20
12
01
/09
/20
12
01
/02
/20
13
01
/07
/20
13
01
/12
/20
13
01
/05
/20
14
Freq
uen
cy
Month
Application media type frequency over time
pdf_frequency rss_frequency msword_frequency msexcel_frequency
odfsheet_frequency odfsheet_frequency-2 javascript_frequency flash_frequency
java_frequency other_frequency
22/06/2017 © The University of Sheffield
21
Media types - application
0
20
40
60
80
100
120
01/01/2008 01/01/2009 01/01/2010 01/01/2011 01/01/2012 01/01/2013 01/01/2014
Per
cen
tage
of
tota
l
Month/Year
Application media types as percentage of total, 2008-2014
PDF RSS Word Excel Word (XML) Excel (XML) ODF doc ODF sheet Javascript Flash RDF Atom JSON XML Other
22/06/2017 © The University of Sheffield
22
Media types - executable
application/x-javascript68%
application/x-shockwave-flash20%
application/javascript12%
Executable media types
application/x-javascript application/x-shockwave-flash application/javascript
22/06/2017 © The University of Sheffield
23
Media types - executable
0
5
10
15
20
25
30
39
44
8
39
50
8
39
56
9
39
63
0
39
69
2
39
75
3
39
81
4
39
87
3
39
93
4
39
99
5
40
05
7
40
11
8
40
17
9
40
23
8
40
29
9
40
36
0
40
42
2
40
48
3
40
54
4
40
60
3
40
66
4
40
72
5
40
78
7
40
84
8
40
90
9
40
96
9
41
03
0
41
09
1
41
15
3
41
21
4
41
27
5
41
33
4
41
39
5
41
45
6
41
51
8
41
57
9
41
64
0
41
69
9
41
76
0
41
82
1
Per
cen
tage
Month
Executable content percentage over time
javascript_percent flash_percent java_percent
22/06/2017 © The University of Sheffield
24
Media types - document
0
50
100
150
200
250
39
44
8
39
50
8
39
56
9
39
63
0
39
69
2
39
75
3
39
81
4
39
87
3
39
93
4
39
99
5
40
05
7
40
11
8
40
17
9
40
23
8
40
29
9
40
36
0
40
42
2
40
48
3
40
54
4
40
60
3
40
66
4
40
72
5
40
78
7
40
84
8
40
90
9
40
96
9
41
03
0
41
09
1
41
15
3
41
21
4
41
27
5
41
33
4
41
39
5
41
45
6
41
51
8
41
57
9
41
64
0
41
69
9
41
76
0
41
82
1
Per
cen
tage
of
tota
l
Month
Document media types over time
pdf_percent msword_percent msexcel_percent odfdoc_percent odfsheet_percent
22/06/2017 © The University of Sheffield
25
Media types - document
0
5
10
15
20
25
39
44
8
39
50
8
39
56
9
39
63
0
39
69
2
39
75
3
39
81
4
39
87
3
39
93
4
39
99
5
40
05
7
40
11
8
40
17
9
40
23
8
40
29
9
40
36
0
40
42
2
40
48
3
40
54
4
40
60
3
40
66
4
40
72
5
40
78
7
40
84
8
40
90
9
40
96
9
41
03
0
41
09
1
41
15
3
41
21
4
41
27
5
41
33
4
41
39
5
41
45
6
41
51
8
41
57
9
41
64
0
41
69
9
41
76
0
41
82
1
Per
cen
tage
of
tota
l
Month
Document media types over time, excluding PDF
msword_percent msexcel_percent odfdoc_percent odfsheet_percent
22/06/2017 © The University of Sheffield
26
Media types - image
• Consists of inline images appearing in documents, plus icons:
• image/jpeg
• image/png
• image/gif
• image/x-icon
• Occasional use of non-standard media type labels was ignored for this analysis
22/06/2017 © The University of Sheffield
27
Media types - imageImage media types
image/jpegimage/gifimage/pngimage/pjpegimage/jpgimage/x-iconimage/bmpimage/svg+xmlimage/x-pngimage/vnd.microsoft.iconimage/JPEGimage/vnd.wap.wbmpimage/$inputFileExtensionimage/JPGimage/tiffimage/*,%20image/gifimage/.jpg
22/06/2017 © The University of Sheffield
28
Media types - image
0
2000000
4000000
6000000
8000000
10000000
12000000
14000000
16000000
18000000
20000000
01
/06
/20
01
01
/10
/20
01
01
/02
/20
02
01
/06
/20
02
01
/10
/20
02
01
/02
/20
03
01
/06
/20
03
01
/10
/20
03
01
/02
/20
04
01
/06
/20
04
01
/10
/20
04
01
/02
/20
05
01
/06
/20
05
01
/10
/20
05
01
/02
/20
06
01
/06
/20
06
01
/10
/20
06
01
/02
/20
07
01
/06
/20
07
01
/10
/20
07
01
/02
/20
08
01
/06
/20
08
01
/10
/20
08
01
/02
/20
09
01
/06
/20
09
01
/10
/20
09
01
/02
/20
10
01
/06
/20
10
01
/10
/20
10
01
/02
/20
11
01
/06
/20
11
01
/10
/20
11
01
/02
/20
12
01
/06
/20
12
01
/10
/20
12
01
/02
/20
13
01
/06
/20
13
01
/10
/20
13
01
/02
/20
14
01
/06
/20
14
Freq
uen
cy
Month
Frequencies of common image media types over time
image/svg image/png image/jpeg image/gif image/x-icon image/tiff Other
22/06/2017 © The University of Sheffield
29
Media types - image
0
10
20
30
40
50
60
70
80
90
39
44
8
39
50
8
39
56
9
39
63
0
39
69
2
39
75
3
39
81
4
39
87
3
39
93
4
39
99
5
40
05
7
40
11
8
40
17
9
40
23
8
40
29
9
40
36
0
40
42
2
40
48
3
40
54
4
40
60
3
40
66
4
40
72
5
40
78
7
40
84
8
40
90
9
40
96
9
41
03
0
41
09
1
41
15
3
41
21
4
41
27
5
41
33
4
41
39
5
41
45
6
41
51
8
41
57
9
41
64
0
41
69
9
41
76
0
41
82
1
Per
cen
tage
of
tota
l
Month
Image media types as percentage of total over time
image/svg image/png image/jpeg image/gif image/x-icon image/tiff Other
22/06/2017 © The University of Sheffield
30
Media types - text
• Plain text formats, including hypertext:
• text/plain
• text/html
• text/x-html
• Occasional use of non-standard media type labels was ignored for this analysis
• Entirely dominated by HTML
22/06/2017 © The University of Sheffield
31
Media types - textText media types
text/html text/plain text/css
text/xml text/javascript text/csv
text/calendar text/turtle text/x-perl
text/rdf+n3 text/vbscript text/n3
text/tab-separated-values text/rtf text/x-js
text/x-cross-domain-policy text/x-component text/x-vCalendar
text/HTML text/enriched text/comma-separated-values
text/x-c text/js text/x-vcard
text/Calendar text/vcard text/richtext
text/x-csv text/x-javascript text/x-ms-iqy
22/06/2017 © The University of Sheffield
32
Media types - textText media types (excluding HTML)
text/plain text/css text/xmltext/javascript text/csv text/calendartext/turtle text/x-perl text/rdf+n3text/vbscript text/n3 text/tab-separated-valuestext/rtf text/x-js text/x-cross-domain-policytext/x-component text/x-vCalendar text/HTMLtext/enriched text/comma-separated-values text/x-ctext/js text/x-vcard text/Calendartext/vcard text/richtext text/x-csvtext/x-javascript text/x-ms-iqy text/x-patchtext/json text/x-c++ text/x-vCardtext/directory text/x-comma-separated-values text/rtf2text/vnd.wap.wml text/htm text/x-javatext/x-json text/JavaScript text/ecmascripttext/plain,%20charset:%20UTF-8 text/text text/x-pythontext/x-vcalendar text/html%20charset=iso-8859-1 text/trofftext/XML text/rdf text/x-fortrantext/dtd text/css,%20charset:%20UTF-8 text/fragmenttext/x-handlebars-template text/lrc text/illegal
22/06/2017 © The University of Sheffield
33
Media types - video
• Compressed video formats:
• video/x-msvideo
• video/mpeg
• video/x-flv
• video/mp4
• Largely superseded by embedded YouTube links
22/06/2017 © The University of Sheffield
34
Media types - videoVideo media types by total frequency
video/mp4 video/x-ms-wmv video/quicktimevideo/x-flv video/x-ms-asf video/x-ms-wmxvideo/mpeg video/x-ms-wvx video/x-msvideovideo/x-m4v video/ogg video/webmvideo/3gpp video/x-ms-asx video/x-mp4video/x-frv video/unknown video/flvvideo/mp4v-es video/avi video/vnd.objectvideovideo/m4v video/x-f4v video/mpvvideo/mpeg4 video/x-mpeg video/3gpp,%20audio/3gppvideo/x-flv%20.flv video/wmv video/x-sgi-movievideo/mp4v video/x-ms-wmv%20video/quicktime video/msvideovideo/x-FLV video/ogv video/dlvideo/asf video/MP4 video/swfvideo/x-download-quicktime video/x-unknown video/mpgvideo/x-ms-wm video/f4m video/h264video/mpg4 video/shockwave video/f4v
22/06/2017 © The University of Sheffield
35
Media types - video
0
20
40
60
80
100
120
01/01/2008 01/01/2009 01/01/2010 01/01/2011 01/01/2012 01/01/2013 01/01/2014
Per
cen
tage
of
tota
l
Month
Video media types as percentage of category total
MP4 WMV Quicktime Flash video ASF MPEG MS Video M4V Ogg WebM Other
22/06/2017 © The University of Sheffield
36
File size over time
• Compressed image formats only
• This is due to variable compressibility of uncompressed images, documents, text, etc.
• CDX index only contains compressed size data and therefore is not a true representation of file size trends
22/06/2017 © The University of Sheffield
37
File size over time - images
0
50000
100000
150000
200000
250000
300000
350000
01
/01
/20
08
01
/03
/20
08
01
/05
/20
08
01
/07
/20
08
01
/09
/20
08
01
/11
/20
08
01
/01
/20
09
01
/03
/20
09
01
/05
/20
09
01
/07
/20
09
01
/09
/20
09
01
/11
/20
09
01
/01
/20
10
01
/03
/20
10
01
/05
/20
10
01
/07
/20
10
01
/09
/20
10
01
/11
/20
10
01
/01
/20
11
01
/03
/20
11
01
/05
/20
11
01
/07
/20
11
01
/09
/20
11
01
/11
/20
11
01
/01
/20
12
01
/03
/20
12
01
/05
/20
12
01
/07
/20
12
01
/09
/20
12
01
/11
/20
12
01
/01
/20
13
01
/03
/20
13
01
/05
/20
13
01
/07
/20
13
01
/09
/20
13
01
/11
/20
13
01
/01
/20
14
01
/03
/20
14
01
/05
/20
14
01
/07
/20
14
File
siz
e (b
ytes
)
Month
Mean image file size over time
svg_mean_size_rounded png_mean_size_rounded jpeg_mean_size_rounded gif_mean_size_rounded
22/06/2017 © The University of Sheffield
38
Protocol changes
• CDX contains protocol information within URL parameters
• Protocol can be extracted from this parameter and aggregated temporally
• This reveals popularity trends for protocols
• HTTP vs. HTTPS
22/06/2017 © The University of Sheffield
39
HTTP vs HTTPS (absolute)
0
10000000
20000000
30000000
40000000
50000000
60000000
70000000
80000000
90000000
100000000
01
/06
/20
01
01
/11
/20
01
01
/04
/20
02
01
/09
/20
02
01
/02
/20
03
01
/07
/20
03
01
/12
/20
03
01
/05
/20
04
01
/10
/20
04
01
/03
/20
05
01
/08
/20
05
01
/01
/20
06
01
/06
/20
06
01
/11
/20
06
01
/04
/20
07
01
/09
/20
07
01
/02
/20
08
01
/07
/20
08
01
/12
/20
08
01
/05
/20
09
01
/10
/20
09
01
/03
/20
10
01
/08
/20
10
01
/01
/20
11
01
/06
/20
11
01
/11
/20
11
01
/04
/20
12
01
/09
/20
12
01
/02
/20
13
01
/07
/20
13
01
/12
/20
13
01
/05
/20
14
Freq
uen
cy
Month
Frequency of protocols over time
DNS HTTP HTTPS
22/06/2017 © The University of Sheffield
40
HTTP vs HTTPS (relative)
0
20
40
60
80
100
120
01
/01
/20
08
01
/03
/20
08
01
/05
/20
08
01
/07
/20
08
01
/09
/20
08
01
/11
/20
08
01
/01
/20
09
01
/03
/20
09
01
/05
/20
09
01
/07
/20
09
01
/09
/20
09
01
/11
/20
09
01
/01
/20
10
01
/03
/20
10
01
/05
/20
10
01
/07
/20
10
01
/09
/20
10
01
/11
/20
10
01
/01
/20
11
01
/03
/20
11
01
/05
/20
11
01
/07
/20
11
01
/09
/20
11
01
/11
/20
11
01
/01
/20
12
01
/03
/20
12
01
/05
/20
12
01
/07
/20
12
01
/09
/20
12
01
/11
/20
12
01
/01
/20
13
01
/03
/20
13
01
/05
/20
13
01
/07
/20
13
01
/09
/20
13
01
/11
/20
13
01
/01
/20
14
01
/03
/20
14
01
/05
/20
14
01
/07
/20
14
Per
cen
tage
of
tota
l
Month/Year
Protocols as percentage of total, 2008-2014
DNS HTTP HTTPS
22/06/2017 © The University of Sheffield
41
Conclusions
• It is possible to perform useful temporal analysis of CDX index data
• Transformation is necessary – SQL is feasible, commonly available and low cost
• Archive data has particular weaknesses – data cannot be assumed to be fully representative of the content of the target Web subset
• Even with this noise, trends can be identified