Veřejné služby pro Dark archives

Preview:

Citation preview

Webarchiv.czDovětek k přednášce o běhu památníku českého webu.

2266 domén

Docker?

Monitrix

https://github.com/ukwa/monitrix

Prototyp 1

Monitoring / Front-end pro Heritrix 3

Analytika probíhající sklizně / pravděpodobně agreguje jen jeden stroj

Prototyp 2

ELK: ElasticSearch / Logstash / Kibana

25 miliónů řádek logů / 26 GB na disku / 4vCPU / 20 GB RAM – otázka jak škálovat na celoplošné sklizně

QA

proces na analýzu reportu na nesklizené weby a jejich znovu sklizení

proces pro analýzu objevených ale nesklizených URL

na kontrolu sklizní speciální webů jako Youtube, Facebook, Twitter

Webarchiv.czKam směřovat?

Služby

CDX SERVER API

CDX SERVER API

http://web.archive.org/cdx/search/cdx?url=archive.org&output=json&limit=2&filter=!statuscode:200 will return 2 capture results with non-200 status codes.

http://web.archive.org/cdx/search/cdx?url=archive.org&output=json&limit=10&filter=!statuscode:200&filter=!mimetype:text/html&filter=digest:2WAXX5NUWNNCS2BDKCO5OVDQBJVNKIVV will return 10 capture results with non-200 status codes and mime types that are not text/html but which match a specific content digest

https://github.com/iipc/openwayback/tree/master/wayback-cdx-server-webapp

WAT

>>data['Envelope']['WARC-Header-Metadata']['WARC-Type']"response">>data['Envelope']['Payload-Metadata']['HTTP-Response-Metadata']['Headers']['Server']"Apache">>data['Envelope']['Payload-Metadata']['HTTP-Response-Metadata']['HTML-Metadata']['Head']['Title']"BBCNEWS|Africa|NamibiabracesforNujomaexit">>len(data['Envelope']['Payload-Metadata']['HTTP-Response-Metadata']['HTML-Metadata']['Links'])42>>data['Envelope']['Payload-Metadata']['HTTP-Response-Metadata']['HTML-Metadata']['Links'][28]{"path":"A@/href","title":"HomeofBBCSportontheinternet","url":"http://news.bbc.co.uk/sport1/hi/default.stm"}

WAT

Použití https://webarchive.jira.com/wiki/display/Iresearch/Web+Archive+Metadata+File+Specification

WAT specifikace https://webarchive.jira.com/wiki/display/Iresearch/Web+Archive+Transformation+(WAT)+Specification,+Utilities,+and+Usage+Overview

Workshop na vytvoření grafu pomocí WAT https://home.archive.org/~vinay/archive-web-graphs-workshop/

Common Crawl

Je možné použít Amazon infrastructure na analytiku nad daty Common Crawl

více jak ~100 TB přírůstek měsíčně

Common Crawhttps://commoncrawl.org/the-data/get-started/

Příklady využití dat Common Crawlhttp://commoncrawl.org/the-data/examples/

CDX Server API s GUI pro procházení CDX souborůhttp://index.commoncrawl.org

Fulltext

Portugalský prototyp fulltextu

http://www.arquivo.pt/resawdev

The login is: resaw/resaw.eu

https://sobre.arquivo.pt/news/a-first-attempt-to-archive-the-.eu-domain?set_language=en

https://netpreserveblog.wordpress.com/2015/06/03/a-first-attempt-to-archive-the-eu-domain/

Thesis http://sobre.arquivo.pt/sobre/publicacoes-1/Documentos-acerca-do-Arquivo.pt/information-search-in-web-archives

Slides from IIPC GA 2015 http://www.netpreserve.org/sites/default/files/attachments/2015_IIPC-GA_Slides_11_Gomes.pptx

kolegovy poznámky: https://www.evernote.com/shard/s43/sh/e6e12603-ecb2-42ae-8532-67d2779b4a86/3b2162e0bcc710d847b6fa5e86cc70b2

UK WA prototyp fulltextu Shine

Prototyphttps://www.webarchive.org.uk/shine/search/advanced

Wikihttps://github.com/ukwa/shine/wiki/Specification

Codehttps://github.com/ukwa/shine

Prezentace Helen Hockx-Yuhttp://www.netpreserve.org/sites/default/files/attachments/2015_IIPC-GA_Slides_08_Hockx.ppt

Videohttps://www.youtube.com/watch?v=o4iIdZP4rg8

Další příklady

Website Classification Dataset

http://data.webarchive.org.uk/opendata/ukwa.ds.1/classification/

HTTP Archive

In addition to the content of web pages, it's important to record how this digitized content is constructed and served. The HTTP Archive provides this record. It is a permanent repository of web performance information such as size of pages, failed requests, and technologies utilized. This performance information allows us to see trends in how the Web is built and provides a common data set from which to conduct web performance research.

http://httparchive.org/trends.php?s=All&minlabel=Nov+15+2010&maxlabel=Sep+15+2015

http://httparchive.org/interesting.php

Přednášky o současném myšlení o webových

archivech ze Stanfordu

IIPC GA 2015

https://www.youtube.com/channel/UCkUsw2Lo1ahekgy_xEb11BA/videos

Recommended