NetarchiveSuite Meeting, Tallinn, 29./30.01.2015 *
Web@rchive AustriaUpdates and Plans for 2015
Michaela Mayr, Andreas Predikaka
Austrian National [email protected]
www.onb.ac.at
Harvesting 2014
• Ongoing Collections:– Media (since 2011)– Politics (since 2013) incl. 1 regional
election• Olympic Winter Games Sochi
– 3 seeds daily, 96 seeds weekly• EU elections
– 132 seeds daily, 33 seeds weekly• World War I
– 151 seeds *
Budget = 2 TB
Harvesting 2015
• Ongoing Collections:– Media (since 2011)– Politics (since 2013) incl. 4 regional
elections• 4th Broad Crawl
– New TLDs .wien, .tirol– ARC format, NAS 4.4, PostgreSQL
• Eurovision Song Contest• Content behind paywalls?
*
Budget = 10 TB
Statistics
Approximately• 1.4 m. domains• 60 TB raw / 30 TB compressed• 2 bn. files
*
Access
• Prototype for online search interface (no access to data)– Improved search possibilities
(partial fulltext-search of selected seeds)
– User tracking (inhouse, online) and data handling with ELK stack (Elasticsearch, Logstash, Kibana)
• External access for 4 libraries
NAS & other tech stuff
E-Mail-Notification Tool (for selective crawls)
NAS Release testsFile Format Identification
(DROID, as part of ONB risk mangement)
*
NAS & other tech stuff
• HADOOP– Responsibilites changed– Problem solving in progress
• To do until broad crawl (03/15):– Database Migration MySQL to
PostgreSQL– Switch to NAS 4.4
• Switch to OpenWayback