7
NetarchiveSuite Meeting, Tallinn, 29./30.01.2015 * Web@rchive Austria Updates and Plans for 2015 Michaela Mayr, Andreas Predikaka Austrian National Library [email protected] www.onb.ac.at

NetarchiveSuite Meeting, Tallinn, 29./30.01.2015 * Web@rchive Austria Updates and Plans for 2015 Michaela Mayr, Andreas Predikaka Austrian National Library

Embed Size (px)

Citation preview

Page 1: NetarchiveSuite Meeting, Tallinn, 29./30.01.2015 * Web@rchive Austria Updates and Plans for 2015 Michaela Mayr, Andreas Predikaka Austrian National Library

NetarchiveSuite Meeting, Tallinn, 29./30.01.2015 *

Web@rchive AustriaUpdates and Plans for 2015

Michaela Mayr, Andreas Predikaka

Austrian National [email protected]

www.onb.ac.at

Page 2: NetarchiveSuite Meeting, Tallinn, 29./30.01.2015 * Web@rchive Austria Updates and Plans for 2015 Michaela Mayr, Andreas Predikaka Austrian National Library

Harvesting 2014

• Ongoing Collections:– Media (since 2011)– Politics (since 2013) incl. 1 regional

election• Olympic Winter Games Sochi

– 3 seeds daily, 96 seeds weekly• EU elections

– 132 seeds daily, 33 seeds weekly• World War I

– 151 seeds *

Budget = 2 TB

Page 3: NetarchiveSuite Meeting, Tallinn, 29./30.01.2015 * Web@rchive Austria Updates and Plans for 2015 Michaela Mayr, Andreas Predikaka Austrian National Library

Harvesting 2015

• Ongoing Collections:– Media (since 2011)– Politics (since 2013) incl. 4 regional

elections• 4th Broad Crawl

– New TLDs .wien, .tirol– ARC format, NAS 4.4, PostgreSQL

• Eurovision Song Contest• Content behind paywalls?

*

Budget = 10 TB

Page 4: NetarchiveSuite Meeting, Tallinn, 29./30.01.2015 * Web@rchive Austria Updates and Plans for 2015 Michaela Mayr, Andreas Predikaka Austrian National Library

Statistics

Approximately• 1.4 m. domains• 60 TB raw / 30 TB compressed• 2 bn. files

*

Page 5: NetarchiveSuite Meeting, Tallinn, 29./30.01.2015 * Web@rchive Austria Updates and Plans for 2015 Michaela Mayr, Andreas Predikaka Austrian National Library

Access

• Prototype for online search interface (no access to data)– Improved search possibilities

(partial fulltext-search of selected seeds)

– User tracking (inhouse, online) and data handling with ELK stack (Elasticsearch, Logstash, Kibana)

• External access for 4 libraries

Page 6: NetarchiveSuite Meeting, Tallinn, 29./30.01.2015 * Web@rchive Austria Updates and Plans for 2015 Michaela Mayr, Andreas Predikaka Austrian National Library

NAS & other tech stuff

E-Mail-Notification Tool (for selective crawls)

NAS Release testsFile Format Identification

(DROID, as part of ONB risk mangement)

*

Page 7: NetarchiveSuite Meeting, Tallinn, 29./30.01.2015 * Web@rchive Austria Updates and Plans for 2015 Michaela Mayr, Andreas Predikaka Austrian National Library

NAS & other tech stuff

• HADOOP– Responsibilites changed– Problem solving in progress

• To do until broad crawl (03/15):– Database Migration MySQL to

PostgreSQL– Switch to NAS 4.4

• Switch to OpenWayback