26
Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 1/26 Scaling Net Archive Indexing & Search IIPC Technical Training Workshop 2014 @TokeEskildsen Low-level search guy (boss says “System Architect”)

Large scale net_archive_toke_eskildsen_iipc_workshop_2015

Embed Size (px)

Citation preview

Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 1/26

Scaling Net ArchiveIndexing & Search

IIPC Technical Training Workshop 2014

@TokeEskildsen Low-level search guy

(boss says “System Architect”)

Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 2/26

Scaling SolrCloud indexing

● CPU for analysis, bulk read & write for Solr● Homogeneous shards (law of large numbers) ● Solr index update entry point might be

bottleneck (so use more entry points)● Routing overhead● Splitting and moving shards● Schema changes might require parallel rebuild

Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 3/26

Static independent shards

● Easy scaling– Predictable resource requirements

● Selective shard rebuilding● Trivial backup● Lower overall requirements

– Half the JVM heap requirements– Single segment→Higher performance

– Less disk cache competition

● Temporal locality– Better disk cache utilization with few users– Hot spot problem with more users– Ranking suffers (in theory)

● No document-level updates! ~250M docs / 900GB shard

Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 4/26

Static independent shards search

Shard 01

Shard 02

Shard 03

Searcher 1

ZooKeeper

Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 5/26

Building static shards

● Not standard Solr● Sample setup (distribution optional)

– 24 CPU cores (more would be nice)

– 1 Solr indexer @ 40 GB RAM

– 1 Archon tracking (W)ARC files

– 1 Arctika controlling webarchive-discovery (Tika)

– 40 webarchive-discovery (Tika) @ 1 GB RAM

– Final shards: 250M docs, 900GB, fully optimized

Archon + Arctika: https://github.com/netarchivesuite/netsearch

Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 6/26

Static independent shards index

Shard 4

Indexer 1

Shard 5

Indexer 2

Shard 1

Shard 2

Shard 3

Searcher 1

WAD = webarchive-discovery from UKWA: https://github.com/ukwa/webarchive-discovery

WAD 1

Arctika 1

WAD 2...

WAD n

WAD 1

Arctika 1

WAD 2...

WAD n

ARC-pathARC-pathARC-pathARC-path

Archon

ARC 1

Storage

ARC 2...

ARC n

Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 7/26

Measuring search performance

● Mimick real world scenarios– Unique queries

● preferably logged from production

– Warmed caches

– Concurrent searches (if relevant)

– Measured time, not reported Qtime

● Capture setup data– Index size, shard count, document count, free cache

memory, sar logs

Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 8/26

Predicting scaling requirements

● All else is rarely equal– Disk cache / index size ratio

– CPU cores / shard

– Slowest shard dictates total response time

● 3 or more measurement points● Use 2 or more shards● Visualize measurements

Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 9/26

SolrCloud distributed search

● Phase 1– Tophits calculation (fast)

– Simple faceting (medium to slow)

● Phase 2– Document resolving (fast)

– Facet fine count (medium to very slow)

● Coordination and merge overhead

Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 10/26

Interval popularity (aka long tail)

Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 11/26

ms over time

Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 12/26

hits, ms

Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 13/26

log(hits), ms

Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 14/26

Bucketed percentiles (candlesticks)

Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 15/26

Abstract search hardware

● IOPS– Needed for concurrent users and/or many shards

● Latency– 1 request = 1 thread / shard (lying a bit)

– Lower latency → more IOPS

● Tapes < Spinning drives < SSDs < RAM– But the truth is in the mix

Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 16/26

Case study: Net Archive Search at State and University Library, Denmark

Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 17/26

Standard request

● Free-text matching in 6 fields● Phrase matching i 1 field● Grouping on URL (not used in the tests)● Faceting

– URL (~6b uniques, 7b references)

– Host & domain (millions of uniques, 7b references)

– 3 small ones (year, format, public suffix)

Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 18/26

Solr version & schema

● Solr 4.8.1 + SOLR-5894 patch (optional)● Piggy backing UKWA work● DocValues on all large facet fields (essential)

Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 19/26

Clever Solr config tweaks

This space intentionally left blank

Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 20/26

CPU

Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 21/26

Disk cache

RAM %index mean median

110 0.49 658 141

98 0.44 1004 170

54 0.24 2164 361

27 0.12 5620 913

7 0.03 8546 3012

Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 22/26

Concurrent requests

Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 23/26

Concurrent requests (less faceting)

Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 24/26

Faceting impact mitigation

Sparse faceting: http://tokee.github.io/lucene-solr/

Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 25/26

Fewer, smaller facets

Toke Eskildsen [email protected] IIPC Technical Training Workshop 2015 – Large Scale Net Archive Indexing - 26/26

● Measure thrice & visualise● Common Solr rules of thumbs are not always

applicable at Net Archive scale● Static shards makes scaling easier● SSDs works very well for us (22TB costs £7500)● Full distributed faceting is doable but heavy

Danish Net Archive: http://netarkivet.dk/in-english/More Solr tech talk: http://sbdevel.wordpress.com